Audit, Patch, and Bootstrap

July 2, 2011

I’ve finally posted my second System Note (SN-1): “Audit, Patch, and Bootstrap.”  Written with my co-worker, Rohit Chandra, this note summarizes a recipe that many co-workers have helped refine to minimize the impact of Byzantine errors in systems that replicate state.

For performance and availability reasons, replication of state is commonplace in large-scale systems.  But it is error-prone, especially at scale, where Byzantine errors are inevitable.  And, when errors occur, they can require painful repairs.  After losing more weekends than we’d care to admit to cleaning corrupted data, we developed a simple, three-step pattern that minimizes the impact of replication errors:

  1. Where is the Master?  The first step is to identify clearly the master copy of your data — and ensure that it is resilient.
  2. How do you audit?  The next step is to identify strategies for auditing replicas against that source of truth.
  3. How do you repair?  The final step is to identify procedures for repairing discrepancies uncovered by audits.  We recommend defining both “patch” procedures that repairs small, simple errors, and “bootstrap” procedures that rewrite replicas from the ground-up.

These steps are simple to remember, simple to apply — and always yields actionable insights.  SN-1 describes these steps in a bit more detail, and discusses some practical issues that arise when following them.  If you or some friends are building systems that replicate data, give them a try, you won’t be sorry.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s