Understanding the CAP Theorem
A distributed system (generally running in a datacenter) consists of some large number of servers (nodes) running very similar code, sharing data and talking to each other. Compared to a non-distributed system (everything happening on just one computer), distributed systems have the potential to handle much heavier workloads and to continue to perform well in spite of individual nodes going down.
When building a distributed system, there are three things that you’ll generally want:
- Consistency (all nodes have access to the same data simultaneously)
- Availability (a promise that every request receives a response, at minimum whether the request succeeded or failed)
- Partition tolerance (the system will continue to work even if some arbitrary node goes offline or can’t communicate)
Weird things can happen if these properties aren’t there.
If Consistency is violated, then you’ll get the case where one user will get a different answer from another: if user A makes a request and it gets routed to node X, and user B makes a request and it gets routed to node Y, and X and Y have different data, A and B will get different answers. Not ideal!
If Availability is violated, it means that any node might go down at any time, and that a request made to that node might just never return a response. Alternatively, if a sibling node had just been updated, it would mean that a node would refuse to respond to requests until itself was updated to reflect the most recent state.
If Partition tolerance is violated, it means that a single node going down could take even more nodes down with it, and that a failure in one part of the system could spread.
Clearly, we would like to respect all three of these properties. But in the late ‘90’s, computer scientist Eric Brewer formulated a theorem, known as the “CAP Theorem” (CAP for Consistency, Availability, and Partition tolerance), where he showed that any distributed system can have two, but never three, of the CAP properties.
To understand this dynamic, we can imagine the following setup, involving two nodes far away from each other, and not communicating directly (partition-tolerant):
- Scenario 1: One node is updated. Immediately, the two nodes are inconsistent until the second node is updated. This respects the Availability principle, but violates Consistency.
- Scenario 2: One node is updated. The other node acts as though it is unavailable (while it updates). This respects the Consistency principle, but violates Availability.
- Scenario 3: The two nodes are now allowed to communicate, so that updates to one can be sent to the other. This allows for both Consistency and Availability, but makes the system vulnerable to partitions (where a failure of the network would change the behavior of the system).
In Brewer’s own words, “The general belief is that for wide-area systems, designers cannot forfeit P and therefore have a difficult choice between C and A.” In other words, it is a requirement that systems be partition-tolerant — they must be able to continue to function correctly even if some nodes become unavailable. System designers must then decide how to balance the trade-off between consistency and availability, choosing to prioritize consistency in some settings, availability in others. Brewer goes on to point out that “all three properties are more continuous than binary. Availability is obviously continuous from 0 to 100 percent, but there are also many levels of consistency, and even partitions have nuances, including disagreement within the system about whether a partition exists.”
It’s important to realize that the CAP tradeoff isn’t a binary 0-1 for each property for a whole system, but rather a general framework for understanding design decisions. Different choices can be made for different parts of the system, and can change over time in response to changing usage patterns.
When working with distributed systems, it is important to understand the limits of such systems. Having everyone on the same page and with the same knowledge will be a great asset to technical teams, helping all parties work together.
Comments are closed