This is the second piece of the Reading Dynamo series that I started last week. Topics covered in the previous post may be helpful for a better understanding of today’s deep-dive into Dynamo’s data replication strategy, but you’ll survive even if you skip. You can read the previous below.
Data Partitioning in Dynamo
This week, I am reading the Dynamo paper by Amazon along with its corresponding chapter in the Grokking ASD book. So far, I’ve read up on how Dynamo uses a consistent hashing ring with v-nodes for data partitioning, that we’ll dive deeper into in a minute. Apart from covering the system design principles and technical functionalities, I also add some hi…
Preference Lists
Once a data item's key is hashed, it maps to a specific "coordinator node" on the consistent hashing ring. This coordinator node assumes the critical responsibility of managing the replication of that data item. The item is stored locally on the coordinator node and then replicated to
N-1 other nodes, which are typically the immediate clockwise successors to the coordinator on the consistent hashing ring. The parameter N is a crucial, configurable value that dictates the total number of replicas maintained for each data item, directly influencing the system's durability and availability guarantees.
For every key, Dynamo maintains a "preference list," a globally known list of nodes designated to store that key's data. Every node within the Dynamo system possesses the capability to compute this list, fostering a decentralized understanding of data placement across the cluster. A pivotal design choice is that this preference list contains more than N nodes. This serves as a proactive fault-tolerance measure: if one of the initial N intended replica nodes is unavailable due to a temporary failure, Dynamo can utilize the next available node from this extended list to ensure that N replicas are successfully created. This design minimizes the chance of write rejections due to transient failures, directly supporting Dynamo's "always writable" objective. Furthermore, to effectively accommodate virtual nodes, the preference list skips ring positions that correspond to physical nodes already utilized by earlier entries in the list, ensuring that replicas are distributed across distinct physical machines for maximum fault isolation and true redundancy.
The preference list is also integral to how Dynamo achieves tunable consistency through configurable parameters: R (the minimum number of nodes that must respond for a successful read operation) and W (the minimum number of nodes that must respond for a successful write operation).
During a put operation, the coordinator writes data to the nodes identified by the preference list and waits for W-1 successful responses (in addition to its own local write). For a get operation, the coordinator queries the N highest-ranked reachable nodes in the preference list and awaits R responses. The very existence and robust management of this preference list enable the system's flexibility in consistency. By configuring
N, R, and W appropriately (e.g., R + W > N for stronger consistency, or W < N for higher write availability), Dynamo provides granular control over the availability-consistency trade-off, allowing applications to tailor their data consistency model to specific needs.
Keeping Writes Alive: Hinted Handoff
In a large-scale distributed systems, temporary node failures and network partitions are part of the routine. Without a robust mechanism to handle these transient issues, client writes could be rejected. To address this, Dynamo employs Hinted Handoff, a technique that ensures write availability and data durability even when some of the intended replica nodes are temporarily unreachable.
The process of hinted handoff operates as follows: When a client initiates a write operation, the coordinator node (responsible for that key) attempts to write the data to its N primary replica nodes identified by the preference list. If one or more of these intended replica nodes are temporarily down or unreachable, the coordinator does not reject the write. Instead, it temporarily writes the data to an available node that is not one of the original N replicas. This temporary storage is accompanied by a "hint"—metadata that indicates the original, intended destination node for that data. This "hinted" node holds onto the data. When the originally intended node recovers and becomes available again, the hinted node detects its recovery and initiates a "handoff" process, delivering the temporarily stored data (along with the hint) to the correct, recovered node.
Hinted Handoff significantly improves write availability and translates to fewer client-side write failures and a smoother user experience. It also enhances data durability. A notable aspect of Dynamo's design, particularly when operating with a "sloppy quorum" model, is that these hinted writes count towards write consistency requirements. This means that even if a write doesn't immediately reach all its primary replicas, it can still be considered successful if it reaches a sufficient number of any available nodes, further boosting write availability. The combination of sloppy quorum (a policy decision to accept writes even if not all primary replicas are immediately reachable) and hinted handoff (the concrete mechanism for ensuring eventual delivery from temporary locations) is what truly delivers Dynamo's high write availability.
Dealing with Divergence: Versioning with Vector Clocks
Dynamo's strong availability and its "always writable" nature means the system permits concurrent updates to the same data item. This scenario is particularly common during network partitions, where different parts of the system might operate independently, or during server outages where replicas become temporarily isolated. In such circumstances, multiple, divergent versions of the same data object can inevitably arise.
Dynamo's primary mechanism for tracking causality and detecting conflicts between different versions of an object is Vector Clocks. A vector clock is structured as a list of (node, counter) pairs. Every version of an object stored in Dynamo carries its associated vector clock. These clocks allow Dynamo to determine the relationship between different versions: if the counters on one version's clock are all less than or equal to the corresponding counters on another version's clock, then the first version is an ancestor of the second. This indicates that the second version causally descends from the first and can safely overwrite it. Conversely, if neither version's clock dominates the other (meaning they contain independent updates from different nodes), they are considered concurrent modifications, signaling a conflict that requires resolution.
This capability to capture causality provides the semantic context for concurrent writes, allowing Dynamo to identify precisely which writes are true conflicts requiring external resolution, and which are simple sequential updates.
Consider an example of how vector clocks capture causality and identify conflicts:
Initial Write (D1): A client writes data, handled by node Sx. Its vector clock is
{Sx: 1, Sy: 0, Sz: 0}
.Second Write (D2): The same client updates D1, again handled by Sx. The vector clock becomes
{Sx: 2, Sy: 0, Sz: 0}
. D2 causally descends from D1.Third Write (D3): A different client updates D2, handled by node Sy. The vector clock becomes
{Sx: 2, Sy: 1, Sz: 0}
. D3 causally descends from D2.Concurrent Write (D4): A network partition occurs. Another client updates D2, but this time the update is handled by node Sz (who has not yet seen D3). The vector clock for this new version (D4) is
{Sx: 2, Sy: 0, Sz: 1}
. At this point, D3 and D4 are concurrent versions because neither causally descends from the other. They represent a conflict.Reconciliation Write (D5): When a client retrieves D3 and D4, reconciles them, and writes back a new version D5, its vector clock will combine the information from the conflicting versions, for example,
{Sx: 2, Sy: 1, Sz: 1}
.
When a get() request is made, the coordinator node gathers versions from the N highest-ranked reachable nodes and waits for R responses. If vector clocks indicate multiple conflicting versions, Dynamo returns all of them to the client. If versions can be syntactically reconciled (meaning one is clearly newer), only the newest version is returned. This design choice means that Dynamo itself does not attempt to understand the meaning of the data or how to merge application-specific conflicts.
This offloads complexity from the database, keeping it simple, but shifts the burden of "truth" to the application layer. Consequently, developers building on Dynamo must implement robust, application-specific conflict resolution logic, which can be complex but allows for highly customized and intelligent merging based on business rules. A practical challenge arises because vector clocks can grow indefinitely in a highly concurrent system, leading to storage and performance issues. Dynamo truncates vector clocks (oldest first) when they grow too large, which, if crucial causal information is lost, could potentially hinder eventual consistency.
Client-Side Reconciliation and Last Write Wins
While client-side reconciliation with vector clocks is Dynamo's primary method, the original paper notes that Dynamo does offer "last write wins" (LWW) as an alternative conflict resolution strategy for specific use cases, such as session management, where the data is less critical and simpler, automatic resolution is preferred. LWW resolves conflicts by simply selecting the version with the latest timestamp, typically based on wall-clock time, and discarding older versions. It is important to note that while the original Dynamo paper primarily advocated for vector clocks and client-side reconciliation, many subsequent Dynamo-family databases, such as Apache Cassandra, adopted LWW as their primary, simpler mechanism for conflict resolution.
The main drawback of LWW is significant: it can easily lead to data loss. If two concurrent writes occur with very close timestamps, the system essentially makes an arbitrary choice, discarding one version entirely. This means that a valid update might be silently lost. While LWW is simpler to implement due to its automatic, server-side resolution, this simplification comes at the cost of potential data integrity issues. It does not account for the semantic meaning of concurrent updates, making it unsuitable for data where every update is critical and must be preserved or intelligently merged.
Alternatives: CRDTs
Conflict-free Replicated Data Types (CRDTs) is a class of data structures designed to handle concurrent updates in distributed systems without requiring complex coordination or explicit conflict resolution logic. The power of CRDTs lies in their inherent mathematical properties. They are engineered such that concurrent changes, when applied in any order, will always converge to the same, consistent end result across all replicas. This is achieved through properties like commutativity (the order of operations does not affect the outcome) and idempotence (applying an operation multiple times has the same effect as applying it once). For example, a simple counter CRDT can track increments and decrements separately, allowing them to be merged correctly even if operations occur out of order. Similarly, a set CRDT can merge additions and removals without conflicts. Amazon's shopping cart is often cited as an excellent real-world example where CRDT-like properties are highly desirable.
It is crucial to state that the original Dynamo paper does not mention Conflict-free Replicated Data Types (CRDTs) in relation to its core conflict resolution mechanisms. Dynamo primarily relies on vector clocks and client-side reconciliation for managing divergent versions. However, CRDTs are a highly relevant and powerful concept in the broader distributed systems landscape, specifically for solving the same fundamental problem that Dynamo addresses: managing concurrent updates in highly available, eventually consistent systems. This indicates that CRDTs represent a later development or a different conceptual approach to the problem of concurrent updates in eventually consistent systems. While Dynamo laid the groundwork for managing divergence with vector clocks, CRDTs offer an alternative paradigm where data types are designed to be mergeable by their very nature, simplifying the resolution process.
Importantly, some Dynamo-family databases, which draw inspiration from Dynamo's design, do incorporate CRDTs. For instance, Apache Cassandra, a prominent Dynamo-style system, uses a specific type of CRDT, a "Last-Write-Wins Element-Set CRDT," for resolving conflicting mutations on its CQL rows. This demonstrates an evolution or an alternative approach to conflict resolution within the ecosystem inspired by Dynamo. This also highlights a pragmatic approach in modern systems where the simplicity of LWW (for ordering) can be combined with the mathematical guarantees of CRDTs (for convergence) for specific data structures.