Amazon Dynamo

November 20, 2016

There is an excellent paper on Dynamo appearing in SOSP 2007. These notes are a boil-down of that paper, and perhaps casting this particular technology into a fable about an entire approach to persistence — that which emphases availability with reasonable consistency properties, if not strict consistency.

Modern persistence needs to scale to tremendous database sizes and query/update traffic. It has to be available without fail over long periods of time. Data must be available through an maintenance and platform updates, as well as failures of hosts or networks. The data models should be supple. They should provide reasonably powerful query tools, but the complex mathematical modeling of a relational database might not be needed, and the price paid for such complexity would be effort better put to other features.

Dynamo is a database technology invented by Amazon and satisfying there own needs as a vast cloud computing shopping service. Founded in 1996 as a seller of books, Amazon has gone on to be one of the most power internet sales companies, innovating the shopping experience, and revisiting many computer technologies in support of those innovations. Dynamo stores key-value pairs, where both the key and the value are byte strings.

Concept map:

  • Partitioning (sharding)
  • replication and quorum
  • anti-entropy (merkle trees)
  • membership (gossip, or manual configuration)
  • consistency: last write wins (global clocks), vector clocks

Partitioning of data, and replication

Scalability is achieved by partitioning the data set and distributing the partitions across as many machines are currently assigned to handle the data. An algorithm is needed to determine efficiently which machine will store a new data item, and which machine stores an existing data item. The algorithm should be efficient, and fairly distribute the items across all machines. The assignment of items to machines must exhibit both a random character — to distribute evenly across the available machines data items of a (more or less) unknown distribution, and a deterministic character — when retrieving an existing item the algorithm must know exactly on which machine is the item.

Additionally, the algorithm must support an efficient reconfigure of the set of machines. Machines might be added to scale-up for increased load, or to compensate for a faulty machine.

A classical hash functions are used in a method of creating partitions called Distributed Consistent Hashing (DHC).

The key of the item is hashed into buckets defined by intervals in the range of the hash function. For a typical hash to integers, intervals and sequence are defined by the relation next(i) = i + i mod n. One can check that the number of increments to go from a to b modulo n is (b-a+1)%n. For instance, in the integers mod 6, according to this formula, three increments are required to go from 1 to 3 or from 4 to 0.

The earliest implementation of Dynamo had each data store randomly pick a number si in the range of the hash function, then use a protocol to communicate the collection {si} to all data stores as well as servers acting on the behalf of clients to query the data store. Ordered, the partitions are then [si, si+1), and each data item hashing to a value is placed into the data store of the next n intervals, walking forward in the sequence of partitions.

This method has the advantage of self-organization. As new data stores are added, they contribute fairly to the partition.

The presentation has not been entirely accurate. Dynamo does not assume each machine is one data store. Replication has to be across machines, not data stores. If a machine chooses to participate in the partition by choosing several locations in the hash range, the machine’s loss will affect each of those locations. Therefore, Dynamo skips forward to the next location if it encounters an offer to replicate but on a machine already represented in its locations.

Commitment and quorum

Dynamo considers a write committed, and a read resolved, if it receives responses from a user selectable number of replicants from among the n replicants. Dyanmo defines:

  1. n, the total number of replicants
  2. r, the number of replicants offering a response to a read, before a value can be returned,
  3. w, the number of replicants affirming a write, before a system affirmation that the write is commited (or that it will persist, or be durable

The quorum condition is that r + w > n. If the quorum condition is satisfied, then if there is a majority value, that value will be returned among the reads. Given the condition, at least one of r or w must be larger than n/2. If r > n/2, it must intersect that set of replicants storing the majority value, if such a value exists, and will be returned by the read. If w > n/2, then any majority must be due to a confirmed write of at least w replicants and any read of r replicants must intersect one of those.

Consistency

Although the quorum condition assures some propagation of a majority result, many update cases lead to inconsistent data, that cannot naturally be resolved to a majority. For instance, if a stock number is decremented because of two different orders against the items in stock, the resolution of the two decrements should sum the two decrements. It is not possible to resolve this by voting — both results are wrong.

Dynamo provides that a read yielding a spectrum of results will return the entire set of results. The client must resolve the data.

One simple resolution is most recent write wins. Under this regime, all writes are time stamped to create a (somewhat arbitrary) total order according to most recent write. Among any returned read set, the most recent value is considered the result. As the replicants are synchronized, new values overwrite older values.

This is a simple, but limited, resolution regime. It does not work for our example of taking items from stock. For that examples, the two writes must recognize they are decrementing from a common starting point, and when they are presented simultaneously, they must resolve to a sum of decrements, to arrive a a new common value.

Dynamo support vector clock that can be used in this situation.

Logical Timestamps

In view of the difficulty of assigning a global clock, many systems use the theory of
Logical Time Ordering proposed by Leslie Lamport. The two axioms of this ordering are:

  1. In a local process (or on a single platform) there is a local event clock
  2. Event clocks between processes (or platforms) are synchronized so that a messages is sent strictly before it is received.

We define causuality as a transitive relationship of “comes-before” as determined by either the local clock or by a message flowing from one process to another.

The time ordering is captured by an integer assigned to each event, obeying the axioms by the rules: for two events in the same process (or on the same platform) and directly following each other, increment the local clock by one and assign to the later event the incremented value; a send is an local event and time stamps the message with the resulting time stamp; on receiving a message, set the time stamp of the receive to one plus the larger of the message time stamp or the current local clock.

Theorem: The theorem is then, if two events are causally related, then the time stamps are unequal and the event with the lower time stamp occurred first. However, the converse is not true — given two unequal time stamps it is possible that that the two events are not causally related.

Vector clocks

The vector clock is a more robust method for determining causality among distributed processes. Events are stamped with a vector of integers, one index in the vector for each process. Let vij be the event time stamp of a certain event that occurred in process i. For each i, the value in entry j is process i’s knowledge of the event clock on process j.

Each process i keeps track of its own event clock by incrementing vii on each local event. Process i learns of event clocks of other processes through messages which are time stamped by the sender with that sender’s vector clock. The receiver then uses the update formula:

  • vii = vii+1,
  • vij = max ( vij, vj ), j ≠ i,

where vj is the vector clock time stamp on the message.

Causality and concurrency are then defined using vector dominance: a ≤ b iff for every index i, ai ≤ bi , and a < b iff a ≤ b and for some index i, ai < bi. Strict inequality of vector time stamps in called causality, and two events a and b are concurrent if both a ≰ b and a ≱ b (incompatible, or incommensurate, in the language of partially ordered sets).

In this ordering, causality is well defined in that strict ordering provides proof of causality. Hence, not only is the “comes-after” notion of time ordering consistent with strict dominance of a vector clocks, strict dominance of vector clocks implies a “comes-after” relationship.

posted in CSC521 by admin

 
Powered by Wordpress and MySQL. Theme by Shlomi Noach, openark.org