Modern Datastores: Three Case Studies

December 22, 2016

The storage of persistent data requires data structures that use the capabilities of a block store to organize and index information for efficient and complete retrieval. Traditionally file systems were the data structures organizing block stores. The traditional alternative to a file system as a database, which constituted an entirely different field of computing than was operating systems.

While filesystems have developed in efficiency and sophistication, the variety of compromises that must be negotiated has fragmented the notion of persistent store more and more drastically in modern computing. For instance, there is a demand for evanescent file systems — where files are temporarily and efficiently available, with a URI identifier, but disappear after a certain time limit.

We will study three data stores that underlie some modern systems for data persistence. These three data stores do use files and file systems, but accommodate their limitations and provide the desired properties by building on top of files. In particular, it is recognized that,

  • file systems are slower then memory, and the fewest trips to the block store must be made,
  • file systems can support arbitrarily location reads, but only support well a write that appends to the end of an existing file, or creating a new file,
  • it is simpler to handle files if they have a special immutability — either two copies of the files are identical, or they differ in size and agree exactly if the longer file were to be truncated to the size of the smaller file.

Having special immutability gets rid of problems of file consistency, as there is never an alternative version of the file. This simplifies replication and backup — backup consists of keeping copies of whatever file exists, and replication consists of having all the files on multiple machines. If files have versions, if file are deleted only to reappear later with different content (by the same name), it is not all that simple,

I have used the term special immutability to encompass both the true immutability of a specially immutable file combined with the promise of not more appending writes, and the simplistic consistency that there never be two inconsistent appends to a specially immutable file while the file is open for appending writes. We will term these open and closed specially immutable files, respectively.

Bitcask

Bitcask is a simple datastore for key-value pairs that makes use of in-memory hashing and specially immutable files. See Bitcask: A log-structured hast able for fast key/value data by Justin Sheehy and David Smith of Basho Technologies, 2010.

A Bitcask instance has any number of special immutable files, all of which are called inactive, except for one active file. The inactive files are closed, the active file is open. These files are a sequence of variable length records which contain a checksum, a time stamp, the key string with its length, and the value string with its length.

In in-memory hash table maps keys to file identifier and position in the file, where in is the key-value pair for that key, if there is such a pair. Updates to the key-value pair created new record by appending to the active file, and hash table only points to the newest entry. Hence, updates are made by obsoleting, but not necessarily deleting, old data. Since deletions are hard to implement for hash tables, a deletion is modeled as an update to a tombstone value. If key is matched with a tombstone, Bitcask returns the same as if the key does not exist.

New keys append the key-value pair to the active file, then an entry is made in the hash able. Doing things in this order makes the system crash proof — incomplete updates can be reapplied, and are inconsequential until completed.

When the active file reaches a certain maximum permitted size, it is made inactive, immutable, and a new active file is started.

Since there can be collisions in the hash table, depending on how the hash table is implemented, there might be multiple file reads to retrieve a key value; however it will expected constant time, for a properly administrated hash table.

The storage requirement under scheme as described so far is proportional to the number of key-value updates, including the initial value of a key. (Also add in the number of deletions, but this can only affect size by a factor or two, hence the statement of proportionality is unaffected.)

To save space, obsolete values and tombstones are removed by scanning the collection of inactive files, creating a series of new inactive files, and re-hashing based on the series of new inactive files and the single active file. Note that the old files are retired, the are more irrelevant than destroyed. Backups and replicants will not be corrupted if they continue to host these old files. They are just not referred to by the current hash table.

The details of rebuilding the inactive files is not given in the paper. It could be that the files are walked, and for each key-value pair, the key is read to see if this key-value pair is the most current; if not the pair is skipped, else it is copied into the new file.

LevelDB

LevelDB was created by Google in 2011. Unlike Bitcask, LevelDB is a key-value store that supports range queries: the keys are linearly ordered and it is possible to return all key-value pairs for all keys falling inside the query range. It builds on top of a specially immutable files called a SST, a sorted string table. An SST is a file consisting of sequence of key-value pairs, each key and value an arbitrary string, where the keys appear in increasing order. Each SST has an index giving the key and the offset in the file where the key appears. This index can be brought into memory to give a log(N) time access to a key in an SST with one file read.

LevelDB has three sorts of files,

  • an appendable log file for new or updated key-value pairs
  • a collection of level 0 SST’s, containing key-value pairs that were in a recently closed log file, now sorted,
  • a collection of level i SST’s for i greater than 0, which contain key-value pairs, sort and in disjoint ranges by file.

Log and Level 0 SST’s

Writes append to the log file. When the log file becomes a certain size, its contents are sorted and written to an immutable SST, called level 0. The collection of level 0 SST’s might contain keys in overlapping ranges — the level 0 SST contain all and any key-value pairs written in a time interval.

Level i compaction of SST’s

Level 0 SST’s are periodically collected up with Level 1 SST’s and the entirely of key-value pairs found in this collection are sorted and a new series of Level 1 SST’s are created. These Level 1 SST’s contain disjoint ranges of keys so that search of a key first binary searches for the single SST file in which the key might occur, and then binary search in the index of the located SST for the possible offset in that file in which the key occurs.

Further compaction occurs by periodically collecting up Level i and i+1 SST’s which overlap a given key range, and resorted all key-value pairs in these files to create a new set of level i+1 SST files.

For the purpose of removing obsolete key-value pairs, the level 0 SST’s must be kept in chronological order. Sweeping through the key-value pairs, any key already encountered in the sweep blocks any new key-value pair from being copied forward — it is obsolete. Otherwise, there is at most one key-value pair for a key on any level, and level i has a more recently pair than level i+1.

Tombstones must be passed to the next level during compaction, since a level i tombstone needs to block a key-value pair on a level j which might be higher than i+1. It will be eventually migrated out when compacting into the highest level SST.

Read and writing a key value

Writing is by appending the key-value pair to the log file. Updates make old key-values obsolete, and then will be mask by the search order during reads. Deletes are done through writing of tombstones. The level compaction will drive out obsolete and deleted key-value pairs.

Reading is accomplished by,

  • scan the log file for the key;
  • if key not found, search in every level 0 SST for the key, using binary search on the associated index file;
  • if key not found, beginning with level i, and continuing to level i+1 if not found, search among the level i SST to identify the single SST covering the range of the queried key, then search in the identified file using binary search on the associated index file.

In any step, if the key is found, return the value. If a tombstone is found, return that the key does not exist. If the key is not found in the log or any table, return that the key does not exist.

The search time is therefore L + (M+I) log N, for N elements, with a log of size L, I levels, and at most M level 0 SST’s.

For more information, see Ilya Grigorik’s page.

Bigtable

Bigtable is one of the most important data structures used by Google. It uses sharding to partition that set of key-value pairs into ranges by key, and distributes the shards to hosts, that store the range sorted in SST’s. A query to the table first identifies the host that stores any key in the range including the query key, then the query is sent to that host for it to search among its SST’s, using binary search of the indices associated with the hosted SST’s.

Range queries as supported. All key-value pairs with a key in the given range can be selected and returned.

The Bigtable structure is a precursor of LevelDB, and consequently LevelDB mimics it. The table is broken into smaller tablettes, each tablette handles a range of keys. The tablettes are distributed over hosts. A metadata file, comprising a B-tree of tablette ranges, is first queried to find a the tablette. The metadata file uses Chubby, a strong consistency protocol based on Paxos, to keep a single matadata view.

Each tablette is created as a the logical merge of SST and a memtable. The memtable keeps the most recent changes, and is in memory. A minor compaction closes the memtable and transfers it to an SST. To limit the width of the merge among SST’s, periodic merging compactions replace a collection of SST’s by a single SST, retiring those SST’s so replaced. A major compaction is a merging compaction that merges all SST’s in a tablette.

See Bigtable: a distributed storage system for structured data, Fay Chang, Jeffry Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, and Robert E. Gruber, all at Google, ODI 2006

DNS and Eventual Consistency

December 9, 2016

Overview

The internet Domain Name System (DNS) is an distributed, highly available database of facts, called Resource Records (RR) relevant to the operation of the Internet. Created by Paul Mockapetris in 1983, it is a harbinger for the highly available, distributed databases of today. The DNS system answers queries about the collection of RR’s by tossing the queries from DNS server to DNS server until a server is found capable of answering the query.

Rather than a single view of all RR’s, the space of RR’s is structured in a hierarchy of domains, and a domain or a collection of neighboring domains in the hierarchy are arranged as zones of authority. DNS servers that are authoritative for a zone of authority have the responsibility of maintaining the RR’s for that zone. These servers are highly replicated so that neither machine failure nor network partition will cause a RR query to go unanswered.

The likelihood of a query being able to be answered when asked is called availability. If the network is partitioned, so that certain machines cannot not communicate, replicas of the RR’s must exist in each partition. However, this means that if a RR is updated, it is possible that the answer to a query will depend on which replica has referenced. Strong consistency is a typical data model for data stores. A strongly consistent data store has a single true value for each RR, with a fully time-ordered series of updates, including creation and deletion of the RR’s, and any query is always answered with the most recent updated value.

In the face of a network partition, strong consistency is not possible, or at best possible only under limited availability. During a partition, it is not known what updates are occurring in any network partition inaccessible by the partition. If, in the best possible case, network partitions are immediately known, a halt to updates can be mandated. Thereby assuring strong consistency at the expense of availability for updates. More draconian, all queries can go unanswered during network partition.

While tacitly understood in the design of the DNS, the observation that consistency, availability, and networking partition tolerance are related was enunciated by Eric Brewer in 1998. In his famous CAP theorem, a data store can have at most two of these three properties. We have given the example of how tolerance to network partition and strong consistency cannot both be properties unless availability is sacrificed.

But DNS must have high availability, and network partition is a fact of life, so the DNS sacrifices consistency. It is no longer demanded that there by a single global answer for a RR; it is allowed that the answer to a RR query depend upon in which partition the query occurs — as long as eventually the answers converge, e.g. once the network partition is repaired. The DNS bounds the time that an inconsistent but available answer can exist, by tagging each RR with expiry time for the fact, the RR’s time to live (TTL), measured in seconds. Hence the DNS system responded by anticipating the definition of eventual consistency for data stores.

This model was also driven by a variant of availability called scalability. Scalability is an architecture’s or protocol’s ability to continue to work at increasing levels of usage and participation. The more computers, the more usage of DNS: both more queries and more RR’s. The increase in client side demands is balanced by the increase in server side resources — domains will supply their own servers to respond to queries about those domains.

Eventual Consistency under Bounded Availability

In order to achieve high availability, the DNS system replicates the RR database and tolerates inconsistency within a time bound set for each fact, called the RR’s TTL. A RR always has a time remaining to expiration, and the handling of a RR includes tracking the time remaining to expiration. Clients are free to use a RR until expiration. Any computer queried about a RR can provide a copy of an unexpired RR, but must update the RR’s TTL to reflect only the time remaining to expiration.

The DNS architecture defines authoritative servers for a zone in the space of domian names. A zone is essentially a connected sub-tree in the tree of domain names. For instance, cs.miami.edu is the sub-domain cs in the domain miami, which itself a sub-domain of edu. The edu is a TLD, Top Level Domain, and is said to be a sub-domain of root, and only root, which is written as the domain “dot”.

The zone of all sub-domains of edu is divided up into zones such as all sub-domains of miami.edu. The zone of authority of miami.edu has name servers that server authoritatively for names in the miami.edu; except that miami.edu is further divided into, with one sub-domains being cs.miami.edu and all it’s sub-domains (e.g. zinc.cs.mimai.edu).

The authoritative servers for a zone are intuitively the source of truth about the domain, and are allowed to send out RR”s with the full value of the TTL. All other servers, or hearsay clients, will decrement that value according to the passage of time since that RR was received. There can be multiple authoritative servers and the conflicting demands of consistency and availability resolves to only a parallel solution of eventual consistency with bounded availability among the authoritative servers.

In the BIND implementation of DNS, one among the authoritative servers is the master, and the other servers are slaves. The slave servers periodically request zone transfers from the master. The polling window between zone transfer requests is a normal period of inconsistency, where different servers have different answers, until changes propagate. In the case a slave cannot contact a master, a second time window is referenced. If the slave cannot refresh within this second time window, the server will no longer answer queries. (NB: SOA records contain an expire value. Some knowledge of the quality of the answer can be inferred from this expire time.)

The authoritative servers do not advertise the state of these windows. If a RR with a TTL value X is received, then either the fact stated in the RR is true, or at least it was true at some moment in the previous X seconds. While a RR has a TTL that provides a time-definite for expiration, a slave server will not deprecate its answers during a period of network partition. It is authoritative up until the moment that its data store expires, and it no longer will answer queries.

In the BIND implementation of DNS, all updates to a zone increase the zone’s serial number. As an efficiency measure, slaves request zone transfers when the serial number of its data store is behind that of the master’s.

An authoritative server for a zone also has the privilege to mark its responses as authoritative, and clients can request that queries be answered only by authoritative servers. All other responding servers note that they are not authoritative, and in the case of negative responses, that is, a server responds that the requested RR does not exist in the zone, will helpfully and optionally provide a list of authoritative servers along with the reply.

Authoritative servers are know through certain RR declaring servers as authoritative. A server authoritative for a domain can declare a sub-domain as a new Start of Authority (SOA) and provide a Name Server (NS) RR (NS) for the root domain of that zone. These are called glue records, because they glue the tree of domains together, where one server hands off to another at the SOA boundaries.

A glue record is unique, because it is not technically under the authority of whoever is authoritative. E.g. It is the University of Miami that declare certain servers as authoritative for the miami.edu zone; but the NS record for miami.edu belongs in the .edu data store, and whoever is authoritative for that domain determines what is reported as the NS record’s value.

N.B.: It is easy to confuse master/slave with authoritative/non-authoritative. A RR is authoritative if there is an NS record for X.Y in the the datastore for domain Y. In BIND, the master and all slaves should have NS records. There are many other cases such as shadow masters, and so one. Also, the replication among authoritative servers need not proceed according to the master/slave/serial-number scheme of BIND.

The answering negatively for queries is special. A response that a RR is not found in a domain can be provided non-authoritatively, if a server has recently queried an authoritative server for the RR, and found none. The SOA record for a zone states the policy for the TTL for negative answers.

Common Resource Records

RR are associated with domains, called nodes. A node has a domain name, and is a bucket full of RR’s that are associated with that domain. Each RR has a family, a type, a TTL, and some value. There can be multiple RR of the same type in a domain. DNS is an internet distributed database of facts. The meaning attached to those facts is determined by the client’s use of those facts. DNS is by nature an agnostic purveyor of facts.

Perhaps the three most common types RR are the SOA RR, the NS RR, and the A RR. The SOA and NS RR form the DNS hierarchy of names by declaring the existence of zones, and linking between servers that are authoritative for those zones. As the most common use of DNS is to find an IP address associated with a domain name, the A RR which gives the IPv4 address for a domain, is most commonly found among the RR’s attached to a domain. The AAAA type RR gives the IPv6 address for that domain.

The next three most common, in my experience, is the MX, CNAME, and TXT RR types.

A mail domain, such as gmail.com, is best not associated with a particular host. The DNS systems allows that senders of mail request the MX records for the domain, and receive a selection of hosts, ordered by preference, of hosts that are handling the mail for that domain. If no MX records exist for a domain, the spirit of the Internet (be strict with what you send, be liberal with what you receive) will recommend that next A or AAAA records are sought.

A CNAME RR gives the value of another domain associated with this domain. What seem like a good idea for renaming services or forwarding requests, CNAME’s can be troublesome. Think a CNAME in a domain that has MX records, and the target of that CNAME also have MX records — should the mail sender union all MX records found by following all CNAME recors? It is also possible that create a circle of CNAMES so that looking for domain X returns a forward through a CNAME to domain Y, and Y having a CNAME back to X. There are best practices recommended by the IETF for using CNAME’s, that should be followed.

The TXT RR attaches arbitrary text to a domain. The TXT RR’s are useful in experimentation and new mechanisms on the internet, as implementation of new TXT’s is entirely the discretion of clients. The TXT records allow DNS to be a general database for storing arbitrary information attached to domain names. However, see RFC 1464 that suggests a format when using TXT records as an extension to DNS as a network wide database.

Two examples of how TXT records are being used related to attempts to reduce spam an phishing leverage by fake emails. Sender Policy Framework (SPF) and Domain Keys Identified Mail (DKIM) are systems that place TXT records in the DNS system for mail domains, and they help identify when mail is sent authentically or not.

SPF is more modest: it names the expected IP addresses that are the usual source of mail given that the domain name is the claimed source. Like a physical letter that is sent with a forged return address, hoping that the recipient will consider the mail to actually be from that return address, email return addresses are easily forged. It is more difficult, but not impossible, to forge an IP address. An SPF record for a domain provides the basis for suspicion if mail comes from a wrong IP address. Unfortunately, the SPF initiative can only be helpful in combating spam and phising. Email can be legitimately relayed through mail handlers. Hence, mail can come from unexpected IP address and still be authentically sourced from the domain. Furthermore, just because mail comes from an expected machine does not mean the mail was authentic — the machine might be compromised; the senders mail client might be compromised.

DKIM uses cryptography to digitally sign messages. If the signing key is honestly attributable to the sending machine, and there have been no compromises of the key, then there is a definite attestation by the sending machine that this email comes from this source. However, what is signed is the sending machine, not the human being that wrote the email. It is important not to overdo the importance of the signature. The human sender can still credibly deny sending the email, or never meaning to send the email, or having forwarded someone else’s sentiment in the email.

References

  • RFC 1464: Using the Domain Name System To Store Arbitrary String Attributes(1993)
  • RFC 882: Domain Names – Concepts and Facilities (1983)
  • RFC 883: Domain Names – Implementation and Specification (1983)
  • RFC 1912: Common DNS Operational and Configuration Errors (1996)
  • RFC 7208:Sender Policy Framework (SPF) (2014)
  • http://www.openspf.org/

Example

A typical RR is the Address Record, or A-rec, which pairs a domain name, such as www.cs.miami.edu, with an IP address, 192.31.89.16. DNS, the Domain Name System is an architecture and protocol defined by the IETF RFC’s. BIND, the Berekely Internet Name Domain software is an implementation of the DNS architecture.

Matawan-3:~ ojo$ dig a www.cs.miami.edu

; <<>> DiG 9.8.3-P1 <<>> a www.cs.miami.edu
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 33039
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
;www.cs.miami.edu. IN A

;; ANSWER SECTION:
www.cs.miami.edu. 360 IN A 192.31.89.16

;; Query time: 187 msec
;; SERVER: 10.0.1.1#53(10.0.1.1)
;; WHEN: Thu Dec 8 20:41:16 2016
;; MSG SIZE rcvd: 50

Matawan-3:~ ojo$

Amazon Dynamo

November 20, 2016

There is an excellent paper on Dynamo appearing in SOSP 2007. These notes are a boil-down of that paper, and perhaps casting this particular technology into a fable about an entire approach to persistence — that which emphases availability with reasonable consistency properties, if not strict consistency.

Modern persistence needs to scale to tremendous database sizes and query/update traffic. It has to be available without fail over long periods of time. Data must be available through an maintenance and platform updates, as well as failures of hosts or networks. The data models should be supple. They should provide reasonably powerful query tools, but the complex mathematical modeling of a relational database might not be needed, and the price paid for such complexity would be effort better put to other features.

Dynamo is a database technology invented by Amazon and satisfying there own needs as a vast cloud computing shopping service. Founded in 1996 as a seller of books, Amazon has gone on to be one of the most power internet sales companies, innovating the shopping experience, and revisiting many computer technologies in support of those innovations. Dynamo stores key-value pairs, where both the key and the value are byte strings.

Concept map:

  • Partitioning (sharding)
  • replication and quorum
  • anti-entropy (merkle trees)
  • membership (gossip, or manual configuration)
  • consistency: last write wins (global clocks), vector clocks

Partitioning of data, and replication

Scalability is achieved by partitioning the data set and distributing the partitions across as many machines are currently assigned to handle the data. An algorithm is needed to determine efficiently which machine will store a new data item, and which machine stores an existing data item. The algorithm should be efficient, and fairly distribute the items across all machines. The assignment of items to machines must exhibit both a random character — to distribute evenly across the available machines data items of a (more or less) unknown distribution, and a deterministic character — when retrieving an existing item the algorithm must know exactly on which machine is the item.

Additionally, the algorithm must support an efficient reconfigure of the set of machines. Machines might be added to scale-up for increased load, or to compensate for a faulty machine.

A classical hash functions are used in a method of creating partitions called Distributed Consistent Hashing (DHC).

The key of the item is hashed into buckets defined by intervals in the range of the hash function. For a typical hash to integers, intervals and sequence are defined by the relation next(i) = i + i mod n. One can check that the number of increments to go from a to b modulo n is (b-a+1)%n. For instance, in the integers mod 6, according to this formula, three increments are required to go from 1 to 3 or from 4 to 0.

The earliest implementation of Dynamo had each data store randomly pick a number si in the range of the hash function, then use a protocol to communicate the collection {si} to all data stores as well as servers acting on the behalf of clients to query the data store. Ordered, the partitions are then [si, si+1), and each data item hashing to a value is placed into the data store of the next n intervals, walking forward in the sequence of partitions.

This method has the advantage of self-organization. As new data stores are added, they contribute fairly to the partition.

The presentation has not been entirely accurate. Dynamo does not assume each machine is one data store. Replication has to be across machines, not data stores. If a machine chooses to participate in the partition by choosing several locations in the hash range, the machine’s loss will affect each of those locations. Therefore, Dynamo skips forward to the next location if it encounters an offer to replicate but on a machine already represented in its locations.

Commitment and quorum

Dynamo considers a write committed, and a read resolved, if it receives responses from a user selectable number of replicants from among the n replicants. Dyanmo defines:

  1. n, the total number of replicants
  2. r, the number of replicants offering a response to a read, before a value can be returned,
  3. w, the number of replicants affirming a write, before a system affirmation that the write is commited (or that it will persist, or be durable

The quorum condition is that r + w > n. If the quorum condition is satisfied, then if there is a majority value, that value will be returned among the reads. Given the condition, at least one of r or w must be larger than n/2. If r > n/2, it must intersect that set of replicants storing the majority value, if such a value exists, and will be returned by the read. If w > n/2, then any majority must be due to a confirmed write of at least w replicants and any read of r replicants must intersect one of those.

Consistency

Although the quorum condition assures some propagation of a majority result, many update cases lead to inconsistent data, that cannot naturally be resolved to a majority. For instance, if a stock number is decremented because of two different orders against the items in stock, the resolution of the two decrements should sum the two decrements. It is not possible to resolve this by voting — both results are wrong.

Dynamo provides that a read yielding a spectrum of results will return the entire set of results. The client must resolve the data.

One simple resolution is most recent write wins. Under this regime, all writes are time stamped to create a (somewhat arbitrary) total order according to most recent write. Among any returned read set, the most recent value is considered the result. As the replicants are synchronized, new values overwrite older values.

This is a simple, but limited, resolution regime. It does not work for our example of taking items from stock. For that examples, the two writes must recognize they are decrementing from a common starting point, and when they are presented simultaneously, they must resolve to a sum of decrements, to arrive a a new common value.

Dynamo support vector clock that can be used in this situation.

Logical Timestamps

In view of the difficulty of assigning a global clock, many systems use the theory of
Logical Time Ordering proposed by Leslie Lamport. The two axioms of this ordering are:

  1. In a local process (or on a single platform) there is a local event clock
  2. Event clocks between processes (or platforms) are synchronized so that a messages is sent strictly before it is received.

We define causuality as a transitive relationship of “comes-before” as determined by either the local clock or by a message flowing from one process to another.

The time ordering is captured by an integer assigned to each event, obeying the axioms by the rules: for two events in the same process (or on the same platform) and directly following each other, increment the local clock by one and assign to the later event the incremented value; a send is an local event and time stamps the message with the resulting time stamp; on receiving a message, set the time stamp of the receive to one plus the larger of the message time stamp or the current local clock.

Theorem: The theorem is then, if two events are causally related, then the time stamps are unequal and the event with the lower time stamp occurred first. However, the converse is not true — given two unequal time stamps it is possible that that the two events are not causally related.

Vector clocks

The vector clock is a more robust method for determining causality among distributed processes. Events are stamped with a vector of integers, one index in the vector for each process. Let vij be the event time stamp of a certain event that occurred in process i. For each i, the value in entry j is process i’s knowledge of the event clock on process j.

Each process i keeps track of its own event clock by incrementing vii on each local event. Process i learns of event clocks of other processes through messages which are time stamped by the sender with that sender’s vector clock. The receiver then uses the update formula:

  • vii = vii+1,
  • vij = max ( vij, vj ), j ≠ i,

where vj is the vector clock time stamp on the message.

Causality and concurrency are then defined using vector dominance: a ≤ b iff for every index i, ai ≤ bi , and a < b iff a ≤ b and for some index i, ai < bi. Strict inequality of vector time stamps in called causality, and two events a and b are concurrent if both a ≰ b and a ≱ b (incompatible, or incommensurate, in the language of partially ordered sets).

In this ordering, causality is well defined in that strict ordering provides proof of causality. Hence, not only is the “comes-after” notion of time ordering consistent with strict dominance of a vector clocks, strict dominance of vector clocks implies a “comes-after” relationship.

The FAT Filesystem

November 14, 2016

File systems are a collection of data structures that use block stores for data persistence. File systems add features to the data, including ways of labeling the data for retrieval, protection and ownership mechanisms, and mechanisms for recovery of data from unfortunate loss or misuse.

One of the most popular file system is the FAT file system originally designed by Bill Gates and Marc McDonald in 1977 for use in MS-DOS. FAT has evolved over the years, with versions named by the bit width of the indexing tables: 8 bit FAT, FAT12, FAT16 and FAT32. The file system is used by many devices now, and is such a common file system that it is useful for detachable media such as thumb drives, which will be accessed by a wide variety of operating systems.

All file systems have three categories of information: the superblock, metadata, and content. The superblock is a fixed sized data structure that helps the operating system identify the file system, know its parameters, and find the initial data structures on the block device. The metadata are all the data structures that organize the files and directories on the block device. The content is the pool of data blocks used to store data, other than metadata.

Many file systems have a variety of metadata. FAT as a single metadata structure: the File Allocation Table, FAT. It is remarkable that all organizing needs are satisfied by a single structure.

A block device is an indexed sequence of logical blocks. FAT works instead in clusters. A cluster could be one block, or multiple consecutive blocks subject to an alignment constraint. The FAT table is indexed by cluster index and contains as content a cluster index. The table serves several functions:

  • The sequence of clusters used to store the data of a given file are arranged in a linked list. The FAT entries provide an array-implementation of those linked lists. For this, a particular value of FAT entry is stolen from the range of valid cluster addresses to mark the end of a linked list.
  • The FAT table tracks which clusters are free and which are in use storing data. A FAT entry containing the value 0 indicates that the cluster with address the same as that index is free; any other value indicates that the cluster of same address as that index is claimed.
  • Other content values mark that the cluster with address same as the index is bad, and should not be use.

The FAT file system is laid on on the block device such that the superblock begins a cluster 0 and continues on in consecutive clusters as needed; followed by the FAT table in a consecutive sequence of clusters as needed to accommodate FAT table size; followed by a copy of the FAT table; followed by the data content region. The clusters in the data content region are numbered starting with index 2. This allows for indices 0 (the free cluster mark) and 1 (no meaning) to be reserved.

The data content region contains but files and directories. The original FAT versions only supported a single root directory containing a fixed number of files. The location of this directory was fixed as the start of the data content region and size, in directory entries, was recorded in the superblock. Later FAT versions supported subdirectories and allowed for a variably sized root directory. The superblock records the starting cluster for the root directory in this case.

The data for files and directories are stored in a sequence of clusters. The sequencing of the clusters is recorded in the FAT table by entering the address of the next cluster in the sequence at the index of the previous cluster in the sequence.

For instance, suppose a file is of size not larger than three times the cluster size. To store the data, three clusters assigned to the file, and are arranged in an order, say cluster 17, then cluster 28, then cluster 21. The bytes in the file are then in correspondence with the bytes in the clusters, the first bytes in cluster 17 and the last bytes in cluster 21. The cluster numbers and their ordering is recorded in the FAT table by setting FAT entry 17 to value 28, FAT entry 28 to value 21, and FAT entry 21 to a special end-of-chain value, usually -1.

                17                  21                                 28
----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----
    |    |    | 28 |    |    |    | -1 |    |    |    |    |    |    | 21 |    |
----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----

The file or directory data can be represented by the index of the head of the cluster chain, in this case the number 17.

Files can be arbitrarily sized, so the length of the cluster chain need not be the size of the file. The directory entry for the file will give the index of the head of the cluster chain and the size of the file. If the file length does not fill the last cluster in the chain, the unused space is called slack space.

Directories in FAT are a sequence of fixed sized (32 byte) records that may or may not be in use, but the sequence always fills completely the clusters in the chain. Therefore the size of a directory can be calculated from its cluster chain as the total byte length of the cluster chain.

The 32 byte directory record contains:

  • The file name and extension in the famous 8 + 3 format,
  • An attribute byte, providing, among other attributes, if this entry is of a file or a directory,
  • The creation and modification time and dates,
  • Last access date (the time is not recorded),
  • The cluster number of the start of the cluster chain,
  • The file length.

The format of the directory entry depends on the version of FAT. For FAT32, 4 bytes are needed for the address of the cluster chain, and those bytes are used of other things in FAT12 and FAT16. Note that even FAT32 has only a 28 bit address space for cluster indices — the high 4 bits of a FAT32 cluster address must always be zero.

Directory records, called dirents for directory entries, are fixed length and are allocated to fill the cluster chain comprising the directory. The first character of the file name is used to signal whether the entry is to be considered valid and referring to a file or directory, or invalid. If the first character is zero, the entry is invalid, and so are all entries following this entry. If the first character is value 0xE5 then the entry is invalid but other valid entries might follow this entry. If the first character is an allowable character in a FAT filename, then the entry is valid. The 0xE5 and 0×00 markers allow for directory search to terminate without touching all the directory entries. How this is used depends on the handling of the directory. There is no requirement in the specification to handle directories in the minimum possible space, so even after deletion of many files, a directory that expanded its cluster chain, might remain much larger than necessary.

Deleting a file or directory consists of overwriting the first byte of the file name in the directory entry for the file with the invalid mark (value 0×00 or 0xE5). Then the clusters associated with the file or directory are reclaimed but walking the cluster chain and writing zeros into the FAT entry for each cluster on the chain. Generally no erasing is done, and having raw access to the block device of a FAT file system can recover deleted data, as it is never deliberately erased, only eventually overwritten. Data from a previous file that ends up in slack space an persist on a life time of the file that has that slack space. Computer virus sometimes hide themselves or their data in the slack space.

The 8+3 file name format was a great inconvenience, and FAT continues to be a popular file system even now that long file names with a variety of characters is the norm. FAT introduced a system of long file names that is backwardly compatible with the original 8+3 file naming restriction. Every file in the FAT file system has two names: a long name and an 8+3 name. Typically the user chooses the long name and the 8+3 name is created automatically. Users are rarely aware of the 8+3 name.

The long file name is entered in to the directory as a series of "illegal" directory entries, just in front of the correct directory entry for the file. That correct entry has the 8+3 name. The entries are illegal because their attributes are as conflicting types, and so older operating systems skip over these entries and see only the legal entry with the short name. However, new operating systems gather up the data that is scattered into the illegal entries and glue it all together into the long file name.

Classical ACID Persistence

November 13, 2016

Persistent data is a key resource provided by the operating system. The classical persistent store is files, and filesystems are often topics found in classic operating system textbooks. Other forms of persistence are archives, on-disk swap storage, source code control systems such as Subversion and Git, databases, and configuration systems such as the Microsoft Registry and its Hives.

The seriousness of research allows for each of these providers of persistence to be their own subjects. They respond to very different needs, and are in constant compromise between what is wanted and what can be reliably provided. Many ultimately use a file system as their ultimate data store, where file system is broadly defined as any set of data structures targeting a block device for data retention. However, even if generally the occupations of operating system architect and data store architect are distinct, issues in persistence are fundamental to operating system design.

Among the persistent data stores, the giant is the Database Management System, DBMS, better known as a database. These are certainly topics and occupations all to themselves. Computers are mostly useful because of databases. The information revolution is mostly the realization of vast databases tracking stuff, information and people. The software that implements a database, the database engine is a substantial and highly valuable software product. Our discussion of persistence will borrow from the database community, for they are the experts on persistence, and for many years have provided both problems and solutions for persistent datastore.

Data stores are read and updated. The updates can be considered as state transitions, from the entirety of the data and metadata before the update versus this entirety after the update. Reads function on the entirety at arbitrary moments, and generate results, which are thought of as revealing portions of the state.

Coming out of the vocabulary of traditional database theory are the four properties of Atomicity, Consistency, Isolation and Durability. The theory of these four properties is known as the ACID theory.

Atomicity: Atomicity is the property that the update is all-or-nothing, including all of the intermediate steps or none of them. Either the change, which might be implemented as a series of changes, completes fully, or is rolled back completely.

The fully completed transition is called a transaction. Transactional databases allow for a series of commonly provided atomic actions to be enveloped into an all-or-nothing unit, also called a transaction. Often the failure of completing a series of actions in a transaction is recovered by actively recovering the state from before the start of the transaction. This is called a roll-back.

Isolation: Isolation enhances to the atomicity property by specifying in which way to reads respect the all-or-nothing quality of a transaction.

The strongest form of isolation is serializable actions, meaning that only either the initial or final states of the object are available, and interactions have a consistent time order of occurrence based on whether it observed the initial or final state.

Atomic non-serializable actions are easily demonstrated: two bank accounts which are each asked to be set $10 more than the other. A serialized solution will have the two accounts ultimately settling with one account being $10 less than the other. However they could end up each $10 more than the other had at first without contradicting isolation, but with a failure of serializability.

In this case, to block advance of one transaction by another, during an update, could lead to deadlock, as both transactions wait for the other to finish. One technique used in these cases is the optimistic locking in which the input value is read before and after the output value is calculated and committed into the world. If the two reads are not equal, the transaction is retried. This turns a deadlock into a possible livelock of continual retries.

It is also possible that one action depend on both the initial and final states caused by another action, essentially straddling the time ordering of the other action. Perhaps one transaction updates two variables and a second transaction posts the sum of the two variables. Isolation would speak to whether the summing transaction must know that its read of the two variables occurred both before or both after the updating action.

Durability: Durability is the very practical concern that a transaction, once completed, never be lost. It does not need to go so far as to ensure a consistency such as a value, once committed, will never be (permanently) reverted. Durability is a feature of database backup and recovery, and it is a possibility that a failure might cause fall back on a previous database condition, while recovery is in progress.

For instance, a replicated server might be available when the primary server fails. However, a gap in update propagation exists, although is impervious to server failure. The replicated server would show old values until the eventual arrival of the propagated update.

Consistency: The persistent store has constraints on the relationships between data items. Consistency states that these constraints are always satisfied. This is an abstract definition, that depends on the constraints that one wishes to have; and these constraints work with and build off of the other three properties.

For instance, consistency can mean a read-consistency that once a change is made to the database, all reads consistently see that change.

Example: As an example of a filesystem that enforces line-semantics. Writes to files are append-only, and transitions are only on full lines of text appended (ending with and special end-of-line character).

Atomicity will say that the ultimate state of any file has only full lines appended. Isolation says that each append is the result of starting from a previously, correctly appended file.

However, it is possible that two simultaneous appends propose two new conditions for the file. For instance, two transactions appending to the file “good\n” the lines “night\n” and “day\n” propose either “good\nnight\n” or “good\nday\n”. This is a failure of isolation, since as transactions one transaction must have identified the state as “good\n” when, respecting atomicity, it is one the way to transitioning to “good\nnight\n”. Serialized, the two possibilities are either “good\nnight\nday\n” or “good\nday\nnight”.

Interprocess Communication

September 28, 2016

The process idealization provides an environment. For the process to affect the world, processes must work together exchanging information and the timing of events. Process often exchange information through files, as well as having the process exchange information with itself, displaced in time, when information is placed in a file for retrieval later, or when instances of the process are stopped and restarted to continue on with the process achieved up to the stopping point. Although files are a method of interprocess communication, we discuss them under the heading of persistence, as their primary feature is that they persist and have a system of cataloging.

The methods of process communication discussed here are mechanisms of interprocess communication that can be interactive and are more immediate and fluid.

Unix V6 provided for a unified structure on information flows, as seeing as an almost sufficient model of information flow as the stream of bytes. For historical reasons, the unit of information coalesced around the 8 bit byte, and recognizing the importance of interrelationships between those bytes, unix idealizes information as a strict sequence of those bytes, one after the other, where the sequence is as important as the actual data values. Their were no requirements on the rhythm of the flow, and the arrival of a byte, not its contents, can in fact be the content of the message.

While protocols more generally are discussions between processes, with information flowing interactively in both directions, the unix model saw as basic a one-way flow, called half-duplex, after historical terminology dating back to the teletype. A bundle of two half-duplex channels, in opposite directions, formed and full-duplex channel that allows for "discussion" between the processes. Their is no real timing mechanism assured between the two halves of a duplex channel. Although bytes entered into a channel should arrive at the outlet in a timely fashion, that is an expectation, and will not be a promise.

Unix V6 also provided signals, which are specifically designed to communicate by the occurrence of an event, with little channel width to parameterize the conditions of the event, just that it happened. This model is useful for its low overhead, lack of setup steps, and when the sender’s role is more facilitory than constructive — the sender of a signal needs only the receiving process’ PID and the right to send; the receiver can do as it pleases with the signal, and the receiver has no obligations to the sender for having been the target of a signal.

Signals are sent by the kill system call. The call takes two parameters, the process ID of the process to receive the signal, and a numeric identifier for the signal. The event of a signal is noted by setting flags in the process’ process control block (PCB), and at any point that the kernel returns to a thread in the process these flags are checked and if signaled the kernel arranges to return into a signal handler, rather than the interrupted flow of code.

Many signals have meanings that are defined by a standard. For instance, the SIGKILL signal indicating that the sender is asking that the process terminate immediately. Numerically, SIGKILL is 9. There is a signal handler installed into all processes to receive the kill 9 and to exit the process. Most signals either have no handler, or the handler provided by default can be overridden. The kill 9 handler cannot be overridden.

The Unix Pipe

Two process can communicate through a byte stream channel, a construction called a pipe. In unix, pipes are constructed naturally because the standard input and output are modeled as byte streams. One process can produce bytes into its standard out and another process can consume those bytes from its standard in. A unix feature was the simple construction of such connections between programs written to consumer bytes from standard in and produce bytes to standard out using a command line notation. For instance, the program ls produced a directory listing to standard out, and the program wc counted words consumed from standard in. Therefore piping standard out of ls to standard in of wc gave a combination program that would count the numbers of files in a directory.

That simple notation used the pipe symbol as so: ls | wc .

The byte stream channel includes an additional signal for end of file (EOF). In the above example, not only does ls communicated to wc a sequence of file names, it signals to when that list is ended. This is not represented in band, that is, there is not a byte marker sent that indicates no more bytes follow, even though the ASCII standard does provide for such a byte marker int the EOF character, a.k.a. control-D, but by an out of band mechanism that signals and EOF condition.

This simple model of communication as a byte stream influenced the file system interface. A file is any element of data persistence, but in the standard unix models of read and write is a source or sink, respectively, of a byte stream. This is true when a file is accessed by the read or write system call inside a program, or when a file is attached to the standard in or standard out of a program. This is called file redirection an on the command line is indicated with the less than or greater than characters. For instance, to list the directory into the file f.txt, use the unix syntax: ls > f.txt ; and to word cout from the file f.txt, use the unix syntax: wc < f.txt , or alternatively: cat f.txt | wc .

There is a slight bit of magic in making this work. The ls program knows to what type of sink it is providing characters, and according to the type, will provide the output formatted differently. When providing output to the terminal, the listing is formatted in user-convenient columns. But when provided into a pipe or file, the listing is formatted in machine friendly one filename per line.

Other Communication Mechanisms

  • Pipes
  • Named pipes
  • sockets

Process Permissions

September 25, 2016

The operating system enforces a system of rights and permissions. Rights and permissions refer to permitted or restricted actions that one category of entity, the subject, performs on another category of entity, the object. Intuitively, the subject is a user and the object is something over which the user has jurisdiction. The operating system models this jurisdiction and attempts to enforce it.

For example, a certain user may have the right to read or modify a file. The file is the object and the user is the subject. The permitted action is to read or modify. Strictly speaking, it is never the user that takes the action, but a process working on behalf of the user. For this reason, the process is the central actor in the system of rights and permissions. A typical mechanism is for the process to have an owner, and the rights and permissions of the process are those of its owner.

To complete the access decision, there must be an object and a set of rules attached to the object-subject pair. If the rules are attached to the object, the rules are called permissions, and if attached to the subject, the rule are are called rights, also known as privileges. If the right or privilege is transferable to another subject, the transferable access is called a capability.

Traditional unix security was based on a permission system for files, and the unix principle that “everything is a file”. Hence a large number of activities passed through the file system and security enforced as a permission on an object. Activities that are more appropriately a right, such as the right to reboot the computer, depended on a simple distinction of subjects into root, with UID 0, and any other user (UID not 0). There was also from the start a group identifier, and that was expanded into the ability for creating collections of groups. However most security decisions were made using the three permissions read, write, and execute that could be applied to one of three subjects — the file owner, the group owner, or anyone else. If the subject is root, all is permitted.

The SUID bit

The permissions system of Unix required that a process have the proper ownership to accomplish the required tasks. Processes are created by fork as replicas of the forking process, so there needed to be a mechanism to modify process ownership. The setuid system call was permitted only to root, and if the process was owned by root, this call could change the ownership to any other user. This mechanism is too limited. For instance, if this were all there were to the mechanism it would be impossible for a user to change their own password. Passwords would be stored in a file, and that file must be changeable only by root. But any process run by a user would be owner user and therefore could not change the password file. Hence, a user could never change their own password.

The SUID bit was introduced into the permission system to allow enough flexibility to accomplish such tasks. The permission system included a set-uid property. If a program had the set-uid property, that is, the file that contained the program was marked set-uid, the exec system call changed the ownership of the process to that of the file, and the process now has the rights and permissions of the file owner, rather than those of the process parent.

The owner of a file has rights over the file’s permissions, hence this allowed the owner of a file to allow an arbitrary user to “borrow” the owner’s rights, but only during the run of the program. Since the owner wrote the program, the consequences of they borrowing would be under the file owner’s control, and hopefully the file owner will write a program that has no unintended consequences because of this borrowing. As described by Thompson and Ritchie:

The Since the actual user ID of the invoker of any program is always available, set-user-ID programs may take any measures desired to satisfy themselves as to their invoker’s credentials. – The UNIX Time-Sharing System, Dennis M. Ritchie and Ken Thompson Bell Laboratories, CACM 17:7 1974.

The unix version 1 man pages describe a real and effective UID. Exec would change the effective UID, but the real UID would be unchanged. The getuid system call would return the real UID, so that the program could make decisions based on the owner of the process that invoked this program. The setuid program would return an error if called by a process other than root-owned. Root could only set the real and effective UID to the same value.

By version 6, the definitive "porto-unix", getuid would return both the real and effective UID’s, and setuid could be invoked by other than root, but only to copy the real UID into the effective UID, thereby dropping the privilege of running as the program’s owner. Once the privilege was dropped, it could not be regained. Even root could not set the effective UID different from the real UID. This was the only possible through the combination of the exec system call and the suid permission bit on the file.

Various descendants of version 6 unix set about solving the problem of being able to drop privilege and then regain privilege in different ways, which introduced more heat than light, and also security bugs. For instance, Unix SYSV introduced the save uid, and a non-root user could set the effective uid to either the real or saved uid. For a full discussion see Setuid Demystified by Hao, Wagner and Dean.

Access Control Lists

The unix permission system is limited to the three permissions read, write and execute, interpreted variously depending upon whether the object protected is a file or directory, or other sort of filesystem resident entity, with the subjects: owner, group, other, and root. A more flexible permission system was needed, and in fact, to achieve government certification as C2 secure, according to the Orange Book standard, unix needed an improved system of access control.

An access list (ACL) is a list of rules, each rule matching a subject and ending in an accept or reject. The list of rules is attached to the object, so that the system is a system of privilege. Still unaddressed was unix’s simplistic privilege/rights system, such as the right to reboot or install or upgrade software. Some rights can be simulated by creating a filesystem object to represent the right, and then attaching a sophisticated ACL to the object.

The implementation of an ACL system for Linux was undertaken by the National Security Agency, NSA.

Rights and Capabilities

The Process Lifecycle

September 24, 2016

Processes are created by other processes using system calls. At least one thread is created, and that thread can create other threads in that process. When all threads have completed their tasks, the process begins a process of dissolution. The process is first placed in a dead state, where the process context still exists, but there are no running or runnable threads, and finally the process is completely destroyed, all the resource is deallocated and returned to the operating system.

Unix has a particularly simple way of handling processes. A process is created by a fork system call. Inside the kernel, the fork call results in the creation of a new process, whose assets are an exact clone of the old process, at the moment of the call, and then the kernel returns to both processes, so there are now two almost completely identical processes running. The old process is called the parent, the new process is called the child.

They are almost identical, but not exactly. The return value to the parent process is the process identifier, or PID, of the new process; the return value to the child process is 0. Zero is never a valid PID.

If the parent and child processes are to act differently after the fork, they will check the return value to fork in an if statement, and branch to different pieces of code. Typically, the goal of the parent is to launch not just a new process, but a new application, running in the process. The child code then would contain an exec system call whose parameter is the name of the application. After the fork, the child only runs the exec call, and the effect of that call is to replace the assets of the process with those of the named application, and then start the thread at the start of the application’s code.

In code, it is simple. For a parent to fork off a new process that will run the directory listing program, /bin/ls,

if (!fork()) exec("/bin/ls");

A newly created process moves to the ready state when there is sufficient resource so that some thread in the process can run. Threads then receive cycles, that is, are assigned to a physical thread, while in the running state. If they expire their time slice, they are removed from the running state and put back in the ready state, where they wait the allocation of additional cycles. A thread might also take an action that blocks the thread from further need for cycles until some event occurs. An example of this is the thread making a file read request. Until the read completes and the data is delivered to a buffer accessible to the thread, there is nothing for the thread to do. The thread is placed in the wait state, and not waiting thread is ever considered for assignment of cycles.

When the event occurs, the thread is then moved from the wait state to the ready state, and will then eventually receive cycles.

Process Life Cycle

When a thread has run its program, it exits. This means that the control structure for the thread is marked as a zombie (unix terminology), and the operating system never gives the thread any more cycles. There are no more cycles needed, as all the code has ben run. A process for which all threads are zombies, is a zombie process.

The zombie state allows for rendezvous and status pickup by other processes, somehow dependent on the state of completion of the thread, or by an integer that the thread returns, said to be the exit status of the thread. Those running threads or processes can call the system with the PID of the process, or TID of the thread, thus reaping the process or thread. The act of reaping will release the process or thread from the zombie state, and the process or thread control structures will be destroyed, and all resource reclaimed. The process or thread is now gone.

Unix provides a wait system call that implements a wait for the event that the child process goes zombie, a death of child event. The unix shell works using forks, exec’s and waits. Unix allows for any program to be a shell program — it is simply the first program run when the user logs in. Typically it is a proper shell program, that loops endlessly between reading the user’s command and running them.

For the most part, every command is the name of a program to run. To list a directory, type “ls”, and the shell runs the program “/bin/ls”; to copy a file, type “cp”, and the shell runs the program “/bin/cp”. It does this by parsing the command line, and doing a fork-exec with the command name that results from the parsing. The shell then waits for the command to exit before returning to parse another command. The shell and the run command share the terminal window, so the shell must wait or it will read the user input instead of the command reading that input, or will write out prompts mixed in with what the command is writing.

Over all, it is very easy:

char * parsed_command;
int child_result;
if (!fork()) exec(parsed_command) ;
wait(&child_result);

The Unix threads

When it comes to process creation, the unix fork was not only a mechanism to accomplish process creation, it influenced decisively every process creation primitive that came after it. However, threads were not current currency in 1969 when unix was born, or even in 1988 with Unix System 5 Release 4, (SVR4, as it is better known). The result is that unix thread models are neither pretty nor influential.

Linux has replaced “fork” with “clone” and given a vector of items that are shared or “cloned”. The Linux community is not very strong on theory, however, so despite how wonderful is their practical significance, they rarely know exactly what they are talking about. This prevents any other flavor of unix from following Linux in its innovations, but few are ever tempted.

POSIX has worked as a committee to produce a threads API worthy of a committee. But this is what we have and this is what we must use.

Threads are complicated because programmers are not good at concurrent programming, the notion of two algorithms running side by side sharing data structures. Helper structures tend to be required. Newer languages, such as Java and Python, provide more interesting thread structures, and call upon the compiler to help with their correct deployment. Rather than a thread library, these languages have a Thread Object. The object encapsulates the code the thread runs, and messages can be sent to the object to coordinate the thread with the other threads in the process.

Operating system programmers, however, must be proficient in concurrent programming. Most modern operating systems are concurrent. With multiple cores on a CPU this concurrency is in fact real not simulated. Also, the operating system is responsible for delivering the abstractions used by Java and other higher level languages.

Virtual Threads and the Context Switch

September 24, 2016

The process abstraction models the requirements for running programs. It is a container for the context, and collected resources that will be used by a running program. A thread is the ability or the actuality of executing the program.

The Thread Abstraction is the conception of a thread. Physical threads, which are what carry out the execution of the code, are an ability of the hardware. This ability is found in the Central Processing Unit, or CPU. The CPU can walk a sequence of instructions and update its state, and the state of the world, according to the instructions’ commands. To each thread there is associated an Instruction Pointer (IP) — a CPU register containing an address of a machine instruction to be excited. The CPU will fetch the instruction at the memory location indicated by the IP, execute the instruction, and go on to the next instruction, as directed by the instruction (e.g. a jump) or by default by incrementing the Instruction Pointer.

Originally, and CPU offered only one thread. There was one IP, and one of each of any other special purpose register. The entire CPU was dedicated to one flow of instructions. The multiplicity of threads offered by current CPU’s is the result of either the CPU actually being a bundle of separate CPUs, this is called multi-core, or that a single CPU (core) cleverly multiplexes hardware so that it can keep track of more than one instruction flow, and execute multiple flows simultaneously. This is called hyper-threading. At the time of this writing, the Intel Xeon Processor E7-8890 v4 has 24 cores, each with 2 threads, for a total of 48 threads, and a list price of just over $7,000.

Just as Virtual Memory provides for an idealized memory space, making use of physical memory to realize the functionality, Virtual Threads are an idealized resource which is realized by physical threads. The The operating system is responsible for realizing virtual threads by matching them to physical threads, on demand, and by negotiating between the requests for virtual threads which might far out number the available physical threads.

Historically, the apparent multiplication of a single hardware thread into several threads began with the demand for Time Sharing. With time sharing, a single machine with the hardware capabilities of only a single thread could satisfy the work demands of multiple users, each working interactively with the computer, constantly posing requests and getting responses. This was done by breaking clock time into short interval, and having the physical thread accomplish the work demanded by a visual thread, during that short time interval; and then it moved on to accomplishing the work demanded by a different thread during the next short time interval.

The original word for this was time-slicing as the exact method used in the simulation is to run a few instructions of each thread in rotation, slicing time up into intervals. This innovation was associated with operating systems that had time slicing in the name, such as IBM’s Time Sharing Options (TSO) introduced in 1971. The forerunner was the Dartmouth Time Sharing Systesm (DTSS) begun around 1962-64, as well as MIT’s Compatible Time Sharing System (CTSS), and through MIT’s Multics, influenced Bell Lab’s Unix.

In those days, rather than a formalism of virtual and physical threads, the viewpoint was more like a single chef juggling multiple pizzas in the air. However, with the availability of multiple physical threads, the issues of correctly assigning work to the physical threads increased in complexity, and one really needed to take a wider look around if one were to have a path towards resolving the complexity.

The context switch

While the entire discussion of implementing threads is covered in a later chapter, the basics are presented now, so that the reader does not lack for a concrete example of the concepts being presented.

When a physical thread is used to carry out the work of a virtual thread, it executes in the context of the particular process. The memory the thread accesses is the memory associated with that particular process; the files the thread might read and write are the files opened by that particular process. A physical threads runs for the time slice in that context, and then is moved to a different context for the next time slice. The mechanism of moving the context is called a context switch.

A context switch is demanded by the operating system in order that each thread receive its share of the processor’s physical threads. The delivery of cycles to a process must be timely and sufficient: timely in that there is often a limit to the delay that can be tolerated between runs of the process. For instance, the mouse moving routine must be invoked frequently enough that the cursor appear to move smoothly; and sufficient in that enough cycles must be allocated to the process that it keep up with its computational appetite.

A context switch occurs in kernel mode. It is worked into the general syscall mechanism. When a user process enters the kernel because of a syscall or interrupt, the kernel might pause the process rather than return to it. If it does this, it will return to a different process, one that was paused earlier in this same manner. A process might cause a syscall in order to obtain data. Often the data is not immediately available. It might need to be requested from disk, from a network connection, or from a users’ typing commands. This is a perfect opportunity for the process to be paused, to be continued when the data is available.

A program might, however, attempt to run for a long period of time without a syscall. To insure that each process have its timely and sufficient access to the CPU, most operating systems use preemptive multitasking, and will force the running program to yield the CPU to another process. It is necessary to have a hardware clock that interrupts the processor every so often in order to have preemptive multitasking.

An interrupt is similar to a syscall. Both are a form of trap: they cause the running process to jump away from its code to kernel code running in kernel mode; they remember on the stack enough information to return to the interrupted task; the return will reset the processor back to non-kernel mode. However, an interrupt is asynchronous — it pushes its syscall into the instruction stream, so to speak, motivated by an external event. A syscall is synchronous — the syscall appears as an instruction in the instruction stream.

=======+ Process A         +===== Process B
       |                   ^
  ---- | ----------------- | ------- user/kernel
       |   +===scheduler===+
       V   ^     ^  ^
       +===+     |  |
       ^ ^       |  +---- Context of Process B
       | |       +------- Context of Process A
       | |
       | +- clock sets "need scheduler"
       +--- interrupt

HOW PRE-EMPTION WORKS IN TIME SLICING

The diagram above shows the logic path for a context switch drive by a preemption. At the clock tick, and hardware interrupt occurs. This transitions the thread into the kernel to handle the interrupt. While handling the interrupt, the user context is ignored, and the thread runs purely in system context. In effect, the user thread is being borrowed to do system work.

The return from the interrupt is not directly back to the interrupted user context and instruction stream, but travels through scheduling code. If it is determined that the thread should be reassigned to a different context, all data associated with the current thread is stored into the PCB and thread structure, and the next thread structure is read out into he processor’s registers. The the thread returns.

However, it is returning in the context of the new process (Process B in the diagram), so it returns at the interruption point that previously occurred in process B, either because at its last run that thread was pre-empted, or otherwise an event had forced that thread into the kernel and the thread had been reassigned. It now picks up where it left off.

The Thread Abstraction

September 23, 2016

A process is a data context and one or more threads that run in that context. A thread is the ability or the actuality of executing a sequence of instructions. The computer instructions are loaded into memory is a part of the context. A thread is the running or potential running of that code.

A thread is an ability of the hardware. A modern CPU can execute multiple threads. Each thread is roughly speaking an Instruction Pointer. The CPU will fetch the instruction at the memory location indicated by the pointer, execute the instruction, and go on to the next instruction, as directed by the instruction (e.g. a jump) or by default by incrementing the Instruction Pointer. Although the purist might cringe at my being so hardware-centric in this explanation, I think one must reference the hardware in order to clearly understand a thread. A computer science defined thread is an abstraction of an ability of a CPU to run instructions.

In this case, I believe, the hardware came first. Machines were built that could carry out a list of written instructions. As the machines became more complicated so that they managed their carrying out of the instructions then threads became the thing that was managed. For instance, these machines have memory caches to store recent results. The managing of those caches are also machine actions, but are actions in support of the “thread”, and if multiple threads are running, the cache actions include negotiations between the threads.

Speaking of hardware: a processor chip often has multiple, almost separate CPU’s. This is called multi-core. Each core, however, can be cleverly constructed to execute multiple threads at a time, that is, have multiple Instruction Pointers, and share hardware resources so that the data associated with each thread says correctly associated. This is called hyper-threading. The total number of threads is the product of the number of cores and the hyper-threading multiplicity of each core.

The operating system provides for processes, including the process’ demand for threads. The hardware can really provide a only fixed number of threads. The operating system treats these hardware threads as a resource to assign to software threads, making them active at the correct moment, to simulate an unlimited number of hardware threads. The original word for this was time-slicing as the exact method used in the simulation is to run a few instructions of each thread in rotation, slicing time up into intervals. This innovation was associated with operating systems that had time slicing in the name, such as IBM’s Time Sharing Options (TSO) introduced in 1971. The forerunner was the Dartmouth Time Sharing Systesm (DTSS) begun around 1962-64, as well as MIT’s Compatible Time Sharing System (CTSS), and through MIT’s Multics, influenced Bell Lab’s Unix.

The Unix Fork

In order to understand better this discussion, we will walk through the various ideas using the unix system as an example. Unix has a particularly simple way of handling processes. A process is created by a fork system call. Inside the kernel, the fork call results in the creation of a new process, whose assets are an exact clone of the old process, at the moment of the call, and then the kernel returns to both processes, so there are now two almost completely identical processes running. The old process is called the parent, the new process is called the child.

They are almost identical, but not exactly. The return value to the parent process is the process identifier, or PID, of the new process; the return value to the child process is 0. Zero is never a valid PID.

If the parent and child processes are to act differently after the fork, they will check the return value to fork in an if statement, and branch to different pieces of code. Typically, the goal of the parent is to launch not just a new process, but a new application, running in the process. The child code then would contain an exec system call whose parameter is the name of the application. After the fork, the child only runs the exec call, and the effect of that call is to replace the assets of the process with those of the named application, and then start the thread at the start of the application’s code.

In code, it is simple. For a parent to fork off a new process that will run the directory listing program, /bin/ls,

if (!fork()) exec("/bin/ls");

The unix shell works this way. The shell is the command line program that implements users requests made through typing. Unix allows for any program to be a shell program — it is simply the first program run when the user logs in. Typically it is a proper shell program, that loops endlessly between reading the user’s command and running them.

For the most part, every command is the name of a program to run. To list a directory, type “ls”, and the shell runs the program “/bin/ls”; to copy a file, type “cp”, and the shell runs the program “/bin/cp”. It does this by parsing the command line, and doing a fork-exec with the command name that results from the parsing. The shell then waits for the command to exit before returning to parse another command. The shell and the run command share the terminal window, so the shell must wait or it will read the user input instead of the command reading that input, or will write out prompts mixed in with what the command is writing.

Unix provides a wait system call that implements a wait for the event that the child process goes zombie, a death of child event. It then collects the result value from the zombie process, releases the zombie process to be dissolved by the operating system, and returns this result value as the return value of the wait call.

Over all, it is very easy:

char * parsed_command;
int child_result;
if (!fork()) exec(parsed_command) ;
wait(&child_result);

The Unix threads

When it comes to process creation, the unix fork was not only a mechanism to accomplish process creation, it influenced decisively every process creation primitive that came after it. However, threads were not current currency in 1969 when unix was born, or even in 1988 with Unix System 5 Release 4, (SVR4, as it is better known). The result is that unix thread models are neither pretty nor influential.

Linux has replaced “fork” with “clone” and given a vector of items that are shared or “cloned”. The Linux community is not very strong on theory, however, so despite how wonderful is their practical significance, they rarely know exactly what they are talking about. This prevents any other flavor of unix from following Linux in its innovations, but few are ever tempted.

POSIX has worked as a committee to produce a threads API worthy of a committee. But this is what we have and this is what we must use.

Threads are complicated because programmers are not good at concurrent programming, the notion of two algorithms running side by side sharing data structures. Helper structures tend to be required. Newer languages, such as Java and Python, provide more interesting thread structures, and call upon the compiler to help with their correct deployment. Rather than a thread library, these languages have a Thread Object. The object encapsulates the code the thread runs, and messages can be sent to the object to coordinate the thread with the other threads in the process.

Operating system programmers, however, must be proficient in concurrent programming. Most modern operating systems are concurrent. With multiple cores on a CPU this concurrency is in fact real not simulated. Also, the operating system is responsible for delivering the abstractions used by Java and other higher level languages.

 
Powered by Wordpress and MySQL. Theme by Shlomi Noach, openark.org