Shannon-Hartley Theorem

April 24, 2012

It was only around the turn of the century that information has existed as a scientific entity. Hartley wrote about it in 1928, as it became very important for the transmission of telegraph and telephone signals. By now, information is not just a natural quantity, it is a commercial quantity, bought and sold. We are very concerned that our cell phone plans have enough megabytes, just as much as our homes have enough electricity. It is metered and charged for.

I have to skip over much of the definition of information. Be warned, however, that information in the scientific sense is not always the same as information in the colloquial sense. Random noise is very dense in information in the scientific sense, but often is despised as, well, noise, in the colloquial sense. Random noise is information dense because I cannot predict what it will be until I see it. Information is about my determining what was sent, not what was meant by the sender. Information is measured in bits. One bit is the smallest quantum of information. It is the equivalent of a yes/no answer to a single question. If I play 20 questions, I play 20 bits.

Information is sent from one place to another through a channel. It might be a fluctuating voltage in a wire; or the compression wave traveling through air that constitutes a sound; a flashing light, either across open space or captured in a fiber optic. The concern of Hartley, and later Shannon, is how much information can go across a channel, and how well it can cross it. The result that we are discussing says that a channel has an innate capacity in bits per second, determined by its physical composition, that equals: C = B log_2 ( 1 + S/N ); where B is the bandwidth, S is Signal Strength, and N is Noise Strength. These are terms we need to discuss.

Hartley-Shannon

Bandwidth of a channel

Information will be sent by disturbing some physical quantity at one of the channel and attempting to observe that disturbance at the other end of the channel. A shock wave of air caused by the vocal cords are decoded by the ear; a voltage change in an ethernet cable is measured at the far end of the cable. It had been observed that the channel is often reluctant to allow fast changes. There is a limit to how quickly the channel can react, and hence if too many changes are made in an instant of time, for instance, the transmission of too many bits are attempted per second, the changes will be lost, and the information will not arrive at the other end of the channel.

For instance, in electrical engineering there are simple RC circuits. A resistor is placed in series with a capacitor so that current will flow first through the resistor and then into the capacitor. A large voltage applied across the combination will cause the current to flow. The current flows through the resistor, creating a voltage across it, and then into the capacitor, which “fills” and whose voltage rises until it ultimately matches the applied voltage and the current flow ceases.

If one is unfamiliar with electrical engineering, the same situation can be modeled with two basins of water and a thin tube connecting them. One basin is full, the other empty, and water flows through the tube to equalize the water level across them. As long as there is a difference in level, current flows and the water builds up on the formerly empty basin. Once the levels are equal, the flow stops.

RC circuits

If I were to try to send information by suddenly increasing the level of water in one of the basins, the other basin’s level would not rise immediately. It would rise at a certain time constant which depends on the resistance the tube gives to the flow of water. The reciprocal of this time constant is the bandwidth, the quicker the levels come close to even the quicker I can adjust the level once again, allowing me to communicate by a sequence of level changes.

The model in electrical engineering would be C dV/dt + V/R = 0, where C is the capacitance of the capacitor, the amount of current that is required to make a given voltage change, analogous to the capacity of the basin: how much water must flow in to make a given level change. R is the resistance of the resistor, how much it impedes flow. The derivative on the voltage expresses that for capacitors the change in voltage (change in level) is caused by current flows. The sum equaling zero means that all current that flows in or out of the capacitor is accounted for by the exact amount of current flowing through the resistor, and the two find balance.

The solution to this equation is, roughly, V1 = V0 * (1 – e^(-t/(RC)), when a sudden change of V0 (or level L0) is applied to the system. RC is the time constant tau. If it is large, t must be large for the voltage to fall, lets say, to e^(-1) of its original value. So t = tau is the amount of time required for the voltage to fall to 1/e = 0.37 of it’s original value. Looking at dV/dt of this solution, 1/tau is the slope of the curve at time zero – it is the maximum speed the circuit will follow a voltage change. This value, 1/tau, is called the bandwidth.

Noise, and Signal to Noise ratio

Every system has noise. A reading of the voltage at the receiving end, or of the level of water in a basin, as an error tolerance. The value read cannot consider changes within that error tolerances of being attributable to the quantity measure, it is attributable to the error of measurement.

Consider a series of changes in voltage, meant to communicate a sequence of values. Given a fixed bandwidth, the system can slide up to equalize with that change just so fast. If the changes are made slowly, the bandwidth allows the system to slide up to each change, and the output reads with good accuracy the input values. As the changes come more rapidly, the system fails to keep up, and is not near sliding up to the applied value before the next value comes along. This can be corrected for at the receiving end, but eventually the information comes so rapidly that the amount of change is more or less equal to the noise band. In this case, the information is lost. The changes seen will be washed out by noise, or can equally be attributable to noise as to signal.

The Shannon-Hartley theorem shows this in the factor 1+S/N. The largest range of possible output values, the signal S, is divided into brackets by the width of a band of noise N. However many levels are partitioned is the practical limit of measures. We can see that the value is in one bracket or another, but within a single bracket, the details of the measurement are as much determined by the noise as by the intended value. Hence the channel capacity is a product of the bandwidth and a function of 1+S/N.

Consider going from a 1+S/N of 2 to a 1+S/N of 4. In the first, I have two brackets, one at minimum signal and another a maximum and they are separated by the width of noise, so I can be probably correct in assigning a measurement closest to the maximum as maximum (since it is an entire noise band away from the other possibility, the minimum), and vice-a-versa. In the second, there are 4 brackets. However, I can also have 4 brackets, virtually, if I double my transmission speed and send in the time of one change, two changes, each using only a 1+S/N of 2. Likewise, I can either have a 1+S/N of 8, or triple my transmission speed and send three changes in the time of one, each with a 1+S/N of 2. That means that it is the log_2(1+S/N) that is the equivalent of bandwidth – I can either quadruple my sending rate or take my signal to noise ratio to the fourth power to achieve comparable channel capacity.

Conclusion, Shannon’s theorem

We have justified that the channel capacity C is B log_2(1+S/N). It depends on the bandwidth B of the channel, that is, it’s ability to transmit disturbance quickly, and the signal to noise ratio S/N, that is, the ability for small changes to be detected as significant, and not lost in the noise level. Information theory gives us guidance in constructing high capacity channels which are physical systems with large bandwidth and large signal to noise.

There is noise, and the detection of our information is limited by it. Sometimes, the noise will cause an error in detection. As we move our sending rate towards the capacity C, you might believe that the errors are more frequent, as we begin to brush the noise levels in our hurry to apply the next level change. This is true. However Shannon showed that (the Shannon noise-channel coding theorem) as long as the sending rate does not exceed C, error-less communication is possible. It will require error-correcting codes to combat the noise. The error correcting codes will themselves reduce the usable information rate, because the codes will adjoin correction information. However, in the limit (what limit?, you might ask) the overhead can be made proportionally negligible.

On the other hand, (the Shannon source coding theorem) it is possible to optimally discard information so that it can fit into a channel. This is called lossy compression. There is also loss-less compression, and it is useful to squeeze out redundancy or noise that has been introduced in data, so that a seemingly large amount of information can be written as a smaller amount of information without “essential” loss.

Quadratic Residues in Z mod a Blum integer

April 23, 2012

Blum integers are integers n such that n is the product of two distinct primes p and q, where p and q are both 3 mod 4. The permutation given by x goes to x squared mod n, when restricted to the quadratic residues of the integers mod n, is the basis of the Blum-Blum-Shub pseudorandom number generator.

The structure of square roots mod p=3 (4)

Given a prime p = 3 (4), either x or -x has a square root. Consequently, -1 does not have a square root. If -1 had a square root, then the fourth roots of 4, i.e. 1, -1, i, and -i (for the “imaginary quantity” i) would for a subgroup of the group of units mod p, hence 4|(p-1), which is does not since p-1=2(4).

Since x and -x have the same square, we can pair up (p-1)/2 elements, and unless for some pair (x,-x) there are y and y’ such that y^2=x and y’^2=-x, exactly one of the pair is a square (by a counting argument, a.k.a. the pigeonhole principal). In the former case, y/y’ would be a square root of -1, hence exactly one of x and -x has a square root, for all x.

Therefore, if y has a square root, exactly one of the two square roots itself has a square root, and that is called the principal square root.

The structure of square roots mod a Blum Integer

Since x^2=y(pq) would give x^2=y(p) and x^2=y(q), x is a QR mod pq iff x is a QR mod p and mod q simultaneously. If the factorization of pq is known, the square root is easily computed from the Chinese Remainder Theorem and the fact that x^2 = x^((p+1)/4) (p) for a prime p which is 3 mod 4. Continuing to work simultaneously, if y has a square root, it has exactly four square roots (corresponding to the lift by the CRT from Zp x Zq to Zpq of (xp,xq), (-xp,xq), (xp,-xq) and (-xp,-xq), where xp^2=y(p) and xq^2=y(q)), and exactly one among these four itself has a square root, and is called the principal square root.

Of the three square roots of a square y which are not themselves squares, two are easily identified has not being squares by calculation of the jacobi symbol. However, if xp is the principal square root of y mod p, and xq is the principal square root of y mod q, then the lift of (-xp,-xq) to mod n will have Jacobi Symbol of 1, and cannot by that method be identified as a non-square. This pair can also be identified as square roots x of y such that x = -x and neither x nor -x is a square. Such numbers are called pseudo-squares, and deciding for a given x, which is either a square or a pseudo-square, it is conjectured to be as hard as factoring (more or less) to decide between whether the number is a square or a pseudo-square.

The remaining two square roots of y are negatives of each other, both have Jacobi Symbol 1, and exactly one (the principal square root) is itself a square. The squaring map is a permutation on the set of squares, and looks like a disjoint collection of cycles, if one were to visualize the situation as a graph where x and x^2 are connected by a (directed) edge. 1 is a self-loop this graph.

Zero knowledge login

April 16, 2012

Some authentication mechanisms depend on a secret knowledge to identify the subject. For instance, knowledge of a password. To prove knowledge of the password, the subject produces the password. This requires that the authenticating party know the password as well, so the knowledge is not that secret. The authenticating party must be authenticated, else that party will now know something it did not know before, and can impersonate the subject.

Challenge-response and one-time passwords mold the information to mitigate the vulnerabilities of password systems. Challenge-response protocols harden against eavesdroppers, or untrusted parties receiving the “password” but keeping the knowledge secret, and just providing a aspect of that knowledge for verification, i.e. the value of a function evaluated on the combination password and random challenge. The password is not revealed even to the authenticating party. One-time passwords disqualify the knowledge on use.

A very interesting protocol are those using Zero-knowledge. This protocol proves that the subject has the secret information, but in such a way that the authenticating party learns nothing other than that the subject knows the secret. Nothing further is learned. The authenticating party cannot impersonate the subject because it has not learned how.

These protocols depend on complexity based cryptography. This means that the security depends on the fact the certain computations, although possible, take too long. Even though computers continually increase in speed, it is possible to propose problems that increase in difficulty as the computer increases in speed, continually outrunning to an ever greater margin the increase in computing power.

On such problem (allegedly) is taking a square root in the system of integers modulo n, where n is the product of two distinct primes, each which is 3 modulo 4. Each element of this as a number theoretic reason to be in the problem statement, but suffice it to say that there is no reasonable way, given an X for which there is an x such that x2 = X modulo n, to actually calculate out the x.

Suppose the Prover P want to prove to the Verifier V that it knows an x such that x2=X mod n, where X and n are publicly known. However, besides proving this to the V, P wants V to learn nothing at all more. This can be done:

P picks a random R (which has a square root mod n) and sends it to V. V can decide whether to ask P to reveal the square root of R, or the square root of R*X. P does as asked, giving an r such that r2=R mod n or and r’ such that r’2=R*X mod n.

Argument the first: If P is honest in all respects, it can answer either of these questions, although it will only reveal the answer to one of them. It therefore can demonstrate knowledge of r and r’, and hence it must also know r’/r = x. (What is knowledge: knowledge anything you can compute in reasonable time from what you know. The ratio of two known quantities is something you know, because it can be computed easily).

Argument the second: If P is lucky, it can answer on of these questions without knowledge of x. P must predict V’s choice. If V will ask for the square root of R, P generates R by picking a random r and revealing R = r2 mod n. If V will ask for the square root of R*X, then P picks are random r and reveals R = r2/X. In both cases, P answers r correctly to V’s choice.

Argument the third: By repeated play of this game, P’s luck can be made to run out. How often can it guess V’s choice, assuming V makes its choice from an unbiased coin not under P’s control? Hence after repeated plays, V can assume with only a small possibility of error that P knows x.

Argument the fourth: However, what has V learned? The coin flips cannot contain the secret of x, and therefore why not let V play the role of P, except that it does know the result of the last coin flip. So V can send itself either r2 or r2/X, as required for r to be the answer to V’s inevitable question (to itself). Hence the entire protocol can be accomplished without P at all, and hence nothing could have been learned from P.

Argument the fourth, alternate: If I might modify the protocol by introducing a negotiator N. N interposed between P and V and does not let V see the R sent by P until it receives the choice from V. It then releases the choice to P and sends the R and the appropriate square root to V. From P’s point of view, nothing has changed. From V’s point of view the order is different. First it sends its and then receives two answers in a row.

That said, N can stop asking P, maybe P doesn’t want to be bothered anymore, and by the techniques of argument the second, N can go on and complete the protocol without P, and without knowledge of x. Therefore V can be receiving no knowledge from N, since there is no knowledge to give. But how will it get more knowledge if N asks P, rather than answering V itself. After all, V can detect absolutely no difference during the times P is on break and N answers, or when P answers.

Low entropy compression

March 22, 2012

Huffman coding doesn’t seem to do the right thing in cases of low entropy, such as a 0-1 data source where p, the probability of a zero, is very close to 1. One possibility is to batch several symbols into one, block encoding, and Huffman code over the blocks. Letting block size increase should limit to the correct average length of code. The calculation can be done by assuming that k-bit sequences are infinitesimally less likely with each additional 1 in the sequence.

Another possibility is to run-length encode, and as an example scheme to consider since the average waiting time for a 1 is 1/p, to set aside k=(-log 1/p) bits to store a count of consecutive zeros. A count of 0 means a one appeared. So the string 0010000001100000001 would encode, for k=3 as 010 000 110 000 000 111 000. (or 0010 1 0110 1 1 0111 1 to encode every counter be beginning by a 0, and compress 000 into a single 1).

Browser programming

January 31, 2012

Browser programming combines a good number of technologies. That fact that writing a web page means mastering at least four technologies and their interactions makes it a challenge. It also must begin with the warning that it’s a rough-and-tumble world of competing technologies, imperfectly realized, and terribly under-documented. However, browser-as-a-platform has been extremely influential, and will continue to dominate the direction of software for at least several year more.

The first technology to master is the HTML language. HTML is a markup language. It is a written description of a page, using tags to describe the graphical elements of the page. It is most noted for its invention of the hyperlink, although they are now simply called links, which can lead from one page to another. The entirety is actually pretty sophisticated. The links are actually resources, often pages, but they can be database queries or calls to action, such as with forms.

The second technology to understand is Javascript. Javascript is a scripting language. Although there is no particular reason why it should live in the browser, and exclusively in the browser, it does. It is an interesting language, with some very unusual concepts. Actionscript, in Flash, has converged with javascript. It is considered to be a very high level language, because it can treat functions as variables. It is object oriented, but uses prototypes rather than templates as the foundation of the objects.

The third technology is the Document Object Model, or DOM. This is the connection between Javascript and the HTML page. If Javascript is to manipulate the page, it needs to access the page’s contents. It does this through objects that are instantiated behind the scenes, and delivers page information into the Javascript environment. The DOM is most confusing and variable piece of the puzzle. Since it bridges javascript and the browser, its definition involved constant negotiation between browser writers and javascript developers. These groups often have competing agendas.

Finally, as a fourth technology, is Cascading Style Sheets, or CSS. CSS is an attempt to separate a web page into content and structure, provided by HTML, and looks and style, described by CSS. It is supposed to allow the programmer and artists to work together without stepping on each other’s feet. The programmer creates the edits the HTML and the artist works with the CSS. It also allows a single HTML document to be viewed in a variety of styles, either at the end-user’s preferences, or based on the end-user’s platform.

The combination of HTML, Javascript, the DOM, and CSS is sometimes referred to as DHTML, Dynamic HTML. It is just one way of creating dynamic web pages, the other possibilities including Flash (on the page) and several server-side mechanisms.

Block devices

December 26, 2011

An historical perspective

At the heart of filesystems, and other I/O, is the abstraction of a block device. The standard block device is a disk drive. It can be abstracted as an array of blocks, data packets of 512 bytes or so, indexed by a positive integer. Data is read and written by block and a random access fashion.

This models the physical reality of a hard drive. A hard drive is comprised of a spinning disk with a magnetizable surface, and a movable head that glides just over the surface of the disk. The head can detect or cause magnetic loops, and the loops are stable on the surface of the drive. The loop is either in one direction or another and will stay that way until re-written by the head. As the head stands still and the disk turns, a circle is traced, called a track. The head is moved in discrete increments towards and away from the spindle of the disk, to scan all tracks. Each track is divided into sectors, which are complete packets of data, with a preamble and other synchronization signals, and these are they physical manifestation of blocks.

In a standard hard drive there are multiple disks on a single spindle, with a head on each side of the disk. The collection of all tracks at a single head position is called a cylinder. Due to limitations on the electronics, or maybe because of the curiously limited way hardware engineers like to conceptualize software, early disks communicated the address of a sector by giving the three numbers cylinder, the head and the sector. This is called CHS addressing. The alternative is to ignore the geometry of the disk and just number sectors starting from zero. This is called Linear Block Addressing, or the LBA. CHS is a pain, although a great category for trivia pursuit, category Personal Computers.

Even at this primitive level, the need for flexibility allowed a single disk to be seen as several disks. The sector could be absolute, or relative within a range of sectors in the disk set aside as a distinct partition. The enumeration of the partitions on a disk are stored on the disk, in a fixed location, called the partition table or disk label.

As an example, the standard PC, from day one up to Windows XP, has a certain format of partition table located at the end of the first 512-byte sector of the disk, LBA 0. Because of the long history, the partition table references both CHS addressing and LBA addressing. I don’t think there really is an official name for this standard, but it is often simply called “the” MBR, for Master Boot Record. FreeBSD and Linux also use this sort of partition table. The table can describe out to 4 partitions, by giving either the starting and ending CHS address of the partition, or the starting LBA and the size in sectors of the partition. Each partition can also have a type, and a flag that marks is bootable. These are the four primary partitions of an MBR format disk.

There is also a scheme to go beyond four partitions by creating an extended partition which can contain an unlimited number of logical partitions. In the first sector of the extended partition is another partition table.The first entry in this table describes the first logical partition; the second is a sector offset to yet another partition table, whose first entry describes the second logical partition, and so forth. This creates a linked list of partition tables, for an unlimited number of logical partitions.

For historical reasons, the MBR partition table can specify a partition by either CHS or LBA addressing. If the CHS address is used, it is limited to a total size of 24 bits, allocated as 8 bits of head number, 6 bits of sector address (ranging only between and 1 and 63), and 10 bits of cylinder address. For 512 byte sectors, a typical number, this limits disks to a capacity of 8 GiB. The details are much more complicated, as various CHS standards are incompatible, with more bits for the cylinder and less bits for the head, great news for this category in trivia pursuit. If LBA addressing is used, there is a simple 32 bits reserved in the MBR partition table, for a 2 TiB limit, assuming 512 byte partitions.

Stuff about MBR’s?

Partly motivated by the desire to increase the range of the LBA space, PC class computers are moving away from the standard MBR partition table to the newer GUID partition table which has 64 bit LBA’s. Looking instead at disk standards, in talking to an ATA drive prior to ATA-6, only 28 bits of LBA could be communicated to the drive. ATA-6 (a.k.a. Ultra ATA/100, released 2003) raised this to 48 bits. Hence at this time, the upper 16 bits of LBA are unused but ATA drives.

While this perspective helps to motivate the introduction of block devices, at this time block devices exist in layers of the operating system that are far enough away from the details of the physical device that block device I/O is more an abstract principal than a practical necessity. Various realizations of block devices might embed in the drivers, or in the circuitry of controllers, the mapping between an abstract block device and the physical block device.

RAID, Redundant Array of Inexpensive Disks

Various techniques can be used to construct the software block device out of hardware block devices. This introduces an entire layer of functionality the refashions, better, the hardware capabilities and re-presents the device improved. The three techniques we shall talk about are RAID, LVM, and Journaling.

RAID places block data on physical block devices in various configurations, either to increase data throughput through parallelism, or redundantly to increase device failure survivability. The idea of RAID is to use a quantity of cheap disks to build a higher performing disk. To construct such a disk directly would be more costly, according to the RAID philosophy. RAID configurations are called levels, and some levels have names. Among the simplest raid level to understand is RAID 1, also known as mirroring.

Mirroring keeps the same information at the same LBA on at least two different devices. How much this increases the survivability depends on how drive failure is handled. Data loss does not occur with the first disk failure, but may occur if the mirror has not yet been restored at the time of subsequent disk failures. A standard configuration is to have two drives in the mirror. The mirror must be restured before the second drive fails. Specialized RAID devices have a hot spare device which is immediately called into service to replace the failed drive. It takes some time to rebuild the mirror, during which time use of the RAID continues on the good drive while it is being mirror-copied into the new drive.

In the standard configuration of mirroring with two drives, survivability is increased however the overall disk capacity is halved. Two one terabyte drives are needed to store one terabyte of data. The odds can be turned slightly using a RAID 5 configuration. A single mathematical equation is established for the data across N drives, called the raid set. The information of N-1 blocks of data across N blocks in the raid set, each on a different drive, subject to the equation:

b_1 (+) b_2 (+) ... (+) b_n = 0

where (+) is the exclusive-or operation over a block b_i of data.

If a drive fails, the missing data is the solution to the equation given the values in the other N-1 drives. The mathematics of this is very simple, and I give an example for 4 bit blocks, rather than 4096 bits normally in a block, assuming a 512 byte block. Assuming N=3, and the data to store is 0110 and 1100. Then 0110 (+) 1100 = 1010; then 0110 (+) 1100 (+) 1010 = 0000. The computed data 1010 is called the parity.

The data stored on the three drives is 0110, 1100 and 1010. Suppose the second drive fails. We can recompute the data on the second drive as 0110 (+) 1010 = 1100. And drives value is the exclusive or of the remaining drives values. We can and must rebuild the failed drive before another drive fails. Although the parity is computed from the data, the data is just as dependent on the parity as the parity is on the data. Any drive can fail and the recovery mathematics is the same.

A complexity of RAID 5 is that updating a data block requires recomputing and updating the parity block. Since the parity block is involved in every write, RAID 5 rotates the parity block among all the disks in the raid set. This evens the work load.

Journaling

Another important observation is that the change to the parity block can be computed from the change in the data block. Just as in “normal” arithmetic, a sum can be kept zero by making compensating changes to two terms involved in the sum. If b_i’ is the new b_i, and b_j’ is the new parity (remember the parity rotates), the compensating value will be given by

b_i (+) b_i' = b_j (+) b_j'.

In our example, to change 0110 to 0011, since 0110 (+) 0011 = 0101, the new parity will be 1010 (+) 0101 = 1111. As proof: compute 0011 (+) 1100 (+) 1111 and verify it is 0000. Given this, a write to a raid 5 device will expand into two reads and two writes. This is a disadvantage of raid 5, but it does increase the usable storage versus raid 1.

With RAID 5 we can survive one failed drive, and the amount of overhead is N/(N-1). This can be tuned as desired, but supposing we have 4 one terabyte drives, we can store 3 terabytes of data, for an overhead of 33%. As with mirroring, specialized raid devices will keep a hot spare to automatically swap into the raid set and begin recovery immediately on detected failure. The operator must only remove the failed drive and replace it with a new drive. The new drive is then assigned as the hot swap drive, and we are ready for more failures!

There are many other raid levels, but mirroring and raid 5 are the only ones I’ve personally used.

LVM: Logical Volume Management

Although LVM is a linux-specific technology, the idea of logical volumes is ubiquitous. I will use LVM as a specific for-instance, but this isn’t just about linux.

Volume management allows great flexibility in forming one or more abstract block devices from one or more physical devices. The discipline of LVM is to form a pool of extents, which are mutliple of blocks, and the unit of storage in LVM. The extents are provided by physical devices, called physical volumes. A physical volume is placed into a volume group and all the volume’s extents are given to the the group. Extents from the group are assigned to create a logical volumes. It seems confusing, but the idea is to remember that the center of is the volume group, which is a pool of extents. Physical volumes give their extents into the pool, and logical volumes are assigned extents from the pool. Neither physical nor logical volumes can span pools, they must be entirely contained in a a single volume group.

Picture here of pv, vg, and lv relationships

Another innovation by volume management is the ability to expand or contract a logical volume. This is conceptually like being able to increase the size of a disk after it has been installed! In LVM, it just means assigning more extents to a logical volume after it has been created. A logical volume is often the container of a file system. We have not considered yet what is in a block device. Whether or not it is of any practical effect to increase the size of a block device will depend on whether the file system that uses the block device can understand this increase and make the new blocks available. Since it was originally impossible for a disk to increase its size after installation, filesystems originally did not provide for that possibility. To increase the size of a filesystem, the filesystem contents were written to tape, the filesystem destroyed and a new larger one created, and the data loaded onto the new filesystem. Newer filesystems can be expanded.

As for making a filesystem smaller, some filesystems can do this as well. I would not be surprised if this is less supported. The need to make a filesystem smaller does not occur that often, and the technical problems of contracting a filesystem can be severe. It is important to keep in mind the different levels, the block device provides storage for a file system. LVM is a technology for block devices.

LVM in essence is very simple. In Linux, it hands off most of the work to MD, the multiple device driver. The MD driver is a general solution to the mapping of abstract block devices to physical block devices. The LVM system labels a disk, or a disk partition, with an metadata file that describes the volume group to which the disk or disk partition belongs as a physical volume. In order to keep the description short, extends are arranged in segments, and a compact notation is used to describe each segment and its assignment to logical volumes. Each physical volume has a copy of the metadata. UUID’s are used to identify each object: physical volumes, logical volumes, and volume groups, so that the system can introspect into it’s own configuration. The MD conspires with UDEV to create entries in the device hierarchy for each logical group and logical volume inside the logical group. Access to the physical volume traverses the MD and the abstract LBA’s get mapped to a device and an LBA on that device.

Using DD to read and LVM metadata file:

TBA

Journaling

To increase the fault tolerance of a filesystem, writes to the file system are committed in a two step process. The intent to change is first written into a journal. Journal entries are strictly chronological. Once a sufficient number of changes have been written, and the collection of changes is considered consistent with all data structure invariants of the filesystem, the collection is considered committed. This is done by adding a commit record to the end of the journal.

Once a set of entries had been committed, the filesystem begins to make the changes permanent, by copying the intents form the journal into the body of the filesystem. Once all the changes up to the commit record have been made permanent, those journal entire are redundant, and can be removed and the space in the the journal reused. Generally, the journal is arranged as a ring buffer, appended to in chronological order, so it all works out very nicely.

Journaling

If a problem occurs, say a system crash. The journal is queried. Any committed set of journal entries are considered good. They can be made permanent. Even if the crash occurred while these journal entries were being copied to the body of the filesystem, this copying can be repeated. Any journal entries which are not followed by a commit record are discarded. The crash has unfortunately caused the loss of that work. However, the filesystem itself is intact.

Journaling is a filesystem idea, but I am placing it here under block devices. If one looks at the implementation of journaling in ext3, the standard linux filesystem, the journal entries are in entire blocks. A journal entry is the new value of a block. A header block to a sequence of journal entries gives the LBA of the block. The commit record is a block, eventually written, to bookend the header block. That said, the mechanism of journaling is only dependent on the device being a block device.

One bit of untruth here is the determination of what a commit record should be written. A file system is a complicated data structure. A change to that data structure might translate into the change of several different blocks. If only some but not all of those blocks are changed, the result will be an inconsistent data structure. Perhaps the stated file length in the metadata describing a file has been changed, but the added space not yet allocated; perhaps space has been allocated but that space has not yet been recorded in-use. Only the filesystem knows when it is appropriate to write a commit record.

Another bit of untruth is that the journal of a journaling filesystem is a file in the filesystem. Hence the journal blocks must pass through the file system to get to the block device driver.

The buffer cache, disk scheduling

The data blocks in transit to the block device will sit in memory. The entirety of blocks in transit can be considered as a cache for the physical storage. The actual writes can be rearranged to suit the device, as long as the view from outside the cache answers consistently according to the time sequence of writes. According to the access statistics, read-ahead might pre-fetch blocks of data at reduced seek costs if there is a reasonable probability that the block will be demanded in the near future.

Linking and loading, Part 3

November 3, 2011

The linking loader can combine compiled programs into executables — files which can be loaded into memory and run.

When invoked on a series of object files (files ending in the “.o” extension), the linking loader will concatenate the text sections of each of the files into a single larger text section, placing them into memory in the order the object files are named on the command line.

As this occurs, the linking loader places all the text symbols defined by the object file into a dictionary, along with the addresses assigned to the symbols. When an unresolved reference is encountered, the dictionary is queried to see if the symbol reference can be resolved.

The linker generally requires at least two passes over the list of object files. The first pass populates the dictionary, gathering up all the symbol definitions. The second pass goes over the files again, this time resolving the unresolved references from the dictionary. Two passes are needed since a symbol might be needed before it is defined, depending on the order that the .o files are placed on the command line.

If a symbol is defined multiple times among the object files, that is, if there are multiple functions by the same name, the linking loader determines which among these symbols will take precedence in the dictionary. Generally, the first time the symbol is encountered it is entered in the dictionary, and subsequent definitions are ignored.

Libraries

A number of object files can be brought together into a single file, or multiple files, called a library. A library would contain code for useful functions, such as string manipulation functions, collected together for use by programmers. A library is simply an object file that is intended to be linked to by other programs. However, the linking loader is written to give certain advantages to files that are considered libraries.

First, while the linking loader tends to collect together all the code of all the object files that are listed on the command line, it only selectively includes the code of a library file. A library file mentioned on the command line which does not resolve any symbols from other files is not loaded into memory. Its code is not needed, so it is not loaded.

Second, the linking loader maintains knowledge of well-known directories where libraries are found, and might have a special command line syntax to refer to libraries. The linking loader might conspire with other conventions in the operating system to update and modify the path along with it will search for libraries.

Since multiple symbol definitions are resolved by the first entry into the linker’s dictionary, the order in which directories are searched is important. By moving a file ahead in the search order, it will become the provider of the symbol, and hence, the function, overriding other definitions.

Shared Libraries

Another advantage of libraries, in some operating systems, is that they can be shared. This means, that while several programs will be using the same code in the library, that code will be loaded only once into memory. This saves in memory. This is particularly important with virtual memory, where the value of physical memory is amplified by a demand-paged strategy of using memory as a cache for the currently-used section of a program. Each program has a working set of pages that it needs to run, and a program is running efficiently only if its the virtual memory system has paged the working set into memory. Sharing code pages allows a larger working sets, or more simultaneous processes, for a given amount of physical memory.

As a drawback, shared libraries complicate the installation of a program. A non-shared library, also called a statically linked library, is a complete run package, and does not depend on the environment to load libraries at run time. If a library changes over time, the library available at run time might not be equivalent to the same-named library used when the program was developed. Libraries might become obsolete, or merged with other libraries. A complete solution to this problem is beyond computers at this time, and can be avoided by using statically linked libraries.

On the other hand, a re-released library might contain bug fixes. In which case a shared library will have access to the new, fixed, library, without recompiling or relinking the code. It is also possible to version libraries, so that multiple versions of the library can exist together, and the program can request either the most recent version, or a specific version.

Position Independent Code

The technology of shared libraries divides between code which is position independent and code which is not position independent. Position independent code uses CPU instructions which do not need to be changed if the program is relocated. PIC always uses relative addresses. Rather than branch to location X, PIC instructions branch to “here plus X”.

If code is not position independent, then in order for code to be shared, it must be loaded into the same address in memory for all the programs that share the code. Microsoft dll’s are not position independent. They load the major system libraries into well-known locations, starting at the top of text memory and working their way down. If program loads and finds a conflict, then it will get its own version of the library, relocated to an open memory space in that program.

Some of Microsoft dll’s cannot be relocated. Ntdll.dll, for instance, will not work if relocated. The Microsoft loader is tricked-up to always load Ntdll.dll at the highest memory address. Other important libraries such as kernel32.dll and user32.dll are then loaded beneath, at well known addresses. Vendor dll’s (not from Microsoft) load underneath system dll’s and can be shared opportunistically.

Even if code is position independent, if two shared libraries call each other, their interconnecting references must also be position independent.

The libraries might need their own data space. It becomes an issue whether the data segment can be made to be a known offset from the text segment so that references to data in the shared library can be position independent.

The solution to this problem is exemplified by unix’s GOT, the Global Offset Table. Addresses which cannot be made position independent are brought together in a table. The location of this table is position independent to the concerned library, and the index of each item in the table is fixed. Therefore, each library can get, in a position independent way, the address of non-position independent data or the entry point of functions. The library can be shared except for the GOT — that will be distinct for each user of the shared library.

This requires extensive use of indirect addressing, which is an addressing mode that is native to most CPU’s. Rather than instructions that read as “jump to X” (or the relative “jump to here+X”), they read as “jump to *GOT[i]“, meaning, use the value stored in the i-th entry in the GOT table as the target jump address. Equivalently data is stored and retrieved indirectly through the GOT using the CPU’s native indirect addressing.

Naively you might think this slows things down considerably. However, the world inside a motherboard is a surprising and complicated place. There are caches (L1 and L2, you might have heard about them). The RAM memory is not fast enough for most CPU’s, and memory is first brought to cache in large bucketfuls. If the GOT is in cache, and stays in cache, the loss of speed in accessing the GOT would be swamped by the need to bring to cache the referent named in the GOT.

Dynamically Linked Libraries

While shared libraries differ the final link until just before the program is run, dynamic linking differs the link to even later — it does not link a procedure until it is used.

As a program is prepared for run, the linker will identify all shared libraries; it will either find them already loaded at a reserved address, or if the library is already loaded and uses position independent code, it will map the loaded library arbitrarily into the address space and fix up the GOT to resolve absolute addresses.

A dynamically linked library will not yet bother to load or map the library, or fix up the GOT. Instead it will point all GOT entries for unresolved symbols back into a dynamic loader. On first use of a dynamically linked function, rather than jumping to the function, the program will jump into the dynamic loader. The dynamic loader will identify the situation and resolve at that point the symbol. It will then fix up the GOT, and continue onto the function. All future references will go directly through the GOT to the function.

The details are in the PLT, the Procedure Linking Table. For each dynamically linked function there is a PLT entry. The called calls the PLT entry, which indirectly jumps through the GOT. The initialization of the GOT will point each GOT entry back to the PLT, plus one. At this location in the PLT will be code that pushes an identifier onto the stack and jumps through to PLT entry 0. That entry jumps through a GOT entry to the dynamic linker.

By this circuitous route the dynamic linker is invoked, with the top of the stack containing an identifier of the function to be resolved. It resolves the function and updates the GOT entry so it no longer jumps back to the PLT, but to the target function.

Only the GOT needs to be writable. The PLT will be accessed in a position independent way, and be read only.


GOT1: <identify>
      &<dy lib function>

GOTn: &PTLn+1

PLT0: push *(GOT+1)   // dynamic linker uses GOT to find itself
      jump *(GOT+2)   // Step 3, jumps to the dynamic linker

PLTn: jump *(GOTn)    // Step 1, called by program. jumps  ...
      push n          // ... right back to here.
      jump PLT0       // Step 2, jumps to the dynamic linker
GOT1: <identify>
      &<dy lib function>

GOTn: &<function n>     // dy linker updates GOT

PLT0: push *(GOT+1)
      jump *(GOT+2)

PLTn: jump *(GOTn)     // Step 1, jumps immediately to <function n>
      push n           // unused after first use
      jump PLT0

Linking and loading, Part 2

November 1, 2011

An Experiment

Here are three files, that are compiled separately and then brought together in a final link. The final link is done with a suggestion of laying the partial codes into memory in different orders. The placement in memory is checked using the nm utility. Running nm on f.o and g.o, they are compiled with the functions f and g (respectively) beginning at location 0×00. Linked together with fg.c, f is relocated to 0x0e30, and g is relocated to 0x0e60.

As a matter for making this ever more clear, the order the object files is changed on the command line, suggesting the g should be located first in memory and then f. nm then shows g relocated to 0x0e30 and f relocated to 0x0e60.

hohokus-3:linkertests burt$ cat f.c
#include<stdio.h>
#include<stdlib.h>

int f(int i) { printf("Hello world\n") ; }

hohokus-3:linkertests burt$ cat g.c
#include<stdio.h>
#include<stdlib.h>

int g(int i) { printf("Goodnight moon\n") ; } 

hohokus-3:linkertests burt$ cat fg.c
#include<stdio.h>
#include<stdlib.h>

int f(int) ;
int g(int) ;

int main(int argc, char * argv[]) {
	f(1);
	g(1);
}

hohokus-3:linkertests burt$ cat Makefile
all:
	cc -c g.c
	cc -c f.c
        cc -c fg.c
	cc -o fg fg.o f.o g.o
	cc -o gf fg.o g.o f.o

hohokus-3:linkertests burt$ nm f.o g.o fg gf
f.o:
0000000000000038 s EH_frame0
0000000000000025 s L_.str
0000000000000000 T _f
0000000000000050 S _f.eh
                 U _puts

g.o:
0000000000000038 s EH_frame0
0000000000000025 s L_.str
0000000000000000 T _g
0000000000000050 S _g.eh
                 U _puts

fg.o:
0000000000000038 s EH_frame0
                 U _f
                 U _g
0000000000000000 T _main
0000000000000050 S _main.eh

fg:
0000000100001048 S _NXArgc
0000000100001050 S _NXArgv
0000000100001060 S ___progname
0000000100000000 A __mh_execute_header
0000000100001058 S _environ
                 U _exit
0000000100000e30 T _f
0000000100000e60 T _g
0000000100000df0 T _main
                 U _puts
0000000100001000 s _pvars
                 U dyld_stub_binder
0000000100000db0 T start

gf:
0000000100001048 S _NXArgc
0000000100001050 S _NXArgv
0000000100001060 S ___progname
0000000100000000 A __mh_execute_header
0000000100001058 S _environ
                 U _exit
0000000100000e60 T _f
0000000100000e30 T _g
0000000100000df0 T _main
                 U _puts
0000000100001000 s _pvars
                 U dyld_stub_binder
0000000100000db0 T start

Referring back to the experiment: The switch -c asks the compiler to stop after compilation. Normally, cc is a short hand for both cc followed by ld, the linking loader. cc -c asks that the linker not be invoked, and the .o file is left without relocations or symbol resolution. Unresolved externals, and exported symbols are noted in the .o file, for later use by ld.

The final cc invokes ld on three .o files: fc.o (not seen), f.o and g.o. It concatenates them together, necessarily relocation f and g, and their data. It then resolves the reference in fc.o to the exported symbols in f.o and g.o. With all references satisfied, the final file, fg or gf, is executable — it is ready for the loader to copy it into memory and set it running.

Linking and loading

November 1, 2011

Compiled programs are prepared for running by a combination of softwares called the linker and loader. The edge between these two is blurry, and all unixes have a single program doing both, called ld.

Why are we talking about linking-loaders at all? Recalling the discussion of what is an operating system, the linking-loader would be part of the development tools. They have a great dependency on the operating system, and must have knowledge of the operating systems internals, but they are perhaps not core elements of an operating system. Most operating systems texts don’t cover the topic of linking-loaders.

However, Windows NT’s loader in part of the operating system. The loader must make considerable use of the virtual memory system. And a too narrow definition of the operating system might allow for better theoretical discussion, but is simply not useful in practical terms. For better or worse, in operating systems I include the utilities of the “greater operating system” as well as the core development tools.

According to Levine’s book, Linkers and Loaders, the combined linker-loader has three functions:

  • Loading: the actual copy of the program and its preinitialized data segments into memory;
  • Relocation: the translation of segments of code or data over to occupy a different location in memory;
  • Symbol resolution: the coordination and connection between locations and symbols between separately compiled code segments.

As the name suggests, loading is tightly associated with loaders. Symbol resolution is always associated with linkers. Relocation can variously be considered a linker’s job, or a loader’s job. As will be plain by the end of this lecture, this is all very arbitrary. Reality is very messy when it comes to linker-loaders.

Loading: The copy of a program and its data into memory has a few more subtasks than one might assume. In simplest operating systems, this can be a straight copy, almost mirror image, for a file into physical memory. However, in any operating system of consequence, virtual memory must be setup to receive the various segments of the program, initialization tasks must be handled, the source might come from various files, as loadable libraries, or various sections of a file.

There we have some of the standard terminology of loader-linkers. In the layout of virtual memory, various spans in virtual memory will have distinct purposes. These intervals are called segments. In Intel architectures, there are various segment registers on the CPU to keep track of segments.

Files that contain loadable images will have various sections, that are more various than segments, and are more like various appendices in a book. There are alignments: a loadable image file will have code sections, areas of the file that contain the text that should be associated with or copied into the code segments set aside in the space of virtual memory.

To load code, virtual memory space is set aside and propertied but a data element sometimes called the VAD, the Virtual Address Descriptor. The VAD might describe that an interval, say from 0×8000 to 0xafff in virtual addresses is read-only (text) and maps to a certain set of bytes in a certain file: the loadable source for the text. The VAD might set aside a range of virtual memory for data, and mark that range as read-write, and initialize data in that range in accordance with a data section in a loadable source. Unlike text, however, the association of the virtual address range with the file section occurs only at start-up. Text sections are associated throughout the run of the program.

Relocation: A compiler works on a piece of the overall program. This partial programs are compiled into assumed addresses — both the code and data. When pieces of code are brought together it is likely that they must be moved. The assumed starting address for the code in both pieces will be the same, and eventually only one piece of code can occupy that address.

Relocation maps out how the memory will be arranged and translates code segments into non-overlapping intervals. This is done by the linker providing reloc records, for each and every address written into the code, whether the target of a code jump or of a data load or store, there is a record of this address, with additional information of the kind of address written. These are also called fixups, and the act of going through the collection reloc records and modifying the address referred to, so as to account for the translation of the segment, is also called a fixup.

For instance, a program might be compiled so that some while loop ends with a jump to location 0×80800. The jump itself is at location 0×80880. A reloc pointing to location 0×80880 will be placed in the linker output file. If the program must be moved upwards by 0×10000 bytes, the reloc will cause the the jump instructed to be rewritten as a jump to 0×90800; the jump itself will appear in memory at location 0×90880.

Symbol resolution: Separately compiled codes make reference to each other. Certainly, functions written in on file are called by code written in another file. At compile time, the compiler must know the signature of the called function: the return type and the number and type of arguments. Knowing these things, the compiler and emit code for the call, and handle the return values, without knowing the exact code of the called function.

In order that the code run, however, the call must eventually be made to the address of the called function. After the code is assembled and relocated, the call instruction must be fixed-up as well, rewritten so that address of the called function, left at zero by the compiler, now have the proper address of the called function.

Additional reloc records point to the location of the call statement, annotating the reloc record with the symbol name. These are unresolved references, and the linker endeavors to resolve them. Code providing symbols has a section of its linker output file stating all the external symbols the file exports. The record for an exported symbol would five the symbol’s name and the unrelocated address of the symbol in text section (or data section for an exported variable).

Interval tree proof

November 1, 2011

It was a bad day in class today. I’d like to fix up discussion of the interval tree.

An interval tree is a structure for organizing a set of intervals { i=[i.low,i.high] } so that the structure can be maintained dynamically, under insert of new intervals and deletion of existing intervals, and an intersection query can be answered efficiently. The intersection query is: given a query interval q = [q.low,q.high], return any interval i in the interval tree such that q ∩ i is non-empty, or return NONE if no i intersects q.

The interval tree is a red-black tree augmented with a field that, for every node n in the tree, contains the value (max i.high) where the maximum is taken over all nodes that are descendents of n, including n itself. The red-black tree has the nodes ordered by the i.low values.

It is efficient to maintain this information: on insert, first propagate any new maximums from the insertion leaf back to the route along the path of the insertion; for rotates, the value of exactly one node needs to be recalculated, and that can be done by taking the maximum of n.left.high, n.right.high and n.high. On deletes, rotation is handled as above, deletion will require climbing the tree back from the deleted node, recalculating.

The search algorithm is as follows: to search for an intersection with query interval q against all intervals in the tree rooted at node n,

  • first check if q intersects the interval contained in n;
  • if not, check if n has a left child, and if not recurse on n.right;
  • if it has, check if n.left.max < q.low and if so recurse on n.right;
  • else recurse on n.left.

The surprising thing about this algorithm, to my mind, is that implication that if q is not fully to the right of all the nodes on the left subtree, then it must intersect something in the left subtree.

If any of the conditions for going to the right are satisfied, then it is obviously impossible for there to be an intersection of q with an interval in the left subtree. However, oddly, barring obvious impossibility, if there is an intersection, it will be in the left subtree.

The proof is as follows. We maintain the assertion that if q intersects some interval in the tree, it intersects some interval descending from the current node n. Begin by setting n to the root so the statement is trivially true.

If q intersects n.i, we are done and an intersection is found and we are done. Else we continue on the assumption that if there is an intersection there must be at least one in either the left or right subtrees.

If n.left is empty, the presumed intersection must be in the right subtree. So continuing with n.right maintains the assertion.

The case that q.low > n.left.max is very similar: q is fully to the left of the rightmost extending interval among all intervals in the left subtree of n, so it cannot intersect with any of those. Therefore if there is an intersection among the intervals in the right subtree of n.

Finally, if q.low ≤ n.left.max then either q intersects with the interval that achieves this n.left.max, i.e. an interval i’ such that i’.high = n.left.max, or it does not. If it does, then continuing searching in the left subtree of n maintains the assertion since i’ is in the left subtree of n.

If it does not, then this implies q.high < i’.low, so q is fully to the left of some interval in the left subtree of n. As the tree is ordered by low endpoints of intervals, q is fully to the left of everything in the right subtree of n. Therefore if the presumed intersection cannot be among the intervals in the right subtree of n, and therefore must be among the intervals in the left subtree of n.

So even when written out, the proof is sort of long. It depends on the ordering of the tree in the last step.

 
Powered by Wordpress and MySQL. Theme by Shlomi Noach, openark.org