It was only around the turn of the century that information has existed as a scientific entity. Hartley wrote about it in 1928, as it became very important for the transmission of telegraph and telephone signals. By now, information is not just a natural quantity, it is a commercial quantity, bought and sold. We are very concerned that our cell phone plans have enough megabytes, just as much as our homes have enough electricity. It is metered and charged for.
I have to skip over much of the definition of information. Be warned, however, that information in the scientific sense is not always the same as information in the colloquial sense. Random noise is very dense in information in the scientific sense, but often is despised as, well, noise, in the colloquial sense. Random noise is information dense because I cannot predict what it will be until I see it. Information is about my determining what was sent, not what was meant by the sender. Information is measured in bits. One bit is the smallest quantum of information. It is the equivalent of a yes/no answer to a single question. If I play 20 questions, I play 20 bits.
Information is sent from one place to another through a channel. It might be a fluctuating voltage in a wire; or the compression wave traveling through air that constitutes a sound; a flashing light, either across open space or captured in a fiber optic. The concern of Hartley, and later Shannon, is how much information can go across a channel, and how well it can cross it. The result that we are discussing says that a channel has an innate capacity in bits per second, determined by its physical composition, that equals: C = B log_2 ( 1 + S/N ); where B is the bandwidth, S is Signal Strength, and N is Noise Strength. These are terms we need to discuss.

Bandwidth of a channel
Information will be sent by disturbing some physical quantity at one of the channel and attempting to observe that disturbance at the other end of the channel. A shock wave of air caused by the vocal cords are decoded by the ear; a voltage change in an ethernet cable is measured at the far end of the cable. It had been observed that the channel is often reluctant to allow fast changes. There is a limit to how quickly the channel can react, and hence if too many changes are made in an instant of time, for instance, the transmission of too many bits are attempted per second, the changes will be lost, and the information will not arrive at the other end of the channel.
For instance, in electrical engineering there are simple RC circuits. A resistor is placed in series with a capacitor so that current will flow first through the resistor and then into the capacitor. A large voltage applied across the combination will cause the current to flow. The current flows through the resistor, creating a voltage across it, and then into the capacitor, which “fills” and whose voltage rises until it ultimately matches the applied voltage and the current flow ceases.
If one is unfamiliar with electrical engineering, the same situation can be modeled with two basins of water and a thin tube connecting them. One basin is full, the other empty, and water flows through the tube to equalize the water level across them. As long as there is a difference in level, current flows and the water builds up on the formerly empty basin. Once the levels are equal, the flow stops.

If I were to try to send information by suddenly increasing the level of water in one of the basins, the other basin’s level would not rise immediately. It would rise at a certain time constant which depends on the resistance the tube gives to the flow of water. The reciprocal of this time constant is the bandwidth, the quicker the levels come close to even the quicker I can adjust the level once again, allowing me to communicate by a sequence of level changes.
The model in electrical engineering would be C dV/dt + V/R = 0, where C is the capacitance of the capacitor, the amount of current that is required to make a given voltage change, analogous to the capacity of the basin: how much water must flow in to make a given level change. R is the resistance of the resistor, how much it impedes flow. The derivative on the voltage expresses that for capacitors the change in voltage (change in level) is caused by current flows. The sum equaling zero means that all current that flows in or out of the capacitor is accounted for by the exact amount of current flowing through the resistor, and the two find balance.
The solution to this equation is, roughly, V1 = V0 * (1 – e^(-t/(RC)), when a sudden change of V0 (or level L0) is applied to the system. RC is the time constant tau. If it is large, t must be large for the voltage to fall, lets say, to e^(-1) of its original value. So t = tau is the amount of time required for the voltage to fall to 1/e = 0.37 of it’s original value. Looking at dV/dt of this solution, 1/tau is the slope of the curve at time zero – it is the maximum speed the circuit will follow a voltage change. This value, 1/tau, is called the bandwidth.
Noise, and Signal to Noise ratio
Every system has noise. A reading of the voltage at the receiving end, or of the level of water in a basin, as an error tolerance. The value read cannot consider changes within that error tolerances of being attributable to the quantity measure, it is attributable to the error of measurement.
Consider a series of changes in voltage, meant to communicate a sequence of values. Given a fixed bandwidth, the system can slide up to equalize with that change just so fast. If the changes are made slowly, the bandwidth allows the system to slide up to each change, and the output reads with good accuracy the input values. As the changes come more rapidly, the system fails to keep up, and is not near sliding up to the applied value before the next value comes along. This can be corrected for at the receiving end, but eventually the information comes so rapidly that the amount of change is more or less equal to the noise band. In this case, the information is lost. The changes seen will be washed out by noise, or can equally be attributable to noise as to signal.
The Shannon-Hartley theorem shows this in the factor 1+S/N. The largest range of possible output values, the signal S, is divided into brackets by the width of a band of noise N. However many levels are partitioned is the practical limit of measures. We can see that the value is in one bracket or another, but within a single bracket, the details of the measurement are as much determined by the noise as by the intended value. Hence the channel capacity is a product of the bandwidth and a function of 1+S/N.
Consider going from a 1+S/N of 2 to a 1+S/N of 4. In the first, I have two brackets, one at minimum signal and another a maximum and they are separated by the width of noise, so I can be probably correct in assigning a measurement closest to the maximum as maximum (since it is an entire noise band away from the other possibility, the minimum), and vice-a-versa. In the second, there are 4 brackets. However, I can also have 4 brackets, virtually, if I double my transmission speed and send in the time of one change, two changes, each using only a 1+S/N of 2. Likewise, I can either have a 1+S/N of 8, or triple my transmission speed and send three changes in the time of one, each with a 1+S/N of 2. That means that it is the log_2(1+S/N) that is the equivalent of bandwidth – I can either quadruple my sending rate or take my signal to noise ratio to the fourth power to achieve comparable channel capacity.
Conclusion, Shannon’s theorem
We have justified that the channel capacity C is B log_2(1+S/N). It depends on the bandwidth B of the channel, that is, it’s ability to transmit disturbance quickly, and the signal to noise ratio S/N, that is, the ability for small changes to be detected as significant, and not lost in the noise level. Information theory gives us guidance in constructing high capacity channels which are physical systems with large bandwidth and large signal to noise.
There is noise, and the detection of our information is limited by it. Sometimes, the noise will cause an error in detection. As we move our sending rate towards the capacity C, you might believe that the errors are more frequent, as we begin to brush the noise levels in our hurry to apply the next level change. This is true. However Shannon showed that (the Shannon noise-channel coding theorem) as long as the sending rate does not exceed C, error-less communication is possible. It will require error-correcting codes to combat the noise. The error correcting codes will themselves reduce the usable information rate, because the codes will adjoin correction information. However, in the limit (what limit?, you might ask) the overhead can be made proportionally negligible.
On the other hand, (the Shannon source coding theorem) it is possible to optimally discard information so that it can fit into a channel. This is called lossy compression. There is also loss-less compression, and it is useful to squeeze out redundancy or noise that has been introduced in data, so that a seemingly large amount of information can be written as a smaller amount of information without “essential” loss.