Information Theory
In order to understand Shannon’s entropy, we return to Bob and Alice and assume that they have a communication channel that is capable of transferring one pulse by seven in the evening. What Shannon did was to attempt to quantify the amount of information that Bob transfers to Alice. At first glance, it seems that the amount of information is the sum of probabilities. In other words, if the probability that “Bob is busy” (0), is p, then the probability that “Bob is free” (1), is 1 – p, meaning that the total probability is one unit (one bit), no matter what the value of p actually is.
But Shannon gave a different solution, which is I = –[p ln p + (1 – p) ln (1 – p)], or in words, the amount of information, I, is a function of the probability of the bit being “0” times the log of its probability, plus the probability of the bit being “1”, times the log of its probability. This solution is what placed him in the pantheon of great scientists.
First, it is important to understand why I = p+(1 – p), that is, one bit, is not a good answer. It is possible that, on the day that Bob spoke to Alice, he had already planned another meeting and that the one with Alice (who was aware of this) was a secondary option. If we assume that the chance that the initial meeting would be cancelled and Bob would meet Alice is ten percent, then the bit with a value of “0” carries less information to Alice (0.1 bits), while the “1” bit carries more information (0.9 bits). If we add these probabilities, the answer is always “1” no matter what the relative probabilities (as known to Alice) of the bit being “0” or “1.” If Alice had not been aware that the probability was 90:10, she might have assumed that the probability was still 50:50, and the bit with the “0” value would carry the same value of information as the “1.” Therefore adding probabilities overlooks the a priori knowledge that is so important in communications.
In the expression that Shannon gave, on the other hand – and he was aware of the fact that this expression was the same as that of the Gibbs entropy – summing the probabilities does not disregard a 90% chance of the bit being “0” and the 10% chance of it being “1” (or any other value that Alice would attribute to the chances of meeting Bob).
If Alice estimates that the probability that Bob will be free that evening is 50%, the amount of information that Bob sends her is maximum, because Alice will be equally surprised by either value of the bit. Formally, in this case both possibilities have the same uncertainty, and the information that the bit carries is:
I=-(1/2)ln(1/2)-(1/2)ln(1/2)=ln2
That is, the maximum amount of information that a bit can carry is equal to half the natural log of 2 (50% chance that Bob will come) plus another half the natural log of 2 (50% chance Bob won’t come). (Note: In communication theory, the numerical value is called “nat”.) Since engineers prefer to calculate logarithms in base 2 and not in the “natural” base, e, the maximum amount of information that a bit carries is 1.
If the probabilities that Bob will or will not come are not equal, the bit that he sends to Alice carries less information than one; if they are equal, the information carried in the bit is equal to one. If Alice knows that the chance that Bob will come is just 10%, the amount of entropy according to Shannon will be:
I=-(1/10)ln(1/10)-(9/10)ln(9/10)=0.32
That is, the bit carries information that is 0.32/ln2 = 0.46 of the maximum value.
Shannon’s definition is superior to the probability independent definition. If we send N bits, the maximum information is N bits. The reason for this is that Shannon’s expression acquires the maximum value of one for each and every bit when “0” and “1” have equal probabilities; but when their probabilities are not equal, Shannon’s information is less than one. And here is Shannon’s great insight: for an N-bit file, there is a chance that its content can be transferred via digital communication using less than N bits. In principle, then, it is possible to compress files without losing content, and of course, this is very important regarding the speed, cost and ease of transferring information.