Information Theory

Introduction

Information theory, introduced by Claude Shannon (1948), provides a quantitative framework for measuring information, uncertainty, compression limits, and communication reliability.

Its central idea: information corresponds to reduction in uncertainty. Rare events carry more information than predictable ones. This intuition is formalized through entropy and related measures.

Entropy

For a discrete random variable X with pmf p(x),H(X)=−(∑_^)(p(x))*(log_2)(p(x))=E*[−(log_2)(p(X))].

Entropy is measured in bits (base-2 logarithm).

Basic Properties

H(X)≥0, with equality iff X is deterministic.
If X takes n values, then

H(X)≤(log_2)(n), with equality for the uniform distribution.

Binary Entropy

For X∼Bernoulli(p),

(H_b)(p)=−p*(log_2)(p)−(1−p)*(log_2)(1−p).

Maximum uncertainty occurs at p=0.5, giving (H_b)(0.5)=1 bit.

Joint and Conditional Entropy

For joint distribution p(x,y): H(X,Y)=−(∑_^)(p(x,y))*(log_2)(p(x,y)).

Conditional entropy: H(Y|X)=−(∑_^)(p(x,y))*(log_2)(p)*(y|x).

Chain rule: H(X,Y)=H(X)+H*(Y|X).

Mutual Information

Mutual information measures shared information: I(X;Y)=H(X)−H(X|Y)=H(X)+H(Y)−H(X,Y).

Equivalent KL form: I(X;Y)=(∑_^)(p(x,y))*(log_2)(p(x,y)/(p(x)*p(y))).

Properties

I*(X;Y)≥0 (equality iff independence).
Symmetric: I*(X;Y)=I*(Y;X).
I*(X;X)=H(X).

Kullback–Leibler Divergence

(D_K*L)*(p∥q)=(∑_^)(p(x))*(log_2)(p(x)/q(x)) .

(D_K*L)≥0.
Zero iff p=q.
Not symmetric.

KL divergence measures distributional discrepancy and underlies many learning objectives.

Source Coding

Shannon’s source coding theorem:

Average code length ≥H(X). There exist codes achieving average length arbitrarily close to H(X).

Entropy is the fundamental limit of lossless compression.

Channel Capacity

For channel p*(y|x), C=max(I)*(X;Y).

Shannon’s channel coding theorem:

Reliable transmission possible if R<C.
Impossible if R>C.

Binary Symmetric Channel

With crossover probability p:

C=1−(H_b)(p).

Differential Entropy

For density f(x):

h(X)=−(∫_^)(f(x)*(log_)(f(x))*d(x)).

Unlike discrete entropy, it:

Can be negative.
Depends on coordinate scaling.

However, continuous mutual information remains non-negative.

Gaussian 𝒩*(μ,σ2):

h=1/2*(log_2)(2*π*e*σ2).

Maximizes entropy among fixed-variance distributions.

Data Processing Inequality

If X→Y→Z (Markov chain),

I*(X;Z)≤I*(X;Y).

Processing cannot increase information.

Cross-Entropy

H(p,q)=−(∑_^)(p(x))*(log_)(q(x))=H(p)+(D_K*L)*(p∥q).

Widely used as a loss function in classification.

Applications

Compression: Huffman, arithmetic coding.
Error correction: Codes approaching capacity.
Machine learning: Cross-entropy loss, mutual information, variational inference.
Statistics: Maximum entropy principle, likelihood, Bayesian updates.

Summary

Entropy quantifies uncertainty. Mutual information quantifies dependence. KL divergence measures distributional mismatch.

Shannon’s theorems establish fundamental limits for compression and communication. These principles govern modern communication systems, statistical inference, and machine learning.