Information Theory
Introduction
Information theory, introduced by Claude Shannon (
Its central idea: information corresponds to reduction in uncertainty. Rare events carry more information than predictable ones. This intuition is formalized through entropy and related measures.
Entropy
For a discrete random variable
Entropy is measured in bits (base-
Basic Properties
H(X)≥0 , with equality iffX is deterministic.If
X takesn values, then
Binary Entropy
For
Maximum uncertainty occurs at
Joint and Conditional Entropy
For joint distribution
Conditional entropy:
Chain rule:
Mutual Information
Mutual information measures shared information:
Equivalent KL form:
Properties
I*(X;Y)≥0 (equality iff independence).Symmetric:
I*(X;Y)=I*(Y;X) .I*(X;X)=H(X) .
Kullback–Leibler Divergence
(D_K*L)≥0 .Zero iff
p=q .Not symmetric.
KL divergence measures distributional discrepancy and underlies many learning objectives.
Source Coding
Shannon’s source coding theorem:
Average code length
Entropy is the fundamental limit of lossless compression.
Channel Capacity
For channel
Shannon’s channel coding theorem:
Reliable transmission possible if
R<C .Impossible if
R>C .
Binary Symmetric Channel
With crossover probability
Differential Entropy
For density
Unlike discrete entropy, it:
Can be negative.
Depends on coordinate scaling.
However, continuous mutual information remains non-negative.
Gaussian
Maximizes entropy among fixed-variance distributions.
Data Processing Inequality
If
Processing cannot increase information.
Cross-Entropy
Widely used as a loss function in classification.
Applications
Compression: Huffman, arithmetic coding.
Error correction: Codes approaching capacity.
Machine learning: Cross-entropy loss, mutual information, variational inference.
Statistics: Maximum entropy principle, likelihood, Bayesian updates.
Summary
Entropy quantifies uncertainty. Mutual information quantifies dependence. KL divergence measures distributional mismatch.
Shannon’s theorems establish fundamental limits for compression and communication. These principles govern modern communication systems, statistical inference, and machine learning.