encode | decode | compress

> huffman | optimal | codes <

// Huffman Coding - Optimal prefix-free compression algorithm

0 chars
0 chars
[OPTIMAL]

Optimal Codes

Generates shortest average code length for given frequencies.

[PREFIX-FREE]

Prefix-Free

No code is a prefix of another, enabling unique decoding.

[ADAPTIVE]

Frequency-Based

Common characters get shorter codes automatically.

>> technical info

How Huffman Coding Works:

Huffman coding builds a binary tree from the bottom up. Starting with character frequencies, it repeatedly combines the two lowest-frequency nodes into a parent node until only the root remains. Left branches are '0', right branches are '1'. Characters with higher frequency get shorter codes, achieving optimal compression for the given frequency distribution.

Huffman Coding Example:

Text: "AABBBCCCC"

Frequencies:
A: 2
B: 3
C: 4

Tree Building:
1. Combine A(2) + B(3) = AB(5)
2. Combine C(4) + AB(5) = Root(9)

Tree:
     Root(9)
    /      \
   C(4)    AB(5)
          /     \
        A(2)    B(3)

Codes:
C: 0
A: 10
B: 11

Encoded: 10 10 11 11 11 0 0 0 0
= 1010111111 0000

Why Use Huffman Coding:

  • Optimal compression ratio
  • Simple implementation
  • Foundation of ZIP/JPEG
  • No data loss
  • Proven efficiency

>> frequently asked questions

What is Huffman coding?

Huffman coding is an optimal prefix-free compression algorithm developed by David A. Huffman in 1952. It assigns variable-length codes to characters based on their frequency, with more frequent characters getting shorter codes, achieving the minimum average code length possible.

Why is it called prefix-free?

A code is prefix-free when no codeword is a prefix of another. For example, if 'A' is '01', then no other character can have a code starting with '01'. This property ensures unambiguous decoding - you can always tell where one code ends and another begins without separators.

How efficient is Huffman coding?

Huffman coding achieves the optimal compression for a given frequency distribution, typically reaching within 1 bit of the theoretical entropy limit. For English text, it usually achieves 60-70% compression. It's most effective when character frequencies vary significantly.

Where is Huffman coding used?

Huffman coding is used in JPEG image compression, ZIP file compression (combined with LZ77 in DEFLATE), MP3 audio, and many other formats. It's often combined with other algorithms - for example, JPEG uses Huffman after DCT and quantization, while ZIP uses it after LZ77.

霍夫曼编码(Huffman Coding)是什么?怎么手算?

霍夫曼编码由 David A. Huffman 于 1952 年发明,是一种基于字符频率的最优前缀编码(变长无歧义编码)。出现频率越高的字符,编码越短。

手算步骤(以 AABBBCCCC 为例)
1. 统计频率:A=2, B=3, C=4。
2. 创建优先队列,每个字符是一个叶子节点:[A:2, B:3, C:4]。
3. 取出频率最低的两个 (A:2, B:3),合并为新节点 AB:5。队列变为 [C:4, AB:5]。
4. 再取最低两个 (C:4, AB:5),合并为根 Root:9。
5. 从根出发:左分支=0,右分支=1。
  • C 在左子树 → 0
  • A 在右左子树 → 10
  • B 在右右子树 → 11

这样原本每个字符需要 2 bit(3 个字符需 ⌈log2 3⌉=2 bit),共 18 bit;霍夫曼编码只需 10·10·11·11·11·0·0·0·0 = 14 bit,压缩率 ~22%。

ハフマン符号(Huffman coding)の仕組み — 日本語圧縮との関係

ハフマン符号は 1952 年 David A. Huffman 発明の可変長符号。出現頻度が高い記号ほど短いビット列で表現し、エントロピー(情報源の下限)にほぼ到達する。

日本語圧縮での応用:
UTF-8: 日本語 1 文字は 3 バイト。あいうえおだけ繰り返すテキストは、静的ハフマンで 3 バイト→1-2 バイトに圧縮可能。
ZIP / gzip の DEFLATE: LZ77 で重複除去した後、ハフマンで残った literal/length/distance を符号化。日本語プレーンテキストは典型的に 60-70% の圧縮率になる。
JPEG: DCT → 量子化 → ジグザグスキャン → ハフマン符号で静止画圧縮の最終段。

ハフマンの最大の強み:Prefix-free property(プレフィックスフリー性)— どの符号も他の符号の先頭にならないため、区切り文字なしで一意に復号できる。

Codificación Huffman — ¿cómo calcular la entropía y saber si vale la pena?

La codificación Huffman (Huffman encoding) se acerca al límite teórico de compresión dado por la entropía de Shannon:
H = -Σ p(x) · log2(p(x))
Donde p(x) es la probabilidad (frecuencia relativa) de cada símbolo.

Para AAAABBC (A=4/7, B=2/7, C=1/7):
H = -(4/7·log2(4/7) + 2/7·log2(2/7) + 1/7·log2(1/7)) ≈ 1.38 bits/símbolo
• Huffman produce: A=0 (1 bit), B=10 (2 bits), C=11 (2 bits) → promedio (4·1 + 2·2 + 1·2)/7 = 1.43 bits/símbolo.
Diferencia con la entropía: solo 0.05 bits/símbolo. Huffman es óptimo cuando las probabilidades son potencias de 2. Para distribuciones arbitrarias, la codificación aritmética se acerca aún más a la entropía.

Code Huffman — à quoi sert-il en pratique ?

Le code de Huffman (ou codage Huffman) est rarement utilisé seul, mais on le retrouve partout en tant que dernière étape de chaînes de compression :
DEFLATE (gzip, ZIP, PNG) : LZ77 puis Huffman sur les codes littéraux et les distances.
JPEG : DCT + quantification puis Huffman sur les coefficients.
MP3 / AAC : sous-bandes quantifiées puis tables de Huffman prédéfinies.
bzip2 : BWT + MTF + RLE puis Huffman.
HTTP/2 HPACK : Huffman statique sur les headers HTTP.

Seul, Huffman convient pour des distributions très asymétriques (quelques symboles très fréquents). Pour du texte anglais standard, il atteint ~60% de compression ; combiné avec LZ77 (DEFLATE), on descend à 30-40%.

What is an adaptive Huffman code?

Classic (static) Huffman requires two passes over the data: one to count frequencies, one to encode. Adaptive Huffman (Vitter's algorithm, FGK algorithm) rebuilds the tree incrementally as it reads input — no pre-pass, no shipped frequency table.

Trade-offs:
Pros: single-pass streaming; no header overhead; adapts to changing statistics mid-stream.
Cons: more complex implementation; tree updates per symbol cost CPU; slightly worse compression than static Huffman when statistics are stable.

Adaptive Huffman shines in network streaming and real-time encoding where you don't know the full input in advance. For file compression where you have the whole data upfront, static Huffman with a shipped frequency table (or canonical Huffman) is simpler and faster.

Huffman vs Shannon-Fano vs Arithmetic coding — which wins?

All three are entropy codes that exploit symbol probability, but they differ in optimality and complexity:

AlgorithmOptimal?ComplexityUsed in
Shannon-Fano (1948)Near-optimalSimple top-downHistorical / ZIP Implode
Huffman (1952)Optimal (per-symbol)Simple bottom-upDEFLATE, JPEG, MP3
Arithmetic (1976)Optimal (fractional bits)More complexJPEG 2000, H.264, bzip2
Huffman wastes fractions of a bit because each symbol must use an integer number of bits. Arithmetic coding can assign fractional bits per symbol and achieves the Shannon entropy bound almost exactly. The catch: arithmetic coding was patent-encumbered into the 2000s, which is why DEFLATE and MP3 stuck with Huffman. Modern formats (H.265, AV1) use ANS / range coding, a patent-free variant of arithmetic coding that's even faster.

Other Languages