> bwt | transform | compress <
// Burrows-Wheeler Transform - Reversible transformation for compression
>> features
Fully Reversible
Lossless transformation that can be perfectly inverted.
Character Clustering
Groups similar characters together for better compression.
Core of bzip2
Foundation of the bzip2 compression algorithm.
>> technical info
How BWT Works
BWT creates all cyclic rotations of the input string, sorts them lexicographically, and outputs the last column of the sorted matrix along with the index of the original string. This transformation tends to group similar characters together, making the output highly compressible by run-length encoding or move-to-front transform.
BWT Example
Input: "banana$" Rotations: 0: banana$ 1: anana$b 2: nana$ba 3: ana$ban 4: na$bana 5: a$banan 6: $banana Sorted: 0: $banana 1: a$banan 2: ana$ban 3: anana$b 4: banana$ 5: na$bana 6: nana$ba Last column: annb$aa Original index: 4 Output: annb$aa|4
Why Use BWT
- Improves compressibility
- Fully reversible
- Groups similar contexts
- Foundation for bzip2
- Effective preprocessing
>> frequently asked questions
What is Burrows-Wheeler Transform?
BWT is a reversible transformation algorithm that rearranges a character string into runs of similar characters. Developed by Michael Burrows and David Wheeler in 1994, it's used as a preprocessing step in compression algorithms like bzip2 to improve compression ratios.
Why does BWT improve compression?
BWT groups characters that occur in similar contexts. For example, in English text, 'th' is common, so after BWT, many 't's will be grouped together (from contexts ending in 'h'). These runs of similar characters compress very well with run-length encoding or entropy coders.
What is the EOF marker?
The EOF (End-Of-File) marker, typically '$' or another unique character, ensures all rotations are unique and helps identify the original string position. Without it, some strings cannot be uniquely transformed. It's removed during the inverse transform.
BWT vs other compression methods?
BWT itself doesn't compress - it's a transformation that makes data more compressible. Unlike LZ77 (used in ZIP), which finds repeated patterns, BWT reorganizes data to create patterns. It's typically combined with Move-to-Front, Run-Length Encoding, and Huffman coding in bzip2.