DEV Community

Cover image for The Magic behind Bzip2 : Burrows Wheeler Transform
Akash Singh
Akash Singh

Posted on

The Magic behind Bzip2 : Burrows Wheeler Transform

ACM Winter School 2024

I recently went to ACM Winter School 2024 and over there Prof. Chirag Jain showed us how Burrows Wheeler Transform works for compression in tools like bzip2 and how it can be used to optimize Suffix Trees and Suffix Arrays. In this blog, we'll unravel the workings of the BWT and its significance, while I try to translate my class notes into this blog xD.

Introduction

I am Akash Singh, a third year engineering student and Open Source Contributor from Bangalore.
Here is my LinkedIn, GitHub and Twitter

Sky Singh

I go by the name SkySingh04 online.


What is the Burrows-Wheeler Transform?

The BWT is a reversible text transformation that rearranges the characters of a string into runs of similar characters, making the string more amenable to compression. This transformation is pivotal in compression algorithms like bzip2 and indexing structures like the FM-index.

The magic of the BWT lies in its ability to group identical characters together while retaining the original string's information, albeit in a transformed format.

Let’s explore how BWT is applied step-by-step using the example banana$.


Applying BWT on banana$

  1. Start with a Sentinel Character Append a sentinel character $ at the end of the string. This character:
    • Marks the end of the string.
    • Is lexicographically smaller than all other characters.

For banana, we get banana$.

  1. Generate All Rotations Create all cyclic rotations of the string:
   banana$
   anana$b
   nana$ba
   ana$ban
   na$bana
   a$banan
   $banana
Enter fullscreen mode Exit fullscreen mode
  1. Sort the Rotations Lexicographically After sorting the rotations:
   $banana
   a$banan
   ana$ban
   anana$b
   banana$
   na$bana
   nana$ba
Enter fullscreen mode Exit fullscreen mode
  1. Extract the Last Column The last column of the sorted rotations forms the BWT:
   BWT: annb$aa
Enter fullscreen mode Exit fullscreen mode

Fun Story :

One of the best parts about attending such Winter schools are the people around you. For example, this is the gossip between me and Alfiya xD.

Image description

Yes this is real and was a meme throughout the winter school lmao.

Reversibility: Going Back to the Original Text

One of the most fascinating aspects of BWT is its reversibility. You can reconstruct the original string from the transformed version without losing any information. Here’s how:

  1. Sort the BWT Create the first column by sorting the characters of the BWT. For annb$aa, the sorted order is:
   $aabnn
Enter fullscreen mode Exit fullscreen mode
  1. Link Columns The relationship between the first and last columns allows us to reconstruct the string. By tracking character positions and their order, we recover banana$.

I think my notes will make more sense here

Image description

This lossless transformation ensures that no data is discarded during the process.


Succinct Indexing with BWT

The succinct index created by BWT is used in data structures like the FM-index, which powers efficient string searching. Instead of storing the entire text, the FM-index enables queries by leveraging the BWT and auxiliary data structures like suffix arrays. This is especially valuable for applications requiring high-speed pattern matching.


BWT and Suffix Trees

In class, Prof. Chirag Jain explained how BWT interacts with suffix trees and suffix arrays:

  1. Suffix Trees:

    By applying BWT to suffix trees, we can reduce memory usage while maintaining efficient query performance. The BWT clusters repeated substrings, which aligns closely with how suffix trees group similar patterns.

  2. Genome Indexing:

    In bioinformatics, the BWT is critical for indexing and searching genomic data. Tools like Bowtie and BWA rely on the FM-index (a BWT-based structure) to align DNA sequences efficiently.


Why is BWT Useful for Genomics?

In genome sequencing, researchers often need to match short DNA sequences against massive genomic datasets. The BWT-based FM-index enables:

  • Space Efficiency: Compact representation of the genome.
  • Fast Searching: Efficient pattern matching without decompressing the data.
  • Scalability: Handles terabytes of genomic data.

For example, if a query DNA sequence is AGCT, the BWT groups occurrences of A, G, C, and T together, making it faster to locate matches.


Conclusion

The Burrows-Wheeler Transform is not just an algorithm—it’s a gateway to understanding the interplay between mathematics and computer science. From powering compression tools like bzip2 to optimizing bioinformatics pipelines, BWT continues to revolutionize how we process and store data.

Attending ACM Winter School 2024 gave me a fresh perspective on how these elegant algorithms shape our world. If you’re interested in exploring string algorithms, BWT is the perfect starting point, right after Suffix Trees and Suffix Trees. Also I'll be willing to sell my winter school notes, the bidding starts at 69 bucks xD.

Top comments (2)

Collapse
 
alfiyafatima09 profile image
Alfiya Fatima

Hey, it was amazing attending the winter school together! Nice Blog :)

Collapse
 
skysingh04 profile image
Akash Singh

Likewise