Suffix BWT vs cyclic shift BWT, and fast computation
13 points by fanf
13 points by fanf
Thank you @purplesyringa for writing this! I had never heard there were two different BWTs before.
these properties give it a place in data compression and genome alignment.
Are there any good rules of thumb for when one should use a Burrows-Wheeler Transform? In which cases does it perform better than other tools?
For data compression, I got the sense that BWT was pretty good but kind of a dead end, whereas the LZ- family of algorithms (.xz, .zstd) kept improving until they won out over BWT (.bz2). But my impression could be mistaken.
(As for genome alignment, I know very little, but my rough understanding is that it's a string search problem? What properties do the strings need to have for BWT to be a better choice than other string search algorithms?)
Are there any good rules of thumb for when one should use a Burrows-Wheeler Transform?
BWT is great for text and code, while LZ-based algorithms win on binary or tabular data. I saved 10% of space by switching from xz (it's better than zstd) to bzip3, and that's pretty common IME. I think that's just what these algorithms are aligned to due to their nature.
I got the sense that BWT was pretty good but kind of a dead end, whereas the LZ- family of algorithms (.xz, .zstd) kept improving until they won out over BWT (.bz2).
So there are actually two board families of compression algorithms: the LZ family and the PPM family.
LZ is "insert literal strings, copy from previous occurrences", you know the drill. PPM encodes the string character-by-character, modelling the probability distribution of the next character based on which character is usually present in the current context (determined by the last few preceding characters, in the simplest scenario). The characters are then encoded with entropy coding.
Both LZ and PPM have plenty of heuristics and parameters to tinker with: much like you can choose different coding formats for lengths/offsets in LZ and set different costs for different lengths/offsets, you can choose very different models in PPM. For instance, you can decode data bit-by-bit and add already decoded bits of the current byte to the context; or you can treat different subsets of preceding bytes as multiple contexts and mix their predictions; etc.
Context modeling is an incredibly broad topic, and the best text compressors use it in some shape of form. When done right, PPM/CM compresses data better than LZ. It's just often impractical because it's incredibly slow, since you need to do a ton of math per bit to get an outstanding result, and LZ is often good enough.
So where does BWT come in? BWT is actually a heuristic-free, greedy version of PPM. BWT grouping characters by context is similar to PPM looking for earlier continuations of the same context; it's just that BWT is global, while PPM decodes data in order. BWT cannot notice local changes in probability distributions and works best on stationary data, much like PPM would if you didn't limit predictions to recent occurrences. It's kind of a Temu PPM, but it's the best you can do without spending a ton of memory and CPU time, so it's more of a compromise than a dead-end.
There is actually a similar greedy algorithm in the LZ family: LZP. LZP supports two operations: "insert literal string S" and "copy N chars from the previous occurrence", which is similar to LZ, except it doesn't say where to copy from. Instead, the encoder and the decoder agree to search for the previous occurrence of the current context (e.g. the last 3 decoded characters) and copy from there. So this situation of having a single different greedy variant is not unique to PPM/BWT.
Wow, thank you so much for your detailed reply!
The theme I'm getting is that LZ is better when there are repeated sequences of symbols, whereas BWT is better when there are relationships between adjacent symbols. Both families of compression algorithm work because most data has both kinds of redundancy. But text has more of the latter (shorter repeats but higher correlation), which is why BWT does such a good job.
And today I learned that BWT is a form of PPM in disguise! That helps ground it for me. Previously, BWT had seemed like such an oddball: do this thing which doesn't look like any other data compression algorithm, and your data magically becomes more compressible.
What properties do the strings need to have for BWT to be a better choice than other string search algorithms?)
It's not BWT specifically but the related suffix array structure, used to quickly find all occurrences of a substring with binary search. It's an efficient index for when you might want to count how often any substring appears, also helpful for finding longest matches. Because the suffix array is on the order of the size of the text, probably most helpful when you need to do arbitrary queries frequently on some static corpus.
My guess is that it's helpful with genomes specifically because of the small alphabet and lack of easily parsed structure, but I'm not so familiar with other string search algorithms.
As for genome alignment, I know very little, but my rough understanding is that it's a string search problem? What properties do the strings need to have for BWT to be a better choice than other string search algorithms?
I don't really know much about this topic, but to the best of my knowledge, the most pressing issue here is memory use. DNA strings are incredibly long, and most other search algorithms have large indexes. BWT-based indexes takes about the same space as the original string, so they're the only practical approach.