llmfuse: a self-compressing filesystem backed by an LLM
9 points by grohan
9 points by grohan
Lossy filesystem kind of reminds me how you can slowly degrade encoding the longer the file sits in a CDN. Would be nice to be able to do this for large amounts of text as well.
Huh, arithmetic encoding isn't lossy, unless I misunderstand this project it should be lossless compression, just slow.
As the diagram shows, arithmetic coding splits up the unit interval, giving each piece of data its own sub-interval. We can "decode" a particular numerical value by checking which interval contains it; and we can use nested intervals to represent sequences. It's like a continuous analogue of Huffman trees.
This could be used for lossy compression, but that requires some extra assumptions. Firstly, to get any compression we need to correlate the size of the intervals to the probability of the data: less-likely data gets a smaller interval, so the numerical values contained in that interval need higher precision (i.e. are longer to write out).
To get lossy compression, we'd want truncations of a numerical value to be contained in intervals for "similar" data. In practice this would mean semantically-similar data gets assigned to numerically-similar intervals. I'm not sure how much that holds for a GPT-like LLM; my guess is that it's reasonably good. "Variational" techniques are the classic way to link numerical similarity to semantic similarity; e.g. variational autoencoders are learned compression algorithms, whose compressed representations act as a "latent space", similar to what you describe.
So the way that this works is that it's taking the next-token-probs given a context and using them to subdivide the interval. The tokens that the LLM considers most likely get the biggest part of the space (which ultimately means they take fewer bits to code), and the less-likely tokens get a smaller slice, taking more bits. So, as long as the input "looks like text" you get compression.
LLM embeddings do have the kind of similarity-space properties you're talking about, but the arithmetic space for this is ordered by token IDs, which don't. Qwen uses a BPE tokenizer, which tends to assign lower token IDs to more common tokens, but "cat" and "feline" aren't going to be anywhere near each other.
But arithmetic coding does have a prefix property: each token's portion of the space is subdivided for the next token, and then the next, recursively. So "there is", "there are", and "there in the sky, it's Superman" are all going to be nearby. That's what lets it produce output progressively: once you've read enough bits to know that the value is definitely in a range that includes "there", it can output "there" and then rescale the intervals so that it doesn't need infinite precision.
Long story short, if you want to do lossy text compression with one of these models, you can, but you would go about it differently.