Xz format inadequate for general use

33 points by df

fanf

previously, previously

gerikson

There used to be a site-generated list of previous submissions. Is that functionality broken now?
- pushcx
  
  Yes, it’s broken: https://github.com/lobsters/lobsters/issues/1529
  
  The code to require the submitter to write a comment starting a new discussion clearly didn’t run as there’s no top-level comment from him. I don’t know if it even presented the list of previous submissions so they’d know they were about to resubmit.
  - df
    
    Oops, sorry. I haven’t been on Lobsters much for about a year and just assumed it would warn me if it was already posted, and then I wouldn’t post. I did not see any warning at all.
    
    pushcx
    
    It should be me apologizing, it’s almost certainly my bug and I’m responsible regardless. Sorry it’s not fixed yet, and no worries!
    
    gerikson
    
    Reposting after a number of months is ok.
    
    I tried to submit a previously submitted item a few days ago and did get the warning. So the issue seems to be older items.
  - riki
    
    The other submissions have this one shown on the bottom as “stories with similar links.” It’s just that this one seems bugged out.
- magikid
  
  Glad to see there’s an “already posted” flag
  - fanf
    
    previously
    
    magikid
    
    😂 thank you for adding that missing context.
    
    I reviewed the previous submissions and comments. I considered the content and the facts this was posted 7 years ago, one year ago, and today. It is my opinion, that nothing about xz has changed between 7 years ago, last year, and today which is why I flagged the post.
  - izabera
    
    I wonder what C.A.R. Hoare thinks on this topic
  - olliej
    
    Independent of anything else, its performance seems to be:
    
    -rw-r--r-- 1 14208 Oct 21 17:26 100MBzeros.lz -rw-r--r-- 1 14195 Oct 21 17:26 100MBzeros.lzma -rw-r--r-- 1 14676 Oct 21 17:26 100MBzeros.xz
    
    The naming of which seems to imply it’s comparing the performance of these compression algorithms by testing a 100mb file of identical bytes? Purely RLE schemes could also trivially match this performance.
    
    There are plenty of real world data corpuses that can be used to compare compression performance, why not just use those? That’s literally what they are there for?
    
    david_chisnall
    
    These are not comparisons of compression algorithms. All three formats use the same compression algorithm. They are comparisons of the container format. Zeroes are used because they should compress to almost nothing (though, because the compression algorithm is block based, it will include metadata for multiple blocks). The comparison of the size of the metadata for one file containing the same number of blocks.
    
    Gaelan
    
    That section seems to be specifically talking about overhead from headers, so a trivially compressible file seems like a good point of comparison there.
    
    olliej
    
    Overhead for headers is not a a thing though?
    
    There is no world in which a compression algorithm’s performance is measured by “overhead from headers”. That’s not a concept that makes sense.
    
    For exact same reason that when comparing compression algorithms you do not get to exclude headers and metadata.
    
    There is a reason you use corpuses for measuring compression performance.
    
    masklinn
    
    What you quoted is not a comparison of compression algorithms, all three files use the same algorithm, it’s a critique of the lzma2 container / stream format:
    
    This wording suggests that LZMA2 is some sort of improved LZMA algorithm. (After all, the ‘A’ in LZMA stands for ‘algorithm’). But LZMA2 is a container format that divides LZMA data into chunks in an unsafe way. […] In practice, for compressible data, LZMA2 is just LZMA with 0.015%-3% more overhead.
    
    MaskRay
    
    To compare different codecs, we need to use the Pareto frontier https://richg42.blogspot.com/2015/11/the-lossless-decompression-pareto.html
    
    Some good codecs, like LZNA (though deprecated in the proprietary Oodle) and LZHAM, never gained popularity.
    
    johnklos
    
    In addition to the many good reasons given in the article, there are other reasons why it’s problematic.
    
    For instance, it’s interesting that xz with -T greater than one makes files that are larger than with -T 1. Apparently the places where the data is joined there’s waste. For instance, using the NetBSD-11 src, xsrc and current pkgsrc trees:
    
    tar cf - pkgsrc src xsrc | xz -9e -T 12 > netbsd_t12.tar.xz tar cf - pkgsrc src xsrc | xz -9e > netbsd.tar.xz
    
    gives netbsd_t12.tar.xz at 511295504 bytes and netbsd.tar.xz at 505340992 bytes. That’s 1.178% larger.
    
    Finally, the fact that xz -9e can take upwards of a gigabyte of memory (-T 18 takes 25 gigs!), one can make a case for it being rather special purpose.
    
    Summer
    
    I kinda wish the integrity stuff was separate so it could be used with any compression format
    
    petar
    
    I needed to download a FreeBSD 14.3 image yesterday and I opted for the xz-compressed one since I wanted to be a good netizen and use less of the donated bandwidth, only to discover that macOS Sequoia can’t extract it from within the GUI and throws a very cryptic error message that doesn’t tell me anything useful.
    
    Extracting it with tar in the command line worked as usual, though, and come to think of it, I don’t remember I’ve ever seen a xz archive before in my life. Now I wonder why BSD folks prefer it over other formats.
    
    david_chisnall
    
    It’s smaller than gzip or bzip2, which are the other common formats, and it supports multithreading in the compressor and decompressor. Unlike gzip (and like bzip2, I believe), it compresses blocks, not streams. This means that you can compress and decompress blocks independently. This also makes it possible to seek within a compressed xz file without decompressing the whole thing (I used this in a tool to visualise instruction traces from our CHERI prototype CPUs), though the fact that it’s most commonly used to wrap tar archives makes this less useful.
    
    I think it’s likely to be replaced by zstd in the next few years. Last time I did some ad-hoc benchmarking, zstd was, depending on the compression level, either better for CPU usage, better for compression ratio, or both (or, in the worst case, as good).
    
    intelfx
    
    I believe (at least on the data I care about, that is, text, text-adjacent documents, and executable binaries) xz -9 is still typically better (both in compression ratio and CPU time) than zstd --ultra -22, so there’s one case where LZMA stands its ground.
    
    For everything else, though, zstd is a no-brainer.
    
    MaskRay
    
    Yes. zstd with a high compression level like -22 can be (much) slower than xz -6 or xz -9 on some data without achieving the same compression ratio. .zst does decompress much faster, though.
    
    I hope that there is open-source implementation of newer codecs like Oodle Leviathan.
    
    petar
    
    Thank you for the information, now it all makes much more sense.
    
    By the way, I was astonished to see zstd being almost a first-class citizen on FreeBSD already given that it’s almost nowhere to be found on Linux unless you compile it from source. I haven’t tested it extensively but it really does look like it’s the compression algorithm of the future.
    
    david_chisnall
    
    It was added as a ZFS compression scheme a while ago. Originally ZFS provided gzip as a compression algorithm, but at low compression levels it didn’t save much and at high levels it was very slow. They then added lz4, which didn’t compress as well as gzip at high compression levels but was much faster to both compress and decompress. Zstd outperforms lz4 at pretty much all levels, but for a while lz4 was better for some cases because it had an ‘early abort’ mode where it would detect data it couldn’t compress further (e.g. video files that are already compressed) and skip them. The ZFS folks added a similar mode to zstd a year or two ago.
    
    It’s great for filesystem compression because you can typically compress faster and decompress much faster than the disk. My NAS has a Pentium Silver (which appears to be the new brand for Atom) that’s a few years old (home of dead-end technology, I think it’s the only core with both SGX and MPX support). It can decompress zstd at about 2TB/s, so wouldn’t be a bottleneck for reads even with NVMe. Compression is slower but still would not likely be a bottleneck even if I replaced the spinning rust with SSDs.
    
    Since the code has to be in the kernel, and in the minimal ZFS read-only implementation that loader uses, it doesn’t really make sense to not ship it for userland.
    
    natkr
    
    Arch, for one, has been using zstd for packages since 2020, and for the initramfs since 2021.
    
    slondr
    
    I beleive arch has been compressing the initramfs with zstd for a few years, despite still calling it “initcpio”
    
    intelfx
    
    It is an initcpio. It is a (potentially compressed) cpio archive, with the user’s choice of a compression algorithm (or absence thereof).
    
    On boot, the cpio archive (“initcpio”) is uncompressed and extracted into a tmpfs, at which point it becomes “initramfs”.
    
    masklinn
    
    At very high compression levels, zstd tends to either be slower than lzma, compress worse, or both.
    
    However zstd has the huge advantage that its decompression speed is more or less constant, and its memory requirements don’t shoot into the stratosphere.
    
    kolja
    
    Do you have any links to back up your zstd intuition? I’m still torn on what to use for long-term archive compression. The attack on xz, and what I read in its aftermath about the complexity involved in building it, made me a little weary of using it. But on the other hand, it does compress slightly better than zstd in most cases I checked, and it is preinstalled in a lot more places.
    
    david_chisnall
    
    I did some measurements on some data I cared about a few years ago when I was choosing between zstd and xz. No idea how representative the results were.
    
    df
    
    I posted this because I used to work on enterprise backup software. We tried to design all our databases and file formats to last 30 years, because some companies planned for 30-year data retention. I am not sure if we succeeded. I guess we’ll find out in 25 years or so.