How ClickHouse handles strings
12 points by typesanitizer
12 points by typesanitizer
Regarding the overlapping load trick where you handle the length range (n, 2n] by doing a pair of n-byte loads aligned to the beginning and end, respectively. While this is a very useful (and very standard) trick for dealing with slices within untrusted allocations, if you control the allocation you can just ensure that the longest fixed-length load you do stays within the bounds of the allocation, so pad the allocation by the max overhang. E.g. you can do a 16-byte vector load + length-masked comparison to handle all <= 16-byte strings. You have to take care with data races/write exclusivity of physically adjacent but logically separate strings suballocated from the same language-level allocation, but in my experience you can almost always design around that constraint.
As @snej suggests, you can also implement it in assembly (as glibc and others do) and then you just have to worry about page-splitting loads. That's slightly less efficient than what I'm suggesting, and one disadvantage is that it doesn't operate within the language-level abstract machine semantics (e.g. one consequence is that you have to allowlist it with asan, tsan and similar tools, which is IMHO borderline unacceptable for non-libc-like use cases), but it does mean that you don't have to assume anything about the allocations.
The string comparison stuff is cool! But I thought modern C libraries already used these tricks to optimize memcmp and strcmp?
Modern C libraries also use these tricks, but ClickHouse has a lot of their own primitive functions.