Das Problem mit German Strings
22 points by asubiotto
22 points by asubiotto
Coming up with string representation optimizations is a lot of fun. When I worked in games, we had our own string class with a short-string optimization. Back then, the std::string implementation lacked this.
One thought I had when reading this and the original article was that it seems slightly wasteful to store a 32 bit length even for short strings: It shouldn’t be too tricky to use 15 bytes of the buffer for the short string optimization instead of just 12, and just reserve one bit to indicate if the string is short or not. Maybe that’s overkill.
Storing a prefix in the ptr case is clever, though.. especially when strings are immutable so storing a cap doesn’t make sense.
Someone’s just re-implemented Pascal strings, haven’t they?
(Yes, I know Pascal only had a single byte for length, but the concept’s the same until the string gets big)
Pascal strings were a pointer to memory containing a length followed by the contents of the string. Modern languages put the length (and maybe capacity) next to the pointer to save an indirection for common operations, and to make it possible to slice substrings without duplicating them. A length and pointer is loads of space (16 bytes) compared to many strings, so it’s common to have a small string optimization that can put the string data in the metadata block when it fits. German strings are a variant of the small string optimization that is adjusted for columnar data processing: keeping a prefix of the string in the metadata block is handy for shortcutting comparisons.
A length and pointer is loads of space (16 bytes) compared to many strings, so it’s common to have a small string optimization that can put the string data in the metadata block when it fits.
Even more so because the strings are commonly 24 bytes (or more) to add a capacity and amortise reallocations / concatenation.
Indeed, but (thinking out loud) it occurs to me that if you have a <24 byte small string optimization for fat mutable strings, and a <16 byte small string optimization for slices, there’s an awkward quandary for slices of strings >=16 <24: the slice needs to point somewhere, but the string might not have a stable address … dunno how existing implementations deal with this.
I don’t think SSO makes sense for a borrowed slice (a string_view or an &str), sure there’s a gain from locality but I’m not sure that compensates for the increased complexity, and you’re not avoiding an allocation.
And if you have an &str to a string, SSO or not, Rust would not let you move that string. I assume in C++ it would be UB.
I checked what Typst does—with the EcoString type from its ecow crate—and it doesn’t have a complex slice type whatsoever, it only has a method to give a standard Rust &str
instead of trying to small string optimize the slice type. Instead of using views, Typst seems to prefer passing around owned EcoString
values and generate new ones as needed. Additionally EcoString
is atomic ref counted and copy-on-write.