All of the String types
21 points by juliethefoxcoon
21 points by juliethefoxcoon
I love this article, whoever wrote this is so cool.
So cool. Could add some detail for Python, though, as bytes and raw strings and formatted strings and string templates are all things amongst the things, never mind collections of the same sames.
Still, so cool.
Raw strings and fstrings are just different syntaxes to create the exact same strings, they’re not different strings. That would be like saying a + b is a string.
And no, I don’t think list[str] or StringBuilder are strings. At that point you’re well on your way to claim that an integer is a string.
Under the hood I think they’re both LiteralString now… but I’d argue it’s the fact they they all have different semantics that matters… they’re a bit more than just syntactic sugar. And no, I agree a list[str] isn’t a string type, but I’d say a bytesarray is while a memoryview could be.
F-strings are definitely not literal strings.
And raw strings have the exact same semantics as “normal” strings, they just have a slightly different tokenization. They’re so the same thing that equivalent raw and non-raw strings have the same id.
This is nitpicky but I've never seen a reason to use Vec<char>. That's basically UTF-32.
And it ignores Vec<u8> and &[u8], which are really common string representations (the byte string version of String and &str respectively). Rust sadly doesn't offer great string processing affordances for these byte string types, but I have lots of code which uses it. It's a great alternative any time the "every string must be valid UTF-8" constraint is inappropriate (which happens a lot when working with external data).
Ditto, I’ve used Rc<str> more than I’ve used Vec<char>.
And I don’t think it’s nitpicky. Claiming that Rust has only 3 string types, then claiming that Vec<char> is one of them, is odd. Does the stdlib even provide string operations on that out of the box?
I'll go you one better - it quite literally is UTF-32 :)
charis guaranteed to have the same size, alignment, and function call ABI asu32on all platforms.
https://doc.rust-lang.org/std/primitive.char.html#validity-and-layout
You can also cast char to u32 directly with the as operator, so I believe transmuting Vec<char> to Vec<u32> is safe. It is not safe however to transmute from u32 to char as not all u32s are valid chars. char::from_u32 returns an Option<char> and char::from_u32_unchecked is unsafe.
so I believe transmuting Vec<char> to Vec<u32> is safe
Unfortunately, no. Even if you can transmute A to B, there is no guarantee that you can transmute Foo<A> to Foo<B> in general.
Yep. You need to into_raw_parts → cast/transmute the pointer → from_raw_parts. The exact same instructions are generated: https://play.rust-lang.org/?mode=release&gist=a2cd08edd55fa811f2b1936ad2310cfb
It is occasionally useful. It can be a convenient representation in, say, a text editor where character-wise editing is pretty common, although a more correct implementation would operate on graphemes instead of codepoints.
For Zig:
[]const u8is an immutable, non-null-terminated, non-growable array of bytes.
This isn't true. It's a slice of bytes: a length + a pointer. This distinction is important as
fn foo(s: []const u8) void { ... }
const slice: []const u8 = try alloc(u8, 1024 * 1024);
foo(slice);
Won't make copy 1MiB of data. Whereas the following might:
fn foo(a: [1024 * 1024]u8) void { ... }
const array: [1024 * 1024]u8 = @splat(42);
foo(array);
Arrays are values in Zig. They can be coerced to slices, but they are distinct, unlike in Go which conflates them causing a lot of headaches for newcomers.
Nice article! I might use this as a reference if I end up using some of the latter languages. One minor comment:
&mut str is a non-growable array of characters in which each charcter can be mutated
str is UTF-8 encoded, so mutating a character might require a reallocation if a character's codepoint size changes. It might be better to say that the individual bytes are available for mutation, provided UTF-8 validity over the length of the slice is retained. You can't, however, just start arbitrarily mutating characters.
If one goes through the methods on str and search for &mut self, one sees that - aside from methods used to index and reinterpret as bytes - pretty much the only useful functionality provided by &mut str is make_ascii_uppercase/make_ascii_lowercase, which work only because ASCII characters always have the same codepoint size, avoiding the need to mess with any codepoint offsets.
C#
The whole C# section seems really confused to me.
stringis immutable, non-growable list of bytes. It is not unicode aware.
This is... confusing! string is an alias for the System.String type. It is a sequence of UTF-16 code units, not bytes. (That is, when you index a string you get a UTF-16 code unit back.) "not unicode aware" is so vague as to be meaningless, but System.String is natively UTF-16, and .NET as a whole has generally excellent Unicode support. System.String is neither a list nor an array.
cstringis a null terminated version of astring.
There's no cstring in C#, AFAIK!
char []andchar *are mutable non growable list of chars. chars are 2 bytes making them utf-16 code points.
char[] is a character array, not a List. Arrays in .NET can be expanded. char* would be a pointer to a character, and is not usable in safe C#
Finally we have
StringBuilderwhich is [...] a mutable, growable, non-null-terminated array of bytes
As with System.String, it would be more accurate to describe this as a mutable, growable "list" of UTF-16 code units. Although it is not a List.
System.String is natively UTF-16
Minor clarification: They aren't always UTF-16, because as you say they're sequences of (arbitrary) UTF-16 code units.
Or as I like to say, UTF-16 doesn’t exist: every system that uses 16-bit code units is actually WTF-16 because they all allow unpaired surrogates.
I know I'm in the minority around here, but I prefer simple, high level programming languages in which there is no string type (or exactly one string type from a different perspective).
There should be a single type representing an ordered sequence of values. Call this an "array" or a "list" according to your preference. And there should be a single type representing a character value. It is fine to have a string literal syntax like "foo", but this is just an array literal where the array happens to contain characters.
There should be no magical String type where the String type is disjoint from the array or list type, especially if it has an incompatible API. Of course there are string operations, but they are mostly just general array operations. The rest are free functions residing in libraries; they are not "in" the String type.
I got my first taste of this kind of programming simplicity from the array languages APL and K.
I feel that one reason that simple languages are not more popular is that most programming languages are designed by programmers, for programmers. Whereas APL and K are industrial programming languages, but they are used in the finance industry by people who are domain experts in finance first, and programmers second. Domain experts are more likely to appreciate a simple programming language.
Programmers prefer complex languages. The simple languages I prefer deny you the feeling of mastery that comes from understanding the 14 different string types, and the correct situations for using each one.
That simple string type is probably a list (of some sort) of Unicode graphemes, which could be exceedingly large. Therefore, it's probably not a great idea to make a grapheme[] where each grapheme could practically be up to five code points or so. This would necessitate a special case for "list of graphemes" that used some encoding (e.g., UTF-8) internally that didn't have many of the characteristics of an array (e.g., O(1) read/write) (that a programmer expects, perhaps not a domain expert), then it might as well be useful to make a String type and make it "iterable" for most needs
That's one possible design. The concept you are referring to is called "grapheme cluster" in the Unicode standard, not "grapheme". https://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries
Unicode has no stable definition of a grapheme cluster. It can be locale or application specific. It may even vary based on context for the same user. There is a generic specification for Extended Grapheme Cluster which isn't locale dependent, but even this definition is not stable, as it changes from year to year.
So you might not want to hard wire a particular definition of grapheme cluster into your language core.
What is stable is the concept of a code point. The design I prefer, for a simple language, is to equate characters with code points. If you are a domain expert on Unicode text, and you are writing something like a text renderer, then you will be using a unicode library full of complicated concepts that are of no interest to regular users. This library may be versioned to a specific Unicode release, and upgrading your code to a new Unicode release may require code changes. And you will be working with arrays of code points anyway. For most people, the most common use for strings is to encode natural language messages, which are constructed by concatenating smaller strings. You generally do not need to care about the distinction between a code point and a grapheme cluster for most programming tasks.
This is a small companion for strings in Rust: https://steveklabnik.com/writing/when-should-i-use-string-vs-str/