TIL: serde's borrowing can be treacherous
79 points by yossarian
79 points by yossarian
Cow<T> is one of a few Rust stdlib types where I found I had to learn to just ignore the name because it kept confusing me. It’s commonly used as a returned value that has already been borrowed or allocated — it has nothing to do with whether there will be or ever were any writes to it. (The idea itself is quite elegant, and this is probably just me, and sadly I don’t have a better suggestion…)
I totally agree, I always think of "Cow" as something like "BorrowedOrOwned" because it's pretty useful for that specific use case but it's not like there's any operation (maybe derefmut) that automatically Clones the underlying T if it's Borrowed
Also confusingly, the case where I most frequently run into Cow is Cow<'static, str>, where the "borrowed" case is also long-lived, just not heap allocated.
Previously, on Lobsters, I shared a proposal by ais523 to use some concepts from linear logic. For those return types, they might propose replacing cows with pirates.
Yep. In a language with such of attention put into naming things right and intuitively (see "borrow", "move", "drop", etc), "Cow" seems like the exception that confirms the rule. I also got confused by it and thought it meant something different the first time(s) i stumbled upon it.
TIL serde can even do that. A good nuanced case of "library doing the right thing that is otherwise unintuitive at first". Good writeup!
It seems to me that Serde should always refuse to deserialize a JSON string into an &str, rather than allow the creation of a program which accepts some JSON inputs, but rejects other JSON inputs which should be semantically equivalent.
I guess that if you know some field will never need escaping (maybe it’s a string with some fixed format you know), and the source of your JSON will never do weird things, you could deserialize it to an &str to save a little bit of overhead. But that seems like a niche optimization you should opt into if you really need it, rather than something you can accidentally do if you’re not careful.
That restriction would hinder serde Deserializer implementations for formats which do allow you to always safely zero-copy reference utf-8 strings, though. (Generally these are binary formats like Apache Arrow, or zeugma-hamper's libPlasma).
I've not personally thought it through, but it's possible a different design of serde's trait system would allow Deserializer implementations to declare statically whether or they support zero-copy string references, avoiding this runtime issue.
That’s an annoying API design problem, but I think I’d still rather serde_json always panic on &str rather than its current behavior. At least you catch the bug the first time you try to run the program.
First, that's a downstream choice. You can .unwrap() if you want it.
Second, it's probably been a decade by now, but I still don't trust anything by the author of goblin after I managed to get goblin to panic with one of my "Hello, World!" test .exes from Open Watcom C/C++. (I can't remember if it was the Win16 or the OS/2 EXE that did it and, yes, I did open a bug report and contribute my EXEs for regression testing purposes... doesn't change that I don't trust their judgment.)
As I see it, if you put a panic! somewhere that external data could bring down the entire binary if the user wasn't treating their dependencies as pseudo-hostile and wrapping std::panic::catch_unwind around each unit of work, then you have bad judgement and nobody should trust your --lib crates.
Heck, while I don't agree with that design decision, if you want more than an empty Error 500 response (i.e. blank white page) out of actix-web's catch_unwind handler, you have to put it behind a reverse proxy.
(As a UI/UX guy, for my miniserve-esque projects, I've been waffling between whether I should do the more correct but more work-intensive thing and maintain a following fork of actix-web for my projects or whether I should do the more convenient but more cursed thing and imitate the iterator adapter I wrote to work around pulldown-cmark/pulldown-cmark#890 by building some kind of reverse proxy into the same statically linked binary... it'll probably depend on whether I can get something like listenfd working intra-process so I don't need to expose the underlying actix-web listener as a localhost-bound TCP socket.)
To be clear, I’m not saying it should panic depending on the input - I agree that’s terrible. I’m saying it should panic, unconditionally, when you try to deserialize a JSON string into an &str, whether or not it can get away with it for that particular string, because we’re asking for an operation for which there is no generally correct implementation. It’d be best if it was a compilation error, but failing that, the best we can do is crash the program on any attempt to test it.
Panicking unconditionally is not suited to all the purposes Serde is used for.
Rust occupies niches where that kind of optimization actually has value, so it'd be equivalent to arguing that unsafe shouldn't exist because people use it without understanding it fully and, even if it didn't, if it's unconditional, it should be a compile-time error encoded in the type system
The problem is either documentation or not finding some not-too-onerous way to add a safety catch akin to "You can't use &str, but you can use the DeserializationMayFail<&str> newtype." (Bearing in mind that, as others have pointed out, the same Serde schema can be used across multiple formats, some of which can zero-copy deserialize to an &str without needing to handle un-escaping.)
(Sorry to anyone who saw the earlier version of this post. I was up far too late last night.)
You can't use &str, but you can use the DeserializationMayFail<&str> newtype.
Yeah, I’d be entirely happy with that sort of approach! I just think this behavior is far too broken to allow by default with no explicit opt-in.
Rust’s type system is sufficiently capable to allow the design of an api where it could statically determine based off the specific decoder impl if deser into such an object type would be legal or not, but it would obviously come with complications and possibly UX issues.
I agree. I feel like I keep seeing these stories from Rust, where after scratching the surface the shine seems to come off. Zero-copy deserialisation is cool, and it makes sense why you can't do it in every case, but this behaviour is really surprising to me given that the Rust community pushes the "safety" message so loudly. I think it's really worrying that it's so easy to use a flagship library to write a cleanly compiling program and have it fail on JSON that matches the shape the serialiser "should" accept. Expecting people to "just know" to sprinkle Cows around is a big footgun.
It's definitely not ideal, but I'm not sure what a better API would be here given that:
&str in cases where you know that you'll never run into thisThis is a non-obvious limitation, but it's not a safety issue. It's as careful and correct as possible for a zero-copy JSON deserialization. If parsing succeeds, you always get valid data that has been parsed correctly, including escapes. When it can't deserialize to a temporary string slice, it correctly reports it as a failure (instead of trying to be clever and mutate immutable data or return slices belonging to some hidden buffer, which could be unsafe).
The footgun here is from opting in into a specific performance optimization by using the &str type. The string slice is generally a weirdly restrictive type, but I have an impression that most Rust devs mistake it for merely a "faster string" rather than a very special case. Every Rust noob tries writing Person { name: &str }, and has their day ruined by &str being incapable of storing string data, even before any JSON is involved.
Serde's documentation has that property about it. The learning curve is a bit of a learning cliff once you get outside the basic cases and could really benefit from someone (who understands Serde better than I do) volunteering to improve it.
I ran into this issue as well when trying to deserialize Windows paths (with backslashes) into &Path, and wrote about it a few months ago. It’s a tricky issue that is somewhat difficult to diagnose using the error message!
I ran into a similar situation when I had a program error out mid-job because Serde will just happily attempt to serialize Path/PathBuf into a JSON string and fail for POSIX paths that aren't valid UTF-8.
I wound up writing an encoding scheme which takes advantage of NUL being the one character allowed in spec-compliant JSON but not POSIX paths to use it as an escape character so that \0\xf means "\xf, the byte. Not \xf, the codepoint." ...that way, any Path/PathBuf that would have passed through before has its JSON representation remain unchanged.
(And yes, I tested. PHP is the only non-C language I could find that mishandles JSON strings containing null bytes.)
PHP is the only non-C language I could find that mishandles JSON strings containing null bytes.
I’m pretty sure R will have trouble. I believe it cannot handle null bytes in strings.
Also JSON tools that use C strings under the hood cannot. E.g. jo
Fair. I don't use R and couldn't think of a reason I might want to decode a dump of arbitrary paths in it.
IIRC, the languages I tested were C# (Mono), Java (OpenJDK), JavaScript (Node.js, Firefox, Chrome), Lua, Perl, Python, Ruby, Rust, and maybe some others I'm forgetting at the moment.
I've also tested null bytes in JSON strings in a bunch of languages: Go, Raku, and Swift all worked. R didn't. Bash has trouble with null bytes because it used C strings for (at least) command line arguments (ARGV).
Huh. So PHP's fixed that bug since I tested it and now decodes \u0000 correctly instead of producing truncated output.
(I just re-did the json_decode test I used before and var_export now produces 'Hello World ' . "\0" . '' for your test string.)
...Go might have been another one I tested. It feels like one that was low-effort accessible enough (compiler in Ubuntu's APT repos) and "various similarities to C# and Java" enough for me to have tested and forgotten about.
Go has a weird default where it converts bytes slices to base64 in JSON. It’s weird but it actually makes a lot of sense because it’s trivial to decode base64 in any popular language and it lets you transport any bytes you need to transport. I suppose you could also use uuencode or something. I don’t really see the reason to invent a new encoding unless you really need the performance, but in that case you probably shouldn’t use JSON at all.
I don't want to pessimize the 99.9...% case just to keep it from blowing up or silently omitting data in the 0.0...1% case. (Similar to how I had to manually serialize/deserialize mtime values to keep serde from failing on encountering "mtime is before std::time::SystemTime::UNIX_EPOCH" with mildly corrupted metadata like mtime = -1 taken off old DOS floppy disks.)
One of the explicit design goals of the encoding was that the transformation is a no-op/identity function for any path that doesn't contain non-UTF8-able bytes, so it's 100% capable of deserializing paths serialized using serde's normal Path/PathBuf serialization. (Before I noticed that JSON was spec'd to round-trip NULs in strings, I was planning to do the same thing but using a codepoint in the Private Use Area.)
(This particular project is basically a specialized knock-off of locate/updatedb where I want some of my other Rust projects and maybe a Python project or two to be able to read the database.)
I bet you'd see similar issues with Windows paths as well, since their UCS-2 with potentially unpaired surrogates can't be encoded as UTF-8. Honestly, paths should probably ideally be treated as arbitrary bytes (using base64 or the like) rather than strings. But your solution sounds nicer for human consumption.
I wish JSON had a byte string type.
But your solution sounds nicer for human consumption.
*nod* I chose this approach because I'm sort of "a UI/UX guy who considers things like uncaught exceptions or poor performance on an Athlon II X2 from 2011 or human-unreadable paths in config files to be user interface characteristics".
I wish JSON had a byte string type.
Likewise for TOML. I have another project where the configuration file for its specialized equivalent to libmagic's configuration database currently has lines like header = [137, 80, 78, 71, 13, 10, 26, 10] because that's how serde serializes/deserializes bytestrings and I need to add a transformation step so it can instead have lines more like what you get when you print(bytes(header)) in Python. (b'\x89PNG\r\n\x1a\n')
I learned this real quick when interacting with some JSON produced by PHP, because it likes to escape forward slashes by default. (For example: "http:\/\/...").