edn.c: A fast, zero-copy EDN (Extensible Data Notation) reader written in C11 with SIMD acceleration

52 points by delaguardo

NoahTheDuke

This looks great. I wish EDN would get a proper specification and conforming parsers/generators for clojure JVM, it's very implementation-specific right now.

delaguardo

edn-format.dev Stay tuned :)
rtpg

I had a similar feeling, but reading https://github.com/edn-format/edn ... while it's fairly casual I have a hard time reading this and thinking "this is underspecified" (the same cannot be said of YAML! At least the EDN spec says "this is UTF-8")

But I haven't dug into this too much beyond thinking "edn might be nice to use" for some project and then just kinda forgetting about it. Just bit surprised cuz I feel like I looked this up recently and was unsatisfied and now I feel like it's fine. Maybe it was me looking at the YAML "spec" which tempered my expectations
- NoahTheDuke
  
  The spec isn't too bad (missing small things like "how to write infinity"), but the built-in library clojure.edn has a number of weird behaviors that are copied directly from Clojure's reader/printer (for example, EdnReader.java is more liberal with numbers).
mbrock

I wonder how much of this kind of explicit SIMD intrinsic code is actually necessary given that you can just write SIMD-amenable regular code and the GCC/LLVM vectorizer will try to emit optimal vector code for the target platform?
- delaguardo
  
  I actually was learning while doing this) from what knowledge I have so far - you definitely can write regular code and it will be optimized by compiler but only if you have predictable data buffers size. for such parser it is not always a case and I decided to play safely. but I would not say I'm 100% sure about this. if you know a better way please let me know.
  - wrs
    
    The annoying thing about trusting the compiler to optimize is that you can't trust the compiler to optimize, because there's no actual specification or promise for when it optimizes.
    
    You can add automated tests that ensure the optimization you expect is actually happening, lest a new point release of the compiler silently decide to not do the optimization you were relying on. In addition to the blunt instrument of just generating assembly and grepping it, there are compiler vectorization reports that may help. (GCC and LLVM have different methods for this.)
    
    For a different approach to the problem, take a look at ispc, a SIMD-oriented dialect of C.
    
    fanf
    
    Matt Pharr’s retrospective on ispc is a good read
    
    jcelerier
    
    The annoying thing about trusting the compiler to optimize is that you can't trust the compiler to optimize, because there's no actual specification or promise for when it optimizes.
    
    one thing that worked well for me is sprinkling with #pragma openmp simd, which will write a warning when the compiler wasn't able to optimize
  - mbrock
    
    Ah, I was also recently learning more about SIMD, that's why I ask! Because I had an interesting experience of (1) realizing I could optimize a certain procedure with vector intrinsics; (2) rewriting it with Zig's explicit vector stuff; (3) having to change certain things to make it possible like padding a certain thing to four bytes etc; and then (4) realizing that the LLVM vectorizer could generate very good code if I only did step 3!
    
    In other words, I could lay it up for the compiler by making my code vectorizable. I was excited by this so I tried compiling for obscure architectures like LoongArch and WASM64+SIMD to see it generating different vector opcodes I'd never heard of...
    
    So I recommend taking some appropriate part of the code and writing it in a merely vectorizable way to see what your compiler can do!
    
    delaguardo
    
    most certainly I will give it a try. thanks!
  - fanf
    
    As far as I know compilers can’t yet autovectorize this kind of lexical analysis, so the state of the art is still to write them by hand with vector intrinsics.
    
    mbrock
    
    Interesting. Any idea what's preventing them? I've had LLVM thoroughly autovectorize string buffer code—after I rearranged it just a bit to make it actually possible (avoiding bounds checks, etc).
    
    fanf
    
    The tricky part is spreading the lexer state rightwards across the vector, and altering the state according to the characters in left-to-right order.
    
    mbrock
    
    That makes sense, thanks. I see this code also does stuff like gated dynamic array growth that you'd have to deal with in a way that doesn't cause branching on every iteration, etc.
    
    raoulmillais
    
    Is this for jank? I am eagerly awaiting the first alpha. That project has a lot of promise
    
    delaguardo
    
    I kept jank in mind during development and I would be glad to know what they think about this reader. But compatibility with jank wasn't the goal from the beginning.
    
    antonmedv
    
    Pretty cool! Speed is impressive. Plus Java multiline strings!
    
    vlnn
    
    so if we need fast edn in e.g. python, it's all about FFI'ing the hell out of API? Or there's something more clever coming in?
    
    delaguardo
    
    from my experiments cython was a bit better from ergonomics and performance point of view. but I still test alternative approaches. I will add it later same way as wasm bindings