From Languages to Language Sets

15 points by veqq

franta

C should not fall into the same category as C++ and Rust (where is D?). Even if you call both „manual memory management“ (I would rather say: not garbage collected) there is big difference: SBRM (RAII). This feature significantly increases safety and programmers comfort.

(you can have GC even in C or C++ and other typically non-GC languages – however it is not common)

Java has rich possibilities and would span over multiple categories. Bytecode is both compiled and interpreted. GraalVM can compile it into native binary (even static). Java itself (the language) is strongly statically typed but JVM can run also scripting languages like Groovy, JavaScript, Python, PHP etc. and in GraalVM you can mix all together in a single program and call e.g. JavaScript functions from Python and vice versa.

So Java can be your „Level 2“, but also 3 and 4 (the platform runs and integrates various scripting languages with similar syntax to Java) and even 1 (Java Card – runs on smart cards like banking or phone that has very limited computing power and there is usually no GC).

hc

C has the non-standard but widely-supported cleanup attribute, which gets you most of the way to RAII. (it’s more like go’s defer or C#’s using)

as long as you’re not using MSVC, you can probably use it
snej

I agree with you about C — I would call it Level 0, with no support for memory / resource management at all. Zig is sort of Level 0.5; I don’t count “defer” as equivalent to RAII since you still have to write it out manually.
dfawcus

D is at two levels, level 2 (its ‘native’ level with GC) for common use; and at level 1 (-betterC mode), when sans-GC.

As to RAII, I view it as a bit player; but D has it, even in betterC mode.

The major gain is from bounds checking, so Rust would natively have that, as would D. I have hopes for the Apple enhancements to clang there, allowing C to be safer, even if only for UT and system/integration testing.

dist1ll

The author assumes that the compilation model is fundamentally too slow to compete with interpreters on iteration speed. I think that’s wrong. I believe it’s possible to design modern languages that perform clean builds of 10M+ loc codebases end-to-end within a fraction of a second on a laptop.

veqq

I agree, but further I don’t believe a 10M+ loc codebase makes sense. Such verbosity should be factored out by macros, coherent frameworks and configuration DSLs. A system unable to do this seems rather flawed (yes, I’m criticizing most things around today.)
- stassats
  
  Code generated by macros still has to be compiled. And macros is an easy way to overwhelm the compiler.
- dist1ll
  
  Would you say the same for OS projects? I feel like in a codebase with driver/fw support for thousands of devices from thousands of vendors, there’s too much entropy to keep the code concise.
  - veqq
    
    (yes, I’m criticizing most things around today.)
    
    To be clear, my PoV is not possible today because of inertia, industrial practices etc. but I still think it’s a healthier mental model to grok code and design things.
    
    Yes and in particular. Driver support is another example of this issue where things are often undocumented, non-standardized etc. so you need a lot of glue code which a better design would obviate. But I’m also against monorepos and don’t think drivers should be considered part of the OS, architecturally (rather parts of the devices)(a distro would of course contain code from many of them, as different libraries, just as is done today). My view isn’t the only way, of course. But it seems sensible that one side of the interface to a device family should be part of the OS while the hardware/software implementation which e.g. handle displaying enable it but aren’t part of it, since you wouldn’t use all of device type’s divers on the same machine. In no other space do we consider every implementation of an interface part of that project.
    
    Anyway, Forth drivers require a magnitude less code, which shows a sleeker way to handle things today. That’s my actual point. We have too much code, because we’re (stuck) using bad approaches.
    
    @stassats you aren’t arguing that we have millions of billions of lines of code, because of all the assembly at the end. Rather, there’s some logical point where you think it’s valid to stop and analyze. I am arguing that the highest (densest) abstraction layer is that point. (In Common Lisp, even things like if are macros without taxing the compiler.) You can lazily evaluate macros too (n.b. this isn’t idiomatic. There is also drama around fexprs, whose support is now rare and definition unclear. But e.g. cf. Kernel. I approach all of this from a Lisp view (build Forth on the hardware, then build a Lisp from Forth, and the Lisp image is your OS) but conceptual issues around fexpr (when still possible) led to SmallTalk and objects orientation. ctrl-f “fexpr” here.)
- typesanitizer
  
  I believe it’s possible to design modern languages that perform clean builds of 10M+ loc codebases end-to-end within a fraction of a second on a laptop
  
  Curious as to where you get this intuition from. Sorbet is probably one of fastest type-checkers that I know of for a reasonably complex language, and it’s speed for clean type-checking is about 200K LOC/core/s (parsing can be about 1~1.5 orders of magnitude faster, so ignoring that). 10 cores on a modern laptop, that’s still around 2M LOC/s, assuming zero synchronization overhead.
  
  Your numbers seem an order of magnitude faster than Sorbet.
  - dist1ll
    
    I’ve implemented a compiler for a roughly C-like language w/ basic generics and unidirectional type inference that compiles to RISC-V at 6MLoC/core/s on an M2.
    
    The main enabler for this is strict declare-before-use & an easy to parse grammar. But I also made several design decisions that sacrificed compilation speed in favor of codegen quality (I want something closer to clang -01). Instead of true single-pass, I’m compiling each function in two passes, with an SSA-based IR. Main optimizations are lazy linear regalloc, constant folding, DCE, scalar replacement of aggregates and inlining, but I’m planning a few more.
    
    The main thing I’m lacking is closures and nested function definitions. That’s certainly going to be tough to implement with my current design, but I think it’s doable. Other planned features like memory safety should be efficient based on some prototyping I did last year.
    
    My code is mostly plain Rust. No assembly, no SIMD intrinsics, no sophisticated data structures. I think the ceiling is likely to be a lot higher.
    
    typesanitizer
    
    Very cool. :)
    
    Would love to dig in to your compiler architecture and language feature set if you ever decide to open source it.
  - MatheusRich
    
    Could you provide examples of that?
- pointlessone
  Through the whole thing a huge [citation needed] was glaring so bright I barely could finish it.
  
  Every successful business started with those languages eventually rewrites their codebase to use one of the “lower level” languages because big codebases written by many people are hard to maintain and modify without the support of a static type-checker.
  
  [citation needed]
  
  GitHub, the very site it was published on is written in Ruby. By the OP classification it’s a level 4 language. As of right now GitHub is 17 years old. GitHub is also 6,500 strong company. Granted, not everyone is working on GitHub but even if it’s only 20% it’s still 1,300 people working on a Ruby project. they seem to manage, more or less.
  
  Shopify is another big Ruby app. Founded 2 years before GitHub (so 19 years old). At this time about 8,300 people work there. By the same 20% assumption, it’s 1,600 people working on a Ruby app. The seem to manage even better than GitHub.
  
  These are not the only two examples out there. It’s the two I think most of you are familiar.
  
  So when does eventually come? Or at what point the team is too big?
  
  And so we come to the levels 2 and 3, where most professional programmers today spend their time.
  
  [citation needed]
  
  I would wager a solid $1 that JS, Python, and PHP alone employ more developers than every other language combined. I’m also fairly confident that code in those languages is produced at higher rate, as well.
  
  The only gap between them is that interpreted languages can include “eval” and dynamic meta-programming (modification of program structure at runtime). These features are usually shied away from in production code though, and are more helpful during development, especially for testing.
  
  [citation needed]
  
  Rails (the thing GitHub and Shopify are written on) is arguably a skyscraper of meta-programming. Every Rails-like framework employs meta-programming to some degree. It wades a bit into the weeds of semantics but Spring (Java framework) annotations are a kind of met-programming, so are Rust #[derive(...)] and other annotations, even C pre-processor. Meta-programming is all over the place and was since the dawn of programming. It’s in every production system even if it doesn’t support eval at runtime.
  
  The best part is that all three languages share pretty much the same syntax, and they are built so that calling from higher level to lower level variant is effortless.
  
  [citation needed]
  
  I’d argue that different syntaxes at different levels is more beneficial as it helps switching context. You’re in for a whole lot of… adventures when you think you’re writing one language when in fact you’re writing a completely different one.
  
  Take this “RustScript” for example:
  
  let rect1 = { width: 30, height: 50 };
  
  There’s a whole lot of questions I have about this line. How do you map this onto a lower level Rust? You have to map it onto the lower level somehow because it’s gotta be “effortless”. But how do you infer a type of this? Is this type equivalent to every struct with those fields? How do we convert between the types if the conversion is not defined in the lower level?
  
  Consider this:
  
  struct Rectangle { width: u32, height: u32, } struct WindowSize { width: u32, height: u32, }
  
  Which type is assigned to each of these variables?
  
  let rect = { width: 30, height: 50 }; let size = { width: 50, height: 30 };
  
  Is this a valid RustScript code?
  
  if (rect == size) { // do something }
  
  Can we call lower-level functions with any of these?
  
  window.set_size(rect); println!(rect_area(size));
  
  Would both of these work? What type would they have?
  
  let x = { width: 100, ..rect }; let y = { width: 100, ..size };
  
  How dynamic is RustScript? The original line is a valid JS. It creates an object and you can add fields to it to your hears content:
  
  let rect = { width: 30, height: 50 }; rect.awesomeness = 9001; console.log(rect); // Object { width: 30, height: 50, awesomness: 9001 }
  
  Can you do it in RustScript? It’s supposed to be dynamically typed. What would happen to the type? Can you still use this awesome rect with low-lever Rust-proper?
  
  This is just from a single line in the example. I’m sure there will be much much more questions when expanded to the whole of Rust.
  
  Now, don’t get me wrong, I appreciate the idea. It can be an interesting thing to explore. I’m just a bit miffed that OP is backwards. It makes some dubious claims to set the stage for the idea and does no exploration of it at all. I’d rather it started with the core idea of language sets and explored it in more depth. Instead of stripping down Rust syntax to look like JS I’d love OP think about interop (arguably, the core feature of a language set) and how it might work.
  - fanf
    
    I think Shopify is a Sorbet user; if so their codebase moved from level 4 to level 3. Dunno if Github uses Sorbet.
    
    Re. metaprogramming, the dynamic qualifier is important. I like to write code that writes code, but that’s almost always static codegen. Code that rewrites itself is far too brainbending!
  - andyc
    
    Yeah I think the opposite direction bears just as much fruit :-)
    
    You could start with the most ergonomic/productive thing and make it fast – in contrast to starting with the fast, complex language, and simplifying it.
    
    We did that with https://oils.pub/ – started with Python, and then statically typed it with MyPy, and then translated that to garbage-collected C++
    
    So we went from tier 4 to tier 3 to tier 2 !
    
    bash is written in a tier 1 language – C. And OSH is faster than bash in many cases, although there are still cases where we’re slower.
    
    We are able to be faster because bash is a very suboptimal program! (e.g. it uses a ton of linked lists, not sure there are any hash tables at all) Most big programs are suboptimal, especially big C programs.
- MatheusRich
  
  RustGC is the language I wish existed. I like Rust, but memory management adds so much clutter.
  - intelfx
    
    So, basically, Graydon Hoare’s Rust or Borgo?
    
    I can only second this. I wish something like this — a kind of Rust on top of Go’s runtime, perhaps with a touch of Python’s ergonomics — were more popular. Some of it kinda exists now in the form of Borgo, but I don’t think I’d pick it in any kind of professional setting due to relative obscurity and immaturity.
  - franta
    
    What about GraalVM native-image? It generates static binaries that are quite big, but run fast. Rust binaries are also big. Or maybe D?
    
    veqq
    
    Rust binaries are also big
    
    Are you just comparing them to dynamically linked c binaries, which rely on code on the system? Rust and Go include everything (including the whole runtime) within the binary, hence the size, while c rests on code in the kernel etc.
  - snej
    
    This makes a lot of sense and echoes thoughts I’ve been having. Level 2 appears to be my “Goldilocks”, because memory safety without some form of GC requires IMHO too much awkwardness.
    
    It’s possible to tease this apart into more levels — I see a distinction between full-on GC vs. refcounting, in that the former requires a larger runtime library and imposes barriers to FFI — but I’m not sure that’s useful at this level of abstraction.
  - andyc
    
    I generally agree with the 4 tier categorization:
    
    C/C++/Rust
    
    Java/Go/OCaml
    
    MyPy, TypeScript
    
    Python, JavaScript
    
    But I’d also add 2 or 3 more tiers:
    
    String-ish languages without GC - Shell, Awk, Make, CMake [1]
    
    Configuration Languages - YAML / TOML - declaring data structures [2]
    
    Data Notations - JSON, HTML, CSV - Objects, Documents, Tables [3]
    
    The goal of YSH is actually to unify tiers 4, 5, 6, and 7 under one language. The slogan I’ve been using is “minimal YSH is shell+Python+JSON+YAML”
    
    Instead of having Unix sludge (autotools - m4 generating make) and Cloud sludge (Helm - Go templates generating YAML), you have one language
    
    YSH is the code dialect – it is a shell with real data types like Python
    
    and with reflection like Ruby/Python, not generating text
    
    Hay (Hay Ain’t YAML) is the data dialect
    
    and we have built-in JSON, etc.
    
    This is a hard design challenge, but I just made a release with an overhaul of Hay - https://oils.pub/release/0.28.0/
    
    Hay version 1 was hard-coded in the interpreter - https://oils.pub/release/0.28.0/doc/hay.html
    
    But we realized it’s actually better to self-host it in YSH, using YSH reflection. We will be testing this by rewriting Hay in YSH
    
    So that’s our language design response to https://news.ycombinator.com/item?id=43386115
    
    It’s madness that languages are effectively siloed from each other.
    
    Instead of tiers 4, 5, 6 being silo’d, we have them all under YSH and the Oils runtime (which is tiny, 2.3 MB of pure native code).
    
    (As a bonus, OSH also runs on the Oils runtime, and it’s the most bash-compatible shell!)
    
    [1] Garbage Collection Makes YSH Different - https://www.oilshell.org/blog/2024/09/gc.html
    
    Shell, Awk, and Make Should Be Combined - https://www.oilshell.org/blog/2016/11/13.html - all these languages lack GC!
    
    [2] Survey of Config Languages - https://github.com/oils-for-unix/oils/wiki/Survey-of-Config-Languages - divides this category into 5 tiers:
    
    Languages for String Data
    
    Languages for Typed Data
    
    Programmable String-ish Languages
    
    Programmable Typed Data
    
    Internal DSLs in General Purpose Languages
    
    [3] Zest: Notation and Representation addresses this - https://www.scattered-thoughts.net/writing/notation-and-representation/
    
    YSH also has a common subset with J8 Notation (which is a superset of JSON)
    
    intelfx
    
    Cloud sludge (Helm - Go templates generating YAML)
    
    This is really apt, I like your characterization of it :-)
    
    andyc
    
    Similarly, you can implement Ruby-like DSLs in YSH – e.g. Ruby has “Rake” for “Make”: https://github.com/ruby/rake/blob/master/doc/rational.rdoc
    
    file "hello.cc" file "hello.o" => ["hello.cc"] do |t| srcfile = t.name.sub(/\.o$/, ".cc") sh %{g++ #{srcfile} -c -o #{t.name}} end
    
    IMO it would be nicer in YSH:
    
    file hello.cc file hello.o : hello.cc { srcfile = name.replace(/ '.o' %end /, '.cc') action { g++ $srcfile -c -o $name } }
    
    You also don’t need to “shell out”, because YSH is already a shell! Both traditional Make and Ruby Rake delegate to the shell, which makes escaping non-obvious, among other things
    
    So if anyone wants to help prove our reflection APIs in this way, feel free to join https://oilshell.zulipchat.com/
    
    Again, we prefer using reflection on data structures in the runtime, rather than “text sludge”
    
    MatheusRich
    
    Ruby is my favorite language, but this looks super neat!
    
    andyc
    
    Thanks for noticing! I would definitely like more Ruby users to check out YSH (feel free to join https://oilshell.zulipchat.com/ or send other interested people that way)
    
    A slogan is “YSH is for Python and JavaScript users who avoid shell” … But that really includes all of Python/JS/Ruby/Perl/PHP/Lua (though Ruby and Perl users don’t seem to avoid shell as much, which I think is good!)
    
    Partly this is because I know Python and JS pretty well, but not Ruby. But I’m interested in and respect Ruby, e.g. I generated some “survey code” as part of the YSH language design, e.g.
    
    https://github.com/oils-for-unix/oils/blob/master/demo/survey-closure.rb
    
    I think our reflection compares favorably to Ruby, e.g. instead of binding we have
    
    var mydict = eval(block_of_code, to_dict=true)
    
    I wrote some design notes in June 2023 about this:
    
    Where Ruby-like Blocks Can Be Useful in Shell
    
    There I wrote
    
    rule { outputs = ['grammar.cc', 'grammar.h'] inputs = ['grammar.y'] command { yacc -C $[_inputs[0]] } }
    
    Which is almost the same thing!
    
    But yeah I hope you can do almost anything with YSH that you can do with Ruby, though we need to test it.
    
    One difference is that we don’t have declared params to the { } blocks, while Ruby has
    
    myfunc do |x, y| statement end
    
    But so far it seems OK to have x and y be “implicit”, rather than explicit (?) Feedback is welcome
    
    andyc
    
    Another tidbit I remember is a Gary Bernhardt talk, where he says that the thing that Ruby has, that Python doesn’t, is blocks
    
    e.g. for the RSpec test framework and so forth – it makes it nicer
    
    And I agree!
    
    So yeah YSH is influenced by Python and JS, but it has Ruby-like blocks too! I think that adds a lot …
    
    I have been missing blocks for all these years in Python-land :)
    
    landon
    
    Putting TS/MyPy between JS/Python and everything else is a bit odd
    
    snej
    
    Static typing / type-checking. That’s a huge difference from my perspective. I gave up on JS years ago for anything but little hacks; and yet in recent years I’ve come to love TypeScript.
    
    landon
    
    Sure. But the order just feels inside out in that the usual “built on top of” relation is subverted.