No, Really, Bash Is Not Enough: Why Large-Scale CI Needs an Orchestrator

45 points by iand675

chris-evelyn

Very good article overall, but one little thing always makes me want to jump out of my seat:

nobody really knows how to write bash

You mean you don't. I do. And why don't you? The manual is right there! You can even have it read to you! For free!

Yes, it is an old language that looks and feels very strange and has some very gnarly corners. Yes, a lot of things should be written in another language, especially build/CI systems.

But not because "nobody knows" bash.

Felt good to get that off my chest :D

l0b0
I'd say I "know" bash:
- Used it for some 15 years now
- Taught it to a few dozen people in different workplaces
- Wrote a book about it which sold a few hundred copies
At the same time, I absolutely agree that "nobody" (to a best approximation, with the possible exception of Stéphane Chazelas) really knows Bash. If you actually read the Bash manual, you'll find the relevant pieces of information about a subject spread far and wide, with important facts mentioned in sub-clauses, and details omitted because if you wanted to include everything Bash does you might as well read the entire source code - it would probably be faster. "Knowing" Bash is basically equivalent to being able to hold its entire source code in your working memory. Since nobody can do that, it's probably fair to say that nobody knows Bash.
- chris-evelyn
  
  I havent written a book yet. Out of curiousity, what was yours? Is it still available?
  
  Yeah, but simple to medium sized scripts don't need that level of deep expertise. You should know which areas you should keep away from and not try to be too clever (that's the impossible thing for most programmers I know - myself included).
  - l0b0
    
    Just checked, yep, it's still available.
    
    The problem with even "simple" scripts is that the smallest things can trip anyone up:
    
    0N being interpreted as octal in numeric contexts. Yonks ago a script I wrote failed on August 1st, because I was trying to parse the month ("08") as a number.
    
    Wrote your script on Windows? Be prepared for a really cryptic error message.
    
    Functions inside functions aren't actually in an inner scope. And have fun with local, export, etc.
    
    Variables being global scope (but not exported) by default.
    
    Various commands (like SSH) doing funny things with stdin.
    
    Looping over files safely is fiendishly hard.
    
    Having to deal with, NUL, ANSI escape codes, and other non-visible bytes.
    
    Too-long argument lists.
    
    grep returns 1 if it doesn't find a match, but did you know find returns 0?
    
    On that note, exit codes are a mess. Every program does its own thing, there are only 256 of the damn things, and most programs just give up and use a single exit code for absolutely every single error condition.
    
    return vs. exit.
    
    That's just a few off the top of my mind. I'm sure you can think of other things which really should be simple, but aren't.
    
    Oh yeah, and even ShellCheck (an absolute must if you write any scripts) only catches a tiny fraction of common issues.
    
    ryan-duve
    
    Is your book available in dead-tree format? From the link it looks like it's a "course" which I assume means either a digital book or an interactive site.
    
    l0b0
    
    Yeah, that's a weird one. It's not available in print, and the publisher decided to change it from a book format to a "course", despite the entire thing being written as a book.
    
    adityaathalye
    
    Well I subscribe to the wisdom of Master Foo... I "know" enough Bash to avoid having to learn any other scripting language.
    
    Master Foo and the Shell Tools: https://soda.privatevoid.net/foo/arc/07.html
    
    Master Foo nodded and replied: “When you are hungry, eat; when you are thirsty, drink; when you are tired, sleep.”
    
    l0b0
    
    It's worth knowing a great deal about Bash. One of the most important things to know about it is when not to use it, or at least use it extremely sparingly. Off the top of my mind, these are some of the important ones:
    
    Maths more complex than integer addition and subtraction (use bc in a pinch).
    
    Looping over files, if their names can contain any valid byte (only NUL and slash are actually disallowed). For example, creating a file with a newline in it is surprisingly easy, especially for a new user: forget to end a quoted string, press Enter, then type the end quote, and press Enter again.
    
    Basically anything where you do something with untrusted user input.
    
    Nested language (JSON, XML, YAML) parsing. There's jq, yq, etc, but those are basically write-only languages.
    
    adityaathalye
    
    Yeah, I prefer using Bash as glue. I also try to program Bash defensively and use functional programming style because it's easier to debug and compose / recompose... "Pipeline all the things", never mind forking and process bloat, because my linux laptop is basically a supercomputer for 80s era software.
    
    I've blogged about some of my Bash stuff here: https://www.evalapply.org/tags/bash/ and I am always happy to receive email about that (and other stuff, per my "standing invitation").
    
    l0b0
    
    Same. And nice site!
    
    adityaathalye
    
    Thanks :) Same good vibes back at you, for your "shell scripting with bash" course / slides. I see you referenced "Greg's Wiki" there. "Bash Pitfalls" is a fantastic resource. It helped me not make so many mistakes when I started to seriously use Bash as a hammer for all my problem-nails.
    
    iand675
    
    When I say "nobody knows bash", it's more in the sense that it has a huge amount of gotchas, and I don't really believe that more than 1 out of 500 engineers (being generous) could write a reasonably complex bash script that doesn't trigger shellcheck the first time, nor hit any of the pitfalls listed here: https://mywiki.wooledge.org/BashPitfalls
    
    It's perfectly doable and okay (IMO) to write bash without understanding bash, and I think most bash is written by people who don't know it very well. However, when it comes to a mission-critical part of your corporate infrastructure, the bar should be higher.
    
    chris-evelyn
    
    I know. It was meant as a lighthearted jab at the general expression that "nobody knows" - should have made it clearer that it was not aimed at you or the article in particular.
    
    mxey
    
    I don't like shell scripts, but there is nothing in the same niche that is as widely known and supported and it is acceptable for glue.
    
    I don't really believe that more than 1 out of 500 engineers (being generous) could write a reasonably complex bash script that doesn't trigger shellcheck the first time
    
    Sounds accurate which is why one runs shellcheck on all scripts. That's a lot harder to do if the shell script is written as a list in a YAML file.
    
    srtcd424
    
    Sporadically following oils(hell) development has been a bit of an eye-opener for me. As an official "old git" with 30 odd years bash experience, I thought I knew it .. now my eyes have been opened :(
    
    kwas
    
    Eh, I don't know. IMHO any Bash guidance that doesn't start with "install shellcheck" is wrong. Also it's a tutorial / reference, but skimming the ToC there's not much how to structure longer code to make it maintainable.
    
    alper
    
    But not because "nobody knows" bash.
    
    I "know" bash. Still you wouldn't want me to do anything serious in it.
    
    junon
    
    I get the same feeling whenever someone says something similar. "Nobody really knows X.", "All code is terrible.", etc.
    
    Yes they do, and no it's not. These defeatist attitudes paired with the sweeping conclusions people make thereafter kill me a bit inside. It's like saying "no matter how hard you try to improve, give up because it's not enough."
    
    Terrible mindset and very annoying, LinkedIn-tier sentiments to parrot.
    
    madhadron
    
    You mean you don't. I do. And why don't you?
    
    I used to. I have been happily forgetting it, and I think it was a waste to have ever learned. It's a travesty how much useful stuff is locked in shell commands because Unix provided no good way of making it a library.
    
    atmosx
    
    Especially today with LLMs doing the heavy lifting, lots of ppl can write bash.
    
    Our main deployment script is written in bash and works fine. Now there is BATS for testing as well.
    
    Granted that past a certain point IMO bash doesn’t make sense, it can really go a long way and if you leverage fictions properly, you can have a very clean high level logic.
    
    alper
    
    I've seen some of that LLM generated bash fly by this week. That's write-only code.
    
    square_usual
    
    Yeah, of all the things, the last thing I want an LLM to write is bash, because I have effectively no way to verify that that code is good.
    
    ThinkChaos
    
    Even Jeffrey Epstein read the manual: https://www.justice.gov/epstein/files/DataSet%209/EFTA00315849.pdf
    
    jonahx
    
    Comment removed by author
    
    izabera
    
    and right there
    
    wink
    
    Same as the last article, this is kinda based on a false premise, I think.
    
    I have never encountered anyone who meant "just use a bash script" literally.
    
    Usually it's a very long (and usually at least 70% true) rant about how there must be a middle ground between "make && make test" wrapped in bash and a (needless) abomination where people struggle to run anything sensible on their local machine, for no good reason - just because people used the wrong level of abstraction in the CI and every single step like "cd a; cmd; cd ../b; cmd" is put into a yaml file instead of a bash file or something adequate.
    
    iand675
    
    Ok, so what’s the correct thing to do here? Easy to post saying that it’s a false premise, but I feel like I made a reasonable effort to articulate the problem space. What am I missing?
    
    stephenr
    
    Not the person you replied to, but here's my take on the 'bash vs bash-in-yaml': aspect.
    
    So many of these CI/etc systems rely on a particularly named file (written in nobody's favourite language: Yet-Another-Migraine-Looming) inside the repo, which then has a metric fuck-ton of shell script snippets embedded within it.
    
    The problem isn't that you have a complex state-aware system to manage the workflow as a whole. The problem is that the part the user interacts with (i.e. the aforementioned YAML file) is needlessly expressed in the most inappropriate language possible.
    
    If you can extract snippets of code from a defined pathname YAML file inside the repo, you can either copy or directly run executable script files at defined pathnames. Heck you can even extract a known-format of metadata from a file header comment if you really wanted to keep the logic for a given step together and you need to know things before it's executed.
    
    There are so many ways you could not rely on a mess of YAML, it's hard to take anyone seriously when that is their offered solution.
    
    singpolyma
    
    The worst part of bash in yaml is that it basically boils down to a list and the list is just executed in order... Which means it could be not a list. At which point it's a bash script possibly with some formatter but in a more annoying syntax
    
    pm
    
    Someone gets it!
    
    "Shellscripts suck!" ---> proceeds to extract the commands one by one and throw them Ina YAML file.
    
    It feels surreal. Then when you point that out, the counter argument is just trying to short circuit the discussion to the end with some dogma.
    
    orib
    
    Anything that lets me run a local command to run your build pipeline, and debug the CI pipeline without a CI runner.
    
    The rest is just quibbling over personal preference.
    
    iand675
    
    Sure, there are plenty of non-bash options (such as for my prior recommendation, buildkite: https://buildkite.com/resources/changelog/44-run-pipelines-locally-with-bk-cli/) that let you run your build pipeline. I think discouraging people from going down an unproductive path is a little more than quibbling.
    
    wink
    
    Brevity. You're actually even saying somewhere "bash is fine until it's not" but then it kinda contradicts (iirc both of the) intros.
    
    I even mostly agree with most of your points, but I think you're arguing against some sort of strawman, or not leading with "in huge projects...", that only comes late.. or maybe it comes early, but only in the second post (and I read both of them today, so sorry for mixing up details)
    
    srtcd424
    
    The problem with stuff that's fine until it's not, is there's usually no obvious point where you should stop evolving/extending it, before it becomes unmanageable. I guess with enough personal or intra-team discipline it's OK, but I suspect commercial pressures make that really hard.
    
    swaits
    
    You assume, like yourself, that nobody else understands how to use bash.
    
    It’s not that hard. It’s ubiquitous. It can be done nicely.
    
    pm
    
    That person would be me. I find this whole topic crazy and I am yet to be provided a practical example of what exactly those y'all's based build servers do that a shellscript wouldn't. They are essentially mutilated shellscripts with the silly YAML to male them more painful.
    
    The example you provided is a good one. I think people don't know how to write shellscripts then jump into absurd conclusions. The CD command is intended for interactive she'll usage. Using it on shellscripts is an anti pattern because it has a zillion gotchas for good reasons.
    
    wink
    
    I think it's not unreasonable to have a somewhat ok method for parallel tasks though.
    
    Naively people who only have bash would either:
    
    make build make test make docs make static_analysis make whatever
    
    while in reality you don't want to background/tmux/sleep/whatever hack but just so
    
    | +- make build test \ +- make docs + - whatever only after all 3 succes +- make static_analysis / `` and I agree with OP that this needs a "proper" system, for whatever definition. Or just fanout to 3 build servers on 3 architectures. why serialize?
    
    pm
    
    Oh, if need be I would absolutely fork those by backgrounding them with & and put a wait in the end. That is the pretty much the canonical way to do it. If you want to execute that in other machines, then I would go with SSH, which I have done countless times from build scripts. It does work.
    
    wink
    
    Yeah I know, I wouldn't want to do that though with lockfiles or whatever and make sure 1-n have terminated successfully so n+1 can run..
    
    If that works for you you seem to like bash a lot more than I do :)
    
    mxey
    
    I feel like this is still misrepresents what I think of as the Bash approach.
    
    At least the way I see, the Bash suggestion is not to disregard the features of a CI runner. By all means, run every job inside a fresh container. That’s what GitLab CI does. But what runs inside the container is a Bash script. A Bash script that does NOT do much by itself but just launches other stuff. If that stuff cannot run in the same environment, either fix the stuff or make it multiple scripts you run as multiple CI jobs.
    
    What this approach still cannot give you is better problem analysis than “read the log”, though. To me that has usually been sufficient but I guess it really depends on what things you run.
    
    For reference: we used to mostly use Earthly, but with its death, we are actually moving to Bash scripts. All the jobs run from the same container image which is our build environment. Gitlab CI runs one job for each script, which are different containers. Notably, in the majority of repos, there is not much in the scripts. They just provide a common entrypoint so that in any repo you can run the test script and it will do "go test" and so on. This is primarly inspired by https://github.com/github/scripts-to-rule-them-all
    
    If I had more complicated requirements, I would actually look towards Bazel instead of trying to put more into the CI system.
    
    doctor_eval
    
    All my current problems involve fighting the CI to get it to do what I can do trivially on the command line. I love the idea of CI/CD but with these orchestrators, almost any change, even a simple one, becomes a tedious two hour death march after which I’m intellectually drained. I tried using an LLM for my last one and it worked at first, but I ended up in exactly the same place. A trivial change fell apart because I couldn’t find a version of all the random pieces that had the tools I needed.
    
    With CICD systems, hard things are easy, and easy things are hard.
    
    So, can I ask how your system works?
    
    I imagine the docker container you use as the base checks out a git repo and runs a shell script at some well known location? Something else?
    
    mxey
    
    I imagine the docker container you use as the base checks out a git repo and runs a shell script at some well known location? Something else?
    
    We use GitLab CI. This is just how it works: every job runs in a container with an image you set in your job/pipeline with the repo checked out already.
    
    The only thing specific to us is that we put almost nothing in the job. Each job just calls a shell script. (Might have some CI-only set up before like docker login).
    
    test: image: … script: - ./hack/test
    
    There is bit more to it with setting some caching defaults in a template job and adding Docker-in-Docker if you need it, but that’s the gist of it.
    
    The idea is to align the CI environment and the client environment well enough that you can run those scripts the same in either place. On the client there might be some minor version differences in installed tools but we only have compatible enough tools in our allowed assumptions. Everything else the build scripts might need they have to procure themselves, usually by running Docker containers.
    
    There could still be failures due to tiny differences between the environments but since the CI environment is a Docker image, it’s not that hard to pull that image and run the script in it manually, if you need the exact same environment to debug the build.
    
    doctor_eval
    
    Thanks for that. That makes total sense.
    
    jlarocco
    
    Funny, I think I replied to that article complaining about Azure YAML being a poor wrapper around CLI tools.
    
    I don't disagree about needing a CI orchestrator, but I still maintain that the best way to say, "Run this command" is to let me give the command.
    
    None of this silliness where every possible command gets its own YAML entity with "inputs" that map to command line arguments. It's extra, needless work - now I need to know msbuild's arguments AND the YAML inputs that they map to, and how they work together. I already know the msbuild command line I need, just let me specify it.
    
    MatejKafka
    
    I'm on a Windows-based team with a CI based on Azure DevOps (can't really recommend). We currently use a custom PowerShell-based orchestrator, with most of the glue written in PowerShell that occasionally calls into native tools (written in Rust, as that's the main project language; not a fan of the long compile times and (over)focus on correctness for tooling, would prefer C#, but I digress).
    
    I'm generally happy with the setup, PowerShell strikes a good balance between a shell and a full scripting language for infra code. On the shell side, it's trivial to call other tools and process their inputs and outputs, it provides built-in parameter handling (typing, structured data, validation, completions,...) and you don't need a separate compilation step. As a scripting language, it's somewhat more idiosyncratic than, e.g., Python, but still very usable, and if something is missing, you have the escape hatch of calling into .NET directly.
    
    For the CI structure, I have a rule that anything invoked in the CI must be easily testable locally, since I'm unwilling to have the CI as part of my inner dev loop. Most CI steps just call into the orchestrator, which prints each invocation to the CI log, so most issues are reproducible locally by copy-pasting a few relevant invocations into your local shell.
    
    mxey
    
    For the CI structure, I have a rule that anything invoked in the CI must be easily testable locally, since I'm unwilling to have the CI as part of my inner dev loop.
    
    This here is the crucial part. Everything else for me follows from that.
    
    pmarreck
    
    If You’re a Nix Shop, You Can Leave Early
    
    LOL, truer words were never spoken. For chrissakes, my fellow devs, please learn to control your project dependencies with a flake.nix and be done with it. The AI will MASSIVELY assist you with this at this point.
    
    mariusor
    
    Go outside. Touch grass.
    
    Thank you, but fuck off.
    
    I think it's enough to acknowledge that CI requirements are not the same for everyone without needling at any specific group.
    
    [edit] But other than that, excellent article.
    
    iand675
    
    Oh, hm. Maybe the way I wrote that came off wrong. I was simply trying to say that it’s nice to have the freedom that comes with having a simple system.
    
    As in, literally, go enjoy the grass on my behalf because I’m stuck in my office in front of a screen
    
    srtcd424
    
    I took it the way you intended, FWIW.
    
    stephenr
    
    This reads like a cringey text of a mother responding "LOL" to "Uncle Albert just passed away" because she thinks it means "lots of love" and is desperate to use "the language the kids use".
    
    I can't remember the last time I saw someone use modern slang so incorrectly with such confidence, and I'm in my 40s.
    
    srtcd424
    
    Getting very off-topic, but: part of being a "grown up" in my book is sensing context when interpreting someone else's language, not just immediately going looking for trouble. Language evolves, but people don't all use it the same way, for all sorts of reasons. (I'm also in my 40s, though sadly not for much longer.)
    
    whjms
    
    I see 'touch grass' used with a more positive connotation these days because of the implication that it means you'll be spending time offline, and that's a good thing.
    
    tomjakubowski
    
    Comment removed by author
    
    evmar
    
    I just saw some linguist talking about how broadening is a common phenomenon with phrases like this! Another example is how "raw dog" originally only had a sexual meaning but now it just means doing anything unsafely. It's something like, whenever slang gets used in a larger audience than where it started, there are more people who don't know/understand the specific original meaning, s they attempt to infer a meaning from context and end up somewhere more general.
    
    mariusor
    
    I think that despite how you wanted to present the idea, you're inadvertently creating a separation between developers that do serious work in teams and someone, supposedly doing not as serious work, that can be satisfied by a "simple" CI pipeline mechanism like bash, you're saying "I'm not talking to you". But, yes, yes you are, because I'm the one reading it.
    
    pm
    
    I think I am the person mentioned in the post.
    
    First of all, everyone should calm down. This is a technical discussion. No need to get worked up over this. After all anything other than technical merits of a piece of tech, is irrelevant in this conversation.
    
    The author doubles down on a bunch of still unproven claims. I reject the idea that shellscripts (not necessarily bash scripts) have worked for some because the problem at hand was simple/easy.
    
    I believe they have worked for those who knew how to use them properly. Frankly speaking, Which percentage of engineers knows how to capture a command output to a variable? Most will get all confused about return values vs output to standard streams. How many know why we have stderr and stdout? Heck, most will try to pass everything as ARGS if being there is a built mechanism for data input, then complain about she'll expansion. Which exists for clear practical reasons. Same as the admittedly tricky quoting sequences.
    
    Yes, learning these things takes time and effort. The reason being that the problems it tries to solve so have non trivial complexity and some intricate details.
    
    But to th point, what is there exactly that is too complex for a shellscript that say GitHub actions solve? And how does it solve? What exactly is its capability that you don't have from a shellscript? You could call me stubborn, but I am genuinely confused, and respectfully suspect the author just hasn't digged into shellscript to the point that he realizes how useful and capable it is.
    
    Shellscripts are the canonical way of scripting your operative system as a user. I can't think of an automated task they can't do. I suppose if you need to manipulate a program that provides no way to interact with itself other via an ABI. At that point you pretty much wrote a library.
    
    matklad
    
    But to th point, what is there exactly that is too complex for a shellscript that say GitHub actions solve? [...] What exactly is its capability that you don't have from a shellscript?
    
    To answer this specific question: running tests simultaneously on a windows box, a linux box, and a mac box, using the current latest version of all there OSes, and signaling success only if all three of them signaled success.
    
    pm
    
    Then the question becomes what building resources and infrastructure are needed, rather than weather a shellscript os capable or not.
    
    Obviously, using a shellscript to test a windows desktop application doesn't make sense to start with. I think this is established before the discussion.
    
    If we're talking about GUIs and such things, that is a whole other discussion, IMO. Certainly build servers are a gain for the machine consumable part of your software. Not so much for GUIs. What software would you want to run on windows and mac and why? We're for sure talking about something that is meant to be interacted with by humans.
    
    sunshowers
    
    Shell scripts cannot express event loops/actor patterns, which are a requirement for many kinds of software. You cannot do epoll or equivalent with shell scripts.
    
    DanOpcode
    
    Very interesting and thought provoking! Thanks for sharing!