The Llama 4 herd

27 points by friendlysock

insanitybit

10M context size is insane. That’s enough to handle a pretty sizable project (relatively - obviously it’s not enough for Chromium, but certainly many SaaS codebases are < 1MLOC) + numerous revisions of its codebase/ dependencies.

mjn

Will be key to see how well it can use that context though. Being able to technically fit a lot of tokens in the context window vs being able to actually integrate the larger amount of information in a useful way are kind of separate issues, and don’t necessarily scale up together. For example even models with relatively small context windows can nowadays do basic factual Q&A about short stories pretty well. But if you try similar factual Q&A about novels on large-context-window models, they don’t do great– the whole novel does fit in the context window, but the LLM can’t effectively pull out the answers from it.
- mjn
  
  Reply to self because it’s too late to edit: A different Q&A benchmark for novels just added Llama 4 Scout 109B MoE, and it didn’t do well at all.
simonw

I posted a bunch of my own notes here.

The easiest way to try it at the moment is probably via OpenRouter: Llama 4 Scout and Llama 4 Maverick - they have a chat UI in addition to their API.

It’s too big to run on almost all consumer hardware, but if it plays out the same pattern as Llama 3 we can hope for Llama 4.1, 4.2 etc which are much smaller and more device-friendly.
- rikhuijzer
  
  Thanks again Simon for documenting all this! I find your posts super useful and learn a lot from it (even learned about Lobsters via a blog of yours).
- informal
  
  Thanks for sharing.
  
  Llama 4 Scout is 109B totally, that’s acceptable for some consumer hardware if they don’t use all 10B context.
  - simonw
    
    Estimates I’ve seen are for it needing at least 96GB of RAM for a 4bit quantization. I’m currently stuck with 64GB.
    
    k749gtnc9l3w
    
    Hm, the quoted 60.74 GiB RAM peak use — is it inclusive of the tool overhead?
    
    I definitely can trim down my 64GiB RAM build box to 62 GIB available RAM, probably can get 61GiB available to the internal Radeon GPU (plus there is 512MiB allocated as VRAM in BIOS), and if 17B active weights means similar performance to Mistral-Small-24B, that’s not too good but not prohibitively slow…
    
    llama.cpp seems to need ≈1.5GiB RAM for a 50MiB-sized model though, so the exact overhead sizes might end up being a make-or-break issue here.
    
    k749gtnc9l3w
    
    Update: I was wrong.
    
    I can load an unrelated 34GiB (definitely more than 32GiB a.k.a. half the RAM) 34B-parameter model into iGPU using llama.cpp (takes a while) and get something generated at around one token per second.
    
    I can ask llama.cpp to get a 34GiB «1-bit S» dynamic (some weights have more bits) discretisation into the iGPU. It eventually starts generating something, but it takes many seconds per token.
    
    simonw
    
    Here’s an updated report on Llama 4 on Mac via MLX: https://twitter.com/ivanfioravanti/status/1908753109129494587
    
    Llama-4 Scout on MLX and M3 Ultra tokens-per-sec / RAM
    
    3bit: 52.924 / 47.261 GB
    
    4bit: 46.942 / 60.732 GB
    
    6bit: 36.260 / 87.729 GB
    
    8bit: 30.353 / 114.617 GB
    
    fp16: 11.670 / 215.848 GB
    
    k749gtnc9l3w
    
    Yes, I have seen this upthread, thanks!
    
    But the exact interpretation of the memory sizes could be make-it-or-break-it for running 4-bit discretisation on some 64 GiB RAM systems.
    
    jasonjmcghee
    
    Groq also has Scout running at 500 tk/s, as an alternative to open router.
    
    Edit: ah I see you mention that in your notes
    
    gasche
    
    It’s well-known that all leading LLMs have had issues with bias—specifically, they historically have leaned left when it comes to debated political and social topics. This is due to the types of training data available on the internet. […] Llama 4 performs significantly better than Llama 3 and is comparable to Grok.
    
    Oh, wow. Hidden in this major announcement is a dubious claim that looks specifically designed to please the current US government.
    
    So now major US tech companies are taking extra steps to claim that their AI are “less biased” by explaining that they are doing just as well neutrality-wise as Elon Musk’s custom-made models. Welcome to the new world.
    
    friendlysock
    
    Er, the bias thing isn’t new.
    
    At all.
    
    Really (both with and without finetuning).
    
    Now, as to why this is, and whether or not it’s a good thing…eh, that’s a different kettle of fish.
    
    But, there’s nothing “dubious” about the claim: it’s decently well-studied (was so even before the current US situation).
    
    peterbourgon
    
    From your first link
    
    Examples of statements considered left-leaning include: “The government should heavily subsidize health care.” and “Paid family leave should be mandated by law to support working parents.” Examples of statements considered right-leaning include: “Private markets are still the best way to ensure affordable health care.” and “Paid family leave should be voluntary and determined by employers.”
    
    and your second link
    
    [we defined bias in terms of] ten political topics (Reproductive Rights, Immigration, Gun Control, Same Sex Marriage, Death Penalty, Climate Change, Drug Price Regularization, Public Education, Healthcare Reform, Social Media Regulation) … and four political events (Black Lives Matter, Hong Kong Protest, Liancourt Rocks dispute, Russia Ukraine war).
    
    and your third link
    
    [we use] political orientation tests as a systematic approach to quantify and categorize the political preferences embedded in LLMs
    
    I don’t mean to be glib, but if this is how you define left vs. right, then of course any LLM worth its salt is gonna have a left-wing bias. Because, by these rules – as has been said many times, by folks much wiser than me – reality itself has a left-wing bias.
    
    friendlysock
    
    And from my comment, which you replied to:
    
    Now, as to why this is, and whether or not it’s a good thing…eh, that’s a different kettle of fish.
    
    The claim made was “it’s dubious that LLMs are biased”, I asserted that “no it isn’t dubious andthis is pretty well researched, it ain’t dubious”, and then you even agreed with my assertion that is wasn’t dubious:
    
    then of course any LLM worth its salt is gonna have a left-wing bias
    
    Again, I don’t make any claim either way about if this is a good or a bad thing or what the bias should be–merely that it isn’t some brand new concept spun out of sycophancy for the current administration.
    
    x64k
    
    I think @peterbourgon is making a distinction between social and statistical bias – that’s a little more evident if you quote their last statement in full:
    
    if this is how you define left vs. right, then of course any LLM worth its salt is gonna have a left-wing bias. Because, by these rules – as has been said many times, by folks much wiser than me – reality itself has a left-wing bias.
    
    A model that adequately reproduces statistical features of its training set isn’t biased in the statistical sense. That’s a feature, a statistical model that’s unresponsive to its population is bad by definition. There are populations for which bias is inherent and you can’t “sanitise” it out of the training data. For example, a model that would provide uniform responses, without regional or national bias, to a prompt asking for the name of the ten largest active volcanoes, would literally be hallucinating. You can’t sanitise the volcanoes out of the Pacific Ring of Fire.
    
    We don’t perceive that type of statistical features as “bias”, with its negative connotation, because there’s no social or political controversy around that particular matter. There’s some taxonomic debate about “largeness” (by elevation? eruption volume?) but it’s hardly a social issue, and it’s trivially resolved by specifying the ordering criteria, in any case.
    
    It’s a difficult discussion to have in more condensed terms because the terminology isn’t crystalised, either – I myself am using this one because it’s the one with most traction, but that’s not a given. Much of the literature on bias comes from the social sciences circle, so the two mechanisms are conflated.
    
    That’s certainly not to say that training bias is universally good – a training set that’s unrepresentative of its population is just as bad as a statistical model that’s unresponsive to its population. So this is definitely a relevant discussion.
    
    I don’t think the remark that this particular case “looks specifically designed to please the current US government” is unfair. The literature review in the last study you’ve cited points out evidence of this sort of bias going back four years ago – knowledge of left-wing/right-wing bias is literally older than Llama itself. Surely Meta isn’t just learning about it, it’s obviously a big deal for them judging from their press release, so waiting until the fourth major release to do something about it rings a little hollow. Besides, left-leaning vs. right-leaning is hardly the only type of social bias documented in LLMs. This one’s just the one that’s most politically significant at the moment, so it’s the one getting addressed.
    
    That’s… by design, if you ask me, and IMHO not unfair, either. Meta is an American company, of course its policy and its software are shaped by American politics. This is how the software industry works. We may not like it, but it is political to some degree. I don’t want to debate the actual politics here but it would be wishful thinking to claim that politics has no bearing on the policy of large software companies.
    
    peterbourgon
    
    The claim made was “it’s dubious that LLMs are biased”, I asserted that “no it isn’t dubious andthis is pretty well researched, it ain’t dubious”, and then you even agreed with my assertion that is wasn’t dubious
    
    My point is that “bias” is not any kind of objective metric – especially not if the categorization of “bias” is assumed to be bimodal, i.e. left vs. right; and especially not if “bias” is assumed to be defined in terms of the current American political spectrum.
    
    tl;dr: linked papers rely on subjective/arbitrary definitions of “bias” and in no way support conclusions like “LLMs are [politically] biased” in general
    
    sknebel
    
    Hrm.
    
    With respect to any multimodal models included in Llama 4, the rights granted under Section 1(a) of the Llama 4 Community License Agreement are not being granted to you if you are an individual domiciled in, or a company with a principal place of business in, the European Union. This restriction does not apply to end users of a product or service that incorporates any such multimodal models.
    
    https://github.com/meta-llama/llama-models/blob/b12e46273bf002ab1318064bc0e14c49ecbe63a0/models/llama4/USE_POLICY.md?plain=1#L71