Using ‘Slop Forensics’ to Determine Model Ancestry

17 points by emschwartz

n1000

If you dismiss this because you’re not interested in LLMs you might miss a pretty cool statistical idea!

Combining stylometry with phylogeny to draw trees is clever! Using it to reveal the phylogeny of synthetic training data is pretty neat, but I want to see the same technique used on real literature. Can we draw a phylogeny (maybe a more complex graph) of influential authors? Genres?

I’d like to see, or maybe I’ll do, a sort of network analysis of literature. Maybe some statistical methods can quantify the impact of Poe and Arthur Conan Doyle on mystery fiction. I don’t know, it looks like there’s a lot of potential for some novel digital humanities work on quantitative analysis of literary genres & influences.

Slop forensics repo

ocramz

perhaps, but (asking as someone who spent exactly 60 seconds studying this field) are all the influences reducible to factors that are in-the-text? I suppose this is easier to answer for genetics than for cultural artifacts like fiction.
- andrewballinger
  
  As someone who hasn’t even spent 60 seconds studying this field. Yes!
  
  Words don’t have intrinsic meanings and between two individuals the interpretation of a word or phrase (the union of their understanding of the use of that word) may be empty! Between any two individuals, there are certainly many words which are!
  
  Now to go read the article …
  - gerikson
    
    After LLMs, everyone is a postmodern literary critic. There’s no objective truth, just Text.
    
    andrewballinger
    
    I was a postmodern literary critic before LLMs as well, but I generally expected silence or a nice discussion, not dismissal and derision.
    
    andrewballinger
    
    On second thought, it’s probably time to get out of here.
    
    I’ve been on lobste.rs since hearing about it on the IRC at Railsconf 2012, but it’s time to move on.
    
    Please DM me if you know of a better community of early adopters / more open minded people.
    
    andrewballinger
    
    That article was fascinating… If these models are trading off ‘learning from each other’ under human guidance, at what point will we expect a model with an understanding of at least one concept which is completely divorced from any human understanding? Does it even require multiple ‘generations’?
    
    Is there some kind of… Common novelty metric in genetics which would fit this definition when applied to semantics? (i.e. a novel gene which has no obvious ancestors corresponding to a novel phrase which an LLM uses frequently but which, statistically, did not exist in the training data)?
    
    hobbified
    
    at what point will we expect a model with an understanding of at least one concept which is completely divorced from any human understanding
    
    When there’s something there that can contribute some understanding, based on its experience of something other than text, i.e. not any time soon.
    
    Unless you want to set a different bar for “understanding”, in which case it happened a long time ago with SolidGoldMagikarp.
    
    andrewballinger
    
    Where should we set the bar in the first place?
    
    I’m 35, my definition of ‘a long time ago’ is probably different then yours, and I find it’s very helpful to set the goalposts in advance so that when we inevitably blow through them we can say, ‘well, there it went’ instead of ‘herp, well, we should just ignore the last 60 years of comp-sci/philosophy/ethics papers that discuss the turing test’.