Using ‘Slop Forensics’ to Determine Model Ancestry
17 points by emschwartz
17 points by emschwartz
If you dismiss this because you’re not interested in LLMs you might miss a pretty cool statistical idea!
Combining stylometry with phylogeny to draw trees is clever! Using it to reveal the phylogeny of synthetic training data is pretty neat, but I want to see the same technique used on real literature. Can we draw a phylogeny (maybe a more complex graph) of influential authors? Genres?
I’d like to see, or maybe I’ll do, a sort of network analysis of literature. Maybe some statistical methods can quantify the impact of Poe and Arthur Conan Doyle on mystery fiction. I don’t know, it looks like there’s a lot of potential for some novel digital humanities work on quantitative analysis of literary genres & influences.
perhaps, but (asking as someone who spent exactly 60 seconds studying this field) are all the influences reducible to factors that are in-the-text? I suppose this is easier to answer for genetics than for cultural artifacts like fiction.
As someone who hasn’t even spent 60 seconds studying this field. Yes!
Words don’t have intrinsic meanings and between two individuals the interpretation of a word or phrase (the union of their understanding of the use of that word) may be empty! Between any two individuals, there are certainly many words which are!
Now to go read the article …
After LLMs, everyone is a postmodern literary critic. There’s no objective truth, just Text.
I was a postmodern literary critic before LLMs as well, but I generally expected silence or a nice discussion, not dismissal and derision.
On second thought, it’s probably time to get out of here.
I’ve been on lobste.rs since hearing about it on the IRC at Railsconf 2012, but it’s time to move on.
Please DM me if you know of a better community of early adopters / more open minded people.
That article was fascinating… If these models are trading off ‘learning from each other’ under human guidance, at what point will we expect a model with an understanding of at least one concept which is completely divorced from any human understanding? Does it even require multiple ‘generations’?
Is there some kind of… Common novelty metric in genetics which would fit this definition when applied to semantics? (i.e. a novel gene which has no obvious ancestors corresponding to a novel phrase which an LLM uses frequently but which, statistically, did not exist in the training data)?
at what point will we expect a model with an understanding of at least one concept which is completely divorced from any human understanding
When there’s something there that can contribute some understanding, based on its experience of something other than text, i.e. not any time soon.
Unless you want to set a different bar for “understanding”, in which case it happened a long time ago with SolidGoldMagikarp.
Where should we set the bar in the first place?
I’m 35, my definition of ‘a long time ago’ is probably different then yours, and I find it’s very helpful to set the goalposts in advance so that when we inevitably blow through them we can say, ‘well, there it went’ instead of ‘herp, well, we should just ignore the last 60 years of comp-sci/philosophy/ethics papers that discuss the turing test’.