defuddle: Get the main content of any page as Markdown
11 points by danlamanna
11 points by danlamanna
HTML has built-in functionality for accessibility & semantic data (such as microdata) & we are throwing it out in favor of something much less semantic/expressive. For the past 12 years, if I wanted some of the noise gone & something plaintext-readable without the semantic regard, I would just dump from w3m. What’s the appeal?
To me, this looks like a way to prepare a website for ingestion into an LLM without spending tokens on "cleaning" it.
If developer stuck to semantic markup instead of Tailwind class soup, there wouldn’t be such an explosion of token usage while retaining semantics. It all seems so backwards where all these these folks taking shortcuts has taken to such a place where we keep trying to code ways around what should have been simple design to begin with—instead of piles of div soup with 8000 class that is effective inline styles. Even if the use case is LLMs, w3m’s output would have been good enough. Putting these low barriers to touch CLAUDE.md then telling the machine to build it seems to me pretty bad versus the first step of asking if this has been done already.