Self-Hosted AI news aggregator using Cloudflare Workers, Vectorize, and Nostr

2 points by delirehberi

I built this primarily to solve my own reading fatigue from jumping between HN, Lobsters, and Reddit, while keeping data completely within my own infrastructure.

Architecture & Storage Choices:

The entire stack is built to live on Cloudflare’s free/low-cost tier to make self-hosting accessible.

Backend & API: Implemented using Hono on Workers.
Database & State: Cloudflare D1 for strict relational storage (sources, cron tracking, user state).
Vector Search: Cloudflare Vectorize managing the 768-dimension embeddings generated natively via @cf/baai/bge-base-en-v1.5.

Ingestion Details:

For Lobsters specifically, the background cron job polls the .json endpoints rather than scraping raw HTML. For bootstrapping historical preferences, I provided endpoints to ingest past JSON or RSS activity exports so the Cosine Similarity calculation actually has a baseline vector profile to match against.

Current Constraints & Trade-offs:

Model Choice: I opted for bge-base-en-v1.5 because it executes completely inside Workers AI with zero external cold starts, but I’m experimenting with more performant embeddings models if they become available natively on the edge.
Nostr Identity: Instead of inventing a custom OAuth or JWT layer, I used NIP-07 (browser extensions like Alby) for user authentication. This makes it trivial to let followers re-publish your resulting curated feed back to public relays.

The repo includes a Makefile that handles full remote provisioning (make db-init, make vectorize-init, make deploy) so you don't have to navigate the Cloudflare UI manually.

Would love any feedback on the vector indexing strategy at the edge, or how people are handling semantic search history pruning over time.