Newsphere Live

A self-updating semantic map of today's news, built on zero recurring costs
TL;DR

Four times a day, a pipeline pulls in fresh articles from 11 major news outlets, runs them through a machine learning model that converts each article into a numerical representation of its meaning, and then groups articles about the same story together even when they use completely different wording. The result is an interactive 3D map where proximity reflects semantic similarity, where stories about the same event cluster together, and different topics sit in different regions of the space.

Clicking a cluster opens a panel with every article covering that story, including which other outlets picked it up. A timeline slider lets you filter by date to watch how topics evolved across the week. Subscribers get a daily email digest of the top clusters each morning.

The whole system runs on free infrastructure with no backend, no database, and no recurring costs. GitHub Actions runs the pipeline, GitHub Pages serves the site, and the output is a single static JSON file that the browser loads directly.

Background

Newsphere Live is the production successor to a static personal project that mapped 7,000 news articles from 2018 using sentence embeddings, UMAP, and HDBSCAN, built while studying Computational Social Science at Leiden University. You can explore the original at rahulrayy.github.io/Corpus-Map. That project demonstrated the ML pipeline worked. This one asks whether the same approach can produce something genuinely useful: a map of today's news that updates itself without any manual intervention.

The original project included quantitative evaluation of cluster quality using silhouette scores and topic coherence metrics across different parameter settings. The architecture of Newsphere Live reflects those findings directly. The HDBSCAN minimum cluster size fraction, the cosine similarity threshold for semantic deduplication, and the UMAP neighbourhood parameter were all informed by what worked and what did not in the static version. Newsphere Live is not a fresh experiment with these choices. It is the production implementation of the best configuration found during that evaluation.

The two projects share a frontend architecture (Three.js, static JSON) and the same core ML stack, but are otherwise built from scratch. The original remains unchanged as a standalone portfolio piece.

Property Newsphere (original) Newsphere Live
DataCC-News 2018, staticLive RSS feeds, rolling 7-day window
Embedding inputTitles onlyTitles + description text
DeduplicationExact title matchNormalised title + cosine similarity
Cluster labelsPlain TF-IDFc-TF-IDF with substring dedup
Layout stabilityN/AProcrustes-aligned run-to-run
Pipeline executionManual, ColabGitHub Actions, cron schedule
Update frequencyNeverEvery 6 hours
BackendNoneNone
Recurring cost€0€0

Pipeline Architecture

The pipeline runs as a GitHub Actions workflow on a cron schedule, rebuilding every 6 hours at 00:00, 06:00, 12:00, and 18:00 UTC. All steps are Python scripts. The only external runtime dependency is the all-MiniLM-L6-v2 sentence-transformer model, which is cached between runs using the Actions cache API to avoid re-downloading 90MB of weights on every run.

fetch_articles.py parse 11 RSS feeds, normalised-title dedup, merge into rolling store
embed.py encode title + description with MiniLM, output embeddings.npy
cluster.py semantic dedup, UMAP 3D, Procrustes align, HDBSCAN, c-TF-IDF labels
diff.py URL-keyed diff against previous run's output
validate.py adaptive quality checks, exit(1) on failure to block bad commits
digest.py generate digest.html, send Resend email to subscribers
git commit + push only if validate passes; GitHub Pages redeploys automatically

A typical run takes around 6 to 8 minutes end-to-end, well inside the 6-hour window between runs. Because the repository is public, the 2,000 minute/month cap on Actions does not apply, so all four daily runs are unmetered.

Key Engineering Decisions

Data source: RSS over NewsAPI or GDELT

NewsAPI's free tier returns headlines only, with no description text, which degrades embedding quality significantly. It also requires an API key, adding a secret management dependency and a single point of failure. GDELT provides richer data but its format requires substantial cleaning that was disproportionate to scope. RSS feeds are free, unauthenticated, provide full description text, and are supported by every major English-language publisher natively. The feedparser library handles all eleven feeds with a single function call each.

MiniLM over MPNet

all-mpnet-base-v2 produces marginally better embeddings but takes roughly 3x longer to run. On the bootstrap corpus (~350 articles, run 1) the quality difference is not visible to a human. On the full 7-day corpus (~2,500 articles) the difference matters more, but so does keeping each run under an hour so that a failed run can be diagnosed and retried within a single working session without waiting for the next scheduled run. MiniLM is the right call at this scale.

Two-pass deduplication

Cross-publisher coverage of the same event is the main data quality problem. Reuters' "Fed holds rates steady" and the BBC's "Federal Reserve keeps interest rates unchanged" are the same story and should collapse to one point on the map. A single exact-title pass catches same-publisher re-posts but misses the main case entirely. The second pass runs after embedding: articles with cosine similarity above 0.90 are collapsed, with the longer-description version kept as canonical and an also_covered_by list appended. This list is surfaced in the UI. The breadth of coverage across outlets is itself a useful signal worth surfacing.

Corpus-fraction HDBSCAN sizing

A fixed min_cluster_size breaks silently as corpus size varies run to run. Setting it as a fraction of corpus size (2% of N, floored at 8) means the threshold scales naturally. Small bootstrap runs still produce clusters, and large mature runs do not fragment into noise.

Procrustes alignment for layout stability

UMAP is stochastic and refits from scratch on every run. Without intervention, the 3D cloud flips, rotates, and rescales unpredictably between runs, which is disorienting for repeat visitors. After UMAP runs, the new coordinates are aligned to the previous run's via orthogonal Procrustes (reflection permitted, since UMAP has no preferred handedness) using URL-matched articles as anchors. The global frame stays consistent; cluster-internal shuffling is inherent to the method and cannot be fully suppressed without sacrificing UMAP quality.

Zero-Cost Architecture

Every component was chosen with a hard constraint of no recurring infrastructure costs. This is not just a budget decision. It removes the risk of the project going dark because a free tier expires or a credit card lapses.

Component Service Cost Why not the obvious alternative
ML pipeline compute GitHub Actions (public repo) €0 / unlimited Public repos are unmetered; private repos cap at 2,000 min/month
Static hosting GitHub Pages €0 Netlify/Vercel free tiers have bandwidth caps
News data RSS feeds €0 / no auth NewsAPI free tier: headlines only, key required; GDELT: cleaning overhead
Email delivery Resend (free tier) €0 / 100 emails/day Raw Gmail SMTP: app passwords deprecated, lands in spam
Subscriber storage Resend Audiences €0 Avoids storing email addresses in a public repository entirely
Bot protection Honeypot + rate limiting €0 Turnstile was evaluated but had propagation issues; honeypot covers the main abuse vector
Subscribe endpoint Cloudflare Workers (free tier) €0 / 100K req/day Required to keep the Resend API key off the client; no other serverless option at zero cost
Total monthly cost €0

The Cloudflare Worker deserves elaboration. Resend's key permission model offers only full_access or sending_access. There is no "write contacts only" scope. This means no Resend key is safe to embed in client-side JavaScript on a public page. The Worker holds the key as a server-side secret and is CORS-locked to the GitHub Pages origin. An invisible honeypot field and a 3-second per-tab rate limit in the browser sit in front of it as additional abuse-mitigation layers.

Known Limitations and Honest Tradeoffs

Limitation
UMAP cluster drift. Procrustes alignment locks the global orientation but cannot prevent clusters from shuffling positions relative to each other when UMAP's internal graph changes between runs. Over long horizons, alignment error accumulates roughly as epsilon times the square root of n. With epsilon around 1.0 in UMAP units this is around 4 to 5 units of drift over a month, visible but tolerable, and logged daily for monitoring.
Limitation
RSS sampling gap. Feeds are fetched every 6 hours. High-volume publishers rotate items faster than this window, so some articles that existed briefly in a feed between fetches are missed. Hourly fetching would close this gap further but was left for a later iteration as it roughly doubles pipeline complexity without improving cluster quality meaningfully at current scale.
Tradeoff
Resend free tier scales to 100 subscribers. The digest sends to each subscriber individually, so the free tier supports up to 100 subscribers before a paid plan is needed. At current scale this is not a constraint.
Tradeoff
Repository growth. The pipeline commits approximately 1.3MB of compressed delta per run, which becomes around 2GB per year at 4 runs per day. The fix when needed is a one-off git filter-repo squash of old rebuild commits, or migrating generated files to an orphan data branch. Neither is needed in v1.
Tradeoff
Scheduled workflow auto-disable. GitHub disables scheduled workflows in public repositories after 60 days without a commit. The pipeline avoids this implicitly because every successful run commits updated data, resetting the clock. A failure notification step catches silent pipeline failures within 6 hours, long before 60 consecutive failed runs become a risk.

Full Stack

LayerTechnology
Embeddingsentence-transformers, all-MiniLM-L6-v2
Dimensionality reductionumap-learn 0.5.6, cosine metric, 3 components
Clusteringhdbscan 0.8.33, EOM selection, corpus-fraction min_size
Topic labellingc-TF-IDF via sklearn CountVectorizer + substring dedup
Layout alignmentOrthogonal Procrustes (reflection permitted) via numpy SVD
RSS parsingfeedparser 6.0.11
Email APIResend HTTP API via requests
VisualizationThree.js r128, OrbitControls, canvas sprite texture
Subscribe endpointCloudflare Worker (JavaScript), Wrangler deploy
Abuse protectionHoneypot field + browser-side rate limit
CI/CDGitHub Actions, 6-hourly cron, model caching, git commit and push
HostingGitHub Pages (static, auto-deploy on push to main)
Dependency managementuv pip compile lockfile + Dependabot weekly PRs