Four times a day, a pipeline pulls in fresh articles from 11 major news outlets, runs them through a machine learning model that converts each article into a numerical representation of its meaning, and then groups articles about the same story together even when they use completely different wording. The result is an interactive 3D map where proximity reflects semantic similarity, where stories about the same event cluster together, and different topics sit in different regions of the space.
Clicking a cluster opens a panel with every article covering that story, including which other outlets picked it up. A timeline slider lets you filter by date to watch how topics evolved across the week. Subscribers get a daily email digest of the top clusters each morning.
The whole system runs on free infrastructure with no backend, no database, and no recurring costs. GitHub Actions runs the pipeline, GitHub Pages serves the site, and the output is a single static JSON file that the browser loads directly.
Newsphere Live is the production successor to a static personal project that mapped 7,000 news articles from 2018 using sentence embeddings, UMAP, and HDBSCAN, built while studying Computational Social Science at Leiden University. You can explore the original at rahulrayy.github.io/Corpus-Map. That project demonstrated the ML pipeline worked. This one asks whether the same approach can produce something genuinely useful: a map of today's news that updates itself without any manual intervention.
The original project included quantitative evaluation of cluster quality using silhouette scores and topic coherence metrics across different parameter settings. The architecture of Newsphere Live reflects those findings directly. The HDBSCAN minimum cluster size fraction, the cosine similarity threshold for semantic deduplication, and the UMAP neighbourhood parameter were all informed by what worked and what did not in the static version. Newsphere Live is not a fresh experiment with these choices. It is the production implementation of the best configuration found during that evaluation.
The two projects share a frontend architecture (Three.js, static JSON) and the same core ML stack, but are otherwise built from scratch. The original remains unchanged as a standalone portfolio piece.
| Property | Newsphere (original) | Newsphere Live |
|---|---|---|
| Data | CC-News 2018, static | Live RSS feeds, rolling 7-day window |
| Embedding input | Titles only | Titles + description text |
| Deduplication | Exact title match | Normalised title + cosine similarity |
| Cluster labels | Plain TF-IDF | c-TF-IDF with substring dedup |
| Layout stability | N/A | Procrustes-aligned run-to-run |
| Pipeline execution | Manual, Colab | GitHub Actions, cron schedule |
| Update frequency | Never | Every 6 hours |
| Backend | None | None |
| Recurring cost | €0 | €0 |
The pipeline runs as a GitHub Actions workflow on a cron schedule, rebuilding every 6 hours
at 00:00, 06:00, 12:00, and 18:00 UTC. All steps are Python scripts. The only external runtime
dependency is the all-MiniLM-L6-v2 sentence-transformer model, which is cached
between runs using the Actions cache API to avoid re-downloading 90MB of weights on every run.
A typical run takes around 6 to 8 minutes end-to-end, well inside the 6-hour window between runs. Because the repository is public, the 2,000 minute/month cap on Actions does not apply, so all four daily runs are unmetered.
Data source: RSS over NewsAPI or GDELT
NewsAPI's free tier returns headlines only, with no description text, which degrades embedding quality
significantly. It also requires an API key, adding a secret management dependency and a single point of failure.
GDELT provides richer data but its format requires substantial cleaning that was disproportionate to scope.
RSS feeds are free, unauthenticated, provide full description text, and are supported by every major
English-language publisher natively. The feedparser library handles all eleven feeds
with a single function call each.
MiniLM over MPNet
all-mpnet-base-v2 produces marginally better embeddings but takes roughly 3x longer
to run. On the bootstrap corpus (~350 articles, run 1) the quality difference is not visible to a human.
On the full 7-day corpus (~2,500 articles) the difference matters more, but so does keeping each
run under an hour so that a failed run can be diagnosed and retried within a single working session
without waiting for the next scheduled run. MiniLM is the right call at this scale.
Two-pass deduplication
Cross-publisher coverage of the same event is the main data quality problem. Reuters'
"Fed holds rates steady" and the BBC's "Federal Reserve keeps interest rates unchanged"
are the same story and should collapse to one point on the map. A single exact-title pass
catches same-publisher re-posts but misses the main case entirely. The second pass runs
after embedding: articles with cosine similarity above 0.90 are collapsed, with the longer-description
version kept as canonical and an also_covered_by list appended. This list is surfaced
in the UI. The breadth of coverage across outlets is itself a useful signal worth surfacing.
Corpus-fraction HDBSCAN sizing
A fixed min_cluster_size breaks silently as corpus size varies run to run.
Setting it as a fraction of corpus size (2% of N, floored at 8) means the threshold scales
naturally. Small bootstrap runs still produce clusters, and large mature runs do not fragment
into noise.
Procrustes alignment for layout stability
UMAP is stochastic and refits from scratch on every run. Without intervention, the 3D cloud flips, rotates, and rescales unpredictably between runs, which is disorienting for repeat visitors. After UMAP runs, the new coordinates are aligned to the previous run's via orthogonal Procrustes (reflection permitted, since UMAP has no preferred handedness) using URL-matched articles as anchors. The global frame stays consistent; cluster-internal shuffling is inherent to the method and cannot be fully suppressed without sacrificing UMAP quality.
Every component was chosen with a hard constraint of no recurring infrastructure costs. This is not just a budget decision. It removes the risk of the project going dark because a free tier expires or a credit card lapses.
| Component | Service | Cost | Why not the obvious alternative |
|---|---|---|---|
| ML pipeline compute | GitHub Actions (public repo) | €0 / unlimited | Public repos are unmetered; private repos cap at 2,000 min/month |
| Static hosting | GitHub Pages | €0 | Netlify/Vercel free tiers have bandwidth caps |
| News data | RSS feeds | €0 / no auth | NewsAPI free tier: headlines only, key required; GDELT: cleaning overhead |
| Email delivery | Resend (free tier) | €0 / 100 emails/day | Raw Gmail SMTP: app passwords deprecated, lands in spam |
| Subscriber storage | Resend Audiences | €0 | Avoids storing email addresses in a public repository entirely |
| Bot protection | Honeypot + rate limiting | €0 | Turnstile was evaluated but had propagation issues; honeypot covers the main abuse vector |
| Subscribe endpoint | Cloudflare Workers (free tier) | €0 / 100K req/day | Required to keep the Resend API key off the client; no other serverless option at zero cost |
| Total monthly cost | €0 | ||
The Cloudflare Worker deserves elaboration. Resend's key permission model offers only
full_access or sending_access. There is no "write contacts only" scope.
This means no Resend key is safe to embed in client-side JavaScript on a public page.
The Worker holds the key as a server-side secret and is CORS-locked to the GitHub Pages origin.
An invisible honeypot field and a 3-second per-tab rate limit in the browser sit in front of it
as additional abuse-mitigation layers.
git filter-repo squash of old rebuild commits, or migrating generated files to an
orphan data branch. Neither is needed in v1.
| Layer | Technology |
|---|---|
| Embedding | sentence-transformers, all-MiniLM-L6-v2 |
| Dimensionality reduction | umap-learn 0.5.6, cosine metric, 3 components |
| Clustering | hdbscan 0.8.33, EOM selection, corpus-fraction min_size |
| Topic labelling | c-TF-IDF via sklearn CountVectorizer + substring dedup |
| Layout alignment | Orthogonal Procrustes (reflection permitted) via numpy SVD |
| RSS parsing | feedparser 6.0.11 |
| Email API | Resend HTTP API via requests |
| Visualization | Three.js r128, OrbitControls, canvas sprite texture |
| Subscribe endpoint | Cloudflare Worker (JavaScript), Wrangler deploy |
| Abuse protection | Honeypot field + browser-side rate limit |
| CI/CD | GitHub Actions, 6-hourly cron, model caching, git commit and push |
| Hosting | GitHub Pages (static, auto-deploy on push to main) |
| Dependency management | uv pip compile lockfile + Dependabot weekly PRs |