Standard keyword search treats words as discrete symbols, which means that two articles covering the same event can fail to match simply because they use different vocabulary. A piece about "Federal Reserve interest rate policy" and one about "Fed hiking borrowing costs" are semantically identical but lexically distant. Semantic embedding models address this by mapping text into a continuous vector space where proximity reflects meaning rather than surface form.
This project asks a straightforward question: if a large news corpus is embedded into that space and left alone, does it naturally organise itself into interpretable topics? The answer, it turns out, is yes. Without providing a single label or category, the pipeline recovers coherent clusters corresponding to events including the Cambridge Analytica data scandal, the US-China trade war, the 2018 FIFA World Cup, and 24 other distinct topics, all from the geometric structure of the embeddings alone.
Beyond the clustering result itself, the project explores a practical deployment question: can a full ML pipeline, from raw data to interactive 3D visualization, be hosted entirely as a static site with no backend? The answer is also yes, and the approach generalises to any text corpus.
Articles were drawn from the CC-News dataset, a large crawl of English-language news articles made available via HuggingFace Datasets. The dataset was streamed directly rather than downloaded in full, with the first 20,000 articles fetched and then filtered to a single calendar year to produce a temporally coherent corpus.
Articles were filtered to the period January to July 2018, retaining only those with a valid publication date and a title of more than 20 characters. Duplicate titles were removed. The resulting corpus contains 7,000 articles from a wide range of English-language publishers, with publication dates spanning approximately six months. This time window was chosen deliberately: 2018 was a particularly eventful year for news, containing several well-defined events such as the Cambridge Analytica story breaking in March, the Avengers: Infinity War release in April, and the FIFA World Cup beginning in June. This makes it a good corpus for evaluating whether the clustering method can recover known real-world events.
Only article titles were used for embedding rather than full article text. This was a deliberate design choice for two reasons. First, titles are more consistent in length and register across different publishers, reducing the risk that embedding quality varies by source. Second, titles carry sufficient semantic signal to separate topics cleanly in practice, and embedding 7,000 full articles would have been prohibitively slow on consumer hardware.
Each article title was embedded using all-MiniLM-L6-v2 from the
sentence-transformers library (Reimers and Gurevych, 2019). This model produces
384-dimensional dense vectors and was chosen as a balance between embedding quality
and inference speed. On a standard CPU, the full corpus of 7,000 titles was embedded
in approximately eight minutes using batched inference with a batch size of 64.
The resulting embedding matrix has shape (7000, 384). Each row is a point in a 384-dimensional semantic space where cosine distance approximates semantic dissimilarity.
The 384-dimensional embeddings were reduced to three dimensions using UMAP (McInnes et al., 2018) with cosine distance as the metric. Three dimensions were chosen over the more common two for a principled reason: the trustworthiness evaluation in Section 4 confirms that 3D UMAP preserves more local neighbourhood structure from the original embedding space than 2D UMAP or t-SNE, which justifies the additional dimension despite the added complexity of 3D rendering.
The key hyperparameters were n_neighbors=20, which controls the
balance between local and global structure, and min_dist=0.1, which
controls how tightly points are packed together in the projection. Both were
selected after manual inspection of the resulting geometry.
Clustering was performed on the 3D UMAP coordinates using HDBSCAN (McInnes et al., 2017). HDBSCAN was preferred over k-means for three reasons. First, it does not require specifying the number of clusters in advance. Second, it handles variable-density clusters, which is important here because topic popularity in news varies substantially. Third, it explicitly assigns a noise label (-1) to points that do not belong to any coherent cluster, rather than forcing every article into the nearest group.
The final configuration used min_cluster_size=60 and
min_samples=10. These values were arrived at through iterative
tuning: lower values produced too many small, fragmented clusters (68 clusters
at min_cluster_size=25), while the final values produced 27
interpretable clusters that corresponded clearly to recognisable news topics.
Each cluster was labelled automatically using TF-IDF. The titles within each cluster were concatenated into a single document, and TF-IDF was computed across all cluster documents with a vocabulary of 5,000 unigrams and bigrams and English stop words removed. The top three terms by TF-IDF score were used as the cluster label. This approach reliably surfaces distinctive topic vocabulary and requires no manual annotation.
The pipeline produced 27 clusters from 7,000 articles, with 3,525 points (50.4%) assigned as noise. A selection of the most coherent clusters is shown below.
| Cluster | Auto label | Size | Interpretation |
|---|---|---|---|
| 01 | facebook | cambridge | analytica | 84 | Cambridge Analytica scandal (March 2018) |
| 02 | korea | north korea | korean | 61 | North Korea nuclear negotiations |
| 03 | sunderland | cup | world cup | 185 | 2018 FIFA World Cup |
| 12 | capitals | knights | stanley cup | 82 | NHL Stanley Cup playoffs |
| 13 | crash | driving | self driving | 101 | Uber self-driving fatality (March 2018) |
| 14 | china | tariffs | news | 95 | US-China trade war |
| 26 | infinity | infinity war | netflix | 108 | Avengers: Infinity War release (April 2018) |
Three quantitative metrics were used to evaluate the pipeline. DBCV (Density-Based Clustering Validation) measures cluster cohesion and separation and is designed specifically for density-based methods such as HDBSCAN, making it more appropriate than silhouette score in this setting. Trustworthiness measures how well the low-dimensional projection preserves local neighbourhood structure from the original 384-dimensional embedding space, with values closer to 1 indicating better preservation.
To justify the choice of UMAP over t-SNE, both methods were run on a 2,000-point subsample and their trustworthiness scores compared. UMAP in both 2D and 3D was also compared directly.
| Method | Trustworthiness | DBCV |
|---|---|---|
| UMAP 3D + HDBSCAN (this project) | 0.9065 | 0.1994 |
| t-SNE 2D + HDBSCAN | 0.8727 | n/a |
| UMAP 2D + HDBSCAN | 0.8511 | n/a |
The most coherent clusters correspond to events with a distinctive and consistent vocabulary. The Cambridge Analytica cluster (01) is particularly clean because the phrase "Cambridge Analytica" is highly specific and appears repeatedly across articles from different publishers. Similarly, "self-driving" and "autonomous vehicle" are distinctive enough that the Uber fatality cluster (13) has very little overlap with general transport coverage.
By contrast, clusters around broader topics, such as general crime reporting (cluster 16, labelled "news | police | suspect"), are large and produce uninformative TF-IDF labels. This reflects a genuine limitation of title-only embedding: crime articles use similar surface vocabulary regardless of the specific event, so they cluster together without forming a semantically tight group. This is worth noting as a structural property of news language rather than a failure of the method.
The noise rate of 50.4% is higher than would be ideal but is not unexpected for a general news corpus. Many articles cover singular events that do not repeat often enough to form a cluster under the chosen minimum cluster size. HDBSCAN correctly identifies these as noise rather than forcing them into the nearest cluster, which is one of its key advantages over k-means. In a more focused corpus (for example, coverage of a single ongoing event over several months), the noise rate would be substantially lower.
One notable limitation is that cluster 21, labelled "di | dan | yang", appears to consist largely of articles from Chinese-language sources that have been transliterated into English. This suggests that source domain can sometimes be a stronger clustering signal than topic, particularly when a publisher uses consistently different vocabulary from the rest of the corpus. A simple fix would be to filter by language or normalise by source during pre-processing.
The trustworthiness and DBCV comparisons were not run on identical subsamples, which limits their interpretability as a controlled evaluation. A more rigorous evaluation would fix the subsample across all methods and report confidence intervals over multiple random seeds.
Several natural extensions would improve both the ML pipeline and the
visualization. Replacing title-only embedding with full article embedding using
a larger model such as all-mpnet-base-v2 would likely produce
tighter clusters, at the cost of significantly higher compute. Extending the
corpus to cover multiple years would enable tracking of how topics evolve,
split, and merge over time, which could be visualised using a temporal animation
rather than a simple date filter. Finally, replacing TF-IDF labelling with a
prompted language model would produce more readable cluster descriptions,
particularly for the clusters that currently surface generic terms.
Reimers, N. and Gurevych, I. (2019). Sentence-BERT: sentence embeddings using siamese BERT networks. EMNLP 2019.
McInnes, L., Healy, J. and Melville, J. (2018). UMAP: uniform manifold approximation and projection for dimension reduction. arXiv:1802.03426.
McInnes, L., Healy, J. and Astels, S. (2017). HDBSCAN: hierarchical density based clustering. Journal of Open Source Software.
McInnes, L. and Healy, J. (2017). Density-based clustering validation. SDM 2017.