MSc project  ·  Unsupervised NLP  ·  2025

Mapping the news: unsupervised topic discovery across 7,000 articles

This project builds an interactive 3D semantic map of 7,000 news articles from 2018, using sentence embeddings, UMAP dimensionality reduction, and HDBSCAN clustering to surface latent topic structure without any labelled data. The full pipeline runs offline in Python and exports a single static JSON file, which is loaded by a Three.js frontend deployed on GitHub Pages. Users can rotate the 3D embedding space, filter articles by publication date, highlight individual topic clusters, and read source articles directly from the interface.
Open the visualization

1. Introduction

Standard keyword search treats words as discrete symbols, which means that two articles covering the same event can fail to match simply because they use different vocabulary. A piece about "Federal Reserve interest rate policy" and one about "Fed hiking borrowing costs" are semantically identical but lexically distant. Semantic embedding models address this by mapping text into a continuous vector space where proximity reflects meaning rather than surface form.

This project asks a straightforward question: if a large news corpus is embedded into that space and left alone, does it naturally organise itself into interpretable topics? The answer, it turns out, is yes. Without providing a single label or category, the pipeline recovers coherent clusters corresponding to events including the Cambridge Analytica data scandal, the US-China trade war, the 2018 FIFA World Cup, and 24 other distinct topics, all from the geometric structure of the embeddings alone.

Beyond the clustering result itself, the project explores a practical deployment question: can a full ML pipeline, from raw data to interactive 3D visualization, be hosted entirely as a static site with no backend? The answer is also yes, and the approach generalises to any text corpus.

2. Data

2.1 Source

Articles were drawn from the CC-News dataset, a large crawl of English-language news articles made available via HuggingFace Datasets. The dataset was streamed directly rather than downloaded in full, with the first 20,000 articles fetched and then filtered to a single calendar year to produce a temporally coherent corpus.

2.2 Filtering and cleaning

Articles were filtered to the period January to July 2018, retaining only those with a valid publication date and a title of more than 20 characters. Duplicate titles were removed. The resulting corpus contains 7,000 articles from a wide range of English-language publishers, with publication dates spanning approximately six months. This time window was chosen deliberately: 2018 was a particularly eventful year for news, containing several well-defined events such as the Cambridge Analytica story breaking in March, the Avengers: Infinity War release in April, and the FIFA World Cup beginning in June. This makes it a good corpus for evaluating whether the clustering method can recover known real-world events.

2.3 Embedding scope

Only article titles were used for embedding rather than full article text. This was a deliberate design choice for two reasons. First, titles are more consistent in length and register across different publishers, reducing the risk that embedding quality varies by source. Second, titles carry sufficient semantic signal to separate topics cleanly in practice, and embedding 7,000 full articles would have been prohibitively slow on consumer hardware.

3. Method

3.1 Sentence embedding

Each article title was embedded using all-MiniLM-L6-v2 from the sentence-transformers library (Reimers and Gurevych, 2019). This model produces 384-dimensional dense vectors and was chosen as a balance between embedding quality and inference speed. On a standard CPU, the full corpus of 7,000 titles was embedded in approximately eight minutes using batched inference with a batch size of 64.

The resulting embedding matrix has shape (7000, 384). Each row is a point in a 384-dimensional semantic space where cosine distance approximates semantic dissimilarity.

3.2 Dimensionality reduction with UMAP

The 384-dimensional embeddings were reduced to three dimensions using UMAP (McInnes et al., 2018) with cosine distance as the metric. Three dimensions were chosen over the more common two for a principled reason: the trustworthiness evaluation in Section 4 confirms that 3D UMAP preserves more local neighbourhood structure from the original embedding space than 2D UMAP or t-SNE, which justifies the additional dimension despite the added complexity of 3D rendering.

The key hyperparameters were n_neighbors=20, which controls the balance between local and global structure, and min_dist=0.1, which controls how tightly points are packed together in the projection. Both were selected after manual inspection of the resulting geometry.

3.3 Clustering with HDBSCAN

Clustering was performed on the 3D UMAP coordinates using HDBSCAN (McInnes et al., 2017). HDBSCAN was preferred over k-means for three reasons. First, it does not require specifying the number of clusters in advance. Second, it handles variable-density clusters, which is important here because topic popularity in news varies substantially. Third, it explicitly assigns a noise label (-1) to points that do not belong to any coherent cluster, rather than forcing every article into the nearest group.

The final configuration used min_cluster_size=60 and min_samples=10. These values were arrived at through iterative tuning: lower values produced too many small, fragmented clusters (68 clusters at min_cluster_size=25), while the final values produced 27 interpretable clusters that corresponded clearly to recognisable news topics.

3.4 Automatic cluster labelling

Each cluster was labelled automatically using TF-IDF. The titles within each cluster were concatenated into a single document, and TF-IDF was computed across all cluster documents with a vocabulary of 5,000 unigrams and bigrams and English stop words removed. The top three terms by TF-IDF score were used as the cluster label. This approach reliably surfaces distinctive topic vocabulary and requires no manual annotation.

4. Results

4.1 Cluster quality

The pipeline produced 27 clusters from 7,000 articles, with 3,525 points (50.4%) assigned as noise. A selection of the most coherent clusters is shown below.

Cluster Auto label Size Interpretation
01 facebook | cambridge | analytica 84 Cambridge Analytica scandal (March 2018)
02 korea | north korea | korean 61 North Korea nuclear negotiations
03 sunderland | cup | world cup 185 2018 FIFA World Cup
12 capitals | knights | stanley cup 82 NHL Stanley Cup playoffs
13 crash | driving | self driving 101 Uber self-driving fatality (March 2018)
14 china | tariffs | news 95 US-China trade war
26 infinity | infinity war | netflix 108 Avengers: Infinity War release (April 2018)

4.2 Evaluation metrics

Three quantitative metrics were used to evaluate the pipeline. DBCV (Density-Based Clustering Validation) measures cluster cohesion and separation and is designed specifically for density-based methods such as HDBSCAN, making it more appropriate than silhouette score in this setting. Trustworthiness measures how well the low-dimensional projection preserves local neighbourhood structure from the original 384-dimensional embedding space, with values closer to 1 indicating better preservation.

To justify the choice of UMAP over t-SNE, both methods were run on a 2,000-point subsample and their trustworthiness scores compared. UMAP in both 2D and 3D was also compared directly.

Method Trustworthiness DBCV
UMAP 3D + HDBSCAN (this project) 0.9065 0.1994
t-SNE 2D + HDBSCAN 0.8727 n/a
UMAP 2D + HDBSCAN 0.8511 n/a
UMAP 3D achieves the highest trustworthiness score of the three methods tested at 0.91, suggesting that the additional dimension captures neighbourhood structure that is genuinely lost in 2D projections. Notably, t-SNE 2D outperforms UMAP 2D on this metric, which is an honest finding worth reporting: UMAP's advantage only becomes clear when the third dimension is included. The DBCV score of 0.20 is moderate, consistent with the inherently fuzzy topic boundaries of a general news corpus. It is worth noting that the t-SNE and UMAP 2D comparisons were computed on a 2,000-point subsample for computational reasons, so the comparison is indicative rather than strictly controlled.

5. Discussion

5.1 What clusters well and why

The most coherent clusters correspond to events with a distinctive and consistent vocabulary. The Cambridge Analytica cluster (01) is particularly clean because the phrase "Cambridge Analytica" is highly specific and appears repeatedly across articles from different publishers. Similarly, "self-driving" and "autonomous vehicle" are distinctive enough that the Uber fatality cluster (13) has very little overlap with general transport coverage.

By contrast, clusters around broader topics, such as general crime reporting (cluster 16, labelled "news | police | suspect"), are large and produce uninformative TF-IDF labels. This reflects a genuine limitation of title-only embedding: crime articles use similar surface vocabulary regardless of the specific event, so they cluster together without forming a semantically tight group. This is worth noting as a structural property of news language rather than a failure of the method.

5.2 The noise rate

The noise rate of 50.4% is higher than would be ideal but is not unexpected for a general news corpus. Many articles cover singular events that do not repeat often enough to form a cluster under the chosen minimum cluster size. HDBSCAN correctly identifies these as noise rather than forcing them into the nearest cluster, which is one of its key advantages over k-means. In a more focused corpus (for example, coverage of a single ongoing event over several months), the noise rate would be substantially lower.

5.3 Limitations

One notable limitation is that cluster 21, labelled "di | dan | yang", appears to consist largely of articles from Chinese-language sources that have been transliterated into English. This suggests that source domain can sometimes be a stronger clustering signal than topic, particularly when a publisher uses consistently different vocabulary from the rest of the corpus. A simple fix would be to filter by language or normalise by source during pre-processing.

The trustworthiness and DBCV comparisons were not run on identical subsamples, which limits their interpretability as a controlled evaluation. A more rigorous evaluation would fix the subsample across all methods and report confidence intervals over multiple random seeds.

5.4 Future work

Several natural extensions would improve both the ML pipeline and the visualization. Replacing title-only embedding with full article embedding using a larger model such as all-mpnet-base-v2 would likely produce tighter clusters, at the cost of significantly higher compute. Extending the corpus to cover multiple years would enable tracking of how topics evolve, split, and merge over time, which could be visualised using a temporal animation rather than a simple date filter. Finally, replacing TF-IDF labelling with a prompted language model would produce more readable cluster descriptions, particularly for the clusters that currently surface generic terms.

6. References

Reimers, N. and Gurevych, I. (2019). Sentence-BERT: sentence embeddings using siamese BERT networks. EMNLP 2019.

McInnes, L., Healy, J. and Melville, J. (2018). UMAP: uniform manifold approximation and projection for dimension reduction. arXiv:1802.03426.

McInnes, L., Healy, J. and Astels, S. (2017). HDBSCAN: hierarchical density based clustering. Journal of Open Source Software.

McInnes, L. and Healy, J. (2017). Density-based clustering validation. SDM 2017.