HDBSCAN for Narrative Clustering

nlpclusteringmethodology

After extracting AAT frames from messages, we need to group similar frames into narrative clusters to identify recurring disinformation themes.

Use HDBSCAN (Hierarchical Density-Based Spatial Clustering) on sentence embeddings of extracted frames.

K-Means clustering

Pros
  • Simple and fast
  • Well-understood behavior
Cons
  • Requires specifying number of clusters in advance
  • Assumes spherical clusters

Agglomerative clustering

Pros
  • Produces hierarchical structure
  • No need to specify k
Cons
  • O(n²) memory complexity
  • Sensitive to distance metric

LLM-based classification

Pros
  • Can use semantic understanding
  • Flexible categories
Cons
  • Expensive at scale
  • Hard to ensure consistency across millions of frames

HDBSCAN handles clusters of varying density, doesn't require specifying the number of clusters, and naturally identifies noise points (frames that don't belong to any narrative cluster). This matches our use case where some narratives are highly concentrated while others are diffuse, and many messages don't contain clear narrative frames.