HDBSCAN for Narrative Clustering
Context
After extracting AAT frames from messages, we need to group similar frames into narrative clusters to identify recurring disinformation themes.
Decision
Use HDBSCAN (Hierarchical Density-Based Spatial Clustering) on sentence embeddings of extracted frames.
Alternatives Considered
K-Means clustering
Pros
- Simple and fast
- Well-understood behavior
Cons
- Requires specifying number of clusters in advance
- Assumes spherical clusters
Agglomerative clustering
Pros
- Produces hierarchical structure
- No need to specify k
Cons
- O(n²) memory complexity
- Sensitive to distance metric
LLM-based classification
Pros
- Can use semantic understanding
- Flexible categories
Cons
- Expensive at scale
- Hard to ensure consistency across millions of frames
Reasoning
HDBSCAN handles clusters of varying density, doesn't require specifying the number of clusters, and naturally identifies noise points (frames that don't belong to any narrative cluster). This matches our use case where some narratives are highly concentrated while others are diffuse, and many messages don't contain clear narrative frames.