Pipeline Overview

Vector operates as a multi-stage pipeline that transforms raw Telegram messages into structured, queryable narrative intelligence. Each stage is designed to work independently, enabling modular upgrades and parallel processing.

01

Collection

Automated ingestion of public Telegram channel messages via the Telegram API. Channels are selected based on relevance to tracked topics, geographic scope, and audience reach.

02

Preprocessing

Raw messages are cleaned, deduplicated, and normalized. Language detection filters non-target languages. Metadata (timestamps, channel info, forwarding chains) is preserved for provenance tracking.

03

Frame Extraction

NLP models extract actor-action-target (AAT) frames from each message. This structured representation captures who is doing what to whom according to the narrative.

04

Narrative Clustering

Similar frames are grouped into narrative clusters using semantic similarity. This reveals coordinated messaging patterns and tracks how narratives evolve over time.

05

Analysis & Reporting

Clustered narratives are analyzed for spread patterns, source attribution, and temporal dynamics. Results are published as structured reports with full source traceability.

Actor-Action-Target Framework

The core analytical unit in Vector is the actor-action-target (AAT) frame. Every message is decomposed into one or more frames that answer three questions:

Actor

Who is performing or being attributed the action? This can be a named individual, organization, country, or abstract entity (e.g., "the West", "the regime").

Action

What is the actor claimed to be doing? Actions range from concrete (attacking, funding, blocking) to abstract (threatening, undermining, supporting).

Target

Who or what is affected by the action? Targets can be individuals, populations, institutions, or concepts (e.g., "democracy", "sovereignty").

By structuring unstructured text into AAT frames, Vector enables quantitative analysis of narrative patterns that would otherwise require manual reading of thousands of messages.

Data Sources

Vector currently focuses on Telegram as its primary data source due to the platform's outsized role in information warfare, particularly in Eastern Europe and the post-Soviet space. Telegram channels offer:

  • Public accessibility without authentication barriers
  • High volume of politically relevant content
  • Clear forwarding chains that reveal information propagation paths
  • Minimal content moderation, making it a primary vector for disinformation

Channel selection follows a snowball sampling methodology: starting from known disinformation-linked channels and expanding through forwarding networks and cross-references.

NLP & Machine Learning

Vector uses a combination of transformer-based models and rule-based heuristics for frame extraction:

  • Named Entity Recognition (NER) — Identifies actors and targets in text using multilingual models fine-tuned on media and political text.
  • Semantic Role Labeling — Extracts predicate-argument structures to identify actions and their relationships to actors and targets.
  • Sentiment & Stance Detection — Classifies the framing valence (positive, negative, neutral) of each actor-target relationship.
  • Embedding-based Clustering — Groups semantically similar frames into narrative clusters using sentence embeddings and hierarchical clustering.

Limitations & Ethics

Vector is a research tool, not an oracle. Key limitations include:

  • NLP models have inherent error rates. All automated extractions should be treated as hypotheses, not ground truth.
  • Channel selection introduces sampling bias. The dataset does not represent all of Telegram.
  • Vector analyzes public channels only. Private groups and direct messages are not collected.
  • The system identifies narrative patterns but does not make claims about the intent or coordination behind them without additional evidence.

The project follows responsible disclosure practices and does not publish personally identifiable information of private individuals.