The Dynamics of Hoaxes in Indonesia

Topic Modeling Analysis of TurnBackHoax.id Data in 2024

By: Ali Al Harkan (2406480304)

Course: Digital Research Methods | Instructor: Dr. Eriyanto

3,746

Hoax Documents

Topics Identified

Executive Summary

This comprehensive analysis applies Latent Dirichlet Allocation (LDA) topic modeling to 3,746 hoax documents collected from turnbackhoax.id, Indonesia's leading hoax-busting platform. The dataset spans the 2024 Indonesian election period and has been categorized into three major types: Politics Scam Others.

Key Achievement: Successfully identified 22 distinct misinformation narrative themes across all categories, providing quantitative evidence of hoax patterns during Indonesia's democratic process.

Total Documents Analyzed

3,746

Topics Identified

Research Context

TurnBackHoax.id is operated by MAFINDO (Masyarakat Anti Fitnah Indonesia), Indonesia's first anti-hoax community. It serves as a comprehensive database of fact-checked misinformation circulating in Indonesian social media and messaging platforms.

2024 Indonesian Election Context: Indonesia held its presidential election on February 14, 2024, with three candidate pairs competing. The election period was marked by intense political polarization, widespread social media misinformation, and concerns about foreign interference narratives.

Main Finding

44% of political hoaxes are candidate-specific character attacks, with narratives predominantly targeting Anies Baswedan (33%) and promoting the Jokowi-Prabowo-Gibran coalition (11%). This reveals a highly polarized information environment where personal attacks dominate over policy discussions.

Methodology

Data Collection & Preparation

Source: Scraped from turnbackhoax.id public archive
Initial Dataset: 3,746 fact-checked hoax articles
LLM Categorization: Used Gemini 2.5 Flash to categorize into Politics (36%), Scam (25%), Others (39%)
NARASI Extraction: Extracted hoax narrative text from CONTENT field using regex patterns

Dataset Statistics

Category	Documents	% of Total	Avg Text Length
Politics	1,358	36.2%	392 chars
Scam	939	25.1%	401 chars
Others	1,449	38.7%	455 chars

Indonesian Text Preprocessing Pipeline

Tokenization: Lowercase conversion, URL removal, special character cleaning
Stopword Removal: Custom Indonesian stopwords (66 terms) + Sastrawi library defaults
Stemming: Sastrawi Indonesian stemmer for morphological normalization
Bigram Detection: Identified multi-word phrases (min count: 5, threshold: 10)
Dictionary Filtering: Removed terms appearing in <2 docs or>50% of docs

LDA Configuration

Algorithm: Latent Dirichlet Allocation (gensim 4.4.0)
Topic Numbers Tested: 5, 7, and 10 for each category
Training: 10 passes, 100 iterations per model
Evaluation Metric: C_v coherence score (interpretability-focused)
Selection: Best model chosen based on highest coherence

Why different topic numbers? Political narratives are diverse and creative (requiring 10 topics), while scam patterns follow 5 basic templates. The Others category's higher coherence (0.461) suggests these topics follow more stereotypical patterns than creative political attacks.

Model Overview

Model Selection & Performance (Coherence Score)

Category	5 Topics	7 Topics	10 Topics	Best Choice
Politics	0.4520	0.4525	0.4581	10 topics
Scam	0.4523	0.4487	0.4419	5 topics
Others	0.4502	0.4610	0.4531	7 topics

Limitations & Considerations

Topic 6 (Politics): Contains fact-checking metadata artifacts, suggesting NARASI extraction could be refined
Topic 5 (Politics): Very large (449 docs, 33%) - may benefit from subdivision in future analysis
Temporal Gaps: Analysis doesn't track topic evolution over the election cycle - recommended for Phase II
Language Dependency: Indonesian-specific preprocessing may not transfer to other languages

Identified Topics by Category

Comprehensive overview of all 22 topics identified across three categories of hoax narratives.

Politics Category: 10 Topics

1,358 political hoax documents from the 2024 Indonesian election period.

Topic ID	Topic Label	Documents	% Share	Top Terms
0	Social Media Verification	107	7.9%	`temu, rupa, gambar, jelas_akun, judul, sama, sedang`
1	Palace Appointments	8	0.6%	`istana, lantik, kaesang, pramono, ganti, bakal`
2	Foreign Interference & Religion	8	0.6%	`bongkar, cina, agama, partai, ancam, bikin`
3	Parliamentary Affairs	9	0.7%	`dpr, bayar, lapor, gratis, milik, jelas_buah`
4	Anti-Corruption Protests	8	0.6%	`demo, kpk, mahasiswa, gagal, kasus, libat`
5	General Politics (Anies Attacks)	449	33.1%	`indonesia, presiden, anies, negara, jadi, sebut`
6	Fact-Check Metadata (Artifacts)	46	3.4%	`disinformasi_first, draft_news, jakarta, jenis_mis`
7	Election Fraud (China)	22	1.6%	`china, suara, kalah, google, mau, jabat`
8	Jokowi-Prabowo-Gibran Coalition	146	10.8%	`jokowi, prabowo, gibran, ikn, pilkada, anak`
9	KPU Manipulation	14	1.0%	`kpu, pecat, kuasa, panggil, gak, naik`

Key Insight: Topics 5 and 8 (highlighted) represent 44% of all political hoaxes and focus on candidate-specific character attacks.

Scam Category: 5 Topics

939 scam-related hoax documents targeting economic vulnerabilities.

Topic ID	Topic Label	Documents	% Share	Top Terms
0	Fake Job Recruitment	446	47.5%	`juta, loker, lowong_kerja, daftar, indonesia, gaji`
1	Lottery & Banking Scams	121	12.9%	`bank, undi, festival, gratis, motor, hadiah`
2	Account Phishing	334	35.6%	`akun, resmi, nomor, hubung, pihak, minta`
3	Fake Recruitment Letters	155	16.5%	`gaji, kerja, surat, posisi, terima, bulan`
4	Celebrity Deepfake Endorsements	110	11.7%	`rupa, temu, guna, taut_daftar, pasti, profil`

Key Insight: Topics 0, 2, and 3 (highlighted) represent 82% of scam hoaxes and exploit economic vulnerability through fake job offers and financial fraud.

Others Category: 7 Topics

1,449 miscellaneous hoax documents covering health, disasters, and social issues.

Topic ID	Topic Label	Documents	% Share	Top Terms
0	Health Misinformation	165	11.4%	`sehat, obat, sakit, sebab, akibat, bahaya`
1	News Articles & Headlines	144	9.9%	`artikel, judul, periksa_mafindo, tiba, tanda`
2	Disasters & Events	263	18.2%	`kota, banjir, bencana, gunung, warga, gempa`
3	COVID-19 & Conspiracy	51	3.5%	`vaksin, covid, virus, bill_gates, wef, digital`
4	Religion & Sports	56	3.9%	`islam, timnas, piala_dunia, umat, masjid`
5	Mixed Content	639	44.1%	`indonesia, baru, dapat, orang, masuk, jadi`
6	Medical Miracle Cures	77	5.3%	`sembuh, darah, air, ginjal, israel, minum`

Note: This category achieved the highest coherence score (0.461), indicating well-separated and interpretable topics. Health misinformation (Topics 0, 3, 6) represents 31% of this category.

Interactive Topic Exploration

Use the tabs below to explore pyLDAvis interactive visualizations for each category. Each circular visualization shows topics positioned by their similarity - topics closer together share more vocabulary. Click on topics to see their top terms.

How to use: Click a category tab, then interact with the visualization below. Hover over circles to see topic prevalence, click to see top terms. Adjust the λ (lambda) slider to balance term frequency vs. topic specificity.

Politics Category: 2024 Election Narratives

1,358 political hoax documents analyzed. Topics identified include candidate attacks, election fraud narratives, foreign interference conspiracies, and institutional distrust themes.

Documents

1,358

Topics

Coherence

0.458

Dominant Pattern

Candidate Attacks

Scam Category: Fraudulent Schemes

939 scam-related hoaxes analyzed. Topics include fake job offers, lottery scams, banking phishing, government aid fraud, and celebrity deepfake endorsements.

Documents

939

Topics

Coherence

0.452

Main Target

Job Seekers

Others Category: Health, Disasters, Religion

1,449 miscellaneous hoaxes analyzed. Topics include natural disasters, COVID-19 vaccines, food safety scares, religious content, celebrity news, and conspiracy theories.

Documents

1,449

Topics

Coherence

0.461

Quality

Highest

Understanding the Visualization (Click to expand)

Left Panel: Topic Map (PC1 vs PC2)

PC1 & PC2 (Principal Components):: 2D projection of high-dimensional topic space using dimensionality reduction. Each topic exists in vocabulary-sized space (1,234 dimensions), compressed to 2D for visualization.
Circle Position:: Close circles = topics sharing similar vocabulary (semantically related)
Distant circles = topics with different word distributions (distinct themes)
Circle Size:: Number of documents assigned to that topic. Larger circles = more prevalent topics.
Example: Topic 5 (General Politics) has 449 docs → very large circle

Right Panel: Term Rankings

Salient Terms:: Measures how distinctive and informative a term is across all topics.
High saliency = frequent AND concentrated in specific topics (helps distinguish topics)
Low saliency = appears uniformly across all topics (generic words)
Relevance Metric (λ lambda):: Controls term ranking balance using the λ slider:
• λ = 1.0: Shows most frequent terms in topic
• λ = 0.0: Shows most exclusive terms (rare in other topics)
• λ = 0.6 (default): Balanced view
Tip: Set λ to 0.3-0.4 to find the most characteristic words for each topic

Interactive Features

Hover over a word in bar chart:: Circles resize to show which topics contain that word:
• Large circle = word is highly associated with this topic
• Multiple large circles = word appears in multiple topics (ambiguous term)
• One dominant circle = word is exclusive to one topic
Click a circle:: Bar chart updates to show top terms for that topic
Hover over a circle:: Shows topic prevalence (percentage of documents)

Exploration Workflow:

Click a topic circle to see its top terms
Adjust λ slider to 0.3 to find distinctive words
Hover over specific words to see which other topics share them
Identify topic clusters (nearby circles) to understand thematic relationships

Detailed Results by Category

Explore in-depth analysis, visualizations, and key topics for each category.

Politics Category: Detailed Results

The political hoax landscape is dominated by candidate-specific character attacks (44%), revealing a highly polarized campaign environment. Topic 5 (Anies Baswedan attacks) and Topic 8 (Jokowi dynasty narratives) account for 595 documents combined - nearly half of all political misinformation.

Visualizations: Politics

Topic Distribution

Coherence Comparison

Topic Word Clouds

Scam Category: Detailed Results

Scam hoaxes overwhelmingly target job seekers (60%) through fake recruitment and employment offers. The economic vulnerability of Indonesia's unemployment-affected population makes them prime targets for fraudulent schemes.

Visualizations: Scam

Topic Distribution

Coherence Comparison

Topic Word Clouds

Others Category: Detailed Results

The Others category achieved the highest coherence score (0.461), indicating these topics are most distinct and interpretable. Health misinformation (31% combined) and disaster-related hoaxes (14%) dominate this category.

Visualizations: Others

Topic Distribution

Coherence Comparison

Topic Word Clouds

Key Findings & Implications

1. Electoral Polarization Through Character Attacks

Finding: 44% of political hoaxes are candidate-specific character attacks, with Anies Baswedan receiving 3x more attacks (449 docs) than coverage of the Jokowi-Prabowo-Gibran coalition (146 docs).

Implication: Campaign discourse prioritizes personal attacks over policy discussions, deepening societal polarization and reducing substantive democratic debate.

2. Economic Vulnerability Exploitation

Finding: 82% of scam content targets job seekers through fake recruitment (60%) and financial fraud (22%).

Implication: Scammers systematically exploit Indonesia's unemployment challenges, preying on economic desperation with fraudulent employment opportunities.

3. Persistent COVID-19 Misinformation

Finding: Health misinformation represents 31% of Others category, with vaccine conspiracies and miracle cure claims persisting beyond pandemic peak.

Implication: Pandemic-era misinformation has lasting effects on public health discourse and vaccine hesitancy.

4. Institutional Trust Erosion

Finding: Systematic targeting of KPU (election commission), KPK (anti-corruption), and DPR (parliament) through conspiracy narratives.

Implication: Deliberate erosion of institutional legitimacy threatens Indonesia's democratic stability beyond the election cycle.

5. Foreign Interference Narratives

Finding: Chinese interference conspiracies appear in both political topics (7, 2) and are mixed with religious polarization themes.

Implication: Xenophobic narratives weaponized for electoral purposes, potentially damaging Indonesia-China relations and fueling racial tensions.

Democratic Implications: The combination of character attacks (44%), institutional distrust narratives, and foreign interference conspiracies creates a triple threat to Indonesian democracy: candidate delegitimization, institutional erosion, and nationalist mobilization.

Recommendations & Next Steps

For Researchers

Temporal Analysis: Track topic prevalence over time, correlate with election events (debates, rallies, voting day)
Sentiment Integration: Apply sentiment analysis to each topic cluster to measure emotional manipulation tactics
Network Analysis: Build entity co-occurrence networks to map relationships between candidates, institutions, and narratives
Cross-Platform Study: Compare hoax patterns across WhatsApp, Facebook, Twitter/X, and TikTok
Predictive Modeling: Train classifiers to auto-categorize new hoaxes and build early warning systems

For Policymakers

Counter-Narrative Programs: Develop targeted fact-checking for the 5 most prevalent topics identified
Digital Literacy: Focus on job-seeking populations (scam targets) and elderly (health misinformation targets)
Platform Accountability: Require social media platforms to address celebrity deepfake abuse
Institutional Communication: KPU, KPK, DPR should proactively counter distrust narratives with transparency

For Journalists

Investigation Topics: Use identified clusters (Chinese interference, dynastic politics) for investigative reporting
Fact-Check Prioritization: Focus on high-prevalence topics (Anies attacks, fake jobs) for maximum impact
Data Journalism: Visualize topic evolution during campaign period to show manipulation patterns

Appendix

Technical Specifications

Software Stack

LDA Implementation: gensim 4.4.0
Indonesian NLP: Sastrawi 1.0.1 (stemming), NLTK 3.8.0
Visualization: pyLDAvis 3.4.1, matplotlib 3.10.7, seaborn, wordcloud
Data Processing: pandas 2.2.3, numpy 2.3.5, scipy 1.16.3
Categorization: Google Gemini 2.5 Flash (LLM)

Reproducibility

All models trained with random_state=42 for reproducibility. Complete code, data, and models available in the project repository.

Files Generated

9 trained LDA models (.pkl) - 3 configurations × 3 categories
3 dictionaries (.pkl) - one per category
12 visualization images (.png) - 4 per category
9 data files (topic_terms.csv, document_topics.csv, coherence_scores.csv) - 3 sets
1 combined interactive visualization (this page)

Computational Resources

Analysis completed on standard hardware (no GPU required). Training time: ~5 minutes per category (15 minutes total). Preprocessing: ~2 minutes per category.

About This Analysis

Project Title: The Dynamics of Hoaxes in Indonesia: Topic Modeling Analysis of TurnBackHoax.id Data in 2024

Researcher: Ali Al Harkan (2406480304)

Course: Digital Research Methods (Metode Riset Digital)

Instructor: Dr. Eriyanto

Date: November 24, 2025

Data Source: TurnBackHoax.id (MAFINDO)

Context: 2024 Indonesian Presidential Election period

Acknowledgments

This analysis is based on fact-checking work by MAFINDO (Masyarakat Anti Fitnah Indonesia), Indonesia's first anti-hoax community. Their tireless efforts to combat misinformation make research like this possible.

Citation

If using this analysis, please cite:
Al Harkan, A. (2025). The Dynamics of Hoaxes in Indonesia: Topic Modeling Analysis of TurnBackHoax.id Data in 2024. Digital Research Methods Course Project. Dataset: TurnBackHoax.id, 3,746 documents. Method: LDA with Indonesian text preprocessing.

Contact: For questions or collaboration opportunities regarding this analysis, please reach out through the course instructor.

The Dynamics of Hoaxes in Indonesia

Executive Summary

Research Context

Main Finding

Methodology

Data Collection & Preparation

Dataset Statistics

Indonesian Text Preprocessing Pipeline

LDA Configuration

Model Overview

Model Selection & Performance (Coherence Score)

Top 5 Most Prevalent Topics

Limitations & Considerations

Identified Topics by Category

Politics Category: 10 Topics

Scam Category: 5 Topics

Others Category: 7 Topics

Interactive Topic Exploration

Politics Category: 2024 Election Narratives

Scam Category: Fraudulent Schemes

Others Category: Health, Disasters, Religion

Left Panel: Topic Map (PC1 vs PC2)

Right Panel: Term Rankings

Interactive Features

Detailed Results by Category

Politics Category: Detailed Results

Top Topics Identified

Topic 5: General Politics & Anies Attacks (449 docs - 33%)

Topic 8: Jokowi-Prabowo-Gibran Coalition (146 docs - 11%)

Topic 0: Social Media Verification (107 docs - 8%)

Topic 7: Election Fraud & Chinese Interference (22 docs)

Topic 9: KPU Manipulation (14 docs)

Key Insight: Institutional Trust Erosion

Visualizations: Politics

Topic Distribution

Coherence Comparison

Top Terms per Topic

Topic Word Clouds

Scam Category: Detailed Results

Top Topics Identified

Topic 0: Fake Job Recruitment (446 docs - 47%)

Topic 2: Account Phishing (334 docs - 36%)

Topic 3: Fake Recruitment Letters (155 docs - 17%)

Topic 1: Lottery & Banking Scams (121 docs - 13%)

Topic 4: Celebrity Deepfake Endorsements (110 docs - 12%)

Exploitation Pattern

Visualizations: Scam

Topic Distribution

Coherence Comparison

Top Terms per Topic

Topic Word Clouds

Others Category: Detailed Results

Top Topics Identified

Topic 2: Disasters & Events (263 docs - 18%)

Topic 0: Health Misinformation (165 docs - 11%)

Topic 1: News Articles (144 docs - 10%)

Topic 6: Medical Miracle Cures (77 docs - 5%)

Topic 4: Religion & Sports (56 docs - 4%)

Health Misinformation Legacy

Visualizations: Others

Topic Distribution

Coherence Comparison

Top Terms per Topic

Topic Word Clouds

Key Findings & Implications

1. Electoral Polarization Through Character Attacks

2. Economic Vulnerability Exploitation

3. Persistent COVID-19 Misinformation

4. Institutional Trust Erosion

5. Foreign Interference Narratives

Recommendations & Next Steps

For Researchers

For Policymakers

For Journalists

Appendix

Technical Specifications

Software Stack

Reproducibility

Files Generated

Computational Resources