The Dynamics of Hoaxes in Indonesia

Topic Modeling Analysis of TurnBackHoax.id Data in 2024

By: Ali Al Harkan (2406480304)

Course: Digital Research Methods | Instructor: Dr. Eriyanto

3,746
Hoax Documents
22
Topics Identified
3
Categories

Executive Summary

This comprehensive analysis applies Latent Dirichlet Allocation (LDA) topic modeling to 3,746 hoax documents collected from turnbackhoax.id, Indonesia's leading hoax-busting platform. The dataset spans the 2024 Indonesian election period and has been categorized into three major types: Politics Scam Others.

Key Achievement: Successfully identified 22 distinct misinformation narrative themes across all categories, providing quantitative evidence of hoax patterns during Indonesia's democratic process.
Total Documents Analyzed
3,746
Topics Identified
22
Categories
3
Best Coherence
0.461

Research Context

TurnBackHoax.id is operated by MAFINDO (Masyarakat Anti Fitnah Indonesia), Indonesia's first anti-hoax community. It serves as a comprehensive database of fact-checked misinformation circulating in Indonesian social media and messaging platforms.

2024 Indonesian Election Context: Indonesia held its presidential election on February 14, 2024, with three candidate pairs competing. The election period was marked by intense political polarization, widespread social media misinformation, and concerns about foreign interference narratives.

Main Finding

44% of political hoaxes are candidate-specific character attacks, with narratives predominantly targeting Anies Baswedan (33%) and promoting the Jokowi-Prabowo-Gibran coalition (11%). This reveals a highly polarized information environment where personal attacks dominate over policy discussions.

1 of 10

Methodology

Data Collection & Preparation

  1. Source: Scraped from turnbackhoax.id public archive
  2. Initial Dataset: 3,746 fact-checked hoax articles
  3. LLM Categorization: Used Gemini 2.5 Flash to categorize into Politics (36%), Scam (25%), Others (39%)
  4. NARASI Extraction: Extracted hoax narrative text from CONTENT field using regex patterns

Dataset Statistics

Category Documents % of Total Avg Text Length
Politics 1,358 36.2% 392 chars
Scam 939 25.1% 401 chars
Others 1,449 38.7% 455 chars

Indonesian Text Preprocessing Pipeline

  1. Tokenization: Lowercase conversion, URL removal, special character cleaning
  2. Stopword Removal: Custom Indonesian stopwords (66 terms) + Sastrawi library defaults
  3. Stemming: Sastrawi Indonesian stemmer for morphological normalization
  4. Bigram Detection: Identified multi-word phrases (min count: 5, threshold: 10)
  5. Dictionary Filtering: Removed terms appearing in <2 docs or>50% of docs

LDA Configuration

  • Algorithm: Latent Dirichlet Allocation (gensim 4.4.0)
  • Topic Numbers Tested: 5, 7, and 10 for each category
  • Training: 10 passes, 100 iterations per model
  • Evaluation Metric: C_v coherence score (interpretability-focused)
  • Selection: Best model chosen based on highest coherence
Why different topic numbers? Political narratives are diverse and creative (requiring 10 topics), while scam patterns follow 5 basic templates. The Others category's higher coherence (0.461) suggests these topics follow more stereotypical patterns than creative political attacks.
2 of 10

Model Overview

Model Selection & Performance (Coherence Score)

Category 5 Topics 7 Topics 10 Topics Best Choice
Politics 0.4520 0.4525 0.4581 10 topics
Scam 0.4523 0.4487 0.4419 5 topics
Others 0.4502 0.4610 0.4531 7 topics
Why different topic numbers? Political narratives are diverse and creative (requiring 10 topics), while scam patterns follow 5 basic templates. The Others category's higher coherence (0.461) suggests these topics follow more stereotypical patterns than creative political attacks.

Top 5 Most Prevalent Topics

  1. General Political Discourse (Politics-5): 449 docs (12% of total) - Anies Baswedan attacks
  2. Fake Job Recruitment (Scam-0): 446 docs (12%) - Freeport, Pertamina, BUMN scams
  3. Mixed Content (Others-5): 639 docs (17%) - General miscellaneous hoaxes
  4. Account Phishing (Scam-2): 334 docs (9%) - Banking and social media phishing
  5. Disasters & Events (Others-2): 263 docs (7%) - Natural disaster misinformation

Limitations & Considerations

  • Topic 6 (Politics): Contains fact-checking metadata artifacts, suggesting NARASI extraction could be refined
  • Topic 5 (Politics): Very large (449 docs, 33%) - may benefit from subdivision in future analysis
  • Temporal Gaps: Analysis doesn't track topic evolution over the election cycle - recommended for Phase II
  • Language Dependency: Indonesian-specific preprocessing may not transfer to other languages
3 of 10

Identified Topics by Category

Comprehensive overview of all 22 topics identified across three categories of hoax narratives.

Politics Category: 10 Topics

1,358 political hoax documents from the 2024 Indonesian election period.

Topic ID Topic Label Documents % Share Top Terms
0 Social Media Verification 107 7.9% temu, rupa, gambar, jelas_akun, judul, sama, sedang
1 Palace Appointments 8 0.6% istana, lantik, kaesang, pramono, ganti, bakal
2 Foreign Interference & Religion 8 0.6% bongkar, cina, agama, partai, ancam, bikin
3 Parliamentary Affairs 9 0.7% dpr, bayar, lapor, gratis, milik, jelas_buah
4 Anti-Corruption Protests 8 0.6% demo, kpk, mahasiswa, gagal, kasus, libat
5 General Politics (Anies Attacks) 449 33.1% indonesia, presiden, anies, negara, jadi, sebut
6 Fact-Check Metadata (Artifacts) 46 3.4% disinformasi_first, draft_news, jakarta, jenis_mis
7 Election Fraud (China) 22 1.6% china, suara, kalah, google, mau, jabat
8 Jokowi-Prabowo-Gibran Coalition 146 10.8% jokowi, prabowo, gibran, ikn, pilkada, anak
9 KPU Manipulation 14 1.0% kpu, pecat, kuasa, panggil, gak, naik
Key Insight: Topics 5 and 8 (highlighted) represent 44% of all political hoaxes and focus on candidate-specific character attacks.

Scam Category: 5 Topics

939 scam-related hoax documents targeting economic vulnerabilities.

Topic ID Topic Label Documents % Share Top Terms
0 Fake Job Recruitment 446 47.5% juta, loker, lowong_kerja, daftar, indonesia, gaji
1 Lottery & Banking Scams 121 12.9% bank, undi, festival, gratis, motor, hadiah
2 Account Phishing 334 35.6% akun, resmi, nomor, hubung, pihak, minta
3 Fake Recruitment Letters 155 16.5% gaji, kerja, surat, posisi, terima, bulan
4 Celebrity Deepfake Endorsements 110 11.7% rupa, temu, guna, taut_daftar, pasti, profil
Key Insight: Topics 0, 2, and 3 (highlighted) represent 82% of scam hoaxes and exploit economic vulnerability through fake job offers and financial fraud.

Others Category: 7 Topics

1,449 miscellaneous hoax documents covering health, disasters, and social issues.

Topic ID Topic Label Documents % Share Top Terms
0 Health Misinformation 165 11.4% sehat, obat, sakit, sebab, akibat, bahaya
1 News Articles & Headlines 144 9.9% artikel, judul, periksa_mafindo, tiba, tanda
2 Disasters & Events 263 18.2% kota, banjir, bencana, gunung, warga, gempa
3 COVID-19 & Conspiracy 51 3.5% vaksin, covid, virus, bill_gates, wef, digital
4 Religion & Sports 56 3.9% islam, timnas, piala_dunia, umat, masjid
5 Mixed Content 639 44.1% indonesia, baru, dapat, orang, masuk, jadi
6 Medical Miracle Cures 77 5.3% sembuh, darah, air, ginjal, israel, minum
Note: This category achieved the highest coherence score (0.461), indicating well-separated and interpretable topics. Health misinformation (Topics 0, 3, 6) represents 31% of this category.
4 of 10

Interactive Topic Exploration

Use the tabs below to explore pyLDAvis interactive visualizations for each category. Each circular visualization shows topics positioned by their similarity - topics closer together share more vocabulary. Click on topics to see their top terms.

How to use: Click a category tab, then interact with the visualization below. Hover over circles to see topic prevalence, click to see top terms. Adjust the λ (lambda) slider to balance term frequency vs. topic specificity.

Politics Category: 2024 Election Narratives

1,358 political hoax documents analyzed. Topics identified include candidate attacks, election fraud narratives, foreign interference conspiracies, and institutional distrust themes.

Documents
1,358
Topics
10
Coherence
0.458
Dominant Pattern
Candidate Attacks

Scam Category: Fraudulent Schemes

939 scam-related hoaxes analyzed. Topics include fake job offers, lottery scams, banking phishing, government aid fraud, and celebrity deepfake endorsements.

Documents
939
Topics
5
Coherence
0.452
Main Target
Job Seekers

Others Category: Health, Disasters, Religion

1,449 miscellaneous hoaxes analyzed. Topics include natural disasters, COVID-19 vaccines, food safety scares, religious content, celebrity news, and conspiracy theories.

Documents
1,449
Topics
7
Coherence
0.461
Quality
Highest
Understanding the Visualization (Click to expand)

Left Panel: Topic Map (PC1 vs PC2)

PC1 & PC2 (Principal Components):
2D projection of high-dimensional topic space using dimensionality reduction. Each topic exists in vocabulary-sized space (1,234 dimensions), compressed to 2D for visualization.
Circle Position:
Close circles = topics sharing similar vocabulary (semantically related)
Distant circles = topics with different word distributions (distinct themes)
Circle Size:
Number of documents assigned to that topic. Larger circles = more prevalent topics.
Example: Topic 5 (General Politics) has 449 docs → very large circle

Right Panel: Term Rankings

Salient Terms:
Measures how distinctive and informative a term is across all topics.
High saliency = frequent AND concentrated in specific topics (helps distinguish topics)
Low saliency = appears uniformly across all topics (generic words)
Relevance Metric (λ lambda):
Controls term ranking balance using the λ slider:
λ = 1.0: Shows most frequent terms in topic
λ = 0.0: Shows most exclusive terms (rare in other topics)
λ = 0.6 (default): Balanced view
Tip: Set λ to 0.3-0.4 to find the most characteristic words for each topic

Interactive Features

Hover over a word in bar chart:
Circles resize to show which topics contain that word:
Large circle = word is highly associated with this topic
Multiple large circles = word appears in multiple topics (ambiguous term)
One dominant circle = word is exclusive to one topic
Click a circle:
Bar chart updates to show top terms for that topic
Hover over a circle:
Shows topic prevalence (percentage of documents)
Exploration Workflow:
  1. Click a topic circle to see its top terms
  2. Adjust λ slider to 0.3 to find distinctive words
  3. Hover over specific words to see which other topics share them
  4. Identify topic clusters (nearby circles) to understand thematic relationships
5 of 10

Detailed Results by Category

Explore in-depth analysis, visualizations, and key topics for each category.

Politics Category: Detailed Results

The political hoax landscape is dominated by candidate-specific character attacks (44%), revealing a highly polarized campaign environment. Topic 5 (Anies Baswedan attacks) and Topic 8 (Jokowi dynasty narratives) account for 595 documents combined - nearly half of all political misinformation.

Top Topics Identified

Topic 5: General Politics & Anies Attacks (449 docs - 33%)

Top terms: indonesia, presiden, anies, negara, jadi, sebut, telusur

The largest topic cluster, focused on attacking Anies Baswedan's character and qualifications. Represents the dominant narrative strategy in political misinformation.

Topic 8: Jokowi-Prabowo-Gibran Coalition (146 docs - 11%)

Top terms: jokowi, prabowo, gibran, ikn, pilkada, anak

Narratives about dynastic politics, coalition building, and the controversial candidacy of Gibran (Jokowi's son) as vice presidential candidate.

Topic 0: Social Media Verification (107 docs - 8%)

Top terms: temu, rupa, gambar, jelas_akun, judul

Hoaxes involving manipulated social media content, fake accounts, and doctored images/videos.

Topic 7: Election Fraud & Chinese Interference (22 docs)

Top terms: china, suara, kalah, google, mau

Conspiracy theories about Chinese interference in vote counting and election manipulation.

Topic 9: KPU Manipulation (14 docs)

Top terms: kpu, pecat, kuasa, panggil, gak

Narratives undermining trust in the General Election Commission (KPU) through claims of corruption and manipulation.

Key Insight: Institutional Trust Erosion

Beyond candidate attacks, topics targeting KPU (election commission), KPK (anti-corruption commission), and DPR (parliament) collectively represent systematic efforts to undermine institutional legitimacy - a dangerous pattern for Indonesian democracy.

Visualizations: Politics

Topic Distribution

Politics Topic Distribution

Coherence Comparison

Politics Coherence

Top Terms per Topic

Politics Top Terms

Topic Word Clouds

Politics Word Clouds

Scam Category: Detailed Results

Scam hoaxes overwhelmingly target job seekers (60%) through fake recruitment and employment offers. The economic vulnerability of Indonesia's unemployment-affected population makes them prime targets for fraudulent schemes.

Top Topics Identified

Topic 0: Fake Job Recruitment (446 docs - 47%)

Top terms: juta, loker, lowong_kerja, daftar, indonesia

Fraudulent job offers from fake PT Freeport, Pertamina, BUMN companies. Often promise high salaries and easy acceptance.

Topic 2: Account Phishing (334 docs - 36%)

Top terms: akun, resmi, nomor, hubung, pihak

Phishing attempts targeting banking, social media, and messaging app accounts through fake customer service contacts.

Topic 3: Fake Recruitment Letters (155 docs - 17%)

Top terms: gaji, kerja, surat, posisi, terima

Fake official recruitment letters claiming candidates have been accepted for positions at major companies.

Topic 1: Lottery & Banking Scams (121 docs - 13%)

Top terms: bank, undi, festival, gratis, motor

Fake lottery wins, free giveaways, and fraudulent banking promotions.

Topic 4: Celebrity Deepfake Endorsements (110 docs - 12%)

Top terms: rupa, temu, guna, taut_daftar, pasti

Deepfake videos of celebrities (Rhoma Irama, Nagita Slavina, Raffi Ahmad) promoting gambling sites.

Exploitation Pattern

82% of scam topics (0+2+3) exploit economic vulnerability by targeting job seekers and those seeking financial opportunities. This reveals scammers' sophisticated understanding of Indonesia's unemployment challenges.

Visualizations: Scam

Topic Distribution

Scam Topic Distribution

Coherence Comparison

Scam Coherence

Top Terms per Topic

Scam Top Terms

Topic Word Clouds

Scam Word Clouds

Others Category: Detailed Results

The Others category achieved the highest coherence score (0.461), indicating these topics are most distinct and interpretable. Health misinformation (31% combined) and disaster-related hoaxes (14%) dominate this category.

Top Topics Identified

Topic 2: Disasters & Events (263 docs - 18%)

Top terms: kota, banjir, bencana, gunung, warga

Misinformation about natural disasters (floods, earthquakes, tsunamis) and emergency events.

Topic 0: Health Misinformation (165 docs - 11%)

Top terms: sehat, obat, sakit, sebab, akibat

General health misinformation including fake cure claims and medical advice.

Topic 1: News Articles (144 docs - 10%)

Top terms: artikel, judul, periksa_mafindo, tiba

Fake news articles and misleading headlines from unreliable sources.

Topic 6: Medical Miracle Cures (77 docs - 5%)

Top terms: sembuh, darah, air, ginjal, israel

Unverified miracle cure claims and medical conspiracies (e.g., water cures, blood treatments).

Topic 4: Religion & Sports (56 docs - 4%)

Top terms: islam, timnas, piala_dunia, umat

Religious content mixed with sports hoaxes, particularly around Indonesian national football team.

Health Misinformation Legacy

COVID-19 and vaccine narratives persist even beyond the pandemic peak. Combined health topics (0+6) represent 31% of Others category, showing lasting impact of pandemic-era misinformation on public health discourse.

Visualizations: Others

Topic Distribution

Others Topic Distribution

Coherence Comparison

Others Coherence

Top Terms per Topic

Others Top Terms

Topic Word Clouds

Others Word Clouds
6 of 10

Key Findings & Implications

1. Electoral Polarization Through Character Attacks

Finding: 44% of political hoaxes are candidate-specific character attacks, with Anies Baswedan receiving 3x more attacks (449 docs) than coverage of the Jokowi-Prabowo-Gibran coalition (146 docs).

Implication: Campaign discourse prioritizes personal attacks over policy discussions, deepening societal polarization and reducing substantive democratic debate.

2. Economic Vulnerability Exploitation

Finding: 82% of scam content targets job seekers through fake recruitment (60%) and financial fraud (22%).

Implication: Scammers systematically exploit Indonesia's unemployment challenges, preying on economic desperation with fraudulent employment opportunities.

3. Persistent COVID-19 Misinformation

Finding: Health misinformation represents 31% of Others category, with vaccine conspiracies and miracle cure claims persisting beyond pandemic peak.

Implication: Pandemic-era misinformation has lasting effects on public health discourse and vaccine hesitancy.

4. Institutional Trust Erosion

Finding: Systematic targeting of KPU (election commission), KPK (anti-corruption), and DPR (parliament) through conspiracy narratives.

Implication: Deliberate erosion of institutional legitimacy threatens Indonesia's democratic stability beyond the election cycle.

5. Foreign Interference Narratives

Finding: Chinese interference conspiracies appear in both political topics (7, 2) and are mixed with religious polarization themes.

Implication: Xenophobic narratives weaponized for electoral purposes, potentially damaging Indonesia-China relations and fueling racial tensions.

Democratic Implications: The combination of character attacks (44%), institutional distrust narratives, and foreign interference conspiracies creates a triple threat to Indonesian democracy: candidate delegitimization, institutional erosion, and nationalist mobilization.
7 of 10

Recommendations & Next Steps

For Researchers

  1. Temporal Analysis: Track topic prevalence over time, correlate with election events (debates, rallies, voting day)
  2. Sentiment Integration: Apply sentiment analysis to each topic cluster to measure emotional manipulation tactics
  3. Network Analysis: Build entity co-occurrence networks to map relationships between candidates, institutions, and narratives
  4. Cross-Platform Study: Compare hoax patterns across WhatsApp, Facebook, Twitter/X, and TikTok
  5. Predictive Modeling: Train classifiers to auto-categorize new hoaxes and build early warning systems

For Policymakers

  1. Counter-Narrative Programs: Develop targeted fact-checking for the 5 most prevalent topics identified
  2. Digital Literacy: Focus on job-seeking populations (scam targets) and elderly (health misinformation targets)
  3. Platform Accountability: Require social media platforms to address celebrity deepfake abuse
  4. Institutional Communication: KPU, KPK, DPR should proactively counter distrust narratives with transparency

For Journalists

  1. Investigation Topics: Use identified clusters (Chinese interference, dynastic politics) for investigative reporting
  2. Fact-Check Prioritization: Focus on high-prevalence topics (Anies attacks, fake jobs) for maximum impact
  3. Data Journalism: Visualize topic evolution during campaign period to show manipulation patterns
8 of 10

Appendix

Technical Specifications

Software Stack

  • LDA Implementation: gensim 4.4.0
  • Indonesian NLP: Sastrawi 1.0.1 (stemming), NLTK 3.8.0
  • Visualization: pyLDAvis 3.4.1, matplotlib 3.10.7, seaborn, wordcloud
  • Data Processing: pandas 2.2.3, numpy 2.3.5, scipy 1.16.3
  • Categorization: Google Gemini 2.5 Flash (LLM)

Reproducibility

All models trained with random_state=42 for reproducibility. Complete code, data, and models available in the project repository.

Files Generated

  • 9 trained LDA models (.pkl) - 3 configurations × 3 categories
  • 3 dictionaries (.pkl) - one per category
  • 12 visualization images (.png) - 4 per category
  • 9 data files (topic_terms.csv, document_topics.csv, coherence_scores.csv) - 3 sets
  • 1 combined interactive visualization (this page)

Computational Resources

Analysis completed on standard hardware (no GPU required). Training time: ~5 minutes per category (15 minutes total). Preprocessing: ~2 minutes per category.

About This Analysis

Project Title: The Dynamics of Hoaxes in Indonesia: Topic Modeling Analysis of TurnBackHoax.id Data in 2024

Researcher: Ali Al Harkan (2406480304)

Course: Digital Research Methods (Metode Riset Digital)

Instructor: Dr. Eriyanto

Date: November 24, 2025

Data Source: TurnBackHoax.id (MAFINDO)

Context: 2024 Indonesian Presidential Election period

Acknowledgments

This analysis is based on fact-checking work by MAFINDO (Masyarakat Anti Fitnah Indonesia), Indonesia's first anti-hoax community. Their tireless efforts to combat misinformation make research like this possible.

Citation

If using this analysis, please cite:
Al Harkan, A. (2025). The Dynamics of Hoaxes in Indonesia: Topic Modeling Analysis of TurnBackHoax.id Data in 2024. Digital Research Methods Course Project. Dataset: TurnBackHoax.id, 3,746 documents. Method: LDA with Indonesian text preprocessing.

Contact: For questions or collaboration opportunities regarding this analysis, please reach out through the course instructor.
9 of 10