Mapping out the DRAPAC26 Submission proposal entries by numbers and topics

The Digital Rights in Asia-Pacific Assembly (DRAPAC) 2026 is an important gathering of digital rights advocates, policymakers, activists, and technologists dedicated to discussing and addressing pressing digital rights issues in the Asia-Pacific region. I'm very much excited for this year's DRAPAC, and I am interested in what and how organizations and individuals are submitting for their sessions/ideas. I believe it is important to look over all the ideas and get the gist of them, to be able to understand how the narrative has evolved and what the interests/demands of the submitters are.

To achieve that, I gathered (scraped) all entries using my personal DRAPAC account and counted all the metadata manually myself (with the help of LLMs/AI). The organizing team did not give me the data, and I have informed them that I ran a scraper.

This year, DRAPAC26 focuses on 2 themes (Track 1: Co-creating shared resources and Track 2: Collaborative action across movements) and 8 session formats. The submission period is closed, and it gained 313 entries. So, let's unfold this further by the numbers.

Thematic tracks:

Track Sessions Share
Co-creating shared resources 227 57.8%
Collaborative action across movements 163 41.5%
Untagged 3 0.8%

Session Formats

Format Total Share
Co-learning workshop 122 31.0%
Panel discussion 104 26.5%
Roundtable discussion 62 15.8%
Ideation workshop 56 14.2%
Exhibit / display / performance 20 5.1%
Social gathering 13 3.3%
Booth, stand, or clinic 9 2.3%
Parallel event 7 1.8%

Track × Format Matrix

Format Co-creating Collaborative
Co-learning workshop 91 30
Panel discussion 43 59
Roundtable discussion 27 35
Ideation workshop 42 14
Exhibit / display / performance 13 7
Social gathering 3 10
Booth, stand, or clinic 5 4
Parallel event 3 4

As the data above shows, we can see that the "Co-creating shared resources" track is leading by 16.3% differences compared to the "Collaborative action across movements" track. This highlights the high interest in this track, which focuses on building shared infrastructure—from platforms, systems, and databases, to skills, networks, and strategies—that can be scaled across the region.

While there is not a significant difference, Track 2 (Collaborative action across movements) focuses on bridging the gap between civil society, government, and the private sector to co-develop rights-based digital policies. It also stresses the importance of discussions pertaining to these issues.

Looking at the track-by-format matrix, we can see that each track has specific preferences toward the session formats it conducts to suit what's best for the discussions or activities. Track 1 garnered a preference for co-learning workshops and ideation workshops, while on the other hand, Track 2 focuses more on panel discussions, roundtable discussions, and social gatherings.

Furthermore, to get a better understanding of what people are proposing, I ran basic topic modeling on the submission content to get a look at the ideas people are interested in focusing on or discussing. I utilized a specific topic modeling technique called LDA (Latent Dirichlet Allocation) to uncover the central topics and their distributions across the set of content's sessions. The following table uncover which words are being frequently used.

Topic Sessions Top Words (topn=12)
T0: AI Systems & Public Accountability 65 (16.5%) data, social, public, platforms, media, surveillance, systems, algorithmic, platform, youth, people, power
T1: Civil Society Digital Security Capacity 9 (2.3%) human, security, strategies, civil, society, technical, law, training, myanmar, challenges, action, cybersecurity
T2: Regional Governance & Policy 110 (28.0%) regional, civil, governance, policy, society, shared, platform, data, systems, online, public, asia-pacific
T3: Digital Rights Practice & Experience 65 (16.5%) space, data, human, shared, climate, people, online, open, collective, building, work, challenges
T4: Community Organizing & Grassroots Capacity 53 (13.5%) communities, shared, systems, community, collective, movement, regional, movements, asia-pacific, end, to:, grassroots
T5: Legal Frameworks, Evidence & Platform Accountability 32 (8.1%) legal, human, evidence, platform, security, platforms, challenges, international, regional, resource, criminal, media
T6: Digital Security Tools & Practical Response 20 (5.1%) human, online, legal, security, collective, right, practical, tools, strategies, session, decision, response
T7: Youth, Media & Narrative Resistance 39 (9.9%) internet, civic, movements, people, young, media, communication, social, online, collective, narratives, storytelling

Moreover, let's see which countries or region being mentioned the most across proposal. The Country was calculated using a rule-based keyword match over session Title and Content fields. For each country, it counts session once per country if any variant was present.

Country Co-creating Collaborative Total
Pacific (region) 112 83 196
Indonesia 67 45 114
Philippines 51 44 96
India 56 50 106
Nepal 41 32 73
Myanmar 31 19 50
Bangladesh 30 18 48
Pakistan 11 14 26
Malaysia 19 7 26
Sri Lanka 13 7 20
Thailand 9 4 14
Taiwan 10 3 13
Singapore 6 6 12
Vietnam 5 4 9
China 2 5 7
Cambodia 4 3 7
Korea 3 3 6
Japan 2 3 5
Australia 1 2 4
Papua 2 0 2
Timor 1 0 1
Laos 1 0 1

Lastly, I ran a word-matching script to see if AI-related keywords were mentioned in the content of the proposals. It turns out that almost 30% of the entries include AI-related terms, as follows:

Term Sessions Share
"ai" (word-boundary) 94 23.9%
"artificial intelligence" (substring) 14 3.6%
"algorithmic" (word-boundary) 36 9.2%
"algorithm" (word-boundary) 7 1.8%
"genai" (substring) 1 0.3%
Union (ai | algorithmic | algorithm | genai) 117 29.8%

It is fascinating to see diverse ideas being submitted. We certainly won't see most of them being presented at DRAPAC, but I'm sure the ideas are worth spreading and must be taken into account, especially for individuals and organizations within the region and beyond.

To close, most of the scripts/code that helped me to generate the data above were generated by LLMs and agentic AI platforms. I have manually observed and adjusted some for cross-checking. I attached the full analysis below, generated 100% by AI. The wording above was mostly made by me. I am planning to update this post to include the information of all the scraped sessions and the generated code, in particular for reproducibility or experimenting further.

DRAPAC 26 — Session Analysis

Source: drap.ac/26/activities/
Scraped: Wave 1 = 2026-03-29 (311 files); Wave 2 = 2026-03-30 (+82 files, total 393)
Total sessions: 393 markdown files
Vault source: D-ARCHIVES/DRAPAC 26 Sessions Submission/


Executive Summary

DRAPAC 26 (Digital Rights Asia-Pacific 2026) is the Asia Pacific Regional Internet Governance Forum. Three methods were applied to the 393 session Content fields:

  1. Unsupervised LDA topic modeling (k=8, k=12) — discovers latent themes statistically
  2. UMAP + HDBSCAN clustering — semantic clustering of sessions by document-topic similarity
  3. Content coverage clusters (keyword-based) — comparative reference

Key findings:

  • LDA identifies Regional Governance & Policy (T2, 110 sessions, 28%) as the single largest latent topic — more than governance frameworks, civil society engagement, and regulatory policy combined. This is the programme's backbone discourse.
  • Semantic clustering discovers 7 distinct session communities, including: "Data & Platform Accountability", "Legal & Human Rights Evidence", "Community Infrastructure Design", "Civil Society Governance AI", "Open Space & People", and "Collective Care Movements" — each with distinct track skew.
  • Track differentiation is real but nuanced: Collaborative skews toward legal/evidence (T5) and surveillance discourse (T0); Co-creating skews toward collective care movements (C5) and community infrastructure (C4). The clearest differentiator is Governance & Policy framing, which Co-creating leads on.
  • AI (explicit "ai" + "algorithmic" + "genai") appears in 29.8% of sessions — making it the second-most-discussed substantive concern after governance.

Overview

DRAPAC 26 (Digital Rights Asia-Pacific 2026) is the Asia Pacific Regional Internet Governance Forum. The vault contains 393 scraped session submissions across two waves: 311 files scraped 2026-03-29 and 82 additional files from 2026-03-30.

Theme analysis design: Three complementary unsupervised methods were applied to the Content field (session description body only). The Content field was preprocessed: lowercased, stopwords removed, markdown stripped, and tokenised (min word length 3, no punctuation-only tokens). Gensim LDA was used for topic modeling; UMAP + HDBSCAN for semantic clustering; content coverage analysis (keyword clusters) as comparative reference.

All theme statistics are Content-only — Organiser and Title fields are excluded to avoid inflation from personal specialisations and session labels.


Thematic Tracks

DRAPAC 26 has two thematic tracks:

Track Sessions Share
Co-creating shared resources 227 57.8%
Collaborative action across movements 163 41.5%
Untagged 3 0.8%

Session Formats

Format Total Share
Co-learning workshop 122 31.0%
Panel discussion 104 26.5%
Roundtable discussion 62 15.8%
Ideation workshop 56 14.2%
Exhibit / display / performance 20 5.1%
Social gathering 13 3.3%
Booth, stand, or clinic 9 2.3%
Parallel event 7 1.8%

Track × Format Matrix

Format Co-creating Collaborative
Co-learning workshop 91 30
Panel discussion 43 59
Roundtable discussion 27 35
Ideation workshop 42 14
Exhibit / display / performance 13 7
Social gathering 3 10
Booth, stand, or clinic 5 4
Parallel event 3 4

Countries & Regional Focus

Country Co-creating Collaborative Total
Pacific (region) 112 83 196
Indonesia 67 45 114
Philippines 51 44 96
India 56 50 106
Nepal 41 32 73
Myanmar 31 19 50
Bangladesh 30 18 48
Pakistan 11 14 26
Malaysia 19 7 26
Sri Lanka 13 7 20
Thailand 9 4 14
Taiwan 10 3 13
Singapore 6 6 12
Vietnam 5 4 9
China 2 5 7
Cambodia 4 3 7
Korea 3 3 6
Japan 2 3 5
Australia 1 2 4
Papua 2 0 2
Timor 1 0 1
Laos 1 0 1

Method 1: LDA Topic Modeling

Design

Latent Dirichlet Allocation (LDA) discovers latent topics as probability distributions over words — sessions are not hard-assigned to topics but have a probability distribution across all topics. k is the number of topics the model is constrained to find; you choose it before running the model. Gensim LDA was used with:

  • Dictionary: tokenised Content, stopwords removed (custom stoplist + Gensim STOPWORDS), min doc frequency 5, max doc frequency 65%
  • Two runs: k=8 (primary, 40 passes) and k=12 (for HDBSCAN input, 30 passes)
  • Alpha/eta: auto — the model learns document-topic and topic-word sparsity from the data

LDA Topics Discovered (k=8)

Each topic is a ranked list of words. The number in parentheses is the number of sessions where this topic is the dominant topic (highest probability in the k=8 model).

Topic Sessions Top Words (topn=12)
T0: AI Systems & Public Accountability 65 (16.5%) data, social, public, platforms, media, surveillance, systems, algorithmic, platform, youth, people, power
T1: Civil Society Digital Security Capacity 9 (2.3%) human, security, strategies, civil, society, technical, law, training, myanmar, challenges, action, cybersecurity
T2: Regional Governance & Policy 110 (28.0%) regional, civil, governance, policy, society, shared, platform, data, systems, online, public, asia-pacific
T3: Digital Rights Practice & Experience 65 (16.5%) space, data, human, shared, climate, people, online, open, collective, building, work, challenges
T4: Community Organizing & Grassroots Capacity 53 (13.5%) communities, shared, systems, community, collective, movement, regional, movements, asia-pacific, end, to:, grassroots
T5: Legal Frameworks, Evidence & Platform Accountability 32 (8.1%) legal, human, evidence, platform, security, platforms, challenges, international, regional, resource, criminal, media
T6: Digital Security Tools & Practical Response 20 (5.1%) human, online, legal, security, collective, right, practical, tools, strategies, session, decision, response
T7: Youth, Media & Narrative Resistance 39 (9.9%) internet, civic, movements, people, young, media, communication, social, online, collective, narratives, storytelling

Interpretation: The largest latent topic is Regional Governance & Policy (T2, 28%) — sessions on civil society's role in internet governance, regulatory frameworks, and regional policy coordination. T0 (AI Systems & Public Accountability, 16.5%) is the second major focus: AI audits, platform surveillance, and public sector accountability. T3 (Digital Rights Practice & Experience, 16.5%) is a broad catch-all for sessions focused on practice and experience across accessibility, OGBV, climate data, and community care. T4 (Community Organizing & Grassroots Capacity, 13.5%) covers collective infrastructure: queer communities, trauma-informed care, peer exchange. T7 (Youth, Media & Narrative Resistance, 9.9%) is a distinct internet-culture-and-activism cluster. T1 and T6 are smaller but substantive: CSO security infrastructure and practical digital security tools respectively.

LDA Topic × Track

Topic Co-creating Collaborative
T2: Regional Governance & Policy 69 (30.4%) 41 (25.2%)
T0: AI Systems & Public Accountability 30 (13.2%) 35 (21.5%)
T3: Digital Rights Practice & Experience 33 (14.5%) 31 (19.0%)
T4: Collective Community Movements 31 (13.7%) 22 (13.5%)
T7: Youth, Media & Narrative Resistance 28 (12.3%) 10 (6.1%)
T5: Legal Frameworks & Evidence 18 (7.9%) 13 (8.0%)
T6: Digital Security Tools & Practical Response 14 (6.2%) 6 (3.7%)
T1: Civil Society Digital Security Capacity 4 (1.8%) 5 (3.1%)

Key track differentiators:

  • Co-creating leads on: T2 Regional Governance & Policy (+5pp), T7 Youth, Media & Narrative Resistance (+6pp), T6 Digital Security Tools & Practical Response (+2.5pp)
  • Collaborative leads on: T0 AI Systems & Public Accountability (+8pp), T3 Digital Rights Practice & Experience (+4.5pp)

Method 2: UMAP + HDBSCAN Semantic Clustering

Design

UMAP (Uniform Manifold Approximation and Projection) reduces the doc-topic matrix (12 LDA topics) to 2 dimensions using cosine distance, preserving semantic neighbourhood structure. HDBSCAN (Hierarchical Density-Based Spatial Clustering of Applications with Noise) then clusters the 2D embedding, identifying dense regions as clusters and sparse regions as noise — without requiring a fixed number of clusters.

Parameters: UMAP (n_neighbors=15, min_dist=0.1, metric=cosine) → HDBSCAN (min_cluster_size=25, metric=euclidean, eom selection).

Clusters Discovered

7 clusters + noise (60 sessions, 15.3%). Each cluster is characterised by its dominant LDA topic, top TF-IDF terms, and track composition.

Cluster Sessions Dominant LDA Topic Co-creating Collaborative Label
C4 86 (21.9%) T3: Digital Rights Practice & Experience 44 (51%) 40 (47%) Open Space & People
C0 59 (15.0%) T0: AI Systems & Public Accountability 29 (49%) 30 (51%) Data, AI & Platform Accountability
C3 54 (13.7%) T2: Regional Governance & Policy 27 (50%) 27 (50%) Civil Society Governance AI
C5 46 (11.7%) T4: Community Organizing & Grassroots 37 (80%) 9 (20%) Collective Care & Movement
C6 31 (7.9%) T9: Systems & Accountability 14 (45%) 17 (55%) Country-Specific Implementation
C1 29 (7.4%) T5: Legal & Evidence 17 (59%) 11 (38%) Legal Evidence & Journalism
C2 28 (7.1%) T8: Infrastructure Design 17 (61%) 11 (39%) Community Infrastructure Design
Noise 60 (15.3%) 42 (70%) 18 (30%)

Cluster Profiles

C4 — Open Space & People (86 sessions, 21.9%)

The largest cluster. Sessions focus on people, internet, online space, media access, and open tools — session design, community digital access, and open digital commons. Nearly equal track split.
Top TF-IDF: session, people, internet, online, space, rights, media, access, social, open, tools
LDA: T3 "space | human | data | people | open"

C0 — Data, AI & Platform Accountability (59 sessions, 15.0%)

Sessions discuss data governance, AI, surveillance, platform accountability, and government oversight of digital platforms. Equal track split — this is a shared concern.
Top TF-IDF: data, ai, rights, surveillance, platforms, public, governments, governance, systems, accountability
LDA: T0 "data | public | surveillance | platforms"

C3 — Civil Society Governance AI (54 sessions, 13.7%)

Sessions focus on AI governance, civil society engagement with AI systems, and governance policy frameworks. Equal track split — this is a programmatic centrepiece.
Top TF-IDF: ai, governance, rights, data, systems, civil, civil society, content, society, digital rights
LDA: T2 "civil | data | governance | policy"

C5 — Collective Care & Movement (46 sessions, 11.7%)

The clearest Co-creating-skewed cluster (80% Co-creating, only 20% Collaborative). Sessions focus on collective care, movement infrastructure, activists' wellbeing, and digital security support for movements. This is the social-justice-organising heart of the programme.
Top TF-IDF: care, collective, support, communities, digital security, movement, activists, space, movements, online
LDA: T4 "shared | collective | movements | movement"

C6 — Country-Specific Implementation (31 sessions, 7.9%)

Sessions focused on country-level digital rights implementation, accountability mechanisms in India, Nepal, Philippines, and Indonesia — regulatory frameworks, remedies, and community digital systems at country level.
Top TF-IDF: systems, communities, indonesia nepal, digital systems, nepal, philippines indonesia, india philippines, remedy, centre
LDA: T9 "systems | communities | accountability | community"

C1 — Legal Evidence & Journalism (29 sessions, 7.4%)

Sessions on human rights evidence, criminal and international law, and journalist protection — a specialist legal cluster, slightly Co-creating skew.
Top TF-IDF: evidence, human, criminal, human rights, legal, rights, journalists, platforms, ai
LDA: T5 "legal | human | evidence | platforms | criminal"

C2 — Community Infrastructure Design (28 sessions, 7.1%)

Sessions focused on designing resources, frameworks, and infrastructure for communities — practical Co-creating sessions about building shared tools and frameworks.
Top TF-IDF: resources, design, session, community, framework, data, infrastructure, regional, platform, user, shared
LDA: T8 "shared | platform | regional | infrastructure | practical"


AI Presence (Content-Only)

AI detection uses case-insensitive matching against the Content field only (not Title or Organiser). Four forms are tracked separately, each using the appropriate match type:

ai_re    = re.compile(r'\bai\b',                  re.IGNORECASE)  # word-boundary
artif_re = re.compile(r'artificial intelligence',   re.IGNORECASE)  # phrase (no word boundary)
algo_re  = re.compile(r'\balgorithmic\b',          re.IGNORECASE)  # word-boundary
algow_re = re.compile(r'\balgorithm\b',            re.IGNORECASE)  # word-boundary
genai_re = re.compile(r'genai',                    re.IGNORECASE)  # substring: GenAI, genai-based
union    = \bai\b OR \balgorithmic\b OR \balgorithm\b OR genai

Edge cases:

  • Hyphenated forms ("AI-powered", "AI-driven"): 37 of 38 already caught by \bai\b (standalone token present in same sentence); 1 additional session caught by genai
  • "GenAI-based tools" (#37157): genai substring catches it; \bai\b does not (no word boundary in merged token "GenAI") — this is the sole genai count
  • "Geospatial Artificial Intelligence (GeoAI)" (#37432): caught by artificial intelligence row
  • "journalism aids remembering" (#38744): not AI — word "aids" has no word boundary between "ai" and "ds"; correctly excluded
Term Sessions Share
"ai" (word-boundary) 94 23.9%
"artificial intelligence" (substring) 14 3.6%
"algorithmic" (word-boundary) 36 9.2%
"algorithm" (word-boundary) 7 1.8%
"genai" (substring) 1 0.3%
Union (ai | algorithmic | algorithm | genai) 117 29.8%

LDA context: T0 (AI Systems & Public Accountability) is the primary AI cluster — sessions discussing data, platforms, surveillance, and algorithmic systems. C0 (Data, AI & Platform Accountability) and C3 (Civil Society Governance AI) are where AI discourse concentrates — governance frameworks, surveillance systems, and platform accountability.


Cross-Method Synthesis

LDA and HDBSCAN both identify consistent thematic structure. The old hardcoded keyword-cluster analysis is superseded and not referenced here.

Finding LDA (k=8) HDBSCAN
Largest substantive theme T2 Regional Governance & Policy (28.0%, 110 sessions) C4 Open Space (21.9%, 86 sessions)
AI discourse 29.5% of sessions (word-boundary "ai" | "algorithmic" | "algorithm") C0 Data & AI Accountability (15.0%), C3 Civil Society Governance AI (13.7%)
Collective/movement focus T4 Community Organizing & Grassroots (13.5%, 53 sessions) C5 Collective Care & Movement (11.7%, 46 sessions, 80% Co-creating)
Legal/evidence cluster T5 Legal Frameworks, Evidence & Platform Accountability (8.1%, 32 sessions) C1 Legal Evidence & Journalism (7.4%, 29 sessions)
Youth & civic media T7 Youth, Media & Narrative Resistance (9.9%, 39 sessions) Concentrated in C4 Open Space (T7 k=12 rank-2 in C4)
Myanmar as outlier T1 Civil Society Digital Security Capacity (2.3%, 9 sessions) Mostly noise (HDBSCAN does not isolate it as a cluster)

Disagreement note: The LDA proportion for AI discourse (29.5%) and HDBSCAN cluster proportions (C0 15% + C3 14%) are not additive — they measure different things. LDA measures topic probability weight; HDBSCAN measures cluster membership. The gap between 29.5% and ~29% combined reflects that not all sessions with "AI" in their content are sufficiently coherent to form a dense semantic cluster.


Top Sessions by Upvotes

Co-creating shared resources

Votes ID Format Title
14 #37330 Panel Confronting Online Hate and Digital Censorship in South and Southeast Asia
12 #37163 Roundtable From Risk to Resilience: Bringing Communities and Trainers Together for Digital Safety
12 #37058 Panel How AI is shaping political communications during elections
11 #37609 Co-learning workshop Brain-rot Activism: Strategy or Setback?
11 #37174 Roundtable Resourcing Digital Rights Advocacy in Southeast Asia

Collaborative action across movements

Votes ID Format Title
14 #37015 Co-learning workshop Garbage In, Garbage Out: Exposing Gender Bias and Stereotypes in Large Language Models (LLMs)
12 #37129 Panel Engaging Big Tech in Southeast Asia: Strategies, Challenges, and Collective Leverage for Human Rights
9 #35073 Panel Myanmar Voices, Regional Support: Digital Security Peer Lab
9 #33009 Panel Open Tech Jam: Privacy-respecting, secure, and open digital tools for at-risk communities
9 #38333 Ideation workshop (Re)Imagining a Multistakeholder Model for Digital Platforms in ASEAN
9 #34922 Panel Your Boss is an Algorithm: Are You Playing or Being Played?

Vote counts are a snapshot as of the scrape date (2026-03-29/30). DRAPAC voting is live; counts may have shifted since.


Methodology

Reproducibility

The complete pipeline is published as a standalone script:

A-PROJECTS/DRAPAC 26 Analysis/drapac_analysis.py

Run it with:

python3 drapac_analysis.py "~/Vault/D-ARCHIVES/DRAPAC 26 Sessions Submission/"

Dependencies:

pip install gensim scikit-learn umap-learn hdbscan scipy

The script produces: LDA topics (k=8 + k=12), HDBSCAN cluster labels, UMAP 2D embedding, per-cluster topic/TF-IDF profiles, and a Track × Cluster cross-tabulation.

Data Source

Vault path:  ~/Vault/D-ARCHIVES/DRAPAC 26 Sessions Submission/*.md
Source URL:  https://drap.ac/26/activities/?view=<ID>
Total files: 393 (as of 2026-03-30)

Content Preprocessing

Sessions are loaded from .md files. The Content field (session description body) is extracted via regex and tokenised:

  1. Lowercase — normalises all text
  2. Strip markdown — # headings, * bold, URLs, numbers, table pipes
  3. Stopword removal — Gensim STOPWORDS + 50+ custom terms (conference procedures: "session", "workshop", "panel"; region names: "asia", "pacific"; generic modifiers: "also", "however", "new", "used")
  4. Filter tokens < 3 chars and pure-punctuation tokens
  5. Drop sessions with < 5 tokens (empty or near-empty descriptions)

Dictionary filtering: no_below=5 (must appear in ≥5 sessions), no_above=0.65 (removed if in >65% of sessions). This gives a vocabulary of ~1,894 terms for 393 sessions — sufficient for LDA without sparsity.

LDA Topic Modeling

Two runs:

Model num_topics Passes Purpose
lda8 k=8 40 Primary — cleaner topics, main report table
lda12 k=12 30 HDBSCAN input — higher dim = better clustering

Shared parameters:

alpha='auto'    # learn per-document topic sparsity from data
eta='auto'      # learn per-topic word sparsity from data
random_state=42 # reproducibility

Output: each session gets a probability distribution over all k topics. The "dominant topic" is the one with highest probability. Topic-word distributions are ranked lists used for human interpretation.

Doc-topic matrix shape: (393, k) — used as the embedding for HDBSCAN.

UMAP + HDBSCAN Clustering

UMAP reduces the lda12 doc-topic matrix (393 × 12)(393 × 2) using cosine distance:

umap.UMAP(n_components=2, n_neighbors=15, min_dist=0.1,
          metric='cosine', random_state=42)

Cosine distance is the right metric for probability distributions (cosine similarity of doc_topic[i] between sessions).

HDBSCAN clusters the 2D embedding density:

hdbscan.HDBSCAN(min_cluster_size=25, metric='euclidean',
                 cluster_selection_method='eom')
  • eom (Excess of Mass) selects clusters by density rather than enforcing a fixed threshold
  • min_cluster_size=25 chosen after testing {20, 25, 30} — gives 7 clusters + 15.3% noise

Session Metadata Extraction

The Track, Format, Track×Format, Countries, and Upvotes tables are extracted from the session .md files using regex + keyword matching. Complete standalone script (no external dependencies):

A-PROJECTS/DRAPAC 26 Analysis/drapac_metadata.py

Run it with:

python3 drapac_metadata.py ~/Vault/D-ARCHIVES/DRAPAC\ 26\ Sessions\ Submission/

Field Extraction

Field Regex Notes
Session ID r'-(\d+)\.md$' Basename filename
Upvotes r'##\s*#\d+\s*\n+\[(\d+)\]' [N](url) link after heading
Track r'^|\s*Track\s*|\s*(.+?)\s*|' Table row, normalised to Co-creating/Collaborative
Format r'^|\s*Format\s*|\s*(.+?)\s*|' Table row, title-cased
Content r'^|\s*Content\s*|\s*(.+?)(?=\n|)' Session body only — all theme analysis uses this
Organiser r'^|\s*Organiser\s*|\s*(.+?)\s*|' Table row

Track Normalisation

tm = re.search(r'^\|\s*Track\s*\|\s*(.+?)\s*\|', c, re.MULTILINE)
track = 'Unknown'
if tm:
    t = tm.group(1).strip().lower()
    if 'co-creat' in t:   track = 'Co-creating'    # Co-creating shared resources
    elif 'collab' in t:   track = 'Collaborative'  # Collaborative action across movements

Format Extraction

fm = re.search(r'^\|\s*Format\s*\|\s*(.+?)\s*\|', c, re.MULTILINE)
fmt = fm.group(1).strip().title() if fm else 'Unknown'

Case-insensitive substring match across the entire file (Title + Organiser + Content). Sessions are counted once per matched country — duplicates within a file do not inflate counts.

COUNTRY_VARIANTS = {
    'Indonesia':    ['indonesia'],
    'Philippines':  ['philippines'],
    'India':        ['india'],
    'Nepal':        ['nepal'],
    'Myanmar':      ['myanmar', 'burma'],
    'Bangladesh':   ['bangladesh'],
    'Pakistan':     ['pakistan'],
    'Malaysia':     ['malaysia'],
    'Sri Lanka':    ['sri lanka'],
    'Thailand':     ['thailand'],
    'Taiwan':       ['taiwan'],
    'Singapore':    ['singapore'],
    'Vietnam':      ['vietnam'],
    'China':        ['china'],
    'Cambodia':     ['cambodia'],
    'Korea':        ['korea', 'south korea', 'north korea'],
    'Japan':        ['japan'],
    'Australia':    ['australia'],
    'Pacific':      ['pacific', 'asia pacific', 'asia-pacific', 'apac', 'oceania'],
    'Papua':        ['papua'],
    'Timor':        ['timor'],
    'Laos':         ['laos'],
}

def search_countries(text: str) -> list[str]:
    """Case-insensitive substring match. Returns unique countries found."""
    text_lower = text.lower()
    found = []
    for country, variants in COUNTRY_VARIANTS.items():
        for variant in variants:
            if variant in text_lower:    # substring — not word-boundary
                found.append(country)
                break
    return found

# Usage: countries = search_countries(title + ' ' + organiser + ' ' + content)

Track × Format Cross-Tabulation

from collections import Counter, defaultdict
track_fmt = defaultdict(lambda: defaultdict(int))
for s in sessions:
    track_fmt[s['track']][s['format']] += 1

Upvotes Extraction

m = re.search(r'##\s*#\d+\s*\n+\[(\d+)\]', content)
votes = int(m.group(1)) if m else 0

Aggregation

  • Thematic Tracks and Session Formats tables: Counter aggregation of extracted track and format fields
  • Track × Format matrix: defaultdict cross-tabulation
  • Countries: Counter per country per track — sessions can appear in multiple country rows (sum of rows > total sessions)

Note: Country counts reflect unique sessions mentioning each country — a session mentioning Indonesia and Philippines counts once in each row. A session may appear in multiple country rows simultaneously (sum of country rows > total sessions).

Known Limitations

Issue Effect Mitigation
k is a chosen hyperparameter Topics depend on choice of k Ran k={8,12}; 8-topic was most interpretable
LDA is probabilistic Same k with different seeds gives slightly different topics Fixed random_state=42 throughout
Session length (~232 words avg) limits LDA quality Shorter documents = noisier topic distributions Filtered dictionary extremes; 40 passes for k=8
HDBSCAN requires min_cluster_size tuning Different values produce different cluster counts Tested mcs={20,25,30}; chose mcs=25 (7 clusters, 15% noise)
Stopword list may filter legitimate terms Some thematic terms removed Custom stoplist avoids removing domain-specific terms
Votes Live snapshot from scrape date Do not treat as current values