1 Apr 2026 • 16 min read

Mapping out the DRAPAC26 Submission proposal entries by numbers and topics

The Digital Rights in Asia-Pacific Assembly (DRAPAC) 2026 is an important gathering of digital rights advocates, policymakers, activists, and technologists dedicated to discussing and addressing pressing digital rights issues in the Asia-Pacific region. I'm very much excited for this year's DRAPAC, and I am interested in what and how organizations and individuals are submitting for their sessions/ideas. I believe it is important to look over all the ideas and get the gist of them, to be able to understand how the narrative has evolved and what the interests/demands of the submitters are.

To achieve that, I gathered (scraped) all entries using my personal DRAPAC account and counted all the metadata manually myself (with the help of LLMs/AI). The organizing team did not give me the data, and I have informed them that I ran a scraper.

This year, DRAPAC26 focuses on 2 themes (Track 1: Co-creating shared resources and Track 2: Collaborative action across movements) and 8 session formats. The submission period is closed, and it gained 313 entries. So, let's unfold this further by the numbers.

Thematic tracks:

Track	Sessions	Share
Co-creating shared resources	227	57.8%
Collaborative action across movements	163	41.5%
Untagged	3	0.8%

Session Formats

Format	Total	Share
Co-learning workshop	122	31.0%
Panel discussion	104	26.5%
Roundtable discussion	62	15.8%
Ideation workshop	56	14.2%
Exhibit / display / performance	20	5.1%
Social gathering	13	3.3%
Booth, stand, or clinic	9	2.3%
Parallel event	7	1.8%

Track × Format Matrix

Format	Co-creating	Collaborative
Co-learning workshop	91	30
Panel discussion	43	59
Roundtable discussion	27	35
Ideation workshop	42	14
Exhibit / display / performance	13	7
Social gathering	3	10
Booth, stand, or clinic	5	4
Parallel event	3	4

As the data above shows, we can see that the "Co-creating shared resources" track is leading by 16.3% differences compared to the "Collaborative action across movements" track. This highlights the high interest in this track, which focuses on building shared infrastructure—from platforms, systems, and databases, to skills, networks, and strategies—that can be scaled across the region.

While there is not a significant difference, Track 2 (Collaborative action across movements) focuses on bridging the gap between civil society, government, and the private sector to co-develop rights-based digital policies. It also stresses the importance of discussions pertaining to these issues.

Looking at the track-by-format matrix, we can see that each track has specific preferences toward the session formats it conducts to suit what's best for the discussions or activities. Track 1 garnered a preference for co-learning workshops and ideation workshops, while on the other hand, Track 2 focuses more on panel discussions, roundtable discussions, and social gatherings.

Furthermore, to get a better understanding of what people are proposing, I ran basic topic modeling on the submission content to get a look at the ideas people are interested in focusing on or discussing. I utilized a specific topic modeling technique called LDA (Latent Dirichlet Allocation) to uncover the central topics and their distributions across the set of content's sessions. The following table uncover which words are being frequently used.

Topic	Sessions	Top Words (topn=12)
T0: AI Systems & Public Accountability	65 (16.5%)	data, social, public, platforms, media, surveillance, systems, algorithmic, platform, youth, people, power
T1: Civil Society Digital Security Capacity	9 (2.3%)	human, security, strategies, civil, society, technical, law, training, myanmar, challenges, action, cybersecurity
T2: Regional Governance & Policy	110 (28.0%)	regional, civil, governance, policy, society, shared, platform, data, systems, online, public, asia-pacific
T3: Digital Rights Practice & Experience	65 (16.5%)	space, data, human, shared, climate, people, online, open, collective, building, work, challenges
T4: Community Organizing & Grassroots Capacity	53 (13.5%)	communities, shared, systems, community, collective, movement, regional, movements, asia-pacific, end, to:, grassroots
T5: Legal Frameworks, Evidence & Platform Accountability	32 (8.1%)	legal, human, evidence, platform, security, platforms, challenges, international, regional, resource, criminal, media
T6: Digital Security Tools & Practical Response	20 (5.1%)	human, online, legal, security, collective, right, practical, tools, strategies, session, decision, response
T7: Youth, Media & Narrative Resistance	39 (9.9%)	internet, civic, movements, people, young, media, communication, social, online, collective, narratives, storytelling

Moreover, let's see which countries or region being mentioned the most across proposal. The Country was calculated using a rule-based keyword match over session Title and Content fields. For each country, it counts session once per country if any variant was present.

Country	Co-creating	Collaborative	Total
Pacific (region)	112	83	196
Indonesia	67	45	114
Philippines	51	44	96
India	56	50	106
Nepal	41	32	73
Myanmar	31	19	50
Bangladesh	30	18	48
Pakistan	11	14	26
Malaysia	19	7	26
Sri Lanka	13	7	20
Thailand	9	4	14
Taiwan	10	3	13
Singapore	6	6	12
Vietnam	5	4	9
China	2	5	7
Cambodia	4	3	7
Korea	3	3	6
Japan	2	3	5
Australia	1	2	4
Papua	2	0	2
Timor	1	0	1
Laos	1	0	1

Lastly, I ran a word-matching script to see if AI-related keywords were mentioned in the content of the proposals. It turns out that almost 30% of the entries include AI-related terms, as follows:

Term	Sessions	Share
"ai" (word-boundary)	94	23.9%
"artificial intelligence" (substring)	14	3.6%
"algorithmic" (word-boundary)	36	9.2%
"algorithm" (word-boundary)	7	1.8%
"genai" (substring)	1	0.3%
Union (ai \| algorithmic \| algorithm \| genai)	117	29.8%

It is fascinating to see diverse ideas being submitted. We certainly won't see most of them being presented at DRAPAC, but I'm sure the ideas are worth spreading and must be taken into account, especially for individuals and organizations within the region and beyond.

To close, most of the scripts/code that helped me to generate the data above were generated by LLMs and agentic AI platforms. I have manually observed and adjusted some for cross-checking. I attached the full analysis below, generated 100% by AI. The wording above was mostly made by me. I am planning to update this post to include the information of all the scraped sessions and the generated code, in particular for reproducibility or experimenting further.

DRAPAC 26 — Session Analysis

Source: drap.ac/26/activities/
Scraped: Wave 1 = 2026-03-29 (311 files); Wave 2 = 2026-03-30 (+82 files, total 393)
Total sessions: 393 markdown files
Vault source: D-ARCHIVES/DRAPAC 26 Sessions Submission/

Executive Summary

DRAPAC 26 (Digital Rights Asia-Pacific 2026) is the Asia Pacific Regional Internet Governance Forum. Three methods were applied to the 393 session Content fields:

Unsupervised LDA topic modeling (k=8, k=12) — discovers latent themes statistically
UMAP + HDBSCAN clustering — semantic clustering of sessions by document-topic similarity
Content coverage clusters (keyword-based) — comparative reference

Key findings:

LDA identifies Regional Governance & Policy (T2, 110 sessions, 28%) as the single largest latent topic — more than governance frameworks, civil society engagement, and regulatory policy combined. This is the programme's backbone discourse.
Semantic clustering discovers 7 distinct session communities, including: "Data & Platform Accountability", "Legal & Human Rights Evidence", "Community Infrastructure Design", "Civil Society Governance AI", "Open Space & People", and "Collective Care Movements" — each with distinct track skew.
Track differentiation is real but nuanced: Collaborative skews toward legal/evidence (T5) and surveillance discourse (T0); Co-creating skews toward collective care movements (C5) and community infrastructure (C4). The clearest differentiator is Governance & Policy framing, which Co-creating leads on.
AI (explicit "ai" + "algorithmic" + "genai") appears in 29.8% of sessions — making it the second-most-discussed substantive concern after governance.

Overview

DRAPAC 26 (Digital Rights Asia-Pacific 2026) is the Asia Pacific Regional Internet Governance Forum. The vault contains 393 scraped session submissions across two waves: 311 files scraped 2026-03-29 and 82 additional files from 2026-03-30.

Theme analysis design: Three complementary unsupervised methods were applied to the Content field (session description body only). The Content field was preprocessed: lowercased, stopwords removed, markdown stripped, and tokenised (min word length 3, no punctuation-only tokens). Gensim LDA was used for topic modeling; UMAP + HDBSCAN for semantic clustering; content coverage analysis (keyword clusters) as comparative reference.

All theme statistics are Content-only — Organiser and Title fields are excluded to avoid inflation from personal specialisations and session labels.

Thematic Tracks

DRAPAC 26 has two thematic tracks:

Track	Sessions	Share
Co-creating shared resources	227	57.8%
Collaborative action across movements	163	41.5%
Untagged	3	0.8%

Session Formats

Format	Total	Share
Co-learning workshop	122	31.0%
Panel discussion	104	26.5%
Roundtable discussion	62	15.8%
Ideation workshop	56	14.2%
Exhibit / display / performance	20	5.1%
Social gathering	13	3.3%
Booth, stand, or clinic	9	2.3%
Parallel event	7	1.8%

Track × Format Matrix

Format	Co-creating	Collaborative
Co-learning workshop	91	30
Panel discussion	43	59
Roundtable discussion	27	35
Ideation workshop	42	14
Exhibit / display / performance	13	7
Social gathering	3	10
Booth, stand, or clinic	5	4
Parallel event	3	4

Countries & Regional Focus

Country	Co-creating	Collaborative	Total
Pacific (region)	112	83	196
Indonesia	67	45	114
Philippines	51	44	96
India	56	50	106
Nepal	41	32	73
Myanmar	31	19	50
Bangladesh	30	18	48
Pakistan	11	14	26
Malaysia	19	7	26
Sri Lanka	13	7	20
Thailand	9	4	14
Taiwan	10	3	13
Singapore	6	6	12
Vietnam	5	4	9
China	2	5	7
Cambodia	4	3	7
Korea	3	3	6
Japan	2	3	5
Australia	1	2	4
Papua	2	0	2
Timor	1	0	1
Laos	1	0	1

Method 1: LDA Topic Modeling

Design

Latent Dirichlet Allocation (LDA) discovers latent topics as probability distributions over words — sessions are not hard-assigned to topics but have a probability distribution across all topics. k is the number of topics the model is constrained to find; you choose it before running the model. Gensim LDA was used with:

Dictionary: tokenised Content, stopwords removed (custom stoplist + Gensim STOPWORDS), min doc frequency 5, max doc frequency 65%
Two runs: k=8 (primary, 40 passes) and k=12 (for HDBSCAN input, 30 passes)
Alpha/eta: auto — the model learns document-topic and topic-word sparsity from the data

LDA Topics Discovered (k=8)

Each topic is a ranked list of words. The number in parentheses is the number of sessions where this topic is the dominant topic (highest probability in the k=8 model).

Topic	Sessions	Top Words (topn=12)
T0: AI Systems & Public Accountability	65 (16.5%)	data, social, public, platforms, media, surveillance, systems, algorithmic, platform, youth, people, power
T1: Civil Society Digital Security Capacity	9 (2.3%)	human, security, strategies, civil, society, technical, law, training, myanmar, challenges, action, cybersecurity
T2: Regional Governance & Policy	110 (28.0%)	regional, civil, governance, policy, society, shared, platform, data, systems, online, public, asia-pacific
T3: Digital Rights Practice & Experience	65 (16.5%)	space, data, human, shared, climate, people, online, open, collective, building, work, challenges
T4: Community Organizing & Grassroots Capacity	53 (13.5%)	communities, shared, systems, community, collective, movement, regional, movements, asia-pacific, end, to:, grassroots
T5: Legal Frameworks, Evidence & Platform Accountability	32 (8.1%)	legal, human, evidence, platform, security, platforms, challenges, international, regional, resource, criminal, media
T6: Digital Security Tools & Practical Response	20 (5.1%)	human, online, legal, security, collective, right, practical, tools, strategies, session, decision, response
T7: Youth, Media & Narrative Resistance	39 (9.9%)	internet, civic, movements, people, young, media, communication, social, online, collective, narratives, storytelling

Interpretation: The largest latent topic is Regional Governance & Policy (T2, 28%) — sessions on civil society's role in internet governance, regulatory frameworks, and regional policy coordination. T0 (AI Systems & Public Accountability, 16.5%) is the second major focus: AI audits, platform surveillance, and public sector accountability. T3 (Digital Rights Practice & Experience, 16.5%) is a broad catch-all for sessions focused on practice and experience across accessibility, OGBV, climate data, and community care. T4 (Community Organizing & Grassroots Capacity, 13.5%) covers collective infrastructure: queer communities, trauma-informed care, peer exchange. T7 (Youth, Media & Narrative Resistance, 9.9%) is a distinct internet-culture-and-activism cluster. T1 and T6 are smaller but substantive: CSO security infrastructure and practical digital security tools respectively.

LDA Topic × Track

Topic	Co-creating	Collaborative
T2: Regional Governance & Policy	69 (30.4%)	41 (25.2%)
T0: AI Systems & Public Accountability	30 (13.2%)	35 (21.5%)
T3: Digital Rights Practice & Experience	33 (14.5%)	31 (19.0%)
T4: Collective Community Movements	31 (13.7%)	22 (13.5%)
T7: Youth, Media & Narrative Resistance	28 (12.3%)	10 (6.1%)
T5: Legal Frameworks & Evidence	18 (7.9%)	13 (8.0%)
T6: Digital Security Tools & Practical Response	14 (6.2%)	6 (3.7%)
T1: Civil Society Digital Security Capacity	4 (1.8%)	5 (3.1%)

Key track differentiators:

Co-creating leads on: T2 Regional Governance & Policy (+5pp), T7 Youth, Media & Narrative Resistance (+6pp), T6 Digital Security Tools & Practical Response (+2.5pp)

Collaborative leads on: T0 AI Systems & Public Accountability (+8pp), T3 Digital Rights Practice & Experience (+4.5pp)

Method 2: UMAP + HDBSCAN Semantic Clustering

Design

UMAP (Uniform Manifold Approximation and Projection) reduces the doc-topic matrix (12 LDA topics) to 2 dimensions using cosine distance, preserving semantic neighbourhood structure. HDBSCAN (Hierarchical Density-Based Spatial Clustering of Applications with Noise) then clusters the 2D embedding, identifying dense regions as clusters and sparse regions as noise — without requiring a fixed number of clusters.

Parameters: UMAP (n_neighbors=15, min_dist=0.1, metric=cosine) → HDBSCAN (min_cluster_size=25, metric=euclidean, eom selection).

Clusters Discovered

7 clusters + noise (60 sessions, 15.3%). Each cluster is characterised by its dominant LDA topic, top TF-IDF terms, and track composition.

Cluster	Sessions	Dominant LDA Topic	Co-creating	Collaborative	Label
C4	86 (21.9%)	T3: Digital Rights Practice & Experience	44 (51%)	40 (47%)	Open Space & People
C0	59 (15.0%)	T0: AI Systems & Public Accountability	29 (49%)	30 (51%)	Data, AI & Platform Accountability
C3	54 (13.7%)	T2: Regional Governance & Policy	27 (50%)	27 (50%)	Civil Society Governance AI
C5	46 (11.7%)	T4: Community Organizing & Grassroots	37 (80%)	9 (20%)	Collective Care & Movement
C6	31 (7.9%)	T9: Systems & Accountability	14 (45%)	17 (55%)	Country-Specific Implementation
C1	29 (7.4%)	T5: Legal & Evidence	17 (59%)	11 (38%)	Legal Evidence & Journalism
C2	28 (7.1%)	T8: Infrastructure Design	17 (61%)	11 (39%)	Community Infrastructure Design
Noise	60 (15.3%)	—	42 (70%)	18 (30%)	—

Cluster Profiles

C4 — Open Space & People (86 sessions, 21.9%)

The largest cluster. Sessions focus on people, internet, online space, media access, and open tools — session design, community digital access, and open digital commons. Nearly equal track split.
Top TF-IDF: session, people, internet, online, space, rights, media, access, social, open, tools
LDA: T3 "space | human | data | people | open"

C0 — Data, AI & Platform Accountability (59 sessions, 15.0%)

Sessions discuss data governance, AI, surveillance, platform accountability, and government oversight of digital platforms. Equal track split — this is a shared concern.
Top TF-IDF: data, ai, rights, surveillance, platforms, public, governments, governance, systems, accountability
LDA: T0 "data | public | surveillance | platforms"

C3 — Civil Society Governance AI (54 sessions, 13.7%)

Sessions focus on AI governance, civil society engagement with AI systems, and governance policy frameworks. Equal track split — this is a programmatic centrepiece.
Top TF-IDF: ai, governance, rights, data, systems, civil, civil society, content, society, digital rights
LDA: T2 "civil | data | governance | policy"

C5 — Collective Care & Movement (46 sessions, 11.7%)

The clearest Co-creating-skewed cluster (80% Co-creating, only 20% Collaborative). Sessions focus on collective care, movement infrastructure, activists' wellbeing, and digital security support for movements. This is the social-justice-organising heart of the programme.
Top TF-IDF: care, collective, support, communities, digital security, movement, activists, space, movements, online
LDA: T4 "shared | collective | movements | movement"

C6 — Country-Specific Implementation (31 sessions, 7.9%)

Sessions focused on country-level digital rights implementation, accountability mechanisms in India, Nepal, Philippines, and Indonesia — regulatory frameworks, remedies, and community digital systems at country level.
Top TF-IDF: systems, communities, indonesia nepal, digital systems, nepal, philippines indonesia, india philippines, remedy, centre
LDA: T9 "systems | communities | accountability | community"

C1 — Legal Evidence & Journalism (29 sessions, 7.4%)

Sessions on human rights evidence, criminal and international law, and journalist protection — a specialist legal cluster, slightly Co-creating skew.
Top TF-IDF: evidence, human, criminal, human rights, legal, rights, journalists, platforms, ai
LDA: T5 "legal | human | evidence | platforms | criminal"

C2 — Community Infrastructure Design (28 sessions, 7.1%)

Sessions focused on designing resources, frameworks, and infrastructure for communities — practical Co-creating sessions about building shared tools and frameworks.
Top TF-IDF: resources, design, session, community, framework, data, infrastructure, regional, platform, user, shared
LDA: T8 "shared | platform | regional | infrastructure | practical"

AI Presence (Content-Only)

AI detection uses case-insensitive matching against the Content field only (not Title or Organiser). Four forms are tracked separately, each using the appropriate match type:

ai_re    = re.compile(r'\bai\b',                  re.IGNORECASE)  # word-boundary
artif_re = re.compile(r'artificial intelligence',   re.IGNORECASE)  # phrase (no word boundary)
algo_re  = re.compile(r'\balgorithmic\b',          re.IGNORECASE)  # word-boundary
algow_re = re.compile(r'\balgorithm\b',            re.IGNORECASE)  # word-boundary
genai_re = re.compile(r'genai',                    re.IGNORECASE)  # substring: GenAI, genai-based
union    = \bai\b OR \balgorithmic\b OR \balgorithm\b OR genai

Edge cases:

Hyphenated forms ("AI-powered", "AI-driven"): 37 of 38 already caught by \bai\b (standalone token present in same sentence); 1 additional session caught by genai
"GenAI-based tools" (#37157): genai substring catches it; \bai\b does not (no word boundary in merged token "GenAI") — this is the sole genai count
"Geospatial Artificial Intelligence (GeoAI)" (#37432): caught by artificial intelligence row
"journalism aids remembering" (#38744): not AI — word "aids" has no word boundary between "ai" and "ds"; correctly excluded

Term	Sessions	Share
"ai" (word-boundary)	94	23.9%
"artificial intelligence" (substring)	14	3.6%
"algorithmic" (word-boundary)	36	9.2%
"algorithm" (word-boundary)	7	1.8%
"genai" (substring)	1	0.3%
Union (ai \| algorithmic \| algorithm \| genai)	117	29.8%

LDA context: T0 (AI Systems & Public Accountability) is the primary AI cluster — sessions discussing data, platforms, surveillance, and algorithmic systems. C0 (Data, AI & Platform Accountability) and C3 (Civil Society Governance AI) are where AI discourse concentrates — governance frameworks, surveillance systems, and platform accountability.

Cross-Method Synthesis

LDA and HDBSCAN both identify consistent thematic structure. The old hardcoded keyword-cluster analysis is superseded and not referenced here.

Finding	LDA (k=8)	HDBSCAN
Largest substantive theme	T2 Regional Governance & Policy (28.0%, 110 sessions)	C4 Open Space (21.9%, 86 sessions)
AI discourse	29.5% of sessions (word-boundary "ai" \| "algorithmic" \| "algorithm")	C0 Data & AI Accountability (15.0%), C3 Civil Society Governance AI (13.7%)
Collective/movement focus	T4 Community Organizing & Grassroots (13.5%, 53 sessions)	C5 Collective Care & Movement (11.7%, 46 sessions, 80% Co-creating)
Legal/evidence cluster	T5 Legal Frameworks, Evidence & Platform Accountability (8.1%, 32 sessions)	C1 Legal Evidence & Journalism (7.4%, 29 sessions)
Youth & civic media	T7 Youth, Media & Narrative Resistance (9.9%, 39 sessions)	Concentrated in C4 Open Space (T7 k=12 rank-2 in C4)
Myanmar as outlier	T1 Civil Society Digital Security Capacity (2.3%, 9 sessions)	Mostly noise (HDBSCAN does not isolate it as a cluster)

Disagreement note: The LDA proportion for AI discourse (29.5%) and HDBSCAN cluster proportions (C0 15% + C3 14%) are not additive — they measure different things. LDA measures topic probability weight; HDBSCAN measures cluster membership. The gap between 29.5% and ~29% combined reflects that not all sessions with "AI" in their content are sufficiently coherent to form a dense semantic cluster.

Top Sessions by Upvotes

Co-creating shared resources

Votes	ID	Format	Title
14	#37330	Panel	Confronting Online Hate and Digital Censorship in South and Southeast Asia
12	#37163	Roundtable	From Risk to Resilience: Bringing Communities and Trainers Together for Digital Safety
12	#37058	Panel	How AI is shaping political communications during elections
11	#37609	Co-learning workshop	Brain-rot Activism: Strategy or Setback?
11	#37174	Roundtable	Resourcing Digital Rights Advocacy in Southeast Asia

Collaborative action across movements

Votes	ID	Format	Title
14	#37015	Co-learning workshop	Garbage In, Garbage Out: Exposing Gender Bias and Stereotypes in Large Language Models (LLMs)
12	#37129	Panel	Engaging Big Tech in Southeast Asia: Strategies, Challenges, and Collective Leverage for Human Rights
9	#35073	Panel	Myanmar Voices, Regional Support: Digital Security Peer Lab
9	#33009	Panel	Open Tech Jam: Privacy-respecting, secure, and open digital tools for at-risk communities
9	#38333	Ideation workshop	(Re)Imagining a Multistakeholder Model for Digital Platforms in ASEAN
9	#34922	Panel	Your Boss is an Algorithm: Are You Playing or Being Played?

Vote counts are a snapshot as of the scrape date (2026-03-29/30). DRAPAC voting is live; counts may have shifted since.

Methodology

Reproducibility

The complete pipeline is published as a standalone script:

A-PROJECTS/DRAPAC 26 Analysis/drapac_analysis.py

Run it with:

python3 drapac_analysis.py "~/Vault/D-ARCHIVES/DRAPAC 26 Sessions Submission/"

Dependencies:

pip install gensim scikit-learn umap-learn hdbscan scipy

The script produces: LDA topics (k=8 + k=12), HDBSCAN cluster labels, UMAP 2D embedding, per-cluster topic/TF-IDF profiles, and a Track × Cluster cross-tabulation.

Data Source

Vault path:  ~/Vault/D-ARCHIVES/DRAPAC 26 Sessions Submission/*.md
Source URL:  https://drap.ac/26/activities/?view=<ID>
Total files: 393 (as of 2026-03-30)

Content Preprocessing

Sessions are loaded from .md files. The Content field (session description body) is extracted via regex and tokenised:

Lowercase — normalises all text
Strip markdown — # headings, * bold, URLs, numbers, table pipes
Stopword removal — Gensim STOPWORDS + 50+ custom terms (conference procedures: "session", "workshop", "panel"; region names: "asia", "pacific"; generic modifiers: "also", "however", "new", "used")
Filter tokens < 3 chars and pure-punctuation tokens
Drop sessions with < 5 tokens (empty or near-empty descriptions)

Dictionary filtering: no_below=5 (must appear in ≥5 sessions), no_above=0.65 (removed if in >65% of sessions). This gives a vocabulary of ~1,894 terms for 393 sessions — sufficient for LDA without sparsity.

LDA Topic Modeling

Two runs:

Model	`num_topics`	Passes	Purpose
`lda8`	`k=8`	40	Primary — cleaner topics, main report table
`lda12`	`k=12`	30	HDBSCAN input — higher dim = better clustering

Shared parameters:

alpha='auto'    # learn per-document topic sparsity from data
eta='auto'      # learn per-topic word sparsity from data
random_state=42 # reproducibility

Output: each session gets a probability distribution over all k topics. The "dominant topic" is the one with highest probability. Topic-word distributions are ranked lists used for human interpretation.

Doc-topic matrix shape: (393, k) — used as the embedding for HDBSCAN.

UMAP + HDBSCAN Clustering

UMAP reduces the lda12 doc-topic matrix (393 × 12) → (393 × 2) using cosine distance:

umap.UMAP(n_components=2, n_neighbors=15, min_dist=0.1,
          metric='cosine', random_state=42)

Cosine distance is the right metric for probability distributions (cosine similarity of doc_topic[i] between sessions).

HDBSCAN clusters the 2D embedding density:

hdbscan.HDBSCAN(min_cluster_size=25, metric='euclidean',
                 cluster_selection_method='eom')

eom (Excess of Mass) selects clusters by density rather than enforcing a fixed threshold
min_cluster_size=25 chosen after testing {20, 25, 30} — gives 7 clusters + 15.3% noise

Session Metadata Extraction

The Track, Format, Track×Format, Countries, and Upvotes tables are extracted from the session .md files using regex + keyword matching. Complete standalone script (no external dependencies):

A-PROJECTS/DRAPAC 26 Analysis/drapac_metadata.py

Run it with:

python3 drapac_metadata.py ~/Vault/D-ARCHIVES/DRAPAC\ 26\ Sessions\ Submission/

Field Extraction

Field	Regex	Notes
Session ID	`r'-(\d+)\.md$'`	Basename filename
Upvotes	`r'##\s#\d+\s\n+\[(\d+)\]'`	`[N](url)` link after heading
Track	`r'^\|\sTrack\s\|\s(.+?)\s\|'`	Table row, normalised to Co-creating/Collaborative
Format	`r'^\|\sFormat\s\|\s(.+?)\s\|'`	Table row, title-cased
Content	`r'^\|\sContent\s\|\s*(.+?)(?=\n\|)'`	Session body only — all theme analysis uses this
Organiser	`r'^\|\sOrganiser\s\|\s(.+?)\s\|'`	Table row

Track Normalisation

tm = re.search(r'^\|\s*Track\s*\|\s*(.+?)\s*\|', c, re.MULTILINE)
track = 'Unknown'
if tm:
    t = tm.group(1).strip().lower()
    if 'co-creat' in t:   track = 'Co-creating'    # Co-creating shared resources
    elif 'collab' in t:   track = 'Collaborative'  # Collaborative action across movements

Format Extraction

fm = re.search(r'^\|\s*Format\s*\|\s*(.+?)\s*\|', c, re.MULTILINE)
fmt = fm.group(1).strip().title() if fm else 'Unknown'

Countries — Substring Keyword Search

Case-insensitive substring match across the entire file (Title + Organiser + Content). Sessions are counted once per matched country — duplicates within a file do not inflate counts.

COUNTRY_VARIANTS = {
    'Indonesia':    ['indonesia'],
    'Philippines':  ['philippines'],
    'India':        ['india'],
    'Nepal':        ['nepal'],
    'Myanmar':      ['myanmar', 'burma'],
    'Bangladesh':   ['bangladesh'],
    'Pakistan':     ['pakistan'],
    'Malaysia':     ['malaysia'],
    'Sri Lanka':    ['sri lanka'],
    'Thailand':     ['thailand'],
    'Taiwan':       ['taiwan'],
    'Singapore':    ['singapore'],
    'Vietnam':      ['vietnam'],
    'China':        ['china'],
    'Cambodia':     ['cambodia'],
    'Korea':        ['korea', 'south korea', 'north korea'],
    'Japan':        ['japan'],
    'Australia':    ['australia'],
    'Pacific':      ['pacific', 'asia pacific', 'asia-pacific', 'apac', 'oceania'],
    'Papua':        ['papua'],
    'Timor':        ['timor'],
    'Laos':         ['laos'],
}

def search_countries(text: str) -> list[str]:
    """Case-insensitive substring match. Returns unique countries found."""
    text_lower = text.lower()
    found = []
    for country, variants in COUNTRY_VARIANTS.items():
        for variant in variants:
            if variant in text_lower:    # substring — not word-boundary
                found.append(country)
                break
    return found

# Usage: countries = search_countries(title + ' ' + organiser + ' ' + content)

Track × Format Cross-Tabulation

from collections import Counter, defaultdict
track_fmt = defaultdict(lambda: defaultdict(int))
for s in sessions:
    track_fmt[s['track']][s['format']] += 1

Upvotes Extraction

m = re.search(r'##\s*#\d+\s*\n+\[(\d+)\]', content)
votes = int(m.group(1)) if m else 0

Aggregation

Thematic Tracks and Session Formats tables: Counter aggregation of extracted track and format fields
Track × Format matrix: defaultdict cross-tabulation
Countries: Counter per country per track — sessions can appear in multiple country rows (sum of rows > total sessions)

Note: Country counts reflect unique sessions mentioning each country — a session mentioning Indonesia and Philippines counts once in each row. A session may appear in multiple country rows simultaneously (sum of country rows > total sessions).

Known Limitations

Issue	Effect	Mitigation
`k` is a chosen hyperparameter	Topics depend on choice of `k`	Ran k={8,12}; 8-topic was most interpretable
LDA is probabilistic	Same `k` with different seeds gives slightly different topics	Fixed `random_state=42` throughout
Session length (~232 words avg) limits LDA quality	Shorter documents = noisier topic distributions	Filtered dictionary extremes; 40 passes for k=8
HDBSCAN requires `min_cluster_size` tuning	Different values produce different cluster counts	Tested mcs={20,25,30}; chose mcs=25 (7 clusters, 15% noise)
Stopword list may filter legitimate terms	Some thematic terms removed	Custom stoplist avoids removing domain-specific terms
Votes	Live snapshot from scrape date	Do not treat as current values