# Methodology — State of AI Dubbing 2026

> Companion methodology document for *State of AI Dubbing 2026: A Multi-Vertical Analysis of Perso AI's Professional Creator Data.*
> Released under CC BY 4.0. Free to cite with attribution to Perso AI.
> Version 1.0 · Published May 27, 2026

---

## 1. Data Source and Scope

**Platform**: Perso AI (perso.ai) — a global AI dubbing platform operating across 80+ countries.

**Data export**: Complete project-level export of dubbing activity on the Perso AI platform between **January 1, 2025 and April 28, 2026** (16 months).

**Total records**: 316,856 unique dubbing projects across 141,638 unique creator accounts.

**Scope of findings**: This report's findings describe behavior **within Perso AI's professional creator cohort**. We do not claim the patterns observed represent the entire AI dubbing market globally. Single-platform data is subject to user-acquisition bias and platform-specific behavioral patterns (see Section 6 — Limitations).

---

## 2. Two-Lens Analytical Approach

The report applies two complementary analytical lenses to the same dataset.

### 2.1 Lens A — Full Platform (Behavioral Patterns)

All 316,856 projects from all status types (ACTIVE, SYSTEM_DELETED, DELETED) are included when measuring behavioral patterns that are independent of project lifecycle.

**Used for**:
- Share rate (96% of dubbed videos shared immediately)
- Multi-language adoption per creator
- Source × target language pair counts (909 active pairs)
- Quarterly YoY comparisons of segment activity

**Why include deleted projects?** Behavioral signals like share rate occur at the time of dubbing, before later deletion. Including deleted projects yields the most complete behavioral signal.

### 2.2 Lens B — Categorized Active Projects (Use Case Map)

The Use Case Map cross-tabulation uses only ACTIVE projects with category metadata (n = 42,060 platform-wide, 112,797 for the broader categorized analysis period including some non-ACTIVE).

**For the published Use Case Map**, we restrict to the 6-month window where Perso AI's automated industry categorization reached production-grade coverage (≥96%):

- **Period**: October 1, 2025 – April 28, 2026
- **N**: 112,797 categorized projects

**Why restrict?** Prior to October 2025, industry categorization was applied retroactively and inconsistently. Restricting to post-October 2025 ensures the cross-tabulation reflects production-quality categorization, not partial backfills.

---

## 3. Key Definitions

### Professional Creator

A creator account on the Perso AI platform producing **6 or more dubbing projects** within the analysis period. This threshold was chosen because:

- 6 projects approximates one project per category major (≥5 categories per creator), distinguishing professional usage from one-off experimentation
- Below 6 projects, language and category patterns are dominated by single-use creators who may not represent serious adoption
- The threshold aligns with creator economy studies (e.g., Lenny Rachitsky) that distinguish "trial" from "production" usage

**N**: 4,023 professional creators in this dataset.

### Single-Use Creator (Mass-Market)

A creator account producing exactly 1 dubbing project. N: 115,439 creators. Used in Section 2 (Lens A) for mass-market trial activity measurement; explicitly **excluded from the main findings** to focus on structural professional patterns rather than trial behavior.

### Active Project

A project where the dubbed output is currently retrievable on the Perso AI platform (not deleted by user or system).

### Categorized Project

A project with industry classification metadata (e.g., "Education", "Animation", "Religion") applied by Perso AI's automated categorization. Categorization is based on title, description, and content analysis at the time of project creation.

### Active Language Pair

A unique source × target language combination with at least one project on Perso AI in the analysis period. Example: "Korean → English" is one active pair. **Total: 909 active pairs** across 36 source × 34 target languages.

### Heavy-Tail Distribution

A statistical distribution where the median and the mean diverge significantly due to a long upper tail. In this report, the multi-language adoption distribution among professional creators is heavy-tailed: median = 1 language, mean = 2.43 languages, top 1% (n=47) average = 15 languages. We use both median and mean explicitly to avoid misleading readers about the typical creator.

### Among Perso AI's Data

A scoping qualifier indicating findings describe Perso AI's professional creator cohort, not the entire AI dubbing market. **Used consistently throughout the report** to maintain methodological discipline against extrapolation.

---

## 4. The Use Case Map Construction

The Use Case Map is the report's hero analytical artifact. It is a cross-tabulation of:

- **Rows**: 19 industry categories (Education, Animation, Film & Drama, Gaming, Religion, Science & Tech, Medical & Health, Business & Finance, Talk & Interview, Entertainment & Doc, News, Product Review, Lifestyle, Comedy, Sports & Fitness, Food & Cooking, Beauty, Pets & Animals, Travel & Events, plus "Other")
- **Columns**: 34 target languages (filtered to top 10 in published version)

**Cell value**: % share of each industry's target language distribution.

**Sample size guardrail**: Each industry × target language cell in the published Use Case Map has n ≥ 500 categorized projects in at least the top-10 columns. Cells below this threshold are aggregated into "Other" for visualization purposes but preserved in the long-format CSV download (`use-case-map-long-2026.csv`).

**Industry category normalization**: The raw export contains industry labels in multiple languages (Korean, Japanese, Portuguese, Spanish, Indonesian, Hindi, Russian, etc.) because users select category in their local language. We normalize to a single English taxonomy of 19 industries plus "Other". The mapping is published in the data files (see Section 8).

---

## 5. Statistical Conventions

### Confidence Intervals

Where findings rely on industry × language cell percentages, we report 95% confidence intervals using normal approximation for binomial proportions:

```
CI = p ± 1.96 × √(p(1-p)/n)
```

For Finding 1 (Religion Dual Hub, n=6,229), the CI on the 25.6% / 25.2% gap is ±1.0–1.2%p. The report's body acknowledges this explicitly — the "Dual Hub" headline describes magnitude (Portuguese reaching English-parity at scale), not a statistical lead.

### Sub-Sample Analysis (n=47 cohort)

Finding 3 cites the top 1% of professional creators (n=47) for the multi-language adoption frontier. We present this as a **directional signal**, not a population estimate. Three robustness signals partially mitigate the small-sample concern:

1. **47 creators distributed across 44 unique workspaces** (87% unique) — no single-organization dominance
2. **Median 6 distinct industries per creator** — not single-vertical specialists
3. **13,982 projects total** in this cohort (range: 20–2,559 per creator) — multi-language behavior repeated across substantial individual project counts

Detailed anonymized cohort composition is published in `top1pct-cohort-anonymized-2026.csv` and Appendix A of the main report.

### YoY Comparisons

YoY comparisons (e.g., +73.2% growth in professional creator base, Q1 2025 → Q1 2026) are computed within **consistent creator segments**. We do not compare total platform volume YoY because of a mid-2025 pricing model change that introduced noise in absolute volume figures. Within-segment comparisons (professional creators vs single-use creators) are not affected by this pricing change.

---

## 6. Limitations (Honest Acknowledgment)

Two specific limitations apply to every finding in this report. We disclose them upfront so the data can be evaluated on its merits.

### 6.1 User-Acquisition Mix May Skew Certain Industry-Language Patterns

Within Perso AI's data, certain target language concentrations likely reflect Perso AI's user-acquisition footprint as much as broader market trends. Specifically:

- **Hindi-target concentration in Animation (31.5%) and Film/Drama (34.9%)** may reflect Perso AI's user-acquisition strength in Indian and South Asian markets, not necessarily a global Bollywood-content-economy signal. We intentionally **excluded these patterns from the main Findings** (Section 3) and discuss them only as data observations in the heatmap visualization.

- **Korean-target concentration in Sci/Tech (12.5%)** (Finding 2) admits two explanations of equal plausibility:
  - **(A)** K-Content cultural spillover into knowledge content consumption
  - **(B)** Perso AI's user-acquisition footprint in Korea elevating Korean-target demand
  - Single-platform data cannot adjudicate between these explanations. We present Finding 2 as a *pattern consistent with* K-Content spillover, **not as proof of it**.

External validation across non-Perso-AI datasets (other AI dubbing platforms, broader content distribution data) would be required to distinguish user-acquisition bias from market signal.

### 6.2 Volume-Based Time-Series Is Excluded

A pricing model change in mid-2025 (transition from unlimited to usage-based pricing for certain plan tiers) introduced noise in absolute volume comparisons. Total project counts before and after this change are not directly comparable.

**Mitigation**: The report uses distribution metrics (% target share, language pair counts, multi-language adoption gaps) and consistent within-segment YoY comparisons — not absolute volume YoY across the platform. Quarterly trends published in `quarterly-trends-2026.csv` are advisory only and not used in the main Findings.

### 6.3 Finding Selection Filters

Findings highlighted in this report (Religion Dual Hub, K-Content Spillover, Multi-Language Frontier) were selected because they pass three filters:

1. **Statistically robust within Perso AI's data** (n ≥ 500 per cell where applicable)
2. **Connect to a global macro narrative** the press already covers (Pew Research on LATAM religion; K-Content cultural mainstreaming; creator economy expansion-revenue thesis)
3. **Survive scrutiny against potential user-acquisition bias** (acknowledged where present, e.g., Finding 2)

Findings that did not pass all three filters (e.g., Animation-Hindi dominance, Film/Drama-Indonesian patterns) are visible in the published Use Case Map data but were not promoted to Hero Findings in the main report.

---

## 7. Editorial Framing (Acknowledgment)

The report's 4-Layer Model framing of AI Dubbing as a "distribution-stage" category distinct from "creation-stage" voice cloning (ElevenLabs) and avatar generation (HeyGen, Synthesia) is **editorial**, not an objective industry taxonomy.

Voice cloning tools also offer dubbing features, and the boundaries between layers are blurry. We frame the categories around production stage and output type because we find this framing more useful for understanding where the AI media stack is heading — but readers should know this is one perspective, not a settled industry classification. The 96% share rate observation in Perso AI's data is presented as a *behavioral signal* consistent with this framing, not as definitive proof of category separation.

---

## 8. Data Files Released (CC BY 4.0)

All aggregated data behind this report's findings is released under Creative Commons Attribution 4.0 (CC BY 4.0) at:
**`perso.ai/research/state-of-ai-dubbing-2026/data/`**

| File | Description |
|---|---|
| `headline-statistics-2026.csv` | 14 headline metrics (totals, share rates, language adoption) |
| `use-case-map-counts-2026.csv` | Industry × Target Language project counts (matrix format) |
| `use-case-map-pct-2026.csv` | Industry × Target Language % distribution (matrix format) |
| `use-case-map-long-2026.csv` | Same data in long format with industry totals and target language totals per row |
| `per-target-language-inverse-2026.csv` | For each target language, the top 10 industries with % share |
| `multi-language-adoption-2026.csv` | Distribution of professional creators by number of target languages used |
| `multi-language-adoption-cumulative-2026.csv` | Same with cumulative counts at-or-above each threshold |
| `top1pct-cohort-anonymized-2026.csv` | Anonymized composition of the top 1% creator cohort (n=47) |
| `industry-deep-dive-2026.csv` | Top 15 industries with top-3 target languages |
| `per-language-deep-dive-2026.csv` | Top 15 target languages with top-3 industries |
| `quarterly-trends-2026.csv` | Quarterly project counts by segment (advisory; see Section 6.2) |
| `methodology-2026.md` | This document |

---

## 9. Reproducibility

The aggregated data files are sufficient to reproduce all percentages, cross-tabulations, and ranking analyses in the main report. Raw project-level data (containing PII) is not released.

To reproduce a specific finding, see the corresponding section of the main report at perso.ai/research/state-of-ai-dubbing-2026/ and the relevant CSV from Section 8 above.

---

## 10. Update Cadence

| Edition | Schedule |
|---|---|
| State of AI Dubbing 2026 (this edition) | Published May 27, 2026 |
| 2026 Mid-Year Mini-Update | August 2026 (Q2 data refresh) |
| 2026 Q3 Stat Drop | November 2026 |
| State of AI Dubbing 2027 | June 2027 (annual cadence) |

---

## 11. Citation

**APA**:
> Perso AI Data Team. (2026). *State of AI Dubbing 2026: A Multi-Vertical Analysis of Perso AI's Professional Creator Data.* Perso AI. https://perso.ai/research/state-of-ai-dubbing-2026/

**Methodology citation specifically**:
> Perso AI Data Team. (2026). *State of AI Dubbing 2026 — Methodology Notes.* https://perso.ai/research/state-of-ai-dubbing-2026/data/methodology-2026.md

---

## 12. Contact

- **Press inquiries**: press@perso.ai
- **Data inquiries**: data@perso.ai
- **Press Kit**: perso.ai/research/state-of-ai-dubbing-2026/press-kit/

---

*Released under CC BY 4.0. Free to share, cite, and re-use with attribution to Perso AI.*
*Version 1.0 · 2026-05-27 · Perso AI Data Team*
