
Today we're releasing our first open source dataset: a Multi-Accent English ASR Dataset — 7,377 recordings totaling 10.25 hours of audio, spanning 11 L1 language backgrounds, gender-balanced, and produced end-to-end by native speakers of each language on our Expert Network, Workbolt, and processed through our Data Foundry.
English is the world's most-transcribed language, and one of the clearest examples of what happens when frontier AI models haven't learned from the full range of real human speech. Accent coverage is still where most automatic speech recognition (ASR) systems silently degrade: higher word error rates (WERs), worse disfluency handling, and a long tail of speakers who have learned to speak louder and slower to be understood by their devices.[1]
"Works for me" quietly means "works for people who sound like the model's training set."
Closing that gap isn't a compute or architecture problem. It's a human-data problem — the cadence of a real human voice, across accents and dialects that no model was trained to understand.[2]
Scaling compute won't conjure it. Synthetic data won't approximate it.
The only way through is to capture it from the people who actually speak that way, and encode it into the data frontier models learn from. This dataset is a step toward doing that, in the open.
What's in the dataset
Each contribution is an expert-produced, two-part task (a read-aloud recording followed by a verbatim same-speaker transcript) designed to be directly useful for training and evaluating accent-aware ASR systems.
What's in the release
You can explore the raw files and listen to samples directly in the Data Studio, and inspect the exact prompts, requirements, and rubrics used to produce it on the Sample Tasks page.
Dataset Information
A single datasheet covering every property a team typically asks about before pulling down the bundle — release version, modality, totals, format, provenance, QA, and license — so you can confirm fit at a glance before reading the deeper sections below.
Dataset information
Every property a team asks about before training on the dataset, in one datasheet.
| Property | Value |
|---|---|
| Name | Multi-Accent English ASR Dataset |
| Version | v1.0 — April 2026 |
| Tier | Open Source |
| Modality | Audio + text (speech-to-text / ASR) |
| Spoken language | English |
| L1 backgrounds | 11 — Chinese, Vietnamese, Thai, Japanese, Russian, Polish, General East European, Indonesian, French, German, South Korean |
| Total recordings | 7,377 utterances |
| Total duration | 10.25 hours |
| Speaking style | Read speech — Harvard Sentences corpus (public domain) |
| Sample rate | 16 kHz, narrowband, mono |
| Transcripts | Verbatim · lowercase a–z + spaces only · disfluencies included · no punctuation, no capitalisation, no non-speech tags |
| Speaker demographics | Self-reported by the contributor: ethnicity, country, city, timezone, gender — gender-balanced within every L1 group |
| Provenance | 100% human-produced by native L1 speakers on Workbolt in their home country; no synthetic, no scraped web audio |
| Recording conditions | Contributors' own hardware (laptop or phone microphones) in quiet rooms; SNR floor and clipping checks enforced per file |
| QA pipeline | Automated acoustic QA (clipping, SNR, completeness) plus human review against the rubric in the Ocular Data Foundry |
| License | ODC-By v1.0 — permissive, attribution-only, suitable for research and commercial training |
| Access | Sample audio and spec public on the Data Studio; full bundles via contact |
| Suggested citation | Contains information from the Multi-Accent English ASR Dataset by Ocular AI, made available under ODC-By v1.0. |
Properties listed here describe the v1.0 release shipped April 2026. Future releases will be documented as new versions with their own datasheets.
DownloadDataset Structure
The dataset is organized to slot cleanly into existing ASR training and evaluation workflows. Each row is a single utterance paired with rich metadata about the script, the speaker, and the recording itself.
Core columns
What you need to train and evaluate.
| Column | Description |
|---|---|
target_accent | The accent the contributor was recruited to represent. |
expert_id | Stable identifier for the contributor on Workbolt. |
script | The prompt the contributor was asked to read. |
audio | Path to the audio file. |
transcription | The verbatim transcript. |
language | Speaker's L1 language. |
gender | Self-reported speaker gender. |
expert_accent(s) | Any additional accents the contributor natively speaks. |
size | File size of the audio clip. |
Each row in the dataset is a single utterance. Pair `audio` with `transcription` for ASR training, and use `target_accent` / `language` / `gender` for stratified sampling and per-accent WER evaluation.
Demographic & provenance metadata
Useful for slicing and fairness analysis.
| Column | Description |
|---|---|
ethnicity | Self-reported ethnicity. |
country | Country of residence at recording time. |
city | City of residence. |
timezone | Local timezone of the recording. |
script_id | Stable ID for the prompt, so you can group utterances by script. |
estimated_duration | Expected length of the read prompt. |
actual_duration | Measured length of the recording. |
created_at | Timestamp the recording was submitted. |
All demographic fields are self-reported by the contributor at task time and surfaced verbatim. Treat them as voluntary metadata, not training labels — and as the axes you slice WER along when you audit a model on this dataset.
Audio files are grouped by accent and speaker, with transcription files aligned at the utterance level. That structure makes it straightforward to pull stratified samples — for example "all female Vietnamese speakers with actual_duration within 10% of estimated_duration" — and to run in-depth analyses by accent, speaker, or demographic characteristics without having to re-join data from external sources.
Why accent-balanced data matters
Most public ASR benchmarks are dominated by a handful of native-English accents.[3] Models trained and evaluated on that distribution look great on paper and then underperform the moment they meet a real user base — Chinese-, Vietnamese-, Japanese-, Russian-, or French-accented English, and dozens of other varieties that are entirely standard for the people speaking them.[4]
Balanced, accent-labeled data lets teams do three things that are otherwise hard:
- Train models that generalize across accents instead of regressing to the majority voice.
- Evaluate accent-specific WER as a first-class metric, not a footnote.
- Debug failure modes by slicing errors along accent, gender, and L1 language background — the axes that matter in production.
That's the shape we optimized this dataset for.
Building the Dataset
Every recording and transcript in this release was produced by a native speaker of one of the 11 target L1 languages, working on Workbolt, our global Expert Network. Contributors were recruited in their home country against each language background, so Chinese-accented English is read by a native Chinese speaker in China, Vietnamese-accented English by a native Vietnamese speaker in Vietnam, and so on.
Covered geographies
The 11 L1 backgrounds map to the following source geographies. Contributors were recruited in-country for each language so the accent in the audio reflects the speaker's natural everyday English, not an imitation studied abroad.
Covered geographies
The 11 L1 backgrounds and where their contributors were recruited.
| L1 Background | Primary Country / Region |
|---|---|
| Chinese | China |
| Vietnamese | Vietnam |
| Thai | Thailand |
| Japanese | Japan |
| Russian | Russia |
| Polish | Poland |
| General East European | Eastern Europe (mixed, including Ukrainian and other regional varieties) |
| Indonesian | Indonesia |
| French | France |
| German | Germany |
| South Korean | South Korea |
'General East European' is intentionally a regional bucket rather than a single country — it captures the shared phonetic patterns across Eastern European L1 speakers of English (including Ukrainian and other regional varieties) for teams that care about regional-level coverage without splitting every nationality into its own row.
The two-part task
Read mode about the Sample Tasks here.
Each contributor completed a single task with two tightly-coupled parts, designed so the audio and the transcript are produced by the same person and stay honest to one another:
- Read and record. The contributor is shown a scripted English prompt drawn from the Harvard Sentences corpus[5] and records themselves reading it aloud in their natural accent at 16 kHz — no imitation, no "neutralizing," no performance voice.
- Transcribe verbatim with disfluencies. The same contributor then transcribes their own recording — not the prompt they were given, but exactly what came out of their mouth. Transcripts follow a deliberately minimal format:
- Verbatim text — every word that is said, in the order it is said.
- Character set — only the 26 lowercase letters
a–zplus the space character. - Disfluencies included — fillers (um, uh, er, mm), repetitions, false starts, and self-corrections are all written out.[6]
- No special annotation — no punctuation, no capitalization, and no tags for non-speech events (laughter, coughs, breaths, background noise are simply not transcribed).
This two-part design keeps the audio and text grounded in the same speaker's production, while the minimal character set makes the transcripts trivial to tokenize and easy to use as ground truth for ASR training and WER evaluation. Every submission then flowed through our Data Foundry — see Data Quality below for the full QA pipeline.
Data Quality
Dataset quality is enforced end-to-end in our Data Foundry. Every submission has to pass both an automated check and a human review against the rubric before it ships.
Audio quality gates
- Clean capture, not studio. Recordings must be made in a quiet environment at 16 kHz, with no clipping and no truncation.
- SNR floors applied to every file.
- Completeness — the recording must cover the full prompt, with no missing or cut-off portions.
- Natural accent only — reviewers flag imitation, "neutralizing," or performance voice.
- No synthetic augmentation, no scraped web audio — every file is an original recording from a Workbolt contributor.
Transcript quality gates
- Verbatim to the audio, not to the prompt — if the contributor misreads or re-starts, the transcript follows what was actually said.
- Full disfluency coverage — fillers, repetitions, false starts, and self-corrections must be written out.
- Character-set compliance — lowercase
a–zand spaces only; any disallowed character (punctuation, capitalization, non-speech tags) is flagged. - Same-speaker alignment — contributors correct their own transcripts against playback, so the text is grounded in what they actually produced.
QA pipeline
- Automated acoustic QA — clipping detection, background-noise and SNR thresholds, silence/truncation checks, and completeness comparison against the prompt.
- Human review against the rubric — reviewers listen to each file and read each transcript, flagging imitated accents, missing disfluencies, disallowed characters, or transcript drift. Failing tasks are sent back to the contributor for re-work.
- Provenance — every one of the 7,377 files is traceable to a Workbolt contributor, a task submission, and at least one Data Foundry review pass.
The goal is boring in the best way: every file is produced the same way, by a verified native speaker, with the quality bar written down and enforced.
How to use it
Find the dataset on our website: Multi-Accent English ASR Dataset — browse files, stream audio samples, and read the task spec on the sample tasks page. Full downloads are gated behind a short licensing step (see Licensing below).
Typical uses:
- Fine-tuning or domain-adapting existing ASR models for broader L2 accent coverage.
- Building accent-stratified evaluation suites sliced by L1 background and gender.
- Benchmarking pronunciation and disfluency handling across L2 varieties of English.
Licensing
The Multi-Accent English ASR Dataset is released under the Open Data Commons Attribution License (ODC-By v1.0) — a permissive, attribution-only license designed specifically for open databases. In short, you can:
- Use the data for research or commercial purposes.
- Modify, combine, and redistribute it, including as part of derivative datasets.
- Train models (open or commercial) on it.
…as long as you attribute Ocular AI and preserve the license notice when you redistribute the database or a substantial part of it. Read the full license text (or the plain-language summary) for the formal terms.
How to get access
To keep a light touch on provenance and make sure we know who is building on top of the data, full audio + transcript bundles are not a one-click public download. Instead, contact us with a short note about who you are and what you're planning to use the dataset for, and we'll send you the download link and any updates to the release. Sample audio and the task spec remain openly browsable on the dataset page.
Suggested attribution
When you use the dataset in a paper, model card, release notes, or a derivative dataset, please include a notice along the lines of:
Contains information from the Multi-Accent English ASR Dataset by Ocular AI, made available under the ODC Attribution License.
Limitations
A few things to keep in mind before you use the dataset:
- Read, not spontaneous. Every utterance is a contributor reading a Harvard Sentences prompt, so disfluencies are far less frequent than in real conversation. Treat this as a read-speech resource, not a conversational one.
- Narrowband, 16 kHz. Great for ASR; not suitable for wideband TTS, voice conversion, or anything that needs frequencies above 8 kHz.
- Real-world recording conditions. Audio is captured on contributors' own hardware — laptop and phone mics in quiet rooms, not studios. That's deliberate: it looks more like production data than a clean-room corpus, but it also means background noise, room acoustics, and device coloration vary from file to file.
- Minimal transcripts. Lowercase
a–zand spaces only, with disfluencies included and no punctuation, capitalization, or non-speech tags (laughter, coughs, breaths). If you need richer annotation, plan to layer it on top. - Not every accent. The 11 L1 backgrounds above cover several of the largest L2-English populations, but they're far from exhaustive. Notably absent: Spanish, Portuguese, Arabic, Hindi, and Swahili L1 speakers, along with most native-English regional accents. More to come.
What's next
This release is published as-is, under the license above. We welcome contributions from the community — additional accents, richer annotations, derivative datasets, or downstream benchmarks — under the terms of the license.
Beyond this open release, we've developed more robust, enterprise-grade datasets tackling harder scenarios — including scenario-driven speech, full-duplex conversation, and broader coverage across TTS, STT, and LLM training use cases. If any of those are relevant to what you're building, read more about our Proprietary Datasets or talk to us directly — we're happy to walk through samples. If you're a native or fluent speaker interested in contributing to future Ocular datasets, join the Expert Network. And if you want to see what else we've shipped in the open, start with our open source datasets.
Authors

Louis Murerwa
Co-founder & CTO

Michael Moyo
Co-founder & CEO



