Multi-Accent English ASR Dataset
Our first open source dataset: a diverse English speech corpus spanning 11 L1 language backgrounds, balanced across genders, built for accent-aware ASR.

Today we're releasing our first open source dataset: a Multi-Accent English ASR Dataset — 7,377 recordings totaling 10.25 hours of audio, spanning 11 L1 language backgrounds, gender-balanced, and produced end-to-end by native speakers of each language on our Expert Network, Workbolt, and processed through our Data Foundry.
English is the world's most-transcribed language, and one of the clearest examples of what happens when frontier AI models haven't learned from the full range of real human speech. "Works for me" quietly means "works for people who sound like the model's training set." Accent coverage is still where most automatic speech recognition (ASR) systems silently degrade: higher word error rates (WERs), worse disfluency handling, and a long tail of speakers who have learned to speak louder and slower to be understood by their devices.
Closing that gap isn't a compute or architecture problem. It's a human-data problem — the cadence of a real human voice, across accents and dialects that no model was trained to understand. Scaling compute won't conjure it. Synthetic data won't approximate it. The only way through is to capture it from the people who actually speak that way, and encode it into the data frontier models learn from.
This dataset is a step toward doing that, in the open.
What's in the dataset
Each contribution is an expert-produced, two-part task (a read-aloud recording followed by a verbatim same-speaker transcript) designed to be directly useful for training and evaluating accent-aware ASR systems.
- 7,377 audio files / 10.25 hours of read English speech, sampled at 16 kHz, each paired with a verbatim transcript.
- 11 L1 language backgrounds of English speakers — Chinese, Vietnamese, Thai, Japanese, Russian, Polish, general East European, Indonesian, French, German, and South Korean — captured from native L1 speakers of each language reading English in their natural accent.
- Gender-balanced within each language group, so accent effects aren't confounded with speaker demographics.
- Script-driven read speech based on the public-domain Harvard Sentences corpus, giving a consistent phonetically-balanced prompt set across contributors.
- Verbatim lowercase transcripts with disfluencies included — see the two-part task below for the full spec.
You can explore the raw files and listen to samples directly in the Data Studio, and inspect the exact prompts, requirements, and rubrics used to produce it on the Sample Tasks page.
Dataset Structure
The dataset is organized to slot cleanly into existing ASR training and evaluation workflows. Each row is a single utterance paired with rich metadata about the script, the speaker, and the recording itself.
Core columns — what you need to train / evaluate:
target_accent— the accent the contributor was recruited to represent.expert_id— stable identifier for the contributor on Workbolt.script— the prompt the contributor was asked to read.audio— path to the audio file.transcription— the verbatim transcript.language— speaker's L1 language.gender— self-reported speaker gender.expert_accent(s)— any additional accents the contributor natively speaks.size— file size of the audio clip.
Demographic & provenance metadata — useful for slicing and fairness analysis:
ethnicity— self-reported ethnicity.country— country of residence at recording time.city— city of residence.timezone— local timezone of the recording.script_id— stable ID for the prompt, so you can group utterances by script.estimated_duration— expected length of the read prompt.actual_duration— measured length of the recording.created_at— timestamp the recording was submitted.
Audio files are grouped by accent and speaker, with transcription files aligned at the utterance level. That structure makes it straightforward to pull stratified samples — for example "all female Vietnamese speakers with actual_duration within 10% of estimated_duration" — and to run in-depth analyses by accent, speaker, or demographic characteristics without having to re-join data from external sources.
Why accent-balanced data matters
Most public ASR benchmarks are dominated by a handful of native-English accents. Models trained and evaluated on that distribution look great on paper and then underperform the moment they meet a real user base — Chinese-, Vietnamese-, Japanese-, Russian-, or French-accented English, and dozens of other varieties that are entirely standard for the people speaking them.
Balanced, accent-labeled data lets teams do three things that are otherwise hard:
- Train models that generalize across accents instead of regressing to the majority voice.
- Evaluate accent-specific WER as a first-class metric, not a footnote.
- Debug failure modes by slicing errors along accent, gender, and L1 language background — the axes that matter in production.
That's the shape we optimized this dataset for.
Building the Dataset
Every recording and transcript in this release was produced by a native speaker of one of the 11 target L1 languages, working on Workbolt, our global Expert Network. Contributors were recruited in their home country against each language background, so Chinese-accented English is read by a native Chinese speaker in China, Vietnamese-accented English by a native Vietnamese speaker in Vietnam, and so on.
Covered geographies
The 11 L1 backgrounds map to the following source geographies. "General East European" is intentionally a regional bucket rather than a single country — it captures the shared phonetic patterns across Eastern European speakers of English for teams that care about regional-level coverage without splitting every nationality into its own row.
| L1 Background | Primary Country / Region |
|---|---|
| Chinese | China |
| Vietnamese | Vietnam |
| Thai | Thailand |
| Japanese | Japan |
| Russian | Russia |
| Polish | Poland |
| General East European | Eastern Europe (mixed, incl. speakers from Ukraine and other regional varieties) |
| Indonesian | Indonesia |
| French | France |
| German | Germany |
| South Korean | South Korea |
The two-part task
Read mode about the Sample Tasks here.
Each contributor completed a single task with two tightly-coupled parts, designed so the audio and the transcript are produced by the same person and stay honest to one another:
- Read and record. The contributor is shown a scripted English prompt drawn from the Harvard Sentences corpus and records themselves reading it aloud in their natural accent at 16 kHz — no imitation, no "neutralizing," no performance voice.
- Transcribe verbatim with disfluencies. The same contributor then transcribes their own recording — not the prompt they were given, but exactly what came out of their mouth. Transcripts follow a deliberately minimal format:
- Verbatim text — every word that is said, in the order it is said.
- Character set — only the 26 lowercase letters
a–zplus the space character. - Disfluencies included — fillers (um, uh, er, mm), repetitions, false starts, and self-corrections are all written out.
- No special annotation — no punctuation, no capitalization, and no tags for non-speech events (laughter, coughs, breaths, background noise are simply not transcribed).
This two-part design keeps the audio and text grounded in the same speaker's production, while the minimal character set makes the transcripts trivial to tokenize and easy to use as ground truth for ASR training and WER evaluation. Every submission then flowed through our Data Foundry — see Data Quality below for the full QA pipeline.
Data Quality
Dataset quality is enforced end-to-end in our Data Foundry. Every submission has to pass both an automated check and a human review against the rubric before it ships.
Audio quality gates
- Clean capture, not studio. Recordings must be made in a quiet environment at 16 kHz, with no clipping and no truncation.
- SNR floors applied to every file.
- Completeness — the recording must cover the full prompt, with no missing or cut-off portions.
- Natural accent only — reviewers flag imitation, "neutralizing," or performance voice.
- No synthetic augmentation, no scraped web audio — every file is an original recording from a Workbolt contributor.
Transcript quality gates
- Verbatim to the audio, not to the prompt — if the contributor misreads or re-starts, the transcript follows what was actually said.
- Full disfluency coverage — fillers, repetitions, false starts, and self-corrections must be written out.
- Character-set compliance — lowercase
a–zand spaces only; any disallowed character (punctuation, capitalization, non-speech tags) is flagged. - Same-speaker alignment — contributors correct their own transcripts against playback, so the text is grounded in what they actually produced.
QA pipeline
- Automated acoustic QA — clipping detection, background-noise and SNR thresholds, silence/truncation checks, and completeness comparison against the prompt.
- Human review against the rubric — reviewers listen to each file and read each transcript, flagging imitated accents, missing disfluencies, disallowed characters, or transcript drift. Failing tasks are sent back to the contributor for re-work.
- Provenance — every one of the 7,377 files is traceable to a Workbolt contributor, a task submission, and at least one Data Foundry review pass.
The goal is boring in the best way: every file is produced the same way, by a verified native speaker, with the quality bar written down and enforced.
How to use it
Find the dataset on our website: Multi-Accent English ASR Dataset — browse files, stream audio samples, and read the task spec on the sample tasks page. Full downloads are gated behind a short licensing step (see Licensing below).
Typical uses:
- Fine-tuning or domain-adapting existing ASR models for broader L2 accent coverage.
- Building accent-stratified evaluation suites sliced by L1 background and gender.
- Benchmarking pronunciation and disfluency handling across L2 varieties of English.
Licensing
The Multi-Accent English ASR Dataset is released under the Open Data Commons Attribution License (ODC-By v1.0) — a permissive, attribution-only license designed specifically for open databases. In short, you can:
- Use the data for research or commercial purposes.
- Modify, combine, and redistribute it, including as part of derivative datasets.
- Train models (open or commercial) on it.
…as long as you attribute Ocular AI and preserve the license notice when you redistribute the database or a substantial part of it. Read the full license text (or the plain-language summary) for the formal terms.
How to get access
To keep a light touch on provenance and make sure we know who is building on top of the data, full audio + transcript bundles are not a one-click public download. Instead, contact us with a short note about who you are and what you're planning to use the dataset for, and we'll send you the download link and any updates to the release. Sample audio and the task spec remain openly browsable on the dataset page.
Suggested attribution
When you use the dataset in a paper, model card, release notes, or a derivative dataset, please include a notice along the lines of:
Contains information from the Multi-Accent English ASR Dataset by Ocular AI, made available under the ODC Attribution License.
Limitations
A few things to keep in mind before you use the dataset:
- Read, not spontaneous. Every utterance is a contributor reading a Harvard Sentences prompt, so disfluencies are far less frequent than in real conversation. Treat this as a read-speech resource, not a conversational one.
- Narrowband, 16 kHz. Great for ASR; not suitable for wideband TTS, voice conversion, or anything that needs frequencies above 8 kHz.
- Real-world recording conditions. Audio is captured on contributors' own hardware — laptop and phone mics in quiet rooms, not studios. That's deliberate: it looks more like production data than a clean-room corpus, but it also means background noise, room acoustics, and device coloration vary from file to file.
- Minimal transcripts. Lowercase
a–zand spaces only, with disfluencies included and no punctuation, capitalization, or non-speech tags (laughter, coughs, breaths). If you need richer annotation, plan to layer it on top. - Not every accent. The 11 L1 backgrounds above cover several of the largest L2-English populations, but they're far from exhaustive. Notably absent: Spanish, Portuguese, Arabic, Hindi, and Swahili L1 speakers, along with most native-English regional accents. More to come.
What's next
This release is published as-is, under the license above. We welcome contributions from the community — additional accents, richer annotations, derivative datasets, or downstream benchmarks — under the terms of the license.
Beyond this open release, we've developed more robust, enterprise-grade datasets tackling harder scenarios — including scenario-driven speech, full-duplex conversation, and broader coverage across TTS, STT, and LLM training use cases. If any of those are relevant to what you're building, read more about our Proprietary Datasets or talk to us directly — we're happy to walk through samples. If you're a native or fluent speaker interested in contributing to future Ocular datasets, join the Expert Network. And if you want to see what else we've shipped in the open, start with our open source datasets.
Authors
Louis Murerwa
Co-founder & CTO
Michael Moyo
Co-founder & CEO
