April 21, 2026ProductRelease

Multi-Accent English ASR Dataset

Our first open source dataset: a diverse English speech corpus spanning 11 L1 language backgrounds, balanced across genders, built for accent-aware ASR.

Louis MurerwaMichael Moyo
Louis Murerwa and Michael Moyo
Multi-Accent English ASR Dataset

Today we're releasing our first open source dataset: a Multi-Accent English ASR Dataset7,377 recordings totaling 10.25 hours of audio, spanning 11 L1 language backgrounds, gender-balanced, and produced end-to-end by native speakers of each language on our Expert Network, Workbolt, and processed through our Data Foundry.

English is the world's most-transcribed language, and one of the clearest examples of what happens when frontier AI models haven't learned from the full range of real human speech. Accent coverage is still where most automatic speech recognition (ASR) systems silently degrade: higher word error rates (WERs), worse disfluency handling, and a long tail of speakers who have learned to speak louder and slower to be understood by their devices.[1]

"Works for me" quietly means "works for people who sound like the model's training set."

Closing that gap isn't a compute or architecture problem. It's a human-data problem — the cadence of a real human voice, across accents and dialects that no model was trained to understand.[2]

Scaling compute won't conjure it. Synthetic data won't approximate it.

The only way through is to capture it from the people who actually speak that way, and encode it into the data frontier models learn from. This dataset is a step toward doing that, in the open.

What's in the dataset

Each contribution is an expert-produced, two-part task (a read-aloud recording followed by a verbatim same-speaker transcript) designed to be directly useful for training and evaluating accent-aware ASR systems.

What's in the release

  • 01

    7,377 audio files, 10.25 hours of read English speech sampled at 16 kHz, each paired with a verbatim transcript produced by the same speaker.

  • 02

    11 L1 language backgrounds — Chinese, Vietnamese, Thai, Japanese, Russian, Polish, general East European, Indonesian, French, German, and South Korean — recorded by native L1 speakers in their natural accent.

  • 03

    Gender-balanced within every language group, so accent effects in your model aren't confounded with speaker demographics.

  • 04

    Script-driven read speech from the public-domain Harvard Sentences corpus, giving a consistent phonetically-balanced prompt set across every contributor.

  • 05

    Verbatim lowercase transcripts with disfluencies — fillers, repetitions, false starts, and self-corrections written out exactly as spoken, no punctuation, no capitalisation, no non-speech tags.

  • 06

    Open under ODC-By v1.0 — permissive attribution-only license for research and commercial training use, including derivative datasets and downstream models.

You can explore the raw files and listen to samples directly in the Data Studio, and inspect the exact prompts, requirements, and rubrics used to produce it on the Sample Tasks page.

Dataset Information

A single datasheet covering every property a team typically asks about before pulling down the bundle — release version, modality, totals, format, provenance, QA, and license — so you can confirm fit at a glance before reading the deeper sections below.

Dataset information

Every property a team asks about before training on the dataset, in one datasheet.

PropertyValue
NameMulti-Accent English ASR Dataset
Versionv1.0 — April 2026
TierOpen Source
ModalityAudio + text (speech-to-text / ASR)
Spoken languageEnglish
L1 backgrounds11 — Chinese, Vietnamese, Thai, Japanese, Russian, Polish, General East European, Indonesian, French, German, South Korean
Total recordings7,377 utterances
Total duration10.25 hours
Speaking styleRead speech — Harvard Sentences corpus (public domain)
Sample rate16 kHz, narrowband, mono
TranscriptsVerbatim · lowercase a–z + spaces only · disfluencies included · no punctuation, no capitalisation, no non-speech tags
Speaker demographicsSelf-reported by the contributor: ethnicity, country, city, timezone, gender — gender-balanced within every L1 group
Provenance100% human-produced by native L1 speakers on Workbolt in their home country; no synthetic, no scraped web audio
Recording conditionsContributors' own hardware (laptop or phone microphones) in quiet rooms; SNR floor and clipping checks enforced per file
QA pipelineAutomated acoustic QA (clipping, SNR, completeness) plus human review against the rubric in the Ocular Data Foundry
LicenseODC-By v1.0 — permissive, attribution-only, suitable for research and commercial training
AccessSample audio and spec public on the Data Studio; full bundles via contact
Suggested citationContains information from the Multi-Accent English ASR Dataset by Ocular AI, made available under ODC-By v1.0.

Properties listed here describe the v1.0 release shipped April 2026. Future releases will be documented as new versions with their own datasheets.

Download

Dataset Structure

The dataset is organized to slot cleanly into existing ASR training and evaluation workflows. Each row is a single utterance paired with rich metadata about the script, the speaker, and the recording itself.

Core columns

What you need to train and evaluate.

ColumnDescription
target_accentThe accent the contributor was recruited to represent.
expert_idStable identifier for the contributor on Workbolt.
scriptThe prompt the contributor was asked to read.
audioPath to the audio file.
transcriptionThe verbatim transcript.
languageSpeaker's L1 language.
genderSelf-reported speaker gender.
expert_accent(s)Any additional accents the contributor natively speaks.
sizeFile size of the audio clip.

Each row in the dataset is a single utterance. Pair `audio` with `transcription` for ASR training, and use `target_accent` / `language` / `gender` for stratified sampling and per-accent WER evaluation.

Demographic & provenance metadata

Useful for slicing and fairness analysis.

ColumnDescription
ethnicitySelf-reported ethnicity.
countryCountry of residence at recording time.
cityCity of residence.
timezoneLocal timezone of the recording.
script_idStable ID for the prompt, so you can group utterances by script.
estimated_durationExpected length of the read prompt.
actual_durationMeasured length of the recording.
created_atTimestamp the recording was submitted.

All demographic fields are self-reported by the contributor at task time and surfaced verbatim. Treat them as voluntary metadata, not training labels — and as the axes you slice WER along when you audit a model on this dataset.

Audio files are grouped by accent and speaker, with transcription files aligned at the utterance level. That structure makes it straightforward to pull stratified samples — for example "all female Vietnamese speakers with actual_duration within 10% of estimated_duration" — and to run in-depth analyses by accent, speaker, or demographic characteristics without having to re-join data from external sources.

Why accent-balanced data matters

Most public ASR benchmarks are dominated by a handful of native-English accents.[3] Models trained and evaluated on that distribution look great on paper and then underperform the moment they meet a real user base — Chinese-, Vietnamese-, Japanese-, Russian-, or French-accented English, and dozens of other varieties that are entirely standard for the people speaking them.[4]

Balanced, accent-labeled data lets teams do three things that are otherwise hard:

  1. Train models that generalize across accents instead of regressing to the majority voice.
  2. Evaluate accent-specific WER as a first-class metric, not a footnote.
  3. Debug failure modes by slicing errors along accent, gender, and L1 language background — the axes that matter in production.

That's the shape we optimized this dataset for.

Building the Dataset

Every recording and transcript in this release was produced by a native speaker of one of the 11 target L1 languages, working on Workbolt, our global Expert Network. Contributors were recruited in their home country against each language background, so Chinese-accented English is read by a native Chinese speaker in China, Vietnamese-accented English by a native Vietnamese speaker in Vietnam, and so on.

Covered geographies

The 11 L1 backgrounds map to the following source geographies. Contributors were recruited in-country for each language so the accent in the audio reflects the speaker's natural everyday English, not an imitation studied abroad.

Covered geographies

The 11 L1 backgrounds and where their contributors were recruited.

L1 BackgroundPrimary Country / Region
ChineseChina
VietnameseVietnam
ThaiThailand
JapaneseJapan
RussianRussia
PolishPoland
General East EuropeanEastern Europe (mixed, including Ukrainian and other regional varieties)
IndonesianIndonesia
FrenchFrance
GermanGermany
South KoreanSouth Korea

'General East European' is intentionally a regional bucket rather than a single country — it captures the shared phonetic patterns across Eastern European L1 speakers of English (including Ukrainian and other regional varieties) for teams that care about regional-level coverage without splitting every nationality into its own row.

The two-part task

Read mode about the Sample Tasks here.

Each contributor completed a single task with two tightly-coupled parts, designed so the audio and the transcript are produced by the same person and stay honest to one another:

  1. Read and record. The contributor is shown a scripted English prompt drawn from the Harvard Sentences corpus[5] and records themselves reading it aloud in their natural accent at 16 kHz — no imitation, no "neutralizing," no performance voice.
  2. Transcribe verbatim with disfluencies. The same contributor then transcribes their own recording — not the prompt they were given, but exactly what came out of their mouth. Transcripts follow a deliberately minimal format:
    • Verbatim text — every word that is said, in the order it is said.
    • Character set — only the 26 lowercase letters a–z plus the space character.
    • Disfluencies included — fillers (um, uh, er, mm), repetitions, false starts, and self-corrections are all written out.[6]
    • No special annotation — no punctuation, no capitalization, and no tags for non-speech events (laughter, coughs, breaths, background noise are simply not transcribed).

This two-part design keeps the audio and text grounded in the same speaker's production, while the minimal character set makes the transcripts trivial to tokenize and easy to use as ground truth for ASR training and WER evaluation. Every submission then flowed through our Data Foundry — see Data Quality below for the full QA pipeline.

Data Quality

Dataset quality is enforced end-to-end in our Data Foundry. Every submission has to pass both an automated check and a human review against the rubric before it ships.

Audio quality gates

  • Clean capture, not studio. Recordings must be made in a quiet environment at 16 kHz, with no clipping and no truncation.
  • SNR floors applied to every file.
  • Completeness — the recording must cover the full prompt, with no missing or cut-off portions.
  • Natural accent only — reviewers flag imitation, "neutralizing," or performance voice.
  • No synthetic augmentation, no scraped web audio — every file is an original recording from a Workbolt contributor.

Transcript quality gates

  • Verbatim to the audio, not to the prompt — if the contributor misreads or re-starts, the transcript follows what was actually said.
  • Full disfluency coverage — fillers, repetitions, false starts, and self-corrections must be written out.
  • Character-set compliance — lowercase a–z and spaces only; any disallowed character (punctuation, capitalization, non-speech tags) is flagged.
  • Same-speaker alignment — contributors correct their own transcripts against playback, so the text is grounded in what they actually produced.

QA pipeline

  • Automated acoustic QA — clipping detection, background-noise and SNR thresholds, silence/truncation checks, and completeness comparison against the prompt.
  • Human review against the rubric — reviewers listen to each file and read each transcript, flagging imitated accents, missing disfluencies, disallowed characters, or transcript drift. Failing tasks are sent back to the contributor for re-work.
  • Provenance — every one of the 7,377 files is traceable to a Workbolt contributor, a task submission, and at least one Data Foundry review pass.

The goal is boring in the best way: every file is produced the same way, by a verified native speaker, with the quality bar written down and enforced.

How to use it

Find the dataset on our website: Multi-Accent English ASR Dataset — browse files, stream audio samples, and read the task spec on the sample tasks page. Full downloads are gated behind a short licensing step (see Licensing below).

Typical uses:

  • Fine-tuning or domain-adapting existing ASR models for broader L2 accent coverage.
  • Building accent-stratified evaluation suites sliced by L1 background and gender.
  • Benchmarking pronunciation and disfluency handling across L2 varieties of English.

Licensing

The Multi-Accent English ASR Dataset is released under the Open Data Commons Attribution License (ODC-By v1.0) — a permissive, attribution-only license designed specifically for open databases. In short, you can:

  • Use the data for research or commercial purposes.
  • Modify, combine, and redistribute it, including as part of derivative datasets.
  • Train models (open or commercial) on it.

…as long as you attribute Ocular AI and preserve the license notice when you redistribute the database or a substantial part of it. Read the full license text (or the plain-language summary) for the formal terms.

How to get access

To keep a light touch on provenance and make sure we know who is building on top of the data, full audio + transcript bundles are not a one-click public download. Instead, contact us with a short note about who you are and what you're planning to use the dataset for, and we'll send you the download link and any updates to the release. Sample audio and the task spec remain openly browsable on the dataset page.

Suggested attribution

When you use the dataset in a paper, model card, release notes, or a derivative dataset, please include a notice along the lines of:

Contains information from the Multi-Accent English ASR Dataset by Ocular AI, made available under the ODC Attribution License.

Limitations

A few things to keep in mind before you use the dataset:

  • Read, not spontaneous. Every utterance is a contributor reading a Harvard Sentences prompt, so disfluencies are far less frequent than in real conversation. Treat this as a read-speech resource, not a conversational one.
  • Narrowband, 16 kHz. Great for ASR; not suitable for wideband TTS, voice conversion, or anything that needs frequencies above 8 kHz.
  • Real-world recording conditions. Audio is captured on contributors' own hardware — laptop and phone mics in quiet rooms, not studios. That's deliberate: it looks more like production data than a clean-room corpus, but it also means background noise, room acoustics, and device coloration vary from file to file.
  • Minimal transcripts. Lowercase a–z and spaces only, with disfluencies included and no punctuation, capitalization, or non-speech tags (laughter, coughs, breaths). If you need richer annotation, plan to layer it on top.
  • Not every accent. The 11 L1 backgrounds above cover several of the largest L2-English populations, but they're far from exhaustive. Notably absent: Spanish, Portuguese, Arabic, Hindi, and Swahili L1 speakers, along with most native-English regional accents. More to come.

What's next

This release is published as-is, under the license above. We welcome contributions from the community — additional accents, richer annotations, derivative datasets, or downstream benchmarks — under the terms of the license.

Beyond this open release, we've developed more robust, enterprise-grade datasets tackling harder scenarios — including scenario-driven speech, full-duplex conversation, and broader coverage across TTS, STT, and LLM training use cases. If any of those are relevant to what you're building, read more about our Proprietary Datasets or talk to us directly — we're happy to walk through samples. If you're a native or fluent speaker interested in contributing to future Ocular datasets, join the Expert Network. And if you want to see what else we've shipped in the open, start with our open source datasets.


Authors

Louis Murerwa

Louis Murerwa

Co-founder & CTO

Michael Moyo

Michael Moyo

Co-founder & CEO

Ready to bring AI into the real world?