Beyond Fisher: A High Fidelity Full-Duplex Dataset for Conversational AI
Every major full-duplex voice AI model traces its training data to Fisher English — telephone conversations recorded at 8 kHz in 2004. Here is the high-fidelity corpus the field has been asking for.


Every major full-duplex voice AI model traces its training data back to the same source: the Fisher English corpus — a catalog of telephone conversations recorded at 8 kHz in 2004.
Meta's dGSLM (2022), the foundational full-duplex model, trained on 2,000 hours of Fisher. SyncLLM (University of Washington, 2024) used 1,927 hours of Fisher as its real-data anchor — the remaining 99% of its 215,000-hour training set was synthesized by TTS. SALM-Duplex (NVIDIA, Interspeech 2025) and PersonaPlex (NVIDIA, 2026) — the two openly available end-to-end duplex speech-to-speech models — were both grounded on the same telephony-grade corpus. PersonaPlex drew speaker voice samples from Fisher among others, then synthesized the conversations themselves.
The 2025 survey of the field (From Turn-Taking to Synchronous Dialogue, Chen et al.) names data scarcity as the primary bottleneck to progress — more limiting than architecture. The Full-Duplex-Bench evaluation (Lin et al., NTU / UC Berkeley, 2025) finds that no open-source model achieves both natural backchanneling and appropriate interruption behavior simultaneously, and attributes the gap to training data, not model design.
The field has asked for better data. Here it is.
How Labs Have Patched the Gap
Rather than collect better data, labs have patched the Fisher gap with increasingly elaborate simulations.
SyncLLM synthesized 193,000 hours of dialogue via TTS, then used 1,927 hours of Fisher as a real-data anchor. Synthetic TTS data does not contain real prosody, real disfluency, or real overlaps. The model learns the structural pattern of conversation from fake conversations, then anchors to telephony-grade audio for acoustic grounding.
SALM-Duplex and PersonaPlex go further. For barge-in — the moment a speaker interrupts mid-utterance — SALM-Duplex inserts a programmatic 0.64-second silence followed by the interrupting voice. PersonaPlex uses "negative-duration silence" to stitch the interrupting voice on top of the interrupted speaker's audio.
These are engineering approximations of a real human behavior. A barge-in involves overlapping prosody, pitch alignment between both speakers, and a floor negotiation — one speaker yields — that unfolds over hundreds of milliseconds with acoustic cues at every step. The simulation captures none of that.
The Thinking Machines Interaction Model scores 77.8 on FD-bench against ~50 for the best open models. Researchers attribute this gap not to architectural superiority but to proprietary access to high-quality real conversational data.
The conclusion is the same everywhere: the architecture is largely solved; the data is the constraint.
What the Field Has Explicitly Asked For
Three recent papers are worth reading as calls for data:
Full-Duplex-Bench (Lin et al., 2025): The benchmark uses 727 samples drawn from Candor, ICC, and GPT-4o/ChatTTS synthetic audio. The authors explicitly note the scarcity of high-quality annotated full-duplex data as a limitation of the benchmark itself. A corpus with labeled backchannel timing, barge-in types, and turn-mechanics would allow the benchmark to be run at scale with ground truth rather than proxy metrics.
FLEXI (Ge et al., 2025): Introduces six interaction scenario types — including emergency and emotional support — and finds significant open-source/commercial gaps in every scenario. The authors identify per-scenario training data coverage as a key missing ingredient. A corpus annotated with scenario types and role dynamics directly addresses this.
Ocular Hifi Full Duplex Dataset
Ocular captures each speaker on a separate isolated channel at 48 kHz / 24-bit. Before recording begins, the session passes a quality gate: spectral energy verified above 2 kHz, noise floor below threshold, peak levels within headroom, SNR confirmed. Sessions that fail the gate don't record. Fisher had no equivalent gate — it accepted whatever the telephone network delivered.
The sessions are unscripted. Speakers are not given prompts, not read-speech, not asked to demonstrate specific behaviors. The conversations are collected worldwide across diverse accents, ages, and topics. Natural barge-ins, real backchannels, and genuine hesitation occur because the conditions for them to occur are present.
Fisher captured what telephone infrastructure allowed: 8 kHz audio, capped at 4 kHz by the network, through handset microphones, in whatever acoustic environment callers happened to be in at the time.
The acoustic signals that separate human speech from TTS output — the audible breath before a response, the prosodic fall of a completed clause, the 2–5 kHz presence region that carries vocal warmth, the overlap of two voices at full fidelity — all live above 4 kHz. Fisher cannot contain them. Every model trained primarily on Fisher is learning from data that is acoustically blind to the layer it most needs.
Samples
Two unscripted conversations, captured per-speaker on isolated channels. Speaker A and Speaker B recorded simultaneously on separate Shure wired microphones (MV7 / MV6), time-synchronized, 48 kHz / 24-bit lossless.
Conversation 7 — Speaker A · Shure MV7 · Room 1
Conversation 7 — Speaker B · Shure MV6 · Room 2
Two speakers, separate rooms, separate microphones, time-synchronized. Each channel is clean isolation
— no bleed, no merge, no re-encode.
Conversation 8 — Speaker A · Shure MV6 · Room 2
Conversation 8 — Speaker B · Shure MV7 · Room 1
Same pair, different room assignments. Cross-correlation between channels confirms independent capture: Pearson r = 0.001
— no mic bleed.
What Annotation of This Data Enables
The samples above are raw capture — 48 kHz / 24-bit, channel-separated, unprocessed. What makes a corpus training-ready is the annotation layer on top of that capture. Because the audio exists at full fidelity with independent channels, it can support annotation depth that lower-quality or merged captures cannot.
The table below maps each annotation type to the specific research gap it closes and the benchmark metric it directly improves. This is the layer that every paper in this space has identified as missing.
| Annotation type | Research gap it closes | Benchmark metric |
|---|---|---|
| Backchannel timing — "mm-hmm", "yeah", "right" labeled with timestamp, speaker, and whether it occurred during the other speaker's turn | No public corpus annotates backchannels at broadband quality | JSD-D (Full-Duplex-Bench) |
| Barge-in type — aggressive interruption vs. collaborative overlap, with overlap duration and floor outcome | Every open model simulates barge-in; none trained on labeled real barge-in | TOR (Full-Duplex-Bench) |
| Silence classification — thinking pause vs. turn yield vs. floor hold | Models can't distinguish "wait" from "speak now"; causes false interruptions | Response latency, TOR-D |
| Turn-construction unit boundaries — where a syntactically complete turn ends, with prosodic signals | "When is it safe to speak?" — the hardest open problem in full-duplex | TRP detection accuracy |
| Filled pauses and disfluencies — "um", "uh", false-starts, verbatim | 16 filler tokens no TTS produces naturally; absent from every training corpus | Naturalness MOS |
| Paralinguistic events — breath, laughter, sigh, clear-throat | Voice cloning and expressive TTS both require labeled breath and laugh events | Naturalness MOS, voice cloning quality |
| Scenario type — QA, emotional support, procedural, casual, emergency | FLEXI finds open/commercial gaps in every scenario; coverage requires labeled data | CSC (FLEXI) |
Each annotation layer is a direct training signal for a metric the field already uses to evaluate models. The annotation protocol covering all seven layers — plus prosodic markers, EQ trajectory, and semantic role dynamics — is available on request.
How This Corpus Compares
| Corpus | Sample rate | Channels | Backchannels | Real barge-in | Paralinguistics | Scenario labels |
|---|---|---|---|---|---|---|
| Ocular Hi-Fi | 48 kHz / 24-bit | Separated | Labeled | Real | Labeled | Labeled |
| Fisher English | 8 kHz | Separated | None | Real (unlabeled) | None | None |
| SyncLLM training | TTS | Merged | None | Simulated | None | Text-only |
| SALM-Duplex training | TTS + 8 kHz | Merged | None | Simulated (0.64s) | None | None |
| PersonaPlex training | TTS + audiobook | Merged | None | Simulated (neg. duration) | None | Role prompts |
Every empty cell in this table is a gap the evaluation literature has named as a bottleneck.
What We Have Not Done Yet
We have not fine-tuned a model on this corpus.
The claim this post makes is a prediction: that training on this corpus rather than on Fisher or TTS-synthetic data will produce measurable improvements on Full-Duplex-Bench metrics — particularly TOR, because real barge-in labels are the annotation type most absent from every prior training set. That experiment is the next step and will be published as a follow-up.
What we can demonstrate now: the corpus exists, the capture quality is verified, and both directly address the gaps the evaluation literature identifies as primary training data constraints. The samples above are the warrant for that claim.
Request Access
We are sharing sample packages with research teams working on full-duplex speech systems, evaluation benchmarks, and TTS training. A sample package includes channel-separated FLAC files, spectrograms, conformance reports, and the full annotation protocol specification.
The fine-tuning results follow in a subsequent post.
Author

Louis Murerwa
Co-founder & CTO

