May 20, 2026Datasets

Beyond Fisher: A High Fidelity Full-Duplex Dataset for Conversational AI

Every major full-duplex voice AI model traces its training data to Fisher English — telephone conversations recorded at 8 kHz in 2004. Here is the high-fidelity corpus the field has been asking for.

Louis Murerwa
Louis Murerwa
Beyond Fisher: A High Fidelity Full-Duplex Dataset for Conversational AI

Every major full-duplex voice AI model traces its training data back to the same source: the Fisher English corpus — a catalog of telephone conversations recorded at 8 kHz in 2004.

Meta's dGSLM (2022), the foundational full-duplex model, trained on 2,000 hours of Fisher. SyncLLM (University of Washington, 2024) used 1,927 hours of Fisher as its real-data anchor — the remaining 99% of its 215,000-hour training set was synthesized by TTS. SALM-Duplex (NVIDIA, Interspeech 2025) and PersonaPlex (NVIDIA, 2026) — the two openly available end-to-end duplex speech-to-speech models — were both grounded on the same telephony-grade corpus. PersonaPlex drew speaker voice samples from Fisher among others, then synthesized the conversations themselves.

The 2025 survey of the field (From Turn-Taking to Synchronous Dialogue, Chen et al.) names data scarcity as the primary bottleneck to progress — more limiting than architecture. The Full-Duplex-Bench evaluation (Lin et al., NTU / UC Berkeley, 2025) finds that no open-source model achieves both natural backchanneling and appropriate interruption behavior simultaneously, and attributes the gap to training data, not model design.

The field has asked for better data. Here it is.


How Labs Have Patched the Gap

Rather than collect better data, labs have patched the Fisher gap with increasingly elaborate simulations.

SyncLLM synthesized 193,000 hours of dialogue via TTS, then used 1,927 hours of Fisher as a real-data anchor. Synthetic TTS data does not contain real prosody, real disfluency, or real overlaps. The model learns the structural pattern of conversation from fake conversations, then anchors to telephony-grade audio for acoustic grounding.

SALM-Duplex and PersonaPlex go further. For barge-in — the moment a speaker interrupts mid-utterance — SALM-Duplex inserts a programmatic 0.64-second silence followed by the interrupting voice. PersonaPlex uses "negative-duration silence" to stitch the interrupting voice on top of the interrupted speaker's audio.

These are engineering approximations of a real human behavior. A barge-in involves overlapping prosody, pitch alignment between both speakers, and a floor negotiation — one speaker yields — that unfolds over hundreds of milliseconds with acoustic cues at every step. The simulation captures none of that.

The Thinking Machines Interaction Model scores 77.8 on FD-bench against ~50 for the best open models. Researchers attribute this gap not to architectural superiority but to proprietary access to high-quality real conversational data.

The conclusion is the same everywhere: the architecture is largely solved; the data is the constraint.


What the Field Has Explicitly Asked For

Three recent papers are worth reading as calls for data:

Full-Duplex-Bench (Lin et al., 2025): The benchmark uses 727 samples drawn from Candor, ICC, and GPT-4o/ChatTTS synthetic audio. The authors explicitly note the scarcity of high-quality annotated full-duplex data as a limitation of the benchmark itself. A corpus with labeled backchannel timing, barge-in types, and turn-mechanics would allow the benchmark to be run at scale with ground truth rather than proxy metrics.

FLEXI (Ge et al., 2025): Introduces six interaction scenario types — including emergency and emotional support — and finds significant open-source/commercial gaps in every scenario. The authors identify per-scenario training data coverage as a key missing ingredient. A corpus annotated with scenario types and role dynamics directly addresses this.


Ocular Hifi Full Duplex Dataset

Ocular captures each speaker on a separate isolated channel at 48 kHz / 24-bit. Before recording begins, the session passes a quality gate: spectral energy verified above 2 kHz, noise floor below threshold, peak levels within headroom, SNR confirmed. Sessions that fail the gate don't record. Fisher had no equivalent gate — it accepted whatever the telephone network delivered.

The sessions are unscripted. Speakers are not given prompts, not read-speech, not asked to demonstrate specific behaviors. The conversations are collected worldwide across diverse accents, ages, and topics. Natural barge-ins, real backchannels, and genuine hesitation occur because the conditions for them to occur are present.

Fisher captured what telephone infrastructure allowed: 8 kHz audio, capped at 4 kHz by the network, through handset microphones, in whatever acoustic environment callers happened to be in at the time.

The acoustic signals that separate human speech from TTS output — the audible breath before a response, the prosodic fall of a completed clause, the 2–5 kHz presence region that carries vocal warmth, the overlap of two voices at full fidelity — all live above 4 kHz. Fisher cannot contain them. Every model trained primarily on Fisher is learning from data that is acoustically blind to the layer it most needs.


Samples

Two unscripted conversations, captured per-speaker on isolated channels. Speaker A and Speaker B recorded simultaneously on separate Shure wired microphones (MV7 / MV6), time-synchronized, 48 kHz / 24-bit lossless.

Conversation 7 — Speaker A · Shure MV7 · Room 1

Spectrogram of Speaker A, Conversation 7 — 48 kHz capture showing full spectral content above 4 kHz
Speaker A isolated channel — 48 kHz / 24-bit / mono FLAC

Conversation 7 — Speaker B · Shure MV6 · Room 2

Spectrogram of Speaker B, Conversation 7 — 48 kHz capture showing full spectral content above 4 kHz
Speaker B isolated channel — 48 kHz / 24-bit / mono FLAC

Two speakers, separate rooms, separate microphones, time-synchronized. Each channel is clean isolation

no bleed, no merge, no re-encode.

Conversation 8 — Speaker A · Shure MV6 · Room 2

Spectrogram of Speaker A, Conversation 8 — 48 kHz capture
Speaker A isolated channel — 48 kHz / 24-bit / mono FLAC

Conversation 8 — Speaker B · Shure MV7 · Room 1

Spectrogram of Speaker B, Conversation 8 — 48 kHz capture
Speaker B isolated channel — 48 kHz / 24-bit / mono FLAC

Same pair, different room assignments. Cross-correlation between channels confirms independent capture: Pearson r = 0.001

no mic bleed.


What Annotation of This Data Enables

The samples above are raw capture — 48 kHz / 24-bit, channel-separated, unprocessed. What makes a corpus training-ready is the annotation layer on top of that capture. Because the audio exists at full fidelity with independent channels, it can support annotation depth that lower-quality or merged captures cannot.

The table below maps each annotation type to the specific research gap it closes and the benchmark metric it directly improves. This is the layer that every paper in this space has identified as missing.

Annotation typeResearch gap it closesBenchmark metric
Backchannel timing — "mm-hmm", "yeah", "right" labeled with timestamp, speaker, and whether it occurred during the other speaker's turnNo public corpus annotates backchannels at broadband qualityJSD-D (Full-Duplex-Bench)
Barge-in type — aggressive interruption vs. collaborative overlap, with overlap duration and floor outcomeEvery open model simulates barge-in; none trained on labeled real barge-inTOR (Full-Duplex-Bench)
Silence classification — thinking pause vs. turn yield vs. floor holdModels can't distinguish "wait" from "speak now"; causes false interruptionsResponse latency, TOR-D
Turn-construction unit boundaries — where a syntactically complete turn ends, with prosodic signals"When is it safe to speak?" — the hardest open problem in full-duplexTRP detection accuracy
Filled pauses and disfluencies — "um", "uh", false-starts, verbatim16 filler tokens no TTS produces naturally; absent from every training corpusNaturalness MOS
Paralinguistic events — breath, laughter, sigh, clear-throatVoice cloning and expressive TTS both require labeled breath and laugh eventsNaturalness MOS, voice cloning quality
Scenario type — QA, emotional support, procedural, casual, emergencyFLEXI finds open/commercial gaps in every scenario; coverage requires labeled dataCSC (FLEXI)

Each annotation layer is a direct training signal for a metric the field already uses to evaluate models. The annotation protocol covering all seven layers — plus prosodic markers, EQ trajectory, and semantic role dynamics — is available on request.


How This Corpus Compares

CorpusSample rateChannelsBackchannelsReal barge-inParalinguisticsScenario labels
Ocular Hi-Fi48 kHz / 24-bitSeparatedLabeledRealLabeledLabeled
Fisher English8 kHzSeparatedNoneReal (unlabeled)NoneNone
SyncLLM trainingTTSMergedNoneSimulatedNoneText-only
SALM-Duplex trainingTTS + 8 kHzMergedNoneSimulated (0.64s)NoneNone
PersonaPlex trainingTTS + audiobookMergedNoneSimulated (neg. duration)NoneRole prompts

Every empty cell in this table is a gap the evaluation literature has named as a bottleneck.


What We Have Not Done Yet

We have not fine-tuned a model on this corpus.

The claim this post makes is a prediction: that training on this corpus rather than on Fisher or TTS-synthetic data will produce measurable improvements on Full-Duplex-Bench metrics — particularly TOR, because real barge-in labels are the annotation type most absent from every prior training set. That experiment is the next step and will be published as a follow-up.

What we can demonstrate now: the corpus exists, the capture quality is verified, and both directly address the gaps the evaluation literature identifies as primary training data constraints. The samples above are the warrant for that claim.


Request Access

We are sharing sample packages with research teams working on full-duplex speech systems, evaluation benchmarks, and TTS training. A sample package includes channel-separated FLAC files, spectrograms, conformance reports, and the full annotation protocol specification.

engineering@useocular.com

The fine-tuning results follow in a subsequent post.

Author

Louis Murerwa

Louis Murerwa

Co-founder & CTO

Ready to bring AI into the real world?