May 26, 2026ProductRelease

Beyond Fisher: High Fidelity, Full-Duplex Multilingual Conversational AI Datasets

Every major full-duplex voice AI model still traces its training data back to Fisher English — two-party telephone calls recorded at 8 kHz in 2004. Ocular Full-Duplex Hi-Fi is the high-fidelity, multilingual conversational corpus the field has been asking for.

Louis Murerwa

Beyond Fisher: High Fidelity, Full-Duplex Multilingual Conversational AI Datasets

Every major full-duplex voice AI model traces its training data back to the same source: the Fisher English corpus^[1], a catalog of two-party telephone calls recorded at 8 kHz in 2004.

Meta's dGSLM,^[2] the foundational full-duplex model, trained on 2,000 hours of Fisher. SyncLLM^[3] used 1,927 hours of Fisher as its real-data anchor; the remaining 99% of its 215,000-hour training set was synthesized by TTS. SALM-Duplex^[4] and PersonaPlex^[5] (the two openly available end-to-end duplex speech-to-speech models) are both grounded on the same telephony-grade audio. PersonaPlex draws speaker voice samples from Fisher and adjacent audiobook corpora, then synthesizes the conversations themselves.

The recent survey From Turn-Taking to Synchronous Dialogue^[6] names data scarcity as the primary bottleneck to progress, more limiting than architecture. The Full-Duplex-Bench evaluation^[7] finds that no open-source model achieves both natural backchanneling and appropriate interruption behavior simultaneously, and attributes the gap to training data, not model design.

The field has asked for better data. The Ocular Full-Duplex Hi-Fi corpus is our answer.

How Labs Have Patched the Gap

Rather than collect better data, labs have patched the Fisher gap with increasingly elaborate simulations.

Synthetic TTS data does not contain real prosody, real disfluency, or real overlaps. SyncLLM is illustrative: the model learns the structural pattern of conversation from synthesized dialogue, then anchors to telephony-grade audio for acoustic grounding. The structure is fake; only the acoustics (and only at 8 kHz) are real.

SALM-Duplex and PersonaPlex go further. Each layers its TTS dialogue on top of a different real-audio backbone: SALM-Duplex mixes VoxPopuli's^[12] European Parliament recordings (16 kHz) with 8 kHz Fisher, while PersonaPlex draws speaker voices from Libriheavy^[13] audiobook readings and synthesizes the conversational structure on top of them.

For barge-in (the moment a speaker interrupts mid-utterance), SALM-Duplex inserts a programmatic 0.64-second silence followed by the interrupting voice. PersonaPlex uses "negative-duration silence" to stitch the interrupting voice on top of the interrupted speaker's audio.

These are engineering approximations of a real human behavior. A barge-in involves overlapping prosody, pitch alignment between both speakers, and a floor negotiation (one speaker yields) that unfolds over hundreds of milliseconds with acoustic cues at every step. The simulation captures none of that.

That this is a data problem and not an architecture problem is visible directly in the benchmark numbers. The Thinking Machines Interaction Model^[11] scores 77.8 on Full-Duplex-Bench against ~50 for the best openly available models, and the researchers attribute the gap not to architectural superiority but to proprietary access to high-quality real conversational data.

What the Field Has Explicitly Asked For

Three recent papers read as direct calls for better full-duplex training data:

From Turn-Taking to Synchronous Dialogue:^[6] A survey of full-duplex spoken language models that classifies architectures and unifies evaluation across temporal dynamics, behavioral arbitration, semantic coherence, and acoustic performance. Its headline finding: the open problems aren't architectural — they're "synchronous data scarcity, architectural divergence, and evaluation gaps." Solving the first one unblocks the other two.

Full-Duplex-Bench:^[7] Evaluates full-duplex models across pause handling, backchanneling, turn-taking, and user interruption using 727 samples drawn from CANDOR,^[9] ICC,^[10] and GPT-4o / ChatTTS synthetic audio. The authors explicitly note the scarcity of high-quality annotated full-duplex data as a limitation of the benchmark itself: a corpus with labeled backchannel timing, barge-in types, and turn-mechanics would let the benchmark be run at scale with ground truth rather than proxy metrics.

FLEXI:^[8] Introduces six interaction scenario types (including emergency and emotional support) and finds significant open-source/commercial gaps in every scenario. The authors identify per-scenario training data coverage as a key missing ingredient. A corpus annotated with scenario types and role dynamics directly addresses this.

The Ocular Full-Duplex Hi-Fi Corpus

Ocular Full-Duplex Hi-Fi is a studio-grade conversation corpus built deliberately against the constraints Fisher imposed. Every parameter (sample rate, channel topology, capture device, acoustic environment, speech mode) is set against what Fisher allowed.

Ocular Full-Duplex Hi-Fi at a glance

The corpus, parameter-by-parameter, against the data the field has been training on for two decades.

Parameter	Ocular Full-Duplex Hi-Fi	Fisher English
Sample rate	48 kHz	8 kHz
Bit depth	24-bit	8-bit µ-law
Effective bandwidth	DC – 24 kHz	DC – 4 kHz (PSTN-capped)
Channels per session	One isolated channel per speaker	One per speaker (telephony mix)
Channel correlation	Pearson r ≈ 0.001 (no bleed)	Crosstalk inherent to PSTN
Capture device	Studio-grade wired mics	PSTN handset microphone
Capture environment	Acoustically gated rooms, pass/fail SNR pre-check	Whatever environment the caller was in
Speech mode	Unscripted, naturalistic	Topic-prompted, telephone
Per-file format	Mono FLAC (lossless)	µ-law `.sph`
Coverage	Worldwide, with diverse accents, ages, and topics	US English, telephony-grade

Ocular Full-Duplex Hi-Fi values reflect the v1.0 corpus shipped at 48 kHz / 24-bit. Fisher English values cited from the LDC Fisher English Training Speech Part 1 catalog entry (LDC2004S13).

Per-session quality gate. Before recording begins, each session must pass a quality check: spectral energy verified above 2 kHz, noise floor below threshold, peak levels within headroom, SNR confirmed. Sessions that fail the gate don't record. Fisher had no equivalent gate; it accepted whatever the telephone network delivered.

Unscripted conversation. Speakers are not given prompts, not read-speech, not asked to demonstrate specific behaviors. Conversations are collected worldwide across diverse accents, ages, and topics. Natural barge-ins, real backchannels, and genuine hesitation occur because the conditions for them to occur are present.

Why fidelity matters above 4 kHz. Fisher captured what telephone infrastructure allowed: 8 kHz audio, capped at 4 kHz by the network, through handset microphones, in whatever acoustic environment callers happened to be in at the time. The acoustic signals that separate human speech from TTS output (the audible breath before a response, the prosodic fall of a completed clause, the 2–5 kHz presence region that carries vocal warmth, the overlap of two voices at full fidelity) all live above 4 kHz. Fisher cannot contain them. Every model trained primarily on Fisher is learning from data that is acoustically blind to the layer it most needs.

You can hear the difference in the samples below: the breaths, the 2–5 kHz presence band, and the unmixed overlap, all preserved.

Samples

Two unscripted conversations, captured per-speaker on isolated channels. Speaker A and Speaker B recorded simultaneously on separate studio-grade wired microphones, time-synchronized, 48 kHz / 24-bit lossless.

Conversation 7

Channel isolation, side by side

Same conversation, two simultaneous wired microphones in different rooms. The pair below is the raw per-speaker capture, what the model would actually train on.

Speaker A · Studio-grade mic · Room 1 — Each speaker captured on an isolated channel at 48 kHz / 24-bit / mono FLAC. The two spectrograms above are the raw per-microphone recordings: no mixing, no re-encoding.

Speaker B · Studio-grade mic · Room 2 — Each speaker captured on an isolated channel at 48 kHz / 24-bit / mono FLAC. The two spectrograms above are the raw per-microphone recordings: no mixing, no re-encoding.

Two speakers, separate rooms, separate microphones, time-synchronized. Each channel is clean isolation: no bleed, no merge, no re-encode.

Conversation 8

Swapped rooms, same isolation

The same two speakers swap microphone and room assignments. Cross-correlation between channels comes back at Pearson r = 0.001, so what you're seeing is two independent captures, not one mic bleeding into the other.

Speaker A · Studio-grade mic · Room 2 — Same pair of speakers, microphone and room assignments swapped, still captured on isolated channels at 48 kHz / 24-bit / mono FLAC.

Speaker B · Studio-grade mic · Room 1 — Same pair of speakers, microphone and room assignments swapped, still captured on isolated channels at 48 kHz / 24-bit / mono FLAC.

What Annotation of This Data Enables

The samples above are raw capture: 48 kHz / 24-bit, channel-separated, unprocessed. What makes a corpus training-ready is the annotation layer on top of that capture. Because the audio exists at full fidelity with independent channels, it can support annotation depth that lower-quality or merged captures cannot: turn-construction unit boundaries and transition-relevance places (TRPs),^[14] backchannel timing, and barge-in type all require prosodic detail that 8 kHz telephony erases.

The table below maps each annotation type to the specific research gap it closes and the benchmark metric it directly improves. This is the layer that every paper in this space has identified as missing.

Annotation Layers

Each annotation type, the research gap it closes, and the benchmark metric it directly improves.

Annotation type	Research gap it closes	Benchmark metric
Backchannel timing ("mm-hmm", "yeah", "right" labeled with timestamp, speaker, and whether it occurred during the other speaker's turn)	No public corpus annotates backchannels at broadband quality	JSD-D (Full-Duplex-Bench^[7])
Barge-in type (aggressive interruption vs. collaborative overlap, with overlap duration and floor outcome)	Every open model simulates barge-in; none trained on labeled real barge-in	TOR (Full-Duplex-Bench^[7])
Silence classification (thinking pause vs. turn yield vs. floor hold)	Models can't distinguish "wait" from "speak now"; causes false interruptions	Response latency, TOR-D (Full-Duplex-Bench^[7])
Turn-construction unit boundaries (where a syntactically complete turn ends, with prosodic signals)	"When is it safe to speak?", the hardest open problem in full-duplex	TRP detection accuracy (Sacks et al.^[14])
Filled pauses and disfluencies ("um", "uh", false-starts, repetitions, verbatim)	Standard TTS doesn't produce filler tokens naturally; they're effectively absent from every training corpus	Naturalness MOS
Paralinguistic events (breath, laughter, sigh, clear-throat)	Voice cloning and expressive TTS both require labeled breath and laugh events	Naturalness MOS, voice cloning quality
Scenario type (QA, emotional support, procedural, casual, emergency)	FLEXI finds open/commercial gaps in every scenario; coverage requires labeled data	CSC (FLEXI^[8])

Research gaps cited from the SyncLLM, SALM-Duplex, and PersonaPlex papers. Benchmark metric names from the Full-Duplex-Bench (FD-Bench) evaluation suite: JSD-D (Jensen-Shannon divergence of pause distributions), TOR (Take-Over Rate), and BPC (Backchannel Prediction Consistency).

Each annotation layer is a direct training signal for a metric the field already uses to evaluate models. The annotation protocol covering all seven layers (plus prosodic markers, EQ trajectory, and semantic role dynamics) is available on request.

How This Corpus Compares

The table below lines up Ocular Full-Duplex Hi-Fi against Fisher and the three open simulated training sets across the axes every recent full-duplex paper names as a training bottleneck.

Corpus Comparison

How Ocular Full-Duplex Hi-Fi stacks up against Fisher English and the simulated training sets that followed it.

Corpus	Sample rate	Channels	Backchannels	Real barge-in	Paralinguistics	Scenario labels
Ocular Full-Duplex Hi-Fi	48 kHz / 24-bit	Separated	Labeled	Real	Labeled	Labeled
Fisher English	8 kHz / 8-bit µ-law	Separated	None	Real (unlabeled)	None	None
SyncLLM training	~2k h Fisher + ~213k h TTS	Merged	None	Simulated	None	Text-only
SALM-Duplex training	TTS + 8 kHz Fisher	Merged	None	Simulated (0.64 s silence)	None	None
PersonaPlex training	TTS + audiobook (Fisher voices)	Merged	None	Simulated (neg. duration)	None	Role prompts

Ocular Full-Duplex Hi-Fi values shown in bold are the per-row reference. Comparator values cited from the SyncLLM, SALM-Duplex, and PersonaPlex papers and the Fisher English LDC catalog entry.

Every empty cell in this table is a gap the evaluation literature has named as a bottleneck. The interactive comparison below picks each of those corpora apart on the audio itself.

Audio Comparison

The same corpora, on the audio itself.

Pick a corpus on the right. The Ocular Full-Duplex Hi-Fi reference stays pinned on the left so the spectral ceiling and annotation gaps are easy to read side-by-side.

Ocular Full-Duplex Hi-Fi

Reference

Speaker A — 24 kHz

0:00 / 0:00

Speaker B — 24 kHz

0:00 / 0:00

Fisher English

Real LDC corpus recording — 8 kHz telephone, Sample A

0:00 / 0:00

Real LDC corpus recording — 8 kHz telephone, Sample B

0:00 / 0:00

Attribute Comparison

Ocular Full-Duplex Hi-Fi against Fisher English, axis by axis.

One row per parameter every recent full-duplex paper names as a training bottleneck. The Ocular reference column stays bolded so the row-by-row deltas are easy to read at a glance.

	Ocular Full-Duplex Hi-FiReference	Fisher English
Sample rate	48 kHz / 24-bit	8 kHz telephony
Channels	Per-speaker isolated	Per-speaker
Backchannels labeled	Labeled	None
Real barge-in	Real	Real (unlabeled)
Paralinguistics labeled	Labeled	None
Scenario labels	Labeled	None

Ocular Full-Duplex Hi-Fi values shown in bold are the per-row reference. Comparator values for Fisher English cited from the corresponding corpus paper or catalog entry referenced in the article body.

The 8 kHz spectral ceiling is visible in the Fisher spectrograms — energy drops abruptly at 4 kHz where PSTN encoding clips it off. The breath, the 2–5 kHz presence region, and the unmixed overlap that Ocular Full-Duplex Hi-Fi captures all live above that line, in a band Fisher cannot reach.

Why This Matters

Five points from the argument above, in the order they bear on a fine-tuning decision.

Key takeaways

01
The bottleneck is data, not architecture. Every recent benchmark and survey lands on the same conclusion, and the Thinking Machines gap on Full-Duplex-Bench corroborates it from the proprietary side.
02
Fisher is acoustically blind above 4 kHz. The breath, the prosodic fall, the 2–5 kHz presence region, and the overlap cues that separate human speech from TTS all live in a band 8 kHz telephony cannot reach.
03
Simulated barge-in is not barge-in. 0.64-second silences and "negative-duration" splicing recreate the surface pattern, not the prosody, pitch alignment, or floor negotiation that constitute a real interruption.
04
Annotations are the multiplier. Channel-separated 48 kHz audio is the prerequisite; labeled backchannels, barge-in types, silence classification, and scenario tags are what make the corpus a direct training signal for the metrics in Full-Duplex-Bench and FLEXI.
05
Isolation is measured, not asserted. Per-session conformance reports (SNR, peak headroom, spectral coverage, channel cross-correlation) ship with the corpus and put the capture quality on the same footing as the audio itself.

Future Work

We are running LoRA fine-tunes of SALM-Duplex and comparable open architectures on the Ocular corpus and measuring the delta on Full-Duplex-Bench. We expect the signal to be strongest on TOR (Take-Over Rate) and backchannel timing, the two axes where the gap between simulated and real training data is most direct, and where real barge-in labels are the annotation type most absent from every prior training set.

Until those numbers ship as a follow-up post, the warrant for the claim is the corpus itself: verified capture, channel-separated audio, and the annotation protocol above. The samples make the rest of the argument.

Request Access

We're sharing sample packages with research teams working on full-duplex speech systems, evaluation benchmarks, and TTS training. A sample package includes:

Channel-separated FLAC files: one mono track per speaker, 48 kHz / 24-bit lossless.
Spectrograms: full-band, per-channel, for every sample.
Conformance reports: quality-gate output (SNR, peak headroom, spectral coverage, channel correlation) for every session.
Annotation protocol specification: full schema for the seven annotation layers above plus prosodic markers, EQ trajectory, and semantic role dynamics.

The corpus ships across American English, French, Mandarin Chinese, Spanish, Vietnamese, Bahasa Indonesia, Japanese, Thai, German, Arabic, Hindi, Korean, Polish, and Russian, with regional dialect coverage and per-speaker dialect tags inside each language. Every language is captured against the same studio-grade pipeline (paired native speakers, isolated channels, 48 kHz / 24-bit, the full annotation layer above), so cross-lingual transfer experiments train on like-for-like data. Browse the full multilingual catalogue and request samples on the marketplace.

Request a sample package →, or email founders@useocular.com directly. Mention the model or benchmark you're targeting and we'll size the package to fit. To be notified when the fine-tuning results post ships, include "follow-up" in your subject line.

For more on the corpus, see the Hi-Fi dataset page.

Author

Louis Murerwa

Co-founder & CTO

Keep reading

View all

Multi-Accent English ASR Dataset

Product, Release · Apr 21, 2026

Ocular AI Manifesto

Company · Apr 20, 2026