Mandarin Full-Duplex Conversational Dataset

Overview

Naturalistic, two-speaker Mandarin Chinese conversations captured at studio quality in full-duplex stereo. Pairs of native Mandarin Chinese speakers from mainland China, Taiwan, and the Chinese diaspora discuss everyday topics for the full duration of the session — no read scripts, no scene cuts. Each recording preserves real overlapping speech, backchannels, hesitations, and code-switching, so downstream models train on the way Mandarin Chinese actually sounds in the wild. Every clip is collected from paid contributors with explicit consent, scene-level provenance, and metadata for speaker demographics, dialect, and acoustic environment.

Key highlights

01
Mainland Putonghua and Taiwanese Mandarin pairings with per-utterance tone variation tagged — critical for Mandarin TTS training.
02
Chinglish code-switching, English loanwords, and modern Chinese internet slang ("yyds", "emo", "绝绝子") preserved verbatim.
03
Family-style group conversational rhythm with overlapping turn-taking, rapid topic shifts, and culturally specific honorifics.
04
Regional accent coverage spanning Beijing, Shanghai, Guangdong (Mandarin-speaking), Taipei, and the overseas Chinese diaspora.
05
Disfluencies — filled pauses (uh, um, hmm), false starts, self-repairs, hesitations, laughter, sighs, breath, and throat clears — are preserved with utterance-level timestamps rather than normalised away, so models can learn from them or filter them out as a first-class signal.

01
Mainland Putonghua and Taiwanese Mandarin pairings with per-utterance tone variation tagged — critical for Mandarin TTS training.
02
Chinglish code-switching, English loanwords, and modern Chinese internet slang ("yyds", "emo", "绝绝子") preserved verbatim.
03
Family-style group conversational rhythm with overlapping turn-taking, rapid topic shifts, and culturally specific honorifics.
04
Regional accent coverage spanning Beijing, Shanghai, Guangdong (Mandarin-speaking), Taipei, and the overseas Chinese diaspora.
05
Disfluencies — filled pauses (uh, um, hmm), false starts, self-repairs, hesitations, laughter, sighs, breath, and throat clears — are preserved with utterance-level timestamps rather than normalised away, so models can learn from them or filter them out as a first-class signal.

Technical specifications

Coverage

Hundreds of paired sessions from native Mandarin Chinese speakers across Mainland China and Taiwan — coverage extends to bespoke dialects, age groups, and topical targets on request.

Capture specs

Stereo full-duplex audio at 48 kHz / 24-bit per channel from studio-grade microphones, with per-speaker channel isolation, calibrated noise floor, and continuous capture for the full lifespan of each session — not cherry-picked moments.

Annotations

Every session ships with rich speaker / contributor metadata (age, gender, region, dialect, native language, acoustic environment) plus an utterance-level annotation layer: emotion tags (joy, frustration, neutral, surprise, sadness, anger, amusement, empathy, and more), topic tags spanning everyday domains (work, family, sports, travel, health, finance, technology, food, pop culture, politics, education), intent labels (question, agreement, backchannel, hedge, interruption, repair, opinion), turn-taking markers (overlap onset/offset, gap, hold, yield), and prosody cues (pitch contour, stress, laughter, sighs, hesitation, code-switch boundaries). Custom annotation schemas — domain-specific intents, fine-grained emotion taxonomies, named-entity spans, sentiment scoring, or any task-specific labels — are available on request.

Use cases

Full-duplex conversational AI training and evaluation
Speaker diarization and Mandarin Chinese ASR / TTS modelling
Turn-taking, backchannel, and overlap-handling research
Emotion-aware and intent-aware voice agent fine-tuning
Voice agent benchmarks for natural, multi-party conversation

Request samples

Mandarin Full-Duplex Conversational Dataset

Coverage

Capture specs

Annotations

Use cases

Share your use case and we'll send sample clips, pricing, and recommended next steps for your pipeline.

American English Full-Duplex Conversational Dataset

French Full-Duplex Conversational Dataset