Hi-Fi, Studio-Grade Datasets

Premium Quality Datasets

The Hi-Fi tier is a 48 kHz / 24-bit capture spec we apply across our entire dataset catalogue — from full-duplex conversation to monologue ASR and narrated TTS — wherever audio quality is non-negotiable. Available as an upgrade on any dataset on the marketplace.

Overview

Hi-Fi, Studio-Grade is the premium capture tier we offer across the entire Ocular marketplace — every dataset, every format, captured at 48 kHz / 24-bit on studio-grade microphones with a calibrated noise floor. Use it for full-duplex conversation, monologue ASR, narrated TTS, or multilingual reads; the spec stays the same and the audio ships uncompressed end-to-end. Sessions arrive with acoustic environment, microphone configuration, and speaker / expert metadata so downstream models train on captures they can trust at production resolution.

Key highlights

  • 01

    48 kHz / 24-bit per channel from studio-grade microphones — every session preserved at production fidelity, never re-encoded down to telephony rates.

  • 02

    Three data modalities — audio-only, video-only, and synced audiovisual — all captured at the same Hi-Fi spec across the catalogue.

  • 03

    Six dataset types under one tier: full-duplex conversation, monologue ASR, narrated TTS, multilingual reads, video-only captures, and synced audiovisual sessions.

  • 04

    14 shipping languages — American English, French, Mandarin, Spanish, and 10 more — with bespoke languages and dialects scoped on request.

  • 05

    Video and audiovisual sessions captured at 4K UHD (3840×2160) at 30 / 60 fps, shipped as a color-graded master plus a delivery proxy with lens, sensor, and lighting documented per recording.

  • 06

    Audio delivered as lossless FLAC by default. WAV, MP4 (H.264), and ProRes MOV masters available on request so the file format never forces a fidelity compromise.

  • 07

    Per-speaker channel isolation and continuous full-duplex capture available on conversational sessions — overlap, backchannels, and turn-taking arrive intact.

  • 08

    Calibrated noise floor verified per session with a pre-recording pass/fail gate — sessions that miss the threshold simply don't ship.

  • 09

    Acoustic environment, microphone configuration, and pre-/post-processing chain documented per session so models train on captures they can trust.

  • 10

    Speaker / expert metadata shipped with every recording: age, gender, region, dialect, native language. Transcripts, diarization, and custom annotation layers available on request.

Technical specifications

Coverage

Offered across our entire dataset catalogue — full-duplex conversation, monologue ASR, narrated TTS, and multilingual reads — spanning our 14 shipping languages (American English, French, Mandarin, Spanish, and 10 more). Coverage extends to bespoke dialects, age groups, capture modalities, and topical targets on request.

Capture specs

48 kHz / 24-bit per channel from studio-grade microphones, with a calibrated noise floor and continuous capture for the full lifespan of each session — not cherry-picked moments. Conversational sessions add per-speaker channel isolation and stereo full-duplex recording; single-speaker sessions ship at the same fidelity in the channel layout the dataset calls for. Mic configuration and acoustic environment are documented per recording.

Annotations

Speaker / expert metadata shipped with every session: age, gender, region, dialect, native language, and acoustic environment. Word-level transcripts, diarization, and custom annotation layers available on request.

Use cases

  • Production TTS modelling where mic colour and noise floor matter
  • High-fidelity ASR training across accents, languages, and recording environments
  • Full-duplex conversational AI training and evaluation
  • Speaker diarization, voice cloning, and expressive-speech pipelines
  • Voice agent benchmarks for natural, multi-party conversation

Technical Comparisons

Open-source speech corpora vs. Ocular Hi-Fi

Public corpora like LibriTTS, VCTK, Common Voice, and LJSpeech are the de-facto baselines for English speech research — but most were captured as solo, read-prompt audio at 16 kHz / 22 kHz mono with light speaker metadata. The Hi-Fi tier is the inverse, regardless of modality: 48 kHz / 24-bit captures on studio microphones, per-speaker isolation where the format calls for it, rich speaker metadata, and clean commercial licensing across 14 languages.

Spec sheet

How the Ocular Hi-Fi tier stacks up against the de-facto open-source speech corpora, attribute by attribute.

CategoryAttributeOcular Hi-FiCommercialLibriTTSOpen sourceVCTKOpen sourceCommon VoiceOpen sourceLJSpeechOpen source
Audio formatSample rateHz of analog capture preserved end-to-end48 kHz24 kHz48 kHz (downmixed to 16 kHz in most distros)32–48 kHz source → 16 kHz MP322.05 kHz
Bit depthPer-sample dynamic range24-bit16-bit16-bitMP3 (lossy)16-bit
ChannelsPer-speaker channel isolation when the format calls for itStereo / per-speaker L / R on conversational sessionsMonoMonoMonoMono
CaptureCapture modalitiesRecording formats supported at the Hi-Fi tierFull-duplex conversation, monologue ASR, narrated TTS, multilingual readsSolo read audiobooksSolo read promptsSolo read prompts (crowdsourced)Solo read (non-fiction)
Overlap & turn-takingBackchannels, interruptions, dual-talk on conversational sessionsPreserved verbatim (full-duplex sessions)NoneNoneNoneNone
CoverageSpeaker metadataDemographic + acoustic context per recordingAge, gender, region, dialect, native language, environmentGender + speaker IDAge, gender, accentSelf-reported age / gender / accent (optional)Single speaker
LanguagesCatalogue breadth14 (and growing)English onlyEnglish only100+ (long-tail thin)English only
LicensingLicensingCommercial usabilityCommercial — clean consent at scene levelCC BY 4.0 (research-friendly, audiobook attribution required)ODC-BY 1.0CC0Public domain
ProvenanceSource + consent trailPaid contributors, signed consent, scene-level audit trailLibriVox volunteers reading public-domain booksNewspaper sentences read by paid actorsAnonymous web volunteersSingle LibriVox volunteer (Linda Johnson)

Ocular Hi-Fi values shown in bold are the per-row reference. Open-source values cited from the LibriTTS paper, the VCTK README, Common Voice corpus stats, and the LJSpeech project page.

Download

Spectrogram comparison

See the bandwidth gap

Pick an open-source dataset on the right. The same Ocular Hi-Fi reference sits on the left so the spectral content is easy to eyeball — open-source speech corpora cut off at the Nyquist of their sample rate (8 / 11 / 12 / 22 kHz), while every Hi-Fi capture — regardless of dataset modality — carries detail all the way up to 24 kHz.

LibriTTS

LibriTTS spectrogram
0:00 / 0:00

Ocular Hi-Fi

Ocular Hi-Fi spectrogram
0:00 / 0:00
LibriTTS is distributed at 24 kHz / 16-bit, so its spectrogram caps at the 12 kHz Nyquist limit — every band of detail above that line is missing before training even starts. Ocular Hi-Fi carries spectral energy all the way to 24 kHz from the original 48 kHz capture.

What you're looking at

Each spectrogram plots time on the x-axis and frequency on the y-axis, with brighter pixels marking more energy at that frequency. The hard horizontal line where every open-source sample goes flat is the Nyquist limit — half the sample rate, and the ceiling on detail the dataset can physically carry. Above that line the file contains no information at all; no amount of training can recover energy that was never captured.

That gap is not cosmetic. For text-to-speech, the mic colour, sibilance, breath noise, and the high-frequency air that make a synthetic voice sound embodied all live above 8 kHz — so a 16 kHz dataset effectively trains models to produce telephony-grade speech by construction. For full-duplex voice agents, the same upper band carries the prosodic cues — rising intonation, laughter, hesitation, smile voice — that listeners actually use to decide whether they're talking to a person.

Ocular Hi-Fi sessions are captured at 48 kHz / 24-bit per channel and shipped without re-encoding, so the full spectrum the microphone heard reaches your training pipeline intact — no Nyquist cliff, no MP3 ringing, no downstream resampling artefacts. Switching open-source datasets on the selector above swaps the right panel so you can see exactly what each baseline is leaving on the table.

Tier information

Every property a buyer asks about before booking a Hi-Fi sample, in one datasheet.

NameHi-Fi, Studio-Grade
TierPremium
Data modalities
  • Audio
  • Video
  • Audiovisual
Dataset types
  • Full-duplex conversational sessions (two-speaker, isolated channels)
  • Monologue ASR (single-speaker read or spontaneous speech)
  • Narrated TTS (single-speaker performance reads for voice cloning)
  • Multilingual read prompts (phonetically balanced sentence sets)
  • Video-only captures (4K UHD, no audio track)
  • Audiovisual sessions (synced 4K video + Hi-Fi audio)
File formats
  • FLAC — lossless audio (default)
  • WAV — uncompressed audio (on request)
  • MP4 (H.264) — video, up to 4K UHD
  • MOV (ProRes) — video master, on request
Sample rate48 kHz
Bit depth24-bit
Channel layout
  • Per-speaker stereo on conversational sessions
  • Mono or stereo on single-speaker sessions
  • Synced multi-channel on audiovisual sessions
Video specsUp to 4K UHD (3840×2160) at 30 / 60 fps on video and audiovisual sessions — color-graded master plus delivery proxy
MicrophonesStudio-grade wired — documented per session
Cameras4K-capable mirrorless / cinema cameras with fixed framing per session; lens, sensor, and lighting documented per recording
Noise floorCalibrated per session with a pre-recording pass/fail gate — sessions that miss the threshold don't ship
Languages14 shipping (American English, French, Mandarin, Spanish, +10) · bespoke languages and dialects on request
Speaker metadataAge, gender, region, dialect, native language, acoustic environment
Annotations
  • Word-level transcripts
  • Diarization and speaker turns
  • Prosodic markers and disfluency tags
  • Scenario / role labels
  • Custom annotation layers on request

Properties listed here apply to every dataset shipped at the Hi-Fi tier. Per-dataset deviations (additional languages, dialects, or annotation layers) are documented on the relevant marketplace listing.

Download

Treat the datasheet as the floor, not the ceiling. Every property is a contractual commitment held to across all 14 shipping languages, every recording modality, and every dataset in the marketplace catalogue — that uniformity is the whole point. A pipeline built against Hi-Fi captures inherits the same sample rate, bit depth, channel discipline, and noise-floor calibration delivery after delivery, so training runs don't need per-dataset accommodations for spec drift between releases. If your use case needs something the table doesn't list — a non-standard mic configuration, an extra annotation layer, a bespoke language or dialect — request samples below and we'll scope it against the same guarantees.

Request samples

Share your use case and we'll send sample clips, pricing, and recommended next steps for your pipeline.

Ready to bring AI into the real world?