Hi-Fi, Studio-Grade Datasets

Premium Quality Datasets

The Hi-Fi tier is a 48 kHz / 24-bit capture spec we apply across our entire dataset catalogue — from full-duplex conversation to monologue ASR and narrated TTS — wherever audio quality is non-negotiable. Available as an upgrade on any dataset on the marketplace.

Request samples Talk to sales

Overview Key Highlights Specifications Technical Comparisons Tier Information Request Samples

Overview

Hi-Fi, Studio-Grade is the premium capture tier we offer across the entire Ocular marketplace — every dataset, every format, captured at 48 kHz / 24-bit on studio-grade microphones with a calibrated noise floor. Use it for full-duplex conversation, monologue ASR, narrated TTS, or multilingual reads; the spec stays the same and the audio ships uncompressed end-to-end. Sessions arrive with acoustic environment, microphone configuration, and speaker / expert metadata so downstream models train on captures they can trust at production resolution.

Key highlights

01
48 kHz / 24-bit per channel from studio-grade microphones — every session preserved at production fidelity, never re-encoded down to telephony rates.
02
Three data modalities — audio-only, video-only, and synced audiovisual — all captured at the same Hi-Fi spec across the catalogue.
03
Six dataset types under one tier: full-duplex conversation, monologue ASR, narrated TTS, multilingual reads, video-only captures, and synced audiovisual sessions.
04
14 shipping languages — American English, French, Mandarin, Spanish, and 10 more — with bespoke languages and dialects scoped on request.
05
Video and audiovisual sessions captured at 4K UHD (3840×2160) at 30 / 60 fps, shipped as a color-graded master plus a delivery proxy with lens, sensor, and lighting documented per recording.
06
Audio delivered as lossless FLAC by default. WAV, MP4 (H.264), and ProRes MOV masters available on request so the file format never forces a fidelity compromise.
07
Per-speaker channel isolation and continuous full-duplex capture available on conversational sessions — overlap, backchannels, and turn-taking arrive intact.
08
Calibrated noise floor verified per session with a pre-recording pass/fail gate — sessions that miss the threshold simply don't ship.
09
Acoustic environment, microphone configuration, and pre-/post-processing chain documented per session so models train on captures they can trust.
10
Speaker / expert metadata shipped with every recording: age, gender, region, dialect, native language. Transcripts, diarization, and custom annotation layers available on request.

Technical specifications

Coverage

Offered across our entire dataset catalogue — full-duplex conversation, monologue ASR, narrated TTS, and multilingual reads — spanning our 14 shipping languages (American English, French, Mandarin, Spanish, and 10 more). Coverage extends to bespoke dialects, age groups, capture modalities, and topical targets on request.

Capture specs

48 kHz / 24-bit per channel from studio-grade microphones, with a calibrated noise floor and continuous capture for the full lifespan of each session — not cherry-picked moments. Conversational sessions add per-speaker channel isolation and stereo full-duplex recording; single-speaker sessions ship at the same fidelity in the channel layout the dataset calls for. Mic configuration and acoustic environment are documented per recording.

Annotations

Speaker / expert metadata shipped with every session: age, gender, region, dialect, native language, and acoustic environment. Word-level transcripts, diarization, and custom annotation layers available on request.

Use cases

Production TTS modelling where mic colour and noise floor matter
High-fidelity ASR training across accents, languages, and recording environments
Full-duplex conversational AI training and evaluation
Speaker diarization, voice cloning, and expressive-speech pipelines
Voice agent benchmarks for natural, multi-party conversation

Technical Comparisons

Open-source speech corpora vs. Ocular Hi-Fi

Public corpora like LibriTTS, VCTK, Common Voice, and LJSpeech are the de-facto baselines for English speech research — but most were captured as solo, read-prompt audio at 16 kHz / 22 kHz mono with light speaker metadata. The Hi-Fi tier is the inverse, regardless of modality: 48 kHz / 24-bit captures on studio microphones, per-speaker isolation where the format calls for it, rich speaker metadata, and clean commercial licensing across 14 languages.

Spec sheet

How the Ocular Hi-Fi tier stacks up against the de-facto open-source speech corpora, attribute by attribute.

Category	Attribute	Ocular Hi-FiCommercial	LibriTTSOpen source	VCTKOpen source	Common VoiceOpen source	LJSpeechOpen source
Audio format	Sample rateHz of analog capture preserved end-to-end	48 kHz	24 kHz	48 kHz (downmixed to 16 kHz in most distros)	32–48 kHz source → 16 kHz MP3	22.05 kHz
	Bit depthPer-sample dynamic range	24-bit	16-bit	16-bit	MP3 (lossy)	16-bit
	ChannelsPer-speaker channel isolation when the format calls for it	Stereo / per-speaker L / R on conversational sessions	Mono	Mono	Mono	Mono
Capture	Capture modalitiesRecording formats supported at the Hi-Fi tier	Full-duplex conversation, monologue ASR, narrated TTS, multilingual reads	Solo read audiobooks	Solo read prompts	Solo read prompts (crowdsourced)	Solo read (non-fiction)
Capture	Overlap & turn-takingBackchannels, interruptions, dual-talk on conversational sessions	Preserved verbatim (full-duplex sessions)	None	None	None	None
Coverage	Speaker metadataDemographic + acoustic context per recording	Age, gender, region, dialect, native language, environment	Gender + speaker ID	Age, gender, accent	Self-reported age / gender / accent (optional)	Single speaker
Coverage	LanguagesCatalogue breadth	14 (and growing)	English only	English only	100+ (long-tail thin)	English only
Licensing	LicensingCommercial usability	Commercial — clean consent at scene level	CC BY 4.0 (research-friendly, audiobook attribution required)	ODC-BY 1.0	CC0	Public domain
Licensing	ProvenanceSource + consent trail	Paid contributors, signed consent, scene-level audit trail	LibriVox volunteers reading public-domain books	Newspaper sentences read by paid actors	Anonymous web volunteers	Single LibriVox volunteer (Linda Johnson)

Ocular Hi-Fi values shown in bold are the per-row reference. Open-source values cited from the LibriTTS paper, the VCTK README, Common Voice corpus stats, and the LJSpeech project page.

Download

Spectrogram comparison

See the bandwidth gap

Pick an open-source dataset on the right. The same Ocular Hi-Fi reference sits on the left so the spectral content is easy to eyeball — open-source speech corpora cut off at the Nyquist of their sample rate (8 / 11 / 12 / 22 kHz), while every Hi-Fi capture — regardless of dataset modality — carries detail all the way up to 24 kHz.

LibriTTS spectrogram — LibriTTS is distributed at 24 kHz / 16-bit, so its spectrogram caps at the 12 kHz Nyquist limit — every band of detail above that line is missing before training even starts. Ocular Hi-Fi carries spectral energy all the way to 24 kHz from the original 48 kHz capture.

Ocular Hi-Fi spectrogram — LibriTTS is distributed at 24 kHz / 16-bit, so its spectrogram caps at the 12 kHz Nyquist limit — every band of detail above that line is missing before training even starts. Ocular Hi-Fi carries spectral energy all the way to 24 kHz from the original 48 kHz capture.

What you're looking at

Each spectrogram plots time on the x-axis and frequency on the y-axis, with brighter pixels marking more energy at that frequency. The hard horizontal line where every open-source sample goes flat is the Nyquist limit — half the sample rate, and the ceiling on detail the dataset can physically carry. Above that line the file contains no information at all; no amount of training can recover energy that was never captured.

That gap is not cosmetic. For text-to-speech, the mic colour, sibilance, breath noise, and the high-frequency air that make a synthetic voice sound embodied all live above 8 kHz — so a 16 kHz dataset effectively trains models to produce telephony-grade speech by construction. For full-duplex voice agents, the same upper band carries the prosodic cues — rising intonation, laughter, hesitation, smile voice — that listeners actually use to decide whether they're talking to a person.

Ocular Hi-Fi sessions are captured at 48 kHz / 24-bit per channel and shipped without re-encoding, so the full spectrum the microphone heard reaches your training pipeline intact — no Nyquist cliff, no MP3 ringing, no downstream resampling artefacts. Switching open-source datasets on the selector above swaps the right panel so you can see exactly what each baseline is leaving on the table.

Tier information

Every property a buyer asks about before booking a Hi-Fi sample, in one datasheet.

Name	Hi-Fi, Studio-Grade
Tier	Premium
Data modalities	Audio Video Audiovisual
Dataset types	Full-duplex conversational sessions (two-speaker, isolated channels) Monologue ASR (single-speaker read or spontaneous speech) Narrated TTS (single-speaker performance reads for voice cloning) Multilingual read prompts (phonetically balanced sentence sets) Video-only captures (4K UHD, no audio track) Audiovisual sessions (synced 4K video + Hi-Fi audio)
File formats	FLAC — lossless audio (default) WAV — uncompressed audio (on request) MP4 (H.264) — video, up to 4K UHD MOV (ProRes) — video master, on request
Sample rate	48 kHz
Bit depth	24-bit
Channel layout	Per-speaker stereo on conversational sessions Mono or stereo on single-speaker sessions Synced multi-channel on audiovisual sessions
Video specs	Up to 4K UHD (3840×2160) at 30 / 60 fps on video and audiovisual sessions — color-graded master plus delivery proxy
Microphones	Studio-grade wired — documented per session
Cameras	4K-capable mirrorless / cinema cameras with fixed framing per session; lens, sensor, and lighting documented per recording
Noise floor	Calibrated per session with a pre-recording pass/fail gate — sessions that miss the threshold don't ship
Languages	14 shipping (American English, French, Mandarin, Spanish, +10) · bespoke languages and dialects on request
Speaker metadata	Age, gender, region, dialect, native language, acoustic environment
Annotations	Word-level transcripts Diarization and speaker turns Prosodic markers and disfluency tags Scenario / role labels Custom annotation layers on request

Properties listed here apply to every dataset shipped at the Hi-Fi tier. Per-dataset deviations (additional languages, dialects, or annotation layers) are documented on the relevant marketplace listing.

Download

Treat the datasheet as the floor, not the ceiling. Every property is a contractual commitment held to across all 14 shipping languages, every recording modality, and every dataset in the marketplace catalogue — that uniformity is the whole point. A pipeline built against Hi-Fi captures inherits the same sample rate, bit depth, channel discipline, and noise-floor calibration delivery after delivery, so training runs don't need per-dataset accommodations for spec drift between releases. If your use case needs something the table doesn't list — a non-standard mic configuration, an extra annotation layer, a bespoke language or dialect — request samples below and we'll scope it against the same guarantees.

Hi-Fi, Studio-Grade Datasets

Coverage

Capture specs

Annotations

Use cases

Open-source speech corpora vs. Ocular Hi-Fi

See the bandwidth gap

Share your use case and we'll send sample clips, pricing, and recommended next steps for your pipeline.