Hi-Fi, Studio-Grade Datasets
Premium Quality DatasetsThe Hi-Fi tier is a 48 kHz / 24-bit capture spec we apply across our entire dataset catalogue — from full-duplex conversation to monologue ASR and narrated TTS — wherever audio quality is non-negotiable. Available as an upgrade on any dataset on the marketplace.
Overview
Hi-Fi, Studio-Grade is the premium capture tier we offer across the entire Ocular marketplace — every dataset, every format, captured at 48 kHz / 24-bit on studio-grade microphones with a calibrated noise floor. Use it for full-duplex conversation, monologue ASR, narrated TTS, or multilingual reads; the spec stays the same and the audio ships uncompressed end-to-end. Sessions arrive with acoustic environment, microphone configuration, and speaker / expert metadata so downstream models train on captures they can trust at production resolution.
Key highlights
Technical specifications
Coverage
Offered across our entire dataset catalogue — full-duplex conversation, monologue ASR, narrated TTS, and multilingual reads — spanning our 14 shipping languages (American English, French, Mandarin, Spanish, and 10 more). Coverage extends to bespoke dialects, age groups, capture modalities, and topical targets on request.
Capture specs
48 kHz / 24-bit per channel from studio-grade microphones, with a calibrated noise floor and continuous capture for the full lifespan of each session — not cherry-picked moments. Conversational sessions add per-speaker channel isolation and stereo full-duplex recording; single-speaker sessions ship at the same fidelity in the channel layout the dataset calls for. Mic configuration and acoustic environment are documented per recording.
Annotations
Speaker / expert metadata shipped with every session: age, gender, region, dialect, native language, and acoustic environment. Word-level transcripts, diarization, and custom annotation layers available on request.
Use cases
- Production TTS modelling where mic colour and noise floor matter
- High-fidelity ASR training across accents, languages, and recording environments
- Full-duplex conversational AI training and evaluation
- Speaker diarization, voice cloning, and expressive-speech pipelines
- Voice agent benchmarks for natural, multi-party conversation
Technical Comparisons
Open-source speech corpora vs. Ocular Hi-Fi
Public corpora like LibriTTS, VCTK, Common Voice, and LJSpeech are the de-facto baselines for English speech research — but most were captured as solo, read-prompt audio at 16 kHz / 22 kHz mono with light speaker metadata. The Hi-Fi tier is the inverse, regardless of modality: 48 kHz / 24-bit captures on studio microphones, per-speaker isolation where the format calls for it, rich speaker metadata, and clean commercial licensing across 14 languages.
Spec sheet
How the Ocular Hi-Fi tier stacks up against the de-facto open-source speech corpora, attribute by attribute.
| Category | Attribute | Ocular Hi-FiCommercial | LibriTTSOpen source | VCTKOpen source | Common VoiceOpen source | LJSpeechOpen source |
|---|---|---|---|---|---|---|
| Audio format | Sample rateHz of analog capture preserved end-to-end | 48 kHz | 24 kHz | 48 kHz (downmixed to 16 kHz in most distros) | 32–48 kHz source → 16 kHz MP3 | 22.05 kHz |
| Bit depthPer-sample dynamic range | 24-bit | 16-bit | 16-bit | MP3 (lossy) | 16-bit | |
| ChannelsPer-speaker channel isolation when the format calls for it | Stereo / per-speaker L / R on conversational sessions | Mono | Mono | Mono | Mono | |
| Capture | Capture modalitiesRecording formats supported at the Hi-Fi tier | Full-duplex conversation, monologue ASR, narrated TTS, multilingual reads | Solo read audiobooks | Solo read prompts | Solo read prompts (crowdsourced) | Solo read (non-fiction) |
| Overlap & turn-takingBackchannels, interruptions, dual-talk on conversational sessions | Preserved verbatim (full-duplex sessions) | None | None | None | None | |
| Coverage | Speaker metadataDemographic + acoustic context per recording | Age, gender, region, dialect, native language, environment | Gender + speaker ID | Age, gender, accent | Self-reported age / gender / accent (optional) | Single speaker |
| LanguagesCatalogue breadth | 14 (and growing) | English only | English only | 100+ (long-tail thin) | English only | |
| Licensing | LicensingCommercial usability | Commercial — clean consent at scene level | CC BY 4.0 (research-friendly, audiobook attribution required) | ODC-BY 1.0 | CC0 | Public domain |
| ProvenanceSource + consent trail | Paid contributors, signed consent, scene-level audit trail | LibriVox volunteers reading public-domain books | Newspaper sentences read by paid actors | Anonymous web volunteers | Single LibriVox volunteer (Linda Johnson) |
Ocular Hi-Fi values shown in bold are the per-row reference. Open-source values cited from the LibriTTS paper, the VCTK README, Common Voice corpus stats, and the LJSpeech project page.
DownloadSpectrogram comparison
See the bandwidth gap
Pick an open-source dataset on the right. The same Ocular Hi-Fi reference sits on the left so the spectral content is easy to eyeball — open-source speech corpora cut off at the Nyquist of their sample rate (8 / 11 / 12 / 22 kHz), while every Hi-Fi capture — regardless of dataset modality — carries detail all the way up to 24 kHz.
LibriTTS

Ocular Hi-Fi

What you're looking at
Each spectrogram plots time on the x-axis and frequency on the y-axis, with brighter pixels marking more energy at that frequency. The hard horizontal line where every open-source sample goes flat is the Nyquist limit — half the sample rate, and the ceiling on detail the dataset can physically carry. Above that line the file contains no information at all; no amount of training can recover energy that was never captured.
That gap is not cosmetic. For text-to-speech, the mic colour, sibilance, breath noise, and the high-frequency air that make a synthetic voice sound embodied all live above 8 kHz — so a 16 kHz dataset effectively trains models to produce telephony-grade speech by construction. For full-duplex voice agents, the same upper band carries the prosodic cues — rising intonation, laughter, hesitation, smile voice — that listeners actually use to decide whether they're talking to a person.
Ocular Hi-Fi sessions are captured at 48 kHz / 24-bit per channel and shipped without re-encoding, so the full spectrum the microphone heard reaches your training pipeline intact — no Nyquist cliff, no MP3 ringing, no downstream resampling artefacts. Switching open-source datasets on the selector above swaps the right panel so you can see exactly what each baseline is leaving on the table.
Tier information
Every property a buyer asks about before booking a Hi-Fi sample, in one datasheet.
| Name | Hi-Fi, Studio-Grade |
|---|---|
| Tier | Premium |
| Data modalities |
|
| Dataset types |
|
| File formats |
|
| Sample rate | 48 kHz |
| Bit depth | 24-bit |
| Channel layout |
|
| Video specs | Up to 4K UHD (3840×2160) at 30 / 60 fps on video and audiovisual sessions — color-graded master plus delivery proxy |
| Microphones | Studio-grade wired — documented per session |
| Cameras | 4K-capable mirrorless / cinema cameras with fixed framing per session; lens, sensor, and lighting documented per recording |
| Noise floor | Calibrated per session with a pre-recording pass/fail gate — sessions that miss the threshold don't ship |
| Languages | 14 shipping (American English, French, Mandarin, Spanish, +10) · bespoke languages and dialects on request |
| Speaker metadata | Age, gender, region, dialect, native language, acoustic environment |
| Annotations |
|
Properties listed here apply to every dataset shipped at the Hi-Fi tier. Per-dataset deviations (additional languages, dialects, or annotation layers) are documented on the relevant marketplace listing.
DownloadTreat the datasheet as the floor, not the ceiling. Every property is a contractual commitment held to across all 14 shipping languages, every recording modality, and every dataset in the marketplace catalogue — that uniformity is the whole point. A pipeline built against Hi-Fi captures inherits the same sample rate, bit depth, channel discipline, and noise-floor calibration delivery after delivery, so training runs don't need per-dataset accommodations for spec drift between releases. If your use case needs something the table doesn't list — a non-standard mic configuration, an extra annotation layer, a bespoke language or dialect — request samples below and we'll scope it against the same guarantees.
Request samples