How much bitrate does an audio-multimodal LLM actually need?

Conventional transcription pipelines run in two stages. An ASR system — Whisper, Deepgram, AssemblyAI — emits a token-level hypothesis, and then a text-only LLM does the editorial cleanup: strips fillers, applies punctuation, fixes the occasional homophone via semantic context. Two models, two API calls, two bills.

Audio-multimodal LLMs collapse that into a single pass. The model that consumes the audio tokens also emits the cleaned text. For dictation workloads the win is real: one call instead of two, roughly half the latency, and the cleanup stage has acoustic features (prosody, disfluency patterns) that a two-stage pipeline throws away at the ASR boundary.

The catch is that these models are new enough that basic operational questions — sample rate, codec, bitrate — don't have settled answers in the public record. So I ran an eval.

The question

How sensitive is transcription accuracy to MP3 bitrate across the audio-input models currently accessible via OpenRouter, and how do those models differ from each other on accuracy, latency, and instruction adherence?

Setup

12 models — every audio-capable model in OpenRouter's catalogue as of April 2026 (Gemini 2.0/2.5/3 variants, GPT-Audio family, Voxtral Small 24B, MiMo V2 Omni).
4 dictation samples — 20–30 second clips of native-English read-aloud prose. Vocabulary deliberately bounded to unambiguous tokens so formatting variance wouldn't pollute WER.
5 MP3 bitrates — 16, 24, 32, 48, 64 kbps. CBR, mono, 16 kHz held constant across all variants.
Verbatim transcription prompt — explicit instruction to transcribe exactly what was spoken, sentence punctuation and capitalisation only.
WER = Levenshtein edit distance over lowercased word tokens, normalised by reference length. Latency is client-side wall-clock from Jerusalem.

12 × 4 × 5 = 240 API calls. Aggregate cost ≈ $0.25 at April 2026 OpenRouter rates.

Finding 1: compression below the perceptual-audio threshold is safe for most models

For the Gemini and Voxtral families, WER is statistically flat across the 16–64 kbps range. Per-bitrate variance within a model exceeds any trend across bitrates; slopes are indistinguishable from zero at n=4. The heatmap makes the model × bitrate grid legible:

The operational implication is unambiguous: the default upload bitrate in most dictation apps is substantially overspecified. 64 kbps MP3 — roughly the perceptual-audio floor for music listening — carries no transcription-accuracy benefit over 32 kbps for speech content on Gemini or Voxtral, and sending it wastes about 2× the bandwidth. 24 kbps is likely also safe; 16 kbps is the point at which model-specific testing becomes warranted before committing.

This shouldn't surprise anyone who has looked at the information content of speech — the useful spectral bands sit below 4 kHz, and a 16 kbps MP3 at 16 kHz sample rate preserves the formants ASR systems rely on. But the assumption and the measurement are different things, and the measurement for audio-multimodal specifically didn't previously exist in the public record.

Further compression (Opus at 8–16 kbps, which outperforms MP3 at equivalent bitrates for speech) isn't accessible via OpenRouter's OpenAI-compatible input_audio schema, which accepts wav and mp3 only. Opus would require bypassing OR for providers that expose it natively.

Finding 2: large cross-model differences in accuracy and latency

The relevant optimisation surface for a transcription pipeline is accuracy × latency × cost:

Three clusters stand out.

Voxtral Small 24B: average WER ≈ 0.02, average latency ≈ 1.0s. Fastest model in the panel by a significant margin — 2–8× faster than comparable-accuracy Gemini variants. On latency-sensitive pipelines (live dictation with visible response time), this is the model to beat. The 32k context window becomes a constraint for clips over ~15 minutes.

Gemini 3 Flash Preview: average WER ≈ 0.014, average latency ≈ 2.2s. Best accuracy in the panel, consistent across all bitrates. Sensible default when accuracy matters more than latency.

Gemini 2.5 Pro: average WER ≈ 0.018, average latency ≈ 7.2s, significantly higher cost. Strictly dominated by Gemini 3 Flash Preview for this workload. Dictation doesn't benefit from reasoning-model capabilities; the additional capacity is unused. Not recommended for transcription.

Latency values include network round-trip and are specific to the test location. Absolute numbers will differ elsewhere; the ordering should be broadly stable since it reflects serving-infrastructure differences rather than routing.

Finding 3: instruction adherence varies — and it matters more than compression

The most operationally significant finding is not about bitrate at all. It's about whether the model does what the prompt asks.

The Gemini and Voxtral WER distributions are tight — median ≈ 0.02, narrow IQR, no tail. The three OpenAI audio models (GPT-Audio, GPT-Audio-Mini, GPT-4o-Audio-Preview) show bimodal behaviour: a cluster of calls with WER ≈ 0.02 (as tight as Gemini) and a second cluster with WER ≈ 0.9–1.2. Inspection of the outliers reveals the failure mode: the model treats the audio as conversational input and emits a response to the content rather than a transcription of it.

Example — GPT-Audio-Mini, sample 2, 16 kbps:

Reference (what the speaker read): "My grandmother used to make soup from whatever was in the kitchen on a Sunday afternoon. Carrots, a little onion, sometimes a handful of barley if she remembered to buy it…"

Model output: "That's a beautiful description. It paints a vivid picture of the scene—your grandmother's methodical and careful preparation, the simple ingredients, and the comforting aroma filling the apartment…"

WER on this call: 0.96. The same audio at the same bitrate produced WER = 0 on Gemini 2.0 Flash Lite and WER = 0.014 on Voxtral. The audio was fine. The model decoded the acoustic signal correctly. It then chose to generate conversationally despite an explicit "transcribe VERBATIM … do NOT rephrase, summarise, or reformat … output plain text only" system prompt.

This is a prompt-adherence failure, not an audio-understanding failure. The implications are significant:

Verbatim transcription of open-form speech is not a reliable capability across the audio-multimodal landscape. All three OpenAI audio variants exhibit the behaviour; Gemini, Voxtral, and MiMo do not.
Output validation becomes non-optional when using GPT-Audio-family models for transcription. A length-ratio check or a semantic-coherence check catches most failures — at the cost of a second model call that partly defeats the one-pass architectural advantage.
Provider selection should be evaluated on instruction-adherence specifically, not just accuracy on successful calls.

The conversationalisation failure rate in this eval is roughly 25–40% across the three OpenAI models, varying by sample (narrative prose appears to elicit it more than task-oriented content). At that rate, the one-pass architecture is no longer a win — the probability of pipeline failure exceeds the value of eliminating the second stage.

Recommendations

Reduce upload bitrate to 32 kbps MP3 mono 16 kHz unless eval data against your own samples shows a model-specific regression. Most production dictation pipelines are over-provisioned on audio quality by 2–4×.
16–24 kbps is worth testing for high-volume pipelines where bandwidth dominates cost.
Do not send 44.1 or 48 kHz audio to audio-multimodal LLMs. The encoders operate on 16 kHz (or lower); higher rates are server-side resampled and waste bandwidth.
For latency-sensitive transcription, default to Voxtral Small 24B at 24–32 kbps. Nothing else in the OpenRouter catalogue matches its latency at comparable accuracy.
For accuracy-sensitive workloads, default to Gemini 3 Flash Preview at 32 kbps.
Avoid GPT-Audio-family models for verbatim transcription without output validation. Fine for audio-understanding tasks (captioning, content analysis) where conversational output is wanted; not fine when verbatim is wanted.
Audit your own pipeline for this failure mode if currently using GPT-Audio variants in production. Length-ratio checks are the cheapest defence.

Caveats

n=4 samples per cell — sufficient for the effect sizes observed, insufficient for tight confidence intervals on small deltas. All recordings share a single speaker (native English, Israeli-inflected), microphone (consumer USB condenser), and acoustic environment (quiet indoor room). Results won't generalise cleanly to accented speech, noisy environments, multi-speaker conversation, or non-English without re-testing. Latency numbers are region-specific.

Dataset and code

The full dataset — source WAVs, every MP3 variant byte-for-byte, per-call transcriptions, and all 240 calls in a single CSV — is published under MIT. The eval harness is reusable against your own samples with a single CLI argument change.

GitHub: Audio-Understanding-Bitrate-Eval-0426
HF Dataset: Audio-Understanding-Bitrate-Eval-0426
Tooling: Multimodal Voice Typer — the dictation app this eval was built to inform.
Original HF blog writeup: huggingface.co/blog/danielrosehill/audio-multimodal-bitrate-wer

Repositories

danielrosehill/Audio-Understanding-Bitrate-Eval-0426 ★ 0

Empirical eval: how MP3 bitrate affects transcription accuracy across every audio-input LLM on OpenRouter (Gemini, GPT-Audio, Voxtral, MiMo). April 2026.

PythonUpdated Apr 2026

audio-understandingmultimodal

danielrosehill/AI-Typer-V2 ★ 1

Voice dictation with multimodal AI cleanup — speak naturally, get polished text

PythonUpdated Apr 2026