Testing Gemini 3.1 Flash Lite's Audio Understanding With 49 Structured Prompts

Audio multimodal models can transcribe speech. That much is well established. But what else can they actually infer from a raw audio signal? Can they detect accents, estimate speaker demographics, identify emotional tone, or spot deception? I decided to find out by building a structured test suite and throwing it at Google's Gemini 3.1 Flash Lite.

The Setup

The experiment used a single, natural voice sample: a 21-minute unscripted recording of myself rambling into a smartphone in Jerusalem. No studio conditions, no script — just stream-of-consciousness chat about AI podcasts, voice cloning experiments, rocket sirens, and GitHub LFS issues. The kind of messy, real-world audio that actually tests a model's limits.

Against this audio, I ran 49 structured prompts spanning 13 categories of audio analysis. Each prompt was sent to Gemini 3.1 Flash Lite (Preview) via the API alongside the FLAC audio file, and the model's response was captured as markdown.

The 13 Test Categories

The prompts were designed to probe progressively deeper layers of audio understanding:

Speaker Analysis (8 prompts) — accent identification, phonetic patterns, voice profiling
Audio Engineering (6 prompts) — EQ recommendations, microphone inference, room acoustics
Emotion & Sentiment (5 prompts) — emotional tone, valence-arousal mapping, timestamped tracking
Speaker Demographics (4 prompts) — gender, age, education level inference
Health & Wellness (4 prompts) — inebriation, drug influence, hydration assessment
Environment (4 prompts) — indoor/outdoor classification, background noise identification
Speech Metrics (3 prompts) — words per minute, dictation coaching, STT model ranking
Forensic Audio (3 prompts) — deception detection, deepfake detection, insincerity timestamps
Voice Cloning (2 prompts) — TTS cloning characteristics, clonability assessment
Content Analysis (2 prompts) — words-vs-tone deviation, address pattern analysis
Language Learning (2 prompts) — Hebrew phonetic difficulty, easiest foreign language to learn
Production (1 prompt) — voiceover potential assessment
Speaker ID (1 prompt) — celebrity voice match

How It Works

The pipeline is straightforward. A Python script (run-prompts.py) loads the 49 prompts from a JSON index, uploads the audio file to Google's GenAI File API, then iterates through each prompt — sending audio + text to the model and saving the response as markdown. A separate script generates a formatted PDF report with cover page, experiment metadata, and all 49 prompt-output pairs.

The whole thing is open source and designed to be reproducible — swap out the audio file and re-run the suite against any model that supports audio input.

What Worked Well

Gemini Flash Lite handled content-driven audio understanding impressively for a lightweight model:

Speaker identification and accent analysis — correctly identified an Irish male speaker in his late 30s, with detailed phonetic observations about vowel patterns and speech rhythm.
Content comprehension — accurately extracted topics, named entities, geographic references, and technical terminology from completely unscripted speech.
Emotional tone and deception analysis — appropriately characterized the speech as casual and non-deceptive, and correctly noted fatigue consistent with the speaker's own description.
Audio engineering — produced technically sound EQ recommendations and plausible microphone/room acoustics inferences.
Ethical responsibility — consistently added appropriate disclaimers on sensitive prompts (mental health, drug influence detection) without refusing to engage.

Where It Struggled

The model's limitations showed up primarily in tasks requiring genuine signal-level acoustic analysis:

Quantitative acoustic metrics — tasks like valence-arousal mapping with timestamps or precise emotional peak identification produced plausible-sounding narratives without demonstrable acoustic grounding. The numbers felt estimated rather than measured.
Adversarial demographic prompts — age detection leaned on the speaker's own stated age from the content rather than independent vocal analysis. It's unclear how much was acoustic inference vs. content comprehension.
Environmental inference — weather inference and some voice-matching tasks produced minimal outputs with little conditional reasoning from available audio cues.

The Key Takeaway

Gemini 3.1 Flash Lite excels at content-driven audio understanding — who is speaking, what they're saying, what emotional register they're in, and what environment they're in. It's considerably weaker at signal-level acoustic analysis — precise measurements, spectral characteristics, and tasks that require genuinely "hearing" the waveform rather than understanding the speech.

This distinction matters because it suggests the model's audio "understanding" is still heavily mediated by its language capabilities. It's excellent at reasoning about what it hears, but less capable of performing the kind of acoustic signal processing that an audio engineer or forensic analyst would do.

All 49 Prompts Completed

One noteworthy result: all 49 prompts completed without a single failure or refusal. Even the more provocative ones — deception detection, drug influence screening, deepfake analysis — received substantive responses with appropriate caveats. For a lightweight model variant, that's a solid baseline.

Try It Yourself

The full test suite, all 49 prompts, the execution script, and the complete output report are open source on GitHub. Swap in your own audio file, point it at a different model, and see how the results compare. The prompts are designed to be model-agnostic — they'd work just as well against GPT-5, Claude, or any future audio-capable model.

danielrosehill/Audio-Understanding-Test-Prompts ★ 1

A few test prompts to be used when evaluating audio multimodal models

Updated Dec 2025

multimodal-aitranscription