Daniel Rosehill Hey, It Works!
Gemini Transcription MCP: Voice Notes to Formatted Documents via Claude
· Daniel Rosehill

Gemini Transcription MCP: Voice Notes to Formatted Documents via Claude

An MCP server that turns audio files into transcripts, meeting notes, blog posts, emails, and dev specs using Google Gemini — with 200+ transformation presets, VAD preprocessing, and SSH file retrieval.

I record a lot of voice notes — ideas, meeting recaps, stream-of-consciousness project planning. The problem was always the gap between recording and having something usable. Raw transcription gets you text, but it's full of filler words, false starts, and zero structure. What I wanted was to speak into a microphone and get back a formatted blog outline, a clean meeting summary, or a development spec.

Gemini Transcription MCP is the MCP server I built to do exactly that. It uses Google's Gemini multimodal API (via OpenRouter) to transcribe audio and then transform the output into whatever format you need — with 200+ presets and full custom prompt support.

Seven Transcription Tools

The server exposes seven specialized tools, each tuned for a different use case:

  1. transcribe_audio — The recommended default. Returns a lightly edited transcript: filler words removed, verbal corrections applied ("wait, I meant bananas" gets fixed), punctuation added, markdown subheadings generated for structure.

  2. transcribe_audio_raw — Verbatim transcription with minimal cleanup. Preserves filler words and false starts. Only applies essential fixes: spelled-out words, incomplete sentences, basic punctuation.

  3. transcribe_audio_vad — Voice Activity Detection preprocessing using the Silero VAD model. Strips silence and non-speech audio before transcription. Essential for recordings with long pauses or background noise.

  4. transcribe_audio_format — Transcribe and format as a specific document type: email, to-do list, meeting notes, technical document, blog post, executive summary, letter, report, or outline. Applies format-specific structural conventions automatically.

  5. transcribe_with_preset — Uses curated presets from a library of 200+ transformations. Style presets modify tone (formal, academic, journalistic, shakespearean, dejargonizer). Format presets restructure into document types (blog_outline, meeting_minutes, bug_report, cover_letter, tech_documentation). Supports fuzzy matching.

  6. transcribe_audio_custom — Full control via a user-defined prompt. Whatever specialized transcription instructions you need.

  7. list_transcription_presets — Lists all available presets, filterable by category (style or format) and searchable by name.

Three Ways to Feed It Audio

Every tool supports all three input methods:

  • Base64-encoded content — passed directly in the request

  • HTTP/HTTPS URLs — streaming download, no memory buffering

  • SSH/SCP retrieval — pull files from remote machines via SCP (local deployment only, requires SSH key access)

Audio Processing Pipeline

The server handles format conversion and compression automatically via ffmpeg:

  • Native formats (MP3, WAV, OGG, FLAC, AAC, AIFF) pass through directly

  • Non-native formats (Opus, M4A, WebM, WMA, AMR, 3GP, CAF) auto-convert to OGG/Opus

  • Large files (>15MB) get downsampled to OGG/Opus at 16kHz, 24kbps, mono — a 1-hour WAV (~600MB) compresses to ~10MB while maintaining transcription quality

  • 100MB absolute maximum file size limit

Deployment Options

Local (stdio) for Claude Code / Claude Desktop:

npx gemini-transcription-mcp

Remote (Docker/HTTP) for containerized deployments:

docker run -d -p 3000:3000 -e OPENROUTER_API_KEY=your-key ghcr.io/danielrosehill/gemini-transcription-mcp

The Docker image includes ffmpeg and exposes a streamable HTTP endpoint at /mcp plus a health check at /health.

Configuration

  • OPENROUTER_API_KEY — required, for accessing Gemini via OpenRouter

  • OPENROUTER_MODEL — optional, defaults to Gemini Flash Lite. Use flash for the more capable model

  • TRANSCRIPT_OUTPUT_DIR — optional, auto-saves transcripts as markdown files with slugified titles

The repo is at github.com/danielrosehill/Gemini-Transcription-MCP — published on npm as gemini-transcription-mcp.

danielrosehill/Gemini-Transcription-MCP ★ 0

MCP for Gemini multimodal audio transcription with built in post-processing

TypeScriptUpdated Apr 2026
audio-multimodaldictationgeminigemini-mcpmcp