Cloud ASR MCP: multi-backend transcription via multimodal LLMs
An MCP server for audio transcription using multimodal LLMs like Gemini, GPT-4o Audio, and Voxtral — not traditional ASR.
After building the Gemini-specific transcription MCP, I realised I was hitting a limitation that would be obvious to anyone who's worked with multiple LLM providers: I'd tuned everything for one backend, and when I wanted to compare transcription quality across models or route to a different provider for a specific task, I had to swap MCP servers. That's silly. What I actually wanted was a single MCP server that could route transcription to different multimodal LLM backends depending on the task — Gemini for its massive context window and single-pass capability, GPT-4o Audio for its formatting accuracy, Voxtral for its open-source appeal. Cloud ASR MCP is the result: a multi-backend transcription server that uses audio-capable multimodal models instead of traditional speech-to-text engines.
danielrosehill/Cloud-ASR-MCP View on GitHubWhy multimodal LLMs instead of traditional ASR
The key difference from conventional speech-to-text tools like Whisper is that multimodal LLMs process audio in a fundamentally different way. Traditional ASR converts speech to text and that's it — any cleanup, formatting, or analysis requires a separate step. Multimodal LLMs process audio holistically in a single pass, which means you can provide text prompt guidance to clean up transcripts on the fly: removing filler words, formatting speaker turns, applying domain-specific vocabulary corrections, or even generating structured summaries alongside the transcript. All in one API call. I've successfully transcribed 50-minute audio files using Gemini in a single pass with no chunking required, which is genuinely impressive compared to the chunk-and-stitch approach that traditional ASR tools require for long recordings. The quality tradeoff is real — Whisper is still more accurate for pure word-for-word transcription — but for my use case, which is usually "give me a clean, readable version of this meeting recording," the multimodal approach wins on practicality.
The multi-backend architecture
The server supports Gemini 2.5 Flash and Pro, GPT-4o Audio, and Voxtral (Mistral's voice model), all accessible through OpenRouter with a single API key or via direct API access for each provider. OpenRouter is the recommended path since it gives you access to all models with one key and built-in cost tracking. The unified OpenRouter tool lets you pick a model at transcription time, which is handy when you want to compare outputs or match a model to a specific task. There are also direct API tools for Gemini (both cleaned and raw verbatim modes), OpenAI's Whisper-based gpt-4o-transcribe, and Voxtral via Mistral's API. It supports both stdio transport for local use with Claude Code and SSE transport for remote deployments with MCP aggregators like MetaMCP. Audio can be provided as files, base64 content, or fetched from URLs. Available on npm and GitHub.
danielrosehill/Cloud-ASR-MCP View on GitHub