Voice Blog Creator: turning voice recordings into polished blog posts with Gemini
An automated pipeline that converts raw voice recordings into polished blog posts using audio preprocessing, Gemini transcription, and AI-powered formatting.
I do a lot of thinking out loud. Sometimes the best way to work through an idea is to just talk it out, record it, and figure out the structure later. The problem is that "later" often means never — raw voice recordings pile up and never get turned into anything useful. Voice Blog Creator is my solution to that problem: an automated pipeline that takes a raw audio recording and produces a formatted, publish-ready blog post.
The three-step pipeline
The workflow has three distinct stages, each optimized for its specific purpose:
Step 1: Audio preprocessing. The raw recording gets cleaned up for optimal speech-to-text performance. This means converting to mono, removing silence while keeping natural pauses, reducing background noise, normalizing audio levels, applying dynamic range compression, and downsampling to 16kHz. All handled by ffmpeg under the hood.
Step 2: Transcription. The processed audio goes to Gemini 2.5 Flash for transcription with light redaction. It removes filler words (um, uh, like, you know), organizes the text into paragraphs based on topic changes, and adds proper spacing — all while maintaining the original meaning and the speaker's voice.
Step 3: Blog post generation. The cleaned transcript gets transformed into a formatted blog post with a compelling title, introduction, subheadings, conclusion, and proper markdown formatting. Again powered by Gemini 2.5 Flash.
Smart caching and flexibility
The pipeline is designed to be practical. Each step caches its output, so if you've already preprocessed the audio and just want to regenerate the blog post with different settings, it skips the earlier steps. You can also run individual steps independently, or force regeneration of everything with the --force flag. The cost is minimal — roughly $0.01-0.05 per hour of audio through the Gemini API.
Why not just use a transcription service?
The key difference is that this isn't just transcription. A raw transcript of someone speaking extemporaneously is basically unreadable as a blog post. The magic is in the combination: clean up the audio first so the transcription is accurate, then lightly edit the transcript to remove verbal tics while preserving voice, and finally restructure the content into a proper blog format. Three steps, each doing one thing well.
Check it out on GitHub: Voice-Blog-Creator
Create a blog (or other doc) from a voice recording