Voice Blog Creator: turning voice recordings into polished blog posts with Gemini

I do a lot of thinking out loud. Sometimes the best way to work through an idea is to just talk it out, record it, and figure out the structure later. The problem is that "later" often means never — raw voice recordings pile up and never get turned into anything useful. Voice Blog Creator is my solution to that problem: an automated pipeline that takes a raw audio recording and produces a formatted, publish-ready blog post.

The three-step pipeline

The workflow has three distinct stages, each optimized for its specific purpose:

Step 1: Audio preprocessing. The raw recording gets cleaned up for optimal speech-to-text performance. This means converting to mono, removing silence while keeping natural pauses, reducing background noise, normalizing audio levels, applying dynamic range compression, and downsampling to 16kHz. All handled by ffmpeg under the hood.

Step 2: Transcription. The processed audio goes to Gemini 2.5 Flash for transcription with light redaction. It removes filler words (um, uh, like, you know), organizes the text into paragraphs based on topic changes, and adds proper spacing — all while maintaining the original meaning and the speaker's voice.

Step 3: Blog post generation. The cleaned transcript gets transformed into a formatted blog post with a compelling title, introduction, subheadings, conclusion, and proper markdown formatting. Again powered by Gemini 2.5 Flash.

Smart caching and flexibility

The pipeline is designed to be practical. Each step caches its output, so if you've already preprocessed the audio and just want to regenerate the blog post with different settings, it skips the earlier steps. You can also run individual steps independently, or force regeneration of everything with the --force flag. The cost is minimal — roughly $0.01-0.05 per hour of audio through the Gemini API.

Why not just use a transcription service?

The key difference is that this isn't just transcription. A raw transcript of someone speaking extemporaneously is basically unreadable as a blog post. The magic is in the combination: clean up the audio first so the transcription is accurate, then lightly edit the transcript to remove verbal tics while preserving voice, and finally restructure the content into a proper blog format. Three steps, each doing one thing well.

Check it out on GitHub: Voice-Blog-Creator

danielrosehill/Voice-Blog-Creator ★ 0

Create a blog (or other doc) from a voice recording

PythonUpdated Oct 2025

aiai-agentsgemini