Fine-tuning Whisper on Modal's serverless GPUs
A script for fine-tuning OpenAI's Whisper speech recognition models using Modal's serverless GPU infrastructure.
If you've spent any time fine-tuning Whisper models, you know the pain: spinning up a GPU instance, wrestling with CUDA drivers, babysitting a training run, and then realizing you forgot to set up checkpointing so when the kernel panics at 3 AM you lose everything. I got tired of that cycle and built a script that wraps the entire Whisper fine-tuning process around Modal, a serverless GPU platform that eliminates basically all of that infrastructure overhead.
The motivation: GPU access without the headache
I run an AMD GPU locally (Radeon RX 7700 XT), which is great for inference with Vulkan-accelerated whisper.cpp but essentially useless for training with the PyTorch/HuggingFace ecosystem that assumes CUDA everywhere. Renting cloud GPUs through the traditional providers means committing to an instance, paying by the hour whether you're actively training or debugging a config issue, and dealing with setup each time. Modal flips that model: you define your environment as code, and they spin up an A100 only when your function actually runs. You pay for GPU-seconds, not GPU-hours-while-you-read-Stack-Overflow.
How the script works
The script supports all five Whisper model variants: tiny (39M params), base (74M), small (244M), medium (769M), and large-v3-turbo (809M). Each variant runs as an isolated Modal app with its own caching volume, so you can train multiple variants simultaneously without them stepping on each other. The architecture is clean: one Python file, one dataclass per model config, separate Modal apps so you can deploy and trigger them independently.
The environment is defined declaratively using Modal's image builder. The container starts from a Debian slim base with Python 3.12, installs ffmpeg for audio handling, and then pip-installs the full HuggingFace training stack: torch, torchaudio, transformers, datasets, accelerate, evaluate, jiwer, soundfile, librosa, and tensorboard. This means the environment is perfectly reproducible every time --- no more "works on my machine" issues with training pipelines.
The training pipeline
The pipeline handles the full lifecycle: loading your dataset from HuggingFace Hub, resampling audio to 16kHz if needed, automatically creating a 90/10 train/eval split, converting audio to mel spectrograms via the Whisper feature extractor, and then running Seq2Seq training with evaluation every 50 steps. Checkpoints save to a persistent Modal volume every 100 steps, so if something goes wrong you don't lose everything. When training finishes, the model gets pushed directly to your HuggingFace Hub repo.
Your dataset needs to be in parquet format on HuggingFace with an audio column (WAV files at 16kHz) and a text or sentence column for transcriptions. The minimum is 10 samples, though I'd recommend at least 100 for meaningful results. I ran my fine-tuning experiments with about 90 minutes of audio, which was enough to see real differences in output quality.
Running a training job
The actual usage is delightfully simple. After setting your dataset name and target HuggingFace repos in the config, you deploy with modal deploy finetune.py and then kick off training with commands like modal run finetune.py::main_large for the large-v3-turbo variant. Everything is configurable per-run: learning rate (default 1e-5), batch size (default 8), gradient accumulation steps (default 2), max training steps (default 250), and number of epochs. This makes it trivial to experiment with hyperparameters across model sizes without editing the script each time.
What it actually costs
This was the part that surprised me most. Running 250 training steps on an A100-40GB typically takes 30 to 90 minutes depending on model size, costing between $0.50 and $2.00 per run at Modal's rate of roughly $1.10/hour. Compare that to renting a comparable instance on a traditional cloud provider where you'd pay for setup time, idle time while debugging, and teardown. For iterative experimentation --- run a training job, evaluate, tweak parameters, run again --- the serverless model saves real money.
The smaller models (tiny, base) are particularly cheap to experiment with. You can run dozens of training jobs for the cost of a single hour on a reserved GPU instance. I used this to rapidly iterate on hyperparameters before committing to longer training runs on the larger variants.
Broader context: why fine-tuning Whisper matters
Whisper out of the box is remarkably good for general English transcription, but it struggles with domain-specific vocabulary, code-switching between languages, and speaker-specific pronunciation patterns. If you're transcribing medical dictation, legal proceedings, or --- in my case --- English peppered with Hebrew words, fine-tuning on even a modest amount of representative data can meaningfully improve accuracy. The hard part has always been the infrastructure, not the concept. This script is my attempt to make the infrastructure part trivial.
I've also published the fine-tuned models that came out of this pipeline on HuggingFace, along with the evaluation dataset I used. If you want to see what fine-tuning actually does to accuracy across different model sizes, check out my companion project Whisper-Fine-Tune-Accuracy-Eval where I benchmark the results systematically.
danielrosehill/Whisper-Fine-Tune-Accuracy-Eval View on GitHubThe repo is open source under MIT. If you're looking to fine-tune Whisper without the headache of GPU management, give it a look: Modal-Whisper-Finetune-Script on GitHub.
danielrosehill/Modal-Whisper-Finetune-Script View on GitHub