Whisper Large V3: Production-Ready Speech Recognition
OpenAI's Whisper Large V3 delivers near-human transcription accuracy for 100+ languages. But running it at production scale requires careful optimization. This guide shows you how to deploy Whisper for 1,000+ hours of audio per day efficiently.
๐ก What you'll build
A production Whisper system capable of processing 50+ concurrent transcription requests with automatic language detection, timestamps, and speaker diarization.
Hardware Requirements
Model Sizes and GPU Memory
| Model | VRAM | Speed (1hr audio) | WER |
|---|---|---|---|
| tiny | 1GB | ~2 min | ~14% |
| base | 1GB | ~3 min | ~11% |
| small | 2GB | ~5 min | ~7% |
| medium | 5GB | ~10 min | ~5% |
| large-v3 | 10GB | ~15 min | ~3% |
For production, large-v3 provides the best accuracy. You'll want at least 16GB VRAM for comfortable batch processing.
Recommended GPUs on GPUBrazil
- RTX 4090 (24GB): $0.40/hr โ Best for startups
- L40S (48GB): $0.79/hr โ Higher throughput
- A100 80GB: $1.20/hr โ Maximum batch size
Basic Setup
First, spin up a GPU instance on GPUBrazil and install dependencies:
# Install system dependencies
apt update && apt install -y ffmpeg
# Install Python packages
pip install openai-whisper torch torchaudio
# For faster inference
pip install faster-whisper
Basic Transcription
import whisper
# Load model
model = whisper.load_model("large-v3")
# Transcribe
result = model.transcribe("audio.mp3")
print(result["text"])
print(result["segments"]) # With timestamps
Faster-Whisper: 4x Speed Boost
The faster-whisper library uses CTranslate2 optimization for 4x faster inference with the same accuracy:
from faster_whisper import WhisperModel
# Load with int8 quantization for speed
model = WhisperModel(
"large-v3",
device="cuda",
compute_type="int8_float16" # Best speed/accuracy
)
def transcribe_audio(audio_path):
segments, info = model.transcribe(
audio_path,
beam_size=5,
language=None, # Auto-detect
word_timestamps=True,
vad_filter=True # Skip silence
)
return {
"language": info.language,
"duration": info.duration,
"segments": [
{
"start": seg.start,
"end": seg.end,
"text": seg.text,
"words": [
{"word": w.word, "start": w.start, "end": w.end}
for w in (seg.words or [])
]
}
for seg in segments
]
}
# Usage
result = transcribe_audio("podcast.mp3")
print(f"Language: {result['language']}")
print(f"Duration: {result['duration']:.1f}s")
๐ Performance Comparison
Faster-whisper processes 1 hour of audio in ~4 minutes on RTX 4090, vs ~15 minutes with standard Whisper. That's 15 hours of audio per GPU-hour!
Production API with FastAPI
from fastapi import FastAPI, UploadFile, BackgroundTasks
from faster_whisper import WhisperModel
import aiofiles
import uuid
import os
app = FastAPI()
# Load model on startup
model = WhisperModel("large-v3", device="cuda", compute_type="int8_float16")
# Store results
results = {}
async def process_audio(job_id: str, file_path: str):
"""Background transcription task"""
try:
segments, info = model.transcribe(
file_path,
beam_size=5,
vad_filter=True,
word_timestamps=True
)
results[job_id] = {
"status": "completed",
"language": info.language,
"duration": info.duration,
"text": " ".join(seg.text for seg in segments),
"segments": [
{"start": s.start, "end": s.end, "text": s.text}
for s in segments
]
}
except Exception as e:
results[job_id] = {"status": "error", "error": str(e)}
finally:
os.remove(file_path) # Cleanup
@app.post("/transcribe")
async def transcribe(file: UploadFile, background_tasks: BackgroundTasks):
"""Upload audio and start transcription"""
job_id = str(uuid.uuid4())
# Save uploaded file
file_path = f"/tmp/{job_id}_{file.filename}"
async with aiofiles.open(file_path, 'wb') as f:
content = await file.read()
await f.write(content)
# Queue transcription
results[job_id] = {"status": "processing"}
background_tasks.add_task(process_audio, job_id, file_path)
return {"job_id": job_id, "status": "processing"}
@app.get("/result/{job_id}")
async def get_result(job_id: str):
"""Get transcription result"""
return results.get(job_id, {"status": "not_found"})
Batch Processing Pipeline
For processing thousands of files, use a batch pipeline with Redis queue:
from redis import Redis
from rq import Queue, Worker
from faster_whisper import WhisperModel
import json
# Redis connection
redis_conn = Redis(host='localhost', port=6379)
queue = Queue('transcription', connection=redis_conn)
# Worker process (run separately)
def transcribe_job(audio_url, callback_url):
"""Worker function for batch processing"""
import requests
from faster_whisper import WhisperModel
# Download audio
response = requests.get(audio_url)
temp_path = f"/tmp/{uuid.uuid4()}.mp3"
with open(temp_path, 'wb') as f:
f.write(response.content)
# Transcribe
model = WhisperModel("large-v3", device="cuda", compute_type="int8_float16")
segments, info = model.transcribe(temp_path, vad_filter=True)
result = {
"audio_url": audio_url,
"language": info.language,
"text": " ".join(s.text for s in segments)
}
# Callback with result
requests.post(callback_url, json=result)
os.remove(temp_path)
return result
# Queue jobs
def queue_batch(audio_urls, callback_url):
"""Queue multiple transcriptions"""
jobs = []
for url in audio_urls:
job = queue.enqueue(transcribe_job, url, callback_url)
jobs.append(job.id)
return jobs
Adding Speaker Diarization
Identify who speaks when using pyannote.audio:
from pyannote.audio import Pipeline
from faster_whisper import WhisperModel
import torch
# Initialize models
whisper_model = WhisperModel("large-v3", device="cuda")
diarization = Pipeline.from_pretrained(
"pyannote/speaker-diarization-3.1",
use_auth_token="YOUR_HF_TOKEN"
)
diarization.to(torch.device("cuda"))
def transcribe_with_speakers(audio_path):
"""Transcribe with speaker identification"""
# Get speaker segments
diarization_result = diarization(audio_path)
# Get transcription
segments, _ = whisper_model.transcribe(audio_path, word_timestamps=True)
segments = list(segments)
# Merge speaker info with transcription
result = []
for seg in segments:
# Find speaker at segment midpoint
midpoint = (seg.start + seg.end) / 2
speaker = "UNKNOWN"
for turn, _, spk in diarization_result.itertracks(yield_label=True):
if turn.start <= midpoint <= turn.end:
speaker = spk
break
result.append({
"speaker": speaker,
"start": seg.start,
"end": seg.end,
"text": seg.text
})
return result
# Usage
transcript = transcribe_with_speakers("meeting.mp3")
for seg in transcript:
print(f"[{seg['speaker']}] {seg['start']:.1f}s: {seg['text']}")
Real-Time Streaming Transcription
import numpy as np
import sounddevice as sd
from faster_whisper import WhisperModel
import queue
import threading
model = WhisperModel("small", device="cuda") # Use smaller for realtime
audio_queue = queue.Queue()
def audio_callback(indata, frames, time, status):
"""Capture audio chunks"""
audio_queue.put(indata.copy())
def transcribe_stream():
"""Process audio stream in realtime"""
buffer = np.array([], dtype=np.float32)
while True:
# Accumulate audio
chunk = audio_queue.get()
buffer = np.append(buffer, chunk.flatten())
# Process every 5 seconds
if len(buffer) >= 16000 * 5: # 5 seconds at 16kHz
# Transcribe buffer
segments, _ = model.transcribe(
buffer,
language="en",
beam_size=1,
best_of=1
)
for seg in segments:
print(seg.text, end=" ", flush=True)
# Keep last 0.5s for context
buffer = buffer[-8000:]
# Start streaming
with sd.InputStream(samplerate=16000, channels=1, callback=audio_callback):
transcribe_stream()
โ ๏ธ Streaming Latency
Real-time streaming has ~2-5 second latency. For production live captions, consider a smaller model (small/medium) and tune buffer sizes.
Cost Optimization
Smart GPU Usage
- Batch similar-length audio: Padding waste is minimized
- Use VAD filtering: Skip silence automatically
- INT8 quantization: 2x faster with same accuracy
- Scale to zero: Shut down when queue is empty
Cost per Hour of Audio
Using RTX 4090 on GPUBrazil ($0.40/hr) with faster-whisper:
- Processing speed: ~15 hours audio per GPU-hour
- Cost: $0.027 per hour of audio
Compare to API services:
- OpenAI Whisper API: $0.36/hour
- Google Speech-to-Text: $1.44/hour
- AWS Transcribe: $1.44/hour
GPUBrazil is 13-53x cheaper for high-volume transcription.
Process 1,000+ Hours Daily
Deploy your Whisper pipeline on GPUBrazil. RTX 4090s from $0.40/hr.
Get $5 Free Credit โProduction Checklist
- โ Use faster-whisper for 4x speedup
- โ Enable VAD filter to skip silence
- โ INT8 quantization for best performance
- โ Queue system (Redis/RQ) for batch jobs
- โ Auto-scaling based on queue depth
- โ Health checks and monitoring
- โ Result storage (S3/database)
- โ Webhook callbacks for async jobs
Conclusion
Running Whisper at scale is entirely achievable with the right setup. The combination of faster-whisper optimization, proper batching, and affordable GPU compute makes it possible to transcribe thousands of hours daily at a fraction of API costs.
Start with a single RTX 4090 on GPUBrazil, test your pipeline, and scale horizontally as needed.