Whisper Large V3: Production-Ready Speech Recognition

OpenAI's Whisper Large V3 delivers near-human transcription accuracy for 100+ languages. But running it at production scale requires careful optimization. This guide shows you how to deploy Whisper for 1,000+ hours of audio per day efficiently.

๐Ÿ’ก What you'll build

A production Whisper system capable of processing 50+ concurrent transcription requests with automatic language detection, timestamps, and speaker diarization.

Hardware Requirements

Model Sizes and GPU Memory

ModelVRAMSpeed (1hr audio)WER
tiny1GB~2 min~14%
base1GB~3 min~11%
small2GB~5 min~7%
medium5GB~10 min~5%
large-v310GB~15 min~3%

For production, large-v3 provides the best accuracy. You'll want at least 16GB VRAM for comfortable batch processing.

Recommended GPUs on GPUBrazil

Basic Setup

First, spin up a GPU instance on GPUBrazil and install dependencies:

# Install system dependencies
apt update && apt install -y ffmpeg

# Install Python packages
pip install openai-whisper torch torchaudio

# For faster inference
pip install faster-whisper

Basic Transcription

import whisper

# Load model
model = whisper.load_model("large-v3")

# Transcribe
result = model.transcribe("audio.mp3")

print(result["text"])
print(result["segments"])  # With timestamps

Faster-Whisper: 4x Speed Boost

The faster-whisper library uses CTranslate2 optimization for 4x faster inference with the same accuracy:

from faster_whisper import WhisperModel

# Load with int8 quantization for speed
model = WhisperModel(
    "large-v3",
    device="cuda",
    compute_type="int8_float16"  # Best speed/accuracy
)

def transcribe_audio(audio_path):
    segments, info = model.transcribe(
        audio_path,
        beam_size=5,
        language=None,  # Auto-detect
        word_timestamps=True,
        vad_filter=True  # Skip silence
    )
    
    return {
        "language": info.language,
        "duration": info.duration,
        "segments": [
            {
                "start": seg.start,
                "end": seg.end,
                "text": seg.text,
                "words": [
                    {"word": w.word, "start": w.start, "end": w.end}
                    for w in (seg.words or [])
                ]
            }
            for seg in segments
        ]
    }

# Usage
result = transcribe_audio("podcast.mp3")
print(f"Language: {result['language']}")
print(f"Duration: {result['duration']:.1f}s")

๐Ÿš€ Performance Comparison

Faster-whisper processes 1 hour of audio in ~4 minutes on RTX 4090, vs ~15 minutes with standard Whisper. That's 15 hours of audio per GPU-hour!

Production API with FastAPI

from fastapi import FastAPI, UploadFile, BackgroundTasks
from faster_whisper import WhisperModel
import aiofiles
import uuid
import os

app = FastAPI()

# Load model on startup
model = WhisperModel("large-v3", device="cuda", compute_type="int8_float16")

# Store results
results = {}

async def process_audio(job_id: str, file_path: str):
    """Background transcription task"""
    try:
        segments, info = model.transcribe(
            file_path,
            beam_size=5,
            vad_filter=True,
            word_timestamps=True
        )
        
        results[job_id] = {
            "status": "completed",
            "language": info.language,
            "duration": info.duration,
            "text": " ".join(seg.text for seg in segments),
            "segments": [
                {"start": s.start, "end": s.end, "text": s.text}
                for s in segments
            ]
        }
    except Exception as e:
        results[job_id] = {"status": "error", "error": str(e)}
    finally:
        os.remove(file_path)  # Cleanup

@app.post("/transcribe")
async def transcribe(file: UploadFile, background_tasks: BackgroundTasks):
    """Upload audio and start transcription"""
    job_id = str(uuid.uuid4())
    
    # Save uploaded file
    file_path = f"/tmp/{job_id}_{file.filename}"
    async with aiofiles.open(file_path, 'wb') as f:
        content = await file.read()
        await f.write(content)
    
    # Queue transcription
    results[job_id] = {"status": "processing"}
    background_tasks.add_task(process_audio, job_id, file_path)
    
    return {"job_id": job_id, "status": "processing"}

@app.get("/result/{job_id}")
async def get_result(job_id: str):
    """Get transcription result"""
    return results.get(job_id, {"status": "not_found"})

Batch Processing Pipeline

For processing thousands of files, use a batch pipeline with Redis queue:

from redis import Redis
from rq import Queue, Worker
from faster_whisper import WhisperModel
import json

# Redis connection
redis_conn = Redis(host='localhost', port=6379)
queue = Queue('transcription', connection=redis_conn)

# Worker process (run separately)
def transcribe_job(audio_url, callback_url):
    """Worker function for batch processing"""
    import requests
    from faster_whisper import WhisperModel
    
    # Download audio
    response = requests.get(audio_url)
    temp_path = f"/tmp/{uuid.uuid4()}.mp3"
    with open(temp_path, 'wb') as f:
        f.write(response.content)
    
    # Transcribe
    model = WhisperModel("large-v3", device="cuda", compute_type="int8_float16")
    segments, info = model.transcribe(temp_path, vad_filter=True)
    
    result = {
        "audio_url": audio_url,
        "language": info.language,
        "text": " ".join(s.text for s in segments)
    }
    
    # Callback with result
    requests.post(callback_url, json=result)
    os.remove(temp_path)
    
    return result

# Queue jobs
def queue_batch(audio_urls, callback_url):
    """Queue multiple transcriptions"""
    jobs = []
    for url in audio_urls:
        job = queue.enqueue(transcribe_job, url, callback_url)
        jobs.append(job.id)
    return jobs

Adding Speaker Diarization

Identify who speaks when using pyannote.audio:

from pyannote.audio import Pipeline
from faster_whisper import WhisperModel
import torch

# Initialize models
whisper_model = WhisperModel("large-v3", device="cuda")
diarization = Pipeline.from_pretrained(
    "pyannote/speaker-diarization-3.1",
    use_auth_token="YOUR_HF_TOKEN"
)
diarization.to(torch.device("cuda"))

def transcribe_with_speakers(audio_path):
    """Transcribe with speaker identification"""
    
    # Get speaker segments
    diarization_result = diarization(audio_path)
    
    # Get transcription
    segments, _ = whisper_model.transcribe(audio_path, word_timestamps=True)
    segments = list(segments)
    
    # Merge speaker info with transcription
    result = []
    for seg in segments:
        # Find speaker at segment midpoint
        midpoint = (seg.start + seg.end) / 2
        speaker = "UNKNOWN"
        
        for turn, _, spk in diarization_result.itertracks(yield_label=True):
            if turn.start <= midpoint <= turn.end:
                speaker = spk
                break
        
        result.append({
            "speaker": speaker,
            "start": seg.start,
            "end": seg.end,
            "text": seg.text
        })
    
    return result

# Usage
transcript = transcribe_with_speakers("meeting.mp3")
for seg in transcript:
    print(f"[{seg['speaker']}] {seg['start']:.1f}s: {seg['text']}")

Real-Time Streaming Transcription

import numpy as np
import sounddevice as sd
from faster_whisper import WhisperModel
import queue
import threading

model = WhisperModel("small", device="cuda")  # Use smaller for realtime
audio_queue = queue.Queue()

def audio_callback(indata, frames, time, status):
    """Capture audio chunks"""
    audio_queue.put(indata.copy())

def transcribe_stream():
    """Process audio stream in realtime"""
    buffer = np.array([], dtype=np.float32)
    
    while True:
        # Accumulate audio
        chunk = audio_queue.get()
        buffer = np.append(buffer, chunk.flatten())
        
        # Process every 5 seconds
        if len(buffer) >= 16000 * 5:  # 5 seconds at 16kHz
            # Transcribe buffer
            segments, _ = model.transcribe(
                buffer,
                language="en",
                beam_size=1,
                best_of=1
            )
            
            for seg in segments:
                print(seg.text, end=" ", flush=True)
            
            # Keep last 0.5s for context
            buffer = buffer[-8000:]

# Start streaming
with sd.InputStream(samplerate=16000, channels=1, callback=audio_callback):
    transcribe_stream()

โš ๏ธ Streaming Latency

Real-time streaming has ~2-5 second latency. For production live captions, consider a smaller model (small/medium) and tune buffer sizes.

Cost Optimization

Smart GPU Usage

Cost per Hour of Audio

Using RTX 4090 on GPUBrazil ($0.40/hr) with faster-whisper:

Compare to API services:

GPUBrazil is 13-53x cheaper for high-volume transcription.

Process 1,000+ Hours Daily

Deploy your Whisper pipeline on GPUBrazil. RTX 4090s from $0.40/hr.

Get $5 Free Credit โ†’

Production Checklist

Conclusion

Running Whisper at scale is entirely achievable with the right setup. The combination of faster-whisper optimization, proper batching, and affordable GPU compute makes it possible to transcribe thousands of hours daily at a fraction of API costs.

Start with a single RTX 4090 on GPUBrazil, test your pipeline, and scale horizontally as needed.