Whisper: Speech Recognition & Audio UnderstandingΒΆ

OpenAI Whisper is the state-of-the-art open-source model for automatic speech recognition (ASR) β€” transcribing audio to text in 99 languages.

Why Whisper?ΒΆ

  • Multilingual: 99 languages, automatic language detection

  • Robust: Handles accents, background noise, and technical jargon

  • Open source: Free, runs locally, no API key needed

  • Timestamps: Word-level timestamps for subtitle generation

  • Translation: Transcribe + translate to English in one step

Model

Size

Speed

Quality

tiny

39M

Fastest

Good

base

74M

Fast

Better

small

244M

Medium

Great

medium

769M

Slow

Excellent

large-v3

1.5B

Slowest

Best

# Install dependencies
# !pip install openai-whisper torch torchaudio ffmpeg-python
# Also needs: brew install ffmpeg (Mac) or apt install ffmpeg (Linux)

1. Local Whisper TranscriptionΒΆ

import whisper
import torch

# Load model (downloads on first use)
device = 'cuda' if torch.cuda.is_available() else 'cpu'
model = whisper.load_model('base', device=device)
print(f'Loaded Whisper base on {device}')

def transcribe(audio_path: str, language: str = None) -> dict:
    """Transcribe audio file. language=None for auto-detection."""
    result = model.transcribe(
        audio_path,
        language=language,
        word_timestamps=True,
        verbose=False
    )
    return result

# Example usage (replace with your audio file):
# result = transcribe('my_audio.mp3')
# print('Text:', result['text'])
# print('Language:', result['language'])
print('transcribe() function ready β€” pass any audio file path.')

2. OpenAI Whisper API (Faster, No Local GPU)ΒΆ

from openai import OpenAI

client = OpenAI()

def transcribe_api(audio_path: str, language: str = None, task: str = 'transcribe') -> str:
    """
    Transcribe via OpenAI API.
    task: 'transcribe' (keep language) or 'translate' (β†’ English)
    Cost: $0.006/minute
    """
    with open(audio_path, 'rb') as f:
        if task == 'translate':
            response = client.audio.translations.create(
                model='whisper-1',
                file=f,
                response_format='text'
            )
        else:
            response = client.audio.transcriptions.create(
                model='whisper-1',
                file=f,
                language=language,
                response_format='verbose_json',
                timestamp_granularities=['word']
            )
    return response

print('OpenAI Whisper API function defined.')

3. Generate SRT SubtitlesΒΆ

def seconds_to_srt_time(seconds: float) -> str:
    """Convert seconds to SRT timestamp format: HH:MM:SS,mmm"""
    hours = int(seconds // 3600)
    minutes = int((seconds % 3600) // 60)
    secs = int(seconds % 60)
    millis = int((seconds - int(seconds)) * 1000)
    return f'{hours:02d}:{minutes:02d}:{secs:02d},{millis:03d}'

def segments_to_srt(segments: list) -> str:
    """Convert Whisper segments to SRT subtitle format."""
    lines = []
    for i, seg in enumerate(segments, 1):
        start = seconds_to_srt_time(seg['start'])
        end   = seconds_to_srt_time(seg['end'])
        text  = seg['text'].strip()
        lines.append(f'{i}\n{start} --> {end}\n{text}\n')
    return '\n'.join(lines)

def transcribe_to_srt(audio_path: str, output_path: str = None) -> str:
    """Transcribe audio and save as SRT subtitle file."""
    result = model.transcribe(audio_path, word_timestamps=True)
    srt_content = segments_to_srt(result['segments'])

    if output_path is None:
        output_path = audio_path.rsplit('.', 1)[0] + '.srt'

    with open(output_path, 'w', encoding='utf-8') as f:
        f.write(srt_content)

    print(f'Saved subtitles to: {output_path}')
    return srt_content

# Demo with synthetic segments
demo_segments = [
    {'start': 0.0,  'end': 3.5,  'text': 'Welcome to the multimodal AI course.'},
    {'start': 3.5,  'end': 7.2,  'text': 'Today we will learn about Whisper speech recognition.'},
    {'start': 7.2,  'end': 11.0, 'text': 'It can transcribe audio in 99 languages automatically.'},
]

print('Sample SRT output:')
print(segments_to_srt(demo_segments))

4. Faster Whisper (4x Speed, Same Quality)ΒΆ

# faster-whisper uses CTranslate2 backend β€” 4x faster, 2x less memory
# !pip install faster-whisper

from faster_whisper import WhisperModel

# int8 quantization for even more speed on CPU
fast_model = WhisperModel('base', device='cpu', compute_type='int8')

def fast_transcribe(audio_path: str):
    segments, info = fast_model.transcribe(audio_path, beam_size=5)
    print(f'Detected language: {info.language} (probability: {info.language_probability:.2%})')
    full_text = ' '.join(seg.text for seg in segments)
    return full_text

print('faster-whisper model loaded (int8 quantized for speed)')

5. Real-World ApplicationsΒΆ

# Meeting transcription pipeline
def process_meeting(audio_file):
    # 1. Transcribe
    result = transcribe(audio_file)
    # 2. Summarize with LLM
    from openai import OpenAI
    client = OpenAI()
    summary = client.chat.completions.create(
        model='gpt-4o-mini',
        messages=[{'role': 'user', 'content': f'Summarize this meeting transcript in bullet points:\n\n{result["text"]}'}]
    ).choices[0].message.content
    return {'transcript': result['text'], 'summary': summary, 'language': result['language']}

ExercisesΒΆ

  1. Record a 1-minute audio clip and transcribe it with Whisper base vs. large-v3. Compare accuracy.

  2. Download a YouTube video (using yt-dlp), extract the audio, and generate subtitles.

  3. Build a meeting transcription + summarization pipeline using Whisper + GPT-4o.

  4. Test multilingual transcription: record yourself speaking two languages and check auto-detection.