Cover image for Transcribe 2.5 Hours of Audio in 98 Seconds: Meet Insanely Fast Whisper
Promptzone - Commumity
Promptzone - Commumity

Posted on

Transcribe 2.5 Hours of Audio in 98 Seconds: Meet Insanely Fast Whisper

In the rapidly evolving landscape of artificial intelligence, speed and accuracy are paramount. A new implementation called "insanely-fast-whisper" is taking the AI community by storm, promising to transcribe 2.5 hours of audio in just 98 seconds, all while running locally on your Mac or Nvidia GPU. This groundbreaking tool leverages OpenAI's Whisper model and the Pyannote library to deliver unprecedented transcription speeds and speaker segmentation capabilities. Let's delve into how you can harness this powerful tool.

What is Insanely Fast Whisper?

Insanely Fast Whisper is an opinionated command-line interface (CLI) designed to transcribe audio files quickly and efficiently using Whisper's large v3 model from OpenAI, enhanced with the Pyannote library for speaker diarization. The tool is optimized to run on both Mac (using Apple Silicon's mps backend) and Nvidia GPUs, ensuring broad accessibility and performance.

Key Features

  • Blazing Fast Transcriptions: Transcribe 150 minutes (2.5 hours) of audio in less than 98 seconds.
  • Local Processing: Runs entirely on-device, eliminating the need for internet connectivity and enhancing privacy.
  • Advanced Optimizations: Utilizes Flash Attention 2 and other optimization techniques for rapid processing.

Installation and Setup

To get started with Insanely Fast Whisper, you'll need to install the tool via pip or pipx. Here’s a step-by-step guide:

  1. Install Insanely Fast Whisper:
   pip install insanely-fast-whisper
Enter fullscreen mode Exit fullscreen mode

Alternatively, you can use pipx for a more isolated environment:

   pipx install insanely-fast-whisper
Enter fullscreen mode Exit fullscreen mode
  1. Run the CLI:
   insanely-fast-whisper --file-name <FILE NAME or URL> --batch-size 2 --device-id mps --hf_token <HF TOKEN>
Enter fullscreen mode Exit fullscreen mode

Using Insanely Fast Whisper

Once installed, you can transcribe audio files using a simple command. Here’s how:

  1. Basic Transcription:
   insanely-fast-whisper --file-name <filename or URL>
Enter fullscreen mode Exit fullscreen mode
  1. Using Flash Attention 2 for even faster processing:
   insanely-fast-whisper --file-name <filename or URL> --flash True
Enter fullscreen mode Exit fullscreen mode
  1. Running on Mac:
   insanely-fast-whisper --file-name <filename or URL> --device-id mps
Enter fullscreen mode Exit fullscreen mode
  1. Using Distil-Whisper Model:
   insanely-fast-whisper --model-name distil-whisper/large-v2 --file-name <filename or URL>
Enter fullscreen mode Exit fullscreen mode

Benchmark Results

The speed and efficiency of Insanely Fast Whisper have been benchmarked extensively. Here are some results using an Nvidia A100 - 80GB GPU:

Optimization Type Time to Transcribe (150 mins of Audio)
large-v3 (Transformers) (fp32) ~31 minutes
large-v3 (Transformers) (fp16 + batching [24] + bettertransformer) ~5 minutes
large-v3 (Transformers) (fp16 + batching [24] + Flash Attention 2) ~2 minutes
distil-large-v2 (Transformers) (fp16 + batching [24] + bettertransformer) ~3 minutes
distil-large-v2 (Transformers) (fp16 + batching [24] + Flash Attention 2) ~1 minute
large-v2 (Faster Whisper) (fp16 + beam_size [1]) ~9 minutes
large-v2 (Faster Whisper) (8-bit + beam_size [1]) ~8 minutes

Advanced CLI Options

Insanely Fast Whisper provides a range of options to customize your transcription experience:

  -h, --help            Show help message and exit
  --file-name FILE_NAME Path or URL to the audio file to be transcribed
  --device-id DEVICE_ID Device ID for your GPU. Use "mps" for Macs with Apple Silicon
  --transcript-path TRANSCRIPT_PATH Path to save the transcription output (default: output.json)
  --model-name MODEL_NAME Name of the pretrained model/checkpoint (default: openai/whisper-large-v3)
  --task {transcribe, translate} Task to perform (default: transcribe)
  --language LANGUAGE   Language of the input audio (default: auto-detect)
  --batch-size BATCH_SIZE Number of parallel batches to compute (default: 24)
  --flash FLASH         Use Flash Attention 2 (default: False)
  --timestamp {chunk, word} Timestamps granularity (default: chunk)
  --hf-token HF_TOKEN   Hugging Face token for Pyannote audio diarization
  --diarization_model DIARIZATION_MODEL Model for speaker diarization (default: pyannote/speaker-diarization)
  --num-speakers NUM_SPEAKERS Exact number of speakers (default: None)
  --min-speakers MIN_SPEAKERS Minimum number of speakers (default: None)
  --max-speakers MAX_SPEAKERS Maximum number of speakers (default: None)
Enter fullscreen mode Exit fullscreen mode


Insanely Fast Whisper is a game-changer in the field of audio transcription, offering unparalleled speed and accuracy. Its ability to run locally on Mac and Nvidia GPUs makes it a versatile tool for developers and researchers alike. Whether you're transcribing lengthy interviews or need quick speaker segmentation, Insanely Fast Whisper delivers performance that sets a new standard.

For more information and to get started, visit the Insanely Fast Whisper GitHub repository.

Top comments (0)