This is a simplified guide to an AI model called Incredibly-Fast-Whisper maintained by Vaibhavs10. If you like these kinds of analysis, you should join AImodels.fyi or follow us on Twitter.
Model overview
The incredibly-fast-whisper
model is an opinionated CLI tool built on top of the OpenAI Whisper large-v3 model, which is designed to enable blazingly fast audio transcription. Powered by Hugging Face Transformers, Optimum, and Flash Attention 2, the model can transcribe 150 minutes of audio in less than 98 seconds, a significant performance improvement over the standard Whisper model. This tool is part of a community-driven project started by vaibhavs10 to showcase advanced Transformers optimizations.
The incredibly-fast-whisper
model is comparable to other Whisper-based models like whisperx, whisper-diarization, and metavoice, each of which offers its own unique set of features and optimizations for speech-to-text transcription.
Model inputs and outputs
Inputs
-
Audio file: The primary input for the
incredibly-fast-whisper
model is an audio file, which can be provided as a local file path or a URL. - Task: The model supports two main tasks: transcription (the default) and translation to another language.
- Language: The language of the input audio, which can be specified or left as "None" to allow the model to auto-detect the language.
- Batch size: The number of parallel batches to compute, which can be adjusted to avoid out-of-memory (OOM) errors.
- Timestamp format: The model can output timestamps at either the chunk or word level.
- Diarization: The model can use Pyannote.audio to perform speaker diarization, but this requires providing a Hugging Face API token.
Outputs
The primary output of the incredibly-fast-whisper
model is a transcription of the input audio, which can be saved to a JSON file.
Capabilities
The incredibly-fast-whisper
model le...
Click here to read the full guide to Incredibly-Fast-Whisper
Top comments (0)