Model Whisper Description
Basic Information
Model Name: Whisper
Developer/Creator: OpenAI
Release Date: September 2022 (original series), December 2022 (large-v2
), and November 2023 (large-v3
)
Model Type: Sequence-to-sequence ASR (automatic speech recognition) and speech translation model
Versions:
Size |
Parameters |
Relative speed |
tiny |
39 M |
~32x |
base |
74 M |
~16x |
small |
244 M |
~6x |
medium |
769 M |
~2x |
large |
1550 M |
1x |
The Whisper models are primarily for AI research, focusing on model robustness, generalization, and biases, and are also effective for English speech recognition. The use of Whisper models for transcribing non-consensual recordings or in high-risk decision-making contexts is strongly discouraged due to potential inaccuracies and ethical concerns.
Key Features:
- Multilingual capabilities, shows strong results in roughly 10 languages but have limited evaluation for other tasks like voice detection and speaker classification.
- Robust to diverse accents and noisy environments.
- Can be used for tasks such as speech transcription, translation, and generating subtitles.
Intended Use:
Intended for developers and researchers interested in incorporating speech-to-text capabilities into applications, supporting accessibility features, or conducting linguistic research.
Technical Details
Architecture:
The model utilizes a Transformer architecture that has been pre-trained on a mixture of supervised and unsupervised data.
Training Data:
The models are trained using 680,000 hours of audio and corresponding transcripts from the internet, with 65% being English audio and transcripts, 18% non-English audio with English transcripts, and 17% non-English audio with matching non-English transcripts, covering 98 languages in total.
Performance Metrics:
Research indicates that these models outperform many existing ASR systems. They show enhanced robustness to accents, background noise, and technical language, and provide zero-shot translation from multiple languages into English with nearly state-of-the-art accuracy in both speech recognition and translation.
Performance varies across languages, particularly suffering in low-resource or less commonly studied languages, and demonstrates variability in accuracy with different accents, dialects, and demographic groups. The models may also generate repetitive texts, a trait partly addressable through beam search and temperature scheduling techniques.
Knowledge cutoff:
Audio or text data used for training would not include information beyond mid-2022
Usage
Code Samples/SDK:
const axios = require('axios').default;
const api = new axios.create({
baseURL: 'https://api.aimlapi.com/v1',
headers: { Authorization: 'Bearer <YOUR_API_KEY>' },
});
const main = async () => {
const response = await api.post('/stt', {
model: '#g1_whisper-large',
url: 'https://audio-samples.github.io/samples/mp3/blizzard_unconditional/sample-0.mp3',
});
console.log('[transcription]', response.data.results.channels[0].alternatives[0].transcript);
};
main();
Tutorials: Speech-to-text Multimodal Experience in NodeJS
File Size
The maximum file size is limited to 2 GB.
Support and Community
Community Resources:
AIML API Discord
Support Channels:
Issues and contributions can be made directly through the GitHub repository.
Ethical Considerations
- Ethical Guidelines: OpenAI provides guidance on responsible usage, emphasizing privacy and ethical use of AI technologies.
- Bias Mitigation: Continuous efforts to reduce biases in speech recognition accuracy across different languages and accents.
Licensing
- License Type: Released under the MIT license, allowing for commercial and non-commercial use.
References