Update README.md
Browse files
README.md
CHANGED
|
@@ -17,25 +17,53 @@ base_model:
|
|
| 17 |
- openai/whisper-medium
|
| 18 |
---
|
| 19 |
|
| 20 |
-
# Whisper
|
| 21 |
|
| 22 |
[](https://huggingface.co/ivabojic/whisper-medium-sing2eng-transcribe)
|
| 23 |
|
| 24 |
-
## Model
|
| 25 |
|
| 26 |
-
This model is a fine-tuned version of `openai/whisper-medium` on over 2 million speech samples from the Singapore National Speech Corpus (NSC)
|
| 27 |
-
|
| 28 |
-
It is designed to provide accurate transcription of Singaporean English and Singlish audio into text, supporting research and production tasks requiring speech recognition on Singapore data.
|
| 29 |
|
| 30 |
---
|
| 31 |
|
| 32 |
-
##
|
| 33 |
|
| 34 |
-
- **
|
| 35 |
-
- **Samples:** Over 2 million segmented, cleaned, and pre-processed samples
|
| 36 |
-
- Covers a wide variety of Singaporean speakers, environments, and speaking styles
|
| 37 |
|
| 38 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 39 |
|
| 40 |
## Usage
|
| 41 |
|
|
@@ -44,23 +72,30 @@ import torchaudio, torch
|
|
| 44 |
from transformers import WhisperProcessor, WhisperForConditionalGeneration
|
| 45 |
|
| 46 |
model_name = 'ivabojic/whisper-medium-sing2eng-transcribe'
|
| 47 |
-
audio_path = 'path_to_audio'
|
| 48 |
|
|
|
|
| 49 |
model = WhisperForConditionalGeneration.from_pretrained(model_name)
|
| 50 |
processor = WhisperProcessor.from_pretrained(model_name)
|
| 51 |
|
| 52 |
-
# Load audio
|
| 53 |
audio, sr = torchaudio.load(audio_path)
|
| 54 |
if sr != 16000:
|
| 55 |
resampler = torchaudio.transforms.Resample(orig_freq=sr, new_freq=16000)
|
| 56 |
audio = resampler(audio)
|
| 57 |
audio = audio.squeeze().numpy()
|
| 58 |
|
| 59 |
-
# Preprocess and
|
| 60 |
inputs = processor(audio=audio, sampling_rate=16000, return_tensors="pt")
|
| 61 |
|
| 62 |
with torch.no_grad():
|
| 63 |
predicted_ids = model.generate(inputs.input_features)
|
| 64 |
|
| 65 |
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
|
| 66 |
-
print(transcription)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 17 |
- openai/whisper-medium
|
| 18 |
---
|
| 19 |
|
| 20 |
+
# Whisper-medium Singlish2English transcription model
|
| 21 |
|
| 22 |
[](https://huggingface.co/ivabojic/whisper-medium-sing2eng-transcribe)
|
| 23 |
|
| 24 |
+
## Model overview
|
| 25 |
|
| 26 |
+
This model is a fine-tuned version of [`openai/whisper-medium`](https://huggingface.co/openai/whisper-medium), trained on over **2 million speech samples** from the [Singapore National Speech Corpus (NSC)](https://www.imda.gov.sg/how-we-can-help/national-speech-corpus). It focuses on **Singaporean-accented English** (Singlish), which are typically underrepresented in general-purpose ASR systems.
|
|
|
|
|
|
|
| 27 |
|
| 28 |
---
|
| 29 |
|
| 30 |
+
## Custom dataset overview
|
| 31 |
|
| 32 |
+
To enable fine-tuning of open-source foundation ASR models, we curated **NSC<sub>P16</sub>** bespoke dataset constructed from the NSC corpus. It is designed to capture the range and richness of Singlish across both non-conversational and conversational contexts.
|
|
|
|
|
|
|
| 33 |
|
| 34 |
+
- **Non-conversational speech** includes:
|
| 35 |
+
- **Part 1:** Phonetically-balanced scripts consisting of standard English sentences spoken in local accents.
|
| 36 |
+
- **Part 2:** Sentences randomly generated from themes such as people, food, places, and brands.
|
| 37 |
+
|
| 38 |
+
- **Conversational and expressive speech** includes:
|
| 39 |
+
- **Part 3:** Natural dialogues on everyday topics between Singaporean speakers.
|
| 40 |
+
- **Part 5:** Stylized recordings simulating debates, finance-related discussions, and emotional expressions (both positive and negative).
|
| 41 |
+
- **Part 6:** Scenario-based dialogues, where speakers engage in topic-driven, semi-scripted interactions covering various themes.
|
| 42 |
+
|
| 43 |
+
Together, these components make NSC<sub>P16</sub> a robust dataset for building speech models that generalize well across local speech styles, tones, and speaking conditions.
|
| 44 |
+
|
| 45 |
+
**Table 1: Overview of the custom-created transcription datasets.**
|
| 46 |
+
|
| 47 |
+
| **Name** | **Samples** | **Total hours** | **Avg. duration (s)** | **Min (s)** | **Max (s)** |
|
| 48 |
+
|------------------|-------------|------------------|------------------------|-------------|-------------|
|
| 49 |
+
| NSC<sub>P16_train</sub> | 2,048,000 | 2944.1 | 5.2 | 0.1 | 30.1 |
|
| 50 |
+
| NSC<sub>P16_valid</sub> | 50,000 | 73.4 | 5.3 | 0.8 | 29.1 |
|
| 51 |
+
| NSC<sub>P16_test</sub> | 10,000 | 19.1 | 6.9 | 1.0 | 26.1 |
|
| 52 |
+
|
| 53 |
+
## Evaluation
|
| 54 |
+
|
| 55 |
+
Evaluation was conducted on the held-out NSC<sub>P16</sub> dataset. Performance was measured using **Word Error Rate (WER)**, comparing the fine-tuned model against the off-the-shelf Whisper-medium baseline.
|
| 56 |
+
|
| 57 |
+
**Table 2: Evaluation results on the test dataset using WER. A lower WER indicates better performance (↓).**
|
| 58 |
+
|
| 59 |
+
| Model | WER (↓) |
|
| 60 |
+
|-----------------------------------|---------|
|
| 61 |
+
| Whisper-medium (off-the-shelf) | 21.09 |
|
| 62 |
+
| Whisper-medium-Sing2Eng (fine-tuned) | **6.63** |
|
| 63 |
+
|
| 64 |
+
This represents a **14.46 percentage point absolute reduction** and a **68.5% relative improvement** in WER over the baseline Whisper-medium model on the NSC<sub>P16</sub> test set.
|
| 65 |
+
|
| 66 |
+
By learning from diverse local accents and speaking styles, this model significantly improves transcription accuracy for Singaporean speech, making it suitable for both research and production applications in **multilingual** and **code-switched** environments.
|
| 67 |
|
| 68 |
## Usage
|
| 69 |
|
|
|
|
| 72 |
from transformers import WhisperProcessor, WhisperForConditionalGeneration
|
| 73 |
|
| 74 |
model_name = 'ivabojic/whisper-medium-sing2eng-transcribe'
|
| 75 |
+
audio_path = 'path_to_audio' # e.g: https://github.com/IvaBojic/Singlish2English/blob/main/small_dataset/audios/00862042_713.wav
|
| 76 |
|
| 77 |
+
# Load model and processor
|
| 78 |
model = WhisperForConditionalGeneration.from_pretrained(model_name)
|
| 79 |
processor = WhisperProcessor.from_pretrained(model_name)
|
| 80 |
|
| 81 |
+
# Load and resample audio if needed
|
| 82 |
audio, sr = torchaudio.load(audio_path)
|
| 83 |
if sr != 16000:
|
| 84 |
resampler = torchaudio.transforms.Resample(orig_freq=sr, new_freq=16000)
|
| 85 |
audio = resampler(audio)
|
| 86 |
audio = audio.squeeze().numpy()
|
| 87 |
|
| 88 |
+
# Preprocess and generate transcription
|
| 89 |
inputs = processor(audio=audio, sampling_rate=16000, return_tensors="pt")
|
| 90 |
|
| 91 |
with torch.no_grad():
|
| 92 |
predicted_ids = model.generate(inputs.input_features)
|
| 93 |
|
| 94 |
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
|
| 95 |
+
print(transcription)
|
| 96 |
+
```
|
| 97 |
+
|
| 98 |
+
## Project repository
|
| 99 |
+
|
| 100 |
+
For training scripts, evaluation tools, sample audio files, and more, visit the GitHub repository:
|
| 101 |
+
[https://github.com/IvaBojic/Singlish2English](https://github.com/IvaBojic/Singlish2English)
|