ivabojic commited on
Commit
b00243b
·
verified ·
1 Parent(s): cc717e2

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +49 -14
README.md CHANGED
@@ -17,25 +17,53 @@ base_model:
17
  - openai/whisper-medium
18
  ---
19
 
20
- # Whisper Medium Singlish2English Transcription Model
21
 
22
  [![Hugging Face](https://img.shields.io/badge/HuggingFace-Model-blue)](https://huggingface.co/ivabojic/whisper-medium-sing2eng-transcribe)
23
 
24
- ## Model Overview
25
 
26
- This model is a fine-tuned version of `openai/whisper-medium` on over 2 million speech samples from the Singapore National Speech Corpus (NSC), focusing on Singlish and Singaporean-accented English speech.
27
-
28
- It is designed to provide accurate transcription of Singaporean English and Singlish audio into text, supporting research and production tasks requiring speech recognition on Singapore data.
29
 
30
  ---
31
 
32
- ## Training Data
33
 
34
- - **Source:** Singapore National Speech Corpus (NSC)
35
- - **Samples:** Over 2 million segmented, cleaned, and pre-processed samples
36
- - Covers a wide variety of Singaporean speakers, environments, and speaking styles
37
 
38
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
39
 
40
  ## Usage
41
 
@@ -44,23 +72,30 @@ import torchaudio, torch
44
  from transformers import WhisperProcessor, WhisperForConditionalGeneration
45
 
46
  model_name = 'ivabojic/whisper-medium-sing2eng-transcribe'
47
- audio_path = 'path_to_audio' # Example: https://github.com/IvaBojic/Singlish2English/blob/main/small_dataset/audios/00862042_713.wav
48
 
 
49
  model = WhisperForConditionalGeneration.from_pretrained(model_name)
50
  processor = WhisperProcessor.from_pretrained(model_name)
51
 
52
- # Load audio
53
  audio, sr = torchaudio.load(audio_path)
54
  if sr != 16000:
55
  resampler = torchaudio.transforms.Resample(orig_freq=sr, new_freq=16000)
56
  audio = resampler(audio)
57
  audio = audio.squeeze().numpy()
58
 
59
- # Preprocess and transcribe
60
  inputs = processor(audio=audio, sampling_rate=16000, return_tensors="pt")
61
 
62
  with torch.no_grad():
63
  predicted_ids = model.generate(inputs.input_features)
64
 
65
  transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
66
- print(transcription)
 
 
 
 
 
 
 
17
  - openai/whisper-medium
18
  ---
19
 
20
+ # Whisper-medium Singlish2English transcription model
21
 
22
  [![Hugging Face](https://img.shields.io/badge/HuggingFace-Model-blue)](https://huggingface.co/ivabojic/whisper-medium-sing2eng-transcribe)
23
 
24
+ ## Model overview
25
 
26
+ This model is a fine-tuned version of [`openai/whisper-medium`](https://huggingface.co/openai/whisper-medium), trained on over **2 million speech samples** from the [Singapore National Speech Corpus (NSC)](https://www.imda.gov.sg/how-we-can-help/national-speech-corpus). It focuses on **Singaporean-accented English** (Singlish), which are typically underrepresented in general-purpose ASR systems.
 
 
27
 
28
  ---
29
 
30
+ ## Custom dataset overview
31
 
32
+ To enable fine-tuning of open-source foundation ASR models, we curated **NSC<sub>P16</sub>** bespoke dataset constructed from the NSC corpus. It is designed to capture the range and richness of Singlish across both non-conversational and conversational contexts.
 
 
33
 
34
+ - **Non-conversational speech** includes:
35
+ - **Part 1:** Phonetically-balanced scripts consisting of standard English sentences spoken in local accents.
36
+ - **Part 2:** Sentences randomly generated from themes such as people, food, places, and brands.
37
+
38
+ - **Conversational and expressive speech** includes:
39
+ - **Part 3:** Natural dialogues on everyday topics between Singaporean speakers.
40
+ - **Part 5:** Stylized recordings simulating debates, finance-related discussions, and emotional expressions (both positive and negative).
41
+ - **Part 6:** Scenario-based dialogues, where speakers engage in topic-driven, semi-scripted interactions covering various themes.
42
+
43
+ Together, these components make NSC<sub>P16</sub> a robust dataset for building speech models that generalize well across local speech styles, tones, and speaking conditions.
44
+
45
+ **Table 1: Overview of the custom-created transcription datasets.**
46
+
47
+ | **Name** | **Samples** | **Total hours** | **Avg. duration (s)** | **Min (s)** | **Max (s)** |
48
+ |------------------|-------------|------------------|------------------------|-------------|-------------|
49
+ | NSC<sub>P16_train</sub> | 2,048,000 | 2944.1 | 5.2 | 0.1 | 30.1 |
50
+ | NSC<sub>P16_valid</sub> | 50,000 | 73.4 | 5.3 | 0.8 | 29.1 |
51
+ | NSC<sub>P16_test</sub> | 10,000 | 19.1 | 6.9 | 1.0 | 26.1 |
52
+
53
+ ## Evaluation
54
+
55
+ Evaluation was conducted on the held-out NSC<sub>P16</sub> dataset. Performance was measured using **Word Error Rate (WER)**, comparing the fine-tuned model against the off-the-shelf Whisper-medium baseline.
56
+
57
+ **Table 2: Evaluation results on the test dataset using WER. A lower WER indicates better performance (↓).**
58
+
59
+ | Model | WER (↓) |
60
+ |-----------------------------------|---------|
61
+ | Whisper-medium (off-the-shelf) | 21.09 |
62
+ | Whisper-medium-Sing2Eng (fine-tuned) | **6.63** |
63
+
64
+ This represents a **14.46 percentage point absolute reduction** and a **68.5% relative improvement** in WER over the baseline Whisper-medium model on the NSC<sub>P16</sub> test set.
65
+
66
+ By learning from diverse local accents and speaking styles, this model significantly improves transcription accuracy for Singaporean speech, making it suitable for both research and production applications in **multilingual** and **code-switched** environments.
67
 
68
  ## Usage
69
 
 
72
  from transformers import WhisperProcessor, WhisperForConditionalGeneration
73
 
74
  model_name = 'ivabojic/whisper-medium-sing2eng-transcribe'
75
+ audio_path = 'path_to_audio' # e.g: https://github.com/IvaBojic/Singlish2English/blob/main/small_dataset/audios/00862042_713.wav
76
 
77
+ # Load model and processor
78
  model = WhisperForConditionalGeneration.from_pretrained(model_name)
79
  processor = WhisperProcessor.from_pretrained(model_name)
80
 
81
+ # Load and resample audio if needed
82
  audio, sr = torchaudio.load(audio_path)
83
  if sr != 16000:
84
  resampler = torchaudio.transforms.Resample(orig_freq=sr, new_freq=16000)
85
  audio = resampler(audio)
86
  audio = audio.squeeze().numpy()
87
 
88
+ # Preprocess and generate transcription
89
  inputs = processor(audio=audio, sampling_rate=16000, return_tensors="pt")
90
 
91
  with torch.no_grad():
92
  predicted_ids = model.generate(inputs.input_features)
93
 
94
  transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
95
+ print(transcription)
96
+ ```
97
+
98
+ ## Project repository
99
+
100
+ For training scripts, evaluation tools, sample audio files, and more, visit the GitHub repository:
101
+ [https://github.com/IvaBojic/Singlish2English](https://github.com/IvaBojic/Singlish2English)