Automatic Speech Recognition System using KALDI from scratch

Hello Researchers! In this post, we will understand how to build an ASR system.


Kaldi is an open-source toolkit for speech recognition written in C++ and licensed under the Apache License v2.0. We can use it to train speech recognition models and decode audio from audio files.

Download and Install KALDI

You can skip this if you have already done the setup for KALDI.

git clone

Now, Go to the directory, open the Install file, and compile the KALDI Framework according to the instruction given on that file.KALDI takes time during installation, so utilize that time and have some dark chocolate coffee. (Do you know kaldi was a legendary Ethiopian goatherd who discovered the coffee plant around 850 AD)

Let’s Talk about Speech Recognition

In general Speech Recognition framework:
1. Process incoming wav speech
2. Then from the wave signal, we extract acoustic features using an acoustic model
3. Linking those features to words or vocabulary or lexicon
4. Language model or grammar defines how words can be connected to each.

Let’s understand the folder structure

The “egs” folder contains example models and scripts for Kaldi. Make copy of any example folder and rename it. Below is your folder structure.

KALDI Default folder structure

Conf- folder contains the configuration file for compute-and-process-kaldi.

local, Steps, and Utils- folders contain all the required files for creating language models and other supporting files for training and decoding ASR.

Data Preparation

The initial task is to properly curate the data as per KALDI format which includes the general files wav.scp, utt2spk, spk2utt, text, So create a data folder inside your directory. Inside the data folder create two more directories test and train. Also, put wav format audio files in your base folder.

Make sure your wav audio file name has the below naming convention (This step we are doing for our ease not necessary)

The first 2 letters signify your language name (for example: for english- en or for spanish -sp) , Next 4 characters specify the speaker_id (suppose we have 100 different speaker data for the training then we can give id like 0001), Next character specifies the speaker gender(M or F) and the last four characters signify the sentence ID per Speaker. So your audio file name should be like en0001M0001 or en0002F0002.

Below are the steps for KALDI format data.

Create wav.scp file in your train folder and save it.

wav.scp file format (Pattern: <filename> <full_path_to_audio_file>)

Create text file and save it

text (Pattern: <filename> <text_transcription>)

utt2spk: create file on <filename> <speakerID> pattern and save it

spk2utt: Sentences spoken by each speaker. <group same speaker per uttarance> and save it.

spk: create file list on <lang_name+speakerid> pattern. ex: <en0001M> and save it.

utt: create file lists on <unique utterance id> pattern and save it.

Repeat same steps for your test folder.

Language Data Preparation :

Create lexicon.txt file inside your data/local/dict/ folder. This file contains every word from your dictionary with its ‘phonetic transcriptions” , see below example.

lexicon.txt (Pattern <word> <phone 1> <phone 2> )

The phonetic transcription of Reason can be R IY Z AH N or R ee Z ahn etc.

Below are the files that are required for Language model preparation.

nonsilence_phones.txt: This file lists nonsilence phones that are present in corpus like aa, umm, etc. Create this file and save it.

optional_silence.txt: type sil and save to data/local/dict/.

silence_phones.txt: Not contain the acoustic information but are present. (noise). type sil and save it. Now our next step is to create language model.

Language Model Preparation

Here we are working with the N-gram language model, copy the below script to your folder and replace the path, I have taken the n_gram=2 which means that I am building a bi-gram language model, you can change it with your requirement.

#set-up for single machine or cluster based execution
. ./
#set the paths to binaries and other executables
[ -f ] && . ./
#Creating input to the LM training
#corpus file contains list of all sentences
cat $basepath/data/train/text | awk '{first = $1; $1 = ""; print $0; }' > $basepath/data/train/transwhile read linedoecho "<s> $line </s>" >> $basepath/data/train/lmtrain.txtdone <$basepath/data/train/trans#*******************************************************************************#lm_arpa_path=$basepath/data/local/lmtrain_dict=dict
n_gram=2 # This specifies bigram or trigram. for bigram set n_gram=2 for tri_gram set n_gram=3
echo " Creating n-gram LM "

rm -rf $basepath/data/local/$train_dict/lexicon_c.txt $basepath/data/local/$train_lang $basepath/data/local/tmp_$train_lang $basepath/data/$train_lang
mkdir $basepath/data/local/tmp_$train_lang
utils/ --num-sil-states 3 data/local/$train_dict '!SIL' data/local/$train_lang data/$train_lang$kaldi_root_dir/tools/irstlm/bin/ -i $basepath/data/$train_folder/lmtrain.txt -n $n_gram -o $basepath/data/local/tmp_$train_lang/lm_phone_bg.ilm.gzgunzip -c $basepath/data/local/tmp_$train_lang/lm_phone_bg.ilm.gz | utils/ data/$train_lang/words.txt > data/local/tmp_$train_lang/oov.txtgunzip -c $basepath/data/local/tmp_$train_lang/lm_phone_bg.ilm.gz | grep -v '<s> <s>' | grep -v '<s> </s>' | grep -v '</s> </s>' | grep -v 'SIL' | $kaldi_root_dir/src/lmbin/arpa2fst - | fstprint | utils/ data/local/tmp_$train_lang/oov.txt | utils/ | utils/ | fstcompile --isymbols=data/$train_lang/words.txt --osymbols=data/$train_lang/words.txt --keep_isymbols=false --keep_osymbols=false | fstrmepsilon > data/$train_lang/G.fst$kaldi_root_dir/src/fstbin/fstisstochastic data/$train_lang/G.fstecho "End of Script"

save the above code as and run sh on your terminal. you will see the below output.

Language Model Process Flow

When you get a success message enjoy! You have created your first language model. For checking, go to your data folder and there you can see two directories local and langmodel, open langmodel and you will find below folder structure.

Compiled Language Model folder structure

G.fst is a word level grammar finite state transducer
L.fst is a pronunciation lexicon finite state transducer

Feature Extraction: In this step we extract MFCC features of each utterance (audio). Open your terminal and run below command.

steps/ -nj 4 data/train exp/make_mfcc/train mfcc used for computing MFCC coefficients.
nj- number of jobs, you can set it to according to your CPU.
data/train: folder path for which you want to compute MFCC.
log file stored in this directory.
mfcc: directory name where we store extracted feature.

MFCC Feature Extraction

For extracting cepstral mean and variance statistics indexed by speakers run below command.

steps/ data/train exp/make_mfcc/train mfcc
CMVN Indexing

Acoustic Model Preparation

In this step we train the Monophone HMM system , by using below command.

steps/ --nj 4 data/train data/langmodel exp/mono
Training flow of Monophone HMM System

Run the below command for combining the acoustic model and language model to get the final model.

utils/ — — mono data/langmodel exp/mono exp/mono/graph

Yeah! Finally, you have developed a training model for your own ASR System.


For checking how your ASR system performing, use below command on unseen testing data

steps/ — nj 4 exp/mono/graph data/test exp/mono/decode

To see the decoded results use below command:

utils/ -f 2- data/langmodel/words.txt exp/mono/decode/scoring/3.tra

Now we successfully build an ASR system for custom language. YAY !
If you like the post ! Hit clap ! Thanks !

Author: Ravi Pandey Date: 2020-06-05 15:00:00
Quick Reply