Speech Recognition System Project Using Neural Networks ( Final Project Report ): December 2006

Sunday, December 24, 2006

Bibliography

[Bib1] Tebelskis, Joe. “Speech Recognition using Neural Networks”

Carnegie Mellon University, Pittsburgh, Pennsylvania. May 1995.

[Bib2] Kingston, Andrew. “Speech Recognition by Machine”

Victoria University of Wellington, New Zealand. October, 1992.

[Bib3] Rahim, Mazin G. “Artificial Neural Networks for Speech Analysis/Synthesis”

Chapman & Hall Neural Computing, 1994.

[Bib4] Brookes, Mike. “VOICEBOX: Speech Processing Toolbox for MATLAB”

Department of Electric & Electronics Engineering, Imperial College, UK.

[Bib5] The Mathworks. “MATLAB 6.1”

Signal Processing and Neural Networks Toolbox.

Appendix C

Matlab Files

[C1] MAIN File

%This function would read several data and tells in which class

%each set of data belongs

function finalproject = pj

%Read data of letter "a" sound

f1=wavread('vc_a',22050);

%Transform the data using Fast Fourier Transform function

F1=fft(f1,11025);

%Compute to obtain the Normalized Power Spectrum Density of the

%transformed data

Pf1 = F1.* conj(F1)/11025;

Pf1 = transpose(Pf1); %Transpose

Pf1 = Pf1(1:2206); %Only the first 2206 sample data needed

f2=wavread('vc_a2',22050);

F2=fft(f2,11025);

Pf2 = F2.* conj(F2)/11025;

Pf2 = transpose(Pf2);

Pf2 = Pf2(1:2206);

f3=wavread('vc_a3',22050);

F3=fft(f3,11025);

Pf3 = F3.* conj(F3)/11025;

Pf3 = transpose(Pf3);

Pf3 = Pf3(1:2206);

%Compute the average of the three

Pf_AVG1 = (Pf1 + Pf2 + Pf3) / 3;

%Plot PSD of sound "a"

f = 100000*(0:2205)/22050; %Frequency range

mel = frq2mel(f);

figure(1);

plot(mel,Pf_AVG1);

title('The PSD of the Letter "a" Sound');

xlabel('Frequency (Mel)');

ylabel('Power');

%-----------------------------------------------------------------------

%Read data of letter "e" and "o" sounds also

f4=wavread('vc_e',22050);

F4=fft(f4,11025);

Pf4 = F4.* conj(F4)/11025;

Pf4 = transpose(Pf4);

Pf4 = Pf4(1:2206);

f5=wavread('vc_e2',22050);

F5=fft(f5,11025);

Pf5 = F5.* conj(F5)/11025;

Pf5 = transpose(Pf5);

Pf5 = Pf5(1:2206);

f6=wavread('vc_e3',22050);

F6=fft(f6,11025);

Pf6 = F6.* conj(F6)/11025;

Pf6 = transpose(Pf6);

Pf6 = Pf6(1:2206);

Pf_AVG2 = (Pf4 + Pf5 + Pf6) / 3;

figure(2);

plot(mel,Pf_AVG2);

title('The PSD of the Letter "e" Sound');

xlabel('Frequency (Mel)');

ylabel('Power');

f7=wavread('vc_o',22050);

F7=fft(f7,11025);

Pf7 = F7.* conj(F7)/11025;

Pf7 = transpose(Pf7);

Pf7 = Pf7(1:2206);

f8=wavread('vc_o2',22050);

F8=fft(f8,11025);

Pf8 = F8.* conj(F8)/11025;

Pf8 = transpose(Pf8);

Pf8 = Pf8(1:2206);

f9=wavread('vc_o3',22050);

F9=fft(f9,11025);

Pf9 = F9.* conj(F9)/11025;

Pf9 = transpose(Pf9);

Pf9 = Pf9(1:2206);

Pf_AVG3 = (Pf7 + Pf8 + Pf9) / 3;

figure(3);

plot(mel,Pf_AVG3);

title('The PSD of the Letter "o" Sound');

xlabel('Frequency (Mel)');

ylabel('Power');

%Combine Averages PSD's of all three sounds "a", "e" and "o"

%to one matrix

Pf_Total(1,:) = Pf_AVG1; %"a"

Pf_Total(2,:) = Pf_AVG2; %"e"

Pf_Total(3,:) = Pf_AVG3; %"o"

%Train by creating a feed-forward back-propagation network

net1 = newff(minmax(mel),[80,3],{'tansig' 'purelin'},'trainlm','learnp');

net1.trainParam.epochs = 30;

net1 = train(net1,mel,Pf_Total);

%Simulate to obtain outputs

y1 = sim(net1,mel);

%plot the outputs

figure(4);

plot(mel,y1);

title('Plot of the Outputs of the Trained Network');

xlabel('Frequency (Mel)');

ylabel('Power');

%------------------------------------------------------------------

%Read the test data (word "pay") and do the same as the trained data

s1=wavread('pl_pay',22050);

S1=fft(s1,11025);

Ps1 = S1.* conj(S1)/11025;

Ps1 = transpose(Ps1);

Ps1 = Ps1(1:2206);

s2=wavread('pl_pay2',22050);

S2=fft(s2,11025);

Ps2 = S2.* conj(S2)/11025;

Ps2 = transpose(Ps2);

Ps2 = Ps2(1:2206);

Ps_AVG = (Ps1 + Ps2) / 2;

figure(5);

plot(mel,Ps_AVG);

title('The PSD of the word "Pay"');

xlabel('Frequency (Mel)');

ylabel('Power');

%Create another network and train using the test data

net2 = newff(minmax(mel),[80,1],{'tansig' 'losgig'},'trainlm','learnp');

net2.trainParam.epochs = 30;

net2 = train(net2,mel,Ps_AVG);

%Simulate the output

y2 = sim(net2,mel);

figure(6);

plot(mel,y2)

title('Plot of the Output of the Trained Network with the Word "Pay"');

xlabel('Frequency (Mel)');

ylabel('Power');

%--------------------------------------------------------------------------

%Compare the each output of the first network by the output of the

%second network to test in which sound class the word "pay" belongs

%Compute the norm of the error between outputs in the first network

%and the output of the second network

'Result of the Test with the Word "Pay"'

A1 = sqrt(abs(y1(1,:) - y2) * transpose(abs(y1(1,:) - y2)))

E1 = sqrt(abs(y1(2,:) - y2) * transpose(abs(y1(2,:) - y2)))

O1 = sqrt(abs(y1(3,:) - y2) * transpose(abs(y1(3,:) - y2)))

%Check in which class the word "pay" belongs

if A1 <>

'The word "pay" belongs to the "a" sound group'

end

elseif E1 <>

if E1 <>

'The word "pay" belongs to the "e" sound group'

end

if O1 <>

'The word "so" belongs to the "o" sound group'

end

%---------------------------------------------------------------

%Test again with the test word "so"

'Another Test with the Word "So"'

s1=wavread('fc_so',22050);

S1=fft(s1,11025);

Ps1 = S1.* conj(S1)/11025;

Ps1 = transpose(Ps1);

Ps1 = Ps1(1:2206);

s2=wavread('fc_so2',22050);

S2=fft(s2,11025);

Ps2 = S2.* conj(S2)/11025;

Ps2 = transpose(Ps2);

Ps2 = Ps2(1:2206);

Ps_AVG = (Ps1 + Ps2) / 2;

figure(7);

plot(mel,Ps_AVG);

title('The PSD of the word "So"');

xlabel('Frequency (Mel)');

ylabel('Power');

net3 = newff(minmax(mel),[80,1],{'tansig' 'purelin'},'trainlm','learnp');

net3.trainParam.epochs = 30;

net3 = train(net2,f,Ps_AVG);

y2 = sim(net3,mel);

figure(8);

plot(mel,y2);

title('Plot of the Output of the Trained Network with the Word "So"');

xlabel('Frequency (Mel)');

ylabel('Power');

'Result of the Test with the Word "So"'

A2 = sqrt(abs(y1(1,:) - y2) * transpose(abs(y1(1,:) - y2)))

E2 = sqrt(abs(y1(2,:) - y2) * transpose(abs(y1(2,:) - y2)))

O2 = sqrt(abs(y1(3,:) - y2) * transpose(abs(y1(3,:) - y2)))

if A2 <>

'The word "so" belongs to the "a" sound group'

end

elseif E2 <>

if E2 <>

'The word "so" belongs to the "e" sound group'

end

if O2 <>

'The word "so" belongs to the "o" sound group'

end

[C2] Frequency to Mel File

function mel = frq2mel(frq)

%FRQ2ERB Convert Hertz to Mel frequency scale MEL=(FRQ)

% mel = frq2mel(frq) converts a vector of frequencies (in Hz)

% to the corresponding values on the Mel scale which corresponds

% to the perceived pitch of a tone

% The relationship between mel and frq is given by:

% m = ln(1 + f/700) * 1000 / ln(1+1000/700)

% This means that m(1000) = 1000

% References:

% [1] S. S. Stevens & J. Volkman "The relation of pitch to

% frequency", American J of Psychology, V 53, p329 1940

% [2] C. G. M. Fant, "Acoustic description & classification

% of phonetic units", Ericsson Tchnics, No 1 1959

% (reprinted in "Speech Sounds & Features", MIT Press 1973)

% [3] S. B. Davis & P. Mermelstein, "Comparison of parametric

% representations for monosyllabic word recognition in

% continuously spoken sentences", IEEE ASSP, V 28,

% pp 357-366 Aug 1980

% [4] J. R. Deller Jr, J. G. Proakis, J. H. L. Hansen,

% "Discrete-Time Processing of Speech Signals", p380,

% Macmillan 1993

% [5] HTK Reference Manual p73

% Copyright (C) Mike Brookes 1998

% Last modified Fri Apr 3 14:57:14 1998

% VOICEBOX home page: http://www.ee.ic.ac.uk/hp/staff/dmb/voicebox/voicebox.html

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

% This program is free software; you can redistribute it and/or modify

% it under the terms of the GNU General Public License as published by

% the Free Software Foundation; either version 2 of the License, or

% (at your option) any later version.

% This program is distributed in the hope that it will be useful,

% but WITHOUT ANY WARRANTY; without even the implied warranty of

% MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the

% GNU General Public License for more details.

% You can obtain a copy of the GNU General Public License from

% ftp://prep.ai.mit.edu/pub/gnu/COPYING-2.0 or by writing to

% Free Software Foundation, Inc.,675 Mass Ave, Cambridge, MA 02139, USA.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

mel = log(1+frq/700)*1127.01048;

Appendix A

Training and Test Data

[A1]

Training data: A letter “a” sound data: vc_a.wav, vc_a2.wav, vc_a3.wav

A letter “e” sound data: vc_e.wav, vc_e2.wav, vc_e3.wav

A letter “o” sound data: vc_o.wav, vc_o2.wav, vc_o3.wav

Test data: A word “pay”: pl_pay.wav, pl_pay2.wav

A word “so”: fc_so.wav, fc_so2.wav

[A2] Plots

This project was very challenging, because many areas of electrical engineering were involved. Of all the areas involved, the signal processing area was the one that involved mostly. My area of study is Control System, so I did not have much of a knowledge about DSP. But as I completed this project, I obtained various skills and learned how designing one system involves many areas on engineering. I feel working on this project became one of the most valuable experiences in my college life.

Conclusion/Problems

During simulation, I encountered difficulties with the MATLAB :

1) Long-time execution: when executing to train the network, as the number of hidden units increase, the more time is needed to execute. But in order to obtain an accurate result, the units must be very large number.

2) Maximum memory: as the size of the network increases by increasing the units and layers, the computer is not able to execute the program because it takes up a lot of memory. This is also another cause of inaccurate results in simulations.

Improvements/Extensions

Several improvements and extensions can be considered:

1) Increase accuracy by increasing the number of units in the hidden layer and the training data.

2) Extend the system to sentence-level recognition with speaker independence.

3) Inserting the temporal model for various speaking rates of a speaker.

Results

The results were 60% for the word “pay”, and 80% for the word “so” with a MSE of 0.28

Simulations

Simulations are shown in Appendix B, and the training and test data are shown in Appendix A. The goal was to determine in which sound class (letters “a”, ”e”, ”o” sound class) the two test data belong. The two data are the word “pay” and “so.” The sound of the letters “a”, “e”, and “o” are recorded as training data, the words “pay” and “so” are also recorded but as test data. The result must be that, for the word “pay,” since it sounds like the letter “a,” the test data must belong to letter “a” class. For the word “so,” it should belong to the class “o,” because the word “so” is similar to “o.”

Approach

Procedure

There are number of steps to design the speech recognition system using MATLAB.

1) Record raw speeches. A microphone and the Sound Recorder in Windows are used to store them as both training and testing data.

2) Read the signals with MATLAB. The function called “readwav” is used to read the recorded signals.

3) Transform the signals into simpler form. In this step, the function “fft(signal, sampled frequency)” is used.

4) Create a feed-forward back-propagation network using the function “newff.”

5) Train the network with the training data.

6) Test the network by creating a new network with test data and comparing the two networks and see if they are similar or not.

7) Determine and classify the test data into different groups with the result of step 6) above.

Introduction

What Is a Speech?

Speech is a sequence of sounds, or a signal with certain frequencies, generated by the vocal apparatus – a complex machine that consists of lips, jaw, tongue, velum, and larynx (upper part of trachea). This machine has nonlinear properties – very sensitive that it is easily affected by various factors, from gender type to emotional state. Hence, dealing with these factors makes designing speech recognition a very complex problem.

Speech Recognition Mechanism

A diagram of human vocal apparatus is shown.

Speech sounds are roughly classified into 3 classes, depending on their methods of excitation:

a) Voiced sounds - produced when air passing through the larynx causes vibration of the vocal folds.

- Ex. /i/, /u/, /a/.

b) Fricative sounds - generated by forcing air through a constriction (draw or squeeze together) formed at some point in the vocal tract which results in a turbulent (disturbance) flow of air in that region.

- Ex. /θ/

c) Plosive sounds - produced when a build-up of pressure behind a complete closure in the vocal tract is suddenly released by the articulators (lips, tongue, etc).

- Ex. /p/ ('pick')

All parts a) through c) are considered as basic linguistic units and are often referred to as phonemes.

Speech Recognition

Speech recognition – a system that has a speech signals as inputs, able interpret them, and gives some form of output(s).

Why is this system important?

There are several reasons that this system is useful in our life. For example, people are more comfortable to interact with our computers through speech, rather than typing keyboards and pointing with the mouse.

Another example would be the convenience. Devices such as voice recognition in cell phone, telephone directory assistance, and automatic voice translation into foreign languages are made for convenience.

Levels of Speech Inputs

There are three levels of inputs that can be considered in speech recognition:

a) Sentence-level: the largest level, that a whole sentence can be represented as one input.

b) Word-level: each isolated word in a sentence.

c) Phoneme-level: different type sound in each word

The sentence-level recognition system is the most difficult to design, because a sentence is the combination of words and phonemes. A large recognition system is necessary to classify each sentence.

Accuracy

An accuracy of the speech recognition systems is affected by various conditions, such as:

1) Vocabulary size and confusability

It is easy to discriminate a small set of words (i.e. words “zero” to “nine”). However, as the size of the set of words increases to 5000, 10,000, the system would have difficult time to discriminate each word since there are words that have similar sounds. This is the meaning of confusability.

2) Speaker dependence and independence

A speaker dependent means a single speaker is dealt with the recognition experiment. On the other hand, a speaker independent system deals any speaker is involved. Speaker independent system is difficult to distinguish because each person has his/her characteristics in voice. Generally, the error rates are 3 to 5 times higher for speaker independent systems than the dependent systems [Bib1].

3) Isolated, discontinuous, or continuous speech

An isolated speech means a single word. A discontinuous speech means full sentences in which each words are artificially separated by silence. A continuous speech means naturally spoken sentences. Among these three, the isolated and discontinuous speech recognitions are relatively easy to cope because words are separated, but continuous speech recognition is more difficult because word boundaries are unclear, just like a written sentence that have no space between each word.

4) Adverse Conditions

Examples of adverse conditions would be:

Environmental noise
Acoustical distortions - echoes, room acoustics
Different microphones used for recording
Altered speaking manner – shouting, whining, speaking quickly

Kinds of Variability

Currently, speech recognition systems distinguish between two kinds of variability:

1) Acoustic variability – covers different accents, pronunciations, pitches, volumes, etc.

2) Temporal variability – covers different speaking rates

There is one thing to note that although different speaking rates vary the acoustical patterns (accents, pitches, etc), it is useful simplification to treat them independently.

In this project, only the acoustic model is considered, which means the speaking rate is assumed to be constant. Also, for simulations, it is considered that the speaker is dependent (just myself), and word-level recognition system would be used.

Structure of Speech Recognition

Components of the System

1) Raw Speech:

Raw speech is recorded using a microphone, and each speech signal is sampled at high frequency, which is 22 kHz. In human speech, there is only a little information beyond the frequency of 9 kHz, and since the maximum frequency that can be sampled without aliasing of the sampled signal is

f = s/2, where s = number of samples, f = 22k / 2 = 11 kHz is adequate.

2) Signal Analysis:

Raw speech must be compressed and transformed without losing any info to simplify the subsequent processing. In this project, the FFT (Fast Fourier Transform) is used.

3) Speech Frames:

The result of the speech analysis is the speech frames. After the data is transformed by FFT, the signals are converted to speech frames.

Acoustic Model/Acoustic Analysis:

The speech frames would go to the speech analysis part. This block trains the acoustic model using the speech frame data. This place is where the concept of neural networks comes in. How the model is trained would be explained in more details later on.

Abstract

A speech recognition system is a device that is able to analyze raw speech signal inputs by converting them into readable form, and produces certain outputs. In this project, the system would classify each raw speech signal using multilayer perceptrons in artificial neural networks. By using this network, the speech recognition system can be trained to memorize each word and is able to determine in which class a test input word belongs. This report explains what the speech recognition is about in details and also shows how classifications are done by explaining the classification algorithm and briefly explaining the MATLAB simulation programs and output plots that are done by myself.