Sunday, December 24, 2006


What Is a Speech?

Speech is a sequence of sounds, or a signal with certain frequencies, generated by the vocal apparatus – a complex machine that consists of lips, jaw, tongue, velum, and larynx (upper part of trachea). This machine has nonlinear properties – very sensitive that it is easily affected by various factors, from gender type to emotional state. Hence, dealing with these factors makes designing speech recognition a very complex problem.

Speech Recognition Mechanism

A diagram of human vocal apparatus is shown.

Speech sounds are roughly classified into 3 classes, depending on their methods of excitation:

a) Voiced sounds - produced when air passing through the larynx causes vibration of the vocal folds.

- Ex. /i/, /u/, /a/.

b) Fricative sounds - generated by forcing air through a constriction (draw or squeeze together) formed at some point in the vocal tract which results in a turbulent (disturbance) flow of air in that region.

- Ex. /θ/

c) Plosive sounds - produced when a build-up of pressure behind a complete closure in the vocal tract is suddenly released by the articulators (lips, tongue, etc).

- Ex. /p/ ('pick')

All parts a) through c) are considered as basic linguistic units and are often referred to as phonemes.

Speech Recognition

Speech recognition – a system that has a speech signals as inputs, able interpret them, and gives some form of output(s).

Why is this system important?

There are several reasons that this system is useful in our life. For example, people are more comfortable to interact with our computers through speech, rather than typing keyboards and pointing with the mouse.

Another example would be the convenience. Devices such as voice recognition in cell phone, telephone directory assistance, and automatic voice translation into foreign languages are made for convenience.

Levels of Speech Inputs

There are three levels of inputs that can be considered in speech recognition:

a) Sentence-level: the largest level, that a whole sentence can be represented as one input.

b) Word-level: each isolated word in a sentence.

c) Phoneme-level: different type sound in each word

The sentence-level recognition system is the most difficult to design, because a sentence is the combination of words and phonemes. A large recognition system is necessary to classify each sentence.


An accuracy of the speech recognition systems is affected by various conditions, such as:

1) Vocabulary size and confusability

It is easy to discriminate a small set of words (i.e. words “zero” to “nine”). However, as the size of the set of words increases to 5000, 10,000, the system would have difficult time to discriminate each word since there are words that have similar sounds. This is the meaning of confusability.

2) Speaker dependence and independence

A speaker dependent means a single speaker is dealt with the recognition experiment. On the other hand, a speaker independent system deals any speaker is involved. Speaker independent system is difficult to distinguish because each person has his/her characteristics in voice. Generally, the error rates are 3 to 5 times higher for speaker independent systems than the dependent systems [Bib1].

3) Isolated, discontinuous, or continuous speech

An isolated speech means a single word. A discontinuous speech means full sentences in which each words are artificially separated by silence. A continuous speech means naturally spoken sentences. Among these three, the isolated and discontinuous speech recognitions are relatively easy to cope because words are separated, but continuous speech recognition is more difficult because word boundaries are unclear, just like a written sentence that have no space between each word.

4) Adverse Conditions

Examples of adverse conditions would be:

  • Environmental noise
  • Acoustical distortions - echoes, room acoustics
  • Different microphones used for recording
  • Altered speaking manner – shouting, whining, speaking quickly

Kinds of Variability

Currently, speech recognition systems distinguish between two kinds of variability:

1) Acoustic variability – covers different accents, pronunciations, pitches, volumes, etc.

2) Temporal variability – covers different speaking rates

There is one thing to note that although different speaking rates vary the acoustical patterns (accents, pitches, etc), it is useful simplification to treat them independently.

In this project, only the acoustic model is considered, which means the speaking rate is assumed to be constant. Also, for simulations, it is considered that the speaker is dependent (just myself), and word-level recognition system would be used.

Structure of Speech Recognition

Components of the System

1) Raw Speech:

Raw speech is recorded using a microphone, and each speech signal is sampled at high frequency, which is 22 kHz. In human speech, there is only a little information beyond the frequency of 9 kHz, and since the maximum frequency that can be sampled without aliasing of the sampled signal is

f = s/2, where s = number of samples, f = 22k / 2 = 11 kHz is adequate.

2) Signal Analysis:

Raw speech must be compressed and transformed without losing any info to simplify the subsequent processing. In this project, the FFT (Fast Fourier Transform) is used.

3) Speech Frames:

The result of the speech analysis is the speech frames. After the data is transformed by FFT, the signals are converted to speech frames.

Acoustic Model/Acoustic Analysis:

The speech frames would go to the speech analysis part. This block trains the acoustic model using the speech frame data. This place is where the concept of neural networks comes in. How the model is trained would be explained in more details later on.