Spoken Word Project - Milestone Report

CS 134: Nicholas D. Lane

Introduction

In the proposal for this project we proposed two things would be achieved by the milestone point. These were:

Perform some basic IR analysis on text that was manually extracted from audio collected during some everyday activities.
Prototype a simple speech recognizer, that allowed us to get into the issues related to the problem. Which would then be the basis for moving to a iPhone based implementation.

We have achieved these goals. However our recognizer is still very poor - relative to its intended application. Also we have not gotten to the level of depth in either of the goals of the project we would have liked. However we have still reached the milestone goals we initially set out to achieve.

Also we have established, by experimentation, that the problems of using spoken words as a source of activity recognition features very challenging and worse than we initially imagined. We thought the speech recognition aspect would be extremely difficult (and completely solving this would be impossible during this project) but had hoped that the approach of matching words to activities would have been relatively effective straight-out-of-the-box. However our experimentation has shown that it is promising but would require much more careful design and investigation to make it work.

In this document we will discuss our progress on the two pieces within this project, within our discussion of our progress we clearly outline what we will do over the rest of the duration of the project. At this point we have done a lot of work in collecting data, performing experiments, understanding the audio recognition problem. A brief summary of the next stage of the project would be:

We will be writing code that will perform the same HMM based speech recognition, that we have operating on the desktop PC - to the iPhone. This software will not include any learning/training - it will only evaluate models we train on the desktop computer. We are looking forward to getting our hands dirty by coding our own version of the forward-backwards algorithm for the HMM. We will try to improve the robustness of the recognizer so that it will work, at least to some degree, in noise environments. Currently outside of a very ideal test conditions - the recognizer fails terribly. Improving this is a secondary point of emphasis we would like to examine.
We will test a refined scenario in which the use of recognized words is useful for activity recognition - we found this concept to be less suitable than we imagined for everyday activities. However we think it will be much more suited to capturing activities when there is a strong human interaction component. An example of this would be when purchasing something. We will focus on classifying short human interactions that are types of human activities that are valuable to capture to fully capture human behaviour, yet are very poorly supported currently. Given our experiments to this point - we expect that when we collect this type of data we should see more effective performance from a word-based activity model.

Progress on Goal 1: Spoken Word Recognition

A goal for the milestone point in this project was to build a simple word recognition system using a HMM, which can recognize a few key words (small vocabulary). This goal has been achieved, although the recognizer remains fairly primitive. We now describe what has been built:

Description of Currently Implemented Recognizer

Words are modeled with individual HMMs. MFCC features are used. The feature vector is pretty much a standard representation [10] used in most speech recognition systems, that is described in many descriptions of these techniques. The features being: 12 MFCC co-efficients, 12 first derivatives of the 12 MFCC co-efficients, and 12 more second order derivatives, along with the energy in the signal. Frames of 25 ms of audio are made and overlap partially. To do this we used the MFCC Matlab implementation of Malcolm Slaney (available here [1]). The features are generated in Matlab, but to train and classify we use an HMM implementation, HTK [3] written in C. In this we are only applying the ideas that we read about in multiple sources (e.g., [7]) of a basic approach to word recognition whereby the whole word is modeled, without explicit acknowledging internal components that make up the work (e.g., phonemes and other sub-word units). A HMM for each word is trained. We model the words with 7 node HMM for each word, training it with explicit silence nodes at the start and end point of the temporal sequence of each word model. Word models are then linked together by the nodes representing silence found at the start and end point of each word model. The HTK toolkit provides two training functions that we use to train this basic word model. It provides us with two implementations of standard techniques in speech recognition used for training: (i) Viterbi based alignment, which segments the sequence of features to specific HMM nodes and (ii) a Baum-Welch based refinement of the initial segmentation and setup of the HMM by (i). This approach is discussed frequently (e.g., [9]) and documented terms of a process that can be applied in [8]. When we want to evaluate based on the inputs which word is being overheard just use a Viteribi driven approach that determines the HMM most likely to have produced the observations that are made (e.g., the sequence of features we calculated from the stream of sound).

Experimental Setup

At this point our recognizer is only setup to recognize words clearly segmented by clean blocks of silence which are explicitly modeled as their own HMM. We have trained our recognizer to recognize only two words and have done so using a laptop microphone. We trained our HMMs with 15 examples of each word, the words being "slices" and "coffee". We choose these words since they were two discriminative words that we heard in our experiments at specific locations. We have performed very minor testing because the current recognizer, although it does work, is quite brittle. Training was a pain. We recorded each word with a microphone attached to a desktop computer and then by hand labeled the sections of the trace that were associated with being silent or containing the sound for the word. These were noted in a text label file in the form that could be accepted by the HTK toolkit. We labeled 30 training examples and 21 test examples.

Preliminary Results

We perform two types of test. A simple test, where we provide input under very similar conditions to the training. Which means a single word spaced with two decent amounts of silence. We also do a challenging test where we have a single stream of words where we deliberately pronounce the word a bit differently, and some instances with short gaps of silence between them. We use in both the experiments the same single speaker who provided the training set (ME!).

For an example of the HMMs we were able to train in this result I have included a specification of the HMM learned for the word coffee. It is available here, in the form of the representation that HTK toolkit produces from the training phase, an markup language like representation that include the states, the emission distributions associated with the state and a transition matrix for the states:

specification for the coffee word hmm

Under the simple test we performed 20 additional single shot experiments, where the person spoke only the single word. 10 examples of both words were used (20 instances in total). Each of these were classified correctly everytime. Words here were consciously said with different inflection and style however the audio is very clean (little background noise), the vocabulary is obviously extremely limited, and only the recognizer itself has been trained and tested using a single speaker so it clearly is not speaker independent. This has been a toy process done while we get used to the ideas involved in speech recognition. However essentially the recognizer works under these limited scenarios.

Under the challenging test the system fails terribly even to do a decent job at segmenting the words into word chunks. In this experiment we provided, where 40 words were spoken with these same two words (slices and coffee), the result of the recognizer is to identify only 7 word chunks and label one word correctly. So that is a word accuracy of about 0.025%. This was with words being said quickly or slowly and with other deliberate variations however the result is amazingly bad despite these intension. Even segmenting 7 out of 40 words is very poor. We believe that poor segmentation contributes a lot to this result, if the chunk of the word is poorly segmented from the audio stream then it will be very difficult to make a good match with the known HMMs that represent words in the vocabulary.

Clearly the additional robustness required by the challenging experiment is not present in our model as it stands. Also bare in mind that the ultimate goal of classifying ambient spoken words from is much more challenging that even this tougher example.

Expected Work over the next 3 weeks

We have to be realistic about what what can be done over the next three weeks. A important goal of this project is to strength the understanding of the HMM as a tool. Another goal is to port this prototyped recognizer to run on an apple iphone. Thinking about this all we need is to be able to evaluate the model, not to generate it. This means, assuming we want to still use a word model, we just need to determine which model is most likely given the inputs. We only need to run on the phone the forward-backward algorithm and the MFCC feature extraction process. This would allow us to decide the most likely word given the inputs.

What we plan to do is remain with this simple recognizer design, and select two or three words to be recognize. Then strengthen our model that currently recognizes silence but adding a model that will capture other words and general background noise. We plan to implement the forward-backward algorithm by hand in C based on matlab and published implementations. This should be fun and will give us an application that recognizes a couple of words directly on the phone. The phone is for it to do this much more robustly than currently. We will experiment using this strengthened model to recognize other background noise to assist with this. Also in [7] a simple technique called 'Ceptral High-pass Filtering' is described which is simple and has been found to be quite effective in making recognition robust to background noise. Essentially (assuming I have understood) all it is normalizing the average of the features to zero both in test and training sets used - which is the major problem with speech recognition in a noise environment - since the classification process is trying to identify the word even though it has never seen an example such as this. But by normalizing then the differences between the test and training environments can be semi-mitigated.

This is the plan for this part of the project from now until the end.

Progress on Goal 2: Towards Spoken Words as Activity Recognition Features

Our first step towards looking at how effect the technique of using word recognition as a feature in activity recognition has been to look at transcriptions of words that are heard from recordings during everday activities. This is a nice idea that was suggested to us, because it allows us to side-step for one moment the technical difficulties of speech recognition and quickly get the heart of the problem. We have done quite an extensive data collection, and performed some preliminary analysis.

Description of Experiments

We collected the audio while performing everyday activities captured from an Apple iPhone. We have collected around nine different activity types, with multiple examples of each activity. In total we have captured around 30 hours of audio data while we have gone about our normal activities. The activity was manually annotated by the person saying at the start of a new activity what was going on. Then when the audio file was later reviewed this description was treated as being outside of the experimental data and gave us a source of ground truth. The types of activities we captured were:

going to the bank
going to pizza
going to get a coffee
going to the gym
going to the library
going to a bar
going to the gas station
going to the bookstore
going to burrito place

Our method in performing these transcriptions we to apply a a best effort principle because perfect transcription would have been too time consuming. So our methodology was to listen to the trace only once, halting at any audio sampled of the activity after 10 minutes. We would pause the trace if we were getting behind with typing any words done from what was said, but we would not go back. We would also transcribe those words that are fairly clear and we were certain of. In fact when performing these experiments we were quite surprised just how hard it was to transcribe the audio. Many many words were said in the audio trace that were not able to be understood.

Preliminary Results and Observations

We have manually transcribed probably around 1/3 of the data we have collected. The results were very good in shaping our thinking, but not very promising for supporting our hypothesis - in an unmodified out of the box form. Which is of course why we do experiments in the firstplace!

We made the following observations during our experiments and transcriptions:

Often there are conversations and discussions that are not related to a specific activity. These create a lot of words associated with the activity - that may not be common - so aren't filtered by a stop words list. But are not discriminative of the activity (nor are they related to it!).
Many times there are literately no words being said. Such as at the gym we recorded a number of examples of this activity for which very very few words were said.
Understanding the words being said on a microphone on the phone while someone is a walking around is even worse than we first thought! Even with a person listening carefully to the audio trace many many words could not be understood, given this is a person listening - then the chances of a program being able to do better is much lower.

The result of these limitations is that our word based activity recognition is quite poor.

We performed the following experiment that demonstrated that effect of the limitations of the data set and the use of off the shelf I.R. techniques. We did the following using [11] a Matlab module for doing textual data mining. We build a TD-IDF structure, where we used the Porter [12] stemming algorithm and a set of stop words based on a standard dictionary that is associated with the TMG project [11] we were using. We then just applied k-means clustered based on the vector of term frequencies. The result was very poor and this is even with all the activities for which no words were left after the stop words were removed and for those activities where no words were overhead in the trace at all (e.g., during examples taken from the gym).

We found the following clusters were created. In this table the rows are the clusters and in the columns are the true labels for those activities that were clustered together. This can be understood as a type of confusion matrix.

bank burrito place bookstore gas station bar library gym pizza coffee

cluster 1 0 0 0 0 0 1 0 0 0

cluster 2 0 0 0 0 0 1 0 0 0

cluster 3 0 0 1 0 0 1 0 1 1

cluster 4 0 0 2 0 0 2 0 0 1

cluster 5 0 0 0 0 0 2 0 0 0

cluster 6 0 0 0 0 0 2 0 0 1

cluster 7 1 2 1 2 1 1 1 3 4

Some of the reasons for this very poor clustering were mentioned already in the list of observations about the data collected. However it should be made clear a lot of the error is due to the confusion caused by conversations, which are of course fairly similar independent of location and activity being over head. A lot of the problems that are discussed are due to the words captured with an activity not being associated wit the activity at all but with ambient conversations which spread words unrelated to the activity through the data set.

Expected Work over the next 3 weeks

We performed quite a broad data collection of activities an were essentially disappointed in even the basic material we found in the audio traces (a lack of words and much high amount of words that were rare but unrelated to the activity).

We hear however when listening to the traces the following promising aspects:

That when the carrier of the phone is involved in the conversation the words being spoken are much much clearer. As are the person that they are talking to. These are much more promising targets of recognition.
Short bursts of interaction with someone by the person sounds like they are better targets to use to determine the activity. For instance sampling all the words while an activity is going is not very effective - but small interactions such as when a person purchases an item - or walked first into the gym and talks to the fitness person at the door then these events are indicative of the activity.

We are going to do the following: Collect more data but instead of targeting a general on-going activity, i.e., being in a coffee shop or being in the library we are going to target specific events that are tied to significant activities. Such as purchasing coffee (not just being in the coffee shop). By targeting these short human interactions we think we can identify some classes of activities much better when we take the words sampled and use them in some I.R. document/text mining techniques we initially thought to apply. We are going to collect his data, transcribe it and test this hypothesis. However we again will only use fairly trivial information retrieval techniques since we just want to evaluate the potential for the concept and the focus of our efforts will be placed on HMM based recognition on the iPhone.

The goal will be to show from the data that this idea has potential and provide some basic results supporting this fact. That is all.

References

[1] Survey of the Speech Recognition Techniques for Mobile Devices by Dmitry Zaykovskiy
[2] http://www.ee.columbia.edu/~dpwe/e6820/matlab/mfcc.m
[3] Information Retrieval by C.J. Rijsbergen
[4] http://htk.eng.cam.ac.uk/
[5] http://www.speech.cs.cmu.edu/sphinxman/
[6] http://cslu.cse.ogi.edu/HLTsurvey/ch1node4.html
[7] http://cslu.cse.ogi.edu/HLTsurvey/ch1node2.html
[8] HTK Book. http://htk.eng.cam.ac.uk/prot-docs/htkbook.pdf
[9] Training Hidden Markov Model/Artificial Neural Network (HMM/ANN) Hybrids for Automatic Speech Recognition (ASR). http://speech.bme.ogi.edu/tutordemos/nnet_training/tutorial.html
[10] Speech and Audio Signal Processing: Processing and Perception of Speech and Music. Ben Gold, Nelson Morgan
[11] Text to Matrix Generator (TMG). http://scgroup6.ceid.upatras.gr:8000/wiki/index.php/Main_Page
[12] Porter, M.F. (1980) An Algorithm for Suffix Stripping, Program, 14(3): 130-137

	bank	burrito place	bookstore	gas station	bar	library	gym	pizza	coffee
cluster 1	0	0	0	0	0	1	0	0	0
cluster 2	0	0	0	0	0	1	0	0	0
cluster 3	0	0	1	0	0	1	0	1	1
cluster 4	0	0	2	0	0	2	0	0	1
cluster 5	0	0	0	0	0	2	0	0	0
cluster 6	0	0	0	0	0	2	0	0	1
cluster 7	1	2	1	2	1	1	1	3	4