Also we have established, by experimentation, that the problems of using spoken words as a source of activity recognition features very challenging and worse than we initially imagined. We thought the speech recognition aspect would be extremely difficult (and completely solving this would be impossible during this project) but had hoped that the approach of matching words to activities would have been relatively effective straight-out-of-the-box. However our experimentation has shown that it is promising but would require much more careful design and investigation to make it work.
In this document we will discuss our progress on the two pieces within this project, within our discussion of our progress we clearly outline what we will do over the rest of the duration of the project. At this point we have done a lot of work in collecting data, performing experiments, understanding the audio recognition problem. A brief summary of the next stage of the project would be:
Description of Currently Implemented Recognizer
Words are modeled with individual HMMs. MFCC features are used. The feature vector is pretty much a standard representation [10] used in most speech recognition systems, that is described in many descriptions of these techniques. The features being: 12 MFCC co-efficients, 12 first derivatives of the 12 MFCC co-efficients, and 12 more second order derivatives, along with the energy in the signal. Frames of 25 ms of audio are made and overlap partially. To do this we used the MFCC Matlab implementation of Malcolm Slaney (available here [1]). The features are generated in Matlab, but to train and classify we use an HMM implementation, HTK [3] written in C. In this we are only applying the ideas that we read about in multiple sources (e.g., [7]) of a basic approach to word recognition whereby the whole word is modeled, without explicit acknowledging internal components that make up the work (e.g., phonemes and other sub-word units). A HMM for each word is trained. We model the words with 7 node HMM for each word, training it with explicit silence nodes at the start and end point of the temporal sequence of each word model. Word models are then linked together by the nodes representing silence found at the start and end point of each word model. The HTK toolkit provides two training functions that we use to train this basic word model. It provides us with two implementations of standard techniques in speech recognition used for training: (i) Viterbi based alignment, which segments the sequence of features to specific HMM nodes and (ii) a Baum-Welch based refinement of the initial segmentation and setup of the HMM by (i). This approach is discussed frequently (e.g., [9]) and documented terms of a process that can be applied in [8]. When we want to evaluate based on the inputs which word is being overheard just use a Viteribi driven approach that determines the HMM most likely to have produced the observations that are made (e.g., the sequence of features we calculated from the stream of sound).
Experimental Setup
At this point our recognizer is only setup to recognize words clearly segmented by clean blocks of silence which are explicitly modeled as their own HMM. We have trained our recognizer to recognize only two words and have done so using a laptop microphone. We trained our HMMs with 15 examples of each word, the words being "slices" and "coffee". We choose these words since they were two discriminative words that we heard in our experiments at specific locations. We have performed very minor testing because the current recognizer, although it does work, is quite brittle. Training was a pain. We recorded each word with a microphone attached to a desktop computer and then by hand labeled the sections of the trace that were associated with being silent or containing the sound for the word. These were noted in a text label file in the form that could be accepted by the HTK toolkit. We labeled 30 training examples and 21 test examples.
Preliminary Results
We perform two types of test. A simple test, where we provide input under very similar conditions to the training. Which means a single word spaced with two decent amounts of silence. We also do a challenging test where we have a single stream of words where we deliberately pronounce the word a bit differently, and some instances with short gaps of silence between them. We use in both the experiments the same single speaker who provided the training set (ME!).
For an example of the HMMs we were able to train in this result I have included a specification of the HMM learned for the word coffee. It is available here, in the form of the representation that HTK toolkit produces from the training phase, an markup language like representation that include the states, the emission distributions associated with the state and a transition matrix for the states:
specification for the coffee word hmm
Under the simple test we performed 20 additional single shot experiments, where the person spoke only the single word. 10 examples of both words were used (20 instances in total). Each of these were classified correctly everytime. Words here were consciously said with different inflection and style however the audio is very clean (little background noise), the vocabulary is obviously extremely limited, and only the recognizer itself has been trained and tested using a single speaker so it clearly is not speaker independent. This has been a toy process done while we get used to the ideas involved in speech recognition. However essentially the recognizer works under these limited scenarios.
Under the challenging test the system fails terribly even to do a decent job at segmenting the words into word chunks. In this experiment we provided, where 40 words were spoken with these same two words (slices and coffee), the result of the recognizer is to identify only 7 word chunks and label one word correctly. So that is a word accuracy of about 0.025%. This was with words being said quickly or slowly and with other deliberate variations however the result is amazingly bad despite these intension. Even segmenting 7 out of 40 words is very poor. We believe that poor segmentation contributes a lot to this result, if the chunk of the word is poorly segmented from the audio stream then it will be very difficult to make a good match with the known HMMs that represent words in the vocabulary.
Clearly the additional robustness required by the challenging experiment is not present in our model as it stands. Also bare in mind that the ultimate goal of classifying ambient spoken words from is much more challenging that even this tougher example.
Expected Work over the next 3 weeks
We have to be realistic about what what can be done over the next three weeks. A important goal of this project is to strength the understanding of the HMM as a tool. Another goal is to port this prototyped recognizer to run on an apple iphone. Thinking about this all we need is to be able to evaluate the model, not to generate it. This means, assuming we want to still use a word model, we just need to determine which model is most likely given the inputs. We only need to run on the phone the forward-backward algorithm and the MFCC feature extraction process. This would allow us to decide the most likely word given the inputs.
What we plan to do is remain with this simple recognizer design, and select two or three words to be recognize. Then strengthen our model that currently recognizes silence but adding a model that will capture other words and general background noise. We plan to implement the forward-backward algorithm by hand in C based on matlab and published implementations. This should be fun and will give us an application that recognizes a couple of words directly on the phone. The phone is for it to do this much more robustly than currently. We will experiment using this strengthened model to recognize other background noise to assist with this. Also in [7] a simple technique called 'Ceptral High-pass Filtering' is described which is simple and has been found to be quite effective in making recognition robust to background noise. Essentially (assuming I have understood) all it is normalizing the average of the features to zero both in test and training sets used - which is the major problem with speech recognition in a noise environment - since the classification process is trying to identify the word even though it has never seen an example such as this. But by normalizing then the differences between the test and training environments can be semi-mitigated.
This is the plan for this part of the project from now until the end.
Our first step towards looking at how effect the technique of using word recognition as a feature in activity recognition has been to look at transcriptions of words that are heard from recordings during everday activities. This is a nice idea that was suggested to us, because it allows us to side-step for one moment the technical difficulties of speech recognition and quickly get the heart of the problem. We have done quite an extensive data collection, and performed some preliminary analysis.
Description of Experiments
We collected the audio while performing everyday activities captured from an Apple iPhone. We have collected around nine different activity types, with multiple examples of each activity. In total we have captured around 30 hours of audio data while we have gone about our normal activities. The activity was manually annotated by the person saying at the start of a new activity what was going on. Then when the audio file was later reviewed this description was treated as being outside of the experimental data and gave us a source of ground truth. The types of activities we captured were:
Our method in performing these transcriptions we to apply a a best effort principle because perfect transcription would have been too time consuming. So our methodology was to listen to the trace only once, halting at any audio sampled of the activity after 10 minutes. We would pause the trace if we were getting behind with typing any words done from what was said, but we would not go back. We would also transcribe those words that are fairly clear and we were certain of. In fact when performing these experiments we were quite surprised just how hard it was to transcribe the audio. Many many words were said in the audio trace that were not able to be understood.
Preliminary Results and Observations
We have manually transcribed probably around 1/3 of the data we have collected. The results were very good in shaping our thinking, but not very promising for supporting our hypothesis - in an unmodified out of the box form. Which is of course why we do experiments in the firstplace!
We made the following observations during our experiments and transcriptions:
The result of these limitations is that our word based activity recognition is quite poor.
We performed the following experiment that demonstrated that effect of the limitations of the data set and the use of off the shelf I.R. techniques. We did the following using [11] a Matlab module for doing textual data mining. We build a TD-IDF structure, where we used the Porter [12] stemming algorithm and a set of stop words based on a standard dictionary that is associated with the TMG project [11] we were using. We then just applied k-means clustered based on the vector of term frequencies. The result was very poor and this is even with all the activities for which no words were left after the stop words were removed and for those activities where no words were overhead in the trace at all (e.g., during examples taken from the gym).
We found the following clusters were created. In this table the rows are the clusters and in the columns are the true labels for those activities that were clustered together. This can be understood as a type of confusion matrix.
bank | burrito place | bookstore | gas station | bar | library | gym | pizza | coffee | |
cluster 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
cluster 2 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
cluster 3 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 1 |
cluster 4 | 0 | 0 | 2 | 0 | 0 | 2 | 0 | 0 | 1 |
cluster 5 | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 0 |
cluster 6 | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 1 |
cluster 7 | 1 | 2 | 1 | 2 | 1 | 1 | 1 | 3 | 4 |
Some of the reasons for this very poor clustering were mentioned already in the list of observations about the data collected. However it should be made clear a lot of the error is due to the confusion caused by conversations, which are of course fairly similar independent of location and activity being over head. A lot of the problems that are discussed are due to the words captured with an activity not being associated wit the activity at all but with ambient conversations which spread words unrelated to the activity through the data set.
Expected Work over the next 3 weeks
We performed quite a broad data collection of activities an were essentially disappointed in even the basic material we found in the audio traces (a lack of words and much high amount of words that were rare but unrelated to the activity).
We hear however when listening to the traces the following promising aspects:
The goal will be to show from the data that this idea has potential and provide some basic results supporting this fact. That is all.