We now will describe the approach we expect to take. However as we learn more about the problems and techniques in this area the discussion here is subject to revision. Initially the audio segments will be manually partitioned into words. We will use MFCC co-efficients as features. MFCC, is a type of frequency based feature that has been found to be effective for a variety of uses, one of which is speach recognition. Words will be modeled by a sequence of audio frames that are represented by windowed MFCC co-efficients. We will use HMMs to model the temporal state sequence of these frames that will be different for each word. A segment of audio that contains a word will be classified based on the transitional sequence seen in the observed data that has the highest probability of being caused by hidden states associated with the word. Later in the iPhone port we will need to automatically segment the audio stream into chunks based on word boundaries. For this we will use a coarse silence detection (RMS based) but this likely will not be robust enough for real-world noisy environments - in which case we will adopt a more sophisticated technique (e.g., the voicing based HMM technique).
While developing this simple Matlab-based recognizer we will also collect data from an iPhone while a person is performing a range of different activities. Only a limited amount of data will be collected because manual labeling of the words that can be heard in the collected audio stream will need to occur. The expected activities include the following examples: working-out, working in the lab, going to the coffee shop. There will be a significant increase in the difficulties in classifying words heard in a dirty audio stream. We will not try to capture all words. Only those words with the following types of characteristics - those that are loud enough; those without overlapping loud sounds (or other words) when the word is spoken; those words that are captured clearly enough. We hope that even though this may only account for a small number of the words spoken (e.g., 20%) that this will still be sufficient, given we only want to associate words with activities instead of conventional applications such as dictation. We expect that we will constrain the vocabulary of words used to improve accuracy rates. This vocabulary will be shaped based on the usefulness of the word to discriminate activities.
A main goal of the project will be to have our classifier run directly on an iPhone. We will base our iPhone implementation on our Matlab prototype. However we may need to modify this implementation due to the audio being dirty (given our initial prototype will be based on clean lab sourced audio data), or any computational or energy limiations. With some modifications the iPhone can run code written in C. So the process we have running in Matlab will need to be ported to C suitable to be run on the phone. For this port we will use a combination of original code and ported libraries that were originally written in C for an alternative platform. We will use these libraries where required to make this projecte managable within the time allowed. When this is done the use of existing libraries will be documented along with the modifications that were necessary for the library to work on the phone.
Our final outputs from this stage of the project will be: (i) a HMM based matlab speech recognizer, and (ii) a modification of this prototype that runs on an iPhone. We will test our iPhone based prototype against the dirty audio data set to see how well it performs. We will also perform additional experiments for sanity checking and to validate correctness.
The approach in this stage will be to examine the dirty audio data collected (while performing actual activities) - and use the manually extracted words overhead in this audio trace. These words would need to be extracted regardless of the needs of this stage to provide ground truth used in developing the speech recognizer. They will serve a different purpose in this stage. They will be used as the actual input data, assumed to be determined from a speech recognition being used on the phone. This will allow us to examine our primary conjecture (the benefit of these spoken words) without the confounding factor of the quality of the speech recognition process. This is important because we do not expect the quality of our speech recognizer to be anywhere near the quality of state of the art methods - given the difficulty of the problem and the decades of work that already exists in the field, but this should not impact artifically evaluation of the idea that motivates the speech recognition work we are doing.
The particular aspect we will examine following specific hypothesis:
Can human activity be inferred effectively using the words commonly spoken during the activity?
We will ignore other aspect of activity recognition (for example those mentioned in the introduction).
The approach we will take is to see if an information retrieval approach can be used in this scenario. We intend to transform this problem into an information retrevial one in the following way: We will treat the words overheard during the activities as a 'document', and the subject of the document will be the activity. So in the case of information retrieval we may see a document about 'cleaning the car' that can be identified from a corpus of documents based on the frequency of words such as 'car', 'clean', 'water'. In our case we will expect that activities such as getting coffee will be associated with words such as 'coffee', 'latte', 'milk', 'sugar' and so on.
We plan to test a vector space approach to mapping observed 'keywords' that are overhead to activites. This is a simple and common technique in information retrieval. We will filter out from all the words that are common across all activities (i.e., the stopwords) leaving only those that are typically found with the type of activity (e.g., the keywords associated with the activity). Then each example of an activity will be represented in a vector space based on the remaining words heard during the activity. The dimensions of the vector space will be the unique keywords observed in all activity types. The magnitude of the vector dimensions will be the frequency of the words during the activity. We can then map all the activities to this vector space and recognize the class of activities by clustering (e..g, k-means) activity examples. We can assume that some activities are ground truth and their underyling class is known.
There are definitely more complex information retrieval techniques possible but we will not explore these unless time allows.
21st April 2009 - 12th May 2009 - [3 weeks]
Week 1 Read a lot about the different techniques for speech recognition. Talk to people in the department who know about audio analysis. Adapt some existing code to sample sound and label the activities easily from an application running on the iPhone. Use this adapted code to start the data collection of 'dirty' audio data during real activities, also collect clean audio in the lab setting. Start to develop the Matlab based prototype that recognizes certain words.
Week 2 Continue developing the Matlab prototype and continue collecting 'dirty' audio data. We should finish data collection this week. Once finished the dirty audio will be then manually labeled to identify the words that can be heard.
Week 3 Perform stage 2 analysis on the manually collected words. Complete the Matlab prototype. Evaluate the performance of this prototype.
12th May 2009 - 2nd June 2009 - [3 weeks]
Week 4 Present at the milestone stage the results of a Matlab based speech recognizer and results of performing activity recognition as an information retreival problem (stage 2). Use this week to plan out how to port that code to the iPhone. Look at how to approach the writing of code (adapting libraries for extracting features?) and any classification algorithms (adaption HMM libraries?). Begin to develop the phone code.
Week 5 Continue developing the code.
Week 6 Finish the code on the phone. Evaluate it against the audio collected earlier in the project. Write the final report.