Spoken Word Project

CS 134: Nicholas D. Lane

Introduction

An active area of research is human activity recognition. Activity recognition is the classification of low-level sensor data (from accelerometers, microphones etc.) to high-level actions being performed by a person (e.g., cooking, working-out, climbing stairs etc.). Reliably inferring activities for many important types of activities are challenging. For example, eating and working out can comprise many different physical activities making it difficult to build a computational model of the scenarios that span that class. In this project we are interested in exploiting a new type of sensor-based 'hint' as to which activity is going on - the words that are spoken by people during the activity. People use particular words when they are engaged in certain activities (think about bowling, driving, cooking and the words that are typically said). If these words, that are associated with activities, can be recognized while the activity is on-going then they potentially can be useful in the recognition of these activities. We believe being able to recognize particular words spoken by people will be of benefit to a variety of types of activity analysis. For example:

Determining what an individual is currently doing.
Determining which activities are associated with particular location.
Being able to differentiate logical 'locations' within a single physical one. For instance, identifying the basketball area of the gym relative to the cardio work out area. Or recognizing a multiple use location (e.g., a eatting establishment during the day that turns into a disco by night). Activity inference based strongly on location based features is a active area of research which could be assisted by such improved recognition.

Deliverables of the project

Given the short period of this class project we intend to only examine a small portion of the potential experiments. We will do the following:

Build a simple word classification system based on HMMs and using sound collected from the Apple iPhone. This will be prototyped using matlab and then ported to the iPhone.
Preliminary investigation will be performed into the use of the words spoken during activities to improve types of activity recognition.

These two aspects of the project deliverables are described in more detail in the following two sections.

Primary Goal: Recognizing Words using the iPhone as a sensing platform

The first step will be to collect clean audio examples of a limited number of words being said (e.g., fewer than 20 words). These will be collected in a lab environment using the microphone in an Apple iPhone. We will use these samples to build a simple word classifier in Matlab. Doing this in a controlled setting will be useful for understanding the problem better.

We now will describe the approach we expect to take. However as we learn more about the problems and techniques in this area the discussion here is subject to revision. Initially the audio segments will be manually partitioned into words. We will use MFCC co-efficients as features. MFCC, is a type of frequency based feature that has been found to be effective for a variety of uses, one of which is speach recognition. Words will be modeled by a sequence of audio frames that are represented by windowed MFCC co-efficients. We will use HMMs to model the temporal state sequence of these frames that will be different for each word. A segment of audio that contains a word will be classified based on the transitional sequence seen in the observed data that has the highest probability of being caused by hidden states associated with the word. Later in the iPhone port we will need to automatically segment the audio stream into chunks based on word boundaries. For this we will use a coarse silence detection (RMS based) but this likely will not be robust enough for real-world noisy environments - in which case we will adopt a more sophisticated technique (e.g., the voicing based HMM technique).

While developing this simple Matlab-based recognizer we will also collect data from an iPhone while a person is performing a range of different activities. Only a limited amount of data will be collected because manual labeling of the words that can be heard in the collected audio stream will need to occur. The expected activities include the following examples: working-out, working in the lab, going to the coffee shop. There will be a significant increase in the difficulties in classifying words heard in a dirty audio stream. We will not try to capture all words. Only those words with the following types of characteristics - those that are loud enough; those without overlapping loud sounds (or other words) when the word is spoken; those words that are captured clearly enough. We hope that even though this may only account for a small number of the words spoken (e.g., 20%) that this will still be sufficient, given we only want to associate words with activities instead of conventional applications such as dictation. We expect that we will constrain the vocabulary of words used to improve accuracy rates. This vocabulary will be shaped based on the usefulness of the word to discriminate activities.

A main goal of the project will be to have our classifier run directly on an iPhone. We will base our iPhone implementation on our Matlab prototype. However we may need to modify this implementation due to the audio being dirty (given our initial prototype will be based on clean lab sourced audio data), or any computational or energy limiations. With some modifications the iPhone can run code written in C. So the process we have running in Matlab will need to be ported to C suitable to be run on the phone. For this port we will use a combination of original code and ported libraries that were originally written in C for an alternative platform. We will use these libraries where required to make this projecte managable within the time allowed. When this is done the use of existing libraries will be documented along with the modifications that were necessary for the library to work on the phone.

Our final outputs from this stage of the project will be: (i) a HMM based matlab speech recognizer, and (ii) a modification of this prototype that runs on an iPhone. We will test our iPhone based prototype against the dirty audio data set to see how well it performs. We will also perform additional experiments for sanity checking and to validate correctness.

Secondary Goal: Uses of Words to improve Activity Recognition

The novelty of this project is the use of spoken words to assist in activity recognition of different forms. The recognition of these words on a suitable sensing platform, such as a sensor-enabled phone, is incidental to this goal - performed largely as a learning excerise. So we plan on performing some provisional experiments to confirm our conjecture regarding the use of words to the activity recognition problem. The extent of this aspect of the project will be based on how much progress is made on the phone aspect of the project. This will remain as a secondary objective of the project given it can be completed after the class project is completed.

The approach in this stage will be to examine the dirty audio data collected (while performing actual activities) - and use the manually extracted words overhead in this audio trace. These words would need to be extracted regardless of the needs of this stage to provide ground truth used in developing the speech recognizer. They will serve a different purpose in this stage. They will be used as the actual input data, assumed to be determined from a speech recognition being used on the phone. This will allow us to examine our primary conjecture (the benefit of these spoken words) without the confounding factor of the quality of the speech recognition process. This is important because we do not expect the quality of our speech recognizer to be anywhere near the quality of state of the art methods - given the difficulty of the problem and the decades of work that already exists in the field, but this should not impact artifically evaluation of the idea that motivates the speech recognition work we are doing.

The particular aspect we will examine following specific hypothesis:

Can human activity be inferred effectively using the words commonly spoken during the activity?

We will ignore other aspect of activity recognition (for example those mentioned in the introduction).

The approach we will take is to see if an information retrieval approach can be used in this scenario. We intend to transform this problem into an information retrevial one in the following way: We will treat the words overheard during the activities as a 'document', and the subject of the document will be the activity. So in the case of information retrieval we may see a document about 'cleaning the car' that can be identified from a corpus of documents based on the frequency of words such as 'car', 'clean', 'water'. In our case we will expect that activities such as getting coffee will be associated with words such as 'coffee', 'latte', 'milk', 'sugar' and so on.

We plan to test a vector space approach to mapping observed 'keywords' that are overhead to activites. This is a simple and common technique in information retrieval. We will filter out from all the words that are common across all activities (i.e., the stopwords) leaving only those that are typically found with the type of activity (e.g., the keywords associated with the activity). Then each example of an activity will be represented in a vector space based on the remaining words heard during the activity. The dimensions of the vector space will be the unique keywords observed in all activity types. The magnitude of the vector dimensions will be the frequency of the words during the activity. We can then map all the activities to this vector space and recognize the class of activities by clustering (e..g, k-means) activity examples. We can assume that some activities are ground truth and their underyling class is known.

There are definitely more complex information retrieval techniques possible but we will not explore these unless time allows.

Timeline

21st April 2009 - 12th May 2009 - [3 weeks]

Week 1 Read a lot about the different techniques for speech recognition. Talk to people in the department who know about audio analysis. Adapt some existing code to sample sound and label the activities easily from an application running on the iPhone. Use this adapted code to start the data collection of 'dirty' audio data during real activities, also collect clean audio in the lab setting. Start to develop the Matlab based prototype that recognizes certain words.

Week 2 Continue developing the Matlab prototype and continue collecting 'dirty' audio data. We should finish data collection this week. Once finished the dirty audio will be then manually labeled to identify the words that can be heard.

Week 3 Perform stage 2 analysis on the manually collected words. Complete the Matlab prototype. Evaluate the performance of this prototype.

12th May 2009 - 2nd June 2009 - [3 weeks]

Week 4 Present at the milestone stage the results of a Matlab based speech recognizer and results of performing activity recognition as an information retreival problem (stage 2). Use this week to plan out how to port that code to the iPhone. Look at how to approach the writing of code (adapting libraries for extracting features?) and any classification algorithms (adaption HMM libraries?). Begin to develop the phone code.

Week 5 Continue developing the code.

Week 6 Finish the code on the phone. Evaluate it against the audio collected earlier in the project. Write the final report.

References

Survey of the Speech Recognition Techniques for Mobile Devices by Dmitry Zaykovskiy
Information Retrieval by C.J. Rijsbergen
http://www.speech.cs.cmu.edu/sphinxman/
http://cslu.cse.ogi.edu/HLTsurvey/ch1node4.html