Classification of Gene Expression Data through Distance Metric Learning: Milestone

Feb 19th, 2013

Craig O. Mackenzie, Qian Yang

Introduction

Classification of gene expression data shows great promise as it may be used regularly in the future to predict health outcomes in clinical settings. The expression level for a gene is determined by the abundance of its messenger RNA (mRNA), the product of the gene that will be used as a template to make proteins. Proteins are the engines and building blocks of life and changes in the levels of proteins may indicate disease. Gene expression data can be thought of as a matrix where the rows are the patients (samples) and the columns are the (genes).

 

Goal

There are many challenges associated with the classification of gene expression data. One of these challenges is that there are relatively few samples but a very high number of features (usually in the 1000's). Because of the high dimension nature of this data, distance based classification methods can suffer when traditional distance metrics such as the euclidean distance are used. Our goal as described in our proposal was to use a distance metric learning method to classify gene expression data. We wanted to test this method on a number of gene expression datasets and compare against the euclidean distance.

 

Accomplishments

There are a number of different metric learning methods that have been developed for kNN classification. We decided to use the method called Large Margin Component Analysis (LMCA) (Torresani & Lee, 2006). We feel this method is well suited for expression data, since it not only learns a Mahalanobis distance metric, but also reduces the dimensionality of the feature space. Instead of learning a distance metric, this method transforms the data so that using the euclidean distance in the transformed space is equivalent to using the learned distance in the original space. Intuitively this space should be transformed so that samples with the same class labels are near to each other and those with different labels are further apart. Thus the goal of this algorithm is to find the matrix L that will transform the original space X in such a way. Note that L is a d × D matrix where D is the dimension of the data and d < D. Equation 1 gives the desired transformation.

(Equation 1)

Y is a matrix whose ijth entry is 1 if samples i and j share the same class labels. N is the matrix whose ijth entry is 1if sample j is a k-nearest neighbor of sample i and it shares the same class label. c is a constant and h is the hinge function. The first term in this equation encourages small distances between points sharing the same class labels and the second terms encourages large distances between points sharing different labels. There is no closed form solution for minimizing . Thus we use a gradient descent method using equation 2.

(Equation 2)

The step size, α, is adjusted through the algorithm and the gradient is given in equation 3.

(Equation 3)

We have fully implemented the LMCA algorithm in Python using gradient descent and incorporated it into the kNN algorithm.

 

Results

We do not have any results for the LCMA kNN algorithm yet, since our smallest datasets have not completed running, even when given days to run. This method is quite computationally intensive as can be seen by the triple summation in equation 2. The gradient descent and cross validation procedures only add to the computational burden here.

We have also completed the code for the basic kNN algorithm on both MATLAB and Python. In the algorithm, when there are multiple classes that have the same largest vote within k nearest neighbors, we randomly pick up a class to break ties.

The first dataset that we used to test the basic kNN algorithm is the colon cancer tumor dataset from (Alon, et al., 1999). This dataset contains gene expression data for 62 patients and 2000 genes. There are two class labels. Figure 1 shows the plot of leave-one out cross validation error for this dataset by MATLAB.

Another dataset that we used to test the basic kNN algorithm is the Schleroderma dataset from (Sargent, Milano, Connolly, & Whitfield). This dataset contains gene expression data for 75 patients and 995 genes. There are four class labels. Figure 2 shows the plot of leave-one out cross validation error for this dataset by MATLAB.

Figure 1.Error rate of colon cancer tumor dataset by kNN

 

Figure 2.Error rate of Schleroderma dataset by kNN

 

Future Work

Our most pressing goal is to find a way of speeding up the algorithm so that we can actually get results for real data using the LMCA-kNN method. We could go in three directions: make algorithmic changes, switch programming languages or use parallel processing. We would like an algorithmic solution to this problem, but we may have to use a combination of methods to speed things up to a point where we can run the program in a reasonable amount of time.

We will develop more tie breaking methods such as nearest neighbor amongst the k-closest. As can be seen in Figure 1, there is a spike in the error rate at k = 4. This is likely due to ties that were broken randomly. Another goal is to use a weighted voting scheme in kNN. Lastly, if we have time, we would like to deal with the problem of missing data that occurs often in expression data.

Works Cited

Alon, U., Barkai, N., Notterman, D. A., Gish, K., Ybarra, S., Mack, D., et al. (1999). Broad Patterns of Gene Expression Revealed by Clustering Analysis of Tumor and Normal Colon Tissues Probed by Oligonucleotide Arrays. Proceedings of the National Academy of Sciences of the United States of America, 6745-6750.

Sargent, J., Milano, A., Connolly, M., & Whitfield, M. (n.d.). Scleroderma gene expression and pathway signatures.

Torresani, L., & Lee, K.-c. (2006). Large Margin Component Analysis.