Classification of Gene Expression Data through Distance Metric Learning: Milestone
Feb 19th, 2013
Craig O. Mackenzie, Qian Yang
Introduction
Classification of gene expression data shows great promise as it may be used regularly in the future to predict health outcomes in clinical settings. The expression level for a gene is determined by the abundance of its messenger RNA (mRNA), the product of the gene that will be used as a template to make proteins. Proteins are the engines and building blocks of life and changes in the levels of proteins may indicate disease. Gene expression data can be thought of as a matrix where the rows are the patients (samples) and the columns are the (genes).
Goal
There are many challenges associated with the classification of gene expression data. One of these challenges is that there are relatively few samples but a very high number of features (usually in the 1000's). Because of the high dimension nature of this data, distance based classification methods can suffer when traditional distance metrics such as the euclidean distance are used. Our goal as described in our proposal was to use a distance metric learning method to classify gene expression data. We wanted to test this method on a number of gene expression datasets and compare against the euclidean distance.
Accomplishments
There are a number of different metric learning methods that
have been developed for kNN classification. We
decided to use the method called Large Margin Component Analysis (LMCA)
(Equation 1)
Y is a matrix whose ijth
entry is 1 if samples i and j share the same class
labels. N is the matrix whose ijth entry
is 1if sample j is a k-nearest neighbor of sample i and it shares the same class label. c
is a constant and h is the hinge function. The first term in this equation
encourages small distances between points sharing the same class labels and the
second terms encourages large distances between points sharing different
labels. There is no closed form solution for minimizing .
Thus we use a gradient descent method using equation 2.
(Equation 2)
The step size, α, is adjusted through the algorithm and the gradient is given in equation 3.
(Equation 3)
We have fully implemented the LMCA algorithm in Python using gradient descent and incorporated it into the kNN algorithm.
Results
We do not have any results for the LCMA kNN algorithm yet, since our smallest datasets have not completed running, even when given days to run. This method is quite computationally intensive as can be seen by the triple summation in equation 2. The gradient descent and cross validation procedures only add to the computational burden here.
We have also completed the code for the basic kNN algorithm on both MATLAB and Python. In the algorithm, when there are multiple classes that have the same largest vote within k nearest neighbors, we randomly pick up a class to break ties.
The first dataset that we used to test the basic kNN algorithm is the colon cancer tumor dataset from
Another dataset that we used to test the basic kNN algorithm is the Schleroderma
dataset from
Figure 1.Error rate of colon cancer tumor dataset by kNN
Figure 2.Error rate of Schleroderma dataset by kNN
Future Work
Our most pressing goal is to find a way of speeding up the algorithm so that we can actually get results for real data using the LMCA-kNN method. We could go in three directions: make algorithmic changes, switch programming languages or use parallel processing. We would like an algorithmic solution to this problem, but we may have to use a combination of methods to speed things up to a point where we can run the program in a reasonable amount of time.
We will develop more tie breaking methods such as nearest neighbor amongst the k-closest. As can be seen in Figure 1, there is a spike in the error rate at k = 4. This is likely due to ties that were broken randomly. Another goal is to use a weighted voting scheme in kNN. Lastly, if we have time, we would like to deal with the problem of missing data that occurs often in expression data.
Alon, U., Barkai, N., Notterman, D. A., Gish, K.,
Ybarra, S., Mack, D., et al. (1999). Broad Patterns of Gene Expression
Revealed by Clustering Analysis of Tumor and Normal Colon Tissues Probed by
Oligonucleotide Arrays. Proceedings of the National Academy of Sciences of
the United States of America, 6745-6750.
Sargent, J., Milano, A., Connolly, M., &
Whitfield, M. (n.d.). Scleroderma gene expression and pathway signatures.
Torresani, L., & Lee, K.-c. (2006). Large Margin
Component Analysis.