Classification of Gene Expression Data through Distance Metric Learning

Jan 23^rd, 2013

Craig O. Mackenzie, Qian Yang

One of the great promises of bioinformatics is that we may be able to predict health outcomes based on genetic makeup (and environmental factors). Classification of gene expression data is one way that this can be done. Gene expression levels are essentially measurements of how active genes are. The expression level for a gene is determined by the abundance of its messenger RNA (mRNA), the product of the gene that will be used as a template to make proteins. Proteins are the engines and building blocks of life and changes in the levels of proteins may indicate disease. There appears to be a correlation between mRNA and protein levels in certain studies, though this relationship is fairly complex. Even in cases where mRNA levels do not correlate with protein levels, they could still be an important factor in understanding disease.

The ultimate goal of gene expression classification is to predict whether someone has or will have a certain health outcome based on the expression levels of certain genes. These outcomes could be disease vs. control, stages of cancer, or subtypes of a disease. Depending on the classification method used, it may allow us to hone in on genes that discriminate the most between disease states, thus giving us biological insight into the disease. These problems also serve as great computational challenges to those working in the field of bioinformatics and machine learning. One must deal with the fact that there are usually thousands of genes and relatively few samples. A gene expression dataset can be thought of as a matrix where the rows represent patients (samples) and the columns represent genes (features).

Gene expression classification is a well studied field and many computation methods have been used or developed for it. These include support vector machines, regression trees and k-nearest neighbors, to name a few. In our project we will predict the class labels of samples using a modified k-nearest neighbor algorithm (kNN).

The k-nearest neighbor algorithm is one of the simplest machine learning algorithms and classifies samples based on the closest training samples in the feature space. The unlabeled sample is usually given the class label that is most frequent among the k nearest training samples (neighbors). This procedure of deciding the class label based on frequency is referred to as “majority voting”. The kNN algorithm has strong consistency in its results. As the data approaches infinity, the error rate of the algorithm is guaranteed to be no more than twice of that of Bayes (Bremner, et al., 2005) (Cover & Hart, 1967).

We will modify the classical kNN algorithm in two ways and test it on gene expression datasets. The first modification will be to the procedure used to assign class labels based on the k nearest neighbors. A drawback to the basic “majority voting” method is that the classes with more frequent examples tend to dominate the prediction of the new vector. This could be a problem in gene expression classification, since the classes rarely have equal sizes. One solution is to adopt a weighting scheme in which the “votes” are weighted by the distance from the test sample to its neighbors (Coomans & Massart, 1982). We will try a few different weighting schemes including “majority voting”.

Secondly, we will incorporate distance metric learning into the algorithm. A typical choice for a distance metric is the Euclidean distance. However, as the dimension of the data increases, this measure generally does not work well. This is a great concern in gene expression data, where there are typically thousands of features (genes) present in the dataset. In (Xiong & Chen, 2006) they used distance metric learning for kNN on gene expression data with good results. They used a data dependent kernel based approach. On many datasets, the errors rates were much better than kNN with a fixed distance metric (Euclidean) and comparable to methods like SVM. (Weinberger, Blitzer, & Saul, 2006) developed a method called Large Margin Nearest Neighbor (LMNN) in which they learn a Mahanalobis distance metric for kNN classification and test it on handwritten digits. Essentially their algorithm learns a distance measures that minimize the distances to the nearest neighbors with matching class labels and maximizes the distances to the nearest neighbors with different class labels. They did use PCA to reduce the dimension first. (Torresani & Lee, 2006) improved upon this by combining dimensionality reduction and distance metric learning for kNN classification. They also developed a kernalized version of LMNN called Large Margin Component Analysis. We are not exactly sure which of these distance learning methods we will end up using. We may try different ones or even modify them slightly depending on our needs.

Finding the best choice of k is another challenging part of kNN algorithm. The optimum value of k depends on the dataset itself. Large values of k results in larger bias while smaller value of k contributes to a higher variance. We will determine the best value or k for each dataset by testing a range of reasonable values for this hyper-parameter. It will be interesting to see if the values of k are relatively close to each other for the expression datasets that we use. This could serve as a heuristic for other researchers who may want to use similar methods on gene expression data.

Since the end goal of gene expression classification is to predict if a patient in a clinical setting (not part of the dataset) has a disease, it is important to find a method that can handle some missing features. Frequently, gene expression studies of the same disease by different researchers include different genes. There is usually overlap, but the sets of genes for the patients are not exactly the same. If we have time, we may attempt to address this issue by incorporating methods for dealing with missing data and test it on two different datasets for the same disease.

In this project, a kNN algorithm will be developed in Python with the aforementioned modifications and tested through a cross validation scheme on gene expression datasets. This is the milestone goal of this project. The data sets will come from the Gene Expression Omnibus (GEO) website (http://www.ncbi.nlm.nih.gov/geo) (Edgar, Domrachev, & Lash, 2002) and will include a breast cancer, lymphoma and a colon cancer datasets. These datasets usually have a set of control samples (patients without cancer) and samples that have cancer or some stage of cancer. We will have at least one dataset in which there are more than two class labels. We will test our methods against kNN using Euclidean distance and another suitable distance metric (perhaps Pearson correlation coefficient) by comparing classification errors. We may also test against SVMs and other classification methods, although we do not plan to write the code for these methods.

We feel that our method is novel in the sense that it combines two (or more) major modifications to the kNN algorithm as well as a rigorous test of the hyper-parameter k on gene expression data. Depending on the running times in our initial tests, we may also try to employ methods to speed up the kNN algorithm such as KD trees or approximate kNN.

Bibliography

Bremner, D., Demaine, E., Erickson, J., Iacono, J., Langerman, S., Morin, P., et al. (2005). Output-sensitive algorithms for computing nearest-neighbor decision boundaries.

Coomans, D., & Massart, D. (1982). Alternative k-nearest neighbour rules in supervised pattern recognition : Part 1. k-Nearest neighbour classification by using alternative voting rules. Analytica Chimica Acta.

Cover, T., & Hart, P. (1967). Nearest neighbor pattern classification. IEEE Transactions on Information Theory, 21-27.

Edgar, R., Domrachev, M., & Lash, A. (2002). Gene Expression Omnibus: NCBI gene expression and hybridization array data repository. Oxford JournalsLife Sciences, 207-210.

Torresani, L., & Lee, K.-c. (2006). Large Margin Component Analysis.

Weinberger, K. Q., Blitzer, J., & Saul, L. K. (2006). Distance Metric Learning for Large Margin.

Xiong, H., & Chen, X.-w. (2006). Kernel-based distance metric learning for microarray data. BMC Bioinformatics.