Learning Bayesian network using EM algorithm from Incomplete Data

Problem Statement

We often encounter missing values or hidden variables in learning Bayesian networks(BNs) from medical data. However, most existing state of the art learning algorithms are not able to learn structure from incomplete data. Hence, I propose to use EM algorithm to learn BNs from incomplete data.

Methods

There are two major components for Bayesian Network learning:

● to Implement EM algorithm for parameters learning ,i.e. conditional probability table

● to learn structures using Bayesian Information Criterion (BIC)

Data

The data I used came from Dartmouth medical school. The table below shows a portion of the data I used for this analysis. This date set consists of 3 variables and 1113 observations. About 20% of values in DNA are missing. These three variables are arsenic exposure,Xrcc3 DNA genotype, bladder cancer. According to a recent study from Dartmouth Medical school, there is evidence of gene-environment interaction between XRCC3 variant genotype and high toenail arsenic levels on bladder cancer risk (Andrew 2009a).

Brief Introduction to Bayeisan Network

The Bayesian network below is a suggested one based on Dr. Andrew's study, which illustrates interactions among those three variables. In this DAG, arsenic exposure and DNA are found to influence cancer. There conditional probability tables associated with each node are also given.

A typical BN consists a directed acyclic graph(DAG) and the conditional probability tables. The edges represent causal relationship between variables. For a particular BN, the joint probability of variables is shown in the following function:

Parameters learning from missing data

In the E-step, the algorithm estimates the labels of missing data from parameters. For each observation with missing values, the algorithm assign frequency to them based on the conditional probability tables. After each E-step, the date set are updated, from which we can future derives new parameters in the M step. The whole process repeats until it reaches convergence, i.e. when the log-likelihood of parameters stop changing. The log-likelihood of the entire data is shown in the function below:

Structure Learning

Once parameters are learned, we can search through all possible structures. According to the prior knowledge, the relationship between DNA and Arsenic exposure are independent, hence only four possible structures or DAGs are listed below. The algorithm searches through the space and find the structure that optimally describe the data set based on penalized likelihood score, which is Bayesian information criterion(BIC) in this model . BIC score is described in the following function , while L is the maximized value of the likelihood function for the estimated model; k is the number of parameters; m is sample size. The structure with lowest BIC score is the optimal structure.

$Description: Description: Description: C:\Users\d41249s\Desktop\Capture5.JPG$

Results

The EM algorithm estimates parameters more accurately as the sample size increases. From the percent error graph, percent error is close to 2.16% when the sample size is 900.

Percent error= Error rate for missing values also drops as sample size increases. When the sample size is 900 the error rate is 0.0812.

The graph below shows BIC scores versus EM iteration. We can see that DAG 4 has the lowest BIC score. Thus DAG 4 is the best model learned by the algorithm, which is consistent with structure proposed in the introduction

Conclusion

I have accomplished what I have proposed in the project proposal and the algorithm satisfactorily estimates parameters using EM method and the BN learned from this algorithm is consistent with the Dr. Andrew's study. However, the current model can only be applied to data with small size of variables. Besides, since we assume that there are no hidden variables apart from DNA, arsenic and cancer, the BN learned in this model might be different with the inclusion of other caner risk factors such as gender, ages and smoking. Future works would be to develop models that can handle more variables and more complicate BNs.

References:

Andrew, A.S., Mason, R.A., Kelsey, K.T., Schned, A.R., Marsit, C.J., Nelson, H.H., Karagas, M.R., DNA repair genotype interacts with arsenic exposure to increase bladder cancer risk. Toxicol. Lett., 187,10-14, 2009a.

Friedman, N. 1998. The Bayesian structural EM algorithm. In Proceedings of the Fourteenth Conference on Uncertainty in Artificial Intelligence (UAI-98), Cooper, G. F. & Moral, S. (eds). Morgan Kaufmann,129-138.

Darwiche,Adnan ., Modeling and Reasoning with Bayesian Networks. Cambridge University Press, 446 pp., Cambridge, UK, 2009.