Learning of Bayesian Network using ML and EM Algorithm

Deqing Li

Project Overview
In this project, I detect deception by learning Bayesian Networks using machine learning techniques. Specifically, human experts are simulated by BNs and then the parameters of their simulators, which are the conditional probability tables (CPTs), are learned from observations generated from the simulators assuming that we already know the structure of the simulator BN. To test the reliability of the learned model, I reason through the model and compare the answers with the experts' answer inferred from the simulator. Here, the answer of the model/simulator refers to the posterior probability distributions inferred from the corresponding BNs. To detect deception, the answers from the model are compared with the deceptive answers from the simulator. An architecture of the system is shown in Figure 1.

Fig. 1 Architecture of Deception Detection by Learning Bayesian Networks

Achievements
  • Generating Training Data
          A set of training data is in the format of "r.v.(i,j) = state(i,j)" where i represents the ith sample and j represents the jth random variable. The state of each r.v. is generated randomly according to its CPT. To ease the process of generating data, the states of the root nodes, which are not conditioning on any other nodes, are generated first. Then, based on their states, generate the states of their children nodes. An example of 10 pieces of training data is depicted in Figure 2.

Fig. 2 (left) A BN with four nodes. (right) Ten pieces of training data generated from the BN
  • Implementing Maximum Likelihood in Parameter Estimation
          Given the structure of the network, the optimal CPT is estimated by maximizing the likelihood of the data. In the case of learning BN, the likelihood of a set of data is the joint probability of the r.v.s taking the states as in the data. According to the following equations,
the log likelihood can be decomposed into sub-log-likelihoods, which are the CPTs of individual r.v.s. Thus the problem is simplified into maximizing the likelihood of each r.v. independently. Now suppose we have a root node Y, and its probability when Y=1 is denoted by sigma. The likelihood of its data can be represented as:
And its prior probability when Y=1 can be derived by:

Similarly, the conditional probability of a node X dependent on Y can be inferred in the following way.
  • Smoothing
          When a particular set of r.v. instantiations cannot be found in the data, the estimate of CPT encounters the problem of 0/0. An ad hoc way to solve this is to add 1 to the denominator, which forms 0/1. However, a probably more intelligent way is to apply smoothing in a similar way of Laplacian smoothing.
  • Designing and Implementing Detection Method
          The intuition of the detection method is that if the deviation from the real answer and the modeled answer is so large that it cannot be explained by noise, then it is identified as possible deception. The way I do it is to first calculate the prediction error, which is the difference between answers from the simulator and answers from the model. The prediction error serves as an estimate of noise. Assuming normal distribution, I obtain the standard deviation of prediction error. If the difference between the deceptive answer and the answer from the model is beyond four std deviations, then it is identified as deception.

Experiments and Results
The network I used as the simulator is an existing BN called Alarm Network. It was originally built to monitor patients with intensive care. It is used in my experiment due to the moderate size. The Alarm Network has 37 nodes, 105 states and 46 arcs. In the experiment, 10, 10^2, 10^3, 10^4 and 10^5 training samples are generated from the simulator respectively. For each size of training sample, a "model" BN is learned and compared with its simulator BN.
  • Learning Error and Error Rate
          After the so called "model" BN is learned, I conduct a probabilistic inference on it and obtain the posterior probability distribution of each variable. The learning error is calculated from the averaged difference between the posterior probability distribution of the simulator and that of the model over all the variables. I am more interested in the error in the posterior probability instead of that in the parameter (CPTs) since in mimicking someone's mind his final opinion on some event is more important than how he derives the opinion. In the result, I also plot the error rate, which is (Error)/(posterior probability of the simulator). We can see from Figure 3 that the error rate is satisfactorily small and decreases as we have more training data. However, the smoothing method is not as effective as the ad hoc one at first. It is because the smoothing poses a strong prior knowledge on the CPTs especially when the data size is small, and thus the estimation does not depend on the data much and results in a worse performance. The smoothing starts to take effect when we have 1,000 data samples, but it is NOT much more effective than the ad hoc method. It might be because that when the data size is large, the prior knowledge is not influencing much.

Fig. 3 (left) Plot of error rate. (right) Plot of error.
  • Detection rate and false alarm
          The detection rate increases as the increase of data size. The detection rate of using smoothing is higher than that of using ad hoc at first. And since the detection rate is higher, the false alarm rate is also higher. The influence of smoothing/ad hoc is more significant when the data size is small, and becomes trivial when the data size is 10,000 and higher. Another observation is that the detection rate does not increase any more after 10,000 samples although the error rate of learning BN still reduces.

Fig. 4 (left) Plot of detection rate. (right) Plot of false alarm.

Timeline
The rest of the term will be spent on the study and implementation of the EM algorithm used to learn BN with incomplete observations, and using the implemented detection method and the learned BN to detect deception.