COSC 34 Project Proposal

Regression using Markov Chain Absorbing States


Vipul Kakkad

My email: vipul.r.kakkad@dartmouth.edu


Goal of Project and Method

Semi-Supervised Regression is performed by first modeling the data set as a directed and symmetrically weighted complete graph, with a vertex for each node, and edges going both ways for any pair of nodes, and the weights for each edge are a measure of the similarity between the the nodes that the edge in question is incident on.

A ‘real harmonic function’ on graph is one that obeys the rule that: at each node it is equal to the weighted average of the neighbors. This property is intuitively appealing for regression purposes, due to what we call the ‘Semi-Supervised Smoothness (SSS) Assumption’, which means that the ground truth function that we are trying to model is smooth, and nodes/datapoints that are close to each other will have similar values of the function.

If we normalize the outgoing edges for every node to a sum of 1, then we convert the graph into a Markov chain with each node being a state and the weight matrix now becomes the transition matrix of the Markov Chain. Using the Laplacian matrix, it is possible to find for any two nodes, a harmonic function that represents the probability that a random walker starting at each node reaches one of the nodes before the other. If we find this for all the labelled nodes, then we can take a linear combination of these, by setting the values of the function at the training set to be equal to the given values. We hence obtain a harmonic solution that minimizes the 'harmonic energy' of the graph for the given values at the labelled nodes.

The problem with this is that these harmonic functions are biased and will often have large flat areas where the function changes very little, which doesn't capture the local behaviour of the function very well.

The proposed solution is to treat the labelled nodes as absorbing states in the graph. Using this, we can obtain the probabilities of being absorbed by each of these absorbing states when starting in any given node. We can now minimize E [ f(absorbing state reached) - f(starting node)] for each node. The hypothesis is that there exists a closed form solution that allows us to compute this relatively efficiently. It is also possible that this new function obtained will maintain the harmonic property.

The goal, then, is the verify the above conjectures and then compare the performance of the harmonic solutions generated using the Laplacian with those generated by minimizing the new objective function that is described above.



Datasets
To compare the performance, the datasets used will be the ones for which the results of using the standard semisupervised regression technique are available. Artificial data sets such as generating samples of curves in Matlab, mainly of the type z=f(x,y) may be used for preliminary testing.

Timeline:

  • By Milestone: Implement the proposed solution and the standard method and try this for artificial data sets and for the data sets used by papers that propose semisupervised learning and its variants. Verify the harmonic property. (Possibly come up with an alternative energy function to minimize if this one doesn't work too well).
  • Before Final: Debugging the algorithm and test the correctness of the algorithm. Identifying the strengths and weaknesses of the algorithm. Find a way to prove that the minimization of this energy function is not just a 'hack', and has some theoretical backing to it.
    Applying it to a real dataset and seeing how the results shape up.


    References

  • [1] Xiaojin Zhu, Zoubin Ghahramani, and John Lafferty. Semi-supervised learning using Gaussian fields and harmonic functions. In The 20th International Conference on Machine Learning (ICML), 2003.
  • Statistical Analysis of Semi-Supervised Regression, by John Lafferty.