My email: vipul.r.kakkad@dartmouth.edu
Semi-Supervised Regression is performed by first modeling the data set as a directed and symmetrically weighted complete graph, with a vertex for each node, and edges going both ways for any pair of nodes, and the weights for each edge are a measure of the similarity between the the nodes that the edge in question is incident on.
A ‘real harmonic function’ on graph is one that obeys the rule that: at each node it is equal to the weighted average of the neighbors. This property is intuitively appealing for regression purposes, due to what we call the ‘Semi-Supervised Smoothness (SSS) Assumption’, which means that the ground truth function that we are trying to model is smooth, and nodes/datapoints that are close to each other will have similar values of the function.
If we normalize the outgoing edges for every node to a sum of 1, then we convert the graph into a Markov chain with each node being a state and the weight matrix now becomes the transition matrix of the Markov Chain. Using the Laplacian matrix, it is possible to find for any two nodes, a harmonic function that represents the probability that a random walker starting at each node reaches one of the nodes before the other. If we find this for all the labelled nodes, then we can take a linear combination of these, by setting the values of the function at the training set to be equal to the given values. We hence obtain a harmonic solution that minimizes the 'harmonic energy' of the graph for the given values at the labelled nodes.
The problem with this is that these harmonic functions are biased and will often have large flat areas where the function changes very little, which doesn't capture the local behaviour of the function very well.
The proposed solution is to treat the labelled nodes as absorbing states in the graph. Using this, we can obtain the probabilities of being absorbed by each of these absorbing states when starting in any given node. We can now minimize E [ f(absorbing state reached) - f(starting node)] for each node. The hypothesis is that there exists a closed form solution that allows us to compute this relatively efficiently. It is also possible that this new function obtained will maintain the harmonic property.
The goal, then, is the verify the above conjectures and then compare the performance of the harmonic solutions generated using the Laplacian with those generated by minimizing the new objective function that is described above.
Timeline:
References