ORCREC: An Intelligent Course Recommendation System

By James Brofos, Cameron Orth, Cooper Stimson
Table of Contents
1 Overview >
2 Preprocessing >
3 Algorithms >
4 Milestone Report Results >
5 Final Submission Expectations >

1 Overview

Current course selection utilities are passive and lack the ability to form unique recommendations for users. To address this issue, we have developed numerical tools for data mining the natural language present in the Registrar'€™s ORC for keywords and content indicators that characteristically define similar classes. Using data scraped from the ORC and clustering algorithms, we proposed to deliver confident custom recommendations to Dartmouth students who are using the ORC to select courses and plan their schedules.

Preprocessing

The ORC presents course descriptions in natural language form. Prior to constructing descriptive numerical representations of the classes in the ORC, it was necessary to rst parse the Registrars website, pulling critical information from the Academic Timetable of Class Meetings. This is then stored in JavaScript Object Notation (JSON referring to both the notation and an object thus represented), one .json file for courses (department offerings as listed, usually with descriptions, in the ORC) and one for classes (specic incarnations of a course with). Below is an example of a JSON obtained by parsing the ORC:
{"TERM": "201203", "WC": "", "ROOM": "", "HOUR": "10A", "CID": "672", "LIM": "", "BLDG": "", "XLIST": "COSC 174 01", "DEPT": ["COSC", "COSC"], "NUM": ["074", "174"], "SEC": "01", "PROF": "Lorenzo Torresani", "DIST": "QDS", "ENRL": "37", "CRN": ["31512", "31925"], "NAME": ["Machine Lrng&Stat Analysis"]}, {"TERM": "201301", "WC": "", "ROOM": "008", "HOUR": "10A", "CID": "673", "LIM": "", "BLDG": "Kemeny Hall", "XLIST": "COSC 174 01", "DEPT": ["COSC", "COSC"], "NUM": ["074", "174"], "SEC": "01", "PROF": "Lorenzo Torresani", "DIST": "QDS", "ENRL": "61", "CRN": ["10905", "10920"], "NAME": ["Machine Learning"]}, {"TERM": "201209", "WC": "", "ROOM": "006", "HOUR": "2A", "CID": "674", "LIM": "", "BLDG": "Kemeny Hall", "XLIST": "COSC 175 01, QBS 175 01", "DEPT": ["COSC", "COSC"], "NUM": ["075", "175"], "SEC": "01", "PROF": "Gevorg Grigoryan, Christopher Bailey-Kellogg", "DIST": "TLA", "ENRL": "34", "CRN": ["91860", "91864"], "NAME": ["Intro. to Bioinformatics"]}, {"TERM": "201209", "WC": "", "ROOM": "B03", "HOUR": "2A", "CID": "675", "LIM": "45", "BLDG": "Moore", "XLIST": "COSC 179 01, PSYC 040 01", "DEPT": ["COSC", "COSC", "PSYC"], "NUM": ["079", "179", "040"], "SEC": "01", "PROF": "Richard Granger", "DIST": "SCI", "ENRL": "63", "CRN": ["90644", "91865", "90499"], "NAME": ["Intro Computatnal Neurosci", "Intro Computational Neurosci"]}, {"TERM": "201209", "WC": "", "ROOM": "007", "HOUR": "10A", "CID": "676", "LIM": "", "BLDG": "Kemeny Hall", "XLIST": "COSC 183 01", "DEPT": ["COSC", "COSC"], "NUM": ["083", "183"], "SEC": "01", "PROF": "Lorenzo Torresani", "DIST": "", "ENRL": "39", "CRN": ["91861", "91874"], "NAME": ["Computer Vision"]}, {"TERM": "201301", "WC": "", "ROOM": "105", "HOUR": "2", "CID": "677", "LIM": "", "BLDG": "Life Sciences Center", "XLIST": "COSC 184 01", "DEPT": ["COSC", "COSC"], "NUM": ["084", "184"], "SEC": "01", "PROF": "Gwen Spencer", "DIST": "TAS", "ENRL": "23", "CRN": ["10906", "10921"], "NAME": ["Mathematical Optimization"]}
In particular, naive fast-prototyped parsers were created in Python to scrape the large xml table on the Registrars site and produce the classes JSON. Then a naive Python parser grouped the classes and another one added a course description to make the courses JSON. As of the milestone, these Python parsers are able to take 4000 classes from the Academic Timetable and produce the two JSONs in under 15 minutes, but still taking long enough to be feasible only as a nightly updating run rather than a runtime calculation. Creating a second parser in C will be a next step to make the parsing process more robust (to guard against changes to ORC formatting, for example) and faster, though still not fast enough for runtime. As a result, and because MATLAB can only meaningfully handle numerical data,

Algorithms

3.1 Term Frequency-Inverse Document Frequency (TF-IDF):

TF-IDF provides a descriptive mea- sure of how similar courses are to one another. Further, it captures uniquely which words are highly distinctive of certain class types. We have currently calculated TF-IDF feature vectors for one-grams for a subset of courses in the Dartmouth ORC. We do not anticipate that this will be dicult to generalize to all of the courses in the ORC. We also intend to generalize the TF-IDF Python program to compute n-grams for the parsed ORC classes, and utilize these n-grams as explanatory variables in constructing new hierarchical clustering models on the basis of the "nth order similarity" of Dartmouth courses in the ORC.

3.2 Laplacian and Hessian Eigenmapping (LLLE and HLLE):

The explanatory vectors produced by the TF-IDF generating function tend to have high dimensionality as the number of courses un- der consideration increases. These large vectors are undesirable because the clustering analysis algorithms tend to be unable to properly decompose their properties and relationships to other courses. To address this problem, we implemented Hessian Eigenmapping, which developed nu- merical precision problems related to eigenvalue calculation. To redress this auxiliary problem, we implemented Laplacian Eigenmapping which demonstrates greater numerical stability and is capable of compressing the high-dimensional TF-IDF vectors to vectors containing only twelve components.

3.3 Agglomerative Hierarchical Clustering (AHC):

To analyze the structure of Dartmouth courses using an unsupervised methodology, we implemented AHC. This is advantageous for revealing patterns in the course space of the form of subgroups of classes, groups of classes, and broad categories of classes, all of which exist under the banner of Dartmouth classes. This methodology of further clustering classes as one ascends up the tree allows us to maintain a distance measurement of of all courses to all other courses on the basis of their distance in the tree structure. This distance measurement is calculated as the cophenetic correlation distances in the hierarchical clustering tree.

3.4 Stochastic Gradient Descent (SGD) with Multi-Dimensional Scaling (MDS):

To produce recommendations, we sum the cluster-distance to a set of previously liked courses and perform a simple stochastic gradient descent. However, the target space is too large to efficiently descend in our target application, a web-based utility. To shrink the target space we apply Multi-Dimensional Scaling to reduce the target space to a smaller set of representative points, and perform gradient descent over that set. Once minima have been found, gradient descent is reapplied to each cluster containing at least one of the representative minima.

3.5 Delegation Vote Conflation Committee (DVCC):

To both avoid high-distance local minima and to permit identification of multiple low-distance minima, we have implemented a Python script to perform committee voting. Three delegations of three models each are each trained on different data: one-, two-, and three-grams. Within each delegation, each delegate begins their gradient descent at a different point. The committee receives from each delegate a vote along with a percentage certainty rating. These certainty ratings weight the votes to optimize the final results, which are presented to the user.

Milestone Report Results

We have successfully parsed the ORC for data and have obtained JSON representations of all courses contained in the Registrars database. For these classes in the Registrar we have calculated one-gram TF-IDF feature vectors, and there is still need to generate two-gram and three-gram vectors to yield additional hierarchical clustering models for which we will utilize our committee voting Python script. We must still implement an actual mechanism that retrieves from the class space a recommendation for the user to examine and consider. Our numerical progress so far has centered on developing hierarchical clustering and dimensionality- reduction algorithms. After applying Laplacian Eigenmapping to the TF-IDF data obtained from 2Python, we cluster the resulting, lower-dimensional data using agglomerative hierarchical clustering, and obtain a tree object that is illustrative of the structure of course similarities in the class space. For visualization purposes, we obtain the following tree structure from a subset of ninety-six courses in the ORC:
image: 0C__Users_Cameron_Documents_cs74_Dendrogram.png
On the basis of this cluster, we implemented in MATLAB a mechanism to obtain a measure of the distance from one leaf node (a single class in the course-space) to any other leaf node. This metric is calculated as the cophenetic correlation distance between two leaf nodes, and provides a measure of similarity between classes on the basis of their location in the hierarchical tree structure. We visualize this matrix for the same ninety-six course subset in the following image:
image: 1C__Users_Cameron_Documents_cs74_HierarchicalDistanceMeasureHeatmap.png
Naturally this matrix is square. Notice that distance between classes is intuitively zero along the main diagonal, re ecting the obvious meaning that a course does not dier from itself. The other courses in the appropriate rows and columns of the matrix that have low distance measures (indicated in blue, and opposed to red courses which are dissimilar) are candidates to be recommended to the user when a similar course is sought, and candidates to be avoided when a very dierent course is desired. The dimensionality-reduced feature vectors that describe the natural language descriptions of courses 3are not easily interpretable. However, it is possible to visualize their informative power. In particular, we can see that certain explanatory variables produced by Laplacian Eigenmapping (and clusters of those explanatory variables) are highly indicative of courses belonging to a given sub-hierarchy of classes in the tree structure.
image: 2C__Users_Cameron_Documents_cs74_Dendrogram_Heatmap.png

Final Submission Expectations

The following points are those that we intend to implement for the final submission:

References

1 Bishop, Christopher M. Pattern Recognition and Machine Learning. New York: Springer, 2006. Print.
2 Brazdil, Pavel B. Metalearning: Applications to Data Mining. Berlin: Springer, 2009. Print.
3 D. L. Donoho and C. E. Grimes. Hessian eigenmaps: Locally linear embedding techniques for high- dimensional data. Proceedings of the National Academy of Arts and Sciences, 100:55915596, 2003.
4 Gelbukh, Alexander. Computational Linguistics and Intelligent Text Processing: 7th International Conference, CICLing 2006, Mexico City, Mexico, February 19-25, 2006 : Proceedings. Seventh International Conference, CICLing, 2006, Mexico, Mexico City. Berlin: Springer, 2006. Print.
5 Grenander, Ulf, and Michael Miller. Pattern Theory: From Representation to Inference. Oxford: Oxford UP, 2007. Print.
6Manning, Christopher D., Prabhakar Raghavan, and Hinrich Schtze. "Document and Query Weighting Schemes." Introduction to Information Retrieval. New York: Cambridge UP, 2008. 118. Print.
7Mikhail Belkin and Partha Niyogi. Laplacian Eigenmaps for Dimensionality Reduction and Data Representation. University of Chicago, Department of Mathematics. December 8, 2002.