ORCREC: An Intelligent Course Recommendation System

By MM. James Brofos, Cameron Orth, and Cooper Stimson
Table of Contents
1 PROBLEM >
2 DATA >
3 METHODS >
4 OPTIMIZATION >
5 MILESTONE ACCOMPLISHMENTS >

1 PROBLEM

A fundamental challenge facing Dartmouth students today is the seemingly impenetrable ORC, which is provided by the Office of the Registrar yet is difficult to manage effectively except in tempered portions. Thus, we propose to develop a recommender system specifically designed to make informed, data-driven decisions for course selection on the basis of objective and subjective data in a high-dimensional feature space. This recommender system will use a combination of distance optimization and automated voting frameworks to provide an intuitive and fast utility for course selection. In particular, we intend to develop a web interface, built on the Django package with routines in MATLAB, Python, and C, to allow Dartmouth students to efficiently plan their schedules based on their preferences.

DATA

We intend to parse both the ORC and the Student Assembly’s course evaluation sites to provide both objective and subjective data that will define a point in the “course-space.” In particular, objective data will be gathered to characterize course instructor, field of study, and the time of course offering. Subjectively, we will establish a list of crucial vocabulary to determine student impressions of courses, as well as course content. To provide a more user-oriented application, the web utility will require participants to input their previous course selections and their personal opinion of those courses. User-defined preferences will be numerically described as though, for instance, on a one-to-ten scale.

METHODS

We will produce recommendations via a committee of models. Each model will locate an optimal course point via stochastic gradient descent from a random starting point (of suitably low error), using a TBD Minkowski metric. We adopt the Minkowski metric so as to generalize a measure of similarity and distance between points in the course-space, allowing us to easily re-measure distance as either a city-block or a Euclidean metric. Once a point has been identified, the algorithm will probe outwards within a growing radius to find the nearest existing course. This course, C, along with the reciprocal of the product of its error and the matching radius,
E , will be returned as a pair.
The committee will be populated by three delegations: models trained on global data, models trained on major-specific data, and models trained on user-only data. Each delegation will consist of 3 models, each of which will select a different starting point within class-space.
Once each ( C,E ) pair has been returned, each vote will be weighted by its error-distance ( V= 1 E ), and votes will be tallied. The top results will be returned to the user.

OPTIMIZATION

Because the course-space is high dimensional and highly textural, our committee method will sum to a computationally intensive process; it is therefore imperative to optimize this to a return speed suitable for a web application (ideally strictly less than one second).
At a baseline computational level, we plan to accomplish this using Category Theory, mapping binary float operations into the category of rational functions, using as functors an array of number theoretic Diophantine approximation techniques, which will serve to minimize the error noise of the functor whilst simultaneously minimizing the magnitude of the rational denominators, thereby greatly increasing computational speed.
Additionally, we will prefilter the course-space using a computationally cheap, rough algorithm to toss out courses with extremely high error. Thus we reduce the input to the more numerically-intensive algorithms used to produce the final recommendations. This prefiltration algorithm will be optimized for performance in high-error regions.
Finally, we will select languages for different steps to optimize performance, using MATLAB where its built-in functions and specific optimizations are advantageous, C for heavy lifting computation, and Python for the final committee and GUI presentation.

MILESTONE ACCOMPLISHMENTS

By the milestone we intend to have completed the data parsing of both ORC and Student Assembly inputs. By this time, implementations of the learning algorithms (distance optimization, unique kernel methodologies, boosted committee voting, etc.) will exist in C, MATLAB, and Python, respectively. These implementations will be applied to hard-coded preference data to produce naive recommendations that will constitute preliminary results.

References

1Bishop, Christopher M. Pattern Recognition and Machine Learning. New York: Springer, 2006. Print.
2Gelbukh, Alexander. Computational Linguistics and Intelligent Text Processing: 7th International Conference, CICLing 2006, Mexico City, Mexico, February 19-25, 2006 : Proceedings. Seventh International Conference, CICLing, 2006, Mexico, Mexico City. Berlin: Springer, 2006. Print.
3Grenander, Ulf, and Michael Miller. Pattern Theory: From Representation to Inference. Oxford: Oxford UP, 2007. Print.
4Lang, Serge. Algebra. New York: Springer, 2002. Print.
5Zannier, Umberto. Lecture Notes on Diophantine Analysis. Pisa, Italy: Edizioni della Normale, 2009. Print.