ORCREC: An Intelligent Course Recommendation System

By James Brofos, Cameron Orth, Cooper Stimson

Table of Contents

1 Overview >

2 Preprocessing >

3 Algorithms >

3.1 Term Frequency-Inverse Document Frequency (TF-IDF): >

3.2 Laplacian and Hessian Eigenmapping (LLLE and HLLE): >

3.3 Agglomerative Hierarchical Clustering (AHC): >

3.4 Stochastic Gradient Descent (SGD) with Multi-Dimensional Scaling (MDS): >

3.5 Delegation Vote Conflation Committee (DVCC): >

4 Milestone Report Results >

5 Final Submission Expectations >

1 Overview

Current course selection utilities are passive and lack the ability to form unique recommendations for users. To address this issue, we have developed numerical tools for data mining the natural language present in the Registrar's ORC for keywords and content indicators that characteristically define similar classes. Using data scraped from the ORC and clustering algorithms, we proposed to deliver confident custom recommendations to Dartmouth students who are using the ORC to select courses and plan their schedules.

2 Preprocessing

The ORC presents course descriptions in natural language form. Prior to constructing descriptive numerical representations of the classes in the ORC, it was necessary to rst parse the Registrars website, pulling critical information from the Academic Timetable of Class Meetings. This is then stored in JavaScript Object Notation (JSON referring to both the notation and an object thus represented), one .json file for courses (department offerings as listed, usually with descriptions, in the ORC) and one for classes (specic incarnations of a course with). Below is an example of a JSON obtained by parsing the ORC:

{"TERM": "201203", "WC": "", "ROOM": "", "HOUR": "10A", "CID": "672", "LIM": "", "BLDG": "", "XLIST": "COSC 174 01", "DEPT": ["COSC", "COSC"], "NUM": ["074", "174"], "SEC": "01", "PROF": "Lorenzo Torresani", "DIST": "QDS", "ENRL": "37", "CRN": ["31512", "31925"], "NAME": ["Machine Lrng&Stat Analysis"]}, {"TERM": "201301", "WC": "", "ROOM": "008", "HOUR": "10A", "CID": "673", "LIM": "", "BLDG": "Kemeny Hall", "XLIST": "COSC 174 01", "DEPT": ["COSC", "COSC"], "NUM": ["074", "174"], "SEC": "01", "PROF": "Lorenzo Torresani", "DIST": "QDS", "ENRL": "61", "CRN": ["10905", "10920"], "NAME": ["Machine Learning"]}, {"TERM": "201209", "WC": "", "ROOM": "006", "HOUR": "2A", "CID": "674", "LIM": "", "BLDG": "Kemeny Hall", "XLIST": "COSC 175 01, QBS 175 01", "DEPT": ["COSC", "COSC"], "NUM": ["075", "175"], "SEC": "01", "PROF": "Gevorg Grigoryan, Christopher Bailey-Kellogg", "DIST": "TLA", "ENRL": "34", "CRN": ["91860", "91864"], "NAME": ["Intro. to Bioinformatics"]}, {"TERM": "201209", "WC": "", "ROOM": "B03", "HOUR": "2A", "CID": "675", "LIM": "45", "BLDG": "Moore", "XLIST": "COSC 179 01, PSYC 040 01", "DEPT": ["COSC", "COSC", "PSYC"], "NUM": ["079", "179", "040"], "SEC": "01", "PROF": "Richard Granger", "DIST": "SCI", "ENRL": "63", "CRN": ["90644", "91865", "90499"], "NAME": ["Intro Computatnal Neurosci", "Intro Computational Neurosci"]}, {"TERM": "201209", "WC": "", "ROOM": "007", "HOUR": "10A", "CID": "676", "LIM": "", "BLDG": "Kemeny Hall", "XLIST": "COSC 183 01", "DEPT": ["COSC", "COSC"], "NUM": ["083", "183"], "SEC": "01", "PROF": "Lorenzo Torresani", "DIST": "", "ENRL": "39", "CRN": ["91861", "91874"], "NAME": ["Computer Vision"]}, {"TERM": "201301", "WC": "", "ROOM": "105", "HOUR": "2", "CID": "677", "LIM": "", "BLDG": "Life Sciences Center", "XLIST": "COSC 184 01", "DEPT": ["COSC", "COSC"], "NUM": ["084", "184"], "SEC": "01", "PROF": "Gwen Spencer", "DIST": "TAS", "ENRL": "23", "CRN": ["10906", "10921"], "NAME": ["Mathematical Optimization"]}

In particular, naive fast-prototyped parsers were created in Python to scrape the large xml table on the Registrars site and produce the classes JSON. Then a naive Python parser grouped the classes and another one added a course description to make the courses JSON. As of the milestone, these Python parsers are able to take 4000 classes from the Academic Timetable and produce the two JSONs in under 15 minutes, but still taking long enough to be feasible only as a nightly updating run rather than a runtime calculation. Creating a second parser in C will be a next step to make the parsing process more robust (to guard against changes to ORC formatting, for example) and faster, though still not fast enough for runtime. As a result, and because MATLAB can only meaningfully handle numerical data,