Assembling a global database of university course listings​
by Amir Hosein Sadeghi Isfahani (He/him) ​
​University of Waterloo
The “global database of university course listings” shows interconnections between programs, courses, and course topics. The database offers benefits for both students and university administrators. While a global database allows a student to have a big picture of their program and stay motivated by knowing the ties between courses, it also enables universities to optimize their programs and courses and effectively run student exchange programs with their partner universities. Using the course listing at University of Waterloo, we are currently working to understand the interconnection between the course topics in the physics courses. Our initial goal is to have an intelligent agent that tells us how much the topics in two physics courses overlaps. This project is the first step towards our ambitious vision for fully-online and accessible degree programs based on the elements of computational thinking.

Introduction

Background

Students use course listings (or catalogs) to choose their courses of interest as they pursue their university degrees. Academic and/or program advisors usually help students in course selection, enabling students to refine and realize an academic plan that is based on and aligned with their goals and aspirations. On the other hand, universities compose course listings to efficiently manage their resources, standardize student admissions, create and facilitate student exchange programs, and handle student transfers.
Course listings are usually published once or twice per year, showing how frequently different courses are offered in each term or semester. While a course listing usually varies from one university to another, a typical one provides information about various courses in each subject as well as course requisites, components, descriptions, and possible instructors. Additionally, subject, course and component names differ from one listing to another.
From an administrative standpoint, the main challenge for comparability of course listings is the name discrepancy, not the structural difference. Therefore, the advantages of course listing are hindered and universities usually consider some procedures to resolve this issue. In particular, universities generally go through three stages to approve the courses that students took in colleges or other universities; they first check course descriptions, then course syllabi, and finally intended learning outcomes (ILOs). Hence, this issue raises a practical question about the applicability of course listings in a global landscape; if a global database of university course listings is designed and implemented, can it resolve this issue? What are other advantages of such a database? What is the roadmap toward such a database?
To start to answer the aforementioned questions, a global database of university course listings is created from the undergraduate (or graduate) course listings and other course-related data of participating universities. This database is trained to guess how much two randomly chosen courses, from similar or different subjects at the same university or two different universities, overlap in their topics, syllabi, and ILOs. Therefore, it can resolve the administrative issue by intelligently checking the course descriptions or any other course-related data that are fed into it. Being aware the interconnections between courses and/or programs at such a high level of granularity, universities can have customized/personalized programs without adding new programs and/or courses to their workflows. They also can optimize their programs and courses, reducing their costs. Additionally, this database benefits instructors and students by allowing them to know the overlaps and ties between courses. Using this database, instructors can tie their course materials to other courses students take in their projects, motivating students to learn the materials. Students can see a big picture of what they learn at their programs, discovering how different courses satisfy their interests and contribute to their skill-sets.
Finally, this global database has some advantages for the paradigm of computational thinking. it allows us to design a program in which the material in each course interconnects with materials in other ones. Moreover, if the design of such a program is integrated with elements of computation literacy, the program empowers its students with inherently-innovative computational thinking and prepares them for multi- and/or inter-disciplinary projects and working with rapid-advancing technologies. This universal database enables us to visually inspect the knowledge network of programs/course/course topics and extract some features of the space of university programs/course/course topics. These features are then used to design and implement a holistic approach to program and course design. While this holistic approach gives students knowledge and freedom to reroute their academic paths, it also provides instructors with knowledge and flexibility they need to design/revise their courses. This database can be used to design online courses and make their content interconnected. Moreover, this project itself can be expanded and used to define fully-online programs based on computation knowledge, furthering fair and accessible higher education all around the world.

Implementation and Analysis

Outline

As mentioned, the structure and contents of course listings change across universties. Additionally, the ways universities compile and upload course listings on their webpages are not similar. Consequently, the process of web scrapping, converting the mined data into a valid and structured dataset, and storing in an standard and secure manner is itself a challenge.
In the rest of this section, two separate dataset are constructed from the undergraduate course listings at UWaterloo and MIT. While it is possible to create a dataset from the whole undergraduate listing at both of these universities (In fact, a code snippet is written that produces a dataset from UWaterloo’s undergraduate course listing), ours effort is focused on physics subject. In this way, the analysing, testing, and troubleshooting of the project is faster and more reliable.
We start with scrapping the undergraduate course listing at UWaterloo. For this, we show a sample course information at UWaterloo (undergraduate) listing. This image reveals the structure and content of each course profile at UWaterloo listing. Next, we explore the XML source codes of UWaterloo course listing and extract some XML patterns by which we can build a dataset from the publicly-available UWaterloo course listing. A similar procedure can be repeated for MIT undergraduate course listing. However, it is more difficult to find XML patterns in MIT XML source codes. For this, we focus on MIT physics undergraduate course listing and create a dataset from its XML source code.
Having clean datasets, we generate some graphs that show the relations between courses in four different subjects at UWaterloo. These graphs provide useful insights into the the interconnections between different courses at the subjective level.
In the rest of notebook, we focus on finding the interconnections between different courses, using their course descriptions. For this, we work with physics undergraduate course listings at UWaterloo and MIT. These has two advantageous: We focus on a subject that we are familiar with its course topics and curricula. Additionally, we work with a prototype that eases analysing, testing, and troubleshooting of the project faster and more reliable.
To achieve or goal, we explain, implement, and discus different strategies in text analysis. Finally, we create a simple course recommendation system, based on the “edit distance”. For a given physics course at MIT, this system suggests the three contextually nearest physics courses at UWaterloo to that course.

Course listing at University of Waterloo

Design of undergraduate course listing at University of Waterloo

Some auxiliary functions for styling fonts, figures with captions,
In[]:=
styleTitle[title_String]:=Style[title,14,FontFamily"Helvetica"];​​styleFootnote[footnote_String]:=Style[footnote,10,Italic,FontFamily"Helvetica"];
The help webpage of UWaterloo listing
In[]:=
SetOptions[EvaluationNotebook[],DockedCells->None]​​SetDirectory[NotebookDirectory[]];​​listGuideLink="https://uwaterloo.ca/registrar/registering-courses/understanding-course-description-listings";
A sample course description at UWaterloo (The first image in the list of imported images).
Out[]=
Sample course information at UWaterloo
retrieved on Tue 13 Jul 2021 from University of Waterloo's Registrar Office Webpage.
Parsing course requisites: At UWaterloo, each course has zero or more pre-, co-, or anti-requisite(s). Moreover, some information such as the minimum grade for taking a course is usually associated with a string that contain the requisite information. requisiteParser parses requisite string and return a list of pre-, co-, or anti-requisite(s).
In[]:=
requisiteParser[req_String]:=Module[​​{minGradDroped=StringReplace[StringDelete[req,"("~~Shortest[___]~~"%"~~Shortest[___]~~")"](* Deleting the extra info about min grade *),{ "or "|"/"}->" "]}​​(*Replacing different course seperators with " "*),​​Flatten[StringCases[minGradDroped,​​(letters:Repeated[LetterCharacter,{2,Infinity}]..~~(" "|",")~~commaSeparatedNumbers:(((DigitCharacter..~~LetterCharacter)|(DigitCharacter..))|" "|","|";").. )​​/; UpperCaseQ[letters]:>​​Outer[StringJoin,{letters},StringCases[commaSeparatedNumbers,DigitCharacter..~~LetterCharacter|DigitCharacter..]]​​(* Each group of a sequence of 2 or more uppercase letters followed 1 or more sequece of 2 or more digits are related to one subject *) ]]];
Based on the structure of UWaterloo course data in the above figure, parseCourse extract all different information from the XMLBlock of a course.
In[]:=
parseCourse[course_XMLElement]:=Association[​​(* course code, subject code, category name, components, unit *)​​FirstCase[​​course[[3]],​​XMLElement["div",{"class""divTableCell"},{XMLElement["strong",{},​​{XMLElement["a",{"shape""rect","name"courseShortCode_String},{}],courseLongCode_String}]}]:><|​​"Code"->StringTrim[courseShortCode],​​"SubjectCode"->StringTrim[Part[StringSplit[courseLongCode],1]],​​"CatalogNumber"->StringTrim[Part[StringSplit[courseLongCode],2]],​​Association[#->True&/@StringSplit[Part[StringSplit[courseLongCode],3],","]],​​"Unit"->Interpreter["Number"][Part[StringSplit[courseLongCode],4]]|>],​​(* course ID *)​​"ID"->FirstCase[​​course[[3]],​​XMLElement[___,{"class""divTableCell crseid"},{courseIDString_String}]:>​​First@StringCases[courseIDString,DigitCharacter..]],​​(* course name *)​​"Name"->FirstCase[​​course[[3]],​​XMLElement["div",{"class""divTableCell colspan-2"},{XMLElement["strong",{},{courseName_String}]}]:>​​StringTrim[courseName]],​​(* course description *)​​"Description"->FirstCase[​​course[[3]],​​XMLElement["div",{"class""divTableCell colspan-2"},{courseDesc_String}]:>StringTrim[StringSplit[courseDesc,"["][[1]]]]​​(* a course description at UWaterloo ends with this pattern: ... [Inforamation on the terms in which a course offered].*),​​(* course notes *)​​"Note"->FirstCase[​​course[[3]],​​XMLElement["div",{"class""divTableCell colspan-2"},{note_String}]/;StringContainsQ[note,"Note:"]:>StringTrim[note],Missing["NotApplicable"],All], ​​(* pre-requisite(s) *)​​"Prerequisite"->FirstCase[​​course[[3]],element:​​XMLElement[___,___,{s_String}]/;StringContainsQ[s,"Prereq"]:>requisiteParser[Part[StringSplit[s,":"],2]],Missing["NotApplicable"],All],​​(* co-requisite(s) *)​​"Corequisite"->FirstCase[​​course[[3]],element:​​XMLElement[___,___,{s_String}]/;StringContainsQ[s,"Coreq"]:>requisiteParser[Part[StringSplit[s,":"],2]],Missing["NotApplicable"],All],​​(* anti-requisite(s) *)​​"Antirequisite"->FirstCase[​​course[[3]],element:​​XMLElement[___,___,{s_String}]/;StringContainsQ[s,"Antireq"]:>requisiteParser[Part[StringSplit[s,":"],2]],Missing["NotApplicable"],All],​​(* offered online *)​​"OfferedOnline"->FirstCase[​​course[[3]],element:​​XMLElement[___,___,{s_String}]/;StringContainsQ[s,"offered Online"]->True,Missing["NotApplicable"],All]​​];
This is the template for subjective course listing at UWaterloo: https://ucalendar.uwaterloo.ca/*AcademicCalender:1920*/COURSE/course-*SubjectCode:PHYS*.html where, for instance 1920 means academic 2019-20.​
​subjectLinkUW generates the link for a subject listing at a given academic calender.
In[]:=
subjectLinkUW[academicCal_Integer,subjectCode_String]:="https://ucalendar.uwaterloo.ca/"<>ToString[academicCal]<>"/COURSE/"<>"course-"<>subjectCode<>".html";
As mention, the XTML versions of the course listings do not follow a standard protocol. Here, importXML fixes a coding problem in HTML source files of UWaterloo course listings.
coursesInSubject extract all the courses in a subject listing.
subjectListGenUW downloads the subjective course at a given academic year, exports as a dataset, and also returns it as a dataset
The list of different course components at UWaterloo. The component list is used as guide for UWaterloo course listing.
Physics, Applied Mathematics, Mathematics, and Computer Science course listings at UWaterloo in the academic year 2021-22 is created and exported.

The dataset of UWaterloo undergraduate course listing

uwListingUndergrad downloads the UWaterloo undergraduate listing at a given academic, exports as a dataset, and also returns it as a dataset
The UWaterloo undergraduate listing in the academic year 2021-22

Course listing at MIT

The HTML sources of the course listing webpages at MIT are less structured than UWaterloo, so it is more difficult to extract information in a systematic way. Moreover, it is not possible to combine its pipeline with UWaterloo’s one. Consequently, MIT course listing has only information about course codes, names, and topics. Moreover, it is unclear how to access the archive of the course listings for previous years .

MIT physics undergraduate course listing

In comparison to UWaterloo, functions here are shorter and smaller in number. The focus is on the Physics course listing with subject code 8.
Based on the structure of MIT course description, parseCourseMIT extract all different information from the XMLBlock of a course.
This is the template for subjective course listing at MIT: http://student.mit.edu/catalog/m*SubjectCode*a.html where, for instance SubjectCode=8 is for the physics course listing in Fall 2021. subjectLinkMIT generates a valid link to a subjective course listing.
As mention, the XTML versions of the course listings do not follow a standard protocol. Here, importXMLMIT extract a list of course descriptions from a subjective course listing
subjectListGenMIT downloads the subjective course at the current term, exports it as a dataset, and also returns it as a dataset. The academicTerm argument, for instance Fall2021, is given as input by the user by hand
Physics course listing at MIT in the academic term Fall 2021 is created and exported.

Exploratory Data Visualization

The subjective or whole course listing is visualized to give some ideas about listings before text analysis.

Visualizing the relation between courses in UWaterloo

It is impossible to visually export all the course listings with the visualization tools due to differences in the course descriptions in different universities. Her
graphReq plot a common neighbors network with some chosen options, using the modularity measure — Caution: The readability of the visualized graphs is highly sensitive to input graph.
graphAntiReq plot a common neighbors network with some chosen options, using the modularity measure.
The networks of pre-, co-, and anti-requisite course are plotted for a given subjective course list
Let’s look at the requisites for different course at four subjects at UWaterloo:
This above figures show the complexity of course interconnections at the subjective level. For the courses in each subject, the interconnections with courses from other subjects are also considered in generating each sub-plot.

Text Analysis

There are different approaches by which the course similarity can be measured between two courses in one subject or two different subjects in one university or more universties. Regardless of the chosen approach, it is needed to first tokenize the course topic, syllabus, or ILOs, hence it can be used in Machine Learning (ML).

Making course information machine-understandable

Below, different tokenizing functions are defined and the reason for defining each of them is given.

Course descriptions at UWaterloo and MIT

Here, the subsets of UWaterloo and MIT physics course listings are selected and converted to associations. For UWaterloo physics course listing, the lab courses which are the co-requisites of other course are dropped and the ones which are independent courses are preserved. Moreover, these co-requisite lab courses usually do not have standalone course descriptions. At MIT, the lab courses are independent courses and need their theoretical counterparts as prerequisites. Additionally, they have their own course descriptions. Therefore, these lab course are not dropped from MIT physics course listing.
In[]:=
Association of pairs of course code and description.
Association of pairs of course code and name.

Tokenize functions

In natural language processing (NLP), tokenization is a process in which a sequence of characters into smaller units called tokens, based on a pre-defined document unit. Tokens are sometimes vaguely called words or terms. As an instance of the document unit, A token can be a character, a sequence of characters, a word, a phrase, or a sentence, depending on the context of document(s) and the measures used in definition of the document unit.
In what follows, several tokenization functions are introduced, each constructed based on a series of rules for defining a document unit. For instance, the first function tokenizeCommon use five rules: making uppercase, removing diacritics, deleting stop words, chunking into words, and trimming whitespace from the beginning and end of each word.
tokenizeCommon is a popular tokenizing function.
Course topics are usually noun phrases (more exactly one or more noun phrases that are connected with “and”) and are separated by “,”, “;”,”.”, and “and” — This statement is based on the visual inspection of course listing at UWaterloo and MIT. To have better understanding of these two features, let’s look at two course descriptions, one from each university
These feature of course topics can allow more precise definitions of tokenize functions, as done below.
tokenizeNounPhrase uses TextCases with text content type “NounPhrase” as the input form and tokenizes a course topic into a list of noun phrases. Since TextCases[text,”NounPhrase”] returns recursively noun phrases, i.e. return noun phrases and noun phrases with each noun phrases until there is no more noun phrases, DeleteDuplicates is applied to drop the repeated noun phrases at the course level.
tokenizeStringSplit uses {“,”, “;”,”.”} to split a course topic into a list of topics
Two other strategies can also be considered. In the first strategy, “Noun” text content type as the input form in Textcases.
tokenizeNoun returns a list of nouns as token. Like tokenizeNounPhrase, it drops the duplicates. Unlike tokenizeNounPhrase and tokenizeStringSplit, it drops stop
Before introducing the second strategy, let’s compare the outcomes of the tokenize functions:
As can be seen, tokenizeNounPhrase and tokenizeStringSplit produce long noun phrases (that can be understood as topics). On the other hand, tokenizeCommon and tokenizeNoun produce a sequence of words and nouns, respectively. While the former functions (tokenizeNounPhrase and tokenizeStringSplit) are human-friendly, the later ones (tokenize and tokenizeNoun) are machine-friendly and can ease the text analysis task at hand.
But let’s compare tokenize and tokenizeNoun with the original text and see which one compiles results closer to the original text
The above figure reveals the the tokenizeNoun drops the adjectives which are very important for several topics such as “linear” in “linear momentum”. Therefore, it is more helpful to use the popular tokenize in later analysis,
The second strategy is to delete words such as course(s), student(s), study, and the like that are common in course descriptions and are not technical terms or course topics. One question arises here: Are the list of common words constructed manually or in an automatic manner? To answer this question, let’s look at the word clouds of the whole physics listings at UWaterloo and MIT with their first five commonest words.
As can be seen, the word students is among the five common words—While words like physics or introduction or even applications seem to be irrelevant to course topics, they are possibly useful when interconnections between courses in same or different subjects are studied. Moreover, there are words such as course, topics, study and the like, and their associated forms (like courses for course) which are not useful. How can we efficiently handle these types of words? This is the topic of the next subsection. The interesting point here is that four of the five commonest words in each of the two listings are the same.

N-gram: Generating n-word sequences

An n-gram is a sequence of n words; a 2-gram or bigram is a two-word sequence of words like “electromagnetism theory” or “applied mathematics”, “kinetic energy”; a 3-gram or trigram is a three-word sequence of words like “conservation of energy”, “rotational kinetic energy. It is possible to create a list 1-grams, 2-grams, 3-grams, or higher n-grams from tokenized list. Let’s first define a n-gram generator.
nGram generates a sequence of n-grams from a given list — Caution: nGram returns one if n is equal to the length of the list and zero if n is larger than the length of the list.

A course recommendation system with Nearest function in WL

nGrammedText tokenized and bi-gramms the course descriptions.
In[]:=
Below, for each physics course at MIT, its three nearest courses at UWaterloo are found. There are more the one courses with the same name at MIT (For instance, there are four “Physics I”, each has their own course structure).

Concluding remarks

The final goal of this project is to produce a pipeline by which the interconnections between two courses at two different universities can be found. For this, we focused our effort on the physics undergraduate course listings at UWaterloo and MIT as the sample database.
In this post, we took the first step towards that goal by:
◼
  • Creating well-organized datasets of UWaterloo and MIT physics undergraduate course listing and discussing the strategies.
  • ◼
  • Visualizing the interconnections between different course at the subjective level.
  • ◼
  • Developing strategies for a success text analysis.
  • ◼
  • Implementing a simple yet useful course recommendation system that can suggests three contextually nearest course to a given course.
  • Keywords

    ◼
  • Course listing
  • ◼
  • Tokenize
  • ◼
  • N-gram
  • Acknowledgment

    Amir Sadeghi acknowledges useful discussions with Paul Abbott (Mentor), W Craig Carter (Mentor), Jason Sonnenberg (Lecturer), Mads Bahrami (WSS 21 Academic Director), Siria Sadeddin (TA), and Jesse Friedman (TA). Amir also acknowledges Paul Abbott (Mentor), W Craig Carter (Mentor), Mads Bahrami (WSS 21 Academic Director), Siria (TA), and Jesse Friedman (TA), and other WSS 21 attendees for their help and support in using Wolfram language and Mathematica for his project.

    References

    ◼
  • Private communication with mentors and TAs.
  • ◼
  • Wolfram Language & System Documentation Center.