-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
create clusters based on most similar course descriptions #415
Conversation
[diff-counting] Significant lines: 140. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks great Kemba! I'm so excited to see progress with the recommendations feature. I left some minor comments about separating certain sections within your `clusters.py file
course-clusters/clusters.py
Outdated
subjects_url = "https://classes.cornell.edu/api/2.0/config/subjects.json?roster=FA23" | ||
subjects = getSubjects(subjects_url) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we move this section into a new file and import the subjects variable.
course-clusters/clusters.py
Outdated
urls = [f"https://classes.cornell.edu/api/2.0/search/classes.json?roster=FA23&subject={sub}" for sub in subjects] | ||
|
||
def fetchDescriptions(urls): | ||
descriptions = [] | ||
for x in urls: | ||
descriptions += loadCourseDesc(x) | ||
print(len(descriptions)) | ||
return descriptions | ||
|
||
|
||
course_descriptions = fetchDescriptions(urls) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we also move this to a separate file? We can add the getSubjects
and fetchDescriptions
functions into a separate scrapper.py
file.
course-clusters/clusters.py
Outdated
def preprocess_text(text): | ||
""" Tokenization and preprocessing with custom stopwords""" | ||
custom_stopwords = ["of", "the", "in", "and", "on", "an", "a", "to"] | ||
strong_words = ["technology","calculus","business", "Artificial Intelligence", "First-Year Writing", "computer","python","java","economics","US","writing","biology","chemistry", "physics", "engineering","ancient" | ||
"programming", "algorithms", "data structures","art","software","anthropology" "databases","fiction","mathematics", "history","civilization"] | ||
|
||
translator = str.maketrans("", "", string.punctuation) | ||
text = text.lower() | ||
text = text.translate(translator) | ||
tokens = text.split() | ||
tokens = [token for token in tokens if token not in custom_stopwords] | ||
for word in strong_words: | ||
if word in tokens: | ||
tokens += [word] * 10 | ||
|
||
return " ".join(tokens) | ||
|
||
def removeStopwords(): | ||
"Removes stopwords for all the descriptions " | ||
preprocessed = [] | ||
for desc in course_descriptions: | ||
if desc: | ||
preprocessed.append(preprocess_text(desc)) | ||
return preprocessed | ||
|
||
preprocessed_descriptions = removeStopwords() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's also move this preprocessing out. Separate the preprocessing/web scrapping with the actual model itself. Can we also create some persistent objects to store the web scrapping data and the preprocessing so we don't always have to do that?
Summary
Created clusters based on the similarity of course descriptions, using a library called genism to do those calculations for us.
Clusters are based on all courses offered in FA23.
Each cluster has 50 coursed inside, or lightly more if an outlier was merged inside. Also currently have 87 clusters.
NOTES
currently courses that don't have a course description aren't accounted for which is 97 courses