Data¶
Data Overview¶
1. NUSMods¶
Course data was retrieved from the NUSMods API and compiled into modules.csv (7,015 rows, 14 columns).
| Core Fields | Description |
|---|---|
moduleCode |
Unique identifier |
title |
Course title |
description |
Course description |
faculty |
Offering faculty |
prerequisite |
Required modules |
moduleCredit |
Credit units |
2. MyCareersFuture Job Ads¶
The dataset consists of 22,720 job postings, flattened from JSON into structured format.
| Core Fields | Description |
|---|---|
title |
Job title |
skills |
Required skills |
categories |
Job categories |
minimum_years_experience |
Experience required |
salary_min / salary_max |
Salary range |
posted_date / expiry_date |
Posting dates |
position_levels |
Seniority level |
Exploratory Data Analysis (EDA)¶
EDA identifies structural properties that inform modelling choices and potential sources of bias.
NUSMods¶
Distribution by Faculty¶

Figure 1: Module representation is uneven, with FASS, CDE, and Science dominating the corpus.
Implications for Framework:
- Risk of representation bias in similarity matching
Mitigation:
- Construct degree-specific module baskets (≈15 core + 8 common modules)
- Use length-normalised embeddings for fair comparison
Description Length¶

Figure 2: Most descriptions fall within 60–100 words, with few long outliers (>250 words).
Implications for Framework:
- Descriptions provide sufficient semantic signal for embeddings.
- Text length is bounded during profile construction to control computational cost.
MyCareersFuture Job Ads¶
Market Breadth and Skills¶

Figure 3: Soft skills (e.g., teamwork, communication) dominate job postings.
Implications for Framework:
- These skills are non-discriminative and introduce noise.
- They are removed during preprocessing.
Category Co-occurrence¶

Figure 4: Job categories frequently co-occur, reflecting overlapping roles.
Implications for Framework:
- Categories are retained as structured features to enrich representations.
Seniority and Experience¶

Figure 5: Entry-level roles dominate, though senior roles exist.
Implications for Framework:
- Aligns with graduate outcomes
- Senior roles are excluded to maintain relevance
Data Cleaning and Preprocessing¶
NUSMods¶
-
Standardise text: Remove HTML and normalise whitespace
-
Handle missing descriptions: Use title as fallback if informative and exclude generic modules (e.g., internship, UROPS)
-
Construct module text: Combine cleaned title and description
-
Filter by relevance: Derive module level from code and retain undergraduate modules
-
Remove low-quality entries: Drop modules without sufficient text
MyCareersFuture Job Ads¶
-
Parse and structure data: Convert JSON into structured format and preserve multi-label fields
-
Clean text fields: Remove HTML, URLs, and boilerplate, then filter low-information descriptions
-
Clean skills: Standardise text (lowercase), remove generic soft skills and retain meaningful technical terms
-
Filter by scope: Exclude internships, academic, and senior roles
-
Deduplicate postings: Remove exact and near-duplicates
-
Construct job text: Combine title, categories, skills and truncated description