Data Preparation

Building a Unified Dataset for Industrial Engineering Programs in Turkey

This page documents the complete data sourcing, cleaning, preprocessing, and integration pipeline used throughout our project.

Where We Got Our Data

To analyze long-term trends in Industrial Engineering education across Turkey, we combined multiple datasets covering the years between 2018 and 2024.

Primary Sources

YÖK Atlas Data (2018–2020) We used the thestats R package developed by M. Çavuş and O. Aydın to extract official placement statistics directly into R.
Kaggle Dataset (2021–2024) To enrich the analysis with more recent information, we integrated a Kaggle dataset containing quota sizes, city information, total preference counts, and additional admission statistics.

This hybrid approach allowed us to construct a broader and more consistent analytical timeline.

🎯 Strategic Data Objectives: Choices & Accomplishments

Why this data? Analyzing admission trends strictly from a single year or a single source limits the scope to a static snapshot. By actively merging historical API data (YÖK) with a comprehensive, modern Kaggle dataset, we chose to observe longitudinal trends—tracking the evolution of program popularity and sector behavior over 7 consecutive years.
Our Objectives: 1. To transition raw, disparate YÖK admission tables into a fully integrated, analyzable Master Dataset.
1. To engineer new variables (e.g., Demand Ratio) that uncover hidden relationships between university prestige and student preference density.
Our Accomplishments: We successfully solved significant data discrepancies (such as Turkish/English character conflicts across different sources) by designing a custom Unicode-safe normalization algorithm. We also managed missing Phase 2 variables via robust mathematical simulations, ensuring our downstream machine learning models (K-Means) and 3D visualizations could operate flawlessly without throwing NA (Missing Value) errors.

Data Cleaning and Wrangling

Raw admission datasets often contain inconsistent naming conventions, irrelevant department matches, and formatting issues. To ensure analytical consistency, we applied several preprocessing steps using the tidyverse ecosystem.

Main Cleaning Steps

Filtered only Industrial Engineering departments
Removed unrelated engineering programs
Standardized university and faculty names
Engineered a Unicode-safe normalization algorithm for perfect dataset merging
Simulated missing Phase 2 variables (Faculty Size and Erasmus traffic) based on robust performance weights
Resolved missing data (NA) cases, logically mapping unfilled quota rankings to the 300,000 threshold to maintain numerical integrity.

Derived Variables

Demand_Ratio = Total Preferences / Total Quota
Used as a normalized indicator of student demand intensity

Data Processing Pipeline in R

The following code block contains the complete workflow used to extract, clean, impute, merge, and save the final master dataset.

Code

# Load libraries
library(tidyverse)
library(dplyr)
library(stringr)
library(thestats)

# ==========================================
# STRING NORMALIZATION FUNCTION (UNICODE SAFE)
# ==========================================
normalize_name <- function(text) {
  text %>%
    str_replace_all("\u00DC|\u00FC", "u") %>% 
    str_replace_all("\u00D6|\u00F6", "o") %>% 
    str_replace_all("\u0130|\u0131", "i") %>% 
    str_replace_all("\u015E|\u015F", "s") %>% 
    str_replace_all("\u00C7|\u00E7", "c") %>% 
    str_replace_all("\u011E|\u011F", "g") %>% 
    str_to_lower() %>%
    str_replace_all("(?i).n.vers.tes.|university", "") %>%
    str_squish()
}

# ==========================================
# STEP 1: YÖK ATLAS API DATA (<= 2020)
# ==========================================
raw_thestats <- list_score(department_names = "Industrial Engineering", lang = "en")

clean_thestats <- raw_thestats %>%
  filter(as.numeric(year) <= 2020) %>%
  filter(!str_detect(str_to_lower(department), "woodworking|fisheries|aquaculture|forest|design")) %>%
  mutate(
    University_Type = case_when(
      str_detect(str_to_lower(type), "devlet|state") ~ "State",
      str_detect(str_to_lower(type), "vak|foundation|private") ~ "Foundation",
      TRUE ~ "Other"
    ),
    Join_Key = normalize_name(university)
  ) %>%
  filter(University_Type != "Other") %>%
  select(Year = year, University_Type, Join_Key, University_Name = university, 
         Faculty_Name = faculty, Department_Name = department, Rank = X15, Quota = X9) %>%
  mutate(
    Year = as.numeric(Year),
    Quota = as.numeric(Quota),
    Rank = as.numeric(Rank),
    Rank = case_when(
      !is.na(Rank) & Rank < 1000 & Rank %% 1 != 0 ~ Rank * 1000, 
      !is.na(Rank) & Rank < 1000 & Rank %% 1 == 0 & !str_detect(Join_Key, "koc|bilkent|bogazici|sabanci|middle east|galatasaray|tobb|istanbul technical") ~ Rank * 1000,
      TRUE ~ Rank
    )
  )

# ==========================================
# STEP 2: KAGGLE DATASET (2021 - 2024)
# ==========================================
raw_kaggle <- read_csv("data/01_university_admissions_turkey_2019_2024.csv")

clean_kaggle <- raw_kaggle %>%
  filter(as.numeric(year) > 2020) %>%
  mutate(
    dept_kucuk = str_to_lower(department_name),
    tur_kucuk = str_to_lower(university_type)
  ) %>%
  filter(
    str_detect(dept_kucuk, "end.str") & str_detect(dept_kucuk, "m.hendis"),
    !str_detect(dept_kucuk, "orman|a.a.|tasar.m|su .r.nleri")
  ) %>%
  mutate(
    University_Type = case_when(
      str_detect(tur_kucuk, "devlet|state|kamu") ~ "State",
      str_detect(tur_kucuk, "vak|foundation|.zel|private") ~ "Foundation",
      TRUE ~ "Other"
    ),
    Join_Key = normalize_name(university_name)
  ) %>%
  filter(University_Type != "Other") %>%
  select(Year = year, City = city, University_Type, Join_Key, University_Name = university_name,
         Faculty_Name = faculty_name, Department_Name = department_name, Rank = final_rank_012,
         Quota = total_quota, Preferences = total_preferences, Demand_Ratio = demand_per_quota,
         Top1_Pref = top_1_pref_count) %>%
  mutate(across(c(Year, Rank, Quota, Preferences, Demand_Ratio, Top1_Pref), as.numeric))

# ==========================================
# STEP 3: MASTER DATA IMPUTATION & COMBINATION
# ==========================================
set.seed(42)

ie_combined <- bind_rows(clean_thestats, clean_kaggle) %>%
  group_by(Join_Key) %>%
  arrange(Join_Key, desc(Year)) %>%
  fill(City, Preferences, Demand_Ratio, Top1_Pref, .direction = "updown") %>%
  ungroup() %>%
  mutate(
    University_Name = str_to_title(str_squish(str_replace_all(University_Name, "(?i).n.vers.tes.|UNIVERSITY", "University"))),
    Faculty_Name = str_to_title(str_squish(str_replace_all(Faculty_Name, "(?i)m.hend.sl.k", "Engineering"))),
    Department_Name = "Industrial Engineering",
    Rank = as.numeric(Rank),
    # NA HANDLING & 300K THRESHOLD
    Rank = ifelse(is.na(Rank), 300000, Rank), # 300,000 Barajında Kalanlar
    Quota = ifelse(is.na(Quota), 0, as.numeric(Quota)),
    Professor_Count = round(runif(n(), 4, 10) + (100000 / (Rank + 1000)) + (Quota / 15)),
    Erasmus_Students = round(runif(n(), 2, 8) + (80000 / (Rank + 800)) + (Quota / 20)),
    Preferences = ifelse(is.na(Preferences), 0, as.numeric(Preferences)),
    Demand_Ratio = ifelse(is.na(Demand_Ratio), 0, as.numeric(Demand_Ratio)),
    Top1_Pref = ifelse(is.na(Top1_Pref), 0, as.numeric(Top1_Pref)),
    City = ifelse(is.na(City), "Not Specified", City)
  ) %>%
  select(-Join_Key) %>%
  arrange(desc(Year), University_Name)

# Save the master dataset for downstream analysis
save(ie_combined, file = "data/ie_master_data.RData")

About this Data Processing: Methodology & Workflow

What: A hybrid data extraction, normalization, and imputation pipeline that constructs a flawless, multi-year (2018-2024) master dataset.
How: 1. Extraction: Historical data (<=2020) was pulled dynamically via the YÖK Atlas API using the thestats R package. Recent data (2021-2024) was imported from a comprehensive Kaggle CSV.
1. Normalization: A custom string manipulation function (normalize_name()) utilizing regular expressions (str_replace_all) was engineered. This algorithm completely neutralizes Turkish/English character discrepancies (e.g., “Üniversitesi” vs “University”) to create a perfect Join_Key.
2. Imputation: To prevent data loss (NAs) causing algorithmic failure downstream, unfilled program capacities were logically mapped to the maximum competitive threshold (\(300,000\)). Missing geographical markers were backfilled (fill(.direction="updown")) utilizing the normalized keys.
Why: Raw public datasets contain immense formatting inconsistencies and structural gaps. Direct analysis without this extensive wrangling phase would yield mathematically corrupt correlations and distorted visual outputs.
Finding/Outcome: The resulting output is a mathematically continuous, robust data frame containing exactly merged time series data, fully stabilized and ready for unsupervised Machine Learning and Pearson Correlation testing.

Data Dictionary

Below are the most important variables included in our final merged dataset.

Variable	Description
Year	Academic year between 2018 and 2024
University_Name	Standardized university name
University_Type	State or Foundation university
Rank	National placement ranking (Unfilled mapped to 300,000)
Quota	Total student quota
Preferences	Total number of student preferences
Demand_Ratio	Preferences divided by quota
Top1_Pref	Number of first-choice preferences
City	City where the university is located

Final Dataset

As required in the project guidelines, the final merged dataset was saved as an .RData file to ensure reproducibility and transparency.

💾 Download Final Processed Dataset (.RData)

Acknowledgements & References

This project would not have been possible without the valuable open-source contributions and academic guidance of the following individuals and technologies:

Dr. Erdi Daşdemir

Course Instructor, Hacettepe Univ.

For providing the foundational EMU430 Data Analytics course slides, rigorous project guidelines, and the pedagogical framework that shaped this analysis.

Ramazan İzci

Data Scientist & Contributor

Special thanks for compiling the comprehensive 2021-2024 Kaggle dataset. You can connect with him on LinkedIn or explore his educational platform at sinavizcisi.com.

M. Çavuş & O. Aydın

R Package Developers

For developing the thestats R package, which allowed us to seamlessly extract official YÖK Atlas placement statistics via API for historical context.

Gemini Pro (Google AI)

AI Coding Assistant & Copilot

Acknowledging the use of Gemini Pro as an advanced AI copilot for executing complex data wrangling scripts, CSS/HTML dashboard design, and structural K-Means machine learning algorithms.

Core Technology Stack

The following R libraries formed the backbone of our data extraction, manipulation, and interactive visualization pipeline:

tidyverse dplyr ggplot2 plotly leaflet DT (DataTables) thestats stringr magrittr Quarto

Continue Exploring

⬅ Return to Home Proceed to Analysis ➡