DATA

DATA CLEANING AND EDA ANALYSIS

Data source: TÜİK

We transferred the data we received from TÜİK as an Excel file onto R with the read_excel( ) function in the readxl library.

In this code, we created “region2id” column for cities in population data set by using region26 data set. Our aim is to have common area (column) for all the data sets (population, region26 and migration) to study on the data sets comfortable. Additionaly, we renamed the columns to make it easier to work in R by using names() function. In this code, we created “region2id” column for cities in population data set by using region26 data set. Our aim is to have common area (column) for all the data sets (population, region26 and migration) to study on the data sets comfortable. Additionaly, We renamed the columns to make it easier to work in R by using names() function and we tidy up our data sets with the help of the pivot_longer() function to work with these datasets more flexibly and facilitate visualization. The tidy data sets names are last_migration_tidy_data, tidy_region26 and last_population_tidy_data respectively. Appart from this operation,using the population data set, we added the city column and region column to both the migration data set in tidy form and the migration data set in wide form to do better visualization.

In population data frame, we used 31 columns. IDD is the code that identifies the district or region. It will be used in migration analysis to track migration patterns between different regions. Region ID and regions reflect the region ID and name. Migration data will be examined on a regional basis to understand population movements between different regions. Totalpop, male, female, corresponds to total population, male population and female population. It will be used in migration analysis to understand how populations change from one region to another.

  • Age groups (age04, age59, age1014, age1519, age2024, age2529, age3034, age3539, age4044, other age) symbolize the population distribution of different age groups. Migration data will be used to examine migration trends of certain age groups from one region to another. Unknowns indicate unknown or uncertain population information. When analyzing migration, the impact of missing data and how this data may affect migration trends will be important.

  • Education levels (doctorate, primary edu, primary edu, elementary edu, highschool, literatebutnoschool, notliterate, middleschool, master university) give the population distribution at different education levels. It will be used to analyze the cities receiving immigrants according to their education levels. Fertility rate, the effect of the fertility rate, which is one of the factors affecting the tendency of individuals to migrate from one region to another, will be examined.

  • Electricity reflects Total Electricity Consumption (Kwh) per capita data. Migration may be towards areas with better infrastructure or living conditions. Migration may occur due to housing sales numbers, changes in the housing market, and trends in housing sales will be important in migration analysis.

In migration data frame, we used 28 columns. Since it contains common columns with the population dataframe, we explain below the columns that are different from the population data frame.

  • Turnbackfamily will provide information about the return loss of families, the rate of families migrating, or the return rates of families migrating. Betterlifecond, the search for better living conditions may be one of the reasons why people migrate. Others reflect other reasons for migration. This category includes a variety of reasons that are not specifically defined but may be associated with migration. Retirement refers to relocation after retirement or migration to preferred provinces for retirement.

  • Buyhome reflects home buying data. An increase in housing sales in a particular region will indicate that migration to that region has increased and people prefer that region. Familymig describes migrations that occur for family reasons. It represents migrations due to family reunification or family relationships.

  • Finjob indicates migrations for job opportunities or finding a job. People may be more likely to migrate for work or career opportunities. It describes situations such as change of marital status, marriage, and divorce. Health refers to migrations that occur for health reasons. Appointment represents a new business opportunity or change.

There are 5 columns in our region 26 data frame. We explain the columns that are different from our population and migration data frame below.

  • Region2id is the second region code specified. Workforce15plus and workforce1564 reflect workforce data aged 15 and over and workforce data between ages 15-64, respectively. This data can influence workforce dynamics in a particular region. The labor force participation rate or workforce structure of migrating individuals may be important in terms of economic impacts and changes in employment. Usableincome symbolizes disposable income or income level. This data can affect income levels and economic well-being. Migration in a particular region can cause changes in income distribution and living standards in that region.
Click the see the code
# importing necessary packages
library(tidyverse) 
library(readxl)
library(readr)

population <- read_excel("C:\\Users\\beyza\\Desktop\\DataVizards_FinalDataFrame.xlsx")
names(population) <- c('IDD','city','regionid','regions','totalpop','male','female','age04','age59','age1014','age1519','age2024','age2529','age3034','age3539','age4044','otherage','unknowns','doctorate','primaryedu','elementaryedu','highschool','literatebutnoschool','notliterate','middleschool','master','university','fertilityrate','electricity','numberofattempts','housingsalesnumbers')

region26 <- read_excel("C:\\Users\\beyza\\Desktop\\DataVizards_FinalDataFrame.xlsx", sheet = "Bolge26")
names(region26) <- c('region','region2id','workforce15plus','workforce1564','usableincome')

migration <- read_excel("C:\\Users\\beyza\\Desktop\\DataVizards_FinalDataFrame.xlsx", sheet = "Goc Bilgileri")
names(migration) <- c('IDD','male2','female2','turnbackfamily','unknowns2','betterlifecond','others','education','retirement','buyhome','familymig','finjob','maritalstatuschange','health','appointment','age04_2','age59_2','age1014_2','age1519_2','age2024_2','age2529_2','age3034_2','age3539_2','age4044_2','university2','highschool2','middleschool2', 'elementaryschool2')


# tidy migration data set
migration$city <- population$city
migration$regions <- population$regions

tidy_data_gender <- migration |> pivot_longer(c(male2,female2),names_to = "gender2",values_to = "gender_value2")

tidy_data_causes <- tidy_data_gender |> pivot_longer(c(turnbackfamily,unknowns2,betterlifecond,others,education,
                                                retirement,buyhome,familymig,finjob,maritalstatuschange,health,appointment),names_to = "migrationcauses",values_to = "migrationcauses_value")

tidy_data_age <- tidy_data_causes |> pivot_longer(c(age04_2,age59_2,age1014_2,age1519_2,
                                                    age2024_2,age2529_2,age3034_2,age3539_2,age4044_2),names_to = "agerange2",values_to = "agerange_value2")

last_migration_tidy_data <- tidy_data_age |> pivot_longer(c(university2,highschool2,middleschool2,elementaryschool2),names_to = "education2",values_to = "education_value2")
head(last_migration_tidy_data)
# A tibble: 6 × 11
    IDD city  regions         gender2 gender_value2 migrationcauses
  <dbl> <chr> <chr>           <chr>           <dbl> <chr>          
1     1 Adana Akdeniz Bölgesi male2           25200 turnbackfamily 
2     1 Adana Akdeniz Bölgesi male2           25200 turnbackfamily 
3     1 Adana Akdeniz Bölgesi male2           25200 turnbackfamily 
4     1 Adana Akdeniz Bölgesi male2           25200 turnbackfamily 
5     1 Adana Akdeniz Bölgesi male2           25200 turnbackfamily 
6     1 Adana Akdeniz Bölgesi male2           25200 turnbackfamily 
# ℹ 5 more variables: migrationcauses_value <dbl>, agerange2 <chr>,
#   agerange_value2 <dbl>, education2 <chr>, education_value2 <dbl>
Click the see the code
# tidy region26 data set
tidy_region26 <- region26 |> pivot_longer(c(workforce15plus,workforce1564),names_to = "workforce",values_to = "workforce_values")
head(tidy_region26)
# A tibble: 6 × 5
  region                     region2id usableincome workforce   workforce_values
  <chr>                      <chr>            <dbl> <chr>                  <dbl>
1 Adana, Mersin              TR62             0.382 workforce1…             1579
2 Adana, Mersin              TR62             0.382 workforce1…             1528
3 Ağrı, Kars, Iğdır, Ardahan TRA2             0.381 workforce1…              383
4 Ağrı, Kars, Iğdır, Ardahan TRA2             0.381 workforce1…              369
5 Ankara                     TR51             0.353 workforce1…             2341
6 Ankara                     TR51             0.353 workforce1…             2308
Click the see the code
# tidy population data set
tidy_data_gender2 <- population |> pivot_longer(c(male,female),names_to = "gender",values_to = "gender_value") 

tidy_data_age <- tidy_data_gender2 |> pivot_longer(c(age04,age59,age1014,age1519,
                                                    age2024,age2529,age3034,age3539,age4044,otherage),names_to = "agerange",values_to = "agerange_value")

tidy_literate <- tidy_data_age |> pivot_longer(c(unknowns,literatebutnoschool,notliterate),names_to = "literate",values_to = "literate_values")

last_tidy_data_education2 <- tidy_literate|> pivot_longer(c(university,highschool,middleschool,elementaryedu,doctorate,primaryedu,master),names_to = "education",values_to = "education_value")

last_population_tidy_data <- last_tidy_data_education2 |> pivot_longer(c(fertilityrate,electricity,numberofattempts,housingsalesnumbers),names_to = "othervariables",values_to = "othervariables_value")
head(last_population_tidy_data)
# A tibble: 6 × 15
    IDD city  regionid regions         totalpop gender gender_value agerange
  <dbl> <chr> <chr>    <chr>              <dbl> <chr>         <dbl> <chr>   
1     1 Adana A        Akdeniz Bölgesi  2263373 male        1130862 age04   
2     1 Adana A        Akdeniz Bölgesi  2263373 male        1130862 age04   
3     1 Adana A        Akdeniz Bölgesi  2263373 male        1130862 age04   
4     1 Adana A        Akdeniz Bölgesi  2263373 male        1130862 age04   
5     1 Adana A        Akdeniz Bölgesi  2263373 male        1130862 age04   
6     1 Adana A        Akdeniz Bölgesi  2263373 male        1130862 age04   
# ℹ 7 more variables: agerange_value <dbl>, literate <chr>,
#   literate_values <dbl>, education <chr>, education_value <dbl>,
#   othervariables <chr>, othervariables_value <dbl>
Click the see the code
population$`region2id`<- ""
for (i in 1:nrow(population)) {
  city <- population$city[i]
  
  for (j in 1:length(region26$region)) {
    if (grepl(city, region26$region[j])) {
      population$`region2id`[i] <- paste(region26$`region2id`[j])
      break 
    }
  }
}
Back to top