Assignment 2

My second assignment has 4 parts.

Part 1

Using the filters on https://m.imdb.com/search, Turkish movies with more than 2500 reviews are saved in the URLs.

Show the code
library(tidyverse)
library(rvest)
library(knitr)
library(ggplot2)
library(stringr)
first <- "https://m.imdb.com/search/title/?title_type=feature&release_date=,2009-12-31&num_votes=2500,&country_of_origin=TR&count=250"
second <- "https://m.imdb.com/search/title/?title_type=feature&release_date=2010-01-01,2023-12-31&num_votes=2500,&country_of_origin=TR&count=250"

url <- c(first,second)

Part 2

Made web scrapping to create a Data Frame with columns: Title, Year, Duration, Rating, Votes.

Show the code
# Initialize an empty data frame
movies <- tibble(Title = character(), Year = numeric(), Duration = character(), Rating = numeric(), Votes = numeric())


# Function to convert duration from to minutes
convert_to_minutes <- function(time_str) {
  if (str_detect(time_str, "h")) {
    # "h" ve "m" arasındaki dakikaları ve "h"den önceki saatleri bul
    hours <- as.numeric(str_extract(time_str, "\\d+(?=h)"))
    minutes <- as.numeric(str_extract(time_str, "(?<=h)\\s*\\d+"))
    total_minutes <- hours * 60 + ifelse(is.na(minutes), 0, minutes)
  } else {
    # "h" bulunmuyorsa ilk iki karakteri al
    total_minutes <- as.numeric(substr(time_str, 1, str_detect(time_str, "m")*2))
  }
  return(total_minutes)
}

# Loop through each URL
for (i in url) {
  data_html <- read_html(i)
  
  # Extracting movie titles
  title_names <- data_html |> html_nodes('.ipc-title__text') |> html_text()
  title_names <- tail(head(title_names, -1), -1)
  title_names <- str_split(title_names, " ", n = 2)
  title_names <- unlist(lapply(title_names, function(x) {x[2]}))
  
  # Extracting years
  year <- data_html %>% html_nodes(".sc-43986a27-7.dBkaPT.dli-title-metadata")
  year <- html_text(year)
  year <- substr(year, 1, 4)
  year <- as.numeric(year)
  
  # Extracting durations
  duration <- data_html %>% html_nodes(".sc-43986a27-7.dBkaPT.dli-title-metadata")
  duration <- html_text(duration)
  duration <- substr(duration, start = 5, stop = 14)

  # Extracting ratings
  rating <- data_html %>% html_nodes(".ipc-rating-star.ipc-rating-star--base.ipc-rating-star--imdb.ratingGroup--imdb-rating")
  rating <- html_text(rating)
  rating <- substr(rating, 1, 3)
  rating <- as.numeric(rating)
  
  # Extracting votes
  vote <- data_html %>% html_nodes(".sc-53c98e73-0.kRnqtn")
  vote <- html_text(vote)
  vote <- sub("Votes", "" ,vote)
  vote <- sub(",", "", vote)
  vote <- as.numeric(vote)
  
  
  temp <- tibble(Title = title_names, Year = year, Duration = duration, Rating = rating, Votes = vote)
  
  # Append to the main data frame
  movies <- bind_rows(movies, temp)
}

#The function written above is run to organize the durations of the films
duration_final <- sapply(movies$Duration, convert_to_minutes)
movies$Duration <- duration_final

# Final results
kable(head(movies,10), caption = "Turkish Movies")
Turkish Movies
Title Year Duration Rating Votes
Nefes: Vatan Sagolsun 2009 128 8.0 35029
Babam ve Oglum 2005 108 8.2 91051
Masumiyet 1997 110 8.1 19315
Kader 2006 103 7.8 16282
Uzak 2002 110 7.5 22381
Eskiya 1996 128 8.1 71705
A.R.O.G 2008 127 7.3 44642
Sevmek Zamani 1965 86 8.0 7138
Hababam Sinifi 1975 87 9.2 42518
Üç Maymun 2008 109 7.3 22670

Part 3

Let’s conduct an EDA on the data set!!!

a. Arranged the data frame in descending order by Rating. Presented the top 5 and bottom 5 movies based on user ratings.

Show the code
top_5_movies <- movies |> arrange(desc(Rating))
kable(head(top_5_movies, 5), caption = "Top 5 Movies")
Top 5 Movies
Title Year Duration Rating Votes
Hababam Sinifi 1975 87 9.2 42518
CM101MMXI Fundamentals 2013 139 9.1 46999
Tosun Pasa 1976 90 8.9 24330
Hababam Sinifi Sinifta Kaldi 1975 95 8.9 24370
Süt Kardesler 1976 80 8.8 20890

As a member of Generation Z, I can’t understand why Yeşilçam films are still so loved. I think their ongoing popularity is not because they are genuinely great films, but rather due to a nostalgia for the old Turkey.

Show the code
bottom_5_movies <- movies |> arrange(Rating)
kable(head(bottom_5_movies, 5), caption = "Bottom 5 Movies")
Bottom 5 Movies
Title Year Duration Rating Votes
Cumali Ceber: Allah Seni Alsin 2017 100 1.0 39269
Reis 2017 108 1.0 73974
Cumali Ceber 2 2018 100 1.2 10230
Müjde 2022 48 1.2 9920
15/07 Safak Vakti 2021 95 1.2 20608

In my opinion, take these films and throw them in the trash.

b. Let’s check my favorite movies :D

Show the code
my_favs = c("Ölümlü Dünya", "Ölümlü Dünya 2", "G.O.R.A.", "Eyyvah Eyvah", "Babam ve Oglum")
kable(head(movies |> arrange(desc(Rating)) |> mutate(Rank = row_number()) |> filter(Title %in% my_favs), 5), caption = "My Favorites")
My Favorites
Title Year Duration Rating Votes Rank
Babam ve Oglum 2005 108 8.2 91051 23
G.O.R.A. 2004 127 8.0 66041 39
Ölümlü Dünya 2018 107 7.6 30293 87
Ölümlü Dünya 2 2023 117 7.4 3558 122
Eyyvah Eyvah 2010 104 7.0 21445 211

c. Visualization of the dataset

Show the code
movies |> group_by(Year) |> summarise(rating_ave = 
  mean(Rating)) |> ggplot(aes(x = Year, 
  y = rating_ave)) + geom_smooth(method = "lm", se = FALSE, color = "blue", formula = y ~ x) + ylab("Average Ratings of Turkish Movies") + geom_point()

That’s sad…

Show the code
movies %>% group_by(Year) |> summarise(movies_num = n())   |> ungroup() |> ggplot(aes(x = Year, y = movies_num)) + ylab("Number of Turkish Movies") + geom_point()

Show the code
movies |> ggplot(aes(x = Year, y = Rating, group = Year, fill = Year)) + geom_boxplot()

d. Correlation between the number of votes a movie received and its rating

Show the code
ggplot(movies, aes(x = log(Votes), y = Rating)) + geom_point()

Show the code
corr1 = cor(movies$Rating, movies$Votes)
print(paste("Correlation is:", corr1))

[1] “Correlation is: 0.131089313642832”

Correlation value is small. By looking at the graph, we can’t say there is a strong relationship between votes and ratings.

e. Correlation between a movie’s duration and its rating

Show the code
ggplot(movies, aes(x = Rating, y = Duration)) + geom_point()

Show the code
corr2 = cor(movies$Rating, movies$Duration)
print(paste("Correlation is:", corr2))

[1] “Correlation is: 0.0357059353259163”

Correlation value is too small. There is almost no linear relationship between duration and rating.

Part 4

Turkish movies that are in the top 1000 movies on IMDb:

Show the code
url <- "https://m.imdb.com/search/title/?title_type=feature&groups=top_1000&country_of_origin=TR"

new_html <- read_html(url)

title_names2 <- new_html |> html_nodes('.ipc-title__text') |> html_text()
title_names2 <- tail(head(title_names2, -1), -1)
title_names2 <- str_split(title_names2, " ", n = 2)
title_names2 <- unlist(lapply(title_names2, function(x) {x[2]}))
  
year2 <- new_html %>% html_nodes(".sc-43986a27-7.dBkaPT.dli-title-metadata")
year2 <- html_text(year2)
year2 <- substr(year2, 1, 4)
year2 <- as.numeric(year2)

top_1000 <- tibble(Title = title_names2, Year = year2)
kable(top_1000, caption = "Turkish Movies in IMDb Top 1000")
Turkish Movies in IMDb Top 1000
Title Year
Yedinci Kogustaki Mucize 2019
Kis Uykusu 2014
Nefes: Vatan Sagolsun 2009
Ayla: The Daughter of War 2017
Babam ve Oglum 2005
Ahlat Agaci 2018
Bir Zamanlar Anadolu’da 2011
Eskiya 1996
G.O.R.A. 2004
Vizontele 2001
Her Sey Çok Güzel Olacak 1998

Let’s find the duration, rating and votes of these movies by using left join on Turkish Movies Data Frame, then order by their ratings.

Show the code
top_1000_joined <- top_1000 |> left_join(movies, by = "Title", suffix = c(" ", " "))
kable((top_1000_joined |> arrange(desc(Rating))), caption = "Turkish Movies in IMDb Top1000 Ordered by Ratings")
Turkish Movies in IMDb Top1000 Ordered by Ratings
Title Year Duration Rating Votes
Ayla: The Daughter of War 2017 125 8.3 43001
Yedinci Kogustaki Mucize 2019 132 8.2 54193
Babam ve Oglum 2005 108 8.2 91051
Eskiya 1996 128 8.1 71705
Her Sey Çok Güzel Olacak 1998 107 8.1 27128
Kis Uykusu 2014 196 8.0 54660
Nefes: Vatan Sagolsun 2009 128 8.0 35029
Ahlat Agaci 2018 188 8.0 27029
G.O.R.A. 2004 127 8.0 66041
Vizontele 2001 110 8.0 38409
Bir Zamanlar Anadolu’da 2011 157 7.8 49379

Are these the same first high-rated 11 movies in our initial data frame? Let’s see!

Show the code
kable(head((movies |> arrange(desc(Rating))),11), caption = "Turkish Movies Top 11")
Turkish Movies Top 11
Title Year Duration Rating Votes
Hababam Sinifi 1975 87 9.2 42518
CM101MMXI Fundamentals 2013 139 9.1 46999
Tosun Pasa 1976 90 8.9 24330
Hababam Sinifi Sinifta Kaldi 1975 95 8.9 24370
Süt Kardesler 1976 80 8.8 20890
Saban Oglu Saban 1977 90 8.7 18536
Zügürt Aga 1985 101 8.7 16140
Neseli Günler 1978 95 8.7 11810
Kibar Feyzo 1978 83 8.7 17126
Hababam Sinifi Uyaniyor 1976 94 8.7 20641
Canim Kardesim 1973 85 8.6 10099

It can be clearly seen from here that IMDb takes care to include more recent movies when creating the Top 1000 list. Because while the list of the top 11 Turkish movies consists almost entirely of films produced before 1990, the Turkish movies in the top 1000 list are productions from after 1990.

Back to top