Assignment 2

Question 1

Using the filters on https://m.imdb.com/search, list all Turkish movies with more than 2500 reviews, and save the URLs.

Show the code

library(tidyverse)
library(stringr)
library(rvest)
library(ggplot2)
library(knitr)
library(reshape2)

URL_1 <- "https://www.imdb.com/search/title/?title_type=feature&release_date=2010-01-01,2023-12-31&num_votes=2500,&country_of_origin=TR&count=250"
URL_2 <- "https://www.imdb.com/search/title/?title_type=feature&release_date=,2009-12-31&num_votes=2500,&country_of_origin=TR&count=250"
url_vector <- c(URL_1,URL_2)

Question 2

Start web scrapping to create a Data Frame with columns: Title, Year, Duration, Rating, Votes

Show the code

movie_titles <- c()
movie_years <- c()
movie_durations <- c()
movie_ratings <- c()
movie_votes <- c()

for(url in url_vector){
  HTML = read_html(url)
  
  title_names <- HTML %>% html_nodes('.ipc-title__text')
  title_names <- html_text(title_names)
  title_names <- tail(head(title_names,-1),-1)
  title_names <- str_split(title_names, " ", n=2)
  title_names <- unlist(lapply(title_names, function(x) {x[2]}))
  
  year <- HTML %>% html_nodes(".sc-43986a27-7.dBkaPT.dli-title-metadata")
  year <- html_text(year)
  year <- substr(year, 1, 4)
  year <- as.numeric(year)
  
  duration_trash <- HTML %>% html_nodes(".sc-43986a27-7.dBkaPT.dli-title-metadata")
  duration_trash <- html_text(duration_trash)
  duration <- c()
  
  for (string in duration_trash){
  start_index <- 5
  string_length <- str_length(string)

  if(grepl("m", string, fixed = TRUE)){
    end_index <- regexpr("m", string)
    result <- substr(string, start_index, end_index)
    duration <- append(duration,result)
    }
    
  else{
    end_index <- regexpr("h", string)
    result <- substr(string, start_index, end_index)
    duration <- append(duration, result)
    }
  }
    
  
  hour_duration <- str_split(duration, " ")
  hour_duration <- sapply(hour_duration, function(x) ifelse(grepl("h", x[1], fixed = TRUE), x[1], 0))
  hour_duration <- sub("h", "", hour_duration)
  hour_duration <- as.numeric(hour_duration)
  hour_duration <- hour_duration * 60
  
  minute_duration <- str_split(duration, " ")
  minute_duration <- sapply(minute_duration, function(x) ifelse(length(x) >= 2, x[2], ifelse(grepl("m", x, fixed = TRUE), x[1], ifelse(grepl("m", x[1], fixed = TRUE), x[1],0))))
  minute_duration <- sub("m", "", minute_duration)
  minute_duration <- as.numeric(minute_duration)
  
  rating <- HTML %>% html_nodes(".ipc-rating-star.ipc-rating-star--base.ipc-rating-star--imdb.ratingGroup--imdb-rating")
  rating <- html_text(rating)
  rating <- substr(rating, 1, 3)
  rating <- as.numeric(rating)
  
  vote <- HTML %>% html_nodes(".sc-53c98e73-0.kRnqtn")
  vote <- html_text(vote)
  vote <- sub("Votes", "" ,vote)
  vote <- sub(",", "", vote)
  vote <- as.numeric(vote)
  
  movie_titles <- append(movie_titles,title_names)
  movie_years <- append(movie_years, year)
  movie_durations <- append(movie_durations, hour_duration + minute_duration)
  movie_ratings <- append(movie_ratings, rating)
  movie_votes <- append(movie_votes, vote)
  
}

movies_df <- data.frame(movie_titles, movie_years, movie_durations, movie_ratings, movie_votes)
kable(head(movies_df,10), caption = "Movies Dataframe")

Movies Dataframe
movie_titles	movie_years	movie_durations	movie_ratings	movie_votes
Kuru Otlar Üstüne	2023	197	8.1	5044
Istanbul Için Son Çagri	2023	91	5.3	7310
Yedinci Kogustaki Mucize	2019	132	8.2	54142
Ölümlü Dünya 2	2023	117	7.5	3411
Bihter	2023	113	3.6	3337
Ölümlü Dünya	2018	107	7.6	30246
Kis Uykusu	2014	196	8.0	54621
Dag II	2016	135	8.2	109860
Do Not Disturb	2023	114	6.3	8762
Ayla: The Daughter of War	2017	125	8.3	42986

Question 3

Arrange your data frame in descending order by Rating. Present the top 5 and bottom 5 movies based on user ratings. Have you watched any of these movies? Do you agree or disagree with their current IMDb Ratings?

Show the code

movies_df <- movies_df[order(movies_df$movie_ratings, decreasing = TRUE),]

Top 5 movies based on user ratings.

Show the code

kable(head(movies_df, 5), caption = "Top 5 Movies Based On User Ratings.")

Top 5 Movies Based On User Ratings.
	movie_titles	movie_years	movie_durations	movie_ratings	movie_votes
257	Hababam Sinifi	1975	87	9.2	42509
39	CM101MMXI Fundamentals	2013	139	9.1	46994
273	Tosun Pasa	1976	90	8.9	24325
337	Hababam Sinifi Sinifta Kaldi	1975	95	8.9	24367
321	Süt Kardesler	1976	80	8.8	20883

I disagree with the top parts of this list created based on the scores given by the users. In my opinion, films are created by transferring comments made on certain events, problems or situations to cinema. Therefore, I think that “Yeşil Çam” films are overrated too much, the emotions that are intended to be told in the films made by “Yeşil Çam” are unnecessary. There are much better directors today and much better films are being made, but they cannot get such high scores.

Bottom 5 movies based on user ratings.

Show the code

kable(tail(movies_df, 5), caption = "Bottom 5 Movies Based On User Ratings.")

Bottom 5 Movies Based On User Ratings.
	movie_titles	movie_years	movie_durations	movie_ratings	movie_votes
189	Cumali Ceber 2	2018	100	1.2	10227
199	Müjde	2022	48	1.2	9920
245	15/07 Safak Vakti	2021	95	1.2	20606
101	Cumali Ceber: Allah Seni Alsin	2017	100	1.0	39264
150	Reis	2017	108	1.0	73972

Definitely I agree with the bottom part of this list but I can’t explain why. :) :D

Check the ratings of 2-3 of your favorite movies. What are their standings?

My top 10 list is below:

Note: This list is not ordered. Please don’t judge me based on this order.
Note2: Yes! Recep İvedik 2 is still on the list of funniest movies for me, additionally you can also find the Onur Ünlü’s comments about Recep İvedik movies. Interview here.

Let’s check the ratings of “Babam ve Oğlum”, “Sen Aydınlatırsın Geceyi” and “İşe Yarar Bir Şey”.

Babam ve Oğlum

Show the code

kable(movies_df[movies_df$movie_titles == "Babam ve Oglum",], caption = "Babam ve Oğlum")

Babam ve Oğlum
	movie_titles	movie_years	movie_durations	movie_ratings	movie_votes
250	Babam ve Oglum	2005	108	8.2	91016

Show the code

sprintf("Rank of the *Babam ve Oğlum* is %d", which(movies_df$movie_titles=="Babam ve Oglum"))

[1] “Rank of the Babam ve Oğlum is 27”

İşe Yarar Bir Şey

Show the code

kable(movies_df[movies_df$movie_titles == "Ise Yarar Bir Sey",], caption = "İşe Yarar Bir Şey")

İşe Yarar Bir Şey
	movie_titles	movie_years	movie_durations	movie_ratings	movie_votes
94	Ise Yarar Bir Sey	2017	104	7.6	5507

Show the code

sprintf("Rank of the *İşe Yarar Bir Şey* is %d", which(movies_df$movie_titles=="Ise Yarar Bir Sey"))

[1] “Rank of the İşe Yarar Bir Şey is 85”

Sen Aydınlatırsın Geceyi

Show the code

kable(movies_df[movies_df$movie_titles == "Sen Aydinlatirsin Geceyi",], caption = "Sen Aydınlatırsın Geceyi")

Sen Aydınlatırsın Geceyi
	movie_titles	movie_years	movie_durations	movie_ratings	movie_votes
60	Sen Aydinlatirsin Geceyi	2013	107	7.7	10483

Show the code

sprintf("Rank of the *Sen Aydınlatırsın Geceyi* is %d", which(movies_df$movie_titles=="Sen Aydinlatirsin Geceyi"))

[1] “Rank of the Sen Aydınlatırsın Geceyi is 68”

Considering that audience rating is a crucial indicator of movie quality, what can you infer about the average ratings of Turkish movies over the years? Calculate yearly rating averages and plot them as a scatter plot. Similarly, plot the number of movies over the years. You might observe that using yearly averages could be misleading due to the increasing number of movies each year. As an alternative solution, plot box plots of ratings over the years (each year having a box plot showing statistics about the ratings of movies in that year). What insights do you gather from the box plot?

Average Ratings vs Year

Show the code

yearly_rating <- movies_df %>% group_by(movie_years) %>%
  summarise(
    average_rating = mean(movie_ratings),
    .groups = "drop"
  )

yearly_rating_scatter_plot <- ggplot(yearly_rating, aes(x=movie_years, y=average_rating)) + geom_point()
yearly_rating_scatter_plot

Year vs Rating boxplot.

Show the code

yearly_rating_box_plot <- ggplot(movies_df, aes(x=movie_years, y=movie_ratings, group=movie_years)) + geom_boxplot()
yearly_rating_box_plot

Number of Movies vs Year

Show the code

yearly_movie_count <- movies_df %>% group_by(movie_years) %>%
  summarise(
    number_of_movies = n(),
    .groups = "drop"
  )

yarly_count_plot <- ggplot(yearly_movie_count, aes(x=movie_years, y=number_of_movies)) + geom_point()
yarly_count_plot

Do you believe there is a relationship between the number of votes a movie received and its rating? Investigate the correlation between Votes and Ratings.

Show the code

corr_rating_vote = cor(movies_df$movie_ratings, movies_df$movie_votes)
corr_rating_vote

[1] 0.1307194

Show the code

rating_vs_votes <- ggplot(movies_df, aes(x=movie_ratings, y=log(movie_votes))) + geom_point()
rating_vs_votes

Do you believe there is a relationship between a movie’s duration and its rating? Investigate the correlation between Duration and Ratings.

Show the code

corr_duration_rating = cor(movies_df$movie_durations, movies_df$movie_ratings)
corr_duration_rating

[1] 0.03343216

Show the code

duration_vs_rating <- ggplot(movies_df, aes(x=movie_durations, y=movie_ratings)) + geom_point()
duration_vs_rating

Let’s look a correlation heatmap

Show the code

correlation_df <- movies_df[, c(3,4,5)]
correlation_df <- round(cor(correlation_df), 5)

correlation_df_melted <- melt(correlation_df)
correlation_plot <- ggplot(correlation_df_melted, aes(x=Var1, y=Var2, fill=value)) + geom_tile() +
  geom_text(aes(Var2, Var1, label = value), 
          color = "white", size = 4)

correlation_plot

Question 4

Use IMDb’s Advanced Title Search interface with The Title Type set to “Movie” only, the Country set to “Turkey” with the option “Search country of origin only” active, and the Awards & Recognation set to “IMDB Top 1000”. You should find a total of 11 movies.

Show the code

URL_3 = "https://www.imdb.com/search/title/?title_type=feature&groups=top_1000&country_of_origin=TR&count=250"
movie_name <- c()
movie_year <- c()

HTML = read_html(URL_3)

title_names <- HTML %>% html_nodes('.ipc-title__text')
title_names <- html_text(title_names)
title_names <- tail(head(title_names,-1),-1)
title_names <- str_split(title_names, " ", n=2)
title_names <- unlist(lapply(title_names, function(x) {x[2]}))

year <- HTML %>% html_nodes(".sc-43986a27-7.dBkaPT.dli-title-metadata")
year <- html_text(year)
year <- substr(year, 1, 4)
year <- as.numeric(year)

movie_name <- append(movie_name, title_names)
movie_year <- append(movie_year, year)
top1000_df <- data.frame(movie_name, movie_year)
kable(top1000_df, caption = "Turkish movies in IMDB Top1000 without rating, duration and votes")

Turkish movies in IMDB Top1000 without rating, duration and votes
movie_name	movie_year
Yedinci Kogustaki Mucize	2019
Kis Uykusu	2014
Nefes: Vatan Sagolsun	2009
Ayla: The Daughter of War	2017
Babam ve Oglum	2005
Ahlat Agaci	2018
Bir Zamanlar Anadolu’da	2011
Eskiya	1996
G.O.R.A.	2004
Vizontele	2001
Her Sey Çok Güzel Olacak	1998

Note that you now have a new data frame with Turkish movies in the top 1000, containing only the title and year. Use your initial data frame and an appropriate join operation to fill in the duration, rating, and votes attributes of the new data frame.

Top 1000 merged dataframe

Show the code

top1000_df_merged <- merge(x=top1000_df, y=movies_df,
                           by.x=c("movie_name", "movie_year"),
                           by.y=c("movie_titles", "movie_years"), all.x=TRUE)
kable(top1000_df_merged, caption = "Turkish movies in IMBD Top1000 with rating, duration and votes")

Turkish movies in IMBD Top1000 with rating, duration and votes
movie_name	movie_year	movie_durations	movie_ratings	movie_votes
Ahlat Agaci	2018	188	8.0	26986
Ayla: The Daughter of War	2017	125	8.3	42986
Babam ve Oglum	2005	108	8.2	91016
Bir Zamanlar Anadolu’da	2011	157	7.8	49344
Eskiya	1996	128	8.1	71695
G.O.R.A.	2004	127	8.0	66020
Her Sey Çok Güzel Olacak	1998	107	8.1	27113
Kis Uykusu	2014	196	8.0	54621
Nefes: Vatan Sagolsun	2009	128	8.0	35007
Vizontele	2001	110	8.0	38396
Yedinci Kogustaki Mucize	2019	132	8.2	54142

Order the 11 movies based on their Rank. Are these the same first high-rated 11 movies in your initial data frame? If yes, does this imply that IMDb uses rankings alone to determine their top 1000 movie list? If not, what does this imply?

Show the code

top1000_df_merged <- top1000_df_merged[order(top1000_df_merged$movie_ratings, decreasing = TRUE),]
kable(top1000_df_merged, caption = "Turkish movies in IMBD Top1000, ordered by rankings.")

Turkish movies in IMBD Top1000, ordered by rankings.
	movie_name	movie_year	movie_durations	movie_ratings	movie_votes
2	Ayla: The Daughter of War	2017	125	8.3	42986
3	Babam ve Oglum	2005	108	8.2	91016
11	Yedinci Kogustaki Mucize	2019	132	8.2	54142
5	Eskiya	1996	128	8.1	71695
7	Her Sey Çok Güzel Olacak	1998	107	8.1	27113
1	Ahlat Agaci	2018	188	8.0	26986
6	G.O.R.A.	2004	127	8.0	66020
8	Kis Uykusu	2014	196	8.0	54621
9	Nefes: Vatan Sagolsun	2009	128	8.0	35007
10	Vizontele	2001	110	8.0	38396
4	Bir Zamanlar Anadolu’da	2011	157	7.8	49344

Let’s take a look at the movies dataframe, ordered by rankings.

Show the code

kable(head(movies_df,11), caption = "Movies Dataframe")

Movies Dataframe
	movie_titles	movie_years	movie_durations	movie_ratings	movie_votes
257	Hababam Sinifi	1975	87	9.2	42509
39	CM101MMXI Fundamentals	2013	139	9.1	46994
273	Tosun Pasa	1976	90	8.9	24325
337	Hababam Sinifi Sinifta Kaldi	1975	95	8.9	24367
321	Süt Kardesler	1976	80	8.8	20883
284	Saban Oglu Saban	1977	90	8.7	18533
307	Zügürt Aga	1985	101	8.7	16133
317	Neseli Günler	1978	95	8.7	11804
323	Kibar Feyzo	1978	83	8.7	17124
380	Hababam Sinifi Uyaniyor	1976	94	8.7	20638
343	Canim Kardesim	1973	85	8.6	10093

Clearly we can see that two dataframes above are not the same. We can say that IMDB not just use the rankings. First thing that I realized is there is not any movie created by before the 1996, so IMDb cares the creation date and older movies are not lucky in this ranking calculation.

Assignment 2

Question 1

Question 2

Question 3

Top 5 movies based on user ratings.

Bottom 5 movies based on user ratings.

My top 10 list is below:

Let’s check the ratings of “Babam ve Oğlum”, “Sen Aydınlatırsın Geceyi” and “İşe Yarar Bir Şey”.

Question 4

for fun :D

demirkubuz vs nbc