Assignment 2

Question 1

we have 2 links pre 2010 and after 2010, we combine them in one vector.

Code
library(tidyverse)
Warning: package 'tidyverse' was built under R version 4.3.2
Warning: package 'ggplot2' was built under R version 4.3.2
Warning: package 'tibble' was built under R version 4.3.2
Warning: package 'tidyr' was built under R version 4.3.2
Warning: package 'readr' was built under R version 4.3.2
Warning: package 'purrr' was built under R version 4.3.2
Warning: package 'dplyr' was built under R version 4.3.2
Warning: package 'stringr' was built under R version 4.3.2
Warning: package 'forcats' was built under R version 4.3.2
Warning: package 'lubridate' was built under R version 4.3.2
Code
library(stringr)
library(rvest)
Warning: package 'rvest' was built under R version 4.3.2
Code
library(ggplot2)
library(knitr)
Warning: package 'knitr' was built under R version 4.3.2
Code
library(reshape2)
Warning: package 'reshape2' was built under R version 4.3.2
Code
IMDB_1 <- "https://www.imdb.com/search/title/?title_type=feature&release_date=2010-01-01,2023-12-31&num_votes=2500,&country_of_origin=TR&count=250"
IMDB_2 <- "https://www.imdb.com/search/title/?title_type=feature&release_date=,2009-12-31&num_votes=2500,&country_of_origin=TR&count=250"
IMDB_vector<- c(IMDB_1,IMDB_2)

Question 2

Creating Web Scraping: Title, Year, Duration, Rating, Votes

Code
table_titles <- c()
table_years <- c()
table_durations <- c()
table_ratings <- c()
table_votes <- c()

for(url in IMDB_vector){
  page = read_html(url)
  
  title_names <- page |> html_nodes('.ipc-title__text')
  title_names <- html_text(title_names)
  title_names <- tail(head(title_names,-1),-1)
  title_names <- str_split(title_names, " ", n=2)
  title_names <- unlist(lapply(title_names, function(x) {x[2]}))
  
  year <- page |> html_nodes(".sc-43986a27-7.dBkaPT.dli-title-metadata")
  year <- html_text(year)
  year <- substr(year, 1, 4)
  year <- as.numeric(year)
  
  duration <- page |> html_nodes(".sc-1b7x5y9-6.gwfzJB.dli-watch-bar") %>%
  html_text() %>%
  str_extract_all("\\d+") %>%
  lapply(function(x) as.numeric(x)) %>%
  unlist() %>%
  sum()
  
  rating <- page |> html_nodes(".ipc-rating-star.ipc-rating-star--base.ipc-rating-star--imdb.ratingGroup--imdb-rating")
  rating <- html_text(rating)
  rating <- substr(rating, 1, 3)
  rating <- as.numeric(rating)
  
  vote <- page |> html_nodes(".sc-53c98e73-0.kRnqtn")
  vote <- html_text(vote)
  vote <- sub("Votes", "" ,vote)
  vote <- sub(",", "", vote)
  vote <- as.numeric(vote)
  
  Titles <- append(table_titles,title_names)
  Years <- append(table_years, year)
  Durations <- append(table_durations, duration)
  Ratings <- append(table_ratings, rating)
  Votes <- append(table_votes, vote)
  
}

movies_df <- data.frame(Titles, Years, Durations, Ratings, Votes)
kable(head(movies_df,10), caption = "Movies Dataframe")
Movies Dataframe
Titles Years Durations Ratings Votes
Nefes: Vatan Sagolsun 2009 0 8.0 35017
Babam ve Oglum 2005 0 8.2 91022
Masumiyet 1997 0 8.1 19282
Kader 2006 0 7.8 16249
Uzak 2002 0 7.5 22362
Eskiya 1996 0 8.1 71698
A.R.O.G 2008 0 7.3 44631
Sevmek Zamani 1965 0 8.0 7125
Hababam Sinifi 1975 0 9.2 42512
Üç Maymun 2008 0 7.3 22652

Question 3

  1. Arrange your data frame in descending order by Rating. Present the top 5 and bottom 5 movies based on user ratings. Have you watched any of these movies? Do you agree or disagree with their current IMDb Ratings?

Code
movies_df <- movies_df[order(movies_df$Ratings, decreasing = TRUE),]
Top 5 movies based on user ratings.
Code
kable(head(movies_df, 5), caption = "Top 5 Movies Based On User Ratings.")
Top 5 Movies Based On User Ratings.
Titles Years Durations Ratings Votes
9 Hababam Sinifi 1975 0 9.2 42512
25 Tosun Pasa 1976 0 8.9 24327
89 Hababam Sinifi Sinifta Kaldi 1975 0 8.9 24370
73 Süt Kardesler 1976 0 8.8 20885
36 Saban Oglu Saban 1977 0 8.7 18535

I have watched all the movies listed in the top 5. They are films that I can watch repeatedly without getting bored. However, I don’t think they deserve to be in the top 5.

Bottom 5 movies based on user ratings.
Code
kable(tail(movies_df, 5), caption = "Bottom 5 Movies Based On User Ratings.")
Bottom 5 Movies Based On User Ratings.
Titles Years Durations Ratings Votes
158 Araf 2006 0 2.4 4276
86 Çilgin Dersane 2007 0 1.9 3899
129 Keloglan Karaprens’e Karsi 2006 0 1.6 9616
33 Dünyayi Kurtaran Adam’in Oglu 2006 0 1.5 16704
195 Emret Komutanim: Sah Mat 2007 0 1.5 7047

I haven’t watched any of the movies mentioned in the lower ranks. Honestly, I believe they deserve their places in the ranking.

  1. *Showing favourite films and evalute them

My top 5 list is below:
  1. Nefes:Vatan Sağolsun
  2. Dağ
  3. Dağ 2
  4. Babam ve Oğlum
  5. Kader

This list contains my favourite movies. If I choose 2 movies from this list, these movies would be Nefes:Vatan Sağolsun and Babam ve Oğlum.

  1. Considering that audience rating is a crucial indicator of movie quality, what can you infer about the average ratings of Turkish movies over the years? Calculate yearly rating averages and plot them as a scatter plot. Similarly, plot the number of movies over the years. You might observe that using yearly averages could be misleading due to the increasing number of movies each year. As an alternative solution, plot box plots of ratings over the years (each year having a box plot showing statistics about the ratings of movies in that year). What insights do you gather from the box plot?

Code
# Assuming 'movies' data frame is available with columns: Titles, Years, Ratings, Votes

library(tidyverse)
library(ggplot2)

# Calculate yearly rating averages
average_ratings <- movies_df %>%
  group_by(Years) %>%
  summarise(Average_Rating = mean(Ratings),
            Number_of_Movies = n())

# Scatter plot for yearly rating averages
ggplot(average_ratings, aes(x = Years, y = Average_Rating, size = Number_of_Movies)) +
  geom_point() +
  labs(title = "Yearly Average Ratings of Turkish Movies",
       x = "Years",
       y = "Average Rating") +
  theme_minimal()

Code
# Number of movies over the years
ggplot(average_ratings, aes(x = Years, y = Number_of_Movies)) +
  geom_bar(stat = "identity", fill = "blue") +
  labs(title = "Number of Turkish Movies Released Each Year",
       x = "Years",
       y = "Number of Movies") +
  theme_minimal()

Code
# Box plots of ratings over the years
ggplot(movies_df, aes(x = as.factor(Years), y = Ratings)) +
  geom_boxplot(fill = "blue") +
  labs(title = "Box Plots of Ratings of Turkish Movies Over the Years",
       x = "Years",
       y = "Ratings") +
  theme_minimal()

  1. Do you believe there is a relationship between the number of votes a movie received and its rating? Investigate the correlation between Votes and Ratings.

Code
corr_rating_vote = cor(movies_df$Ratings, movies_df$Votes)
corr_rating_vote
[1] 0.233614
  1. Do you believe there is a relationship between a movie’s duration and its rating?

Code
corr_duration_rating = cor(movies_df$Durations, movies_df$Ratings)
Warning in cor(movies_df$Durations, movies_df$Ratings): the standard deviation
is zero
Code
corr_duration_rating
[1] NA

As we see, there is no relationship between duration and rating.

Question 4

  1. Use IMDb’s Advanced Title Search interface with The Title Type set to “Movie” only, the Country set to “Turkey” with the option “Search country of origin only” active, and the Awards & Recognation set to “IMDB Top 1000”. You should find a total of 11 movies.

Code
IMDB_3 = "https://www.imdb.com/search/title/?title_type=feature&groups=top_1000&country_of_origin=TR&count=250"
movie_name <- c()
movie_year <- c()

page = read_html(IMDB_3)

title_names <- page |> html_nodes('.ipc-title__text')
title_names <- html_text(title_names)
title_names <- tail(head(title_names,-1),-1)
title_names <- str_split(title_names, " ", n=2)
title_names <- unlist(lapply(title_names, function(x) {x[2]}))

year <- page|> html_nodes(".sc-43986a27-7.dBkaPT.dli-title-metadata")
year <- html_text(year)
year <- substr(year, 1, 4)
year <- as.numeric(year)

movie_name <- append(movie_name, title_names)
movie_year <- append(movie_year, year)
top1000_df <- data.frame(movie_name, movie_year)
kable(top1000_df, caption = "Turkish movies in IMDB Top1000 without rating, duration and votes")
Turkish movies in IMDB Top1000 without rating, duration and votes
movie_name movie_year
Yedinci Kogustaki Mucize 2019
Kis Uykusu 2014
Nefes: Vatan Sagolsun 2009
Ayla: The Daughter of War 2017
Babam ve Oglum 2005
Ahlat Agaci 2018
Bir Zamanlar Anadolu’da 2011
Eskiya 1996
G.O.R.A. 2004
Vizontele 2001
Her Sey Çok Güzel Olacak 1998
Back to top