Assignment 2

Question 1

Using the filters on https://m.imdb.com/search, list all Turkish movies with more than 2500 reviews, and save the URLs.

urls <- c( "https://m.imdb.com/search/title/?title_type=feature&release_date=2010-01-01,2023-12-31&sort=moviemeter,asc&num_votes=2500,&country_of_origin=TR&count=250",
           "https://m.imdb.com/search/title/?title_type=feature&release_date=,2009-12-31&sort=moviemeter,asc&num_votes=2500,&country_of_origin=TR&count=250" )

print(urls)
[1] "https://m.imdb.com/search/title/?title_type=feature&release_date=2010-01-01,2023-12-31&sort=moviemeter,asc&num_votes=2500,&country_of_origin=TR&count=250"
[2] "https://m.imdb.com/search/title/?title_type=feature&release_date=,2009-12-31&sort=moviemeter,asc&num_votes=2500,&country_of_origin=TR&count=250"          

Question 2

Start web scrapping to create a Data Frame with columns: Title, Year, Duration, Rating, Votes

library(tidyverse)
Warning: package 'tidyverse' was built under R version 4.3.2
Warning: package 'ggplot2' was built under R version 4.3.2
Warning: package 'stringr' was built under R version 4.3.2
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.3     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.4.4     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(stringr)
library(rvest)
Warning: package 'rvest' was built under R version 4.3.2

Attaching package: 'rvest'

The following object is masked from 'package:readr':

    guess_encoding
library(ggplot2)
library(knitr)
Warning: package 'knitr' was built under R version 4.3.2
library(reshape2)
Warning: package 'reshape2' was built under R version 4.3.2

Attaching package: 'reshape2'

The following object is masked from 'package:tidyr':

    smiths
df <- data.frame(titles = character(),
                        years = numeric(),
                        durations = character(),
                        ratings = numeric(),
                        votes = numeric())
                    

titles <- c()
years <- c()
durations <- c()
ratings <- c()
votes <- c()

for(url in urls){
  html = read_html(url)
  
title <- html %>% html_nodes('.ipc-title__text')
title <- html_text(title)
title <- tail(head(title,-1),-1)
title <- str_split(title, " ", n=2)
title <- unlist(lapply(title, function(x) {x[2]}))
  
year <- html %>% html_nodes(".sc-43986a27-7.dBkaPT.dli-title-metadata")
year <- html_text(year)
year <- substr(year, 1, 4)
year <- as.numeric(year)

rating <- html %>% html_nodes(".ipc-rating-star.ipc-rating-star--base.ipc-rating-star--imdb.ratingGroup--imdb-rating")
rating <- html_text(rating)
rating <- substr(rating, 1, 3)
rating <- as.numeric(rating)

vote <- html %>%
html_node(".sc-53c98e73-0.kRnqtn") %>%
    html_text() %>%
    parse_number()

duration <- html %>% html_nodes(".sc-43986a27-7.dBkaPT.dli-title-metadata")
  duration <- html_text(duration)
  
  hour <- str_extract(duration, "\\d+h") %>%
    str_replace("h", "") %>%
    as.numeric() %% 10
  
  total_duration <- hour * 60 + str_extract(duration, "\\d+m") %>%
    str_replace("m", "") %>%
    as.numeric()

  
titles <- append(titles,title)
years <- append(years, year)
durations <- append(durations, total_duration)
ratings <- append(ratings, rating)
votes <- append(votes, vote)
  
}

df <- data.frame(titles, years, durations, ratings,votes)
print(head(df,10), caption= "dataframe")
                      titles years durations ratings votes
1          Kuru Otlar Üstüne  2023       197     8.1  5063
2    Istanbul Için Son Çagri  2023        91     5.3 35018
3   Yedinci Kogustaki Mucize  2019       132     8.2  5063
4             Ölümlü Dünya 2  2023       117     7.5 35018
5                     Bihter  2023       113     3.6  5063
6               Ölümlü Dünya  2018       107     7.6 35018
7                 Kis Uykusu  2014       196     8.0  5063
8                     Dag II  2016       135     8.2 35018
9             Do Not Disturb  2023       114     6.3  5063
10 Ayla: The Daughter of War  2017       125     8.3 35018

Question 3

  1. Arrange your data frame in descending order by Rating. Present the top 5 and bottom 5 movies based on user ratings. Have you watched any of these movies? Do you agree or disagree with their current IMDb Ratings?

df <- df[order(df$ratings, decreasing = TRUE),]

Top 5 movies based on user user ratings

top5_movies <- head(df, 5)
print(top5_movies)
                          titles years durations ratings votes
257               Hababam Sinifi  1975        87     9.2  5063
39        CM101MMXI Fundamentals  2013       139     9.1  5063
273                   Tosun Pasa  1976        90     8.9  5063
337 Hababam Sinifi Sinifta Kaldi  1975        95     8.9  5063
321                Süt Kardesler  1976        80     8.8  5063

I cannot say that I agree with this list, which was created based on the points given by users. They are definitely enjoyable, entertaining and valuable movies. However, the events depicted in the films made by Yeşil Yeşil Çam are old and a bit exaggerated compared to today. I think there are better movies.

Bottom 5 movies based on user ratings.

bottom5_movies <- tail(df, 5)
print(bottom5_movies)
                            titles years durations ratings votes
189                 Cumali Ceber 2  2018       100     1.2  5063
199                          Müjde  2022        NA     1.2  5063
245              15/07 Safak Vakti  2021        95     1.2  5063
101 Cumali Ceber: Allah Seni Alsin  2017       100     1.0  5063
150                           Reis  2017       108     1.0 35018

To be honest, I haven’t watched any of the movies in this list. So I can say I have no idea.

  1. Check the ratings of 2-3 of your favorite movies. What are their standings?

####My favorite movies are : 1.Aşk Tesadüfleri Sever 2.Kelebeliğin Rüyası 3.İncir Reçeli

[Aşk Tesadüfleri Sever]

print(df[df$titles == "Ask Tesadüfleri Sever",], caption = "Aşk Tesadüfleri Sever")
                  titles years durations ratings votes
89 Ask Tesadüfleri Sever  2011       118     7.2  5063
sprintf("Rank of the *Aşk Tesadüfleri Sever* is %d", which(df$titles=="Ask Tesadüfleri Sever"))
[1] "Rank of the *Aşk Tesadüfleri Sever* is 151"

[Kelebeğin Rüyası]

print(df[df$titles == "Kelebegin Rüyasi",], caption = "Kelebeğin Rüyası")
             titles years durations ratings votes
42 Kelebegin Rüyasi  2013       138     7.7 35018
sprintf("Rank of the *Kelebeğin Rüyası* is %d", which(df$titles=="Kelebegin Rüyasi"))
[1] "Rank of the *Kelebeğin Rüyası* is 67"

[İncir Reçeli]

print(df[df$titles == "Incir Reçeli",], caption = "İncir Reçeli")
         titles years durations ratings votes
63 Incir Reçeli  2011        94     6.5  5063
sprintf("Rank of the *İncir Reçeli* is %d", which(df$titles=="Incir Reçeli"))
[1] "Rank of the *İncir Reçeli* is 262"
  1. Scatter Plot
yearly_rating <- df %>% group_by(years) %>%
  summarise(average_rating = mean(ratings))
yearly_rating_scatter_plot <- ggplot(yearly_rating, aes(x=years, y=average_rating)) + geom_point()

print(yearly_rating_scatter_plot)

Box Plot

yearly_rating_box_plot <- ggplot(df, aes(x=years, y=ratings, group=years)) + geom_boxplot()
print(yearly_rating_box_plot)

Number of Movies

yearly_movie_count <- df %>% group_by(years) %>%
  summarise(number_of_movies = n())
yearly_count <- ggplot(yearly_movie_count, aes(x=years, y=number_of_movies)) + geom_point()
yearly_count

  1. Correlation between Votes and Ratings.
corr_vote = cor(df$ratings, df$votes)
corr_vote
[1] 0.0332948
  1. Correlation between Duration and Ratings.
corr_duration= cor(df$durations, df$ratings)
corr_duration
[1] NA

Question 4

url = "https://www.imdb.com/search/title/?title_type=feature&groups=top_1000&country_of_origin=TR&count=250"
name <- c()
year <- c()

html = read_html(url)

title <- html %>% html_nodes('.ipc-title__text')
title <- html_text(title)
title <- tail(head(title,-1),-1)
title <- str_split(title, " ", n=2)
title <- unlist(lapply(title, function(x) {x[2]}))

year <- html %>% html_nodes(".sc-43986a27-7.dBkaPT.dli-title-metadata")
year <- html_text(year)
year <- substr(year, 1, 4)
year <- as.numeric(year)

name <- append(name, title)
year <- append(year, year)
top1000_df <- data.frame(name, year)
print(top1000_df, caption = "Turkish movies in IMDB Top1000 without rating, duration and votes")
                        name year
1   Yedinci Kogustaki Mucize 2019
2                 Kis Uykusu 2014
3      Nefes: Vatan Sagolsun 2009
4  Ayla: The Daughter of War 2017
5             Babam ve Oglum 2005
6                Ahlat Agaci 2018
7    Bir Zamanlar Anadolu'da 2011
8                     Eskiya 1996
9                   G.O.R.A. 2004
10                 Vizontele 2001
11  Her Sey Çok Güzel Olacak 1998
12  Yedinci Kogustaki Mucize 2019
13                Kis Uykusu 2014
14     Nefes: Vatan Sagolsun 2009
15 Ayla: The Daughter of War 2017
16            Babam ve Oglum 2005
17               Ahlat Agaci 2018
18   Bir Zamanlar Anadolu'da 2011
19                    Eskiya 1996
20                  G.O.R.A. 2004
21                 Vizontele 2001
22  Her Sey Çok Güzel Olacak 1998

New data frame with Turkish movies in the top 1000 containing only the title and year.

top1000_new_df<- merge(x=top1000_df, y=df,
                           by.x=c("name", "year"),
                           by.y=c("titles", "years"), all.x=TRUE)
print(top1000_new_df, caption = "Turkish movies in IMBD Top1000 with rating, duration and votes")
                        name year durations ratings votes
1                Ahlat Agaci 2018       188     8.0 35018
2                Ahlat Agaci 2018       188     8.0 35018
3  Ayla: The Daughter of War 2017       125     8.3 35018
4  Ayla: The Daughter of War 2017       125     8.3 35018
5             Babam ve Oglum 2005       108     8.2 35018
6             Babam ve Oglum 2005       108     8.2 35018
7    Bir Zamanlar Anadolu'da 2011       157     7.8  5063
8    Bir Zamanlar Anadolu'da 2011       157     7.8  5063
9                     Eskiya 1996       128     8.1 35018
10                    Eskiya 1996       128     8.1 35018
11                  G.O.R.A. 2004       127     8.0 35018
12                  G.O.R.A. 2004       127     8.0 35018
13  Her Sey Çok Güzel Olacak 1998       107     8.1 35018
14  Her Sey Çok Güzel Olacak 1998       107     8.1 35018
15                Kis Uykusu 2014       196     8.0  5063
16                Kis Uykusu 2014       196     8.0  5063
17     Nefes: Vatan Sagolsun 2009       128     8.0  5063
18     Nefes: Vatan Sagolsun 2009       128     8.0  5063
19                 Vizontele 2001       110     8.0  5063
20                 Vizontele 2001       110     8.0  5063
21  Yedinci Kogustaki Mucize 2019       132     8.2  5063
22  Yedinci Kogustaki Mucize 2019       132     8.2  5063
top1000_new_df <- top1000_new_df[order(top1000_new_df$ratings, decreasing = TRUE),]
print(top1000_new_df, caption = "Turkish movies in IMDB Top 1000 according to rankings.")
                        name year durations ratings votes
3  Ayla: The Daughter of War 2017       125     8.3 35018
4  Ayla: The Daughter of War 2017       125     8.3 35018
5             Babam ve Oglum 2005       108     8.2 35018
6             Babam ve Oglum 2005       108     8.2 35018
21  Yedinci Kogustaki Mucize 2019       132     8.2  5063
22  Yedinci Kogustaki Mucize 2019       132     8.2  5063
9                     Eskiya 1996       128     8.1 35018
10                    Eskiya 1996       128     8.1 35018
13  Her Sey Çok Güzel Olacak 1998       107     8.1 35018
14  Her Sey Çok Güzel Olacak 1998       107     8.1 35018
1                Ahlat Agaci 2018       188     8.0 35018
2                Ahlat Agaci 2018       188     8.0 35018
11                  G.O.R.A. 2004       127     8.0 35018
12                  G.O.R.A. 2004       127     8.0 35018
15                Kis Uykusu 2014       196     8.0  5063
16                Kis Uykusu 2014       196     8.0  5063
17     Nefes: Vatan Sagolsun 2009       128     8.0  5063
18     Nefes: Vatan Sagolsun 2009       128     8.0  5063
19                 Vizontele 2001       110     8.0  5063
20                 Vizontele 2001       110     8.0  5063
7    Bir Zamanlar Anadolu'da 2011       157     7.8  5063
8    Bir Zamanlar Anadolu'da 2011       157     7.8  5063
print(head(df,11), caption = "Movies Dataframe")
                          titles years durations ratings votes
257               Hababam Sinifi  1975        87     9.2  5063
39        CM101MMXI Fundamentals  2013       139     9.1  5063
273                   Tosun Pasa  1976        90     8.9  5063
337 Hababam Sinifi Sinifta Kaldi  1975        95     8.9  5063
321                Süt Kardesler  1976        80     8.8  5063
284             Saban Oglu Saban  1977        90     8.7 35018
307                   Zügürt Aga  1985       101     8.7  5063
317                Neseli Günler  1978        95     8.7  5063
323                  Kibar Feyzo  1978        83     8.7  5063
380      Hababam Sinifi Uyaniyor  1976        94     8.7 35018
343               Canim Kardesim  1973        85     8.6  5063
Back to top