Assignment 2

1 and 2

Scraping and Tidying

For the first part of the assignment I was required to scrap Turkish movie data from the IMDB site using the advanced movie search. You can see the code I used to scrap the html and tidy it into a data frame called movie_info.

Show the code

library(tidyverse)
library(rvest)
library(stringr)

site_2010 <- read_html("https://m.imdb.com/search/title/?title_type=feature&release_date=,2010-01-01&num_votes=2500,&country_of_origin=TR&count=250")
movies_1 <- site_2010 |> html_elements("h3") |> html_text()

site_2023 <- read_html("https://m.imdb.com/search/title/?title_type=feature&release_date=2010-01-01,&num_votes=2500,&country_of_origin=TR&count=250")
movies_2 <- site_2023 |> html_elements("h3") |> html_text()

movies <- c(movies_1, movies_2)
movies <- movies[! movies == "Recently viewed"]
movies <- data.frame(movies) |>
  separate(movies, into = c("number", "titles"), sep = ". ", extra = "merge")
movies <- movies[,2] |> data.frame()
colnames(movies) <- c("title")

rates_1 <- site_2010 |> 
  html_elements(".dli-ratings-container") |> 
  html_text() |> 
  substr(0,3)

rates_2 <- site_2023 |> 
  html_elements(".dli-ratings-container") |> 
  html_text() |> 
  substr(0,3)

rates <- c(rates_1, rates_2) |> as.numeric() |> data.frame() 
colnames(rates) <- c("rate")

dli_1 <- site_2010 |> 
  html_elements(".dli-title-metadata-item") |> 
  html_text()

dli_2 <- site_2023 |> 
  html_elements(".dli-title-metadata-item") |> 
  html_text()

years <- c(dli_1 |> str_extract("\\d{4}") |> as.numeric(), 
           dli_2 |> str_extract("\\d{4}") |> as.numeric())

years <- years[complete.cases(years)] |> data.frame()
colnames(years) <- c("year")

dur_detect <- c(dli_1 |> str_detect("[hm]"), dli_2 |> str_detect("[hm]"))
durations <- c(dli_1,dli_2)[dur_detect] |> data.frame()
colnames(durations) <- c("duration")

vote_read <- c(site_2010 |> html_elements(".kRnqtn") |> html_text(), 
               site_2023 |> html_elements(".kRnqtn") |> html_text()) |>
  substring(6, last = 10000) |> 
  str_replace(",", "") |>
  as.numeric() |>
  data.frame()

colnames(vote_read) <- c("votes")

movie_info <- data.frame(movies, years, rates, vote_read, durations)

3

a)

The 5 Top Rated movies:

Code

movie_info <- arrange(movie_info, desc(rate))
first_5 <- movie_info[1:5,1:3]
last_5 <- tail(movie_info[,1:3], n=5)
first_5

                         title year rate
1               Hababam Sinifi 1975  9.2
2       CM101MMXI Fundamentals 2013  9.1
3                   Tosun Pasa 1976  8.9
4 Hababam Sinifi Sinifta Kaldi 1975  8.9
5                Süt Kardesler 1976  8.8

The 5 Worst Rated movies:

                             title year rate
466                 Cumali Ceber 2 2018  1.2
467                          Müjde 2022  1.2
468              15/07 Safak Vakti 2021  1.2
469 Cumali Ceber: Allah Seni Alsin 2017  1.0
470                           Reis 2017  1.0

Out of the top rated movies, I have watched all except for Tosun Pasa. Since the IMDb ratings are based on subjective rates of people, although I think there are movies which deserve top 5 more, I can not say that these movies do not deserve their ratings. The bottom rated movies however, are disliked by the majority of the users so there must be something wrong with them to receive these ratings.

b)

My Favorites

      title year rate votes duration
39 G.O.R.A. 2004    8 66033    2h 7m

    title year rate votes duration
90 Mucize 2015  7.6 13899   2h 16m

I adore G.O.R.A. and Mucize. I think both got the ratings they deserved, but Mucize has a special place in my heart with its beautiful sceneries, convincing acting and accents, good ending, and emotional story.

c)

Extracting Knowledge out of Data

Yearly rating averages:

Code

grouped_rate <- group_by(movie_info, year) |> summarise(rate_avg = mean(rate))

plot_rate <- ggplot(grouped_rate, aes(year, rate_avg)) +
  geom_point(col = "#CC0066")

plot_rate

Number of movies over the years:

Code

grouped_num <- group_by(movie_info, year) |> summarise(n_movies = n())

plot_num <- ggplot(grouped_num, aes(year, n_movies)) +
  geom_point(col = "#330066")

plot_num

Box plots of ratings over the years:

Code

plot_box <- ggplot(movie_info, aes(factor(year), rate)) +
  geom_boxplot(fill = "#339999")+
  xlab("year")+
  ylab("rate")+
  theme(axis.text.x = element_text(angle = 90))

plot_box

The box plot shows that other than a few exceptions means and lower quantiles for movie ratings tend to drop over the years. And the lowest rated movies also appear in more recent years.

d)

Votes vs Rate:

Code

grouped_rate <- group_by(movie_info, rate) |> summarise(vote_avg = mean(votes))

plot_vote_avg <- ggplot(grouped_rate, aes(rate, vote_avg)) +
  geom_point(col = "#005566")

plot_vote_avg

The only conclusion we can come to from this plot is that movies with really high ratings tend to have higher votes, which can be caused by more people watching them because they are good already? Seems pretty natural.

e)

Duration vs Rate

Code

grouped_duration <- group_by(movie_info, duration) |> summarise(rate = mean(rate))

plot_dur <- ggplot(grouped_duration, aes(duration, rate)) +
  geom_point(col = "#FF33CC")+
  theme(axis.text.x = element_text(angle = 90))

plot_dur

No. Maybe “extreme duration = bad rating” but that is pretty much it.

4

Turkish Movies in Top 1000

Code

site_top1000 <- read_html("https://m.imdb.com/search/title/?title_type=feature&groups=top_1000&country_of_origin=TR")
top1000_title <- site_top1000 |> html_elements("h3") |> html_text()
top1000_title <- top1000_title[! top1000_title == "Recently viewed"]
top1000_title <- data.frame(top1000_title) |>
  separate(top1000_title, into = c("number", "titles"), sep = ". ", extra = "merge")
top1000_title <- data.frame(top1000_title$titles)
colnames(top1000_title) <- c("title")

top1000_year <- site_top1000 |> html_elements(".dli-title-metadata-item") |> html_text()
top1000_year <- c(top1000_year |> str_extract("\\d{4}") |> as.numeric())
top1000_year <- top1000_year[complete.cases(top1000_year)] |> data.frame()
colnames(top1000_year) <- c("year")

top1000 <- data.frame(top1000_title,top1000_year)

joined_df <- left_join(top1000, movie_info, by = "title")

joined_df <- joined_df[order(desc(joined_df$rate)),]
joined_df

                       title year.x year.y rate votes duration
4  Ayla: The Daughter of War   2017   2017  8.3 42992    2h 5m
1   Yedinci Kogustaki Mucize   2019   2019  8.2 54171   2h 12m
5             Babam ve Oglum   2005   2005  8.2 91035   1h 48m
8                     Eskiya   1996   1996  8.1 71704    2h 8m
11  Her Sey Çok Güzel Olacak   1998   1998  8.1 27122   1h 47m
2                 Kis Uykusu   2014   2014  8.0 54646   3h 16m
3      Nefes: Vatan Sagolsun   2009   2009  8.0 35022    2h 8m
6                Ahlat Agaci   2018   2018  8.0 27015    3h 8m
9                   G.O.R.A.   2004   2004  8.0 66033    2h 7m
10                 Vizontele   2001   2001  8.0 38403   1h 50m
7    Bir Zamanlar Anadolu'da   2011   2011  7.8 49365   2h 37m

These are obviously not the highest rated movies of all the Turkish movies, so maybe the Top 1000 is dependent on popularity as well.