Assignment 2

Analysis of Turkish Movies (IMBD)

The movies have been filtered, and their URLs have been saved.

Code

# searches between  01/01/2010-31/12/2023
url_1 <- "https://m.imdb.com/search/title/?title_type=feature&release_date=2010-01-01,2023-12-31&num_votes=2500,&country_of_origin=TR&count=250" 

# searches before 31/12/2009
url_2 <- "https://m.imdb.com/search/title/?title_type=feature&release_date=,2009-12-31&num_votes=2500,&country_of_origin=TR&count=250"

combined_url <- c(url_1, url_2)

Necessary Packages

Code

knitr::opts_chunk$set(warning = FALSE)
library(tidyverse)
library(rvest)
library(stringr)
library(ggplot2)
library(ggthemes)

Web-Scrapping

Turkish movies with a minimum of 2500 votes have been filtered.

Code

title <- c()
release_year <- c()
duration <- c()
rating <- c()
vote <- c()
for (url in combined_url) {
  Html <- read_html(url)
  
  title_names <- Html |> html_nodes('.ipc-title__text')
  title_names <- html_text(title_names)
  title_names <- tail(head(title_names,-1),-1)
  title_names <- str_split(title_names, " ", n=2)
  title_names <- unlist(lapply(title_names, function(x) {x[2]}))
  title <- append(title,title_names)
  
  years <- Html |> html_nodes('.sc-43986a27-7.dBkaPT.dli-title-metadata')
  years <- html_text(years)
  years <- unlist(lapply(years, function(years){
           strtrim(years, 4)}))
  release_year <- append(release_year, as.numeric(years))
  
  durations <- Html |> html_nodes('.sc-43986a27-7.dBkaPT.dli-title-metadata')
  durations <- html_text(durations)
  durations <- unlist(lapply(durations, function(durations){
           str_extract(durations, "\\d+h( \\d+m)?|\\d+m|\\d+") |> str_extract("(?<=^.{4}).*")}))
  convert_to_minutes <- function(duration) {
  hours <- as.numeric(str_extract(duration, "\\d+(?=h)"))
  minutes <- as.numeric(str_extract(duration, "\\d+(?=m)"))
  total_minutes <- ifelse(is.na(hours),0, hours) * 60 + ifelse(is.na(minutes), 0, minutes)
  return(total_minutes)
}
  durations <- unlist(lapply(durations, convert_to_minutes))
  duration  <- append(duration, durations)
  
  ratings <- Html |> html_nodes(".sc-43986a27-1.fVmjht")
  ratings <- html_text(ratings)
  ratings <- unlist(lapply(ratings, function(ratings){
    str_sub(ratings, 1, 3)
  }))
  rating <- append(rating, as.numeric(ratings))
  
  votes <- Html |> html_nodes(".sc-53c98e73-0.kRnqtn")
  votes <- html_text(votes)
  extract_numeric <- function(string) {
  numeric_part <- str_extract(string, "\\d[0-9,]+")
  numeric_value <- as.numeric(gsub(",", "", numeric_part))
  return(numeric_value)
}
  votes <- unlist(lapply(votes, extract_numeric))
  vote  <- append(vote, votes)
}
movies  <- data.frame(title, release_year, duration, rating,vote)

head(movies)

                     title release_year duration rating  vote
1        Kuru Otlar Üstüne         2023      197    8.1  5063
2  Istanbul Için Son Çagri         2023       91    5.3  7346
3 Yedinci Kogustaki Mucize         2019      132    8.2 54155
4           Ölümlü Dünya 2         2023      117    7.5  3447
5                   Bihter         2023      113    3.6  3341
6             Ölümlü Dünya         2018      107    7.6 30260

Above, you can observe the dataframe generated by scraping data from the web.

a) Arranged by Rating

pre-processing

Code

movies <- movies %>% 
  arrange(desc(rating)) %>%
  mutate(ranking = c(1: length(title))) %>%
  select(ranking, everything())

Head

Top 5 movies based on user ratings are shown below.

Code

head(movies, n = 5L) %>% select(title, rating, vote, release_year)

                         title rating  vote release_year
1               Hababam Sinifi    9.2 42512         1975
2       CM101MMXI Fundamentals    9.1 46995         2013
3                   Tosun Pasa    8.9 24327         1976
4 Hababam Sinifi Sinifta Kaldi    8.9 24370         1975
5                Süt Kardesler    8.8 20886         1976

I believe these movies represent the best in Turkish cinema. They are widely adored, and I personally enjoyed watching each of them. While CM101 MMXI Fundamentals is a stand-up performance rather than a movie, it remains exceptionally popular. In my opinion, Cem Yılmaz is the standout figure in the world of stand-up comedy, making CM101 MMXI Fundamentals a must-see for comedy enthusiasts.

Tail

The bottom 5 is shown below.

Code

tail(movies, n = 5L) %>% select(title, rating, vote)

                             title rating  vote
466                 Cumali Ceber 2    1.2 10228
467                          Müjde    1.2  9920
468              15/07 Safak Vakti    1.2 20606
469 Cumali Ceber: Allah Seni Alsin    1.0 39266
470                           Reis    1.0 73972

I recently watched Cumali Ceber, and in my opinion, it stands out as an exceptionally poor movie. It might even be considered the worst in the Turkish film industry. I concur with its low rating. Although I haven’t seen the others, judging by their ratings, I’m inclined to believe they are also of subpar quality, and as a result, I’ve decided not to watch them.

b) My Favorite Ones

My favorite 3 movies, their rankings and ratings are listed below.

Code

movies %>% 
  filter(title == "Kolpaçino" | title == "Yedinci Kogustaki Mucize" | title == "Babam ve Oglum")

  ranking                    title release_year duration rating  vote
1      23 Yedinci Kogustaki Mucize         2019      132    8.2 54155
2      27           Babam ve Oglum         2005      108    8.2 91024
3     271                Kolpaçino         2009       98    6.5 13652

c) Visualization

The visual representation below illustrates the annual averages of movie ratings, revealing a decline as we approach the present day. It is crucial, however, to take into account the number of movies released, as it directly influences the rating averages.

Code

movies %>% 
  group_by(release_year) %>%
  summarize(yearly_average = mean(rating)) %>%
  ggplot(aes(x = release_year, y = yearly_average)) + geom_point() +
  ggtitle("Yearly Rating Averages")

Below, you can observe a general increase in the number of movies released over the years.

Code

movies %>%
  group_by(release_year) %>%
  summarize(movie_number = n()) %>%
  ggplot(aes(x = release_year, y = movie_number)) + geom_point() + ggtitle("Number of Movies Over the Year")

Box Plot

Code

movies %>%
  ggplot(aes(x = as.factor(release_year), y = rating)) +
  geom_boxplot() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))+ 
  xlab("release_year")

Since 2003, there has been a dramatic increase in the number of movies released, coinciding with relatively lower ratings.

Vote vs Rating

Code

movies %>%
  ggplot(aes(x = vote, rating)) + geom_point() +
  ggtitle("Vote vs Rating")

The majority of votes fall within the range of 0 to 15000, where there is a concentration of ratings above 5.0. It’s noticeable that as the number of votes increases, the ratings tend to be generally high. However, a more in-depth investigation with a larger dataset is necessary to draw definitive conclusions.

Duration vs Rating

Code

movies %>%
  ggplot(aes(x = duration, rating)) + geom_point() +
  ggtitle("Duration vs Rating")

The durations of the movies tend to accumulate within the range of approximately 75 to 130 minutes. Interestingly, within this range, we observe both high and low ratings. Therefore, there doesn’t appear to be a clear relationship between the duration of the movies and their ratings.

Turkish Movies in IMDB Top 1000

Web-Scrapping

Code

knitr::opts_chunk$set(warning = FALSE)

url <- "https://m.imdb.com/search/title/?title_type=feature&groups=top_1000&country_of_origin=TR"

Html_ <- read_html(url)

title_top <- Html_ |> html_nodes('.ipc-title__text')
title_top <- html_text(title_top)
title_top <- tail(head(title_top,-1),-1)
title_top <- str_split(title_top, " ", n=2)
title_top <- unlist(lapply(title_top, function(x) {x[2]}))


release_year_top <- c()
release_years <- Html_ |> html_nodes(".sc-43986a27-7.dBkaPT.dli-title-metadata")
release_years <- html_text(release_years)
release_years  <- unlist(lapply(release_years, function(release_years){
           strtrim(release_years, 4)}))
release_year_top <- append(release_year_top, as.numeric(release_years))
                          
movies_top <- data_frame(title_top, release_year_top)

Joining with the original table

Code

left_join(movies_top, movies, by = c("title_top" = "title")) %>%
  select(-release_year_top) %>% select(ranking, everything()) %>%
  arrange(desc(rating))

# A tibble: 11 × 6
   ranking title_top                 release_year duration rating  vote
     <int> <chr>                            <dbl>    <dbl>  <dbl> <dbl>
 1      20 Ayla: The Daughter of War         2017      125    8.3 42990
 2      23 Yedinci Kogustaki Mucize          2019      132    8.2 54155
 3      27 Babam ve Oglum                    2005      108    8.2 91024
 4      31 Eskiya                            1996      128    8.1 71698
 5      32 Her Sey Çok Güzel Olacak          1998      107    8.1 27119
 6      37 Kis Uykusu                        2014      196    8   54633
 7      40 Nefes: Vatan Sagolsun             2009      128    8   35018
 8      38 Ahlat Agaci                       2018      188    8   27002
 9      42 G.O.R.A.                          2004      127    8   66029
10      44 Vizontele                         2001      110    8   38399
11      58 Bir Zamanlar Anadolu'da           2011      157    7.8 49352

The ranking reflects the actual positions of these movies within the ‘movies’ dataset. Despite being in the top 1000 list on IMDB, they do not occupy the highest positions in the initial dataframe. This suggests that IMDB employs additional criteria, such as awards received or tickets sold, to order movies in its top 1000 lists.