Assignment 1

My first assignment has four parts. Three parts can be seen inside within this file while the fourth part is rendering and commiting all of my work here.

deneme

(a) A Brief Summary of Data Manipulation Tools

This summary is based on Data Manipulation Tools: dplyr -- Pt 3 Intro to the Grammar of Data Manipulation with R video uploaded by Posit YouTube channel.

Main object of the video and this summary is to discuss key concepts related to data wranling using R and the tidyverse which is a collection of R packages.

Ways to Access Information

  1. Extract existing variables by select()

  2. Extract existing observations by filter()

  3. Derive new variables by mutate()

  4. Change the unit of analysis by summarise()

Pipe Operator

The pipe operator (%>%) allows to chain these functions together in a more readable and efficient way. It is easy to build a sequence of data transformations by piping the results of one expression into the next one.

Unit of Analysis

It is possible to change the unit of analysis by using group_by() and summarise(). Grouping a data frame allows to calculate summary statistics for different subsets of the data. You can group by one or more columns and then apply summarise() to create summary statistics specific to those groups.

Conclusion

This process of grouping and summarizing allows to change the unit of analysis and create different levels of summaries. It makes it easier to explore data at different levels, such as by city, year, or a combination of factors, to gain insights into the data.

(b) Three Differences Between R and Python 1

Even though both R and Python are used for data analysis, statistics and machine learning; there are still some key differences between them.

1. Data Structures and Syntax

R is known for its data manipulation capabilities, especially with data frames and vectors. It uses a special syntax for data frames, making it easy to work with tabular data.

Python, on the other hand, uses lists, dictionaries, and other data structures for data manipulation. It has a more general-purpose syntax.

  • Example in R:
# Creating a data frame in R
df <- data.frame(
  Name = c("Alice", "Bob", "Charlie"),
  Age = c(25, 30, 35)
)
  • Example in Python:
# Creating a list of dictionaries in Python
data = [
  {"Name": "Alice", "Age": 25},
  {"Name": "Bob", "Age": 30},
  {"Name": "Charlie", "Age": 35}
  ]

2. Vectorization

R is designed to work with vectors and supports vectorized operations, which means you can perform operations on entire vectors without explicit loops.

Python, while capable of vectorized operations with libraries like NumPy, often requires explicit loops for many tasks.

Example in R:

# Vectorized operation in R
vector1 <- c(1, 2, 3, 4, 5)
vector2 <- c(6, 7, 8, 9, 10)
result <- vector1 + vector2
  • Example in Python:
# Using a for loop in Python
list1 = [1, 2, 3, 4, 5]
list2 = [6, 7, 8, 9, 10]
result = [a + b for a, b in zip(list1, list2)]

3. Libraries and Ecosystem

Python has a more extensive ecosystem and is often preferred for machine learning and deep learning due to libraries like scikit-learn, TensorFlow, and PyTorch. It’s also widely used for web development and general-purpose programming.

R, on the other hand, has a strong focus on statistical analysis and data visualization with packages like ggplot2, dplyr, and Shiny. While R has packages for machine learning as well, Python is often a more popular choice in this domain.

  • Example of R’s ggplot2 for data visualization:
library(ggplot2)
ggplot(df, aes(x = Age, y = Name)) + geom_point()
  • Example of Python’s scikit-learn for machine learning:
from sklearn.linear_model import LinearRegression
model = LinearRegression()
# Fit the model, make predictions, etc.

In conclusion, R and Python are both powerful programming languages. The choice between R and Python ultimately depends on the specific needs and preferences of the user, as well as the nature of the tasks at hand.

(c) Dataset Example from dslabs

#Setup of dslabs package
install.packages("dslabs") 
library(dslabs) 
data("na_example") # Load the "na_example" data set
head(na_example) # Display the first few rows of the data
total_na_original <- sum(is.na(na_example))
print(paste("Total number of NAs in the original data set:", total_na_original))
na_example_no_na <- na_example
na_example_no_na[is.na(na_example_no_na)] <- 0
head(na_example_no_na) #Replace all the NAs with 0 and display the new data frame
total_na_new <- sum(is.na(na_example_no_na))
print(paste("Total number of NAs in the new data frame:", total_na_new))
Back to top

Footnotes

  1. It is important to highlight that, codes from Section (b) are AI generated.↩︎