Assignment 1
My first assignment has four parts. Three parts can be seen inside within this file while the fourth part is rendering and commiting all of my work here.
deneme
(a) A Brief Summary of Data Manipulation Tools
This summary is based on Data Manipulation Tools: dplyr -- Pt 3 Intro to the Grammar of Data Manipulation with R video uploaded by Posit YouTube channel.
Main object of the video and this summary is to discuss key concepts related to data wranling using R and the tidyverse which is a collection of R packages.
Ways to Access Information
Extract existing variables by select()
Extract existing observations by filter()
Derive new variables by mutate()
Change the unit of analysis by summarise()
Pipe Operator
The pipe operator (%>%) allows to chain these functions together in a more readable and efficient way. It is easy to build a sequence of data transformations by piping the results of one expression into the next one.
Unit of Analysis
It is possible to change the unit of analysis by using group_by() and summarise(). Grouping a data frame allows to calculate summary statistics for different subsets of the data. You can group by one or more columns and then apply summarise() to create summary statistics specific to those groups.
Conclusion
This process of grouping and summarizing allows to change the unit of analysis and create different levels of summaries. It makes it easier to explore data at different levels, such as by city, year, or a combination of factors, to gain insights into the data.
(b) Three Differences Between R and Python 1
Even though both R and Python are used for data analysis, statistics and machine learning; there are still some key differences between them.
1. Data Structures and Syntax
R is known for its data manipulation capabilities, especially with data frames and vectors. It uses a special syntax for data frames, making it easy to work with tabular data.
Python, on the other hand, uses lists, dictionaries, and other data structures for data manipulation. It has a more general-purpose syntax.
- Example in R:
# Creating a data frame in R
<- data.frame(
df Name = c("Alice", "Bob", "Charlie"),
Age = c(25, 30, 35)
)
- Example in Python:
# Creating a list of dictionaries in Python
= [
data "Name": "Alice", "Age": 25},
{"Name": "Bob", "Age": 30},
{"Name": "Charlie", "Age": 35}
{ ]
2. Vectorization
R is designed to work with vectors and supports vectorized operations, which means you can perform operations on entire vectors without explicit loops.
Python, while capable of vectorized operations with libraries like NumPy, often requires explicit loops for many tasks.
Example in R:
# Vectorized operation in R
<- c(1, 2, 3, 4, 5)
vector1 <- c(6, 7, 8, 9, 10)
vector2 <- vector1 + vector2 result
- Example in Python:
# Using a for loop in Python
= [1, 2, 3, 4, 5]
list1 = [6, 7, 8, 9, 10]
list2 = [a + b for a, b in zip(list1, list2)] result
3. Libraries and Ecosystem
Python has a more extensive ecosystem and is often preferred for machine learning and deep learning due to libraries like scikit-learn, TensorFlow, and PyTorch. It’s also widely used for web development and general-purpose programming.
R, on the other hand, has a strong focus on statistical analysis and data visualization with packages like ggplot2, dplyr, and Shiny. While R has packages for machine learning as well, Python is often a more popular choice in this domain.
- Example of R’s ggplot2 for data visualization:
library(ggplot2)
ggplot(df, aes(x = Age, y = Name)) + geom_point()
- Example of Python’s scikit-learn for machine learning:
from sklearn.linear_model import LinearRegression
= LinearRegression()
model # Fit the model, make predictions, etc.
In conclusion, R and Python are both powerful programming languages. The choice between R and Python ultimately depends on the specific needs and preferences of the user, as well as the nature of the tasks at hand.
(c) Dataset Example from dslabs
#Setup of dslabs package
install.packages("dslabs")
library(dslabs)
data("na_example") # Load the "na_example" data set
head(na_example) # Display the first few rows of the data
<- sum(is.na(na_example))
total_na_original print(paste("Total number of NAs in the original data set:", total_na_original))
<- na_example
na_example_no_na is.na(na_example_no_na)] <- 0
na_example_no_na[head(na_example_no_na) #Replace all the NAs with 0 and display the new data frame
<- sum(is.na(na_example_no_na))
total_na_new print(paste("Total number of NAs in the new data frame:", total_na_new))
Footnotes
It is important to highlight that, codes from Section (b) are AI generated.↩︎