R is a programming language used for statistical analysis, visualization, and other data analysis. As a data analyst, you will use R to complete many of the tasks associated with the data analysis process. Understanding how it works and why you use it is crucial to developing a mastery of data analytics.
The is a download link for various versions of R studio based on the OS https://cran.r-project.org/mirrors.html
RStudio desktop download page
https://posit.co/download/rstudio-desktop/#download
Open the R console where you can write and execute commands. You can write simple commands in the command prompt such as print ("Hey Jay") or 1 + 2
Enter a quit() command to quit or type q()
Packages
Units of reproducible R code. R community members create packages to keep track of the R functions that they write and reuse. To install a package, the syntax is install.packages("name_of_package")
For instance, install.packages("tidyverse"). Tidyverse is a collection of packages in R with a common design philosophy for data manipulation, exploration, and visualization. Then reload the tidyverse library with the library function everytime you want to start a session. library(tidyverse)
Common tidyverse packages
ggplot2 - used for data visualization, specifically plots.
tidyr- package used for cleaning to make tidy data.
readr - used for importing data. Common function is read_csv.
dplyr - offers a consistent set of functions that help one complete some common data manipulation tasks.E.g select() function
?print() - helps one learn more about the function in the help window.
my_variable <- "My variable"
my_variable <- 2.23
Vector - group of data elements of same type stored in a sequence in R
vec_variable <- c(10, 45, 78.2)
vec_variable // execute the variable
Pipe - a tool in R for expressing a sequence of multiple operations, represented with "%>%", used to apply the output of one function into another function.
The most common data structures (a format for organizing and storing data) in R include Vectors, Data frames, Matrices and Arrays.
Data frames is a collection of columns-similar to a spreadsheet or SQL table. Each column has a name at the top that represents a variable and includes one observation per row.
file.create (“new_csv_file.csv”) - create a file.
unlink (“some_.file.csv”) - delete a file.
Matrix is a two-dimensional collection of data elements. This means it has both rows and columns.
matrix() function - create a matrix
matrix(c(3:8), nrow = 2) - creates a two rows by three columns matrix containing the values 3-8, nrow=2 specifies the number of rows.
The examples of types of date/time data in R are 06:11:13 UTC, 2019-04-16, and 2018-12-21 16:35:28 UTC.
CRAN is a commonly used online archive with R packages and other resources.
data("dataset_name") - to load a dataset.
View(dataset_name) - to checkout the dataset.
Pipe operator (%>%)
filtered_toothgrowth <- ToothGrowth %>%
filter(dose==0.5) %>%
group_by(supp) %>%
arrange(len)
head(gold) - returns a preview of the first 6 rows.
str(gold) - returns the structure of the data frame.
colnames(gold) - returns the column names of the data frame.
How to create a data frame
Here's how this works. First, create a vector of names:
names <- c("Peter", "Jennifer", "Julie", "Alex")
Then create a vector of ages:
age <- c(15, 19, 21, 25)
With these two vectors, you can create a new data frame called people
:
people <- data.frame(names, age)
**<u>TIBBLES</u>**
**Tibbles** are a little different from standard data frames. A data frame is a collection of columns, like a spreadsheet or a SQL table. Tibbles are like streamlined data frames that are automatically set to pull up only the first 10 rows of a dataset, and only as many columns as can fit on the screen. Unlike data frames, tibbles never change the names of your variables, or the data types of your inputs.
**as_tibble() function** - Used to create a tibble from existing data. For example, as_tibble(gold)
**data(Indometh)** - load a specific dataset.
**`read_csv()` function** - import data from a .csv in the project folder called "hotel_bookings.csv" and save it as a data frame called `bookings_df`
{r}
bookings_df <- read_csv("hotel_bookings.csv")
**<u>Cleaning data example</u>**
let's say one is primarily interested in the following variables: hotel, is_canceled, lead_time. Create a new data frame with just those columns, calling it `trimmed_df`.
{r}
trimmed_df <- bookings_df %>%
select(hotel, is_canceled, lead_time)
Rename the variable 'hotel' to be named 'hotel_type' to be crystal clear on what the data is about:
{r}
trimmed_df %>%
select(hotel, is_canceled, lead_time) %>%
rename(hotel_type = hotel)
You can also use the`mutate()` function to make changes to your columns. Let's say you wanted to create a new column that summed up all the adults, children, and babies on a reservation for the total number of people. Modify the code chunk below to create that new column:
{r}
example_df <- bookings_df %>%
mutate(guests = adults + children + babies)
head(example_df)
**<u>Manually create a Data Frame</u>**
id <- c(1:10)
name <- c("John Mendes", "Rob Stewart", "Rachel Abrahamson", "Christy Hickman", "Johnson Harper", "Candace Miller", "Carlson Landy", "Pansy Jordan", "Darius Berry", "Claudia Garcia")
job_title <- c("Professional", "Programmer", "Management", "Clerical", "Developer", "Programmer", "Management", "Clerical", "Developer", "Programmer")
employee <- data.frame(id, name, job_title)
**<u>Import data</u>**
{r load dataset from a .csv folder and save it in data frame called hotel_bookings }
hotel_bookings <- read_csv("hotel_bookings.csv")
head(hotel_bookings) - preview columns and first 6 rows
View(hotel_bookings)- view the data frame
str/glimpse(hotel_bookings) - to see a summary of each column
arrange(hotel_bookings, desc(lead_time)) - r arrange function descending.
max(hotel_bookings$lead_time) - check out the maximum lead_time without sorting the entire data.
**<u>Benefits of Data cleaning in R</u>**
Cleaning data in R is done by applying specific cleaning functions that work on the original data without modifying it. We can save the result of our cleaning as a new dataset which will keep the changes but in any case we will keep the initial data as it was, plus we can always see how it has been processed from dirty to clean.
This allows for great traceability of our data, a better understanding the work that has been done for future users and, of course, the possibility to get back to the original dataset whenever we need it.
On the other hand, the cleaning made in spreadsheets usually overwrites the data and we rely on regularly saving versions of it and documenting the changes as a way of keeping tabs on the process.
The **sd()**, **cor()**, and **mean()** functions can provide a statistical summary of the dataset using standard deviation, correlation, and mean.
Top comments (1)
R is good for statistical analysis and visualizations