dplyr
is an R package, a collection of functions and data sets that enhance the R language. First you will master the five verbs of R data manipulation with dplyr: select, mutate, filter, arrange and summarise. Next, you will learn how you can chain your dplyr operations using the pipe operator of the magrittr package. In the final section, the focus is on practicing how to subset your data using the group_by function, and how you can access data stored outside of R in a database. All said and done, you will be familiar with data manipulation tools and techniques that will allow you to efficiently manipulate data.
Introduction to dplyr and tbls
Introduction to the dplyr package and the tbl class. Meet the data structures that dplyr uses behind the scenes.
as_tibble()
.Meet the five verbs. The dplyr's gramma is built around five functions, or verbs, that do the basic tasks of data manipulation.
Each verb is simple by itself, but you can combine them to manipulate your data in sophisticated ways. The better these work is when your data comes in rows-observations, columns-variables, known as tidy data
Select and mutate
:
to select a range of variables and -
to exclude some variables, similar to indexing a data.frame with square brackets. You can use both variable's names as well as integer indexes. This call selects the four first variables except for the second one of a data frame df
:select(df, 1:4, -2)
dplyr
comes with a set of helper functions that can help you select groups of variables inside a select()
call:starts_with("X")
: every name that starts with "X"
,ends_with("X")
: every name that ends with "X"
,contains("X")
: every name that contains "X"
,matches("X")
: every name that matches "X"
, where "X"
can be a regular expression,num_range("x", 1:5)
: the variables named x01
, x02
, x03
, x04
and x05
,one_of(x)
: every name that appears in x
, which should be a character vector.Pay attention here: When you refer to columns directly inside select()
, you don't use quotes. If you use the helper functions, you do use quotes.
mutate(). Uses the data to build new columns of values, i.e, it reveals information that your data set already contains but does not display. To use it, enter the tbl name, then define the new variables that you'd like to create.
Add multiple variables. To create more than one variable, place a comma between each variable that you define inside mutate()
. mutate()
even allows you to use a new variable while creating a next variable in the same call. In this example, the new variable x
is directly reused to create the new variable y
:
mutate(my_df, x = a + b, y = x + c)
Filter and Arrange
Summarize and the pipe operator
summarize()
so long as the function can take a vector of data and return a single number. R contains many aggregating functions, as dplyr
calls them:min(x)
- minimum value of vector x
.max(x)
- maximum value of vector x
.mean(x)
- mean value of vector x
.median(x)
- median value of vector x
.quantile(x, p)
- p
th quantile of vector x
.sd(x)
- standard deviation of vector x
.var(x)
- variance of vector x
.IQR(x)
- Inter Quartile Range (IQR) of vector x
.diff(range(x))
- total range of vector x
.first(x)
- The first element of vector x
.last(x)
- The last element of vector x
.nth(x, n)
- The n
th element of vector x
.n()
- The number of rows in the data.frame or group of observations that summarize()
describes.n_distinct(x)
- The number of unique values in vector x
.Group_by and working with databases
Learn to use group_by to group your data into subsets of observations, and use dplyr to access data stored outside of R in a database.
hflights2
is a copy of hflights
that is saved as a data table using the following code:library(data.table) hflights2 <- as.data.table(hflights)
nycflights
is a reference to data that lives outside of R, you can use the dplyr
commands on them as usual. Behind the scenes, dplyr
will convert the commands to the database's native language (in this case, SQL), and return the results. This allows you to pull data that is too large to fit in R: only the fraction of the data that you need will actually be downloaded into R, which will usually fit into R without memory issues.