Tidyverse R Cheat Sheet

Posted on  by 



Tools to help to create tidy data, where each column is a variable, each row is an observation, and each cell contains a single value. Tidyr contains tools for changing the shape (pivoting) and hierarchy (nesting and unnesting) of a dataset, turning deeply nested lists into rectangular data frames (rectangling), and extracting values out of string columns. It also includes tools for working. If you’re using R to do data analysis inside a company, most of the data you need probably already lives in a database (it’s just a matter of figuring out which one!). However, you will learn how to load data in to a local database in order to demonstrate dplyr’s database tools. Dbplyr is a part of the tidyverse. In R, you write regular expressions as strings, sequences of characters surrounded by quotes (') or single quotes('). Some characters cannot be represented directly in an R string. These must be represented as special characters, sequences of characters that have a specific meaning., e.g. Tidyverse basics. As it is difficult to change how fundamental base R structures/functions work, the Tidyverse suite of packages create and use data structures, functions and operators to make working with data more intuitive. The two most basic changes are in the use of pipes and tibbles. Stringing together commands in R can be quite. Data Visualization with ggplot2:: CHEAT SHEET ggplot2 is based on the grammar of graphics, the idea that you can build every graph from the same components: a data set, a coordinate system.

  1. Tidyr Cheat Sheet Pdf
  2. R Data Manipulation Cheat Sheet
  3. Tidyverse In R Cheat Sheet
  4. Tidyverse R Cheat Sheet

Approximate time: 75 minutes

The Tidyverse suite of integrated packages are designed to work together to make common data science operations more user friendly. The packages have functions for data wrangling, tidying, reading/writing, parsing, and visualizing, among others. There is a freely available book, R for Data Science, with detailed descriptions and practical examples of the tools available and how they work together. We will explore the basic syntax for working with these packages, as well as, specific functions for data wrangling with the ‘dplyr’ package, data tidying with the ‘tidyr’ package, and data visualization with the ‘ggplot2’ package.

All of these packages use the same style of code, which is snake_case formatting for all function names and arguments. The tidy style guide is available for perusal.

Adding files to your working directory

We have three files that we need to bring in for this lesson:

  1. A normalized counts file (gene expression counts normalized for library size)
  2. A metadata file corresponding to the samples in our normalized counts dataset
  3. The differential expression results output from our DE analysis using DESeq2

Download the files to the data folder by right-clicking the links below:

  • Normalized counts file: right-click here
  • Differential expression results: right-click here

Choose to Save Link As or Download Linked File As and navigate to your Visualizations-in-R/data folder. You should now see the files appear in the data folder in the RStudio file directory.

Reading in the data files

Let’s read in all of the files we have downloaded:

Tidyverse basics

As it is difficult to change how fundamental base R structures/functions work, the Tidyverse suite of packages create and use data structures, functions and operators to make working with data more intuitive. The two most basic changes are in the use of pipes and tibbles.

Pipes

Stringing together commands in R can be quite daunting. Also, trying to understand code that has many nested functions can be confusing.

To make R code more human readable, the Tidyverse tools use the pipe, %>%, which was acquired from the ‘magrittr’ package and comes installed automatically with Tidyverse. The pipe allows the output of a previous command to be used as input to another command instead of using nested functions.

NOTE: Shortcut to write the pipe is shift + command + M

An example of using the pipe to run multiple commands:

The pipe represents a much easier way of writing and deciphering R code, and we will be taking advantage of it for all future activities.

Exercises

  1. Extract the replicate column from the metadata data frame (use the $ notation) and save the values to a vector named rep_number.

  2. Use the pipe (%>%) to perform two steps in a single line:

    1. Turn rep_number into a factor.
    2. Use the head() function to return the first six values of the rep_number factor.

Tibbles

A core component of the tidyverse is the tibble. Tibbles are a modern rework of the standard data.frame, with some internal improvements to make code more reliable. They are data frames, but do not follow all of the same rules. For example, tibbles can have column names that are not normally allowed, such as numbers/symbols.

Important: tidyverse is very opininated about row names. These packages insist that all column data (e.g. data.frame) be treated equally, and that special designation of a column as rownames should be deprecated. Tibble provides simple utility functions to handle rownames: rownames_to_column() and column_to_rownames(). More help for dealing with row names in tibbles can be found:

Tibbles can be created directly using the tibble() function or data frames can be converted into tibbles using as_tibble(name_of_df).

NOTE: The function as_tibble() will ignore row names, so if a column representing the row names is needed, then the function rownames_to_column(name_of_df) should be run prior to turning the data.frame into a tibble. Also, as_tibble() will not coerce character vectors to factors by default.

Exercises

  1. Create a tibble called df_tibble using the tibble() function to combine the vectors species and glengths.

  2. Change the metadata data frame to a tibble called meta_tibble. Use the rownames_to_column() function to preserve the rownames combined with using %>% and the as_tibble() function.

Differences between tibbles and data.frames

The main differences between tibbles and data.frames relate to printing and subsetting.

Printing

A nice feature of a tibble is that when printing a variable to screen, it will show only the first 10 rows and the columns that fit to the screen by default. This is nice since you don’t have to specify head to take a quick look at your dataset. If it is desirable to view more of the dataset, the print() function can change the number of rows or columns displayed.

Subsetting

When subsetting base R data.frames the default behavior is to simplify the output to the simplest data structure. Therefore, if subsetting a single column from a data.frame, R will output a vector (unless drop=FALSE is specified). In contrast, subsetting a single column of a tibble will by default return another tibble, not a vector.

Due to this behavior, some older functions do not work with tibbles, so if you need to convert a tibble to a data.frame, the function as.data.frame(name_of_tibble) will easily convert it.

Also note that if you use piping to subset a data frame, then the notation is slightly different, requiring a placeholder . prior to the [ ] or $.

Tidyverse tools

While all of the tools in the Tidyverse suite are deserving of being explored in more depth, we are going to investigate only the tools we will be using most for data wrangling and tidying.

Dplyr

The most useful tool in the tidyverse is dplyr. It’s a swiss-army knife for data wrangling. dplyr has many handy functions that we recommend incorporating into your analysis:

  • select() extracts columns and returns a tibble.
  • arrange() changes the ordering of the rows.
  • filter() picks cases based on their values.
  • mutate() adds new variables that are functions of existing variables.
  • rename() easily changes the name of a column(s)
  • summarise() reduces multiple values down to a single summary.
  • pull() extracts a single column as a vector.
  • _join() group of functions that merge two data frames together, includes (inner_join(), left_join(), right_join(), and full_join()).

Note:dplyr underwent a massive revision this year, switching versions from 0.5 to 0.7. If you consult other dplyr tutorials online, note that many materials developed prior to 2017 are no longer correct. In particular, this applies to writing functions with dplyr (see Notes section below).

select()

To extract columns from a tibble we can use the select().

Conversely, you can remove columns you don’t want with negative selection.

arrange()

Note that the rows are sorted by the gene symbol. Let’s fix that and sort them by adjusted P value instead with arrange().

filter()

Let’s keep only genes that are expressed (baseMean above 0) with an adjusted P value below 0.01. You can perform multiple filter() operations together in a single command.

mutate()

mutate() enables you to create a new column from an existing column. Let’s generate log10 calculations of our baseMeans for each gene.

rename()

You can quickly rename an existing column with rename(). The syntax is new_name = old_name.

summarise()

You can perform column summarization operations with summarise().

Advanced:summarise() is particularly powerful in combination with the group_by() function, which allows you to group related rows together.

Note: summarize() also works if you prefer to use American English. This applies across the board to any tidy functions, including in ggplot2 (e.g. color in place of colour).

pull()

In the recent dplyr 0.7 update, pull() was added as a quick way to access column data as a vector. This is very handy in chain operations with the pipe operator.

_join()

Dplyr has a powerful group of join operations, which join together a pair of data frames based on a variable or set of variables present in both data frames that uniquely identify all observations. These variables are called keys.

  • inner_join: Only the rows with keys present in both datasets will be joined together.

  • left_join: Keeps all the rows from the first dataset, regardless of whether in second dataset, and joins the rows of the second that have keys in the first.

  • right_join: Keeps all the rows from the second dataset, regardless of whether in first dataset, and joins the rows of the first that have keys in the second.

  • full_join: Keeps all rows in both datasets. Rows without matching keys will have NA values for those variables from the other dataset.

To practice with the join functions, we can use a couple of built-in R datasets.

Tidyr

The purpose of Tidyr is to have well-organized or tidy data, which Tidyverse defines as having:

  1. Each variable in a column
  2. Each observation in a row
  3. Each value as a cell
SheetTidyverse

There are two main functions in Tidyr, gather() and spread(). These functions allow for conversion between long data format and wide data format. The downstream use of the data will determine which format is required.

gather()

The gather() function changes a wide data format into a long data format. This function is particularly helpful when using ‘ggplot2’ to get all of the values to plot into a single column.

To use this function, you need to give the columns in the data frame you would like to gather together as a single column. Then, provide a name to give the column where all of the column names will be present using the key argument, and the name to give the column where all of the values will be present using the value argument.

spread()

The spread() function is the reverse of the gather() function. The categories of the key column will become separate columns, and the values in the value column split across the associated key columns.

Programming notes

Underneath the hood, tidyverse packages build upon the base R language using rlang, which is a complete rework of how functions handle variable names and evaluate arguments. This is achieved through the tidyeval framework, which interprates command operations using tidy evaluation. This is outside of the scope of the course, but explained in detail in the Programming with dplyr vignette, in case you’d like to understand how these new tools behave differently from base R.

Source: vignettes/dbplyr.Rmd

As well as working with local in-memory data stored in data frames, dplyr also works with remote on-disk data stored in databases. This is particularly useful in two scenarios:

  • Your data is already in a database.

  • You have so much data that it does not all fit into memory simultaneously and you need to use some external storage engine.

(If your data fits in memory there is no advantage to putting it in a database: it will only be slower and more frustrating.)

This vignette focuses on the first scenario because it’s the most common. If you’re using R to do data analysis inside a company, most of the data you need probably already lives in a database (it’s just a matter of figuring out which one!). However, you will learn how to load data in to a local database in order to demonstrate dplyr’s database tools. At the end, I’ll also give you a few pointers if you do need to set up your own database.

Getting started

To use databases with dplyr you need to first install dbplyr:

Tidyr Cheat Sheet Pdf

You’ll also need to install a DBI backend package. The DBI package provides a common interface that allows dplyr to work with many different databases using the same code. DBI is automatically installed with dbplyr, but you need to install a specific backend for the database that you want to connect to.

Five commonly used backends are:

  • RMariaDB connects to MySQL and MariaDB

  • RPostgres connects to Postgres and Redshift.

  • RSQLite embeds a SQLite database.

  • odbc connects to many commercial databases via the open database connectivity protocol.

  • bigrquery connects to Google’s BigQuery.

If the database you need to connect to is not listed here, you’ll need to do some investigation (i.e. googling) yourself.

In this vignette, we’re going to use the RSQLite backend which is automatically installed when you install dbplyr. SQLite is a great way to get started with databases because it’s completely embedded inside an R package. Unlike most other systems, you don’t need to setup a separate database server. SQLite is great for demos, but is surprisingly powerful, and with a little practice you can use it to easily work with many gigabytes of data.

Connecting to the database

To work with a database in dplyr, you must first connect to it, using DBI::dbConnect(). We’re not going to go into the details of the DBI package here, but it’s the foundation upon which dbplyr is built. You’ll need to learn more about if you need to do things to the database that are beyond the scope of dplyr.

The arguments to DBI::dbConnect() vary from database to database, but the first argument is always the database backend. It’s RSQLite::SQLite() for RSQLite, RMariaDB::MariaDB() for RMariaDB, RPostgres::Postgres() for RPostgres, odbc::odbc() for odbc, and bigrquery::bigquery() for BigQuery. SQLite only needs one other argument: the path to the database. Here we use the special string ':memory:' which causes SQLite to make a temporary in-memory database.

Most existing databases don’t live in a file, but instead live on another server. That means in real-life that your code will look more like this:

(If you’re not using RStudio, you’ll need some other way to securely retrieve your password. You should never record it in your analysis scripts or type it into the console. Securing Credentials provides some best practices.)

Our temporary database has no data in it, so we’ll start by copying over nycflights13::flights using the convenient copy_to() function. This is a quick and dirty way of getting data into a database and is useful primarily for demos and other small jobs.

As you can see, the copy_to() operation has an additional argument that allows you to supply indexes for the table. Here we set up indexes that will allow us to quickly process the data by day, carrier, plane, and destination. Creating the right indices is key to good database performance, but is unfortunately beyond the scope of this article.

Now that we’ve copied the data, we can use tbl() to take a reference to it:

SheetTidyverse R Cheat Sheet

When you print it out, you’ll notice that it mostly looks like a regular tibble:

The main difference is that you can see that it’s a remote source in a SQLite database.

Generating queries

To interact with a database you usually use SQL, the Structured Query Language. SQL is over 40 years old, and is used by pretty much every database in existence. The goal of dbplyr is to automatically generate SQL for you so that you’re not forced to use it. However, SQL is a very large language and dbplyr doesn’t do everything. It focusses on SELECT statements, the SQL you write most often as an analyst.

R Data Manipulation Cheat Sheet

Most of the time you don’t need to know anything about SQL, and you can continue to use the dplyr verbs that you’re already familiar with:

However, in the long-run, I highly recommend you at least learn the basics of SQL. It’s a valuable skill for any data scientist, and it will help you debug problems if you run into problems with dplyr’s automatic translation. If you’re completely new to SQL you might start with this codeacademy tutorial. If you have some familiarity with SQL and you’d like to learn more, I found how indexes work in SQLite and 10 easy steps to a complete understanding of SQL to be particularly helpful.

The most important difference between ordinary data frames and remote database queries is that your R code is translated into SQL and executed in the database on the remote server, not in R on your local machine. When working with databases, dplyr tries to be as lazy as possible:

  • It never pulls data into R unless you explicitly ask for it.

  • It delays doing any work until the last possible moment: it collects together everything you want to do and then sends it to the database in one step.

For example, take the following code:

Surprisingly, this sequence of operations never touches the database. It’s not until you ask for the data (e.g. by printing tailnum_delay) that dplyr generates the SQL and requests the results from the database. Even then it tries to do as little work as possible and only pulls down a few rows.

Behind the scenes, dplyr is translating your R code into SQL. You can see the SQL it’s generating with show_query():

If you’re familiar with SQL, this probably isn’t exactly what you’d write by hand, but it does the job. You can learn more about the SQL translation in vignette('translation-verb') and vignette('translation-function').

Typically, you’ll iterate a few times before you figure out what data you need from the database. Once you’ve figured it out, use collect() to pull all the data down into a local tibble:

collect() requires that database does some work, so it may take a long time to complete. Otherwise, dplyr tries to prevent you from accidentally performing expensive query operations:

  • Because there’s generally no way to determine how many rows a query will return unless you actually run it, nrow() is always NA.

  • Because you can’t find the last few rows without executing the whole query, you can’t use tail().

You can also ask the database how it plans to execute the query with explain(). The output is database dependent, and can be esoteric, but learning a bit about it can be very useful because it helps you understand if the database can execute the query efficiently, or if you need to create new indices.

Tidyverse In R Cheat Sheet

Creating your own database

If you don’t already have a database, here’s some advice from my experiences setting up and running all of them. SQLite is by far the easiest to get started with. PostgreSQL is not too much harder to use and has a wide range of built-in functions. In my opinion, you shouldn’t bother with MySQL/MariaDB: it’s a pain to set up, the documentation is subpar, and it’s less featureful than Postgres. Google BigQuery might be a good fit if you have very large data, or if you’re willing to pay (a small amount of) money to someone who’ll look after your database.

All of these databases follow a client-server model - a computer that connects to the database and the computer that is running the database (the two may be one and the same but usually isn’t). Getting one of these databases up and running is beyond the scope of this article, but there are plenty of tutorials available on the web.

MySQL/MariaDB

In terms of functionality, MySQL lies somewhere between SQLite and PostgreSQL. It provides a wider range of built-in functions. It gained support for window functions in 2018.

PostgreSQL

PostgreSQL is a considerably more powerful database than SQLite. It has a much wider range of built-in functions, and is generally a more featureful database.

Tidyverse R Cheat Sheet

BigQuery

BigQuery is a hosted database server provided by Google. To connect, you need to provide your project, dataset and optionally a project for billing (if billing for project isn’t enabled).

It provides a similar set of functions to Postgres and is designed specifically for analytic workflows. Because it’s a hosted solution, there’s no setup involved, but if you have a lot of data, getting it to Google can be an ordeal (especially because upload support from R is not great currently). (If you have lots of data, you can ship hard drives!)





Coments are closed