This post compares common data manipulation operations in dplyr and data.table.
For new-comers to R who are not aware, there are many ways to do the same thing in R. Depending on the purpose of the code (readability vs creating functions) and the size of the data, I for one often find myself switching from one flavour (or dialect) of R data manipulation to another. Generally, I prefer the dplyr style for its readability and intuitiveness (for myself), data.table for its speed in grouping and summarising operations,1 and base R when I am writing functions. This is by no means the R community consensus by the way (perfectly aware that I am venturing into a total minefield),2 but is more of a representation of how I personally navigate the messy (but awesome) R world.
In this post, I am going to list out some of the most common data manipulations in both styles:
The dplyr package. The package dplyr provides easy tools for the most common data manipulation tasks. It can work with gigantic databases which enables to conduct queries directly, and pull back into R only what you need for analysis.
- In addition to data frames/tibbles, dplyr makes working with other computational backends accessible and efficient. Below is a list of alternative backends: dtplyr: for large, in-memory datasets. Translates your dplyr code to high performance data.table code. Dbplyr: for data stored in a relational database. Translates your dplyr code to SQL.
- 8 Week 7: The dplyr package. 8.1 What is dplyr? 8.2 Cheat Sheet; 8.3 Tibbles; 8.4 Select columns with select 8.5 Filter rows with filter 8.6 Add new variables with mutate 8.7 Arrange rows with arrange 8.8 Grouped summaries with summarize 8.9 The pipe operator%% 8.10 Style Guide; 8.11 Grouped summaries with summarize (continue) 8.
- Data Wrangling with dplyr and tidyr Cheat Sheet Data Wrangling with dplyr and tidyr Cheat Sheet Scipy SciPy builds on the NumPy array object and is part of the NumPy stack which includes tools like Matplotlib, pandas and SymPy, and an expanding set of scientific computing libraries.
group_by()
,summarise()
(a single column)group_by()
,summarise_at()
(multiple columns)filter()
,mutate()
mutate_at()
(changing multiple columns)- Row-wise operations
- Vectorised multiple if-else (
case_when()
) - Function-writing: referencing a column with string
There is a vast amount of resources out there on the internet on the comparison of dplyr and data.table. For those who love to get into the details, I would really recommend Atrebas’s seminal blog post that gives a comprehensive tour of dplyr and data.table, comparing the code side-by-side. I would also recommend this comparison of the three R dialects by Jason Mercer, which not only includes base R in its comparison, but also goes into a fair bit of detail on elements such as piping/chaining (%>%
). There’s also a very excellent cheat sheet from DataCamp, linked here.
Why write a new blog post then, you ask? One key (selfish / self-centred) reason is that I myself often refer to my blog for an aide-memoire on how to do a certain thing in R, and my notes are optimised to only contain my most frequently used code. They also contain certain idiosyncracies in the way that I code (e.g. using pipes with data.table), which I’d like to be upfront about - and would at the same time very much welcome any discussion on it. It is perhaps also justifiable that I at least attempted to build on and unify the work of others in this post, which I have argued as what is ultimately important in relation of duplicated R artefacts.
Rambling on… so here we go!
To make it easy to reproduce results, I am going to just stick to the good ol’ mtcars and iris datasets which come shipped with R. I will also err on the side of verbosity and load the packages at the beginning of each code chunk, as if each code chunk is its own independent R session.
- Analysis: Maximum MPG (
mpg
) value for each cylinder type in the mtcars dataset. - Operations: Summarise with the
max()
function by group.
To group by and summarise values, you would run something like this in dplyr:
You could do the same in data.table, and still use magrittr pipes:
- Analysis: Average mean value for
Sepal.Width
andSepal.Length
for each irisSpecies
in the iris dataset. - Operations: Summarise with the
mean()
function by group.
Note: this is slightly different from the scenario above because the “summarisation” is applied to multiple columns.
In dplyr:
In data.table with pipes:
- Analysis: Find out what the multiple of
Sepal.Width
andSepal.Length
would be for the iris speciessetosa
. - Operations: Filter by
Species'setosa'
and create a new column calledSepal_Index
.
In dplyr:
In data.table:
- Analysis: Multiply
Sepal.Width
andSepal.Length
by 100. - Operations: As above
In dplyr:
In data.table with pipes:
This is always an awkward one, even for dplyr. For this, I will list a couple of options for row-wise calculations.
- Analysis: Create a
TotalSize
column by summing all four columns ofSepal.Length
,Sepal.Width
,Petal.Length
, andPetal.Width
. - Operations: As above
In dplyr:
In data.table with pipes:
- Analysis: Classify an
Age
into different categories - Operations: Create a new column called
AgeLabel
based on theAge
variable
In dplyr:
In data.table:
One thing to note is that there are two options here - Option 2 with and Option 1 without using magrittr pipes. The reason why Option 1 is possible without any assignment (<-
) is because of reference semantics in data.table. When :=
is used in data.table, a change is made to the data.table object via ‘modify by reference’, without creating a copy of the data.table object; when you assign it to a new object, that is referred to as ‘modify by copy’.
As Tyson Barrett nicely summarises, this ‘modifying by reference’ behaviour in data.table is partly what makes it efficient, but can be surprising if you do not expect or understand it; however, the good news is that data.table gives you the option whether to modify by reference or by making a copy.
Data Wrangling Dplyr Cheat Sheet
- Requirement: Create a function that will multiply a column by three. A string should be supplied to the argument to specify the column to be multiplied. The function returns the original data frame with the modified column.
Rstudio Dplyr Cheat Sheet
Here, I intentionally name the packages explicitly within the function and not load them, as it’s best practice for functions to be able to run on their own without loading in an entire library.
In dplyr:
In data.table:
(See https://stackoverflow.com/questions/45982595/r-using-get-and-data-table-within-a-user-defined-function)
This is it! For anything with greater detail, please consult the blogs and cheat sheets I recommended at the beginning of this blog post. I’d say this covers 65% (not a strictly empirical statistic) of my needs for data manipulation, so I hope this is of some help to you. (The gather()
vs melt()
vs pivot_longer()
subject is a whole other beast, and ought to be dealt with in another post)
Elio Campitelli has an [excellent blog post] on Why I love data.table, which is a nice short piece on why data.table is pretty awesome.↩︎
As noted in the DS4PS blog, the debate of dplyr versus data.table has resulted in “Twitter clashes, and even became an inspiration for memes.”↩︎