Row operations
Featured in 4 main posts
This page briefly explains the principles of row operations on a data table, using the data.table syntax. For a general primer on data tables, check the relevant page.
The information of a data table is accessed with square brackets next to its name (e.g. table.name[]). The first argument inside square brackets designates which rows and in which order they’re selected or displayed.
This argument can be one of two vector types:
-
logical(i.e.TRUE/FALSEvalues): Its length must be the same as the total number of rows. Selects the rows for which the value isTRUE. Very convenient if we want to select rows based on a condition. -
integer: Selects rows based on row position. Can have any length.
If the first argument is left blank, then no filter is applied and all rows are selected. To leave the first element blank, simply type a comma as the first character in square brackets and move on the next arguments (table.name[,...]).
Example data
Let’s use one of R’s built-in data sets, called iris. In order to use data.table’s syntax, we will first convert it from data.frame to data.table.
library(data.table)
iris.dt <- as.data.table(iris)
iris.dt
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1: 5.1 3.5 1.4 0.2 setosa
## 2: 4.9 3.0 1.4 0.2 setosa
## 3: 4.7 3.2 1.3 0.2 setosa
## 4: 4.6 3.1 1.5 0.2 setosa
## 5: 5.0 3.6 1.4 0.2 setosa
## ---
## 146: 6.7 3.0 5.2 2.3 virginica
## 147: 6.3 2.5 5.0 1.9 virginica
## 148: 6.5 3.0 5.2 2.0 virginica
## 149: 6.2 3.4 5.4 2.3 virginica
## 150: 5.9 3.0 5.1 1.8 virginica
Filter rows
Condition-based filtering
Simple conditions
Suppose we want to select the observations for which sepal length is higher than the median sepal length. In other words, the filter we need is a comparison, which we can perform with a logical expression:
iris.dt[Sepal.Length > median(Sepal.Length)]
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1: 7.0 3.2 4.7 1.4 versicolor
## 2: 6.4 3.2 4.5 1.5 versicolor
## 3: 6.9 3.1 4.9 1.5 versicolor
## 4: 6.5 2.8 4.6 1.5 versicolor
## 5: 6.3 3.3 4.7 1.6 versicolor
## 6: 6.6 2.9 4.6 1.3 versicolor
## 7: 5.9 3.0 4.2 1.5 versicolor
## 8: 6.0 2.2 4.0 1.0 versicolor
## 9: 6.1 2.9 4.7 1.4 versicolor
## 10: 6.7 3.1 4.4 1.4 versicolor
## 11: 6.2 2.2 4.5 1.5 versicolor
## 12: 5.9 3.2 4.8 1.8 versicolor
## 13: 6.1 2.8 4.0 1.3 versicolor
## 14: 6.3 2.5 4.9 1.5 versicolor
## 15: 6.1 2.8 4.7 1.2 versicolor
## 16: 6.4 2.9 4.3 1.3 versicolor
## 17: 6.6 3.0 4.4 1.4 versicolor
## 18: 6.8 2.8 4.8 1.4 versicolor
## 19: 6.7 3.0 5.0 1.7 versicolor
## 20: 6.0 2.9 4.5 1.5 versicolor
## 21: 6.0 2.7 5.1 1.6 versicolor
## 22: 6.0 3.4 4.5 1.6 versicolor
## 23: 6.7 3.1 4.7 1.5 versicolor
## 24: 6.3 2.3 4.4 1.3 versicolor
## 25: 6.1 3.0 4.6 1.4 versicolor
## 26: 6.2 2.9 4.3 1.3 versicolor
## 27: 6.3 3.3 6.0 2.5 virginica
## 28: 7.1 3.0 5.9 2.1 virginica
## 29: 6.3 2.9 5.6 1.8 virginica
## 30: 6.5 3.0 5.8 2.2 virginica
## 31: 7.6 3.0 6.6 2.1 virginica
## 32: 7.3 2.9 6.3 1.8 virginica
## 33: 6.7 2.5 5.8 1.8 virginica
## 34: 7.2 3.6 6.1 2.5 virginica
## 35: 6.5 3.2 5.1 2.0 virginica
## 36: 6.4 2.7 5.3 1.9 virginica
## 37: 6.8 3.0 5.5 2.1 virginica
## 38: 6.4 3.2 5.3 2.3 virginica
## 39: 6.5 3.0 5.5 1.8 virginica
## 40: 7.7 3.8 6.7 2.2 virginica
## 41: 7.7 2.6 6.9 2.3 virginica
## 42: 6.0 2.2 5.0 1.5 virginica
## 43: 6.9 3.2 5.7 2.3 virginica
## 44: 7.7 2.8 6.7 2.0 virginica
## 45: 6.3 2.7 4.9 1.8 virginica
## 46: 6.7 3.3 5.7 2.1 virginica
## 47: 7.2 3.2 6.0 1.8 virginica
## 48: 6.2 2.8 4.8 1.8 virginica
## 49: 6.1 3.0 4.9 1.8 virginica
## 50: 6.4 2.8 5.6 2.1 virginica
## 51: 7.2 3.0 5.8 1.6 virginica
## 52: 7.4 2.8 6.1 1.9 virginica
## 53: 7.9 3.8 6.4 2.0 virginica
## 54: 6.4 2.8 5.6 2.2 virginica
## 55: 6.3 2.8 5.1 1.5 virginica
## 56: 6.1 2.6 5.6 1.4 virginica
## 57: 7.7 3.0 6.1 2.3 virginica
## 58: 6.3 3.4 5.6 2.4 virginica
## 59: 6.4 3.1 5.5 1.8 virginica
## 60: 6.0 3.0 4.8 1.8 virginica
## 61: 6.9 3.1 5.4 2.1 virginica
## 62: 6.7 3.1 5.6 2.4 virginica
## 63: 6.9 3.1 5.1 2.3 virginica
## 64: 6.8 3.2 5.9 2.3 virginica
## 65: 6.7 3.3 5.7 2.5 virginica
## 66: 6.7 3.0 5.2 2.3 virginica
## 67: 6.3 2.5 5.0 1.9 virginica
## 68: 6.5 3.0 5.2 2.0 virginica
## 69: 6.2 3.4 5.4 2.3 virginica
## 70: 5.9 3.0 5.1 1.8 virginica
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
In this case, we use the condition Sepal.Length > median(Sepal.Length). It is tested for all rows, and the rows that satisfy the condition (TRUE) are selected.
Note: Row numbers are not preserved when filtering. This is because
data.tabledoesn’t store row numbers in memory. The row numbers we see on the left are just for display purposes.
Combined conditions
We can combine logical expressions to make more complex filters. If we want to select the rows of species virginica where sepal length is higher than the grand median (median of all observations), we can combine the two expressions with & - which is the AND operator in R.
iris.dt[Sepal.Length > median(Sepal.Length) & Species == "virginica"]
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1: 6.3 3.3 6.0 2.5 virginica
## 2: 7.1 3.0 5.9 2.1 virginica
## 3: 6.3 2.9 5.6 1.8 virginica
## 4: 6.5 3.0 5.8 2.2 virginica
## 5: 7.6 3.0 6.6 2.1 virginica
## 6: 7.3 2.9 6.3 1.8 virginica
## 7: 6.7 2.5 5.8 1.8 virginica
## 8: 7.2 3.6 6.1 2.5 virginica
## 9: 6.5 3.2 5.1 2.0 virginica
## 10: 6.4 2.7 5.3 1.9 virginica
## 11: 6.8 3.0 5.5 2.1 virginica
## 12: 6.4 3.2 5.3 2.3 virginica
## 13: 6.5 3.0 5.5 1.8 virginica
## 14: 7.7 3.8 6.7 2.2 virginica
## 15: 7.7 2.6 6.9 2.3 virginica
## 16: 6.0 2.2 5.0 1.5 virginica
## 17: 6.9 3.2 5.7 2.3 virginica
## 18: 7.7 2.8 6.7 2.0 virginica
## 19: 6.3 2.7 4.9 1.8 virginica
## 20: 6.7 3.3 5.7 2.1 virginica
## 21: 7.2 3.2 6.0 1.8 virginica
## 22: 6.2 2.8 4.8 1.8 virginica
## 23: 6.1 3.0 4.9 1.8 virginica
## 24: 6.4 2.8 5.6 2.1 virginica
## 25: 7.2 3.0 5.8 1.6 virginica
## 26: 7.4 2.8 6.1 1.9 virginica
## 27: 7.9 3.8 6.4 2.0 virginica
## 28: 6.4 2.8 5.6 2.2 virginica
## 29: 6.3 2.8 5.1 1.5 virginica
## 30: 6.1 2.6 5.6 1.4 virginica
## 31: 7.7 3.0 6.1 2.3 virginica
## 32: 6.3 3.4 5.6 2.4 virginica
## 33: 6.4 3.1 5.5 1.8 virginica
## 34: 6.0 3.0 4.8 1.8 virginica
## 35: 6.9 3.1 5.4 2.1 virginica
## 36: 6.7 3.1 5.6 2.4 virginica
## 37: 6.9 3.1 5.1 2.3 virginica
## 38: 6.8 3.2 5.9 2.3 virginica
## 39: 6.7 3.3 5.7 2.5 virginica
## 40: 6.7 3.0 5.2 2.3 virginica
## 41: 6.3 2.5 5.0 1.9 virginica
## 42: 6.5 3.0 5.2 2.0 virginica
## 43: 6.2 3.4 5.4 2.3 virginica
## 44: 5.9 3.0 5.1 1.8 virginica
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
Similarly, we can select the rows of species setosa where sepal length is higher than the grand median:
iris.dt[Sepal.Length > median(Sepal.Length) & Species == "setosa"]
## Empty data.table (0 rows and 5 cols): Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species
This returns an empty data table, as there are no rows that satisfy this condition - for all setosa observations, sepal length is below the grand median.
See more details on logical operations.
Position-based filtering
This is not very common to select rows based on position, but if needed, it’s quite easy. All we need is a numeric vector of the row positions we want.
For example, to select the 2nd row:
iris.dt[2]
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1: 4.9 3 1.4 0.2 setosa
To select a range of adjacent rows (e.g. from the 3rd to the 10th row), we need a sequence:
iris.dt[3:10]
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1: 4.7 3.2 1.3 0.2 setosa
## 2: 4.6 3.1 1.5 0.2 setosa
## 3: 5.0 3.6 1.4 0.2 setosa
## 4: 5.4 3.9 1.7 0.4 setosa
## 5: 4.6 3.4 1.4 0.3 setosa
## 6: 5.0 3.4 1.5 0.2 setosa
## 7: 4.4 2.9 1.4 0.2 setosa
## 8: 4.9 3.1 1.5 0.1 setosa
To select rows with a periodicity (e.g. every 10th row, starting from the 1st), we again need a sequence:
iris.dt[seq(from=1,to=.N,by=10)]
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1: 5.1 3.5 1.4 0.2 setosa
## 2: 5.4 3.7 1.5 0.2 setosa
## 3: 5.4 3.4 1.7 0.2 setosa
## 4: 4.8 3.1 1.6 0.2 setosa
## 5: 5.0 3.5 1.3 0.3 setosa
## 6: 7.0 3.2 4.7 1.4 versicolor
## 7: 5.0 2.0 3.5 1.0 versicolor
## 8: 5.9 3.2 4.8 1.8 versicolor
## 9: 5.5 2.4 3.8 1.1 versicolor
## 10: 5.5 2.6 4.4 1.2 versicolor
## 11: 6.3 3.3 6.0 2.5 virginica
## 12: 6.5 3.2 5.1 2.0 virginica
## 13: 6.9 3.2 5.7 2.3 virginica
## 14: 7.4 2.8 6.1 1.9 virginica
## 15: 6.7 3.1 5.6 2.4 virginica
Note:
.Nis a special symbol indata.table, counting the number of rows.
Sort rows
Display sorted rows
Suppose we want to display the iris data table with the plants sorted by petal length in ascending order. We can take advantage of order, an R built-in function which returns the ordered position of elements (also works for sorting character elements in alphabetical order).
iris.dt[order(Petal.Length)]
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1: 4.6 3.6 1.0 0.2 setosa
## 2: 4.3 3.0 1.1 0.1 setosa
## 3: 5.8 4.0 1.2 0.2 setosa
## 4: 5.0 3.2 1.2 0.2 setosa
## 5: 4.7 3.2 1.3 0.2 setosa
## ---
## 146: 7.9 3.8 6.4 2.0 virginica
## 147: 7.6 3.0 6.6 2.1 virginica
## 148: 7.7 3.8 6.7 2.2 virginica
## 149: 7.7 2.8 6.7 2.0 virginica
## 150: 7.7 2.6 6.9 2.3 virginica
To display the same table sorted by petal length in descending order, we can simply use a negative sign:
iris.dt[order(-Petal.Length)]
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1: 7.7 2.6 6.9 2.3 virginica
## 2: 7.7 3.8 6.7 2.2 virginica
## 3: 7.7 2.8 6.7 2.0 virginica
## 4: 7.6 3.0 6.6 2.1 virginica
## 5: 7.9 3.8 6.4 2.0 virginica
## ---
## 146: 4.4 3.2 1.3 0.2 setosa
## 147: 5.8 4.0 1.2 0.2 setosa
## 148: 5.0 3.2 1.2 0.2 setosa
## 149: 4.3 3.0 1.1 0.1 setosa
## 150: 4.6 3.6 1.0 0.2 setosa
Note: This preserves the original table in memory and merely displays a reordered version of it.
To better understand what exactly order does, let’s try a simple alphabetical order example:
# Create a character vector
x <- c("C","A","B")
# Display the proper alphabetical order of its elements
order(x)
## [1] 2 3 1
This means that:
-
the 2nd element should be 1st
-
the 3rd element should be 2nd
-
the 1st element should be 3rd
Sort rows in-memory
The data.table package has the function setorder, which sets the order of rows in-memory without displaying. We can use it to set the order of rows by ascending sepal length:
setorder(iris.dt,Sepal.Length)
Let’s see the resulting table:
iris.dt
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1: 4.3 3.0 1.1 0.1 setosa
## 2: 4.4 2.9 1.4 0.2 setosa
## 3: 4.4 3.0 1.3 0.2 setosa
## 4: 4.4 3.2 1.3 0.2 setosa
## 5: 4.5 2.3 1.3 0.3 setosa
## ---
## 146: 7.7 3.8 6.7 2.2 virginica
## 147: 7.7 2.6 6.9 2.3 virginica
## 148: 7.7 2.8 6.7 2.0 virginica
## 149: 7.7 3.0 6.1 2.3 virginica
## 150: 7.9 3.8 6.4 2.0 virginica
Note: This changes the row order of the table in-memory and does not preserve the original order.