Column operations
Featured in 4 main posts
This page briefly explains the principles of column operations in tables, using the data.table syntax. For a general primer on data tables, check the relevant page.
The information of a data table is accessed with square brackets next to its name. Any operation involving columns is determined by the second argument within the square brackets.
For the examples below, we want to include all rows. Because of this, the first argument (meant for row filtering) will be left blank (e.g. table.name[,...]). However, rows can be easily selected if needed, by applying a filter in the first argument.
In most cases, the second argument can be one of two things:
-
A
listof columns to display - existing or new. Anydata.tableis essentially a list of columns, even tables with a single column. In thedata.tablesyntax,list()is synonymous to the more compact.(), so that’s what we’ll use. -
An expression (calculation or function), determining what to do - including the creation of new columns.
Example data
Let’s use one of R’s built-in data sets, called iris. In order to use data.table’s syntax, we will first convert it from data.frame to data.table.
library(data.table)
iris.dt <- as.data.table(iris)
iris.dt
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1: 5.1 3.5 1.4 0.2 setosa
## 2: 4.9 3.0 1.4 0.2 setosa
## 3: 4.7 3.2 1.3 0.2 setosa
## 4: 4.6 3.1 1.5 0.2 setosa
## 5: 5.0 3.6 1.4 0.2 setosa
## ---
## 146: 6.7 3.0 5.2 2.3 virginica
## 147: 6.3 2.5 5.0 1.9 virginica
## 148: 6.5 3.0 5.2 2.0 virginica
## 149: 6.2 3.4 5.4 2.3 virginica
## 150: 5.9 3.0 5.1 1.8 virginica
Selecting columns
Selecting columns to display
The iris.dt table has 5 columns in total, but let’s assume that we only want to display 3: petal length, petal width and species. We can choose which columns to show as follows:
iris.dt[,.(Petal.Length,Petal.Width,Species)]
## Petal.Length Petal.Width Species
## 1: 1.4 0.2 setosa
## 2: 1.4 0.2 setosa
## 3: 1.3 0.2 setosa
## 4: 1.5 0.2 setosa
## 5: 1.4 0.2 setosa
## ---
## 146: 5.2 2.3 virginica
## 147: 5.0 1.9 virginica
## 148: 5.2 2.0 virginica
## 149: 5.4 2.3 virginica
## 150: 5.1 1.8 virginica
Omitting columns from display
To reach the same displayed columns, we can also choose to omit Sepal.Length and Sepal.Width. The easiest way to do this is the following:
iris.dt[,!c("Sepal.Length","Sepal.Width")]
## Petal.Length Petal.Width Species
## 1: 1.4 0.2 setosa
## 2: 1.4 0.2 setosa
## 3: 1.3 0.2 setosa
## 4: 1.5 0.2 setosa
## 5: 1.4 0.2 setosa
## ---
## 146: 5.2 2.3 virginica
## 147: 5.0 1.9 virginica
## 148: 5.2 2.0 virginica
## 149: 5.4 2.3 virginica
## 150: 5.1 1.8 virginica
This is a slightly different syntax from the list symbol used above (.()). Here, we combine the names of the columns we don’t want in a vector and exclude them with the ! symbol - the NOT operator in R (see logical operations).
Extracting a single column as vector
We can also call the name of a column without enclosing it in .() or list(). This will return the column as a vector:
iris.dt[,Petal.Length]
## [1] 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 1.5 1.6 1.4 1.1 1.2 1.5 1.3 1.4
## [19] 1.7 1.5 1.7 1.5 1.0 1.7 1.9 1.6 1.6 1.5 1.4 1.6 1.6 1.5 1.5 1.4 1.5 1.2
## [37] 1.3 1.4 1.3 1.5 1.3 1.3 1.3 1.6 1.9 1.4 1.6 1.4 1.5 1.4 4.7 4.5 4.9 4.0
## [55] 4.6 4.5 4.7 3.3 4.6 3.9 3.5 4.2 4.0 4.7 3.6 4.4 4.5 4.1 4.5 3.9 4.8 4.0
## [73] 4.9 4.7 4.3 4.4 4.8 5.0 4.5 3.5 3.8 3.7 3.9 5.1 4.5 4.5 4.7 4.4 4.1 4.0
## [91] 4.4 4.6 4.0 3.3 4.2 4.2 4.2 4.3 3.0 4.1 6.0 5.1 5.9 5.6 5.8 6.6 4.5 6.3
## [109] 5.8 6.1 5.1 5.3 5.5 5.0 5.1 5.3 5.5 6.7 6.9 5.0 5.7 4.9 6.7 4.9 5.7 6.0
## [127] 4.8 4.9 5.6 5.8 6.1 6.4 5.6 5.1 5.6 6.1 5.6 5.5 4.8 5.4 5.6 5.1 5.1 5.9
## [145] 5.7 5.2 5.0 5.2 5.4 5.1
Performing calculations
Let’s assume we want to calculate the petal area for each observation: petal area = petal length · petal width
There are two different ways to perform calculations, depending on our needs: simply display the calculation, or create a new column with the calculation and store it in memory.
Calculating to display
It’s important to consider what exactly we want to display. We could just display the calculated petal area, but it would be meaningless without the species name - it would be just a set of numbers.
Instead, we can calculate and show the petal area and the corresponding species, enclosed within .():
iris.dt[,.(Petal.Area = Petal.Length*Petal.Width,
Species)]
## Petal.Area Species
## 1: 0.28 setosa
## 2: 0.28 setosa
## 3: 0.26 setosa
## 4: 0.30 setosa
## 5: 0.28 setosa
## ---
## 146: 11.96 virginica
## 147: 9.50 virginica
## 148: 10.40 virginica
## 149: 12.42 virginica
## 150: 9.18 virginica
Creating new columns
We might want to store our calculation in memory as a new column of the table. To create new columns, we use the := assignment operator, which is specifically used for the data.table package. Let’s create a new column named Petal.Area:
iris.dt[,Petal.Area := Petal.Length*Petal.Width]
This creates a new column without displaying the table. To see the table, we simply need to call its name:
iris.dt
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species Petal.Area
## 1: 5.1 3.5 1.4 0.2 setosa 0.28
## 2: 4.9 3.0 1.4 0.2 setosa 0.28
## 3: 4.7 3.2 1.3 0.2 setosa 0.26
## 4: 4.6 3.1 1.5 0.2 setosa 0.30
## 5: 5.0 3.6 1.4 0.2 setosa 0.28
## ---
## 146: 6.7 3.0 5.2 2.3 virginica 11.96
## 147: 6.3 2.5 5.0 1.9 virginica 9.50
## 148: 6.5 3.0 5.2 2.0 virginica 10.40
## 149: 6.2 3.4 5.4 2.3 virginica 12.42
## 150: 5.9 3.0 5.1 1.8 virginica 9.18
To create multiple columns at once, the easiest way is the functional form of :=. We will use it to create 2 additional columns:
-
Sepal.Area: similarly calculated sepal area. -
index: a simple index of each observation (i.e. each row).
iris.dt[
,`:=`(Sepal.Area = Sepal.Length*Sepal.Width,
index = .I)
]
iris.dt
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species Petal.Area
## 1: 5.1 3.5 1.4 0.2 setosa 0.28
## 2: 4.9 3.0 1.4 0.2 setosa 0.28
## 3: 4.7 3.2 1.3 0.2 setosa 0.26
## 4: 4.6 3.1 1.5 0.2 setosa 0.30
## 5: 5.0 3.6 1.4 0.2 setosa 0.28
## ---
## 146: 6.7 3.0 5.2 2.3 virginica 11.96
## 147: 6.3 2.5 5.0 1.9 virginica 9.50
## 148: 6.5 3.0 5.2 2.0 virginica 10.40
## 149: 6.2 3.4 5.4 2.3 virginica 12.42
## 150: 5.9 3.0 5.1 1.8 virginica 9.18
## Sepal.Area index
## 1: 17.85 1
## 2: 14.70 2
## 3: 15.04 3
## 4: 14.26 4
## 5: 18.00 5
## ---
## 146: 20.10 146
## 147: 15.75 147
## 148: 19.50 148
## 149: 21.08 149
## 150: 17.70 150
Note:
.Iis a special symbol indata.table, returning an index number for each row.
In this instance, := is used as a function. Its arguments are the new columns we create, separated by comma.
Deleting columns
We can also use := to delete columns. We will most likely not need the index column we just created, so let’s delete it:
iris.dt[,index := NULL]