Aggregate tables :: Dimitris Kokoretsis — Data analytics education

This page briefly explains the principles of aggregating table data, using the data.table syntax. For a general primer on data tables, check the relevant page.

An aggregate table is a table of appropriately grouped (aggregated) individual items of a more extensive table. In other words, it’s a table that summarizes a more extensive table in a meaningful way, focusing on what we’re interested in.

In the data.table syntax, a table is aggregated by using square brackets next to its name. The first and second argument in the square brackets are meant for row operations and column operations, respectively.

For the examples below, we want to include all the data, without excluding rows. Because of this, the first argument will be left blank (e.g. table.name[,...]). However, rows can be easily selected if needed, by applying a filter in the first argument.

The key to aggregating is an additional by argument, which groups rows together by the values of specified columns.

Example data

Let’s use one of R’s built-in data sets, called iris. In order to use data.table’s syntax, we will first convert it from data.frame to data.table.

library(data.table)
iris.dt <- as.data.table(iris)
iris.dt

##      Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
##   1:          5.1         3.5          1.4         0.2    setosa
##   2:          4.9         3.0          1.4         0.2    setosa
##   3:          4.7         3.2          1.3         0.2    setosa
##   4:          4.6         3.1          1.5         0.2    setosa
##   5:          5.0         3.6          1.4         0.2    setosa
##  ---                                                            
## 146:          6.7         3.0          5.2         2.3 virginica
## 147:          6.3         2.5          5.0         1.9 virginica
## 148:          6.5         3.0          5.2         2.0 virginica
## 149:          6.2         3.4          5.4         2.3 virginica
## 150:          5.9         3.0          5.1         1.8 virginica

One value: row count

This data set has 150 rows, each row being an observation with measurements on petal and sepal size. Not all observations are of the same species though, as we can infer from the Species column.

We can display an aggregate table that shows how many observations correspond to each species:

iris.dt[,.N,by=Species]

##       Species  N
## 1:     setosa 50
## 2: versicolor 50
## 3:  virginica 50

Note: .N is a special symbol in data.table, counting the number of rows.

The key here is the by argument. It’s set to Species, so it groups the rows according to the value of Species. This way, .N counts the rows of each different Species separately.

Importantly, the displayed table only includes the requested columns (.N in this case) and the by column(s).

So there are 50 observations of each species.

Multiple values: mean & SD

We can display another aggregate table with the means and standard deviations (SD) of Petal.Width, for each species.

iris.dt[
  ,.(petal.width.mean=mean(Petal.Width),
     petal.width.sd=sd(Petal.Width)),
  by=Species]

##       Species petal.width.mean petal.width.sd
## 1:     setosa            0.246      0.1053856
## 2: versicolor            1.326      0.1977527
## 3:  virginica            2.026      0.2746501

We request more than one columns, so we need to include them in a list (.()) in the second argument, as needed (see column selection).

The mean and sd functions calculate the mean and SD, respectively. As we want to compare the different species, the rows are again grouped by Species.

Chaining table operations

An interesting feature of data table operations is that the output is also a data table.

We’ve seen by now that data table operations are performed within square brackets next to it. We can perform further operations within square brackets - essentially chaining together table operations.

In the example below, we will calculate the petal area of each observation (product of petal length and width) in the 1st step (see column calculations for details), and then the mean and SD of this petal area for each species in the 2nd step.

# 1st step
iris.dt[
  ,.(petal.area=Petal.Length*Petal.Width,
     Species)
][ # 2nd step
  ,.(petal.area.mean=mean(petal.area),
     petal.area.sd=sd(petal.area)),
  by=Species
]

##       Species petal.area.mean petal.area.sd
## 1:     setosa          0.3656     0.1811546
## 2: versicolor          5.7204     1.3684029
## 3:  virginica         11.2962     2.1574124

Note: It’s important to consider how we want the table to enter each step. In this example, we need the Species column for the 2nd step, so we include it in the 1st step.

Aggregate tables

Example data #

One value: row count #

Multiple values: mean & SD #

Chaining table operations #

Contents

Example data

One value: row count

Multiple values: mean & SD

Chaining table operations