Data tables
Featured in 3 main posts
A data table is a tabular structure with values organized in rows (records) and columns (fields). It is the data structure that most resembles a spreadsheet. Each column must have values of the same type (for example numeric, character, etc), much like a vector.
Data table variations in R
In R, there are slight variations of this data structure, but all fit this general description.
For example, R features its standard data.frame right out of the box, without any additional requirement. The data.table is an “enhanced version” of the standard R data.frame provided by the package of the same name. The data.table package features more optimized memory management and a slightly different syntax that makes certain tasks easier (in my opinion). Because of this, I will be using data.table in the vast majority of cases. In the end, it’s a matter of personal preference.
There are two ways to convert a data.frame to a data.table:
-
To store the
data.tableas a new variable name, assign it as usual:data.table.name <- as.data.table(data.frame.name). -
To convert it in-place and keep the same variable name, run the function
setDT(data.frame.name).
The corresponding functions for the other way around are as.data.frame and setDF.
To find out the variation of your table, run the line class(your.table) in the R console.
Data table input/output
Creating a data table
We can create a data table from scratch with the data.table function. The arguments inside brackets will be the table’s columns. Let’s create a data table with the first and last names of three employees of an Albuquerque law firm:
library(data.table)
test.data <- data.table(first.name=c("Jimmy","Kim","Howard"),
last.name=c("McGill","Wexler","Hamlin"))
To view the table, all we need to do is call its name:
test.data
## first.name last.name
## 1: Jimmy McGill
## 2: Kim Wexler
## 3: Howard Hamlin
Reading data from a delimited text file
We can read tabular data from any file on our system or the web with the fread function of the data.table package. Let’s use it to import a data set from the web:
iris.dt <- fread(input="https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data")
To get a glimpse of our imported data table, we can just display it:
iris.dt
## sepallength sepalwidth petallength petalwidth class
## 1: 5.1 3.5 1.4 0.2 Iris-setosa
## 2: 4.9 3.0 1.4 0.2 Iris-setosa
## 3: 4.7 3.2 1.3 0.2 Iris-setosa
## 4: 4.6 3.1 1.5 0.2 Iris-setosa
## 5: 5.0 3.6 1.4 0.2 Iris-setosa
## ---
## 146: 6.7 3.0 5.2 2.3 Iris-virginica
## 147: 6.3 2.5 5.0 1.9 Iris-virginica
## 148: 6.5 3.0 5.2 2.0 Iris-virginica
## 149: 6.2 3.4 5.4 2.3 Iris-virginica
## 150: 5.9 3.0 5.1 1.8 Iris-virginica
We can see that it has 150 rows and 5 columns. A slightly more informative option is the str (structure) function:
str(iris.dt)
## Classes 'data.table' and 'data.frame': 150 obs. of 5 variables:
## $ sepallength: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ sepalwidth : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ petallength: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ petalwidth : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ class : chr "Iris-setosa" "Iris-setosa" "Iris-setosa" "Iris-setosa" ...
## - attr(*, ".internal.selfref")=<pointer: 0x000001763f14bf30>
Some observations:
-
The table belongs to classes
data.tableanddata.frame. Anydata.tableis also adata.frame, but not vice versa. -
It has 150 observations (rows) of 5 variables (columns).
-
The first 4 columns are of type
numeric(num), while the fifth one is of typecharacter(chr).
Accessing and manipulating data tables
A data table is the primary source of information for our work, but it’s often not so useful as it is. We often need to extract the information we’re interested in, perform calculations, summarize the data in some way, or any combination of these actions.
The plainest way to summarize a data table is with the summary function:
summary(iris.dt)
## sepallength sepalwidth petallength petalwidth
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.054 Mean :3.759 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## class
## Length :150
## N.unique : 3
## N.blank : 0
## Min.nchar: 11
## Max.nchar: 15
##
This displays a very simple summary of each column. For each numeric column, certain statistics are displayed (minimum, maximum, quartiles and mean), while the character column can’t really be “summarized” in any way.
For more useful options on accessing and manipulating data tables, check the pages dedicated to filtering rows, performing column operations, creating aggregate tables and converting between wide/long format.
Note: The linked pages are intended as a lightweight guide to basic
data.tableoptions. There are more available options, but this is by no means an exhaustive resource. For a more comprehensive resource refer to the rich documentation of thedata.tablepackage (on its website or by runninghelp(data.table)in the R console).
Saving to a delimited text file
We can use the fwrite function of the data.table package to write a data table into a delimited text file:
fwrite(x=iris.dt, file="path/to/file.csv")
Built-in R data sets
The iris data set also comes with the R language installation, and is available at any R session with the name iris - albeit with slightly different column names and species names:
# The head() function shows only the first few rows of a data table
head(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
Find other available data sets by running library(help="datasets") in the R console.