Data tables :: Dimitris Kokoretsis — Data analytics education

A data table is a tabular structure with values organized in rows (records) and columns (fields). It is the data structure that most resembles a spreadsheet. Each column must have values of the same type (for example numeric, character, etc), much like a vector.

Data table variations in R

In R, there are slight variations of this data structure, but all fit this general description.

For example, R features its standard data.frame right out of the box, without any additional requirement. The data.table is an “enhanced version” of the standard R data.frame provided by the package of the same name. The data.table package features more optimized memory management and a slightly different syntax that makes certain tasks easier (in my opinion). Because of this, I will be using data.table in the vast majority of cases. In the end, it’s a matter of personal preference.

There are two ways to convert a data.frame to a data.table:

To store the data.table as a new variable name, assign it as usual: data.table.name <- as.data.table(data.frame.name).
To convert it in-place and keep the same variable name, run the function setDT(data.frame.name).

The corresponding functions for the other way around are as.data.frame and setDF.

To find out the variation of your table, run the line class(your.table) in the R console.

Data table input/output

Creating a data table

We can create a data table from scratch with the data.table function. The arguments inside brackets will be the table’s columns. Let’s create a data table with the first and last names of three employees of an Albuquerque law firm:

library(data.table)
test.data <- data.table(first.name=c("Jimmy","Kim","Howard"),
                        last.name=c("McGill","Wexler","Hamlin"))

To view the table, all we need to do is call its name:

test.data

##    first.name last.name
## 1:      Jimmy    McGill
## 2:        Kim    Wexler
## 3:     Howard    Hamlin

Reading data from a delimited text file

We can read tabular data from any file on our system or the web with the fread function of the data.table package. Let’s use it to import a data set from the web:

iris.dt <- fread(input="https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data")

To get a glimpse of our imported data table, we can just display it:

iris.dt

##      sepallength sepalwidth petallength petalwidth          class
##   1:         5.1        3.5         1.4        0.2    Iris-setosa
##   2:         4.9        3.0         1.4        0.2    Iris-setosa
##   3:         4.7        3.2         1.3        0.2    Iris-setosa
##   4:         4.6        3.1         1.5        0.2    Iris-setosa
##   5:         5.0        3.6         1.4        0.2    Iris-setosa
##  ---                                                             
## 146:         6.7        3.0         5.2        2.3 Iris-virginica
## 147:         6.3        2.5         5.0        1.9 Iris-virginica
## 148:         6.5        3.0         5.2        2.0 Iris-virginica
## 149:         6.2        3.4         5.4        2.3 Iris-virginica
## 150:         5.9        3.0         5.1        1.8 Iris-virginica

We can see that it has 150 rows and 5 columns. A slightly more informative option is the str (structure) function:

str(iris.dt)

## Classes 'data.table' and 'data.frame':	150 obs. of  5 variables:
##  $ sepallength: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ sepalwidth : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ petallength: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ petalwidth : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ class      : chr  "Iris-setosa" "Iris-setosa" "Iris-setosa" "Iris-setosa" ...
##  - attr(*, ".internal.selfref")=<pointer: 0x000001763f14bf30>

Some observations:

The table belongs to classes data.table and data.frame. Any data.table is also a data.frame, but not vice versa.
It has 150 observations (rows) of 5 variables (columns).
The first 4 columns are of type numeric (num), while the fifth one is of type character (chr).

Accessing and manipulating data tables

A data table is the primary source of information for our work, but it’s often not so useful as it is. We often need to extract the information we’re interested in, perform calculations, summarize the data in some way, or any combination of these actions.

The plainest way to summarize a data table is with the summary function:

summary(iris.dt)

##   sepallength      sepalwidth     petallength      petalwidth   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.054   Mean   :3.759   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        class    
##  Length   :150  
##  N.unique :  3  
##  N.blank  :  0  
##  Min.nchar: 11  
##  Max.nchar: 15  
##

This displays a very simple summary of each column. For each numeric column, certain statistics are displayed (minimum, maximum, quartiles and mean), while the character column can’t really be “summarized” in any way.

For more useful options on accessing and manipulating data tables, check the pages dedicated to filtering rows, performing column operations, creating aggregate tables and converting between wide/long format.

Note: The linked pages are intended as a lightweight guide to basic data.table options. There are more available options, but this is by no means an exhaustive resource. For a more comprehensive resource refer to the rich documentation of the data.table package (on its website or by running help(data.table) in the R console).

Saving to a delimited text file

We can use the fwrite function of the data.table package to write a data table into a delimited text file:

fwrite(x=iris.dt, file="path/to/file.csv")

Built-in R data sets

The iris data set also comes with the R language installation, and is available at any R session with the name iris - albeit with slightly different column names and species names:

# The head() function shows only the first few rows of a data table
head(iris)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

Find other available data sets by running library(help="datasets") in the R console.

Data tables

Data table variations in R #

Data table input/output #

Creating a data table #

Reading data from a delimited text file #

Accessing and manipulating data tables #

Saving to a delimited text file #

Built-in R data sets #

Contents