Functions :: Dimitris Kokoretsis — Data analytics education

A function is a piece of code that accepts data and other parameters as input and produces an output (e.g. processed data, statistical parameters, visualizations, etc).

Essentially, functions help us reuse pieces of code that perform specific actions. If variables are the building blocks of data, functions are the building blocks of actions.

Using functions

Functions are ubiquitous in R. For example, sum, prod, mean and sd are all functions - calculating the sum, product, mean and standard deviation of a given vector, respectively.

Let’s take as an example the simple seq function, which produces arithmetic sequences. A common way to use it is the following:

seq(from=1,to=9,by=1)

## [1] 1 2 3 4 5 6 7 8 9

We just called the seq function with certain values as input: from (set to 1), to (set to 10) and by (set to 1) are called arguments of the function, and we use them to give input data.

Note: We don’t necesarily have to refer to each argument by name, if we give input values in the correct order. For example, seq(1,9,1) would give the same result. I avoid doing this though, to minimize the risk of confusion and mistakes. To see the order of arguments in a function, read its help file in the R documentation.

Argument default values

A function may take many arguments to perform its task, but we often don’t need to give values to all of them, as we might be happy with their default values.

The following call of seq produces the exact same result as before.

seq(to=9)

## [1] 1 2 3 4 5 6 7 8 9

But if we want a different increment, we need to input the by argument manually:

seq(to=9,by=2)

## [1] 1 3 5 7 9

Nested and piped function calls

The most basic way to call a function is to put all input arguments within brackets, next to the function’s name. Let’s pick another function as an example: the simple sum function, which calculates the sum of a given vector. The following piece of code calculates the sum of vector v:

v <- c(5,8,4)
sum(v)

## [1] 17

Another way to perform the same function call is the following:

v |> sum()

## [1] 17

The |> symbol is called pipe. It essentially takes the value of its left-hand side and feeds it as input to the function on its right-hand side.

The pipe is meant to make code easier to read, write, and debug - particularly when functions are used as steps in a processing “pipeline”. The difference is practically non-existent in this example, but it becomes clearer when there are more processing steps.

Example 1: a simple function pipeline

Let’s assume we want to calculate the sum of all integers up to 200. We will first need to recreate this arithmetic sequence using seq_len, then calculate its sum. Both of the two following function calls perform this task:

# Nested call
sum(seq_len(200))

## [1] 20100

# Piped call
200 |> seq_len() |> sum()

## [1] 20100

While these two function calls perform the same operation, they have very different written structure:

The nested call is performed from inner-most to outer-most.
The piped call is performed from left to right.

When coding, my thinking has to follow the order of operations. A nested call is conceived from the inner-most to the outer-most part, but written from left to right. On the contrary, a piped call is conceived, written and read from left to right.

In other words, the pipe allows us to read and write the operations in the order that they are performed.

Example 2: a more complicated function pipeline

The difference in readability is even more apparent if we consider a more complicated task:

Calculate the sum of all multiples of 3, up to the mean of [1243,6424,5455].

It is indeed quite complicated. To begin, we can put these numbers in a vector named a and just see what their mean is:

a <- c(1243,6424,5455)
mean(a)

## [1] 4374

As we need multiples of 3, we will resort back to our seq function to create the sequence - as neither the starting point nor the increment are equal to 1. The sequence then needs to be summed together.

The following nested and piped calls perform this whole task and produce identical results:

# Nested call
sum(seq(from=3,to=mean(a),by=3))

## [1] 3190833

# Piped call
a |>
  mean() |>
  seq(from=3,to=_,by=3) |>
  sum()

## [1] 3190833

In my opinion, the nested call is not readable at all, compared to the piped call.

A major difference from example 1 is that seq takes more than one argument. In this case, the pipe automatically assigns its left-hand side value as the 1st argument of the right-hand side function.

We don’t want that here though! We want to use the left-hand side value as to (2nd argument). In that case, we have to use the _ placeholder, as we did here.

Passing arguments in a list (do.call)

Sometimes the number of arguments is too large, or we don’t know it in advance, so we can’t or don’t want to code each argument manually, one by one. Let’s take the function rbind (as in “row bind”), which can take vectors and stack them on top of one another, giving a matrix:

rbind(c(4,8,6),
      c(4,8,6))

##      [,1] [,2] [,3]
## [1,]    4    8    6
## [2,]    4    8    6

What if we want to do this for 10 vectors like these instead of 2? We really shouldn’t have to write 10 vectors manually, especially if they are repeated.

The do.call function comes to our rescue: it allows us to pass arguments to any function as a list. A list is very convenient in this case, as we can create it programmatically.

We will use the rep function (“repeat”) to create a list of 10 identical vectors, named rbind.input:

rbind.input <- list(c(4,8,6)) |>
  rep(times=10)

We can now use the do.call function to pass this list to rbind.

# Pass the whole list as arguments to rbind
do.call(what=rbind,args=rbind.input)

##       [,1] [,2] [,3]
##  [1,]    4    8    6
##  [2,]    4    8    6
##  [3,]    4    8    6
##  [4,]    4    8    6
##  [5,]    4    8    6
##  [6,]    4    8    6
##  [7,]    4    8    6
##  [8,]    4    8    6
##  [9,]    4    8    6
## [10,]    4    8    6

Creating functions

If the built-in functions of R and its packages are not enough for us, we can create our own functions. Why we might want to do that:

To reuse a piece of code that performs a specific process.
To iterate the same process on multiple elements, such as with lapply.

We will create a function for the process we described in example 2 and we’ll name it example.2.function.

example.2.function <- function(x) {
  result <- x |>
    mean() |>
    seq(from=3,to=_,by=3) |>
    sum()
  return(result)
}

Some important highlights:

The function is assigned to its name with the <- operator, as any other assignment.
x is the function’s input argument, and has no default value. To give a default value (e.g. the vector [3,3,3]), we should code this in the form x=c(3,3,3) when creating the function.
The code of the function is enclosed in curly brackets ({}).
A variable named result is calculated inside the function.
return is a special command that returns its bracket content as the function’s result - in this case, the variable result.

Let’s try how our new function works for different inputs:

# The variable we used previously
example.2.function(a)

## [1] 3190833

example.2.function(c(132,621,591))

## [1] 33525

example.2.function(c(645,9423,6084))

## [1] 4830345

Functions

Using functions #

Argument default values #

Nested and piped function calls #

Example 1: a simple function pipeline #

Example 2: a more complicated function pipeline #

Passing arguments in a list (do.call) #

Creating functions #

Contents