Random samples :: Dimitris Kokoretsis — Data analytics education

R offers a range of options for drawing random samples from finite sets of discrete elements.

Initialize seed

Computer-generated “random” values are not truly random but pseudorandom. Resulting values are seemingly random when their pattern is not obvious - even if there is one.

To generate pseudorandom values, the computer needs a starting value and a process - and if these are known, the pattern is reproducible. In most cases, we actually want reproducibility, so an experiment with random processes gives the exact same results if it’s re-run.

The starting value and process are set with the set.seed function, which takes any integer value as input argument. It’s common practice to do this in the beginning of an experiment, before any random process. Let’s set the starting value to 1:

set.seed(seed=1)

The choice of number is not important, as long as it’s consistently used when re-running the same script.

There are different processes that produce pseudorandom values. The R default, which we are now using, is the Mersenne-Twister.

Find more information on random value generation in the relevant R documentation file by running help(Random) in the R console.

Subsampling

The following examples draw random samples from the letters of the Latin alphabet. R has a built-in character vector with the lower-case alphabet, named letters:

letters

##  [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s"
## [20] "t" "u" "v" "w" "x" "y" "z"

To draw a random element from the alphabet, use the sample function. The code below draws and displays two random elements from letters:

letters |> sample(size=1)

## [1] "y"

letters |> sample(size=1)

## [1] "d"

Likewise, we can draw samples larger than 1 by adjusting the size argument. The code below draws and displays two samples of size 5:

letters |> sample(size=5)

## [1] "g" "a" "b" "w" "k"

letters |> sample(size=5)

## [1] "n" "r" "s" "a" "u"

Reshuffling

We can even draw the whole set as a random sample. In this case all elements will be drawn in random order. In other words, it will be a reshuffling or permutation of the set. The code below creates two permutations of the Latin alphabet (which has 26 letters):

letters |> sample(size=26)

##  [1] "u" "j" "v" "n" "y" "g" "i" "o" "e" "t" "q" "w" "r" "s" "b" "x" "p" "a" "d"
## [20] "c" "f" "l" "m" "h" "k" "z"

letters |> sample(size=26)

##  [1] "t" "l" "w" "f" "h" "y" "x" "g" "j" "z" "v" "n" "b" "m" "o" "q" "k" "a" "c"
## [20] "p" "r" "d" "u" "s" "i" "e"

Note: The sample function works with vectors, as well as lists. The result is again a vector or list, respectively. If one element is drawn from a list, the result is a list of one element.

Replacement

There are two ways to draw a sample from a finite set: with or without replacement.

When sampling with replacement, each drawn element is replaced in the set before randomly drawing again.
When sampling without replacement, each drawn element is not replaced in the set. Therefore, drawing an element excludes it from future draws.

There is no “right” or “wrong” way to draw samples of course - these two ways just simulate different processes and serve different purposes.

By default, the sample function draws without replacement. To change that, set its replace argument to TRUE:

letters |> sample(size=10,replace=TRUE)

##  [1] "d" "m" "h" "y" "p" "y" "w" "n" "t" "g"

In this case, “y” has been drawn twice in this sample (positions 4 and 6). This is only possible because we sampled with replacement.

With replacement, we can draw samples of any size, even larger than the original set. For example, here is a sample of the Latin alphabet of size 30:

letters |> sample(size=30,replace=TRUE)

##  [1] "m" "v" "l" "p" "a" "m" "u" "f" "q" "i" "g" "w" "s" "v" "r" "z" "p" "k" "j"
## [20] "g" "s" "b" "j" "a" "k" "z" "o" "z" "x" "j"

This would not be possible to do without replacement:

letters |> sample(size=30)

## Error in `sample.int()`:
## ! cannot take a sample larger than the population when 'replace = FALSE'

Random samples

Initialize seed #

Subsampling #

Reshuffling #

Replacement #

Contents

Initialize seed

Subsampling

Reshuffling

Replacement