Foundations

Session 2: Data processing

Joshua Wilson Black

joshua.black@canterbury.ac.nz

Te Kāhui Roro Reo | New Zealand Institute of Language, Brain and Behaviour

Te Whare Wānanga o Waitaha | University of Canterbury

Overview

tidyverse and base R.
Functions from two tidyverse packages:
1. dplyr: “a grammar of data manipulation”
- The ‘verbs’
1. tidyr: a tool to “help you create tidy data”

Code and slides

usethis::create_from_github(
  "https://github.com/nzilbb/ws-data-processing"
)

`tidyverse`

What is the `tidyverse`?

A set of packages for R which follow a similar philosophy.
- they are ‘opinionated’ tools.
These include:
- dplyr - for data manipulation
- tidyr - for creating ‘tidy’ data
- ggplot2 - for plotting (see next week)

What is base R?

…anything other than the tidyverse
R has techniques for data processing built in.
- e.g., from last week, filtering with a Boolean vector.

# Filtering
toddlers[toddlers$name == "Deano", ]
toddlers[toddlers$happiness_score < 2, ]

…or creating new columns.

# Create new column.
toddlers$pick_up_time <- c("12:45", "1:00", "2:45")

Interaction between base R and the tidyverse

A salient feature of tidyverse code: %>%
- the ‘pipe’ sends the output of one function as an input to another function.
- Comes with the tidyverse package magrittr.
This was so popular, there is now a base R version: |>
- Introduced in R 4.1.0.

`dplyr`

A ‘grammar’ for data manipulation.
- An abstract, universal, way of thinking about data problems and solutions.
The core: a set of ‘verbs’ — things you can do to data.
The ideal of dplyr:
- Reading: mostly human readable
- Writing: encourages us to break problems into a series of simple steps

Some verbs

Here’s some dplyr verbs:
- select(): select one or more columns
- filter(): filter data
- mutate(): create new columns
We’ll learn how these work in context.

Pipes

We string together verbs using pipes.
- |> or %>%
e.g.:

sad_toddlers <- toddlers |> 
  filter(
    happiness < 2
  ) |> 
  mutate(
    hungry = (current_time - last_meal_time) > 2
  )

NB: you can use variable names inside these functions (the dplyr ‘verbs’).
You don’t constantly have to type, e.g., toddlers$happiness.

Grouped data

We can apply the same steps to groups in the data independently.
- e.g., apply a series of opperations separately to male and female experimental participants.
- group_by(): Creates groups
Some functions implicitly group…
- count(age_in_months): if you had a column called ‘age_in_months’, this would group the data by the values in age_in_months and count how many rows there are in each group.

Two kinds of pipe problem:
1. Nothings coming through (i.e., you get an error message).
2. Mysterious liquids (i.e., not what you expected and/or warning messages).
Find an ‘inspection opening’
- Check output is correct at each step.
- Highlight parts of the pipe and press ‘Run’

penguins <- penguins |> 
  filter(
    Island == "Torgersen"
  ) |> 
  mutate(
    mean_bill = mean(bill_len)
  )

Error in `filter()`:
ℹ In argument: `Island == "Torgersen"`.
Caused by error:
! object 'Island' not found

Error in filter()?
In this case: misspelled column.

penguins <- penguins |> 
  filter(
    island == "Torgersen"
  ) |> 
  mutate(
    mean_bill = mean(bill_len)
  )

penguins$mean_bill

 [1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
[26] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
[51] NA NA

Where’s my mean bill length!?

`tidyr`

‘tidy’ data

Tidy data is data where:

Each variable is a column; each column is a variable.

Each observation is a row; each row is an observation.

Each value is a cell; each cell is a single value.

(https://tidyr.tidyverse.org/)

By contrast:

Storing data in column names
More than one variable stored in a column.
Different ‘observational types’ in one dataframe.
- e.g. Participant info + tokens
For more, see: https://tidyr.tidyverse.org/articles/tidy-data.html
It’s possible to worry too much about this…

Pivoting

Are our ‘observations’ participants, vowel tokens, or individual formant readings.
Varies with context and sometimes we need to switch between contexts.
Wider data has more columns and (usually) fewer rows.
Longer data has fewer columns and (usually) more rows.
tidyr provides the functions pivot_wider() and pivot_longer().

What now?

usethis::create_from_github(
  "https://github.com/nzilbb/ws-data-processing"
)

Work through material at https://nzilbb.github.io/statistics_workshops/chapters/data_processing.html
The script in scripts/data_processing.R contains some of the code already.
The data is in the data directory.

References

Allaire, JJ, Yihui Xie, Christophe Dervieux, Jonathan McPherson, Javier Luraschi, Kevin Ushey, Aron Atkins, et al. 2024. rmarkdown: Dynamic Documents for r. https://github.com/rstudio/rmarkdown.

Müller, Kirill. 2020. here: A Simpler Way to Find Your Files. https://doi.org/10.32614/CRAN.package.here.

R Core Team. 2025. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.

Wickham, Hadley, Mara Averick, Jennifer Bryan, Winston Chang, Lucy D’Agostino McGowan, Romain François, Garrett Grolemund, et al. 2019. “Welcome to the tidyverse.” Journal of Open Source Software 4 (43): 1686. https://doi.org/10.21105/joss.01686.

Xie, Yihui, J. J. Allaire, and Garrett Grolemund. 2018. R Markdown: The Definitive Guide. Boca Raton, Florida: Chapman; Hall/CRC. https://bookdown.org/yihui/rmarkdown.

Xie, Yihui, Christophe Dervieux, and Emily Riederer. 2020. R Markdown Cookbook. Boca Raton, Florida: Chapman; Hall/CRC. https://bookdown.org/yihui/rmarkdown-cookbook.

Zhu, Hao. 2024. kableExtra: Construct Complex Table with “kable” and Pipe Syntax. https://doi.org/10.32614/CRAN.package.kableExtra.

Foundations

Overview

Overview

Code and slides

tidyverse

What is the tidyverse?

What is base R?

Interaction between base R and the tidyverse

dplyr

dplyr

Some verbs

Pipes

Grouped data

tidyr

‘tidy’ data

By contrast:

Pivoting

What now?

What now?

References

`tidyverse`

What is the `tidyverse`?

`dplyr`

`dplyr`

`tidyr`