Foundations

Session 1: Getting Started

Joshua Wilson Black

joshua.black@canterbury.ac.nz

Te Kāhui Roro Reo | New Zealand Institute of Language, Brain and Behaviour

Te Whare Wānanga o Waitaha | University of Canterbury

2025-07-31

Overview

R and RStudio Orientation
Core concepts and use
Jumping ahead: a realistic(ish) example

This session starts our journey through the “Foundations” section of the NZILBB statistics workshops. The goal is to get from no knowledge of R and Rstudio, to reproducible and shareable data analysis in line with at least some of the Open Science principles, without picking up some common bad habits.

Today:

We’ll look at what R and RStudio are and do a bit of orientation.
We’ll look at the core concepts of R and do some sensible configuration of RStudio. This will include basic maths, the idea of a data frame, which is how R represents the kid of data we work with (basically, something which could go in a spreadsheet), etc.
Finally, we’ll jump to the end of the story by looking at a realistic-ish example of a data analysis workflow including data, filtering, plotting, and modelling. This should give a taste of why we are bothering will all of this effort!

Orientation

Ctrl + Shift + N

Let’s open RStudio and see what’s in front of us.

You should see a window containing three ‘panes’, all of which have multiple tabs. You’ll find that there’s quite a bit of variation in the terminology people use to describe all of the aspects of RStudio (and, indeed, R), but here is the official terminology.

First we have the ‘console pane’. This let’s us directly interact both with R and with the computer we are using more generally (in the ‘Terminal’ tab).
To the right we have the ‘environment pane’. This lets us see what we are currently working with, what commands have been sent to R previously in a session, and any details from git, if we are using it.
We have the output pane. This is where we often see the outputs of any plots we might make, and any files in our project (or files elsewhere in our computer), just as you would use an ordinary file explorer.
Now, if you press Ctrl+Shift+N or Cmd+Shift+N, depending on whether you are using a mac or not, a new ‘pane’ should appear: the ‘source’ pane.

Access to this material

The best way to get the material for these sessions is to run the following command in the ‘Console’.

usethis::create_from_github(
  "https://github.com/nzilbb/ws-getting-started"
)

If it doesn’t work, follow instructions at https://tinyurl.com/nzilbb-induction
You should now have an RStudio window open in a new project with the slides and example R scripts.

Avoiding bad habits

Change default settings (Tools > Global Options).
Don’t save results of computations in .RData.
Set line endings to avoid git issues with collaborators.

We’ve got a few small bits of housekeeping before we continue one.

You should change your RStudio settings to encourage good habits and avoid bad ones.

One of the biggest bad habits is keeping a single R session going for a long time, constantly rerunning code, and keeping models and data in ‘short term’ memory. Instead, you need to rely on your code and scripts to produce everything you need. A huge number of problems can be avoided by turning off R’s capacity to keep a ‘session’ alive by saving ‘.RData’ between sessions.

I also encourage you to insist on Posix (LF) line endings. This will help you to collaborite with others using git. I won’t explain what this is here, but if you are interested, there is a great talk on YouTube called “There’s no such thing as plain text” by Dylan Beattie, which I encourage you to watch!

RStudio projects

A RStudio project is a directory (folder) which contains everything you need for a data analysis project, including:
- the data,
- R scripts,
- plots,
- models,
- write-ups, etc
Use projects all the time!

Core Concepts

First steps in R

Go to the console pane.
You can enter code after the >
Let’s start simple: type ‘2 + 2’ and press enter.

🥳🥳🥳🥳🥳

R scripts

The console can be very useful for small tasks.
An R script allows us to keep a step-by-step record.
Open a new R script (Ctrl + Shift + N, or File > New File > R Script)
Save it inside a directory called ‘scripts’ with a name such as getting-started.R.

More maths

1 * 2
2 ^ 3
2 / 4
"kea" + "tūī"

Enter the above lines one-by-one and run them by pressing ‘Ctrl/Cmd + Enter/Return’.
Output will appear in the console pane.
The final statement produces an error
- Why?

Help! I just see a ‘`+`’!

If you enter an incomplete statement, R will wait for you to complete it in the console pane.

5 *

Two options:
1. Complete the statement in the console.
2. Enter “Ctrl + C” to escape.

Comments

Anything after a # will be ignored by R.

# Add 4 to 3
3 + 4

Use this to add ‘comments’ for human readers.

Functions

toupper("this is an example sentence")
paste("tūī", "kea", sep = "-")
rnorm(n = 100, mean = 0, sd = 1)
rbinom(1, 1, 0.5)
?toupper

Functions take input(s) and produce an output.
The function’s name is followed by the input(s) inside brakets.
The inputs are called arguments
? before function name produces help (in output pane)

The real value of something like R comes when we can use ‘functions’, ideally written by someone else, to perform complex data analysis tasks. In almost all cases, functions take some input and produce an output (why am I hedging? sometimes functions have side effects and don’t technically produce an output, and sometimes they don’t need any an input). The technical name for an input to a function is an ‘argument’.

Whatever the case, the usual pattern with a function is to write its name and then the inputs inside brackets. Let’s try it with these functions.

Ask what toupper() is going to do.

We have a function with ‘pastes’ strings togetther (paste()), we also have some functions for producing random data from a distribution. Very important for statistics. Here there’s two examples, rnorm() produces data from a normal distribution and rbinom() from a binomial distribution. With these arguments, we now have a way to flip a fair coin in R.

Functions typically come with documentation, we can can get at using a question mark before the function name. These can be hard to read for beginners, but are an important source for telling you how to use the function.

Logic

"tūī" == "kea"
1 == 2
1 == "kea"
"tūī" != "kea" 
"tūī" != "kea" & 1 != 2 # "And"
"tūī" == "kea" | 1 != 2 # "Or"
1 >= 4

Logical statements can be TRUE or FALSE
These are very useful for filtering data
Combine multiple logical statements with ‘and’ (‘&’) or ‘or’ (|)

Vectors

Data isn’t just one or two numbers!
Vectors allow us to combine many observations.

c("I", "went", "to", "the", "shop")
c(1, 2, 3, 4, 5)
c(TRUE, TRUE, FALSE, FALSE, TRUE)
mean(c(1, 2, 3, 4, 5))
paste0(c(1, 2, 3), ". ", c("I", "went", "to"))

The c stands for “combine”.
All the entries in a vector must be of the same type.
A vector counts as a single argument to a function.

Variables

We want to reuse the same data multiple times
Variables are the solution.
Variables associate a name with an object using ‘<-’

age <- c(2, 4, 2, 3)

Look in the environment pane.
Your variable is there now.

age + 1 # What does this do?

Data frames

How do we associate multiple data points from the same individual?
Data frames
- Rows are ‘observations’
- Columns are ‘variables’ 🫤🫤🫤🫤🫤

toddlers <- data.frame(
  age = c(2, 4, 2, 3),
  happiness = c(5, 4, 5, 1),
  name = c("Basil", "Glenys", "Offa", "Gronk")
)

NB: Statements can go across multiple lines.

We continue to build up to something actually useful. We need to associate multiple values with the same individual or observation or whatever it is we are working with. This is done using dataframes, a core feature or R. In a data frame, each row is …

Note that there is an ambiguity between a ‘variable’ in the sense of a column of a data frame and a variable in the sense of anything we have given a name while using R. In practice this isn’t usually an issue.

Let’s create a tiny data frame with information about a handful of toddlers.

This is the first time where I’ve asked you to write a statement across multiple lines in a script. A statement does not need to be confined to one line. Often we add new lines after an open bracket, giving a new line to each argument. Many people find this easier to read and I encourage you to do it for all but the simplest statements.

Accessing data

View(toddlers): a spreadsheet-style view of data.
Square brackets give access to portions of the data (whether a vector or a data frame).

age[2]
age[3:4]
toddlers[1,3]
toddlers[2,]
toddlers[,3]

Accessing data with names

Use ‘$’ to access columns in a data frame.

toddlers$name
toddlers$happiness

Accessing data with vectors

age[c(2, 4)] # the second and fourth entry in age.
age[age > 2] # filter using a logical statement
toddlers[, c(2, 3)]
toddlers[c(1, 2), ]
toddlers[toddlers$happiness < 2, ]

Use of logical statements lets us filter data.
Logical statements produce ‘logical vectors’.

toddlers$happiness < 2

[1] FALSE FALSE FALSE  TRUE

Packages

Packages contain R code and data for solving specific problems.

install.packages('cowsay') 
library(cowsay)

We install packages with install.packages().
- You shouldn’t put install.packages() directly in a script.
Packages are loaded with library(), or individual functions can be accessed using ::

cowsay::say(what = "Ko te Kāhui Roro Reo te rōpū pai rawe", by = "cat")

Packages


 _______________________________________ 
<Ko te Kāhui Roro Reo te rōpū pai rawe>
 --------------------------------------- 
         \
          \

            |\___/|
          ==) ^Y^ (==
            \  ^  /
             )=*=(
            /     \
            |     |
           /| | | |\
           \| | |_|/\
      jgs  //_// ___/
               \_)

The `tidyverse`

The tidyverse is a popular collection of packages.
It includes, e.g.:
- dplyr: functions for data filtering and transformation.
- ggplot2: a popular package for data visualisation.
- readr: functions for reading and writing data.
- stringr: functions for manipulating strings.
Code written with tidyverse packages has a different style. (…better)

install.packages("tidyverse")

A realistic(ish) example

Get some data

We’ve borrowing data from (Mattingley et al. 2024).
The data lives inside a folder called data in our project.
It was downloaded from OSF, here: https://osf.io/ucx8n

Data inspection (text)

library(tidyverse)
library(here)

# read in data
wellform <- read_tsv(here('data', 'wellFormTask2021.tsv'))

# examine data
View(wellform)
summary(wellform)

read_tsv() loads ‘tab separated values’ (simlar to a .csv file)
summary() tells us about some of the variables.
The data comes from an experiment exploring judgements of well-formedness of Māori words among non-Māori speakers.

Data inspection (visualise)

hist() is a built-in function for making histograms.

# Look at distribution of reaction times
hist(wellform$reactionTime)

Data inspection (visualise)

# Look at shorter reaction times
hist(wellform$reactionTime[wellform$reactionTime < 20])

Filter

# Filter out long reaction times.
# base R style
wellform_filtered <- wellform[wellform$reactionTime < 5, ]

# tidyverse style
wellform_filtered <- wellform |> 
  filter(reactionTime < 5)

Chaining functions

# Chaining functions together with pipes.
wellform_filtered <- wellform |> 
  filter(reactionTime < 5) |> 
  select(
    workerId, stimulus, word, 
    enteredResponse, reactionTime, score.shortv
  ) |> 
  rename(
    participant = workerId,
    response = enteredResponse,
    reaction_time = reactionTime,
    phonotactic_score = score.shortv
  )

Pipes send output of a function to the next function.
Commonly used in tidyverse style.

Visualise again

# visualise again using ggplot
wellform_filtered |> 
  ggplot(
    aes(
      x = factor(response),
      y = phonotactic_score
    )
  ) +
  geom_boxplot()

More on visualisation later in the day.

Visualise again

Model

Let’s fit a model using the built-in lm() function.

# model
wellform_fit <- lm(
  response ~ phonotactic_score + reaction_time,
  data = wellform_filtered
)

summary(wellform_fit)

Model


Call:
lm(formula = response ~ phonotactic_score + reaction_time, data = wellform_filtered)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.0752 -0.8669  0.1387  1.0105  2.8295 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)        6.16334    0.11274   54.67   <2e-16 ***
phonotactic_score  3.53412    0.12287   28.76   <2e-16 ***
reaction_time      0.04016    0.02039    1.97   0.0489 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.217 on 4327 degrees of freedom
Multiple R-squared:  0.1605,    Adjusted R-squared:  0.1602 
F-statistic: 413.8 on 2 and 4327 DF,  p-value: < 2.2e-16

Plot predictions

# Get predictions from model.

# Step 1. Decide what we want predictions for. In this case,
# the full range of phonotactic scores at the mean value for
# reaction time.
to_predict <- data.frame(
  phonotactic_score = seq(-1.2, -0.6, by = 0.01),
  reaction_time = mean(wellform_filtered$reaction_time)
)

# Step 2: Get predictions using the `predict()` function.
model_predictions <- predict(
  wellform_fit, 
  newdata = to_predict,
  se.fit = TRUE
)

# Step 3: add predictions and 95% confidence intervals to the `to_predict` data
# frame.
to_predict$prediction <- model_predictions$fit
to_predict$upper <- model_predictions$fit + 1.96 * model_predictions$se
to_predict$lower <- model_predictions$fit - 1.96 * model_predictions$se

# Step 4: visualise again
to_predict |> 
  ggplot(
    aes(
      x = phonotactic_score,
      y = prediction,
      ymin = lower,
      ymax = upper
    )
  ) +
  geom_ribbon(alpha = 0.4) +
  geom_line()

Plot predictions

Summary

We’ve covered a lot!

An example of a full analysis in a single script.
Core R concepts.
Orientation to R and RStudio.

Next time: More on data processing.

References

Allaire, JJ, Yihui Xie, Christophe Dervieux, Jonathan McPherson, Javier Luraschi, Kevin Ushey, Aron Atkins, et al. 2024. rmarkdown: Dynamic Documents for r. https://github.com/rstudio/rmarkdown.

Chamberlain, Scott, and Amanda Dobbyn. 2025. cowsay: Messages, Warnings, Strings with Ascii Animals. https://doi.org/10.32614/CRAN.package.cowsay.

Mattingley, Wakayo, Forrest Panther, Simon Todd, Jeanette King, Jennifer Hay, and Peter J. Keegan. 2024. “Awakening the Proto-Lexicon: A Proto-Lexicon Gives Learning Advantages for Intentionally Learning a Language.” Language Learning 74 (3): 744–76. https://doi.org/10.1111/lang.12635.

Müller, Kirill. 2020. here: A Simpler Way to Find Your Files. https://doi.org/10.32614/CRAN.package.here.

R Core Team. 2025. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.

Wickham, Hadley, Mara Averick, Jennifer Bryan, Winston Chang, Lucy D’Agostino McGowan, Romain François, Garrett Grolemund, et al. 2019. “Welcome to the tidyverse.” Journal of Open Source Software 4 (43): 1686. https://doi.org/10.21105/joss.01686.

Xie, Yihui, J. J. Allaire, and Garrett Grolemund. 2018. R Markdown: The Definitive Guide. Boca Raton, Florida: Chapman; Hall/CRC. https://bookdown.org/yihui/rmarkdown.

Xie, Yihui, Christophe Dervieux, and Emily Riederer. 2020. R Markdown Cookbook. Boca Raton, Florida: Chapman; Hall/CRC. https://bookdown.org/yihui/rmarkdown-cookbook.

Foundations

Overview

Overview

Orientation

Access to this material

Avoiding bad habits

RStudio projects

Core Concepts

First steps in R

R scripts

More maths

Help! I just see a ‘+’!

Comments

Functions

Logic

Vectors

Variables

Data frames

Accessing data

Accessing data with names

Accessing data with vectors

Packages

Packages

The tidyverse

A realistic(ish) example

Get some data

Data inspection (text)

Data inspection (visualise)

Data inspection (visualise)

Filter

Chaining functions

Visualise again

Visualise again

Model

Model

Plot predictions

Plot predictions

Summary

Summary

References

Help! I just see a ‘`+`’!

The `tidyverse`