Foundations

Session 1: Getting Started

Joshua Wilson Black

Te Kāhui Roro Reo | New Zealand Institute of Language, Brain and Behaviour

Te Whare Wānanga o Waitaha | University of Canterbury

2025-07-31

Overview

Overview

  1. R and RStudio Orientation
  2. Core concepts and use
  3. Jumping ahead: a realistic(ish) example

Orientation

Ctrl + Shift + N

Access to this material

  • The best way to get the material for these sessions is to run the following command in the ‘Console’.
usethis::create_from_github(
  "https://github.com/nzilbb/ws-getting-started"
)
  • If it doesn’t work, follow instructions at https://tinyurl.com/nzilbb-induction
  • You should now have an RStudio window open in a new project with the slides and example R scripts.

Avoiding bad habits

  • Change default settings (Tools > Global Options).
  • Don’t save results of computations in .RData.
  • Set line endings to avoid git issues with collaborators.

RStudio projects

  • A RStudio project is a directory (folder) which contains everything you need for a data analysis project, including:
    • the data,
    • R scripts,
    • plots,
    • models,
    • write-ups, etc
  • Use projects all the time!

Core Concepts

First steps in R

  • Go to the console pane.
  • You can enter code after the >
  • Let’s start simple: type ‘2 + 2’ and press enter.

🥳🥳🥳🥳🥳

R scripts

  • The console can be very useful for small tasks.
  • An R script allows us to keep a step-by-step record.
  • Open a new R script (Ctrl + Shift + N, or File > New File > R Script)
  • Save it inside a directory called ‘scripts’ with a name such as getting-started.R.

More maths

1 * 2
2 ^ 3
2 / 4
"kea" + "tūī"
  • Enter the above lines one-by-one and run them by pressing ‘Ctrl/Cmd + Enter/Return’.
  • Output will appear in the console pane.
  • The final statement produces an error
    • Why?

Help! I just see a ‘+’!

  • If you enter an incomplete statement, R will wait for you to complete it in the console pane.
5 *

  • Two options:
    1. Complete the statement in the console.
    2. Enter “Ctrl + C” to escape.

Comments

  • Anything after a # will be ignored by R.
# Add 4 to 3
3 + 4
  • Use this to add ‘comments’ for human readers.

Functions

toupper("this is an example sentence")
paste("tūī", "kea", sep = "-")
rnorm(n = 100, mean = 0, sd = 1)
rbinom(1, 1, 0.5)
?toupper
  • Functions take input(s) and produce an output.
  • The function’s name is followed by the input(s) inside brakets.
  • The inputs are called arguments
  • ? before function name produces help (in output pane)

Logic

"tūī" == "kea"
1 == 2
1 == "kea"
"tūī" != "kea" 
"tūī" != "kea" & 1 != 2 # "And"
"tūī" == "kea" | 1 != 2 # "Or"
1 >= 4
  • Logical statements can be TRUE or FALSE
  • These are very useful for filtering data
  • Combine multiple logical statements with ‘and’ (‘&’) or ‘or’ (|)

Vectors

  • Data isn’t just one or two numbers!
  • Vectors allow us to combine many observations.
c("I", "went", "to", "the", "shop")
c(1, 2, 3, 4, 5)
c(TRUE, TRUE, FALSE, FALSE, TRUE)
mean(c(1, 2, 3, 4, 5))
paste0(c(1, 2, 3), ". ", c("I", "went", "to"))
  • The c stands for “combine”.
  • All the entries in a vector must be of the same type.
  • A vector counts as a single argument to a function.

Variables

  • We want to reuse the same data multiple times
  • Variables are the solution.
  • Variables associate a name with an object using ‘<-
age <- c(2, 4, 2, 3)
  • Look in the environment pane.
  • Your variable is there now.
age + 1 # What does this do?

Data frames

  • How do we associate multiple data points from the same individual?
  • Data frames
    • Rows are ‘observations’
    • Columns are ‘variables’ 🫤🫤🫤🫤🫤
toddlers <- data.frame(
  age = c(2, 4, 2, 3),
  happiness = c(5, 4, 5, 1),
  name = c("Basil", "Glenys", "Offa", "Gronk")
)
  • NB: Statements can go across multiple lines.

Accessing data

  • View(toddlers): a spreadsheet-style view of data.
  • Square brackets give access to portions of the data (whether a vector or a data frame).
age[2]
age[3:4]
toddlers[1,3]
toddlers[2,]
toddlers[,3]

Accessing data with names

  • Use ‘$’ to access columns in a data frame.
toddlers$name
toddlers$happiness

Accessing data with vectors

age[c(2, 4)] # the second and fourth entry in age.
age[age > 2] # filter using a logical statement
toddlers[, c(2, 3)]
toddlers[c(1, 2), ]
toddlers[toddlers$happiness < 2, ]
  • Use of logical statements lets us filter data.
  • Logical statements produce ‘logical vectors’.
toddlers$happiness < 2
[1] FALSE FALSE FALSE  TRUE

Packages

  • Packages contain R code and data for solving specific problems.
install.packages('cowsay') 
library(cowsay) 
  • We install packages with install.packages().
    • You shouldn’t put install.packages() directly in a script.
  • Packages are loaded with library(), or individual functions can be accessed using ::
cowsay::say(what = "Ko te Kāhui Roro Reo te rōpū pai rawe", by = "cat")

Packages


 _______________________________________ 
<Ko te Kāhui Roro Reo te rōpū pai rawe>
 --------------------------------------- 
         \
          \

            |\___/|
          ==) ^Y^ (==
            \  ^  /
             )=*=(
            /     \
            |     |
           /| | | |\
           \| | |_|/\
      jgs  //_// ___/
               \_)

The tidyverse

  • The tidyverse is a popular collection of packages.
  • It includes, e.g.:
    • dplyr: functions for data filtering and transformation.
    • ggplot2: a popular package for data visualisation.
    • readr: functions for reading and writing data.
    • stringr: functions for manipulating strings.
  • Code written with tidyverse packages has a different style. (…better)
install.packages("tidyverse") 

A realistic(ish) example

Get some data

Data inspection (text)

library(tidyverse)
library(here)

# read in data
wellform <- read_tsv(here('data', 'wellFormTask2021.tsv'))

# examine data
View(wellform)
summary(wellform)
  • read_tsv() loads ‘tab separated values’ (simlar to a .csv file)
  • summary() tells us about some of the variables.
  • The data comes from an experiment exploring judgements of well-formedness of Māori words among non-Māori speakers.

Data inspection (visualise)

  • hist() is a built-in function for making histograms.
# Look at distribution of reaction times
hist(wellform$reactionTime)

Data inspection (visualise)

# Look at shorter reaction times
hist(wellform$reactionTime[wellform$reactionTime < 20])

Filter

# Filter out long reaction times.
# base R style
wellform_filtered <- wellform[wellform$reactionTime < 5, ]

# tidyverse style
wellform_filtered <- wellform |> 
  filter(reactionTime < 5)

Chaining functions

# Chaining functions together with pipes.
wellform_filtered <- wellform |> 
  filter(reactionTime < 5) |> 
  select(
    workerId, stimulus, word, 
    enteredResponse, reactionTime, score.shortv
  ) |> 
  rename(
    participant = workerId,
    response = enteredResponse,
    reaction_time = reactionTime,
    phonotactic_score = score.shortv
  )
  • Pipes send output of a function to the next function.
  • Commonly used in tidyverse style.

Visualise again

# visualise again using ggplot
wellform_filtered |> 
  ggplot(
    aes(
      x = factor(response),
      y = phonotactic_score
    )
  ) +
  geom_boxplot()
  • More on visualisation later in the day.

Visualise again

Model

  • Let’s fit a model using the built-in lm() function.
# model
wellform_fit <- lm(
  response ~ phonotactic_score + reaction_time,
  data = wellform_filtered
)

summary(wellform_fit)

Model


Call:
lm(formula = response ~ phonotactic_score + reaction_time, data = wellform_filtered)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.0752 -0.8669  0.1387  1.0105  2.8295 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)        6.16334    0.11274   54.67   <2e-16 ***
phonotactic_score  3.53412    0.12287   28.76   <2e-16 ***
reaction_time      0.04016    0.02039    1.97   0.0489 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.217 on 4327 degrees of freedom
Multiple R-squared:  0.1605,    Adjusted R-squared:  0.1602 
F-statistic: 413.8 on 2 and 4327 DF,  p-value: < 2.2e-16

Plot predictions

# Get predictions from model.

# Step 1. Decide what we want predictions for. In this case,
# the full range of phonotactic scores at the mean value for
# reaction time.
to_predict <- data.frame(
  phonotactic_score = seq(-1.2, -0.6, by = 0.01),
  reaction_time = mean(wellform_filtered$reaction_time)
)

# Step 2: Get predictions using the `predict()` function.
model_predictions <- predict(
  wellform_fit, 
  newdata = to_predict,
  se.fit = TRUE
)

# Step 3: add predictions and 95% confidence intervals to the `to_predict` data
# frame.
to_predict$prediction <- model_predictions$fit
to_predict$upper <- model_predictions$fit + 1.96 * model_predictions$se
to_predict$lower <- model_predictions$fit - 1.96 * model_predictions$se

# Step 4: visualise again
to_predict |> 
  ggplot(
    aes(
      x = phonotactic_score,
      y = prediction,
      ymin = lower,
      ymax = upper
    )
  ) +
  geom_ribbon(alpha = 0.4) +
  geom_line()

Plot predictions

Summary

Summary

We’ve covered a lot!

  1. An example of a full analysis in a single script.
  2. Core R concepts.
  3. Orientation to R and RStudio.

Next time: More on data processing.

References

Allaire, JJ, Yihui Xie, Christophe Dervieux, Jonathan McPherson, Javier Luraschi, Kevin Ushey, Aron Atkins, et al. 2024. rmarkdown: Dynamic Documents for r. https://github.com/rstudio/rmarkdown.
Chamberlain, Scott, and Amanda Dobbyn. 2025. cowsay: Messages, Warnings, Strings with Ascii Animals. https://doi.org/10.32614/CRAN.package.cowsay.
Mattingley, Wakayo, Forrest Panther, Simon Todd, Jeanette King, Jennifer Hay, and Peter J. Keegan. 2024. “Awakening the Proto-Lexicon: A Proto-Lexicon Gives Learning Advantages for Intentionally Learning a Language.” Language Learning 74 (3): 744–76. https://doi.org/10.1111/lang.12635.
Müller, Kirill. 2020. here: A Simpler Way to Find Your Files. https://doi.org/10.32614/CRAN.package.here.
R Core Team. 2025. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.
Wickham, Hadley, Mara Averick, Jennifer Bryan, Winston Chang, Lucy D’Agostino McGowan, Romain François, Garrett Grolemund, et al. 2019. “Welcome to the tidyverse.” Journal of Open Source Software 4 (43): 1686. https://doi.org/10.21105/joss.01686.
Xie, Yihui, J. J. Allaire, and Garrett Grolemund. 2018. R Markdown: The Definitive Guide. Boca Raton, Florida: Chapman; Hall/CRC. https://bookdown.org/yihui/rmarkdown.
Xie, Yihui, Christophe Dervieux, and Emily Riederer. 2020. R Markdown Cookbook. Boca Raton, Florida: Chapman; Hall/CRC. https://bookdown.org/yihui/rmarkdown-cookbook.