1 Overview

This RMarkdown presents all code used in the article ‘Using Principal Component Analysis to explore co-variation of vowels’ and additional discussion. Throughout, we use data from the ONZE corpus.

Section One covers the toy example used to present PCA. We take three variables with known interrelations: the mean raw F1 values of dress, kits, and trap. These values are taken from a sample of 100 speakers from the ONZE corpus. By looking at three variables, we can visualise the entire process. By looking at three variables with known relationships, we can be confident in our interpretation of our PCA analysis.

Section Two provides a standalone variant of the analysis carried out in Brand et al. (2021). The details of the implementation and models have been slightly changed to enable faster computation.1 Unlike the analysis as it first appears in Brand et al. (2021), the version of the analysis presented here is carried out with generalisation at the front of mind. In particular, we will go into detail regarding the three core questions asked of any PCA analysis:

  1. Is PCA appropriate?
  2. How many PCs should we consider?
  3. Which variables should we use to interpret our PCs?

Throughout, we use functions from the package nzilbb.vowels. These functions are designed to apply to any vowel data to which PCA might be applied. The code in nzilbb.vowels is designed to generalise the methods used in Brand et al. (2021) and Wilson Black et al. (under review). The code for the various functions can be found at the GitHub link above and, if desired, can be directly copied from there if one wishes to modify it and avoid using the nzilbb.vowels package directly.2 If you use code from the package for this kind of analysis, please cite this paper and Brand et al. (2021).

NB: Running this RMarkdown will require around 3GB of RAM.

We will use the following packages:

# We're working in the tidyverse 'dialect'.
library(tidyverse)

# Used for plots.
library(ggrepel)
library(scales)
library(glue)

# Factoextra used for PCA visualisation
library(factoextra)

# nzilbb.vowels used to share the data and some useful functions.
# Install via github using following commented line. If this does not work,
# ensure that you have the package `remotes` installed.
# remotes::install_github('https://github.com/nzilbb/nzilbb_vowels')
library(nzilbb.vowels)
## Warning: package 'patchwork' was built under R version 4.2.3
# We will fit GAMM models as part of our preprocessing example
library(mgcv)
## Warning: package 'mgcv' was built under R version 4.2.3
## Warning: package 'nlme' was built under R version 4.2.3
library(itsadug)

# Plotly for 3D visualisation
library(plotly)

# File path management
library(here)

The following is the details of the R installataion and environment used to knit this markdown:

sessionInfo()
## R version 4.2.2 (2022-10-31 ucrt)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19042)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_New Zealand.utf8  LC_CTYPE=English_New Zealand.utf8   
## [3] LC_MONETARY=English_New Zealand.utf8 LC_NUMERIC=C                        
## [5] LC_TIME=English_New Zealand.utf8    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] here_1.0.1          plotly_4.10.1       itsadug_2.4.1      
##  [4] plotfunctions_1.4   mgcv_1.8-42         nlme_3.1-162       
##  [7] nzilbb.vowels_0.2.0 patchwork_1.1.2     factoextra_1.0.7   
## [10] glue_1.6.2          scales_1.2.1        ggrepel_0.9.3      
## [13] lubridate_1.9.1     forcats_1.0.0       stringr_1.5.0      
## [16] dplyr_1.1.0         purrr_1.0.1         readr_2.1.3        
## [19] tidyr_1.3.0         tibble_3.1.8        ggplot2_3.4.1      
## [22] tidyverse_2.0.0    
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.10       lattice_0.20-45   listenv_0.9.0     rprojroot_2.0.3  
##  [5] digest_0.6.31     utf8_1.2.2        parallelly_1.35.0 R6_2.5.1         
##  [9] evaluate_0.20     httr_1.4.5        pillar_1.9.0      rlang_1.1.0      
## [13] lazyeval_0.2.2    data.table_1.14.8 rstudioapi_0.14   furrr_0.3.1      
## [17] jquerylib_0.1.4   Matrix_1.5-3      rmarkdown_2.21    splines_4.2.2    
## [21] htmlwidgets_1.6.2 munsell_0.5.0     compiler_4.2.2    xfun_0.37        
## [25] pkgconfig_2.0.3   globals_0.16.2    htmltools_0.5.4   tidyselect_1.2.0 
## [29] bookdown_0.33     codetools_0.2-19  viridisLite_0.4.1 fansi_1.0.4      
## [33] future_1.32.0     tzdb_0.3.0        withr_2.5.0       grid_4.2.2       
## [37] jsonlite_1.8.4    gtable_0.3.3      lifecycle_1.0.3   magrittr_2.0.3   
## [41] cli_3.6.0         stringi_1.7.12    cachem_1.0.7      rsample_1.1.1    
## [45] bslib_0.4.2       generics_0.1.3    vctrs_0.5.2       tools_4.2.2      
## [49] hms_1.1.3         parallel_4.2.2    fastmap_1.1.0     yaml_2.3.7       
## [53] timechange_0.2.0  colorspace_2.1-0  knitr_1.42        sass_0.4.5       
## [57] gghalves_0.1.4

2 Data

Before turning to PCA, it’s worth explaining our data a little further. The data we are working with come from the Origins of New Zealand English project (ONZE) (Gordon et al. 2004, 2007). The ONZE corpus provides the longest record of New Zealand English (NZE) speech recordings in existence, with speakers born over a 137-year time period (1851-1988). The speech samples themselves were collected during oral history interviews and sociolinguistic interviews. The combined corpora contain speech data from over 600 different speakers from different social backgrounds, ethnic groups and geographical locations in New Zealand’s South Island. This paper takes processed data from Brand et al. 2021, which, after filtering, includes 481 speakers (225 female, 256 male), born between 1864 and 1982).

There is extensive evidence to suggest that some of the vowels of NZE may well show patterns of co-variation over time. The three short front vowels trap, dress and kit have all changed their place of articulation in NZE, such that there are now two front vowels which have both raised and fronted (trap and dress) and one centralized vowel (kit) (Langstrof, 2006). This series of vowel changes is an example of a push chain (Hay et al., 2015), and is one place where we might expect to find systematic covariation.

Formant values for the monophthongs were automatically extracted by Brand et al. using Praat, resulting in comparable measurements across a coherent set of variables. The initial data set of raw formant measurements of the 12 NZE monophthongs contained over 2 million tokens. This was only possible because the corpus is in the LaBB-CAT environment (Fromont & Hay 2008; 2012).

Brand et al. then apply an extensive data filtering protocol at the token level in which a series of conditions were placed on the removal of individual vowel tokens from the data set (see Brand et al. 2021, supplementary material). FOOT was removed as its token count was insufficient for our pre-PCA modelling step (see below) and SCHWA was removed along with all unstressed tokens. This left 10 monophthongs.

Brand et al.’s filtering procedure ran as follows:

  1. Remove speakers with missing gender or year of birth information.
  2. Remove transcripts that are word lists (interview speech only).
  3. Remove tokens where Praat failed to extract F1 and F2.
  4. Remove tokens with F1 > 1000 Hz (likely errors).
  5. Remove vowel tokens with durations <= 0.01s or >= 3s (likely errors).
  6. Remove tokens with phonological length > 25 (likely errors).
  7. Remove tokens with a hesitation (affects pronunciation of vowels).
  8. Remove tokens where the word is not transcribed.
  9. Remove stopwords.3
  10. Remove unstressed vowel tokens.
  11. Outlier removal (+- 2.5 SD).
  12. Removal of foot tokens (insufficient data).
  13. Removal of speakers for whom all vowels heavily overlap (likely errors).
  14. Removal of speakers who do not have 5 tokens for each remaining vowel.
  15. Removal of tokens with ‘l’ or ‘r’ as following segment (standard for NZE).
  16. Removal of speakers at the very ends of the year of birth distribution.

We do not go into detail about these steps here as there is nothing special about them in connection with PCA. They are part of the tool kit of work with large audio corpora and of statistical modelling in general.4

Figure 2.1 depicts the count of speakers in the filtered data by year of birth and gender. (This plot is Figure 1 of the main paper.)

figure_1 <- onze_vowels_full %>%
  group_by(speaker) %>%
  summarise(
    yob = first(yob),
    gender = first(gender)
  ) %>%
  ggplot(
    aes(
      x = yob,
      fill = gender
    )
  ) +
  geom_histogram(bins = 50) +
  labs(
    title = "Speakers by Year of Birth and Gender",
    x = "Year of birth",
    y = "Count of speakers",
    fill = "Gender"
  ) +
  scale_fill_manual(values = c("#f1a340", "#998ec3")) +
  scale_x_continuous(breaks = seq(1875, 1975, 25))

ggsave(
  here('plots', 'figure_1.png'), 
  figure_1, 
  dpi = 300, 
  width = 160, 
  height = 100,
  units = "mm"
)
figure_1
ONZE speakers by year of birth and gender.

Figure 2.1: ONZE speakers by year of birth and gender.

3 PCA explainer

This section provides the initial example to explain PCA in the paper.

We begin by extracting the raw F1 values of dress, kits, and trap from a pre-generated random sample of speakers from the ONZE corpus, with 50 born before 1920 and 50 born after 1920.

The following code loads the data for this subset of 100 speakers from the nzilbb.vowels package (which includes the dataset onze_vowels). It selects the vowels which we will consider and the useful columns (to see what the onze_vowels dataset looks like before we remove information just type onze_vowels into the console and press enter).5 The code block then calculates the mean first formant value for each speaker.

# Load ONZE data for 100 speakers, take only DRESS KIT and TRAP.
onze_sub <- onze_vowels %>%
  filter(
    vowel %in% c("DRESS", "KIT", "TRAP") 
  ) %>%
  select(
    speaker, vowel, F1_50, yob, gender
  ) %>%
  # Take means of first formant data
  group_by(
    speaker, vowel
  ) %>%
  summarise(
    F1_50 = mean(F1_50),
    yob = first(yob),
    gender = first(gender)
  ) %>%
  ungroup()
## `summarise()` has grouped output by 'speaker'. You can override using the
## `.groups` argument.
onze_sub

The output above shows that there is a row for each vowel and speaker. That is, there are three rows for each speaker. However, PCA requires a column for each variable we are interested in. So we pivot our data so there is a column for each of the three included vowels.

onze_to_pca <- onze_sub %>%
  pivot_wider(
    names_from = vowel,
    values_from = F1_50
  )

At this point we can plot the data, using colour to represent the third dimension. That is, dress and trap are represented by the \(x\) and \(y\) axes, respectively, while kit is represented by shade of blue, with darker blues indicating lower values and lighter blues representing higher values.

three_vowel_plot <- onze_to_pca %>%
  ggplot(
    aes(
      x = DRESS,
      y = TRAP,
      colour = KIT
    )
  ) +
  geom_point() +
  labs(
    title = "First Formant Values",
    x = "DRESS (Hz)",
    y = "TRAP (Hz)",
    colour = "KIT (Hz)"
  )

three_vowel_plot
Midpoint first formant values for DRESS, KIT and TRAP.

Figure 3.1: Midpoint first formant values for DRESS, KIT and TRAP.

Figure 3.1 is a representation of a 3D cloud of points. As explained in the paper, PCA can be thought of as placing a point at the centre of the cloud, and then drawing lines through the cloud, passing through the central point, which successively capture the largest possible amount of variance in the cloud with each line at right angles to the previous line. The lines are what we call principal components (PCs). If you find it hard to interpret the use of colour in this plot, scroll down for an interactive 3D scatter plot of the same data which does not require the use of colour.

To show this visually, we will run a PCA analysis using R and plot the results. We won’t pause over the details of the code here as it will be explained in more detail in the following section.

We run the PCA analysis using the base prcomp function. In this case, we’re not scaling the data, because I want to show what PCA looks like in the original cloud of points. NB: in almost all situations the scale argument to prcomp should be set to TRUE.6

# Running PCA
onze_pca <- prcomp(
  # Only select the numeric columns which you want to feed into the PCA.
  onze_to_pca %>% select(DRESS, KIT, TRAP),
  scale=FALSE
)

We then extract the required information from the PCA analysis to plot the results. Again, we will discuss how to extract this information in more detail in the following section.

# Load scores for individuals (contained in onze_pca$x)
onze_to_pca <- onze_to_pca %>%
  mutate(
    PC1 = onze_pca$x[, 1],
    PC2 = onze_pca$x[, 2]
  )

# Extract centre of the point cloud.
pca_centre <- onze_pca$center

centre_data <- tibble(
  "DRESS" = pca_centre[[1]],
  "KIT" = pca_centre[[2]],
  "TRAP" = pca_centre[[3]]
)

# Extract loadings.
pca_loadings <- onze_pca$rotation

# Use the loadings and centre to find where each point sits along PC1 and PC2
# when represented in the original 3D space.
onze_to_pca <- onze_to_pca %>%
  mutate(
    PC1_DRESS = (PC1 * pca_loadings[1, 1]) + pca_centre[[1]],
    PC1_KIT = (PC1 * pca_loadings[2, 1]) + pca_centre[[2]],
    PC1_TRAP = (PC1 * pca_loadings[3, 1]) + pca_centre[[3]],
    PC2_DRESS = (PC2 * pca_loadings[1, 2]) + pca_centre[[1]],
    PC2_KIT = (PC2 * pca_loadings[2, 2]) + pca_centre[[2]],
    PC2_TRAP = (PC2 * pca_loadings[3, 2]) + pca_centre[[3]],
  )

We plot the centre of the point cloud:

centre_plot <- onze_to_pca %>%
  ggplot(
    aes(
      x = DRESS,
      y = TRAP,
      colour = KIT
    )
  ) +
  geom_point() +
  geom_point(data = centre_data, size = 10, shape = 4, stroke = 2) +
  labs(
    title = "Centre of Point Cloud",
    x = "DRESS (Hz)",
    y = "TRAP (Hz)",
    colour = "KIT (Hz)",
    caption = "The centre of the point cloud is indicated by a cross."
  )

centre_plot
Midpoint first formant values for DRESS, KIT and TRAP with centre of point cloud.

Figure 3.2: Midpoint first formant values for DRESS, KIT and TRAP with centre of point cloud.

Figure 3.2 shows an ‘x’ where the middle of the point cloud is. This is clear to see for the dress and trap axes. Take a moment to convince yourself that the colour sits in the middle of the range of colours of the points for the kit variable. Don’t worry if you can’t discern fine differences in the shade of blue, for the purposes of this illustration all that is required is a sense of ‘high’, ‘mid’, and ‘low’ values.

We now add the first and second PCs, using line type to visually distinguish them:

PC1_projections <-  onze_to_pca %>%
  select(
    speaker, PC1_DRESS, PC1_TRAP, PC1_KIT
  ) %>%
  rename(
    DRESS = PC1_DRESS,
    TRAP = PC1_TRAP,
    KIT = PC1_KIT
  )
  
PC2_projections <-  onze_to_pca %>%
  select(
    speaker, PC2_DRESS, PC2_TRAP, PC2_KIT
  ) %>%
  rename(
    DRESS = PC2_DRESS,
    TRAP = PC2_TRAP,
    KIT = PC2_KIT
  )

PC_projections <- bind_rows(
  "PC1" = PC1_projections,
  "PC2" = PC2_projections,
  .id = "PC"
)

pc_plot <- centre_plot +
  geom_line(
    aes(
      x = DRESS,
      y = TRAP,
      colour = KIT,
      size = PC,
      group = PC
    ),
    data = PC_projections
  ) +
  scale_size_manual(
    values = c(
      "PC1" = 2,
      "PC2" = 1
    )
  ) +
  labs(
    title = "First Two PCs",
    caption = NULL
  )
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
pc_plot
Midpoint first formant values for DRESS, KIT and TRAP with centre of point cloud and first two PCs.

Figure 3.3: Midpoint first formant values for DRESS, KIT and TRAP with centre of point cloud and first two PCs.

Figure 3.3 might initially be a bit hard to interpret. We see that PC1 goes from low values of dress, trap, and kit. This is easy to see for dress and trap. For kit, notice that the thicker line is dark blue at low values of dress and trap and light blue at high values of dress and trap. Dark blue indicates low values of kit, while light blue represents high values.

The second PC, however, puts kit against dress and trap. This is easiest to see by noting that the light to dark blue pattern is the opposite of the pattern for PC1.

The PCs represent new variables against which we can plot our data. We can reduce this 3D data to two dimensions by plotting the data with respect to PC1 and PC2.

two_dimm <- onze_to_pca %>%
  ggplot(
    aes(
      x = PC1,
      y = PC2
    )
  ) + 
  geom_point() +
  labs(
    title = "2D Representation",
  )

two_dimm
ONZE data for DRESS, TRAP, and KIT, plotted against PC1 and PC2.

Figure 3.4: ONZE data for DRESS, TRAP, and KIT, plotted against PC1 and PC2.

Sometimes, plots like 3.4 allow for individuals to be visually clustered into distinct groups. This is the basis of use of PCA as a prepatory step for classification methods. There are no obvious clusters in the PC1 and PC2 plot for this this collection of individuals. However, even if there is no clear grouping, the plot can still be made useful with the addition of supplementary information (namely, information not fed to the PCA analysis), to aid in the interpretation of two PCs. For instance:

two_dim_supps <- onze_to_pca %>%
  ggplot(
    aes(
      x = PC1,
      y = PC2,
      shape = gender,
      colour = yob
    )
  ) + 
  geom_point() +
  labs(
    title = "PCA Representation of DRESS, TRAP, and KIT F1 Data",
    subtitle = "Supplementary variables: gender and year of birth."
  )

two_dim_supps