This document sets out the analysis for the paper ‘How stable are patterns of covariation across time?’
This paper uses the QuakeBox (QB1) and QuakeBox 2 (QB2) corpora to investigate the stability of PCA across the same speakers at two time points.
The structure is as follows, numbering by section:
Preprocessing, Filtering, and Normalisation.
Overall Pattern of Change
QB1 data is used to set out the overall direction of sound change
for the largest possible subset of speakers.
Analysis 1: Speaker Change Between QB1 and QB2
The main analysis is carried out, comparing PCs from Brand et al.
(2021), and between QB1 and QB2.
Analysis 2: Uncontrolled Social Variation
The main analysis is repeated, including social variables in the
GAMM models.
Additional Checks
Note the occasional use of tabs to organise material which takes up a lot of space (such as diagnostic plots for each model). These appear as follows:
renv
If you wish to run this code yourself, we recommend cloning the GitHub
repository and running the code using the .Rproj
file to load the
entire project. We use the renv
package to track the package versions
we have used and allow you to load the same package versions yourself
using the function renv::restore()
from within the project. For more
details, look here.
This document uses the following libraries.
# We use a wide range of tidyverse functions for data filtering and plotting.
library(tidyverse)
library(glue)
library(furrr)
library(tidylog) #gives summaries
library(scales) #rescale functions
library(scico)
library(gt) # For presentation tables.
# The following packages are used for plotting.
library(patchwork)
library(ggrepel)#more plotting things
library(gganimate) #animation plots
library(gifski) #needed to generate gifs
library(itsadug)
library(DiagrammeR)
library(plotly)
library(ggforce)
library(corrplot)
# The following packages are used for modelling and analysis
library(mgcv) #gamms
library(lme4) #mixed-effects models
library(factoextra) #pca
library(lmerTest) #p values from lmer models
# The following packages are used for table output.
library(DT) #interactive data tables
library(kableExtra) #outputting nice tables
# We use `here` for file path management.
library(here)
library(nzilbb.vowels)
# We set a seed here. This will only give the same models if you run each
# code block sequentially (rather than interactively running code blocks)
set.seed(114)
The environment we have used to knit this document is:
sessionInfo()
## R version 4.4.0 (2024-04-24)
## Platform: aarch64-apple-darwin20
## Running under: macOS Sonoma 14.7.1
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.0
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## time zone: Pacific/Auckland
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] nzilbb.vowels_0.3.1 here_1.0.1 kableExtra_1.4.0
## [4] DT_0.33 lmerTest_3.1-3 factoextra_1.0.7
## [7] lme4_1.1-35.3 Matrix_1.7-0 corrplot_0.92
## [10] ggforce_0.4.2 plotly_4.10.4 DiagrammeR_1.0.11
## [13] itsadug_2.4.1 plotfunctions_1.4 mgcv_1.9-1
## [16] nlme_3.1-164 gifski_1.12.0-2 gganimate_1.0.9
## [19] ggrepel_0.9.5 patchwork_1.2.0 gt_0.10.1
## [22] scico_1.5.0 scales_1.3.0 tidylog_1.1.0
## [25] furrr_0.3.1 future_1.33.2 glue_1.7.0
## [28] lubridate_1.9.3 forcats_1.0.0 stringr_1.5.1
## [31] dplyr_1.1.4 purrr_1.0.2 readr_2.1.5
## [34] tidyr_1.3.1 tibble_3.2.1 ggplot2_3.5.1
## [37] tidyverse_2.0.0
##
## loaded via a namespace (and not attached):
## [1] RColorBrewer_1.1-3 shape_1.4.6.1 rstudioapi_0.16.0
## [4] jsonlite_1.8.8 magrittr_2.0.3 jomo_2.7-6
## [7] farver_2.1.2 nloptr_2.0.3 rmarkdown_2.27
## [10] vctrs_0.6.5 minqa_1.2.7 base64enc_0.1-3
## [13] htmltools_0.5.8.1 progress_1.2.3 polynom_1.4-1
## [16] plotrix_3.8-4 broom_1.0.6 weights_1.0.4
## [19] Formula_1.2-5 mitml_0.4-5 sass_0.4.9
## [22] parallelly_1.37.1 bslib_0.7.0 htmlwidgets_1.6.4
## [25] cachem_1.1.0 iterators_1.0.14 lifecycle_1.0.4
## [28] pkgconfig_2.0.3 R6_2.5.1 fastmap_1.2.0
## [31] rbibutils_2.3 digest_0.6.35 numDeriv_2016.8-1.1
## [34] colorspace_2.1-0 rprojroot_2.0.4 ellipse_0.5.0
## [37] Hmisc_5.2-0 gdata_3.0.1 fansi_1.0.6
## [40] timechange_0.3.0 nnls_1.6 httr_1.4.7
## [43] polyclip_1.10-6 compiler_4.4.0 proxy_0.4-27
## [46] doParallel_1.0.17 withr_3.0.0 htmlTable_2.4.3
## [49] backports_1.5.0 pan_1.9 MASS_7.3-60.2
## [52] gtools_3.9.5 tools_4.4.0 foreign_0.8-86
## [55] nnet_7.3-19 grid_4.4.0 checkmate_2.3.2
## [58] cluster_2.1.6 generics_0.1.3 gtable_0.3.5
## [61] tzdb_0.4.0 class_7.3-22 data.table_1.15.4
## [64] hms_1.1.3 rsample_1.2.1 xml2_1.3.6
## [67] utf8_1.2.4 foreach_1.5.2 pillar_1.9.0
## [70] splines_4.4.0 smacof_2.1-7 tweenr_2.0.3
## [73] lattice_0.22-6 gghalves_0.1.4 survival_3.5-8
## [76] tidyselect_1.2.1 knitr_1.47 gridExtra_2.3
## [79] bookdown_0.39 svglite_2.1.3 xfun_0.44
## [82] visNetwork_2.1.2 stringi_1.8.4 lazyeval_0.2.2
## [85] yaml_2.3.8 boot_1.3-30 evaluate_0.23
## [88] codetools_0.2-20 wordcloud_2.6 cli_3.6.2
## [91] rpart_4.1.23 systemfonts_1.1.0 Rdpack_2.6.2
## [94] munsell_0.5.1 jquerylib_0.1.4 Rcpp_1.0.12
## [97] globals_0.16.3 parallel_4.4.0 prettyunits_1.2.0
## [100] glmnet_4.1-8 listenv_0.9.1 viridisLite_0.4.2
## [103] e1071_1.7-14 crayon_1.5.2 clisymbols_1.2.0
## [106] rlang_1.1.4 mice_3.17.0
We load anonymised data. Data anonymisation requires that we carry out some preprocessing steps before the steps presented in this Markdown.
The anonymised data has already had stopwords,1 words with hesitations, and tokens without transcribed words removed. We have also removed unimportant speaker demographics (such as their household location, occupation, areas lived, etc.) because otherwise, there would be enough information to determine the identities of many of our participants.
We set the ggplot
theme to theme_bw()
:
theme_set(theme_bw())
We load the data as follows:
QB1 <- read_rds(here('data', 'QB1_anon.rds'))
QB1_initial_count <- nrow(QB1)
QB2 <- read_rds(here('data', 'QB2_anon.rds'))
QB2_initial_count <- nrow(QB2)
The _count
variables are used later in the Markdown in order to
visualise the counts of tokens at each stage of the filtering process.
The data consists of midpoint first and second formant readings from
vowel tokens with additional data at the level of participant (e.g.
participant ethnic background and gender), at the level of utterance
(e.g. speech rate). Midpoint F1 and F2 values were extracted using
LaBB-CAT’s interface with Praat, with a maximum frequency of 5500 Hz for
female speakers and 5000 Hz for male speakers (see
scripts/qb1_extraction
and scripts/qb2_extraction
in the GitHub
repository for this project
here).
The data is summarised in the following table (for both QB1 and QB2 data):
column_descriptions_initial <- c(
"Anonymised speaker code",
"Speaker gender (binary)",
"Speaker age category",
"Speaker ethnicity (at QB1)",
"Country speaker grew up",
"Region speaker grew up",
"Duration of speakers' speech (seconds)",
"Duration of transcript (seconds)",
"LaBB-CAT token identifier",
"Anonymised word identifier",
"Vowel category (DISC)",
"Onset of vowel token (seconds)",
"Offset of vowel token (seconds)",
"Stress code (primary, secondary, etc)",
"Vowel midpoint (seconds)",
"Previous segment (DISC)",
"Following segment (DISC)",
"Utterance articulation rate",
"Utterance speech rate",
"First formant midpoint reading",
"Second formant midpoint reading"
)
prefilter_table <- tibble(
Name = names(QB1),
Type = reframe(QB1, across(everything(), class))[2,] %>% as_vector(),
Description = column_descriptions_initial
)
prefilter_table <- prefilter_table %>%
gt() %>%
tab_header(
title = "Variables before preprocessing"
)
prefilter_table
Variables before preprocessing | ||
Name | Type | Description |
---|---|---|
Speaker | character | Anonymised speaker code |
Gender | character | Speaker gender (binary) |
Age | character | Speaker age category |
Ethnicity | character | Speaker ethnicity (at QB1) |
participant_grew_up | character | Country speaker grew up |
participant_grew_up_region | character | Region speaker grew up |
participant_duration | numeric | Duration of speakers' speech (seconds) |
transcript_duration | numeric | Duration of transcript (seconds) |
MatchId | character | LaBB-CAT token identifier |
Word | character | Anonymised word identifier |
VowelDISC | character | Vowel category (DISC) |
VowelStart | character | Onset of vowel token (seconds) |
VowelEnd | character | Offset of vowel token (seconds) |
stress | character | Stress code (primary, secondary, etc) |
VowelMid | numeric | Vowel midpoint (seconds) |
previous_segment | character | Previous segment (DISC) |
following_segment | character | Following segment (DISC) |
articulation_rate | numeric | Utterance articulation rate |
speech_rate | numeric | Utterance speech rate |
F1_50 | numeric | First formant midpoint reading |
F2_50 | numeric | Second formant midpoint reading |
This section tidies the data and implements the remaining filtering steps.
We change the DISC notation for vowels to Wells Lexical Set labels:
wells_code <- function(in_data) {
in_data %>%
mutate(
Vowel = fct_recode(
factor(VowelDISC),
FLEECE = "i",
KIT = "I",
DRESS = "E",
TRAP = "{",
START = "#",
LOT = "Q",
THOUGHT = "$",
NURSE = "3",
STRUT = "V",
FOOT = "U",
GOOSE = "u"
)
)
}
QB1 <- wells_code(QB1)
QB2 <- wells_code(QB2)
We generate a vowel duration and orthographic length column.
QB1 <- QB1 %>%
mutate(
VowelEnd = as.numeric(VowelEnd),
VowelStart = as.numeric(VowelStart),
VowelDur = VowelEnd - VowelStart
)
QB2 <- QB2 %>%
mutate(
VowelDur = VowelEnd - VowelStart
)
As a final tidying step, we modify the Ethnicity
column so that we get
a binary distinction between Māori and non-Māori. The code here is
messy, as it is capturing quite a few different cases. We also change
the gender codes from lower case to upper case (e.g. ‘m’ -> ‘M’) for
compatibility with earlier code.
QB1$Ethnicity[QB1$Ethnicity == "NZ mixed ethnicity"] <- "Maori"
QB1$Ethnicity[QB1$Ethnicity == "NZ Maori"] <- "Maori"
QB1$Ethnicity[QB1$Ethnicity == "na"] <- "Non-Maori"
QB1$Ethnicity[QB1$Ethnicity == "Other"] <- "Non-Maori"
QB1$Ethnicity[QB1$Ethnicity == "NZ European"] <- "Non-Maori"
QB2$Ethnicity[QB2$Ethnicity == "NZ Ma�ori"] <- "Maori"
QB2$Ethnicity[QB2$Ethnicity == "NZ mixed ethnicity"] <- "Maori"
QB2$Ethnicity[QB2$Ethnicity == "NZ Maori"] <- "Maori"
QB2$Ethnicity[QB2$Ethnicity == ""] <- "na"
QB2$Ethnicity[QB2$Ethnicity == "Fijian Indian"] <- "Non-Maori"
QB2$Ethnicity[QB2$Ethnicity == "Other"] <- "Non-Maori"
QB2$Ethnicity[QB2$Ethnicity == "NZ European"] <- "Non-Maori"
# If a speaker provides ethnicity info in QB1 but not QB2, we add their
# ethnicity info from QB1 to QB2.
QB2 <- within(QB2, Ethnicity[Speaker == 'QB_NZ_M_594'] <- 'Non-Maori')
QB2 <- within(QB2, Ethnicity[Speaker == 'QB_NZ_F_848'] <- 'Non-Maori')
QB2 <- within(QB2, Ethnicity[Speaker == 'QB_NZ_M_493'] <- 'Non-Maori')
# For convenience, we change gender coding from 'f', 'm' -> 'F', 'M'.
QB1 <- QB1 %>%
mutate(
Gender = str_to_upper(Gender)
)
QB2 <- QB2 %>%
mutate(
Gender = str_to_upper(Gender)
)
We code factor variables as factors (rather than character vectors).
factor_vars <- c(
"Gender", "Ethnicity", "stress"
)
QB1 <- QB1 %>%
mutate(
across(all_of(factor_vars), as.factor),
Age = as.ordered(Age)
)
QB2 <- QB2 %>%
mutate(
across(all_of(factor_vars), as.factor),
Age = as.ordered(Age)
)
We now remove tokens which are, or are likely to be, errors introduced by forced alignment and formant tracking.
First, we remove tokens:
error_filter <- function(in_data) {
out_data <- in_data %>%
filter(
between(VowelDur, 0.01, 2), #filter tokens with very short or long vowel
# durations
F1_50 < 1100, # Remove all tokens with F1 at or above 1100hz.
# !(Vowel %in% c("THOUGHT", "LOT") & F2_50 > 2000),
!is.na(Gender), #filter speakers with missing gender
!Gender == "",
!is.na(Age), # Filter speakers with missing age.
Age != 'na', # Filter speakers with missing age.
!is.na(F1_50), # Filter missing formant values.
!is.na(F2_50)
)
}
QB1 <- error_filter(QB1)
QB1_error_count <- nrow(QB1)
QB2 <- error_filter(QB2)
QB2_error_count <- nrow(QB2)
We then remove unstressed tokens and those which precede liquids.
unstressed_filter <- function(in_data) {
out_data <- in_data %>% filter(
stress != "0" #filter any unstressed tokens
)
}
liquid_filter <- function(in_data) {
out_data <- in_data %>% mutate(
following_segment_category = fct_collapse(
as_factor(following_segment),
labial = c('m', 'p', 'b', 'f', 'w'),
velar = c('k', 'g', 'N'),
liquid = c('r', 'l'),
other_level = "other"
)
) %>%
filter(
!following_segment_category == 'liquid'
)
}
QB1 <- QB1 %>%
unstressed_filter() %>%
liquid_filter()
QB1_stressliquid <- nrow(QB1)
QB2 <- QB2 %>%
unstressed_filter() %>%
liquid_filter()
QB2_stressliquid <- nrow(QB2)
Our liquid filter works on the basis of CELEX. But we haven’t checked manually for linking /r/ or intrusive /r/. To satisfy ourselves that this isn’t a problem, and in response to a helpful comment from the anonymous reviewer, we checked the non-anonymised data and determined that less than 1% of the tokens of start, thought, and nurse occur word-finally. Of these, only a minority would be pre-vocalic. This is the relevant context for linking /r/. We think it is very unlikely that this would have an impact on the overall analysis.
We then remove tokens with formant values more than 2.5 sd from their speaker’s mean.
sd_filter <- function(in_data, sd_limit = 2.5) {
vowels_all_summary <- in_data %>%
# Remove tokens at +/- sd limit (default 2.5 sd)
group_by(Speaker, Vowel) %>%
summarise(
#calculate the summary statistics required for the outlier removal.
n = n(),
mean_F1 = mean(F1_50, na.rm = TRUE),
mean_F2 = mean(F2_50, na.rm = TRUE),
sd_F1 = sd(F1_50, na.rm = TRUE),
sd_F2 = sd(F2_50, na.rm = TRUE),
# Calculate cut off values.
max_F1 = mean(F1_50) + sd_limit*(sd(F1_50)),
min_F1 = mean(F1_50) - sd_limit*(sd(F1_50)),
max_F2 = mean(F2_50) + sd_limit*(sd(F2_50)),
min_F2 = mean(F2_50) - sd_limit*(sd(F2_50))
)
#this is the main outlier filtering step.
out_data <- in_data %>%
inner_join(vowels_all_summary) %>%
mutate(
outlier = ifelse(
F1_50 > min_F1 &
F1_50 < max_F1 &
F2_50 > min_F2 &
F2_50 < max_F2,
FALSE,
TRUE
)
) %>%
group_by(Speaker, Vowel) %>%
filter(outlier == FALSE) %>%
ungroup() %>%
select(
-c(
outlier, n, mean_F1, mean_F2, sd_F1,
sd_F2, max_F1, min_F1, max_F2, min_F2,
)
)
}
tails_filter <- function(in_data, tails = 0.05) {
out_data <- in_data %>%
group_by(Speaker, Vowel) %>%
mutate(
# calculate 5% tails for F1 and F2
F1_quant5 = quantile(F1_50, tails),
F2_quant5 = quantile(F2_50, tails),
F1_quant95 = quantile(F1_50, 1 - tails),
F2_quant95 = quantile(F2_50, 1 - tails),
# Determine tokens in 5% tails for either F1 or F2
outlier_quant = ifelse(
F1_50 > F1_quant95 |
F2_50 > F2_quant95 |
F1_50 < F1_quant5 |
F2_50 < F2_quant5,
TRUE,
FALSE
)
) %>%
ungroup() %>%
# Remove outliers
filter(outlier_quant == FALSE) %>%
# Tidy up by removing auxiliary columns generated during calculation.
select(-contains("quant"))
}
QB1 <- QB1 %>%
sd_filter(sd_limit = 2.5)
# down to 156112
QB1 <- QB1 %>%
tails_filter(tails = 0.05)
# Down to 121884
QB1_outlier_count <- nrow(QB1)
QB2 <- QB2 %>%
sd_filter(sd_limit=2.5) %>%
tails_filter(tails = 0.05)
QB2_outlier_count <- nrow(QB2)
Finally, in order to match Brand et al. 2021 (henceforth ‘B21’), we remove foot. We first output the counts for each vowel to indicate that we have the same relative lack of this vowel type found in B21.
vowel_counts <- function(in_data) {
in_data %>%
group_by(Vowel) %>%
summarise(n = n()) %>%
mutate(
rate = n / sum(.$n) * 100
) %>%
arrange(n)
}
QB1 %>% vowel_counts()
foot makes up 2.3% of the QB1 data. This is roughly the rate reported in B21’s supplementary material (3%).
QB2 %>% vowel_counts()
The rate is also similar for the QB2 data.
We continue by removing foot from both data sets.
QB1 <- QB1 %>%
filter(
Vowel != "FOOT"
)
QB2 <- QB2 %>%
filter(
Vowel != "FOOT"
)
QB1_foot_count <- nrow(QB1)
QB2_foot_count <- nrow(QB2)
The final filtering step, again following B21, is to remove any speakers who have fewer than 5 tokens for any of the remaining vowels.
low_n_filter <- function(in_data) {
low_speakers <- in_data %>%
mutate(
Vowel = droplevels(Vowel)
) %>%
# .drop is required to get situations in which a speaker has NO tokens for
# a given vowel. It ensures a group for all levels of the vowel factor
# (including those not present in the data for a given speaker)
group_by(Speaker, Vowel, .drop = FALSE) %>%
count() %>%
ungroup() %>%
filter(n < 5) %>%
pull(Speaker)
out_data <- in_data %>%
ungroup() %>%
filter(!Speaker %in% low_speakers)
}
QB1 <- QB1 %>%
low_n_filter()
QB2 <- QB2 %>%
low_n_filter()
QB1_low_n_count <- nrow(QB1)
QB2_low_n_count <- nrow(QB2)
Following B21, we check for speakers whose vowels overlap one another. If all vowels overlap, then we likely have some kind of systematic vowel tracking problem.
The data from QB1 and QB2 is much cleaner than that which is present in the ONZE corpus used in B21. Systematic vowel tracking problems are much less likely in our data.
We talk through the steps for the QB1 data and then repeat them for QB2 using the functions defined in the following code blocks.
We first find the mean and standard deviations for each speaker for the first and second formant of each monophthong. We then calculate the distances between vowels within each speaker’s vowel space.
speaker_means <- function(in_data) {
out_data <- in_data %>%
group_by(Speaker, Vowel) %>%
summarise(
n = n(),
F1_mean = mean(F1_50),
F2_mean = mean(F2_50),
F1_sd = sd(F1_50),
F2_sd = sd(F2_50)
)
}
speaker_distances <- function(in_data) {
out_data <- in_data %>%
mutate(
#calculate the euclidean distance matrix between the vowel means for each
#speaker
Dist = colMeans(
as.matrix(dist(cbind(F1_mean, F2_mean)))
)
) %>%
ungroup() %>%
#calculate the mean distance across all vowels (by speaker)
group_by(Speaker) %>%
summarise(
mean_dist = mean(Dist)
) %>%
#create a new variable for use in plot labelling.
mutate(
Speaker_dist = paste(round(mean_dist, 2), Speaker)
)
}
We apply the above functions and plot the distribution of distances between vowels for QB1.
QB1_means <- QB1 %>%
speaker_means()
QB1_distances <- QB1_means %>%
speaker_distances()
distance_plot <- function(in_data) {
out_plot <- in_data %>%
ggplot(
aes(x = mean_dist)
) +
geom_density() +
geom_vline(
xintercept = mean(in_data$mean_dist),
linetype = 1
) +
geom_vline(
xintercept = mean(in_data$mean_dist) +
2*sd(in_data$mean_dist),
linetype = 2
) +
geom_vline(
xintercept = mean(in_data$mean_dist) -
2*sd(in_data$mean_dist),
linetype = 2,
colour = "red") +
theme_bw() +
labs(
title = "Distribution of Mean Euclidean Distance Between Vowels for Each Speaker",
xlab = "Mean Euclidian Distance",
ylab = "Density"
)
}
QB1_dist_plot <- QB1_distances %>%
distance_plot()
QB1_dist_plot
Figure 2.1: Distribution of Mean Speaker Distances (QB1)
We’re going to be looking at speakers who have a mean distance between vowels which falls to the left of the red dashed line in Figure 2.1.
outlier_speakers <- function(mean_data, dist_data) {
out_data <- mean_data %>%
inner_join(dist_data) %>%
filter(
mean_dist < mean(dist_data$mean_dist) -
2*sd(dist_data$mean_dist)
)
}
outlier_vs_plot <- function(vowel_data, outlier_data) {
out_plot <- vowel_data %>%
filter(Speaker %in% outlier_data$Speaker) %>%
left_join(., outlier_data, by = c("Speaker", "Vowel")) %>%
arrange(mean_dist) %>%
ggplot(
aes(x = F2_50, y = F1_50, colour = Vowel)
) +
geom_point(size = 0.2, alpha = 0.5) +
stat_ellipse(level = 0.67) +
geom_text(
data = outlier_data,
aes(
x = F2_mean,
y = F1_mean,
label = Vowel
),
size = 2) +
scale_x_reverse() +
scale_y_reverse() +
theme_bw() +
theme(legend.position = "none") +
labs(
title = "Low Mean Euclidean Distance Vowel Spaces"
)
}
QB1_outlier_speakers <- outlier_speakers(QB1_means, QB1_distances)
QB1_outlier_plots <- outlier_vs_plot(QB1, QB1_outlier_speakers)
QB1_outlier_plots + facet_wrap_paginate(
vars(Speaker_dist),
scales="free"
)
Figure 2.2: Low Mean Distance Speakers (QB1)
Figure 2.2 has six speakers. We can instantly see that these are not the kind of global formant tracking failures picked up in B21. But there are still some data issues coming up.
The speaker QB_NZ_F_466
seems to take almost any F2 value as
acceptable measurements for their vowels. Manual inspection suggests
systematic errors in formant tracking for this speaker, so they will be
excluded.
qb1_speakers_to_exclude <- c("QB_NZ_F_466")
QB1 <- QB1 %>%
filter(
!Speaker %in% qb1_speakers_to_exclude
)
QB1_outlier_vs_count <- nrow(QB1)
We perform the same steps for QB2:
QB2_means <- QB2 %>%
speaker_means()
QB2_distances <- QB2_means %>%
speaker_distances()
QB2_dist_plot <- QB2_distances %>%
distance_plot()
QB2_dist_plot
Figure 2.3: Distribution of Mean Speaker Distances (QB2)
The distribution looks somewhat different, but not worryingly so.
We look at the low mean distance speakers:
QB2_outlier_speakers <- outlier_speakers(QB2_means, QB2_distances)
# If no speakers then the plot gives error. So avoid this.
if (nrow(QB2_outlier_speakers) > 0) {
QB2_outlier_plots <- outlier_vs_plot(QB2, QB2_outlier_speakers)
QB2_outlier_plots + facet_wrap(vars(Speaker_dist), scales="free")
} else {
print("No speakers two standard deviations below mean.")
}
Figure 2.4: Low Mean Distance Speakers (QB2)
Following the same reasoning laid out above, we keep this speaker in our data.
QB2_outlier_vs_count <- nrow(QB2)
We wish to compare speakers while abstracting away from physiological
differences. We use the Lobanov 2.0 normalisation, which was developed
for B21, and works effectively in situations with large differences in
token counts across vowels.2 We use the implementation in the
nzilbb.vowels
package.
# The Lobanov 2.0 function requires that the data be ordered with Speaker id,
# vowel name, F1 (Hz) and F2 (Hz) as the first four columns.
QB1 <- QB1 %>%
# Reorder the data into the column order required by the Lobanov 2 function.
relocate(Speaker, Vowel, F1_50, F2_50) %>%
lobanov_2()
QB2 <- QB2 %>%
relocate(Speaker, Vowel, F1_50, F2_50) %>%
lobanov_2()
We now produce post-filtering-and-normalisation plots.
First, QB1:
QB1 %>%
ggplot(
aes(
x = F1_lob2,
y = F2_lob2,
colour = Gender
)
) +
geom_point(alpha=0.4) +
facet_wrap(vars(Vowel))
Figure 2.5: QB1 Formant Data After Filtering
And QB2:
QB2 %>%
ggplot(
aes(
x = F1_lob2,
y = F2_lob2,
colour = Gender
)
) +
geom_point(alpha = 0.4) +
facet_wrap(vars(Vowel))
Figure 2.6: QB2 Formant Data After Filtering
The vowel thought has the most trouble. There is a high F2 cluster of female tokens in QB1 and of male tokens in the QB2 data. There is some trouble with lot in QB2. Visually, this only looks like a handful of tokens.
Let’s look at these
QB2 %>%
filter(
Vowel == "THOUGHT",
F2_lob2 > 1
) %>%
count(
Speaker
)
Almost all from one speaker QB_NZ_M_126
.
What about QB1?
QB1 %>%
filter(
Vowel == "THOUGHT",
F2_lob2 > 1
) %>%
count(
Speaker
)
More speakers coming up here. One, in particular, is quite strong:
QB_NZ_F_266
.
It is easiest to filter these out.
QB1 <- QB1 %>%
filter(
!(Vowel == "THOUGHT" & F2_lob2 > 1)
)
QB2 <- QB2 %>%
filter(
!(Vowel == "THOUGHT" & F2_lob2 > 1)
)
QB1_thought_count <- nrow(QB1)
QB2_thought_count <- nrow(QB2)
We save the data in its processed state. We relevel factor variables to ensure that factor levels with no tokens are not present in the data.
QB1 <- QB1 %>%
mutate(
across(any_of(factor_vars), droplevels),
Age = droplevels(Age)
)
QB2 <- QB2 %>%
mutate(
across(any_of(factor_vars), droplevels),
Age = droplevels(Age)
)
write_rds(QB1, here('data', 'QB1.rds'))
write_rds(QB2, here('data', 'QB2.rds'))
We visualise the filtering process as a flow chart. The initial count of tokens extracted from LaBB-CAT is hard coded for both QB1 and QB2 as the anonymised data already has stopwords and hesitations removed.
First for QB1:
# Flow chart generated using the DiagrammeR package.
QB1_flow_chart <- mermaid(
diagram = glue("
graph TB
A(Extract data from LaBB-CAT: 707287 tokens) -->
B(Filter stopwords and hesitations: {QB1_initial_count} tokens)
B --> C(Filter common errors: {QB1_error_count} tokens)
C --> D(Filter unstressed tokens and tokens preceding liquids: {QB1_stressliquid} tokens)
D --> E(Filter outliers: {QB1_outlier_count} tokens)
E --> F(Filter FOOT tokens: {QB1_foot_count} tokens)
F --> G(Filter low-n speakers: {QB1_low_n_count} tokens)
G --> H(Filter speakers with vowel space overlaps: {QB1_outlier_vs_count})
H --> I(Filter bad THOUGHT tokens: {QB1_thought_count})
"),
height = 800
)
QB1_flow_chart
Figure 2.7: Filtering flow chart (QB1)
Now for QB2:
# Flow chart generated using the DiagrammeR package.
QB2_flow_chart <- mermaid(
diagram = glue("
graph TB
A(Extract data from LaBB-CAT: 287334 tokens) -->
B(Filter stopwords and hesitations: {QB2_initial_count} tokens)
B --> C(Filter common errors: {QB2_error_count} tokens)
C --> D(Filter unstressed tokens and tokens preceding liquids: {QB2_stressliquid} tokens)
D --> E(Filter outliers: {QB2_outlier_count} tokens)
E --> F(Filter FOOT tokens: {QB2_foot_count} tokens)
F --> G(Filter low-n speakers: {QB2_low_n_count} tokens)
G --> H(Filter speakers with vowel space overlaps: {QB2_outlier_vs_count})
H --> I(Filter bad THOUGHT tokens: {QB2_thought_count})
"),
height = 800
)
QB2_flow_chart
Figure 2.8: Filtering flow chart (QB2)
We have a quick look at the descriptive stats for the relevant data.
First, the full QB1 data set:
The counts below are in tokens.
QB1 %>% summary()
## Speaker Vowel F1_50 F2_50 Gender
## Length:104468 DRESS :18838 Min. : 206.0 Min. : 471 F:71640
## Class1:glue KIT :14734 1st Qu.: 398.0 1st Qu.:1407 M:32828
## Class2:character FLEECE :13922 Median : 470.0 Median :1748
## Mode :character STRUT :12720 Mean : 503.6 Mean :1746
## LOT :10192 3rd Qu.: 585.0 3rd Qu.:2121
## TRAP :10033 Max. :1073.0 Max. :2909
## (Other):24029
## Age Ethnicity participant_grew_up
## 46-55 :26103 Maori : 9210 Length:104468
## 56-65 :25595 Non-Maori:95258 Class :character
## 36-45 :20932 Mode :character
## 66-75 :10931
## 18-25 :10641
## 26-35 : 7566
## (Other): 2700
## participant_grew_up_region participant_duration transcript_duration
## Length:104468 Min. : 176.4 Min. : 150.0
## Class :character 1st Qu.: 481.8 1st Qu.: 414.7
## Mode :character Median : 870.6 Median : 633.9
## Mean :1153.8 Mean : 765.1
## 3rd Qu.:1608.5 3rd Qu.:1036.4
## Max. :6476.8 Max. :2272.8
##
## MatchId Word VowelDISC VowelStart
## Length:104468 Length:104468 Length:104468 Min. : 0.51
## Class :character Class1:glue Class :character 1st Qu.: 190.44
## Mode :character Class2:character Mode :character Median : 385.34
## Mode :character Mean : 523.32
## 3rd Qu.: 695.02
## Max. :3344.12
##
## VowelEnd stress VowelMid
## Min. : 0.6 _ : 5251 Min. : 0.555
## 1st Qu.: 190.5 ' :96846 1st Qu.: 190.477
## Median : 385.5 " : 2070 Median : 385.392
## Mean : 523.4 + : 286 Mean : 523.360
## 3rd Qu.: 695.1 Copy from layer: 97: 15 3rd Qu.: 695.072
## Max. :3344.2 Max. :3344.160
##
## previous_segment following_segment articulation_rate speech_rate
## Length:104468 Length:104468 Min. :1.210 Min. :0.6795
## Class :character Class :character 1st Qu.:4.282 1st Qu.:2.9830
## Mode :character Mode :character Median :4.864 Median :3.6381
## Mean :4.897 Mean :3.6749
## 3rd Qu.:5.487 3rd Qu.:4.3253
## Max. :9.396 Max. :9.1324
## NA's :1127 NA's :1127
## VowelDur following_segment_category F1_lob2
## Min. :0.03000 liquid: 0 Min. :-2.74766
## 1st Qu.:0.05000 velar :15974 1st Qu.:-0.81454
## Median :0.08000 labial:19459 Median :-0.33830
## Mean :0.08877 other :69035 Mean :-0.07376
## 3rd Qu.:0.11000 3rd Qu.: 0.54919
## Max. :1.96000 Max. : 4.45733
##
## F2_lob2
## Min. :-3.7561
## 1st Qu.:-0.5600
## Median : 0.2551
## Mean : 0.1535
## 3rd Qu.: 1.0264
## Max. : 2.7060
##
Second, the QB1 data set filter to include only speakers who appear in both QB1 and QB2.
QB1 %>%
filter(
Speaker %in% QB2$Speaker
) %>%
summary()
## Speaker Vowel F1_50 F2_50 Gender
## Length:22529 DRESS :4018 Min. : 234.0 Min. : 471 F:16626
## Class1:glue KIT :3235 1st Qu.: 401.0 1st Qu.:1426 M: 5903
## Class2:character FLEECE :3005 Median : 474.0 Median :1767
## Mode :character STRUT :2630 Mean : 507.7 Mean :1766
## LOT :2249 3rd Qu.: 592.0 3rd Qu.:2154
## TRAP :2179 Max. :1069.0 Max. :2894
## (Other):5213
## Age Ethnicity participant_grew_up
## 46-55 :5575 Maori : 3301 Length:22529
## 36-45 :5561 Non-Maori:19228 Class :character
## 56-65 :5401 Mode :character
## 66-75 :3761
## 18-25 :1383
## 26-35 : 547
## (Other): 301
## participant_grew_up_region participant_duration transcript_duration
## Length:22529 Min. : 935.3 Min. : 150.0
## Class :character 1st Qu.:1608.5 1st Qu.: 458.8
## Mode :character Median :1953.1 Median : 666.3
## Mean :2148.8 Mean : 710.0
## 3rd Qu.:2579.5 3rd Qu.: 934.2
## Max. :6476.8 Max. :1296.3
##
## MatchId Word VowelDISC VowelStart
## Length:22529 Length:22529 Length:22529 Min. : 0.59
## Class :character Class1:glue Class :character 1st Qu.: 190.71
## Mode :character Class2:character Mode :character Median : 394.10
## Mode :character Mean : 484.29
## 3rd Qu.: 689.03
## Max. :1819.45
##
## VowelEnd stress VowelMid
## Min. : 0.7 _ : 1013 Min. : 0.655
## 1st Qu.: 190.8 ' :20935 1st Qu.: 190.750
## Median : 394.2 " : 530 Median : 394.125
## Mean : 484.4 + : 51 Mean : 484.333
## 3rd Qu.: 689.1 Copy from layer: 97: 0 3rd Qu.: 689.090
## Max. :1819.5 Max. :1819.495
##
## previous_segment following_segment articulation_rate speech_rate
## Length:22529 Length:22529 Min. :1.442 Min. :0.9915
## Class :character Class :character 1st Qu.:4.194 1st Qu.:2.8969
## Mode :character Mode :character Median :4.762 Median :3.5260
## Mean :4.811 Mean :3.5549
## 3rd Qu.:5.402 3rd Qu.:4.1766
## Max. :9.132 Max. :9.1324
## NA's :486 NA's :486
## VowelDur following_segment_category F1_lob2
## Min. :0.03000 liquid: 0 Min. :-2.31143
## 1st Qu.:0.05000 velar : 3382 1st Qu.:-0.80522
## Median :0.08000 labial: 4242 Median :-0.33917
## Mean :0.09273 other :14905 Mean :-0.06557
## 3rd Qu.:0.11000 3rd Qu.: 0.55523
## Max. :1.96000 Max. : 4.29760
##
## F2_lob2
## Min. :-3.7561
## 1st Qu.:-0.5521
## Median : 0.2491
## Mean : 0.1484
## 3rd Qu.: 1.0364
## Max. : 2.5144
##
Finally, the QB2 data set:
QB2 %>% summary()
## Speaker Vowel F1_50 F2_50 Gender
## Length:43403 DRESS :7807 Min. : 194.0 Min. : 490 F:31469
## Class1:glue KIT :7164 1st Qu.: 392.0 1st Qu.:1419 M:11934
## Class2:character FLEECE :5542 Median : 462.0 Median :1730
## Mode :character STRUT :5302 Mean : 496.5 Mean :1727
## LOT :4008 3rd Qu.: 578.0 3rd Qu.:2066
## TRAP :3982 Max. :1072.0 Max. :2868
## (Other):9598
## Age Ethnicity participant_grew_up participant_grew_up_region
## 18-25: 5434 Maori : 4366 Length:43403 Length:43403
## 26-35: 1689 Non-Maori:39037 Class :character Class :character
## 36-45: 8147 Mode :character Mode :character
## 46-55:10861
## 56-65:10482
## 66-75: 5886
## 76-85: 904
## participant_duration transcript_duration MatchId Word
## Min. : 574.9 Min. : 161.6 Length:43403 Length:43403
## 1st Qu.:1421.6 1st Qu.: 972.6 Class :character Class1:glue
## Median :1982.2 Median :1357.6 Mode :character Class2:character
## Mean :2418.8 Mean :1753.2 Mode :character
## 3rd Qu.:2995.4 3rd Qu.:1913.0
## Max. :6476.8 Max. :5756.2
##
## VowelDISC VowelStart VowelEnd stress
## Length:43403 Min. : 3.785 Min. : 3.845 _: 2489
## Class :character 1st Qu.: 320.798 1st Qu.: 320.870 ':39886
## Mode :character Median : 649.346 Median : 649.415 ": 1006
## Mean : 884.392 Mean : 884.488 +: 22
## 3rd Qu.:1109.870 3rd Qu.:1109.992
## Max. :5798.027 Max. :5798.117
##
## VowelMid previous_segment following_segment articulation_rate
## Min. : 3.815 Length:43403 Length:43403 Min. : 1.587
## 1st Qu.: 320.838 Class :character Class :character 1st Qu.: 4.157
## Median : 649.371 Mode :character Mode :character Median : 4.751
## Mean : 884.440 Mean : 4.821
## 3rd Qu.:1109.935 3rd Qu.: 5.402
## Max. :5798.072 Max. :10.169
##
## speech_rate VowelDur following_segment_category
## Min. :0.5517 Min. :0.0300 liquid: 0
## 1st Qu.:3.0889 1st Qu.:0.0600 labial: 8100
## Median :3.6762 Median :0.0800 velar : 6847
## Mean :3.7456 Mean :0.0953 other :28456
## 3rd Qu.:4.3216 3rd Qu.:0.1200
## Max. :8.0808 Max. :1.1900
##
## F1_lob2 F2_lob2
## Min. :-2.70793 Min. :-3.8618
## 1st Qu.:-0.82636 1st Qu.:-0.5144
## Median :-0.33119 Median : 0.2408
## Mean :-0.07701 Mean : 0.1664
## 3rd Qu.: 0.54397 3rd Qu.: 0.9868
## Max. : 4.39026 Max. : 2.7753
##
In tabular form:
column_descriptions_initial <- c(
"Anonymised speaker code",
"Vowel (Wells Lexical Sets)",
"First formant midpoint reading (Hz)",
"Second formant midpoint reading (Hz)",
"Speaker gender (binary)",
"Speaker age category",
"Speaker ethnicity (at QB1)",
"Country speaker grew up",
"Region speaker grew up",
"Duration of speakers' speech (seconds)",
"Duration of transcript (seconds)",
"LaBB-CAT token identifier",
"Anonymised word identifier",
"Vowel category (DISC)",
"Onset of vowel token (seconds)",
"Offset of vowel token (seconds)",
"Stress code (primary, secondary, etc)",
"Vowel midpoint (seconds)",
"Previous segment (DISC)",
"Following segment (DISC)",
"Utterance articulation rate",
"Utterance speech rate",
"Vowel duration (seconds)",
"Following segment category",
"First formant midpoint reading (Lobanov 2.0)",
"Second formant midpoint reading (Lobanov 2.0)"
)
postfilter_table <- tibble(
Name = names(QB1),
Type = reframe(QB1, across(everything(), class))[2,] %>% as_vector(),
Description = column_descriptions_initial
)
postfilter_table <- postfilter_table %>%
gt() %>%
tab_header(
title = "Variables after preprocessing"
)
postfilter_table
Variables after preprocessing | ||
Name | Type | Description |
---|---|---|
Speaker | character | Anonymised speaker code |
Vowel | factor | Vowel (Wells Lexical Sets) |
F1_50 | numeric | First formant midpoint reading (Hz) |
F2_50 | numeric | Second formant midpoint reading (Hz) |
Gender | factor | Speaker gender (binary) |
Age | factor | Speaker age category |
Ethnicity | factor | Speaker ethnicity (at QB1) |
participant_grew_up | character | Country speaker grew up |
participant_grew_up_region | character | Region speaker grew up |
participant_duration | numeric | Duration of speakers' speech (seconds) |
transcript_duration | numeric | Duration of transcript (seconds) |
MatchId | character | LaBB-CAT token identifier |
Word | character | Anonymised word identifier |
VowelDISC | character | Vowel category (DISC) |
VowelStart | numeric | Onset of vowel token (seconds) |
VowelEnd | numeric | Offset of vowel token (seconds) |
stress | factor | Stress code (primary, secondary, etc) |
VowelMid | numeric | Vowel midpoint (seconds) |
previous_segment | character | Previous segment (DISC) |
following_segment | character | Following segment (DISC) |
articulation_rate | numeric | Utterance articulation rate |
speech_rate | numeric | Utterance speech rate |
VowelDur | numeric | Vowel duration (seconds) |
following_segment_category | factor | Following segment category |
F1_lob2 | numeric | First formant midpoint reading (Lobanov 2.0) |
F2_lob2 | numeric | Second formant midpoint reading (Lobanov 2.0) |
The article requires a table of token counts and a histogram of speakers by gender and age group.
counts_data <- function(in_data) {
out_data <- in_data %>%
group_by(
Vowel
) %>%
summarise(
Tokens = n()
) %>%
mutate(
`%` = signif(Tokens/sum(.$Tokens) * 100, 2)
) %>%
arrange(
as.character(Vowel)
) %>%
bind_rows(
tibble(
"Vowel" = c("Total"), "Tokens" = nrow(in_data), "%" = 100
)
)
}
QB1_counts <- counts_data(QB1)
QB1_counts %>%
kable(caption = "All QB1 Speakers") %>%
kable_paper(bootstrap_options = "striped", full_width = F)
Vowel | Tokens | % |
---|---|---|
DRESS | 18838 | 18.0 |
FLEECE | 13922 | 13.0 |
GOOSE | 5982 | 5.7 |
KIT | 14734 | 14.0 |
LOT | 10192 | 9.8 |
NURSE | 4672 | 4.5 |
START | 5426 | 5.2 |
STRUT | 12720 | 12.0 |
THOUGHT | 7949 | 7.6 |
TRAP | 10033 | 9.6 |
Total | 104468 | 100.0 |
We also produce a table for the QB1 and QB2 data of the speakers in both corpora.
QB1_sub <- QB1 %>%
filter(
Speaker %in% QB2$Speaker
)
QB1_sub_counts <- counts_data(QB1_sub)
QB2_counts <- counts_data(QB2)
QBBoth_counts <- bind_cols(
QB1_sub_counts %>%
rename(
`Tokens (QB1)` = Tokens,
`% (QB1)` = `%`
),
QB2_counts %>%
select(-Vowel) %>%
rename(
`Tokens (QB2)` = Tokens,
`% (QB2)` = `%`
)
)
QBBoth_counts %>%
kable(caption = "Speakers in Both Corpora") %>%
kable_paper(bootstrap_options = "striped", full_width = F) %>%
row_spec(11, bold=T)
Vowel | Tokens (QB1) | % (QB1) | Tokens (QB2) | % (QB2) |
---|---|---|---|---|
DRESS | 4018 | 18.0 | 7807 | 18.0 |
FLEECE | 3005 | 13.0 | 5542 | 13.0 |
GOOSE | 1311 | 5.8 | 2594 | 6.0 |
KIT | 3235 | 14.0 | 7164 | 17.0 |
LOT | 2249 | 10.0 | 4008 | 9.2 |
NURSE | 1014 | 4.5 | 2259 | 5.2 |
START | 1218 | 5.4 | 2036 | 4.7 |
STRUT | 2630 | 12.0 | 5302 | 12.0 |
THOUGHT | 1670 | 7.4 | 2709 | 6.2 |
TRAP | 2179 | 9.7 | 3982 | 9.2 |
Total | 22529 | 100.0 | 43403 | 100.0 |
We generate distributions of speakers by age and gender.
QB1 %>%
group_by(Age, Gender) %>%
summarise(
speakers = n_distinct(Speaker)
) %>%
ggplot(
aes(
x = Age,
y = speakers,
fill = Gender,
)
) +
geom_col() +
scale_fill_manual(values = c("F" = "#E69F00", "M" = "#009E73")) +
labs(
title = "QB1 Distribution of Speakers by Age and Gender"
)
Interpretation of our comparisons between QB1 and QB2 rely on a sense of the overall pattern of change in the community. In order to do this, we first fit generalised additive mixed models (GAMMs) using all speakers at QB1 (not just the speakers who are in both QB1 and QB2). This gives us a sense of patterns in change across apparent time. We then look at changes in the pattern in real time (for the speakers who are in both QB1 and QB2).
We fit models following the pattern of B21: a model is fit to each vowel and formant, i.e., separate models for dress F1, dress F2, nurse F1, etc.
These models predict Lobanov 2.0 normalised formant values with a parametric term for gender and a four-knot smooth for age category at both levels of gender. We also fit articulation rate as a control variable and random effect intercepts for speaker and word.
The model structure matches B21 with two exceptions. First, B21 uses a difference smooth structure for gender. We instead fit a separate smooth for the males and females in the data. The resulting p-values tell us whether the smooth for the given gender is significant rather than whether the models for males and females are significantly different from one another. Second, QB data does not have year of birth. Instead, we have age categories. We fit a smooth through the age categories with four knots rather than the default 10.
The other difference is that we fit the model using a scaled t distribution as our error model. This model structure allows for heavier tails in the response variable. This is often the case with formant data and models which were fit using the standard normal error distribution performed worse than the models reported here.
The pattern in the code block below, of nesting data, fitting models, and extracting intercepts, is set out here.
We’ll need a numeric representation of the age categories and to convert speaker, gender, and word to factor variables.
# Check that age values in the correct order.
levels(QB1$Age)
## [1] "18-25" "26-35" "36-45" "46-55" "56-65" "66-75" "76-85" "85+"
# output: "18-25" "26-35" "36-45" "46-55" "56-65" "66-75" "76-85" "85+" "na"
# Correct!
# Convert to integer and do to factor conversions for speaker and word.
QB1 <- QB1 %>%
mutate(
age_category_numeric = as.integer(Age),
Speaker = as.factor(Speaker),
Gender = as.factor(Gender),
Word = as.factor(Word)
)
We count how many speakers are in each category to check whether any age categories should be excluded:
QB1 %>%
group_by(Age) %>%
summarise(
speaker_count = n_distinct(Speaker)
)
We’ve only got one 85+ speaker, so we remove the 85+ category.
QB1_models <- QB1 %>%
filter(
Age != "85+"
) %>%
select(
-F1_50, -F2_50
) %>%
pivot_longer(
cols = F1_lob2:F2_lob2,
names_to = "formant_type",
values_to = "formant_value"
) %>%
group_by(Vowel, formant_type) %>%
nest() %>%
mutate(
model = map(
data,
~ bam(
formant_value ~ Gender + s(age_category_numeric, k = 4, by = Gender) + # main effects
s(articulation_rate, k = 3) + # control vars
s(Speaker, bs='re') + s(Word, bs='re'), # random effects
discrete = TRUE,
nthreads = 16, # This value is too high for most situations.
family = mgcv::scat(link ='identity'),
data = .x
)
),
# Extract p-values. Comment out these lines if you are still experimenting
# with getting the models to fit. Generating summaries can take as long as
# fitting a GAM.
model_smooth_pvs = map(
model,
~ summary(.x)$s.pv
),
genderF_smooth_pv = map_dbl(
model_smooth_pvs,
~ .x[[1]]
),
genderM_smooth_pv = map_dbl(
model_smooth_pvs,
~ .x[[2]]
)
)
write_rds(QB1_models, here('models', 'QB1_models.rds'), compress = "gz")
The following code block outputs model checks for each of the vowel-formant models by means of a qqplot.
walk2(
str_c(QB1_models$Vowel, '_', QB1_models$formant_type),
QB1_models$model,
~ {
qq.gam(.y, sub=.x)
}
)
Fuller output is available by using the following code block, but the output makes the html excessively large.
walk2(
str_c(QB1_models$Vowel, '_', QB1_models$formant_type),
QB1_models$model,
~ {
print(.x)
gam.check(.y, sub=.x)
}
)
There is no evidence of serious problems with model fit here.
We generate predictions from the models in order plot them. We generate predictions for each gender and each age category. At the same time, we collect p-value information for the formant smooths for each gender from the model summaries.
to_predict <- list(
"age_category_numeric" = seq(from=1, to=7, by=1), # All age categories.
"Gender" = c("M", "F")
)
QB1_preds <- QB1_models %>%
mutate(
prediction = map(
model,
~ get_predictions(model = .x, cond = to_predict, print.summary = FALSE)
)
) %>%
select(
Vowel, formant_type, prediction, genderF_smooth_pv, genderM_smooth_pv
) %>%
unnest(prediction) %>%
# This step is important. It ensures that, when we plot,
# arrows go from oldest to youngest speakers.
arrange(
desc(age_category_numeric)
)
QB1_preds <- QB1_preds %>%
mutate(
p_value = if_else(Gender == "F", genderF_smooth_pv, genderM_smooth_pv)
) %>%
select( # Remove unneeded variables
-articulation_rate,
-genderF_smooth_pv,
-genderM_smooth_pv
) %>%
pivot_wider( # Pivot
names_from = formant_type,
values_from = c(fit, p_value, CI)
) %>%
rename(
F1_lob2 = fit_F1_lob2,
F2_lob2 = fit_F2_lob2
) %>%
mutate(
# A variable for statistical significance of smooths - either F1 or F2
# significant with bonferonni correction.
is_sig = p_value_F1_lob2 < 0.025 | p_value_F2_lob2 < 0.025
)
We look at the p-values for each smooth.
QB1_preds %>%
group_by(Vowel, Gender) %>%
summarise(
F1=first(p_value_F1_lob2),
F2=first(p_value_F2_lob2),
is_sig = first(is_sig)
) %>%
kable() %>%
kable_styling(full_width = TRUE)
Vowel | Gender | F1 | F2 | is_sig |
---|---|---|---|---|
TRAP | M | 0.2513332 | 0.5662457 | FALSE |
TRAP | F | 0.1250439 | 0.0454166 | FALSE |
START | M | 0.8647379 | 0.0000000 | TRUE |
START | F | 0.0993974 | 0.0000000 | TRUE |
THOUGHT | M | 0.1839985 | 0.8761409 | FALSE |
THOUGHT | F | 0.1043083 | 0.0050031 | TRUE |
NURSE | M | 0.0000011 | 0.1435383 | TRUE |
NURSE | F | 0.0000000 | 0.0000009 | TRUE |
DRESS | M | 0.0000009 | 0.0000000 | TRUE |
DRESS | F | 0.0000000 | 0.0000000 | TRUE |
FLEECE | M | 0.0244654 | 0.1027110 | TRUE |
FLEECE | F | 0.0005731 | 0.0104060 | TRUE |
KIT | M | 0.0168849 | 0.0545227 | TRUE |
KIT | F | 0.0000155 | 0.0027582 | TRUE |
LOT | M | 0.1980505 | 0.1278668 | FALSE |
LOT | F | 0.0613967 | 0.0279618 | FALSE |
GOOSE | M | 0.1991470 | 0.7556077 | FALSE |
GOOSE | F | 0.0003627 | 0.5781464 | TRUE |
STRUT | M | 0.2759764 | 0.0017502 | TRUE |
STRUT | F | 0.8190211 | 0.0778254 | FALSE |
The column is_sig
in the above table tracks significance using a
Bonferroni correction of the 0.05 significance threshold. The basic idea
is that the hypothesis we are ‘testing’, for each vowel and gender, is
that the vowel moves in the vowel space as we go up the age categories.
This combines two p-values: the p-value for the F1 model and the
p-value for the F2 model. Consequently, we divide 0.05 by 2, and test
whether the F1 or the F2 model reaches this new, lower, threshold.
In all cases where a threshold is adjusted, it is worth looking at what would happen if an alternative threshold were used. For instance, if we had not used the Bonferroni correction, we see that lot would come out as significant for female speakers.
We now plot these
# Tol colours, designed to be colour blind friendly.
vowel_colours <- c(
DRESS = "#777777", # This is the 'bad data' colour for maps
FLEECE = "#882E72",
GOOSE = "#4EB265",
KIT = "#7BAFDE",
LOT = "#DC050C",
TRAP = "#878100", # "#F7F056",
START = "#1965B0",
STRUT = "#F4A736",
THOUGHT = "#72190E",
NURSE = "#E8601C",
FOOT = "#5289C7"
)
# Some alternative vowel colours
# vowel_colours <- c(
# DRESS = "#9590FF",
# FLEECE = "#D89000",
# GOOSE = "#A3A500",
# KIT = "#39B600",
# LOT = "#00BF7D",
# NURSE = "#00BFC4",
# START = "#00B0F6",
# STRUT = "#F8766D",
# THOUGHT = "#E76BF3",
# TRAP = "#FF62BC"
# )
# We use the following vector to reorder the vowel vector and thus control
# the order in which vowels appear in the legend of our plots.
vowels = c(
"FLEECE", "GOOSE", "DRESS", "NURSE", "THOUGHT", "KIT", "TRAP", "LOT", "STRUT",
"START"
)
# Set up some functions for often-repeated plotting code.
qb_mapping <- aes(
x = F2_lob2,
y = F1_lob2,
colour = Vowel,
label = vowel_lab,
group = Vowel
)
vs_layers <- function(in_plot) {
out_plot <- in_plot +
geom_path(
arrow = arrow(
ends = "last",
type="closed",
length = unit(2, "mm")
)
) +
geom_point(
data = ~ .x %>%
filter(
!vowel_lab == ""
),
show.legend = FALSE,
size = 1.5
) +
geom_label_repel(
min.segment.length = 0, seed = 42,
show.legend = FALSE,
fontface = "bold",
size = 10 / .pt,
label.padding = unit(0.2, "lines"),
alpha = 1,
max.overlaps = Inf
)
}
vs_scales <- function(in_plot) {
out_plot <- in_plot +
scale_x_reverse(expand = expansion(mult = 0.2), position = "top") +
scale_y_reverse(expand = expansion(mult = 0.1), position = "right") +
#scale_linetype_manual(values = c('FALSE' = "dotted", 'TRUE' = "solid")) +
#scale_linewidth_manual(values = c('FALSE' = 0.5, 'TRUE' = 1)) +
scale_colour_manual(values = vowel_colours) +
facet_grid(
cols = vars(Gender)
)
}
vs_theme <- function(in_plot) {
out_plot <- in_plot +
labs(
x = "F2 (normalised)",
y = "F1 (normalised)"
) +
theme_bw() +
theme(
plot.title = element_text(face="bold"),
legend.position="none"
)
}
add_labels <- function(in_data, is_sig=FALSE) {
# if is_sig=TRUE assume there is a column called is_sig
# which says whether a smooth is significant.
out_data <- in_data %>%
group_by(Vowel, Gender) %>%
mutate(
vowel_lab = if_else(
age_category_numeric == max(age_category_numeric),
Vowel,
""
)
) %>%
ungroup()
if (is_sig) {
out_data <- out_data %>%
mutate(
vowel_lab = if_else(
is_sig,
str_c(vowel_lab, "*"),
str_c(vowel_lab, " (n.s.)")
),
vowel_lab = if_else(
vowel_lab %in% c("*", " (n.s.)"),
"",
vowel_lab
)
)
}
out_data
}
qb_plot <- function(in_data) {
out_plot <- in_data %>%
add_labels() %>%
ggplot(
mapping = qb_mapping
) %>%
vs_layers() %>%
vs_scales() %>%
vs_theme()
}
# The complexity of this code block is due to addition
# of non-significant smooths to the plot.
all_plot <- QB1_preds %>%
mutate(
Gender = if_else(Gender == "F", "W", Gender),
Gender = factor(Gender, levels = c("M", "W"))
) %>%
add_labels(is_sig=TRUE) %>%
ggplot(
mapping = aes(
x = F2_lob2,
y = F1_lob2,
colour = Vowel,
label = vowel_lab,
group = Vowel,
linewidth = is_sig,
alpha = is_sig
)
) %>%
vs_layers() %>%
vs_scales() %>%
vs_theme() +
scale_linewidth_manual(
values = c(
'TRUE' = 1, 'FALSE' = 0.25
)
) +
scale_alpha_manual(
values = c(
'TRUE' = 1, 'FALSE' = 0.5
)
) +
labs(
title = "QB1 Change Over Apparent Time by Gender",
subtitle = "From 76–85 to 18-25",
)
all_plot
We also want to see the detail of the high front vowels in the female speakers (as hard to see in these plots).
detail_plot <- QB1_preds %>%
filter(
Vowel %in% c("FLEECE", "GOOSE", "DRESS", "NURSE", "KIT"),
) %>%
filter(is_sig) %>%
qb_plot() +
labs(
title = "QB1 Change Over Apparent Time by Gender",
subtitle = "From 76–85 to 18-25",
)
detail_plot
We can look at this plot interactively, as follows:
plotly_plot <- QB1_preds %>%
# for labelling, we restore the original Age variable.
left_join(
QB1 %>%
select(Age, age_category_numeric) %>%
distinct()
) %>%
mutate(
Vowel = factor(Vowel, levels = vowels),
Age = droplevels(Age)
) %>%
ggplot(
aes(
x = F2_lob2,
y = F1_lob2,
colour = Vowel,
label = Vowel,
group = Vowel,
alpha = is_sig,
frame = Age
)
) +
geom_text(
size = 3
) +
scale_x_reverse(expand = expansion(mult = 0.1), position = "top") +
scale_y_reverse(position = "right") +
scale_alpha_manual(values = c('FALSE' = 0.4, 'TRUE' = 1)) +
scale_colour_manual(values = vowel_colours) +
facet_grid(
cols = vars(Gender)
) +
labs(
x = "F2 (normalised)",
y = "F1 (normalised)"
) +
theme_bw() +
theme(plot.title = element_text(face="bold"))
ggplotly(plotly_plot) %>% layout(height=900, showlegend=FALSE)
We also need to plot these with the results from B21. We plot both male and female speakers. All smooths were found to be significant in the original paper. To match the previous plots for this paper, we manually add a “*” to the labels to indicate this.
Here we just copy from the function defined above and modify variable names.
b21_preds <- read_rds(here('data', 'mod_pred_yobgender_values.rds'))
b21_preds <- b21_preds %>%
select(-Speech_rate, -Word, -ll, -ul) %>%
# filter(participant_year_of_birth > 1925) %>% #Turn on to restrict to speakers overlapping in age with QB.
pivot_wider(
names_from = Formant,
values_from = c(fit, CI)
) %>%
mutate(
Gender = if_else(Gender == "F", "W", Gender),
Vowel = factor(Vowel, levels = vowels),
Gender = factor(Gender, levels = c("M", "W"))
) %>%
group_by(Vowel, Gender) %>%
# Again, important to use _maximum_ here to get the oldest speakers to be
# our _first_ observation. (the origin of the paths in the plot)
mutate(
vowel_lab = if_else(
participant_year_of_birth == min(participant_year_of_birth),
str_c(Vowel, "*"),
""
)
) %>%
ungroup()
min_year <- min(b21_preds$participant_year_of_birth)
max_year <- max(b21_preds$participant_year_of_birth)
b21_plot <- b21_preds %>%
ggplot(
aes(
x = fit_F2,
y = fit_F1,
colour = Vowel,
label = vowel_lab,
group = Vowel
)
) %>%
vs_layers() %>%
vs_scales() %>%
vs_theme() +
labs(
title = "ONZE Change Over Real Time by Gender",
subtitle = glue("Year of birth {min_year} to {max_year}")
)
b21_plot
It is also useful to get a sense of what PCA looks like on the complete QB1 corpus.
QB1_intercepts_complete <- QB1_models %>%
mutate(
random_intercepts = map(
model,
~ get_random(.x)$`s(Speaker)` %>%
as_tibble(rownames = "Speaker") %>%
rename(intercept = value)
)
) %>%
select(
Vowel, formant_type, random_intercepts
) %>%
unnest(random_intercepts) %>%
arrange(as.character(Vowel))
QB1_intercepts_complete <- QB1_intercepts_complete %>%
# Create a column to store our eventual column names
mutate(
# Combine the 'vowel' and 'formant_type' columns as a string.
vowel_formant = str_c(Vowel, '_', formant_type),
# Remove '_lob2' for cleaner column names
vowel_formant = str_replace(vowel_formant, '_lob2', '')
) %>%
ungroup() %>%
# Remove old 'vowel' and 'formant_type' columns
select(-c(Vowel, formant_type)) %>%
# Make data 'wider', by...
pivot_wider(
names_from = vowel_formant, # naming columns using 'vowel_formant'...
values_from = intercept # and values from intercept column
)
This step can reveal missing values if a speaker has not been able to have an intercept estimated in a particular model.
Are there missing values?
QB1_intercepts_complete[
which(is.na(QB1_intercepts_complete), arr.ind=TRUE)[,1],
]
We remove this speaker.
QB1_intercepts_complete <- QB1_intercepts_complete %>%
filter(
!is.na(START_F1)
)
Let’s look at a pairwise correlation plot:
QB1_intercepts_complete %>%
select(-Speaker) %>%
as.matrix() %>%
cor() %>%
corrplot()
Generating confidence intervals for our PCA by bootstrapping takes a bit of time. So we load in previous computer results to speed up the knitting of this document.
QB1_pca_complete <- pca_test(
QB1_intercepts_complete %>% select(-Speaker),
n = 1000
)
write_rds(
QB1_pca_complete,
here('models', 'QB1_pca_complete.rds'),
compress="gz"
)
plot_variance_explained(QB1_pca_complete, pc_max = 6)
We’ll look at the first four PCs, as they appear outside the NULL distribution. However, PC2 through PC4’s confidence intervals intersect. We will need to filter out bootstrapped analyses to deal with PC instability.
First PC1:
plot_loadings(QB1_pca_complete, pc_no=1)
Figure 3.1: Index loadings for PC1 (QB1).
NB: high PC1 scores are conservative with respect to the New Zealand English short vowel shift.
Now, PC2:
plot_loadings(QB1_pca_complete, pc_no=2, filter_boots=TRUE)
Figure 3.2: Index loadings for PC2 (QB1).
PC2 represents the thought, start, strut relationship found in ONZE data by B21 (although without thought F2 involved.
We now turn to PC3 and PC4.
plot_loadings(QB1_pca_complete, pc_no=3, filter_boots=TRUE)
Figure 3.3: Index loadings for PC3 (QB1).
plot_loadings(QB1_pca_complete, pc_no=4, filter_boots=TRUE)
Figure 3.4: Index loadings for PC4 (QB1).
We are starting to get PCs which only reliably capture single variables from our original data. We’ll ignore PC3 and PC4.
We here export speaker scores for a distinct project.
QB1_comp_pc_scores <- prcomp(
QB1_intercepts_complete %>% select(-Speaker),
scale=TRUE
)$x
QB1_comp_pc_scores <- as_tibble(QB1_comp_pc_scores) %>%
mutate(
Speaker = QB1_intercepts_complete$Speaker
) %>%
relocate(Speaker)
write_rds(QB1_comp_pc_scores, here('data', 'QB1_complete_scores.rds'))
write_csv(QB1_comp_pc_scores, here('data', 'QB1_complete_scores.csv'))
In order to work out the differences between QB1 and QB2, we restrict ourselves to the speakers which are in both data sets. Since the speakers of QB2 are a subset of the speakers of QB1, we can simple filter out the speakers who are not in QB2 from the QB1 data set and produce a data set with data from both corpora.
We now fit a series of GAMMs to determine the whether the smooths or intercept terms differ between QB1 and QB2.
We will evaluate difference smooths, which require us to represent
corpus as an ordered factor. The eventual fitting of difference smooths
is determined by setting the corpus
variable to an ordered factor with
treatment contrasts.
QB2 <- QB2 %>%
mutate(
age_category_numeric = as.integer(Age),
Speaker = as.factor(Speaker),
Gender = as.factor(Gender),
Word = as.factor(Word)
)
QBBoth <- bind_rows(
"qb1" = QB1 %>%
filter(
Speaker %in% QB2$Speaker
) %>%
# Removing Age due to difficulty in binding ordered factors
select(-Age),
"qb2" = QB2 %>%
select(-Age),
.id = "corpus"
)
QBBoth <- QBBoth %>%
mutate(
corpus = as.ordered(corpus)
)
contrasts(QBBoth$corpus) <- "contr.treatment"
As with our previous models, we lengthen the data so that there is a row for each formant type.
The model structure will be
formant_value ~ corpus + s(age_category_numeric) +
s(age_category_numeric, by = corpus) + s(transcript_articulation_rate) +
s(Speaker, bs='re') + s(Word, bs='re')
. That is, a smooth for
articulation rate as a control, random intercepts for speaker and word,
and a difference smooth structure to investigate the affect of corpus on
the age category smooth. We will fit each gender separately.
QBBoth_models <- QBBoth %>%
filter(
age_category_numeric != 8
) %>%
select(
-F1_50, -F2_50
) %>%
pivot_longer(
cols = F1_lob2:F2_lob2,
names_to = "formant_type",
values_to = "formant_value"
) %>%
group_by(Vowel, formant_type, Gender) %>%
nest() %>%
mutate(
model = map(
data,
~ bam(
formula = formant_value ~ corpus + s(age_category_numeric, k = 4) +
s(age_category_numeric, by = corpus, k = 4) +
s(articulation_rate, k = 3) + s(Speaker, bs='re') + s(Word, bs='re'),
family = mgcv::scat(link ='identity'),
discrete = TRUE,
nthreads = 16,
data = .x
)
),
model_summary = map(model, summary),
base_s_pv = map_dbl(model_summary, ~ .x$s.pv[[1]]),
diff_s_pv = map_dbl(model_summary, ~ .x$s.pv[[2]]),
p_pv = map_dbl(model_summary, ~ .x$p.pv[[2]]) #parametric term for corpus p value
)
# Save models
write_rds(QBBoth_models, here('models', 'QBBoth_models.rds'), compress="gz")
The following code block outputs model checks for each of the vowel-formant models.
walk2(
str_c(
QBBoth_models$Vowel, '_',
QBBoth_models$formant_type, '_',
QBBoth_models$Genderdocs
),
QBBoth_models$model,
~ {
qq.gam(.y, sub=.x)
}
)
Fit is acceptable. We can see evidence of formant tracking errors, but
this is hard to avoid. We also see that the estimated degrees of freedom
are getting close to k'
. Again, this is hard to avoid given the amount
of data we have.
There are two coefficients of interest to us for each of these models. The parametric difference term gives us an estimate of the overall difference in height between two smooths. We look at these terms first:
QBBoth_models <- QBBoth_models %>%
mutate(
QB1_parametric = map_dbl(
model_summary,
~ .x$p.coef[[1]]
),
QB2_parametric = map2_dbl(
model_summary,
QB1_parametric,
~ .y + .x$p.coef[[2]]
)
)
parametric_difference_plot <- QBBoth_models %>%
select(
Vowel, formant_type, Gender, p_pv, QB1_parametric, QB2_parametric
) %>%
pivot_longer(
cols = QB1_parametric:QB2_parametric,
names_to = "corpus",
values_to = "parametric"
) %>%
mutate(
corpus = str_sub(corpus, end = 3L),
formant_type = str_sub(formant_type, end = 2L)
) %>%
filter(
p_pv < 0.025 | corpus == "QB1"
) %>%
select(
- p_pv
) %>%
pivot_wider(
names_from = formant_type,
values_from = parametric
) %>%
# LOCF for missing QB2 values (to give them the QB1 value for non-sig diffs).
fill(
F1:F2,
.direction = "down"
) %>%
mutate(
Gender= if_else(Gender == "F", "W", Gender),
Gender = factor(Gender, levels = c("M", "W"))
) %>%
ggplot(
aes(
x = F2,
y = F1,
colour = corpus,
group = Vowel,
label = Vowel
)
) +
# geom_path(
# arrow = arrow(
# ends = "last",
# type="closed",
# length = unit(1, "mm")
# )
# ) +
geom_text(size = 3, alpha = 1, fontface="bold") +
scale_x_reverse(position = "top", limits = c(2, -2.5)) +
scale_y_reverse(position = "right") +
#scale_colour_manual(values = vowel_colours) +
scale_colour_manual(values = c("QB1" = "#E69F00", "QB2" = "#009E73")) +
facet_grid(cols = vars(Gender)) +
labs(
title = "Statistically Significant Changes in Parametric Term, QB1 v. QB2",
subtitle = "Alpha = 0.025",
y = "F1 (normalised)",
x = "F2 (normalised)",
colour = "Corpus"
) +
theme(
plot.title = element_text(face="bold"),
)
parametric_difference_plot
Figure 3.5: Significant differences in parametric term between QB1 and QB2.
Figure 3.5 shows the difference in overall height or backness between the two corpora if they are statistically significant. We apply a small Bonferroni correction, assuming that we are treating each difference in either F1 or F2 for each vowel as our hypothesis. The plot shows the predicted values for each vowel in QB1 in orange and any significant departures from this in the QB2 data in green.
For the female speakers, we see a slight backing and lowering of fleece, a raising a backing of dress, raising of thought and nurse, lowering of goose, and very slight fronting of strut.
We look at the smooths where there is either a significant parametric or difference smooth term.
sig_models <- QBBoth_models %>%
filter(p_pv < 0.025 | diff_s_pv < 0.025)
walk2(
sig_models$model,
str_c(sig_models$Vowel, '_', sig_models$formant_type, '_', sig_models$Gender),
~ plot_smooth(
.x,
view = "age_category_numeric",
plot_all = 'corpus',
main = .y,
xlab = "QB1 age category from youngest (18-25) to oldest (76-85)"
)
)
## Summary:
## * corpus : factor; set to the value(s): qb1, qb2.
## * age_category_numeric : numeric predictor; with 30 values ranging from 1.000000 to 7.000000.
## * articulation_rate : numeric predictor; set to the value(s): 5.0157.
## * Speaker : factor; set to the value(s): QB_NZ_M_233. (Might be canceled as random effect, check below.)
## * Word : factor; set to the value(s): word_85620. (Might be canceled as random effect, check below.)
## * NOTE : The following random effects columns are canceled: s(Speaker),s(Word)
##
## Summary:
## * corpus : factor; set to the value(s): qb1, qb2.
## * age_category_numeric : numeric predictor; with 30 values ranging from 1.000000 to 7.000000.
## * articulation_rate : numeric predictor; set to the value(s): 5.0157.
## * Speaker : factor; set to the value(s): QB_NZ_M_233. (Might be canceled as random effect, check below.)
## * Word : factor; set to the value(s): word_85620. (Might be canceled as random effect, check below.)
## * NOTE : The following random effects columns are canceled: s(Speaker),s(Word)
##
## Summary:
## * corpus : factor; set to the value(s): qb1, qb2.
## * age_category_numeric : numeric predictor; with 30 values ranging from 1.000000 to 6.000000.
## * articulation_rate : numeric predictor; set to the value(s): 4.679.
## * Speaker : factor; set to the value(s): QB_NZ_F_431. (Might be canceled as random effect, check below.)
## * Word : factor; set to the value(s): word_84594. (Might be canceled as random effect, check below.)
## * NOTE : The following random effects columns are canceled: s(Speaker),s(Word)
##
## Summary:
## * corpus : factor; set to the value(s): qb1, qb2.
## * age_category_numeric : numeric predictor; with 30 values ranging from 1.000000 to 7.000000.
## * articulation_rate : numeric predictor; set to the value(s): 5.0388.
## * Speaker : factor; set to the value(s): QB_NZ_M_115. (Might be canceled as random effect, check below.)
## * Word : factor; set to the value(s): word_86747. (Might be canceled as random effect, check below.)
## * NOTE : The following random effects columns are canceled: s(Speaker),s(Word)
##
## Summary:
## * corpus : factor; set to the value(s): qb1, qb2.
## * age_category_numeric : numeric predictor; with 30 values ranging from 1.000000 to 7.000000.
## * articulation_rate : numeric predictor; set to the value(s): 5.0388.
## * Speaker : factor; set to the value(s): QB_NZ_M_115. (Might be canceled as random effect, check below.)
## * Word : factor; set to the value(s): word_86747. (Might be canceled as random effect, check below.)
## * NOTE : The following random effects columns are canceled: s(Speaker),s(Word)
##
## Summary:
## * corpus : factor; set to the value(s): qb1, qb2.
## * age_category_numeric : numeric predictor; with 30 values ranging from 1.000000 to 6.000000.
## * articulation_rate : numeric predictor; set to the value(s): 4.6931.
## * Speaker : factor; set to the value(s): QB_NZ_F_431. (Might be canceled as random effect, check below.)
## * Word : factor; set to the value(s): word_88358. (Might be canceled as random effect, check below.)
## * NOTE : The following random effects columns are canceled: s(Speaker),s(Word)
##
## Summary:
## * corpus : factor; set to the value(s): qb1, qb2.
## * age_category_numeric : numeric predictor; with 30 values ranging from 1.000000 to 7.000000.
## * articulation_rate : numeric predictor; set to the value(s): 5.0505.
## * Speaker : factor; set to the value(s): QB_NZ_M_233. (Might be canceled as random effect, check below.)
## * Word : factor; set to the value(s): word_88358. (Might be canceled as random effect, check below.)
## * NOTE : The following random effects columns are canceled: s(Speaker),s(Word)
##
## Summary:
## * corpus : factor; set to the value(s): qb1, qb2.
## * age_category_numeric : numeric predictor; with 30 values ranging from 1.000000 to 6.000000.
## * articulation_rate : numeric predictor; set to the value(s): 4.6709.
## * Speaker : factor; set to the value(s): QB_NZ_F_431. (Might be canceled as random effect, check below.)
## * Word : factor; set to the value(s): word_56517. (Might be canceled as random effect, check below.)
## * NOTE : The following random effects columns are canceled: s(Speaker),s(Word)
##
## Summary:
## * corpus : factor; set to the value(s): qb1, qb2.
## * age_category_numeric : numeric predictor; with 30 values ranging from 1.000000 to 7.000000.
## * articulation_rate : numeric predictor; set to the value(s): 4.9774.
## * Speaker : factor; set to the value(s): QB_NZ_M_718. (Might be canceled as random effect, check below.)
## * Word : factor; set to the value(s): word_56517. (Might be canceled as random effect, check below.)
## * NOTE : The following random effects columns are canceled: s(Speaker),s(Word)
##
## Summary:
## * corpus : factor; set to the value(s): qb1, qb2.
## * age_category_numeric : numeric predictor; with 30 values ranging from 1.000000 to 6.000000.
## * articulation_rate : numeric predictor; set to the value(s): 4.6632.
## * Speaker : factor; set to the value(s): QB_NZ_F_431. (Might be canceled as random effect, check below.)
## * Word : factor; set to the value(s): word_65462. (Might be canceled as random effect, check below.)
## * NOTE : The following random effects columns are canceled: s(Speaker),s(Word)
##
## Summary:
## * corpus : factor; set to the value(s): qb1, qb2.
## * age_category_numeric : numeric predictor; with 30 values ranging from 1.000000 to 7.000000.
## * articulation_rate : numeric predictor; set to the value(s): 4.9822.
## * Speaker : factor; set to the value(s): QB_NZ_M_115. (Might be canceled as random effect, check below.)
## * Word : factor; set to the value(s): word_65462. (Might be canceled as random effect, check below.)
## * NOTE : The following random effects columns are canceled: s(Speaker),s(Word)
##
## Summary:
## * corpus : factor; set to the value(s): qb1, qb2.
## * age_category_numeric : numeric predictor; with 30 values ranging from 1.000000 to 6.000000.
## * articulation_rate : numeric predictor; set to the value(s): 4.6852.
## * Speaker : factor; set to the value(s): QB_NZ_F_431. (Might be canceled as random effect, check below.)
## * Word : factor; set to the value(s): word_58075. (Might be canceled as random effect, check below.)
## * NOTE : The following random effects columns are canceled: s(Speaker),s(Word)
##
## Summary:
## * corpus : factor; set to the value(s): qb1, qb2.
## * age_category_numeric : numeric predictor; with 30 values ranging from 1.000000 to 6.000000.
## * articulation_rate : numeric predictor; set to the value(s): 4.6852.
## * Speaker : factor; set to the value(s): QB_NZ_F_431. (Might be canceled as random effect, check below.)
## * Word : factor; set to the value(s): word_58075. (Might be canceled as random effect, check below.)
## * NOTE : The following random effects columns are canceled: s(Speaker),s(Word)
##
## Summary:
## * corpus : factor; set to the value(s): qb1, qb2.
## * age_category_numeric : numeric predictor; with 30 values ranging from 1.000000 to 7.000000.
## * articulation_rate : numeric predictor; set to the value(s): 5.0258.
## * Speaker : factor; set to the value(s): QB_NZ_M_718. (Might be canceled as random effect, check below.)
## * Word : factor; set to the value(s): word_58075. (Might be canceled as random effect, check below.)
## * NOTE : The following random effects columns are canceled: s(Speaker),s(Word)
##
## Summary:
## * corpus : factor; set to the value(s): qb1, qb2.
## * age_category_numeric : numeric predictor; with 30 values ranging from 1.000000 to 7.000000.
## * articulation_rate : numeric predictor; set to the value(s): 5.0258.
## * Speaker : factor; set to the value(s): QB_NZ_M_718. (Might be canceled as random effect, check below.)
## * Word : factor; set to the value(s): word_58075. (Might be canceled as random effect, check below.)
## * NOTE : The following random effects columns are canceled: s(Speaker),s(Word)
##
## Summary:
## * corpus : factor; set to the value(s): qb1, qb2.
## * age_category_numeric : numeric predictor; with 30 values ranging from 1.000000 to 6.000000.
## * articulation_rate : numeric predictor; set to the value(s): 4.577.
## * Speaker : factor; set to the value(s): QB_NZ_F_431. (Might be canceled as random effect, check below.)
## * Word : factor; set to the value(s): word_52231. (Might be canceled as random effect, check below.)
## * NOTE : The following random effects columns are canceled: s(Speaker),s(Word)
##
## Summary:
## * corpus : factor; set to the value(s): qb1, qb2.
## * age_category_numeric : numeric predictor; with 30 values ranging from 1.000000 to 7.000000.
## * articulation_rate : numeric predictor; set to the value(s): 4.878.
## * Speaker : factor; set to the value(s): QB_NZ_M_233. (Might be canceled as random effect, check below.)
## * Word : factor; set to the value(s): word_52231. (Might be canceled as random effect, check below.)
## * NOTE : The following random effects columns are canceled: s(Speaker),s(Word)
##
## Summary:
## * corpus : factor; set to the value(s): qb1, qb2.
## * age_category_numeric : numeric predictor; with 30 values ranging from 1.000000 to 7.000000.
## * articulation_rate : numeric predictor; set to the value(s): 4.878.
## * Speaker : factor; set to the value(s): QB_NZ_M_233. (Might be canceled as random effect, check below.)
## * Word : factor; set to the value(s): word_52231. (Might be canceled as random effect, check below.)
## * NOTE : The following random effects columns are canceled: s(Speaker),s(Word)
##
## Summary:
## * corpus : factor; set to the value(s): qb1, qb2.
## * age_category_numeric : numeric predictor; with 30 values ranging from 1.000000 to 6.000000.
## * articulation_rate : numeric predictor; set to the value(s): 4.6053.
## * Speaker : factor; set to the value(s): QB_NZ_F_431. (Might be canceled as random effect, check below.)
## * Word : factor; set to the value(s): word_81584. (Might be canceled as random effect, check below.)
## * NOTE : The following random effects columns are canceled: s(Speaker),s(Word)
##
## Summary:
## * corpus : factor; set to the value(s): qb1, qb2.
## * age_category_numeric : numeric predictor; with 30 values ranging from 1.000000 to 6.000000.
## * articulation_rate : numeric predictor; set to the value(s): 4.6053.
## * Speaker : factor; set to the value(s): QB_NZ_F_431. (Might be canceled as random effect, check below.)
## * Word : factor; set to the value(s): word_81584. (Might be canceled as random effect, check below.)
## * NOTE : The following random effects columns are canceled: s(Speaker),s(Word)
##
## Summary:
## * corpus : factor; set to the value(s): qb1, qb2.
## * age_category_numeric : numeric predictor; with 30 values ranging from 1.000000 to 6.000000.
## * articulation_rate : numeric predictor; set to the value(s): 4.646.
## * Speaker : factor; set to the value(s): QB_NZ_F_431. (Might be canceled as random effect, check below.)
## * Word : factor; set to the value(s): word_64297. (Might be canceled as random effect, check below.)
## * NOTE : The following random effects columns are canceled: s(Speaker),s(Word)
##
## Summary:
## * corpus : factor; set to the value(s): qb1, qb2.
## * age_category_numeric : numeric predictor; with 30 values ranging from 1.000000 to 6.000000.
## * articulation_rate : numeric predictor; set to the value(s): 4.646.
## * Speaker : factor; set to the value(s): QB_NZ_F_431. (Might be canceled as random effect, check below.)
## * Word : factor; set to the value(s): word_64297. (Might be canceled as random effect, check below.)
## * NOTE : The following random effects columns are canceled: s(Speaker),s(Word)
##
## Summary:
## * corpus : factor; set to the value(s): qb1, qb2.
## * age_category_numeric : numeric predictor; with 30 values ranging from 1.000000 to 7.000000.
## * articulation_rate : numeric predictor; set to the value(s): 4.9689.
## * Speaker : factor; set to the value(s): QB_NZ_M_718. (Might be canceled as random effect, check below.)
## * Word : factor; set to the value(s): word_64297. (Might be canceled as random effect, check below.)
## * NOTE : The following random effects columns are canceled: s(Speaker),s(Word)
##
## Summary:
## * corpus : factor; set to the value(s): qb1, qb2.
## * age_category_numeric : numeric predictor; with 30 values ranging from 1.000000 to 6.000000.
## * articulation_rate : numeric predictor; set to the value(s): 4.5932.
## * Speaker : factor; set to the value(s): QB_NZ_F_431. (Might be canceled as random effect, check below.)
## * Word : factor; set to the value(s): word_81223. (Might be canceled as random effect, check below.)
## * NOTE : The following random effects columns are canceled: s(Speaker),s(Word)
##
## Summary:
## * corpus : factor; set to the value(s): qb1, qb2.
## * age_category_numeric : numeric predictor; with 30 values ranging from 1.000000 to 7.000000.
## * articulation_rate : numeric predictor; set to the value(s): 5.0347.
## * Speaker : factor; set to the value(s): QB_NZ_M_115. (Might be canceled as random effect, check below.)
## * Word : factor; set to the value(s): word_65034. (Might be canceled as random effect, check below.)
## * NOTE : The following random effects columns are canceled: s(Speaker),s(Word)
##
## Summary:
## * corpus : factor; set to the value(s): qb1, qb2.
## * age_category_numeric : numeric predictor; with 30 values ranging from 1.000000 to 7.000000.
## * articulation_rate : numeric predictor; set to the value(s): 5.0347.
## * Speaker : factor; set to the value(s): QB_NZ_M_115. (Might be canceled as random effect, check below.)
## * Word : factor; set to the value(s): word_65034. (Might be canceled as random effect, check below.)
## * NOTE : The following random effects columns are canceled: s(Speaker),s(Word)
##
It is worth noting that difference in the parametric term can be due to either across the board change or age grading effects. For instance nurse F1 in the female speakers indicates a kind of age grading, in which the youngest speakers have become more conservative. fleece F1 in the female speakers, on the other hand, is an example of an across the board change between the first and second recordings.
QBBoth %>%
filter(
age_category_numeric == 1,
Gender == "F"
) %>%
pull(Speaker) %>%
unique()
## [1] QB_NZ_F_997 QB_NZ_F_952
## 255 Levels: QB_NZ_F_103 QB_NZ_F_107 QB_NZ_F_121 QB_NZ_F_131 ... QB_NZ_M_969
The portion of this plot in the first age category is operating off two speakers.
This section repeats the B21 style analysis on the subset of the QB1 data restricted to speakers in QB2 and to the QB2 data. It shows that there is consistency in the vowels grouped together when applying the method to ONZE (in B21), and QB1, and QB2 (in this paper). It also shows that speaker scores are strongly correlated across the QB1 and QB2 analyses. This indicates that the structure picked out by PCA applied to F1 and F2 midpoint readings is stable across the span of time covered by QB1 and QB2.
At this point it is good to free up some RAM by removing the models from the previous sections.
rm(QB1_models, QBBoth_models)
We need to rerun models of the sort fit in Section 3.1.1.1, but this time with just the speakers who are in QB1 and QB2. The following code block does this for both the QB1 and QB2 data.
Analysis1_models <- QBBoth %>%
filter(
age_category_numeric != 8
) %>%
select(
-F1_50, -F2_50
) %>%
pivot_longer(
cols = F1_lob2:F2_lob2,
names_to = "formant_type",
values_to = "formant_value"
) %>%
group_by(corpus, Vowel, formant_type) %>%
nest() %>%
mutate(
model = map(
data,
~ bam(
formant_value ~ Gender + s(age_category_numeric, k = 4, by = Gender) + # main effects
s(articulation_rate, k = 3) + # control vars
s(Speaker, bs='re') + s(Word, bs='re'), # random effects
discrete = TRUE,
nthreads = 16, # This value is too high for most situations.
family = mgcv::scat(link ='identity'),
data = .x
)
),
model_summary = map(model, summary),
random_intercepts = map(
model,
~ get_random(.x)$`s(Speaker)` %>%
as_tibble(rownames = "Speaker") %>%
rename(intercept = value)
)
)
# Save models
write_rds(
Analysis1_models,
here('models', 'Analysis1_models.rds'),
compress="gz"
)
walk2(
str_c(
Analysis1_models$corpus, '_',
Analysis1_models$Vowel, '_',
Analysis1_models$formant_type
),
Analysis1_models$model,
~ {
qq.gam(.y, sub=.x)
}
)
Same basic story as above. High values of lot and thought are a struggle.
walk2(
Analysis1_models$model,
str_c(Analysis1_models$corpus, '_', Analysis1_models$Vowel, '_', Analysis1_models$formant_type),
~ plot_smooth(
.x,
view = "age_category_numeric",
plot_all = 'Gender',
main = .y,
xlab = "Age category from youngest (18-25) to oldest (76-85)"
)
)
## Summary:
## * Gender : factor; set to the value(s): F, M.
## * age_category_numeric : numeric predictor; with 30 values ranging from 1.000000 to 7.000000.
## * articulation_rate : numeric predictor; set to the value(s): 4.7936.
## * Speaker : factor; set to the value(s): QB_NZ_M_233. (Might be canceled as random effect, check below.)
## * Word : factor; set to the value(s): word_85620. (Might be canceled as random effect, check below.)
## * NOTE : The following random effects columns are canceled: s(Speaker),s(Word)
##
## Summary:
## * Gender : factor; set to the value(s): F, M.
## * age_category_numeric : numeric predictor; with 30 values ranging from 1.000000 to 7.000000.
## * articulation_rate : numeric predictor; set to the value(s): 4.7936.
## * Speaker : factor; set to the value(s): QB_NZ_M_233. (Might be canceled as random effect, check below.)
## * Word : factor; set to the value(s): word_85620. (Might be canceled as random effect, check below.)
## * NOTE : The following random effects columns are canceled: s(Speaker),s(Word)
##
## Summary:
## * Gender : factor; set to the value(s): F, M.
## * age_category_numeric : numeric predictor; with 30 values ranging from 1.000000 to 7.000000.
## * articulation_rate : numeric predictor; set to the value(s): 4.7727.
## * Speaker : factor; set to the value(s): QB_NZ_F_983. (Might be canceled as random effect, check below.)
## * Word : factor; set to the value(s): word_84594. (Might be canceled as random effect, check below.)
## * NOTE : The following random effects columns are canceled: s(Speaker),s(Word)
##
## Summary:
## * Gender : factor; set to the value(s): F, M.
## * age_category_numeric : numeric predictor; with 30 values ranging from 1.000000 to 7.000000.
## * articulation_rate : numeric predictor; set to the value(s): 4.7727.
## * Speaker : factor; set to the value(s): QB_NZ_F_983. (Might be canceled as random effect, check below.)
## * Word : factor; set to the value(s): word_84594. (Might be canceled as random effect, check below.)
## * NOTE : The following random effects columns are canceled: s(Speaker),s(Word)
##
## Summary:
## * Gender : factor; set to the value(s): F, M.
## * age_category_numeric : numeric predictor; with 30 values ranging from 1.000000 to 7.000000.
## * articulation_rate : numeric predictor; set to the value(s): 4.8023.
## * Speaker : factor; set to the value(s): QB_NZ_M_233. (Might be canceled as random effect, check below.)
## * Word : factor; set to the value(s): word_88358. (Might be canceled as random effect, check below.)
## * NOTE : The following random effects columns are canceled: s(Speaker),s(Word)
##
## Summary:
## * Gender : factor; set to the value(s): F, M.
## * age_category_numeric : numeric predictor; with 30 values ranging from 1.000000 to 7.000000.
## * articulation_rate : numeric predictor; set to the value(s): 4.8023.
## * Speaker : factor; set to the value(s): QB_NZ_M_233. (Might be canceled as random effect, check below.)
## * Word : factor; set to the value(s): word_88358. (Might be canceled as random effect, check below.)
## * NOTE : The following random effects columns are canceled: s(Speaker),s(Word)
##
## Summary:
## * Gender : factor; set to the value(s): F, M.
## * age_category_numeric : numeric predictor; with 30 values ranging from 1.000000 to 7.000000.
## * articulation_rate : numeric predictor; set to the value(s): 4.7407.
## * Speaker : factor; set to the value(s): QB_NZ_M_233. (Might be canceled as random effect, check below.)
## * Word : factor; set to the value(s): word_56517. (Might be canceled as random effect, check below.)
## * NOTE : The following random effects columns are canceled: s(Speaker),s(Word)
##
## Summary:
## * Gender : factor; set to the value(s): F, M.
## * age_category_numeric : numeric predictor; with 30 values ranging from 1.000000 to 7.000000.
## * articulation_rate : numeric predictor; set to the value(s): 4.7407.
## * Speaker : factor; set to the value(s): QB_NZ_M_233. (Might be canceled as random effect, check below.)
## * Word : factor; set to the value(s): word_56517. (Might be canceled as random effect, check below.)
## * NOTE : The following random effects columns are canceled: s(Speaker),s(Word)
##
## Summary:
## * Gender : factor; set to the value(s): F, M.
## * age_category_numeric : numeric predictor; with 30 values ranging from 1.000000 to 7.000000.
## * articulation_rate : numeric predictor; set to the value(s): 4.7541.
## * Speaker : factor; set to the value(s): QB_NZ_F_575. (Might be canceled as random effect, check below.)
## * Word : factor; set to the value(s): word_65462. (Might be canceled as random effect, check below.)
## * NOTE : The following random effects columns are canceled: s(Speaker),s(Word)
##
## Summary:
## * Gender : factor; set to the value(s): F, M.
## * age_category_numeric : numeric predictor; with 30 values ranging from 1.000000 to 7.000000.
## * articulation_rate : numeric predictor; set to the value(s): 4.7541.
## * Speaker : factor; set to the value(s): QB_NZ_F_575. (Might be canceled as random effect, check below.)
## * Word : factor; set to the value(s): word_65462. (Might be canceled as random effect, check below.)
## * NOTE : The following random effects columns are canceled: s(Speaker),s(Word)
##
## Summary:
## * Gender : factor; set to the value(s): F, M.
## * age_category_numeric : numeric predictor; with 30 values ranging from 1.000000 to 7.000000.
## * articulation_rate : numeric predictor; set to the value(s): 4.7967.
## * Speaker : factor; set to the value(s): QB_NZ_F_751. (Might be canceled as random effect, check below.)
## * Word : factor; set to the value(s): word_58075. (Might be canceled as random effect, check below.)
## * NOTE : The following random effects columns are canceled: s(Speaker),s(Word)
##
## Summary:
## * Gender : factor; set to the value(s): F, M.
## * age_category_numeric : numeric predictor; with 30 values ranging from 1.000000 to 7.000000.
## * articulation_rate : numeric predictor; set to the value(s): 4.7967.
## * Speaker : factor; set to the value(s): QB_NZ_F_751. (Might be canceled as random effect, check below.)
## * Word : factor; set to the value(s): word_58075. (Might be canceled as random effect, check below.)
## * NOTE : The following random effects columns are canceled: s(Speaker),s(Word)
##
## Summary:
## * Gender : factor; set to the value(s): F, M.
## * age_category_numeric : numeric predictor; with 30 values ranging from 1.000000 to 7.000000.
## * articulation_rate : numeric predictor; set to the value(s): 4.7009.
## * Speaker : factor; set to the value(s): QB_NZ_M_233. (Might be canceled as random effect, check below.)
## * Word : factor; set to the value(s): word_98917. (Might be canceled as random effect, check below.)
## * NOTE : The following random effects columns are canceled: s(Speaker),s(Word)
##
## Summary:
## * Gender : factor; set to the value(s): F, M.
## * age_category_numeric : numeric predictor; with 30 values ranging from 1.000000 to 7.000000.
## * articulation_rate : numeric predictor; set to the value(s): 4.7009.
## * Speaker : factor; set to the value(s): QB_NZ_M_233. (Might be canceled as random effect, check below.)
## * Word : factor; set to the value(s): word_98917. (Might be canceled as random effect, check below.)
## * NOTE : The following random effects columns are canceled: s(Speaker),s(Word)
##
## Summary:
## * Gender : factor; set to the value(s): F, M.
## * age_category_numeric : numeric predictor; with 30 values ranging from 1.000000 to 7.000000.
## * articulation_rate : numeric predictor; set to the value(s): 4.6587.
## * Speaker : factor; set to the value(s): QB_NZ_F_983. (Might be canceled as random effect, check below.)
## * Word : factor; set to the value(s): word_81584. (Might be canceled as random effect, check below.)
## * NOTE : The following random effects columns are canceled: s(Speaker),s(Word)
##
## Summary:
## * Gender : factor; set to the value(s): F, M.
## * age_category_numeric : numeric predictor; with 30 values ranging from 1.000000 to 7.000000.
## * articulation_rate : numeric predictor; set to the value(s): 4.6587.
## * Speaker : factor; set to the value(s): QB_NZ_F_983. (Might be canceled as random effect, check below.)
## * Word : factor; set to the value(s): word_81584. (Might be canceled as random effect, check below.)
## * NOTE : The following random effects columns are canceled: s(Speaker),s(Word)
##
## Summary:
## * Gender : factor; set to the value(s): F, M.
## * age_category_numeric : numeric predictor; with 30 values ranging from 1.000000 to 7.000000.
## * articulation_rate : numeric predictor; set to the value(s): 4.7184.
## * Speaker : factor; set to the value(s): QB_NZ_F_983. (Might be canceled as random effect, check below.)
## * Word : factor; set to the value(s): word_62180. (Might be canceled as random effect, check below.)
## * NOTE : The following random effects columns are canceled: s(Speaker),s(Word)
##
## Summary:
## * Gender : factor; set to the value(s): F, M.
## * age_category_numeric : numeric predictor; with 30 values ranging from 1.000000 to 7.000000.
## * articulation_rate : numeric predictor; set to the value(s): 4.7184.
## * Speaker : factor; set to the value(s): QB_NZ_F_983. (Might be canceled as random effect, check below.)
## * Word : factor; set to the value(s): word_62180. (Might be canceled as random effect, check below.)
## * NOTE : The following random effects columns are canceled: s(Speaker),s(Word)
##
## Summary:
## * Gender : factor; set to the value(s): F, M.
## * age_category_numeric : numeric predictor; with 30 values ranging from 1.000000 to 7.000000.
## * articulation_rate : numeric predictor; set to the value(s): 4.6961.
## * Speaker : factor; set to the value(s): QB_NZ_F_759. (Might be canceled as random effect, check below.)
## * Word : factor; set to the value(s): word_81223. (Might be canceled as random effect, check below.)
## * NOTE : The following random effects columns are canceled: s(Speaker),s(Word)
##
## Summary:
## * Gender : factor; set to the value(s): F, M.
## * age_category_numeric : numeric predictor; with 30 values ranging from 1.000000 to 7.000000.
## * articulation_rate : numeric predictor; set to the value(s): 4.6961.
## * Speaker : factor; set to the value(s): QB_NZ_F_759. (Might be canceled as random effect, check below.)
## * Word : factor; set to the value(s): word_81223. (Might be canceled as random effect, check below.)
## * NOTE : The following random effects columns are canceled: s(Speaker),s(Word)
##
## Summary:
## * Gender : factor; set to the value(s): F, M.
## * age_category_numeric : numeric predictor; with 30 values ranging from 1.000000 to 7.000000.
## * articulation_rate : numeric predictor; set to the value(s): 4.7753.
## * Speaker : factor; set to the value(s): QB_NZ_F_431. (Might be canceled as random effect, check below.)
## * Word : factor; set to the value(s): word_86747. (Might be canceled as random effect, check below.)
## * NOTE : The following random effects columns are canceled: s(Speaker),s(Word)
##
## Summary:
## * Gender : factor; set to the value(s): F, M.
## * age_category_numeric : numeric predictor; with 30 values ranging from 1.000000 to 7.000000.
## * articulation_rate : numeric predictor; set to the value(s): 4.7753.
## * Speaker : factor; set to the value(s): QB_NZ_F_431. (Might be canceled as random effect, check below.)
## * Word : factor; set to the value(s): word_86747. (Might be canceled as random effect, check below.)
## * NOTE : The following random effects columns are canceled: s(Speaker),s(Word)
##
## Summary:
## * Gender : factor; set to the value(s): F, M.
## * age_category_numeric : numeric predictor; with 30 values ranging from 1.000000 to 7.000000.
## * articulation_rate : numeric predictor; set to the value(s): 4.6365.
## * Speaker : factor; set to the value(s): QB_NZ_F_952. (Might be canceled as random effect, check below.)
## * Word : factor; set to the value(s): word_52231. (Might be canceled as random effect, check below.)
## * NOTE : The following random effects columns are canceled: s(Speaker),s(Word)
##
## Summary:
## * Gender : factor; set to the value(s): F, M.
## * age_category_numeric : numeric predictor; with 30 values ranging from 1.000000 to 7.000000.
## * articulation_rate : numeric predictor; set to the value(s): 4.6365.
## * Speaker : factor; set to the value(s): QB_NZ_F_952. (Might be canceled as random effect, check below.)
## * Word : factor; set to the value(s): word_52231. (Might be canceled as random effect, check below.)
## * NOTE : The following random effects columns are canceled: s(Speaker),s(Word)
##
## Summary:
## * Gender : factor; set to the value(s): F, M.
## * age_category_numeric : numeric predictor; with 30 values ranging from 1.000000 to 7.000000.
## * articulation_rate : numeric predictor; set to the value(s): 4.7368.
## * Speaker : factor; set to the value(s): QB_NZ_F_431. (Might be canceled as random effect, check below.)
## * Word : factor; set to the value(s): word_65462. (Might be canceled as random effect, check below.)
## * NOTE : The following random effects columns are canceled: s(Speaker),s(Word)
##
## Summary:
## * Gender : factor; set to the value(s): F, M.
## * age_category_numeric : numeric predictor; with 30 values ranging from 1.000000 to 7.000000.
## * articulation_rate : numeric predictor; set to the value(s): 4.7368.
## * Speaker : factor; set to the value(s): QB_NZ_F_431. (Might be canceled as random effect, check below.)
## * Word : factor; set to the value(s): word_65462. (Might be canceled as random effect, check below.)
## * NOTE : The following random effects columns are canceled: s(Speaker),s(Word)
##
## Summary:
## * Gender : factor; set to the value(s): F, M.
## * age_category_numeric : numeric predictor; with 30 values ranging from 1.000000 to 7.000000.
## * articulation_rate : numeric predictor; set to the value(s): 4.6715.
## * Speaker : factor; set to the value(s): QB_NZ_F_431. (Might be canceled as random effect, check below.)
## * Word : factor; set to the value(s): word_81223. (Might be canceled as random effect, check below.)
## * NOTE : The following random effects columns are canceled: s(Speaker),s(Word)
##
## Summary:
## * Gender : factor; set to the value(s): F, M.
## * age_category_numeric : numeric predictor; with 30 values ranging from 1.000000 to 7.000000.
## * articulation_rate : numeric predictor; set to the value(s): 4.6715.
## * Speaker : factor; set to the value(s): QB_NZ_F_431. (Might be canceled as random effect, check below.)
## * Word : factor; set to the value(s): word_81223. (Might be canceled as random effect, check below.)
## * NOTE : The following random effects columns are canceled: s(Speaker),s(Word)
##
## Summary:
## * Gender : factor; set to the value(s): F, M.
## * age_category_numeric : numeric predictor; with 30 values ranging from 1.000000 to 7.000000.
## * articulation_rate : numeric predictor; set to the value(s): 4.8035.
## * Speaker : factor; set to the value(s): QB_NZ_F_431. (Might be canceled as random effect, check below.)
## * Word : factor; set to the value(s): word_88358. (Might be canceled as random effect, check below.)
## * NOTE : The following random effects columns are canceled: s(Speaker),s(Word)
##
## Summary:
## * Gender : factor; set to the value(s): F, M.
## * age_category_numeric : numeric predictor; with 30 values ranging from 1.000000 to 7.000000.
## * articulation_rate : numeric predictor; set to the value(s): 4.8035.
## * Speaker : factor; set to the value(s): QB_NZ_F_431. (Might be canceled as random effect, check below.)
## * Word : factor; set to the value(s): word_88358. (Might be canceled as random effect, check below.)
## * NOTE : The following random effects columns are canceled: s(Speaker),s(Word)
##
## Summary:
## * Gender : factor; set to the value(s): F, M.
## * age_category_numeric : numeric predictor; with 30 values ranging from 1.000000 to 7.000000.
## * articulation_rate : numeric predictor; set to the value(s): 4.7927.
## * Speaker : factor; set to the value(s): QB_NZ_F_431. (Might be canceled as random effect, check below.)
## * Word : factor; set to the value(s): word_85620. (Might be canceled as random effect, check below.)
## * NOTE : The following random effects columns are canceled: s(Speaker),s(Word)
##
## Summary:
## * Gender : factor; set to the value(s): F, M.
## * age_category_numeric : numeric predictor; with 30 values ranging from 1.000000 to 7.000000.
## * articulation_rate : numeric predictor; set to the value(s): 4.7927.
## * Speaker : factor; set to the value(s): QB_NZ_F_431. (Might be canceled as random effect, check below.)
## * Word : factor; set to the value(s): word_85620. (Might be canceled as random effect, check below.)
## * NOTE : The following random effects columns are canceled: s(Speaker),s(Word)
##
## Summary:
## * Gender : factor; set to the value(s): F, M.
## * age_category_numeric : numeric predictor; with 30 values ranging from 1.000000 to 7.000000.
## * articulation_rate : numeric predictor; set to the value(s): 4.6474.
## * Speaker : factor; set to the value(s): QB_NZ_F_431. (Might be canceled as random effect, check below.)
## * Word : factor; set to the value(s): word_81584. (Might be canceled as random effect, check below.)
## * NOTE : The following random effects columns are canceled: s(Speaker),s(Word)
##
## Summary:
## * Gender : factor; set to the value(s): F, M.
## * age_category_numeric : numeric predictor; with 30 values ranging from 1.000000 to 7.000000.
## * articulation_rate : numeric predictor; set to the value(s): 4.6474.
## * Speaker : factor; set to the value(s): QB_NZ_F_431. (Might be canceled as random effect, check below.)
## * Word : factor; set to the value(s): word_81584. (Might be canceled as random effect, check below.)
## * NOTE : The following random effects columns are canceled: s(Speaker),s(Word)
##
## Summary:
## * Gender : factor; set to the value(s): F, M.
## * age_category_numeric : numeric predictor; with 30 values ranging from 1.000000 to 7.000000.
## * articulation_rate : numeric predictor; set to the value(s): 4.7297.
## * Speaker : factor; set to the value(s): QB_NZ_F_431. (Might be canceled as random effect, check below.)
## * Word : factor; set to the value(s): word_64297. (Might be canceled as random effect, check below.)
## * NOTE : The following random effects columns are canceled: s(Speaker),s(Word)
##
## Summary:
## * Gender : factor; set to the value(s): F, M.
## * age_category_numeric : numeric predictor; with 30 values ranging from 1.000000 to 7.000000.
## * articulation_rate : numeric predictor; set to the value(s): 4.7297.
## * Speaker : factor; set to the value(s): QB_NZ_F_431. (Might be canceled as random effect, check below.)
## * Word : factor; set to the value(s): word_64297. (Might be canceled as random effect, check below.)
## * NOTE : The following random effects columns are canceled: s(Speaker),s(Word)
##
## Summary:
## * Gender : factor; set to the value(s): F, M.
## * age_category_numeric : numeric predictor; with 30 values ranging from 1.000000 to 7.000000.
## * articulation_rate : numeric predictor; set to the value(s): 4.7423.
## * Speaker : factor; set to the value(s): QB_NZ_F_431. (Might be canceled as random effect, check below.)
## * Word : factor; set to the value(s): word_58075. (Might be canceled as random effect, check below.)
## * NOTE : The following random effects columns are canceled: s(Speaker),s(Word)
##
## Summary:
## * Gender : factor; set to the value(s): F, M.
## * age_category_numeric : numeric predictor; with 30 values ranging from 1.000000 to 7.000000.
## * articulation_rate : numeric predictor; set to the value(s): 4.7423.
## * Speaker : factor; set to the value(s): QB_NZ_F_431. (Might be canceled as random effect, check below.)
## * Word : factor; set to the value(s): word_58075. (Might be canceled as random effect, check below.)
## * NOTE : The following random effects columns are canceled: s(Speaker),s(Word)
##
## Summary:
## * Gender : factor; set to the value(s): F, M.
## * age_category_numeric : numeric predictor; with 30 values ranging from 1.000000 to 7.000000.
## * articulation_rate : numeric predictor; set to the value(s): 4.7565.
## * Speaker : factor; set to the value(s): QB_NZ_F_431. (Might be canceled as random effect, check below.)
## * Word : factor; set to the value(s): word_56517. (Might be canceled as random effect, check below.)
## * NOTE : The following random effects columns are canceled: s(Speaker),s(Word)
##
## Summary:
## * Gender : factor; set to the value(s): F, M.
## * age_category_numeric : numeric predictor; with 30 values ranging from 1.000000 to 7.000000.
## * articulation_rate : numeric predictor; set to the value(s): 4.7565.
## * Speaker : factor; set to the value(s): QB_NZ_F_431. (Might be canceled as random effect, check below.)
## * Word : factor; set to the value(s): word_56517. (Might be canceled as random effect, check below.)
## * NOTE : The following random effects columns are canceled: s(Speaker),s(Word)
##
It is worth noting that these models are less powerful than the full QB1 data set models. The relationships here are not incompatible with those found in the QB1 data.
What do these models say? To see this we can do another vowel space plot.
Analysis1_preds <- Analysis1_models %>%
mutate(
prediction = map(
model,
~ get_predictions(model = .x, cond = to_predict, print.summary = FALSE)
),
model_smooth_pvs = map(
model_summary,
~ .x$s.pv
),
genderF_smooth_pv = map_dbl(
model_smooth_pvs,
~ .x[[1]]
),
genderM_smooth_pv = map_dbl(
model_smooth_pvs,
~ .x[[2]]
)
) %>%
select(
Vowel, formant_type, prediction, genderF_smooth_pv, genderM_smooth_pv
) %>%
unnest(prediction) %>%
# This step is important. It ensures that, when we plot,
# arrows go from oldest to youngest speakers.
arrange(
desc(age_category_numeric)
) %>%
mutate(
p_value = if_else(Gender == "F", genderF_smooth_pv, genderM_smooth_pv)
) %>%
select( # Remove unneeded variables
-articulation_rate,
-genderF_smooth_pv,
-genderM_smooth_pv
) %>%
pivot_wider( # Pivot
names_from = formant_type,
values_from = c(fit, p_value, CI)
) %>%
rename(
F1_lob2 = fit_F1_lob2,
F2_lob2 = fit_F2_lob2
) %>%
mutate(
# A variable for statistical significance of smooths - either F1 or F2
# significant with bonferonni correction.
is_sig = p_value_F1_lob2 < 0.025 | p_value_F2_lob2 < 0.025
)
The QB1 smooths look like this:
qb1_vsplot <- Analysis1_preds %>%
filter(
corpus == "qb1"
) %>%
qb_plot() +
labs(
title = "QB1 smooths fit on speakers who are in both corpora."
)
qb1_vsplot
Figure 4.1: QB1 smooths fit on speakers who are in both corpora.
qb2_vsplot <- Analysis1_preds %>%
filter(
corpus == "qb2"
) %>%
qb_plot() +
labs(
title = "QB2 smooths fit on speakers who are in both corpora."
)
qb2_vsplot
Figure 4.2: QB2 smooths fit on speakers who are in both corpora.
Figures 4.1 and 4.2 are not inconsistent with those for the entire QB1 corpus. That is, there is nothing coming out as significant here which goes in a radically different direction from what we have seen above.
We perform principal component analyses on QB1 and QB2 separately (see the tabs immediately below).
We collect the random intercepts for the QB1 models.
QB1_intercepts <- Analysis1_models %>%
ungroup() %>%
filter(
corpus == "qb1"
) %>%
select(Vowel, formant_type, random_intercepts) %>%
unnest(random_intercepts) %>%
arrange(as.character(Vowel))
QB1_intercepts
This data frame gives intercepts for each speaker and model. We need to reshape this data so that each column is a vowel-formant pair.
QB1_intercepts <- QB1_intercepts %>%
# Create a column to store our eventual column names
mutate(
# Combine the 'vowel' and 'formant_type' columns as a string.
vowel_formant = str_c(Vowel, '_', formant_type),
# Remove '_lob2' for cleaner column names
vowel_formant = str_replace(vowel_formant, '_lob2', '')
) %>%
# Remove old 'vowel' and 'formant_type' columns
select(-c(Vowel, formant_type)) %>%
# Make data 'wider', by...
pivot_wider(
names_from = vowel_formant, # naming columns using 'vowel_formant'...
values_from = intercept # and values from intercept column
)
This step can reveal missing values if a speaker has not been able to have an intercept estimated in a particular model.
Are there missing values?
QB1_intercepts[which(is.na(QB1_intercepts), arr.ind=TRUE)[,1],]
One speaker does not have intercepts for start. We will have to leave this speaker out.
QB1_intercepts <- QB1_intercepts %>%
filter(
!is.na(START_F1)
)
Let’s look at a correlation plot:
QB1_intercepts %>%
select(-Speaker) %>%
as.matrix() %>%
cor() %>%
corrplot()
We can also check whether the magnitude of these correlations should be surprising to us by a permutation test (as in Brand et al. 2021).
cor_test <- correlation_test(
QB1_intercepts %>% select(-Speaker),
n = 500,
cor.method = "pearson"
)
plot_correlation_magnitudes(cor_test)
We now apply PCA using the pca_test
function from nzilbb.vowels
.
This both applies PCA and estimates confidence intervals for PC
loadings.
QB1_pca <- pca_test(
QB1_intercepts %>% select(-Speaker),
n = 1000
)
write_rds(QB1_pca, here('models', 'QB1_pca.rds'), compress="gz")
For this PCA, the variable loadings for PC1 is in the opposite direction for Brand et al (2021). We will change it here to keep it consistent to Brand et al (2021). Recall that the direction of a PC is arbitrary.
QB1_pca <- pc_flip(QB1_pca, pc_no = 1)
QB1_plot_variance_explained <- plot_variance_explained(QB1_pca, pc_max = 5) +
labs(title = "Variance Explained by Principal Components for QB1")
QB1_plot_variance_explained
We’ll only look at the first two PCs, as they appear outside the NULL distribution. However, PC2’s confidence intervals intersect with PC3, so we will need to filter out bootstrapped analyses to deal with PC instability.
Lastly, let’s get some numbers as to how much variance each PC actually accounts for. The following are the percentages of variance explained by each PC.
QB1_pca$variance$variance_explained * 100
## [1] 33.0577245 15.9193714 9.1410553 7.6614131 6.5853065 4.7188556
## [7] 4.3009257 3.7674359 2.9496742 2.2636386 1.8839387 1.6704812
## [13] 1.3435933 1.2608031 0.8828490 0.7944989 0.6916732 0.5094777
## [19] 0.3476062 0.2496781
First PC1:
QB1_PC1_plot_loadings <- plot_loadings(QB1_pca, pc_no=1) +
labs(title = "Index Loadings for PC1 for QB1")
QB1_PC1_plot_loadings
Figure 4.3: Index loadings for PC1 (QB1).
This is roughly the same as the PC1 generated from the full QB1 corpus (Figure 3.1).
Let’s check to see which exact vowels make up 50% of this PC’s contribution
QB1_pca_simple <- prcomp(
QB1_intercepts %>% select(-Speaker),
scale=TRUE
)
QB1_pca_simple <- pc_flip(QB1_pca_simple, pc_no = 1)
pca_contrib_plot(QB1_pca_simple, pc_no = 1, cutoff = 50)
Figure 4.4: 50% loadings for PC1 (QB1).
We can see how using the 50% cut-off rule, as done in B21, which restrict (and potentially change) the interpretation of the vowel cluster here.
Now, PC2:
QB1_PC2_plot_loadings <- plot_loadings(QB1_pca, pc_no=2, filter_boots=TRUE) +
labs(title = "Index Loadings for PC2 for QB1")
QB1_PC2_plot_loadings
Figure 4.5: Index loadings for PC2 (QB1).
Again, this is roughly the same as the PC2 generated from the full QB1 corpus (Figure 3.2). thought comes through less clearly and lot is somewhat involved.
Again, let’s check which vowels contribute to 50% of PC2’s loading.
pca_contrib_plot(QB1_pca_simple, pc_no = 2, cutoff = 50)
Figure 4.6: 50% loadings for PC2 (QB1).
Before we move on to QB2, we can model the speaker scores onto the apparent vowel change over time to see how speakers axis of covariation sit among each other.
QB1_plot <- plot_pc_input(QB1_pca_simple, QB1_intercepts %>% select(-Speaker), QB1_pca)
QB1_plot
We now apply PCA to the QB2 intercepts.
We collect the random intercepts for the QB2 models.
QB2_intercepts <- Analysis1_models %>%
ungroup() %>%
filter(
corpus == "qb2"
) %>%
select(Vowel, formant_type, random_intercepts) %>%
unnest(random_intercepts) %>%
arrange(as.character(Vowel))
QB2_intercepts
This data frame gives intercepts for each speaker and model. We need to reshape this data so that each column is a vowel-formant pair.
QB2_intercepts <- QB2_intercepts %>%
# Create a column to store our eventual column names
mutate(
# Combine the 'vowel' and 'formant_type' columns as a string.
vowel_formant = str_c(Vowel, '_', formant_type),
# Remove '_lob2' for cleaner column names
vowel_formant = str_replace(vowel_formant, '_lob2', '')
) %>%
# Remove old 'vowel' and 'formant_type' columns
select(-c(Vowel, formant_type)) %>%
# Make data 'wider', by...
pivot_wider(
names_from = vowel_formant, # naming columns using 'vowel_formant'...
values_from = intercept # and values from intercept column
)
This step can reveal missing values if a speaker has not been able to have an intercept estimated in a particular model.
Are there missing values?
QB2_intercepts[which(is.na(QB1_intercepts), arr.ind=TRUE)[,1],]
No.
Let’s look at a pairwise correlation plot:
QB2_intercepts %>%
select(-Speaker) %>%
as.matrix() %>%
cor() %>%
corrplot()
Again, let’s just double-check this against permutated data of the intercepts.
cor_test <- correlation_test(
QB2_intercepts %>% select(-Speaker),
n= 500,
cor.method = "pearson"
)
plot_correlation_magnitudes(cor_test)
We now apply PCA using the pca_test
function from nzilbb.vowels
.
This both applies PCA and estimates confidence intervals for PC
loadings.
QB2_pca <- pca_test(
QB2_intercepts %>% select(-Speaker),
n = 1000
)
write_rds(QB2_pca, here('models', 'QB2_pca.rds'), compress="gz")
Again for QB2, the variable loadings for PC1 is in the opposite direction for Brand et al (2021). We will change it here to keep it consistent to Brand et al (2021).
QB2_pca <- pc_flip(QB2_pca, pc_no = 1)
QB2_plot_variance_explained <- plot_variance_explained(QB2_pca, pc_max = 5) +
labs(title = "Variance Explained by Principal Components for QB2")
QB2_plot_variance_explained
We’ll look at the first three PCs, as they appear outside the NULL distribution. PC2’s confidence intervals intersect with PC3, so we will need to filter out bootstrapped analyses to deal with PC instability for both PC2 and PC3.
Again, we output the exact amounts of variance explained by each PC:
QB2_pca$variance$variance_explained * 100
## [1] 25.7921963 14.1127874 11.4621459 8.2924871 7.0984090 6.0313902
## [7] 5.0785881 4.6870637 3.4827761 2.6363592 2.0278558 1.7098478
## [13] 1.5077131 1.4030869 1.2002238 1.1077374 0.8434046 0.5786344
## [19] 0.5712087 0.3760845
First PC1:
QB2_PC1_plot_loadings <- plot_loadings(QB2_pca, pc_no=1) +
labs(title = "Index Loadings for PC1 for QB2")
QB2_PC1_plot_loadings
Figure 4.7: Index loadings for PC1 (QB2).
This looks very much like PC1 in our previous PCA analyses. Let’s check which vowels contribute the most to this PC.
QB2_pca_simple <- prcomp(
QB2_intercepts %>% select(-Speaker),
scale = TRUE
)
QB2_pca_simple <- pc_flip(QB2_pca_simple, pc_no= 1)
pca_contrib_plot(QB2_pca_simple, pc_no = 1, cutoff = 50)
Figure 4.8: 50% loadings for PC1 (QB2).
Now, PC2:
QB2_PC2_plot_loadings <- plot_loadings(QB2_pca, pc_no=2, filter_boots=TRUE) +
labs(title = "Index Loadings for PC2 for QB2")
QB2_PC2_plot_loadings
Figure 4.9: Index loadings for PC2 (QB2).
pca_contrib_plot(QB2_pca_simple, pc_no = 2, cutoff = 50)
Figure 4.10: 50% loadings for PC2 (QB2).
The corresponding plots for the complete QB1 (Figure 3.2) and the subset of speakers in both corpora (Figure 4.5) look quite different. However, we do have a relationship where strut F2 and start F2 are patterned against thought F1.
And now, PC3:
plot_loadings(QB2_pca, pc_no=3, filter_boots=TRUE)
Figure 4.11: Index loadings for PC3 (QB2).
Even when filtering PC3 to see if one relationship stands out, we don’t get much structure. One explanation is that PC3 captures variation in lot F1, which, since it is not particularly correlated with any of our other variables, is indifferent to which other patterns of variation it appears with in a PC. In particular, we would have no reason, given this plot, to claim a genuine relationship of covariation between lot F1, thought F1, and nurse F2, even though they appear with significant loadings.
While the loadings are unstable, it is worth noting that this PC does capture some of the start, strut, thought relationship (look at the black pluses and minuses).
The main hypothesis we wish to test is whether speakers here are stable with respect to their PC scores between QB1 and QB2. To do this, we simply check the correlations of PC1 and PC2 across both QB1 and QB2.
We extract PC scores for each speaker from QB1 and QB2 and look at their level of correlation.
QB1_pc_scores <- QB1_pca_simple$x
QB2_pc_scores <- QB2_pca_simple$x
QB1_pc_scores <- QB1_intercepts %>%
bind_cols(
QB1_pc_scores[,1:2]
)
QB2_pc_scores <- QB2_intercepts %>%
bind_cols(
QB2_pc_scores[,1:3]
)
QBBoth_pc_scores <- QB1_pc_scores %>%
select(Speaker, PC1, PC2) %>%
rename(
PC1_qb1 = PC1,
PC2_qb1 = PC2
) %>%
left_join(
QB2_pc_scores %>%
select(Speaker, PC1, PC2, PC3) %>%
rename(
PC1_qb2 = PC1,
PC2_qb2 = PC2,
PC3_qb2 = PC3
)
)
cor.test(QBBoth_pc_scores$PC1_qb1, QBBoth_pc_scores$PC1_qb2, method="spearman")
##
## Spearman's rank correlation rho
##
## data: QBBoth_pc_scores$PC1_qb1 and QBBoth_pc_scores$PC1_qb2
## S = 5200, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.7647059
The Spearman correlation between PC1 scores is significant (p-value < 2.2e-16) and of high strength (0.7647059).
We can plot this:
PC1_corrplot <- QBBoth_pc_scores %>%
ggplot(
aes(
x = rank(PC1_qb1),
y = rank(PC1_qb2)
)
) +
geom_point() +
geom_smooth(method='lm') +
labs(
title = "Spearman correlation of PC1 scores (QB1 and QB2)",
y = "QB2 PC1 rank",
x = "QB1 PC1 rank"
)
PC1_plot <- QBBoth_pc_scores %>%
ggplot(
aes(
x = PC1_qb1,
y = PC1_qb2
)
) +
geom_point() +
labs(
title = "PC1 scores (QB1 and QB2)",
y = "QB2 PC1 score",
x = "QB1 PC1 score"
)
PC1_plot / PC1_corrplot
While we have some outliers, this is a pretty clear relationship!
We now do the same for PC2, expecting that there will be much more noise!
cor.test(QBBoth_pc_scores$PC2_qb1, QBBoth_pc_scores$PC2_qb2, method="spearman")
##
## Spearman's rank correlation rho
##
## data: QBBoth_pc_scores$PC2_qb1 and QBBoth_pc_scores$PC2_qb2
## S = 14788, p-value = 0.01813
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.3308597
The relationship comes out as significant, but with a much larger p-value (0.01813) and a smaller strength (0.3308597).
Again, we can plot:
PC2_corrplot <- QBBoth_pc_scores %>%
ggplot(
aes(
x = rank(PC2_qb1),
y = rank(PC2_qb2)
)
) +
geom_point() +
geom_smooth(method = "lm") +
labs(
title = "Spearman correlation of PC2 scores (QB1 and QB2)",
y = "QB2 PC2 rank",
x = "QB1 PC2 rank"
)
PC2_plot <- QBBoth_pc_scores %>%
ggplot(
aes(
x = PC2_qb1,
y = PC2_qb2
)
) +
geom_point() +
labs(
title = "PC2 scores (QB1 and QB2)",
y = "QB2 PC2 score",
x = "QB1 PC2 score"
)
PC2_plot / PC2_corrplot
This is much more noisy. We can zoom in to the smaller values and see if any relationship is visually discernible.
PC2_corrplot_detail <- QBBoth_pc_scores %>%
filter(
between(PC2_qb1, -2, 2),
between(PC2_qb2, -2, 2)
) %>%
ggplot(
aes(
x = PC2_qb1,
y= PC2_qb2
)
) +
geom_point() +
geom_smooth(method="loess") +
labs(y = "QB2 PC2 scores", x = "QB1 PC2 scores")
PC2_corrplot_detail
If we restrict to PC scores between -2 and 2, there is (barely) discernible positive relationship.
Before we explore this further, let’s take a look to see if age is working in correlation with the QB1 PC scores to predict covariation stability in QB2. We can do this by running a simple linear model.
First, we need to add the age categories to the QBBoth_pc_scores
data.
speaker_ages_interim <- QB1 %>%
select(Speaker, age_category_numeric, Age) %>%
distinct()
QBBoth_pc_scores_age <- left_join(
QBBoth_pc_scores,
speaker_ages_interim,
by = "Speaker"
)
rm(speaker_ages_interim)
QBBoth_pc_scores_age <- QBBoth_pc_scores_age %>%
mutate(
Age = factor(Age, ordered = FALSE))
Linear model for PC1 stability
PC1_age_effectA <- lm(PC1_qb2 ~ PC1_qb1 * age_category_numeric, data = QBBoth_pc_scores_age)
summary(PC1_age_effectA)
##
## Call:
## lm(formula = PC1_qb2 ~ PC1_qb1 * age_category_numeric, data = QBBoth_pc_scores_age)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.6426 -0.8593 -0.1292 0.7251 4.1784
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.32823 0.56044 -0.586 0.5609
## PC1_qb1 0.21923 0.22230 0.986 0.3291
## age_category_numeric 0.13568 0.12923 1.050 0.2991
## PC1_qb1:age_category_numeric 0.13356 0.05716 2.336 0.0238 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.304 on 47 degrees of freedom
## Multiple R-squared: 0.6905, Adjusted R-squared: 0.6708
## F-statistic: 34.95 on 3 and 47 DF, p-value: 4.994e-12
PC1_age_effectB <- lm(PC1_qb2 ~ PC1_qb1, data = QBBoth_pc_scores_age)
summary(PC1_age_effectB)
##
## Call:
## lm(formula = PC1_qb2 ~ PC1_qb1, data = QBBoth_pc_scores_age)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.8390 -0.8355 -0.0378 0.8538 4.1673
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.22050 0.19169 1.150 0.256
## PC1_qb1 0.70951 0.07529 9.423 1.39e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.369 on 49 degrees of freedom
## Multiple R-squared: 0.6444, Adjusted R-squared: 0.6371
## F-statistic: 88.8 on 1 and 49 DF, p-value: 1.393e-12
#exclude the only 76-85 speaker
QBBoth_pc_scores_age <- QBBoth_pc_scores_age %>%
filter(Age != "76-85")
PC1_age_effectC <- lm(PC1_qb2 ~ PC1_qb1 * Age, data = QBBoth_pc_scores_age)
summary(PC1_age_effectC)
##
## Call:
## lm(formula = PC1_qb2 ~ PC1_qb1 * Age, data = QBBoth_pc_scores_age)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.2974 -0.7657 -0.1082 0.8970 2.4804
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.7195 0.7251 0.992 0.32732
## PC1_qb1 -1.2081 0.6809 -1.774 0.08404 .
## Age26-35 -1.7824 1.0730 -1.661 0.10491
## Age36-45 -0.9844 0.8411 -1.170 0.24918
## Age46-55 -0.6916 0.8127 -0.851 0.40008
## Age56-65 -0.3341 0.8131 -0.411 0.68341
## Age66-75 -0.2018 0.8535 -0.236 0.81438
## PC1_qb1:Age26-35 1.8369 0.6988 2.629 0.01230 *
## PC1_qb1:Age36-45 1.6597 0.7043 2.356 0.02371 *
## PC1_qb1:Age46-55 2.1746 0.6949 3.129 0.00336 **
## PC1_qb1:Age56-65 1.9944 0.6948 2.870 0.00667 **
## PC1_qb1:Age66-75 2.0674 0.7875 2.625 0.01241 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.256 on 38 degrees of freedom
## Multiple R-squared: 0.7613, Adjusted R-squared: 0.6923
## F-statistic: 11.02 on 11 and 38 DF, p-value: 9.384e-09
table(QBBoth_pc_scores_age$Age)
##
## 18-25 26-35 36-45 46-55 56-65 66-75 76-85
## 3 3 11 13 12 8 0
Linear model for PC2 stability
PC2_age_effect <- lm(PC2_qb2 ~ PC2_qb1 * age_category_numeric, data = QBBoth_pc_scores_age)
summary(PC2_age_effect)
##
## Call:
## lm(formula = PC2_qb2 ~ PC2_qb1 * age_category_numeric, data = QBBoth_pc_scores_age)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.2202 -0.3579 0.3650 0.8319 3.9927
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.04644 0.74939 -0.062 0.951
## PC2_qb1 0.84843 0.55649 1.525 0.134
## age_category_numeric 0.01173 0.17565 0.067 0.947
## PC2_qb1:age_category_numeric -0.15268 0.14190 -1.076 0.288
##
## Residual standard error: 1.701 on 46 degrees of freedom
## Multiple R-squared: 0.09971, Adjusted R-squared: 0.041
## F-statistic: 1.698 on 3 and 46 DF, p-value: 0.1805
PC2_age_effectC <- lm(PC2_qb2 ~ PC2_qb1 * Age, data = QBBoth_pc_scores_age)
summary(PC2_age_effectC)
##
## Call:
## lm(formula = PC2_qb2 ~ PC2_qb1 * Age, data = QBBoth_pc_scores_age)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.9253 -0.5729 0.0799 0.8376 4.1926
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.3610 0.9402 -0.384 0.70312
## PC2_qb1 2.7432 0.7951 3.450 0.00139 **
## Age26-35 1.5072 1.8703 0.806 0.42535
## Age36-45 0.8371 1.0683 0.784 0.43817
## Age46-55 -0.0531 1.0449 -0.051 0.95974
## Age56-65 0.4701 1.0514 0.447 0.65731
## Age66-75 0.3987 1.1037 0.361 0.71989
## PC2_qb1:Age26-35 -4.3633 2.1332 -2.045 0.04778 *
## PC2_qb1:Age36-45 -2.5455 0.8259 -3.082 0.00382 **
## PC2_qb1:Age46-55 -2.4327 0.8247 -2.950 0.00542 **
## PC2_qb1:Age56-65 -2.6013 0.8395 -3.099 0.00365 **
## PC2_qb1:Age66-75 -1.7974 1.2591 -1.428 0.16160
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.623 on 38 degrees of freedom
## Multiple R-squared: 0.3228, Adjusted R-squared: 0.1267
## F-statistic: 1.646 on 11 and 38 DF, p-value: 0.1248
Why is it that the PC2 correlation is so small? We have already seen some indication that the QB2 version of PC2 is particularly noisy.
There are a few ways to show that the same pattern picked out in QB1 PC2 is present in the QB2 data. First, we’ll look at variable plots for both QB1 and QB2, then we’ll look at a PCA with the lot variable removed from QB2. Finally, we’ll show that PC2 and PC3 from QB2 are sufficient to produce very good predictions of QB1 PC2. That is, that PC2 and PC3 from QB2 are capturing the relationship captured by PC2 in QB1.
A variable plot shows the PC loadings for two PCs at once.
Here is a variable plot for PC2 and PC3 for QB1:
fviz_pca_var(
prcomp(QB1_intercepts %>% select(-Speaker), scale=TRUE),
axes=c(2,3),
select.var = list(contrib=10)
)
Here we see the backness of start and strut patterning against thought F1 (among other variables). The length of the arrow indicates the size of the loading in the relevant dimension, with PC2 along the x-axis and PC3 along the y-axis.
If we look at the same plot for QB2, we see:
fviz_pca_var(
prcomp(QB2_intercepts %>% select(-Speaker), scale=TRUE),
axes=c(2,3),
select.var = list(contrib=10)
)
In QB2, the inverse relationship between strut and start F2 together against thought F1 is even stronger. That is, the arrows for these variables are in directly opposite directions. However, these arrows are not directly along either PC2 or PC3. That is, the relationship is being captured by both PC2 and PC3. This is an instance of the kind of rotational instability which can occur when two PCs explain similar amounts of variance to one another. But, for our purposes, the key relationship picked out by PC2 in QB1 is also picked out by PCA applied to QB2. It is just that it is not only captured by PC2 in QB2.
We can also remove the lot variable from QB2. This might allow the strut, start, thought relationship to come through more clearly.
QB2_wlot_pca <- pca_test(
QB2_intercepts %>% select(-Speaker, -contains('LOT')),
n = 1000
)
write_rds(QB2_wlot_pca, here('models', 'QB2_wlot_pca.rds'), compress="gz")
We look at the same variance and loading plots as above.
plot_variance_explained(QB2_wlot_pca, pc_max = 5)
We’ll look at the first three PCs, as they appear outside the NULL distribution. PC2’s confidence intervals intersect with PC3, so we will need to filter out bootstrapped analyses to deal with PC instability for both PC2 and PC3.
First PC1:
plot_loadings(QB2_wlot_pca, pc_no=1)
Figure 4.12: Index loadings for PC1 (QB2).
PC1 is roughly the same as previous instances.
Now, PC2:
plot_loadings(QB2_wlot_pca, pc_no=2, filter_boots=TRUE)
Figure 4.13: Index loadings for PC2 (QB2).
Now PC2 is capturing the thought, start, strut relationship, as hoped.
And now, PC3:
plot_loadings(QB2_wlot_pca, pc_no=3, filter_boots=TRUE)
Figure 4.14: Index loadings for PC2 (QB2).
What does this mean for the correlation between PC2 in QB1 and QB2? Let’s see:
QB2_wlot_pc_scores <- prcomp(
QB2_intercepts %>%
select(-Speaker, -contains('LOT')),
scale=TRUE
)$x
QB2_wlot_pc_scores <- QB2_intercepts %>%
bind_cols(
QB2_wlot_pc_scores[,1:2]
)
QBBoth_pc_scores <- QBBoth_pc_scores %>%
left_join(
QB2_wlot_pc_scores %>%
select(
Speaker, PC1, PC2
) %>%
rename(
PC1_qb2_wlot = PC1,
PC2_qb2_wlot = PC2
)
)
cor.test(QBBoth_pc_scores$PC2_qb1, QBBoth_pc_scores$PC2_qb2_wlot)
##
## Pearson's product-moment correlation
##
## data: QBBoth_pc_scores$PC2_qb1 and QBBoth_pc_scores$PC2_qb2_wlot
## t = 3.1148, df = 49, p-value = 0.003073
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.1474795 0.6134010
## sample estimates:
## cor
## 0.4065402
Now, we have a stronger correlation with smaller p-value. But we are still not quite a strong as in the case of PC1.
Let’s plot:
QBBoth_pc_scores %>%
ggplot(
aes(
x = PC2_qb1,
y = PC2_qb2_wlot
)
) +
geom_point() +
geom_smooth()
Finally, we show that a linear model of QB1 PC2 by QB2 PC2 and PC3 performs very well:
linear_fit <- lm(PC2_qb1 ~ PC2_qb2 + PC3_qb2, data=QBBoth_pc_scores)
summary(linear_fit)
##
## Call:
## lm(formula = PC2_qb1 ~ PC2_qb2 + PC3_qb2, data = QBBoth_pc_scores)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.4061 -0.5739 0.2129 0.9816 2.1942
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.003868 0.204502 0.019 0.9850
## PC2_qb2 0.266392 0.120174 2.217 0.0314 *
## PC3_qb2 0.606789 0.132799 4.569 3.44e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.46 on 48 degrees of freedom
## Multiple R-squared: 0.3569, Adjusted R-squared: 0.3301
## F-statistic: 13.32 on 2 and 48 DF, p-value: 2.501e-05
Indeed, it looks like QB2 PC3 is a better predictor of QB1 PC2 and QB2 PC2 is. Both together provide decent predictive power for prediction QB1 PC2.
QB1_intercepts_long <- QB1_intercepts %>%
pivot_longer(
cols = -Speaker,
names_to = "vowel_formant",
values_to = "intercept"
) %>%
mutate(
Vowel = str_extract(vowel_formant, "^[A-Z]*"),
formant = str_extract(vowel_formant, "F[1-2]$")
) %>%
left_join(
QB1 %>%
select(Speaker, age_category_numeric, Age, Gender) %>%
distinct()
) %>%
left_join(
Analysis1_preds %>%
filter(corpus == "qb1") %>%
select(Vowel, age_category_numeric, Gender, F1_lob2, F2_lob2) %>%
pivot_longer(
cols = contains('_lob2'),
names_to = "formant",
values_to = "fit"
) %>%
mutate(
formant = str_replace(formant, '_lob2', '')
)
) %>%
mutate(
fitted_value = fit + intercept
) %>%
left_join(
# Overkill: adds all PCs.
QB1_pca$loadings %>%
select(
PC, variable, loading, sig_loading
) %>%
pivot_wider(
names_from = PC,
values_from = c('loading', 'sig_loading')
) %>%
rename(vowel_formant = variable)
) %>%
left_join(
QB1_pc_scores %>%
select(Speaker, matches('PC[0-9]+'))
)
QB1_intercepts_long %>%
filter(
Gender == "F"
) %>%
ggplot(
aes(
x = age_category_numeric,
y = fitted_value,
colour = PC1
)
) +
geom_point() +
geom_line(aes(y=fit), size=1, colour='red') +
geom_rect(
data = QB1_intercepts_long %>% filter(sig_loading_PC1),
fill = NA, colour = "red", xmin = -Inf,xmax = Inf,
ymin = -Inf,ymax = Inf
) +
scale_colour_scico(palette = "hawaii") +
facet_grid(cols = vars(Vowel), rows = vars(formant)) +
labs(
title = "QB1 PC1",
subtitle = "GAMM fit + intercept"
)
And in the vowel space:
## Fit linear models
QB1_linear <- QB1_intercepts_long %>%
pivot_longer(
cols = matches("^PC[0-9]+$"),
names_to = 'PC',
values_to = 'PC_score'
) %>%
group_by(Vowel, formant, PC) %>%
nest() %>%
mutate(
model = map(data, ~lm(fitted_value ~ PC_score, data=.x)),
data = map2(data, model, ~ .x %>% mutate(linear_fit = .y[['fitted.values']]))
) %>%
select(
data
) %>%
unnest(data)
The following function is taken from Wilson Black et al. 2022.
QB1_linear %>%
filter(
PC == "PC1"
) %>%
select(
Speaker, Vowel, formant, fitted_value, linear_fit, loading_PC1, PC_score
) %>%
group_by(Vowel) %>%
mutate(
max_loading = max(abs(loading_PC1))
) %>%
ungroup() %>%
pivot_wider(
names_from = "formant",
values_from = c("fitted_value", "linear_fit", "loading_PC1"),
names_glue = "{formant}_{.value}"
) %>%
ggplot(
aes(
x = F2_fitted_value,
y = F1_fitted_value,
colour = PC_score,
group = Vowel,
label = Vowel,
size = max_loading,
fill = rescale(max_loading^2, to = c(0, 1))
)
) +
geom_point(size = 0.7, alpha = 0.5) +
geom_line(
aes(
x = F2_linear_fit,
y = F1_linear_fit,
size = max_loading
),
show.legend = FALSE
) +
geom_label_repel(
data = . %>%
group_by(Vowel) %>%
mutate(
F1_fitted_value = mean(F1_fitted_value),
F2_fitted_value = mean(F2_fitted_value)
) %>%
filter(PC_score == min(PC_score)),
show.legend = FALSE,
alpha = 0.6,
min.segment.length = 0,
force = 100,
segment.alpha = 0.5
) +
scale_x_reverse(expand = expansion(add = 0.7)) +
scale_y_reverse(position = "right") +
scale_colour_gradient2(
name = "PC score",
low = "black",
mid = "white" ,
high = "#7CAE00",
midpoint=0,
breaks = c(-3, 0, 3)
) +
scale_fill_gradient(low = "white", high = "yellow", guide = "none") +
scale_size_continuous(range = c(1.5, 4), guide = "none") +
labs(
x = "F2 (Speaker Intercept + GAMM Fit)",
y = "F1 (Speaker Intercept + GAMM Fit)",
title = "QB1 PC1",
subtitle = "GAMM fit + intercept"
)
QB1_intercepts_long %>%
filter(
Gender == "F"
) %>%
ggplot(
aes(
x = age_category_numeric,
y = fitted_value,
colour = PC2
)
) +
geom_point() +
geom_line(aes(y=fit), size=1, colour='red') +
geom_rect(
data = QB1_intercepts_long %>% filter(sig_loading_PC2),
fill = NA, colour = "red", xmin = -Inf,xmax = Inf,
ymin = -Inf,ymax = Inf
) +
scale_colour_scico(palette = "hawaii") +
facet_grid(cols = vars(Vowel), rows = vars(formant)) +
labs(
title = "QB1 PC2",
subtitle = "GAMM fit + intercept"
)
QB1_linear %>%
filter(
PC == "PC2"
) %>%
select(
Speaker, Vowel, formant, fitted_value, linear_fit, loading_PC2, PC_score
) %>%
group_by(Vowel) %>%
mutate(
max_loading = max(abs(loading_PC2))
) %>%
ungroup() %>%
pivot_wider(
names_from = "formant",
values_from = c("fitted_value", "linear_fit", "loading_PC2"),
names_glue = "{formant}_{.value}"
) %>%
ggplot(
aes(
x = F2_fitted_value,
y = F1_fitted_value,
colour = PC_score,
group = Vowel,
label = Vowel,
size = max_loading,
fill = rescale(max_loading^2, to = c(0, 1))
)
) +
geom_point(size = 0.7, alpha = 0.5) +
geom_line(
aes(
x = F2_linear_fit,
y = F1_linear_fit,
size = max_loading
),
show.legend = FALSE
) +
geom_label_repel(
data = . %>%
group_by(Vowel) %>%
mutate(
F1_fitted_value = mean(F1_fitted_value),
F2_fitted_value = mean(F2_fitted_value)
) %>%
filter(PC_score == min(PC_score)),
show.legend = FALSE,
alpha = 0.6,
min.segment.length = 0,
force = 100,
segment.alpha = 0.5
) +
scale_x_reverse(expand = expansion(add = 0.7)) +
scale_y_reverse(position = "right") +
scale_colour_gradient2(
name = "PC score",
low = "black",
mid = "white" ,
high = "#7CAE00",
midpoint=0,
breaks = c(-3, 0, 3)
) +
scale_fill_gradient(low = "white", high = "yellow", guide = "none") +
scale_size_continuous(range = c(1.5, 4), guide = "none") +
labs(
x = "F2 (Speaker Intercept + GAMM Fit)",
y = "F1 (Speaker Intercept + GAMM Fit)",
title = "QB1 PC2",
subtitle = "GAMM fit + intercept"
)
QB2_intercepts_long <- QB2_intercepts %>%
pivot_longer(
cols = -Speaker,
names_to = "vowel_formant",
values_to = "intercept"
) %>%
mutate(
Vowel = str_extract(vowel_formant, "^[A-Z]*"),
formant = str_extract(vowel_formant, "F[1-2]$")
) %>%
left_join(
QB2 %>%
select(Speaker, age_category_numeric, Age, Gender) %>%
distinct()
) %>%
left_join(
Analysis1_preds %>%
filter(corpus == "qb2") %>%
select(Vowel, age_category_numeric, Gender, F1_lob2, F2_lob2) %>%
pivot_longer(
cols = contains('_lob2'),
names_to = "formant",
values_to = "fit"
) %>%
mutate(
formant = str_replace(formant, '_lob2', '')
)
) %>%
mutate(
fitted_value = fit + intercept
) %>%
left_join(
# Overkill: adds all PCs.
QB2_pca$loadings %>%
select(
PC, variable, loading, sig_loading
) %>%
pivot_wider(
names_from = PC,
values_from = c('loading', 'sig_loading')
) %>%
rename(vowel_formant = variable)
) %>%
left_join(
QB2_pc_scores %>%
select(Speaker, matches('PC[0-9]+'))
)
QB2_intercepts_long %>%
filter(
Gender == "F"
) %>%
ggplot(
aes(
x = age_category_numeric,
y = fitted_value,
colour = PC1
)
) +
geom_point() +
geom_line(aes(y=fit), size=1, colour='red') +
geom_rect(
data = QB2_intercepts_long %>% filter(sig_loading_PC1),
fill = NA, colour = "red", xmin = -Inf,xmax = Inf,
ymin = -Inf,ymax = Inf
) +
scale_colour_scico(palette = "hawaii") +
facet_grid(cols = vars(Vowel), rows = vars(formant)) +
labs(
title = "QB2 PC1",
subtitle = "GAMM fit + intercept"
)
QB2_linear <- QB2_intercepts_long %>%
pivot_longer(
cols = matches("^PC[0-9]+$"),
names_to = 'PC',
values_to = 'PC_score'
) %>%
group_by(Vowel, formant, PC) %>%
nest() %>%
mutate(
model = map(data, ~lm(fitted_value ~ PC_score, data=.x)),
data = map2(data, model, ~ .x %>% mutate(linear_fit = .y[['fitted.values']]))
) %>%
select(
data
) %>%
unnest(data)
QB2_linear %>%
filter(
PC == "PC1"
) %>%
select(
Speaker, Vowel, formant, fitted_value, linear_fit, loading_PC1, PC_score
) %>%
group_by(Vowel) %>%
mutate(
max_loading = max(abs(loading_PC1))
) %>%
ungroup() %>%
pivot_wider(
names_from = "formant",
values_from = c("fitted_value", "linear_fit", "loading_PC1"),
names_glue = "{formant}_{.value}"
) %>%
ggplot(
aes(
x = F2_fitted_value,
y = F1_fitted_value,
colour = PC_score,
group = Vowel,
label = Vowel,
size = max_loading,
fill = rescale(max_loading^2, to = c(0, 1))
)
) +
geom_point(size = 0.7, alpha = 0.5) +
geom_line(
aes(
x = F2_linear_fit,
y = F1_linear_fit,
size = max_loading
),
show.legend = FALSE
) +
geom_label_repel(
data = . %>%
group_by(Vowel) %>%
mutate(
F1_fitted_value = mean(F1_fitted_value),
F2_fitted_value = mean(F2_fitted_value)
) %>%
filter(PC_score == min(PC_score)),
show.legend = FALSE,
alpha = 0.6,
min.segment.length = 0,
force = 100,
segment.alpha = 0.5
) +
scale_x_reverse(expand = expansion(add = 0.7)) +
scale_y_reverse(position = "right") +
scale_colour_gradient2(
name = "PC score",
low = "black",
mid = "white" ,
high = "#7CAE00",
midpoint=0,
breaks = c(-3, 0, 3)
) +
scale_fill_gradient(low = "white", high = "yellow", guide = "none") +
scale_size_continuous(range = c(1.5, 4), guide = "none") +
labs(
x = "F2 (Speaker Intercept + GAMM Fit)",
y = "F1 (Speaker Intercept + GAMM Fit)",
title = "QB2 PC1",
subtitle = "GAMM fit + intercept"
)
QB2_intercepts_long %>%
filter(
Gender == "F"
) %>%
ggplot(
aes(
x = age_category_numeric,
y = fitted_value,
colour = PC2
)
) +
geom_point() +
geom_line(aes(y=fit), size=1, colour='red') +
geom_rect(
data = QB2_intercepts_long %>% filter(sig_loading_PC2),
fill = NA, colour = "red", xmin = -Inf,xmax = Inf,
ymin = -Inf,ymax = Inf
) +
scale_colour_scico(palette = "hawaii") +
facet_grid(cols = vars(Vowel), rows = vars(formant)) +
labs(
title = "QB2 PC2",
subtitle = "GAMM fit + intercept"
)
QB2_linear %>%
filter(
PC == "PC2"
) %>%
select(
Speaker, Vowel, formant, fitted_value, linear_fit, loading_PC2, PC_score
) %>%
group_by(Vowel) %>%
mutate(
max_loading = max(abs(loading_PC2))
) %>%
ungroup() %>%
pivot_wider(
names_from = "formant",
values_from = c("fitted_value", "linear_fit", "loading_PC2"),
names_glue = "{formant}_{.value}"
) %>%
ggplot(
aes(
x = F2_fitted_value,
y = F1_fitted_value,
colour = PC_score,
group = Vowel,
label = Vowel,
size = max_loading,
fill = rescale(max_loading^2, to = c(0, 1))
)
) +
geom_point(size = 0.7, alpha = 0.5) +
geom_line(
aes(
x = F2_linear_fit,
y = F1_linear_fit,
size = max_loading
),
show.legend = FALSE
) +
geom_label_repel(
data = . %>%
group_by(Vowel) %>%
mutate(
F1_fitted_value = mean(F1_fitted_value),
F2_fitted_value = mean(F2_fitted_value)
) %>%
filter(PC_score == min(PC_score)),
show.legend = FALSE,
alpha = 0.6,
min.segment.length = 0,
force = 100,
segment.alpha = 0.5
) +
scale_x_reverse(expand = expansion(add = 0.7)) +
scale_y_reverse(position = "right") +
scale_colour_gradient2(
name = "PC score",
low = "black",
mid = "white" ,
high = "#7CAE00",
midpoint=0,
breaks = c(-3, 0, 3)
) +
scale_fill_gradient(low = "white", high = "yellow", guide = "none") +
scale_size_continuous(range = c(1.5, 4), guide = "none") +
labs(
x = "F2 (Speaker Intercept + GAMM Fit)",
y = "F1 (Speaker Intercept + GAMM Fit)",
title = "QB2 PC2",
subtitle = "GAMM fit + intercept"
)
QB2_intercepts_long %>%
filter(
Gender == "F"
) %>%
ggplot(
aes(
x = age_category_numeric,
y = fitted_value,
colour = PC3
)
) +
geom_point() +
geom_line(aes(y=fit), size=1, colour='red') +
geom_rect(
data = QB2_intercepts_long %>% filter(sig_loading_PC3),
fill = NA, colour = "red", xmin = -Inf,xmax = Inf,
ymin = -Inf,ymax = Inf
) +
scale_colour_scico(palette = "hawaii") +
facet_grid(cols = vars(Vowel), rows = vars(formant)) +
labs(
title = "QB2 PC3",
subtitle = "GAMM fit + intercept"
)
QB2_linear %>%
filter(
PC == "PC3"
) %>%
select(
Speaker, Vowel, formant, fitted_value, linear_fit, loading_PC3, PC_score
) %>%
group_by(Vowel) %>%
mutate(
max_loading = max(abs(loading_PC3))
) %>%
ungroup() %>%
pivot_wider(
names_from = "formant",
values_from = c("fitted_value", "linear_fit", "loading_PC3"),
names_glue = "{formant}_{.value}"
) %>%
ggplot(
aes(
x = F2_fitted_value,
y = F1_fitted_value,
colour = PC_score,
group = Vowel,
label = Vowel,
size = max_loading,
fill = rescale(max_loading^2, to = c(0, 1))
)
) +
geom_point(size = 0.7, alpha = 0.5) +
geom_line(
aes(
x = F2_linear_fit,
y = F1_linear_fit,
size = max_loading
),
show.legend = FALSE
) +
geom_label_repel(
data = . %>%
group_by(Vowel) %>%
mutate(
F1_fitted_value = mean(F1_fitted_value),
F2_fitted_value = mean(F2_fitted_value)
) %>%
filter(PC_score == min(PC_score)),
show.legend = FALSE,
alpha = 0.6,
min.segment.length = 0,
force = 100,
segment.alpha = 0.5
) +
scale_x_reverse(expand = expansion(add = 0.7)) +
scale_y_reverse(position = "right") +
scale_colour_gradient2(
name = "PC score",
low = "black",
mid = "white" ,
high = "#7CAE00",
midpoint=0,
breaks = c(-3, 0, 3)
) +
scale_fill_gradient(low = "white", high = "yellow", guide = "none") +
scale_size_continuous(range = c(1.5, 4), guide = "none") +
labs(
x = "F2 (Speaker Intercept + GAMM Fit)",
y = "F1 (Speaker Intercept + GAMM Fit)",
title = "QB2 PC3",
subtitle = "GAMM fit + intercept"
)
One sanity check for our analysis would be to run the same PCA process on an entirely different corpus consisting of speakers from another dialect. Doing this not only ensures that the success of this method isn’t merely NZ corpora specific, but it will also show whether the clusters that PCA return are not an artefact of some universal vowel space mechanics (i.e., mouth dimensions).
So let’s check this by using the Origins of Liverpool English (OLivE) corpus (Watson & Clark, 2017). For more information about this corpus, refer to Watson & Clark’s 2017 paper here:
Watson, K., & Clark, L. (2017). The origins of Liverpool English. Listening to the past: Audio records of accents of English, 114-141.
Like the QB analysis, the OLivE analysis underwent the same filtering and modelling process (controlling for age, gender, age by gender, and speech rate, with a random intercept for speaker and word). Let’s load in the speaker intercepts here and run PCA on it.
#load in OLivE intercepts
olive_ints <- read.csv(here("data", "OLIVE_ints.csv")) %>%
select(-X)
Let’s look at a pairwise correlation plot:
olive_ints %>%
select(-Speaker) %>%
as.matrix() %>%
cor() %>%
corrplot()
olive_pca <- pca_test(
olive_ints %>% select(-Speaker),
n = 1000
)
write_rds(olive_pca, here('models', 'olive_pca.rds'), compress="gz")
Let’s take a look at the variance explained by the PCs.
plot_variance_explained(olive_pca, pc_max = 5)
Similar to the QB data, PC1 and PC2 are the only stable PCs in the OLivE dataset.
First PC1:
plot_loadings(olive_pca, pc_no=1)
Figure 6.1: Index loadings for PC1 (OLivE).
Although the first few loadings here look very similar to the PC2 loadings in QB (and the PC1 loadings in Brand et al., (2021)), notice that the signs for the vowels are different. In the QB data, thought F1 correlated in the opposite direction to lot F2, strut F2, thought F2, and start F2. Whereas in the OLivE data, thought F1 now correlates in the same direction as those vowels, except for strut F2, which now correlates in the opposite direction. Therefore, the direction of sound change and how these vowels covary is completely unique to this data set of LivE speakers. Additionally, PC1 has significantly loaded nurse F1 and fleece F1 - vowels which were not loaded in PC2 for the QB data. Therefore, there appears to be covariation between the back vowels and high mid-front vowels in this data set of Liverpool English.
Now, PC2:
plot_loadings(olive_pca, pc_no=2, filter_boots=TRUE)
Figure 6.2: Index loadings for PC2 (OLivE).
PC2 is showing a cluster unlike anything that has shown up in any previous PCA on NZE speakers.
We can also check PC3:
plot_loadings(olive_pca, pc_no=3, filter_boots=TRUE)
Figure 6.3: Index loadings for PC3 (OLivE).
And similar to the QB data, PC3 is only reliably capturing a single variable, hence, we shall ignore them.
Let’s also check the variable plot for PC1 and PC2.
fviz_pca_var(
prcomp(
olive_ints %>% select(-Speaker), scale=TRUE
),
axes = c(1,2),
select.var = list(contrib=10)
)
In sum, the stability exploration in this paper, and the relative success in finding said stability is not the result of some NZE-specific dialect nor is the PCA finding stability in more universal vowel features.
One potential driver for our significant PC correlations across QB1 and QB2 may be due to similarity in the discourse material. Given that speakers were specifically ask ‘remind me of your earthquake story’ in the follow-up interviews, it may be the case that the correlation between the PCs are reflective of repeated narratives/words, rather than real monophthongal covariation.
To test for this, we investigate: a) seeing if all (or most) speakers have comparable earthquake stories in their QB1 to QB2 b) doing PCA on the ‘old’ (i.e., repeated) vs ‘new’ (i.e., spontaneous) data sets.
The QuakeBox corpus has tagging to overall topical categories. The QB2 taggings were completed by the first author (GH).
First, let’s load in the rds files which contains the same data we have been dealing with previous, but now they have a ‘type’ column
QB1type <- read_rds(here ("data", "QB1_topics.rds"))
QB2type <- read_rds(here ("data", "QB2_topics.rds"))
table(QB1type$type) #this will all count as 'old' data
##
## {February earthquake experience}
## 380
## {aftermath of the earthquakes}
## 14518
## {aftermath of the earthquakes}
## 30
## {Boxing Day earthquake experience]
## 5
## {Boxing Day earthquake experience}
## 62
## {December 23rd earthquake experience}
## 1008
## {December 23rd earthquakes experience}
## 99
## {February earthquake experience}
## 38483
## {February earthquake experience}
## 189
## {government response to the earthquakes}
## 2206
## {housing and insurance}
## 3725
## {June earthquake experience}
## 3895
## {June earthquakes experience}
## 831
## {other earthquake experience}
## 871
## {personal background to the earthquakes}
## 2056
## {September earthquake experience}
## 12467
## {thoughts for the future}
## 638
table(QB2type$type) #new topic types in here will count as 'new' data
##
## {aftermath of the earthquakes}
## 22515
## {Boxing Day earthquake experience}
## 32
## {December 23rd earthquake experience}
## 98
## {February earthquake experience}
## 10672
## {government response to the earthquakes}
## 1083
## {housing and insurance}
## 6223
## {June earthquake experience}
## 429
## {other earthquake experience}
## 62
## {personal background to the earthquakes}
## 15
## {September earthquake experience}
## 1773
## {thoughts for the future}
## 261
#remove NA taggings
QB1type <- QB1type %>%
filter(type != "NA")
QB2type <- QB2type %>%
filter(type != "NA")
#make all the taggings in QB1 = 'old'
QB1type <- QB1type %>%
mutate(
type=str_trim(type)
) %>%
filter(
type == "{February earthquake experience}" |
type == "{June earthquake experience}" |
type == "{other earthquake experience}" |
type == "{Boxing Day earthquake experience}" |
type == "{December 23rd earthquakes experience}" |
type == "{September earthquake experience}" |
type == "{personal background to the earthquakes}"
) %>%
add_column(retelling="old")
#make all the repeated taggings that are in QB1 = 'old' in QB2 and all other taggings are 'new'
QB2type <- QB2type %>%
mutate(
type=str_trim(type)
) %>%
mutate(
retelling = if_else(
type=="{February earthquake experience}" |
type=="{June earthquake experience}" |
type== "{other earthquake experience}" |
type== "{Boxing Day earthquake experience}" |
type== "{September earthquake experience}" |
type == "{personal background to the earthquakes}",
"old",
"new"
)
)
#let's make a new column that combines the speaker name and topic tag - this will come in handy in a bit when we want to divide the topic tags into 'old' and 'new'
QB1type$Speaker.Type <- paste(QB1type$Speaker, QB1type$type)
#repeat this process with QB2type
QB2type$Speaker.Type <- paste(QB2type$Speaker, QB2type$type)
We can now divide the QB2 ‘retelling’ column into things that have been
repeated (QB2oldtype
) and new narratives (QB2newtype
).
#now we can add in which tokens are part of 'new' narratives if they are not there in the QB1 data
QB2type[!QB2type$Speaker.Type %in% QB1type$Speaker.Type,]$retelling = "new"
#now we can divide the QB2 data into two separate dataframes - one containing only the old narratives and one containing only the new narratives
QB2oldtype <- QB2type %>%
filter(
retelling == "old"
)
QB2newtype <- QB2type %>%
filter(
retelling == "new"
)
Next, we should isolate the QB1 intercepts/PC loading scores data frames
(i.e., all the QB1 ‘old’ data) and the QB2newtype
data to contain the
same speakers that are in the QB2oldtype
.
QB1_pc_scores_old <- QB1_pc_scores %>%
filter(
Speaker %in% QB2oldtype$Speaker
)
QB2newtype <- QB2newtype %>%
filter(
Speaker %in% QB2oldtype$Speaker
)
#there was also one speaker who was manually removed before Analysis 1 due
# to odd STRUT tokens, let's remove them here too
QB2newtype <- QB2newtype %>%
filter(
Speaker != "QB_NZ_F_645"
)
QB2oldtype <- QB2oldtype %>%
filter(
Speaker != "QB_NZ_F_645"
)
#and let's double check there is the same amount of speakers
n_distinct(QB1_pc_scores_old$Speaker)
## [1] 48
n_distinct(QB2newtype$Speaker)
## [1] 48
n_distinct(QB2oldtype$Speaker)
## [1] 48
Finally, let’s remove all foot tokens as they are not a part of our analysis.
QB2newtype <- QB2newtype %>%
filter(
Vowel != "FOOT"
)
QB2oldtype <- QB2oldtype %>%
filter(
Vowel != "FOOT"
)
Now we can run the GAMMs on the QB2oldtype
and QB2newtype
data, get
the intercepts to run PCA, and plot in vowel space to compare to QB1type
(which is identical to the initial QB1tidy data sets, but now with an
inclusion of the retelling column containing ‘old’).
Like previous models, we need to make sure the relevant columns in the data frames are the right structure.
Starting with the QB2newtype
data.
# Check that age values in the correct order.
levels(QB2newtype$Age)
## [1] "18-25" "26-35" "36-45" "46-55" "56-65" "66-75" "76-85"
# output: "18-25" "26-35" "36-45" "46-55" "56-65" "66-75" "76-85"
# Correct!
# Convert to integer and do to factor conversions for speaker and word.
QB2newtype <- QB2newtype %>%
mutate(
age_category_numeric = as.integer(Age),
Speaker = as.factor(Speaker),
Gender = as.factor(Gender),
Word = as.factor(Word)
)
Now we can run the first model. We use the original (i.e., not ethnicity) models here, since we know adding in ethnicity does not significantly affect the PCA.
QB2newtype_models <- QB2newtype %>%
select(
-F1_50, -F2_50
) %>%
pivot_longer(
cols = F1_lob2:F2_lob2,
names_to = "formant_type",
values_to = "formant_value"
) %>%
group_by(Vowel, formant_type) %>%
nest() %>%
mutate(
model = map(
data,
~ bam(
formant_value ~ Gender + s(age_category_numeric, k = 4, by = Gender) + # main effects
s(articulation_rate, k = 3) + # control vars
s(Speaker, bs='re') + s(Word, bs='re'), # random effects
discrete = TRUE,
nthreads = 16, # This value is too high for most situations.
family = mgcv::scat(link ='identity'),
data = .x
)
) ,
model_summary = map(model, summary),
random_intercepts = map(
model,
~ get_random(.x)$`s(Speaker)` %>%
as_tibble(rownames = "Speaker") %>%
rename(intercept = value)
)
)
write_rds(
QB2newtype_models,
here('models', 'QB2newtype_models.rds'),
compress = "gz"
)
QB2newtype_models <- read_rds(here('models', 'QB2newtype_models.rds'))
And collect the random intercepts
QB2newtype_intercepts <- QB2newtype_models %>%
ungroup() %>%
select(Vowel, formant_type, random_intercepts) %>%
unnest(random_intercepts) %>%
arrange(as.character(Vowel))
QB2newtype_intercepts #yippie
QB2newtype_intercepts <- QB2newtype_intercepts %>%
# Create a column to store our eventual column names
mutate(
# Combine the 'vowel' and 'formant_type' columns as a string.
vowel_formant = str_c(Vowel, '_', formant_type),
# Remove '_lob2' for cleaner column names
vowel_formant = str_replace(vowel_formant, '_lob2', '')
) %>%
# Remove old 'vowel' and 'formant_type' columns
select(-c(Vowel, formant_type)) %>%
# Make data 'wider', by...
pivot_wider(
names_from = vowel_formant, # naming columns using 'vowel_formant'...
values_from = intercept # and values from intercept column
)
And now we repeat this modelling with the QB2oldtype
data.
# Check that age values in the correct order.
levels(QB2oldtype$Age)
## [1] "18-25" "26-35" "36-45" "46-55" "56-65" "66-75" "76-85"
# output: "18-25" "26-35" "36-45" "46-55" "56-65" "66-75" "76-85"
# Correct!
# Convert to integer and do to factor conversions for speaker and word.
QB2oldtype <- QB2oldtype %>%
mutate(
age_category_numeric = as.integer(Age),
Speaker = as.factor(Speaker),
Gender = as.factor(Gender),
Word = as.factor(Word)
)
Run the models.
QB2oldtype_models <- QB2oldtype%>%
select(
-F1_50, -F2_50
) %>%
pivot_longer(
cols = F1_lob2:F2_lob2,
names_to = "formant_type",
values_to = "formant_value"
) %>%
group_by(Vowel, formant_type) %>%
nest() %>%
mutate(
model = map(
data,
~ bam(
formant_value ~ Gender + s(age_category_numeric, k = 4, by = Gender) + # main effects
s(articulation_rate, k = 3) + # control vars
s(Speaker, bs='re') + s(Word, bs='re'), # random effects
discrete = TRUE,
nthreads = 16, # This value is too high for most situations.
family = mgcv::scat(link ='identity'),
data = .x
)
) ,
model_summary = map(model, summary),
random_intercepts = map(
model,
~ get_random(.x)$`s(Speaker)` %>%
as_tibble(rownames = "Speaker") %>%
rename(intercept = value)
)
)
write_rds(
QB2oldtype_models,
here('models', 'QB2oldtype_models.rds'),
compress = "gz"
)
QB2oldtype_models <- read_rds(here('models', 'QB2oldtype_models.rds'))
Collect the random intercepts:
QB2oldtype_intercepts <- QB2oldtype_models %>%
ungroup() %>%
select(Vowel, formant_type, random_intercepts) %>%
unnest(random_intercepts) %>%
arrange(as.character(Vowel))
QB2oldtype_intercepts <- QB2oldtype_intercepts %>%
# Create a column to store our eventual column names
mutate(
# Combine the 'vowel' and 'formant_type' columns as a string.
vowel_formant = str_c(Vowel, '_', formant_type),
# Remove '_lob2' for cleaner column names
vowel_formant = str_replace(vowel_formant, '_lob2', '')
) %>%
# Remove old 'vowel' and 'formant_type' columns
select(-c(Vowel, formant_type)) %>%
# Make data 'wider', by...
pivot_wider(
names_from = vowel_formant, # naming columns using 'vowel_formant'...
values_from = intercept # and values from intercept column
)
Let’s make sure the intercepts are okay for this data frame, given there is quite a few less vowel tokens, so there may be some missing intercepts.
QB2oldtype_intercepts #oh no, looks like there's a couple funky speakers! We need to remove them from all the dataframes we are dealing with.
QB2oldtype_intercepts <- QB2oldtype_intercepts %>%
filter(!(Speaker %in% c("QB_NZ_M_472", "QB_NZ_F_680", "QB_NZ_F_781")))
QB2newtype_intercepts <- QB2newtype_intercepts %>%
filter(!(Speaker %in% c("QB_NZ_M_472", "QB_NZ_F_680", "QB_NZ_F_781")))
QB1_pc_scores_old <- QB1_pc_scores_old %>%
filter(!(Speaker %in% c("QB_NZ_M_472", "QB_NZ_F_680", "QB_NZ_F_781")))
Because of this, we are now dealing with 45 speakers in this post-hoc analysis.
Now we need to run PCA on the two intercept data frames above and
extract the speaker loadings for PC1 and PC2 to compare it to the
QB1_pc_scores_old
data. If the previous stability findings between QB1
and QB2 are not the result of narratives difference, then there will
be no statistically significant correlation between speaker loadings.
Like the modelling, we can run the PCA on the data frames one at a time,
starting with the QB2newtype_intercepts.
QB2newtype_pca <- pca_test(
QB2newtype_intercepts %>% select(-Speaker),
n = 1000
)
write_rds(QB2newtype_pca, here('models', 'QB2newtype_pca.rds'), compress = "gz")
And, as per usual, a quick summary of this PCA.
plot_variance_explained(QB2newtype_pca, pc_max = 5)
Like Analysis 1 and Analysis 2, PC1 and PC2 are the most stable PCs in this data.
First PC1:
plot_loadings(QB2newtype_pca, pc_no=1)
Figure 6.4: Index loadings for PC1 (QB2 - new narratives).
This is roughly the same as the PC1 generated from the QB2 analysis (Analysis 1). The front vowels remain the predominate cluster in this data. However, there is slightly less stability with nurse and dress.
Second PC2:
plot_loadings(QB2newtype_pca, pc_no=2, filter_boots=TRUE)
Figure 6.5: Index loadings for PC2 (QB2 - new narratives).
And again, roughly the same cluster is present here as is in the QB2 PC2 (i.e., strut, thought, and start).
And let’s check PC3 while we’re here:
plot_loadings(QB2newtype_pca, pc_no=3, filter_boots=TRUE)
Figure 6.6: Index loadings for PC3 (QB2 - new narratives).
So PC3 includes the lot variables in an (unstable) pairwise relationship with nurse.
We can now run PCA on the repeated narratives (i.e.,
QB2oldtype_intercepts
).
QB2oldtype_pca <- pca_test(
QB2oldtype_intercepts %>% select(-Speaker),
n = 1000
)
write_rds(QB2oldtype_pca, here('models', 'QB2oldtype_pca.rds'), compress="gz")
And, as per usual, a quick summary of this PCA.
plot_variance_explained(QB2newtype_pca, pc_max = 5)
Great, the same PCs are most stable here.
First PC1:
plot_loadings(QB2oldtype_pca, pc_no=1)
Figure 6.7: Index loadings for PC1 (QB2 - repeated narratives).
Again, the main front vowel cluster is apparent here.
Second PC2:
plot_loadings(QB2oldtype_pca, pc_no=2, filter_boots = TRUE)
Figure 6.8: Index loadings for PC2 (QB2 - repeated narratives).
This cluster is more reflected of QB2’s PC3 than PC2 with that lot loading.
Finally. PC3:
plot_loadings(QB2oldtype_pca, pc_no=3, filter_boots = TRUE)
Figure 6.9: Index loadings for PC3 (QB2 - repeated narratives).
Now, PC3 is more reflective of the QB1 PC2.
Similarly to the correlation checks we were running earlier, we grab the speaker loading scores per PC and run a Spearman correlation test on them.
QB2newtype_pc_scores <- prcomp(
QB2newtype_intercepts %>%
select(-Speaker),
scale=TRUE
)$x
QB2oldtype_pc_scores <- prcomp(
QB2oldtype_intercepts %>%
select(-Speaker),
scale=TRUE
)$x
QB2newtype_pc_scores <- QB2newtype_intercepts %>%
bind_cols(
QB2newtype_pc_scores[,1:3] #grabbing PC3 here just in case
)
QB2oldtype_pc_scores <- QB2oldtype_intercepts %>%
bind_cols(
QB2oldtype_pc_scores[,1:3] #grabbing PC3 here just in case
)
QBBoth_type_pc_scores <- QB2newtype_pc_scores %>%
select(Speaker, PC1, PC2, PC3) %>%
rename(
PC1_qb2new = PC1,
PC2_qb2new = PC2,
PC3_qb2new = PC3,
) %>%
left_join(
QB2oldtype_pc_scores %>%
select(Speaker, PC1, PC2, PC3) %>%
rename(
PC1_qb2old = PC1,
PC2_qb2old = PC2,
PC3_qb2old = PC3
) %>%
left_join(
QB1_pc_scores_old %>%
select(Speaker, PC1, PC2) %>%
rename(
PC1_qb1old = PC1,
PC2_qb1old = PC2
)
)
)
We need to run four correlation tests - one of the QB1old - QB2old, another on QB1old - QB2new (per PC) - to answer our question here.
cor.test(QBBoth_type_pc_scores$PC1_qb1old, QBBoth_type_pc_scores$PC1_qb2old, method="spearman") # sig.
##
## Spearman's rank correlation rho
##
## data: QBBoth_type_pc_scores$PC1_qb1old and QBBoth_type_pc_scores$PC1_qb2old
## S = 4696, p-value = 4.532e-07
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.6906456
cor.test(QBBoth_type_pc_scores$PC2_qb1old, QBBoth_type_pc_scores$PC2_qb2old, method="spearman") #not sig.
##
## Spearman's rank correlation rho
##
## data: QBBoth_type_pc_scores$PC2_qb1old and QBBoth_type_pc_scores$PC2_qb2old
## S = 11432, p-value = 0.102
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.2469038
#let's check PC3 since there were significant correlations between QB1_PC2 and QB2_PC3 in Analysis 2.
cor.test(QBBoth_type_pc_scores$PC2_qb1old, QBBoth_type_pc_scores$PC3_qb2old, method="spearman") #sig
##
## Spearman's rank correlation rho
##
## data: QBBoth_type_pc_scores$PC2_qb1old and QBBoth_type_pc_scores$PC3_qb2old
## S = 8798, p-value = 0.004327
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.4204216
And now the old vs new
cor.test(QBBoth_type_pc_scores$PC1_qb1old, QBBoth_type_pc_scores$PC1_qb2new, method="spearman") # sig.
##
## Spearman's rank correlation rho
##
## data: QBBoth_type_pc_scores$PC1_qb1old and QBBoth_type_pc_scores$PC1_qb2new
## S = 5760, p-value = 8.767e-06
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.6205534
cor.test(QBBoth_type_pc_scores$PC2_qb1old, QBBoth_type_pc_scores$PC2_qb2new, method="spearman") # not sig.
##
## Spearman's rank correlation rho
##
## data: QBBoth_type_pc_scores$PC2_qb1old and QBBoth_type_pc_scores$PC2_qb2new
## S = 11980, p-value = 0.1641
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.2108037
In sum, the relatively stable findings in Analysis 1 will appear regardless if whether speakers are repeating narratives or telling new ones. We find some instability between PC2 for the original vs repeated correlation test, however, this is rectified when correlated QB1 PC2 to QB2 PC3 (old), similar to the ethnicity correlation tests in Analysis 2. This could suggest that some of the clusters in PC2 for repeated QB2 narratives ‘spill over’ into PC3 (potentially due to the smaller group of speakers we are dealing with for this data set). Furthermore, we can be sure that it’s not new narratives that are driving the stability findings.
Save figures for publication.
We combine multiple plots into panels using patchwork
.
# Figure 1
figure_1 <- all_plot / b21_plot +
plot_annotation(tag_levels = 'A') +
plot_layout(guides = 'collect')
ggsave(
plot = figure_1,
here("plots", "figure_1.tiff"),
width = 200,
height = 260,
units = "mm",
dpi = 300
)
# Figure 2
ggsave(
plot = parametric_difference_plot,
here("plots", "figure_2.tiff"),
width = 200,
height = 130,
units = "mm",
dpi = 300
)
# Figure 3
figure_3 <- QB1_plot_variance_explained /
QB2_plot_variance_explained +
plot_annotation(tag_levels = 'A') +
plot_layout(guides = 'collect')
ggsave(
plot = figure_3,
here("plots", "figure_3.tiff"),
width = 200,
height = 260,
units = "mm",
dpi = 300
)
# Figure 4
figure_4 <- QB1_PC1_plot_loadings /
QB2_PC1_plot_loadings +
plot_annotation(tag_levels = 'A') +
plot_layout(guides = 'collect')
ggsave(
plot = figure_4,
here("plots", "figure_4.tiff"),
width = 200,
height = 260,
units = "mm",
dpi = 300
)
# Figure 5
ggsave(
plot = PC1_corrplot,
here("plots", "figure_5.tiff"),
width = 200,
height = 260,
units = "mm",
dpi = 300
)
# Figure 6
figure_6 <- QB1_PC2_plot_loadings /
QB2_PC2_plot_loadings +
plot_annotation(tag_levels = 'A') +
plot_layout(guides = 'collect')
ggsave(
plot = figure_6,
here("plots", "figure_6.tiff"),
width = 200,
height = 260,
units = "mm",
dpi = 300
)
# Figure 7
ggsave(
plot = PC2_corrplot,
here("plots", "figure_7.tiff"),
width = 200,
height = 260,
units = "mm",
dpi = 300
)
The removed stopwords were: ‘a’, ‘ah’, ‘ahh’, ‘am’, ‘an’, ‘and’, ‘are’, “aren’t”, ‘as’, ‘at’, ‘aw’, ‘because’, ‘but’, ‘could’, ‘do’, “don’t”, ‘eh’, ‘for’, ‘from’, ‘gonna’, ‘had’, ‘has’, ‘have’, ‘he’, “he’s”, ‘her’, ‘high’, ‘him’, ‘huh’, ‘i’, “i’ll”, “i’m”, “i’ve”, “i’d”, ‘in’, ‘into’, ‘is’, ‘it’, “it’s”, ‘its’, ‘just’, ‘mean’, ‘my’, ‘nah’, ‘not’, ‘of’, ‘oh’, ‘on’, ‘or’, ‘our’, ‘says’, ‘she’, “she’s”, ‘should’, ‘so’, ‘than’, ‘that’, “that’s”, ‘the’, ‘them’, ‘there’, “there’s”, ‘they’, ‘this’, ‘to’, ‘uh’, ‘um’, ‘up’, ‘was’, “wasn’t”, ‘we’, ‘were’, ‘what’, ‘when’, ‘which’, ‘who’, ‘with’, ‘would’, ‘yeah’, ‘you’, and “you’ve”.↩︎
5.1 Speaker PC scores and uncontrolled social variation
For posthoc analysis, we explore how ethnicity and socioeconomic factors may influence PCA. We already have ethnicity data from QB but need to gather socioeconomic data for our speakers. We calculate this by using scores for each speaker’s current place of residence and what high school they attended to give them a socioeconomic score.
5.1.1 Socioeconomic data description
The ethnicity categories we are using are the ‘NZ Ethnic’ categories reduced down to ‘Māori’ and ‘Non-Māori’. The distribution, in the speakers in both QB1 and QB2 is:
We load the socioecomonic data.
High school deciles were manually coded following the 2015 decile changes available here (note that from 2023 decile ratings are being phased out and a new Equity rating is replacing deciles)
Decile scores for last area speakers say they live in. This is done by taking ‘The New Zealand Index of Deprivation’ is a census based measure of socioeconomic deprivation. It is tied to StatsNZ Statistical Area 1 blocks. These are small areas which aim to contain around 100-200 people. See https://github.com/JoshuaWilsonBlack/fire_NZDep for the code which aggregates this information for areas corresponding to those given by QuakeBox participants.
We join this information to our data for speakers in both time points and plot the distribution.
We first plot by school deciles. High values for school decile indicate increased socioeconomic status.
Figure 5.1: Distribution of school deciles by gender.
And then by location deciles.
Figure 5.2: Distribution of school deciles by ethnicity.
We now plot location deciles. High values for location decile indicate reduced socioeconomic status.
Figure 5.3: Distribution of location deciles by gender.
Figure 5.4: Distribution of location deciles by ethnicity.
We can also look to see if there is any strong relationship between the two socioeconomic indicators:
Figure 5.5: Location decile by school decile.
There is a relationship discernible here, in which school decile decreases with location decile from location decile 5 onwards. But there is a lot of noise and one is not a particularly reliable predictor of the other.
The Spearman correlation value of the two scores is -0.3859009, which is a moderate relationship. This is a good sign that the two are providing independent evidence of socioeconomic status.
5.1.2 Modelling
We want to model speaker PC scores by these socioeconomic factors. While we only have socioeconomic factors coded for the speakers who are in both QB1 and QB2, the most stable PC scores we have are from the PCA analysis applied to QB1 as a whole. We attach the socioeconomic data to those PC scores and fit a GAM to test whether they predict PC scores.
We combine the data:
We model for PC1:
Only the parametric ethnicity term comes out as significant. The intercept term represents Māori female speakers, these come out as significantly different from female non-Māori speakers. Māori female speakers have lower PC1 scores than the rest of the population when controlling for other factors. That is, female Māori speakers are less conservative with respect to the short front vowel shift and related changes (as captured by PC1). Non-Māori females come out as more conservative.
Let’s quickly check the quality of the model.
There’s not a huge amount of data at the edges of the distribution (this is to be expected, PC scores are scaled around 0). There is nothing too worrying about these plots.
We can look at the smooths for age category, school decile, and location decile.
These are consistent with no effect (as suggested by the summary). Ignoring statistical significance and freely interpreting, the smooth for age category goes against the natural assumption that more conservative should correspond with older. For a portion of the school decile data suggests increased conservatives as location decile increases (i.e. as socioeconomic status increases) until about decile 7. School decile has an almost flat smooth.
We can do the same for PC2.
There is no obvious effect here for ethnicity, age, or socioeconomic status.