Purpose: Introduce the data and libraries used
This document provides the code used during the data processing stages of the Brand, Hay, Clark, Watson and Sóskuthy (2020) manuscript, submitted to the Journal of Phonetics. It contains all the steps the authors carried out in order to produce the final data set used in the analysis (see supplementary materials: Analysis file).
The raw data file being processed contains data from the ONZE corpus using LaBB-CAT (Fromont & Hay, 2008)
The data were processed automatically with Praat and were force-aligned using the HTK toolkit
The ONZE corpus comprises 4 sub-corpora: Mobile Unit (MU), Intermediate Archive (IA), Canterbury Regional Survey (Darfield), Canterbury Corpus (CC). You can find out more on the ONZE corpus at https://www.canterbury.ac.nz/nzilbb/research/onze/
All available speakers were queried and all available tokens containing the following vowel segments were extracted (DRESS, FLEECE, FOOT, GOOSE, KIT, LOT, NURSE, SCHWA, START, STRUT, THOUGHT, TRAP)
The resulting raw unprocessed dataset is ONZE_raw.rds
, the only processing this file has undergone is converting the original .csv
file outputted from labb-cat to .rds
format, in order to reduce this size of the file.
We do not include the raw data file in the project repository as it contains sensitive information about the speakers that we do not wish to make publicly available. If you are interested in accessing or using the ONZE instance of labb-cat, please contact nzilbb [at] canterbury [dot] ac [dot] nz
The data file that this document produces (ONZE_vowels_filtered_anon.rds
) is made available in the project repository. This contains anonymised data and all the relevant variables that are required to reproduce the main analyses reported in the paper
We hope that this document provides a transparent and clear means to understand each of the steps taken to obtain the final data set
First, we will load in the relevant R libraries required to run the processing procedures and then load in the data itself.
In order for the code in this document to work, the libraries are required to be installed and loaded into your R session. If you do not have any of the libraries installed, you can run install.packages("LIBRARY NAME")
(change “LIBRARY NAME” to the required library name, e.g. install.packages("tidyverse")
) which should resolve any warning messages you might get.
#load in the libraries
library(tidyverse) #various functions
library(tidylog) #gives summaries
library(ggforce) #plotting functions
library(cowplot) #plotting functions
#this is important for the anonymisation of the variables
set.seed(123)
#check information about R session, this will give details of the R setup on the authors computer
sessionInfo()
## R version 3.6.3 (2020-02-29)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Catalina 10.15.6
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_NZ.UTF-8/en_NZ.UTF-8/en_NZ.UTF-8/C/en_NZ.UTF-8/en_NZ.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] cowplot_1.0.0 ggforce_0.3.1 tidylog_1.0.1 forcats_0.5.0
## [5] stringr_1.4.0 dplyr_0.8.5 purrr_0.3.4 readr_1.3.1
## [9] tidyr_1.0.2 tibble_3.0.1 ggplot2_3.3.0 tidyverse_1.3.0
##
## loaded via a namespace (and not attached):
## [1] clisymbols_1.2.0 tidyselect_1.0.0 xfun_0.13 haven_2.2.0
## [5] lattice_0.20-41 colorspace_1.4-1 vctrs_0.2.4 generics_0.0.2
## [9] htmltools_0.4.0 yaml_2.2.1 rlang_0.4.5 pillar_1.4.3
## [13] glue_1.4.0 withr_2.2.0 DBI_1.1.0 tweenr_1.0.1
## [17] dbplyr_1.4.3 modelr_0.1.6 readxl_1.3.1 lifecycle_0.2.0
## [21] munsell_0.5.0 gtable_0.3.0 cellranger_1.1.0 rvest_0.3.5
## [25] evaluate_0.14 knitr_1.28 fansi_0.4.1 broom_0.5.6
## [29] Rcpp_1.0.4 scales_1.1.0 backports_1.1.6 jsonlite_1.6.1
## [33] farver_2.0.3 fs_1.4.1 hms_0.5.3 digest_0.6.25
## [37] stringi_1.4.6 polyclip_1.10-0 grid_3.6.3 cli_2.0.2
## [41] tools_3.6.3 magrittr_1.5 crayon_1.3.4 pkgconfig_2.0.3
## [45] MASS_7.3-51.6 ellipsis_0.3.0 xml2_1.3.2 reprex_0.3.0
## [49] lubridate_1.7.8 assertthat_0.2.1 rmarkdown_2.1 httr_1.4.1
## [53] rstudioapi_0.11 R6_2.4.1 nlme_3.1-147 compiler_3.6.3
#load in the data
vowels_all_raw <- readRDS("Filtering_data/ONZE_raw.rds")
Purpose: Make the data easier to use
Several initial tidying steps are applied to the raw data for ease of use, see the comments (sentences preceeded by a #
) for information about what the code is doing. There is no direct modification of the actual data at this stage, but new variables are created and some variables that are not required are removed.
vowels_all <- vowels_all_raw %>%
#rename the variables
rename(TokenNum = Number,
Gender = participant_gender,
PrevWord = `Before Match`,
FollWord = `After Match`,
Word = `Target orthography`,
WordSegments = `Match segments`,
VowelDISC = `Target segments`,
VowelStart = `Target segments start`,
VowelEnd = `Target segments end`,
VowelMid = time_0.5,
F1_50 = `F1-time_0.5`,
F2_50 = `F2-time_0.5`,
F3_50 = `F3-time_0.5`) %>%
#create a Well's lexical set variable
mutate(Vowel = fct_recode(factor(VowelDISC),
FLEECE = "i",
KIT = "I",
DRESS = "E",
TRAP = "{",
SCHWA = "@",
START = "#",
LOT = "Q",
THOUGHT = "$",
NURSE = "3",
STRUT = "V",
FOOT = "U",
GOOSE = "u"),
#recode any gender values read in as FALSE as female
Gender = fct_recode(factor(Gender), "F" = "FALSE"),
#create a vowel duration variable
VowelDur = VowelEnd - VowelStart,
#create an orthographic length variable
orthographic_length = str_length(Word),
#create a phonological length variable
phonological_length = str_length(WordSegments)
) %>%
#anonymise the speaker and word variables
mutate(Speaker_anon = ifelse(is.na(Speaker), Speaker, paste0(Corpus, "_", tolower(Gender), "_", fct_anon(factor(Speaker)))),
Word_anon = ifelse(is.na(Word), Word, paste0("word_", fct_anon(factor(Word))))) %>%
#change all factors to characters
mutate_if(sapply(., is.factor), as.character)
## rename: renamed 13 variables (TokenNum, Gender, PrevWord, FollWord, Word, …)
## mutate: changed 200,000 values (10%) of 'Gender' (0 new NA)
## new variable 'Vowel' with 49 unique values and 0% NA
## new variable 'VowelDur' with 101,088 unique values and <1% NA
## new variable 'orthographic_length' with 29 unique values and <1% NA
## new variable 'phonological_length' with 37 unique values and 0% NA
## mutate: new variable 'Speaker_anon' with 636 unique values and 0% NA
## new variable 'Word_anon' with 31,431 unique values and <1% NA
## mutate_if: converted 'SearchName' from factor to character (0 new NA)
## converted 'Transcript' from factor to character (0 new NA)
## converted 'Speaker' from factor to character (0 new NA)
## converted 'Corpus' from factor to character (0 new NA)
## converted 'Gender' from factor to character (0 new NA)
## converted 'MatchId' from factor to character (0 new NA)
## converted 'TargetId' from factor to character (0 new NA)
## converted 'URL' from factor to character (0 new NA)
## converted 'PrevWord' from factor to character (0 new NA)
## converted 'Text' from factor to character (0 new NA)
## converted 'FollWord' from factor to character (0 new NA)
## converted 'Word' from factor to character (0 new NA)
## converted 'WordSegments' from factor to character (0 new NA)
## converted 'VowelDISC' from factor to character (0 new NA)
## converted 'F1_50' from factor to character (0 new NA)
## converted 'F2_50' from factor to character (0 new NA)
## converted 'F3_50' from factor to character (0 new NA)
## converted 'Error' from factor to character (0 new NA)
## converted 'Vowel' from factor to character (0 new NA)
#anonymise the transcript variable
anonymised <- vowels_all %>%
select(Speaker, Speaker_anon, Transcript) %>%
distinct() %>%
mutate(Speaker1 = str_replace(string = Speaker, pattern = " ", replacement = ""),
Transcript_anon = str_replace(string = Transcript, pattern = Speaker1, replacement = Speaker_anon)) %>%
select(-Speaker1)
## select: dropped 28 variables (SearchName, TokenNum, Corpus, Gender, participant_year_of_birth, …)
## distinct: removed 2,016,992 rows (>99%), 3,871 rows remaining
## mutate: new variable 'Speaker1' with 636 unique values and 0% NA
## new variable 'Transcript_anon' with 3,871 unique values and 0% NA
## select: dropped one variable (Speaker1)
#keep only the variables which are informative and drop ones that are not
vowels_all <- vowels_all %>%
left_join(., anonymised) %>%
select(TokenNum, Speaker, Speaker_anon, Transcript, Transcript_anon, Corpus, Gender, participant_year_of_birth, Line, LineEnd, MatchId, TargetId, URL, PrevWord, Text, FollWord, Word, Word_anon, WordSegments, Vowel, VowelDISC, F1_50, F2_50, VowelStart, VowelMid, VowelEnd, VowelDur, Error, orthographic_length, phonological_length)
## Joining, by = c("Transcript", "Speaker", "Speaker_anon")
## left_join: added one column (Transcript_anon)
## > rows only in x 0
## > rows only in y ( 0)
## > matched rows 2,020,863
## > ===========
## > rows total 2,020,863
## select: dropped 2 variables (SearchName, F3_50)
Purpose: filter out unsuitable data
As the raw data file contains various data points that are not suitable for our final analysis, we will filter them out from the dataset. The following steps are taken:
Remove speakers with missing gender or year of birth information (this is required for modelling)
Remove transcripts that are word lists (the data should be interview speech only)
Remove tokens where Praat was not able to extract F1/F2 (these are non-numeric values)
Remove tokens where F1 is > 1000 (these are likely errors)
Remove tokens with vowel durations <= 0.01 or >= 3 (these are likely errors)
Remove tokens with a phonological length > 25 (these are likely errors)
Remove tokens with a hesitation (these will affect the production of the vowel)
Remove tokens where the word is not transcribed
Remove tokens from a list of stopwords (these are high frequency grammatical words that may not be representative of a speakers vocalic productions)
#create a list of stopwords
stopWords <- c('a', 'ah', 'ahh', 'am', "an'", 'an', 'and', 'are', "aren't", 'as', 'at', 'aw', 'because', 'but', 'could', 'do', "don't", 'eh', 'for', 'from', 'gonna', 'had', 'has', 'have', 'he', "he's", 'her', 'high', 'him', 'huh', 'I', "I'll", "I'm", "I've", "I'd", 'in', 'into', 'is', 'it', "it's", 'its', 'just', 'mean', 'my', 'nah', 'not', 'of', 'oh', 'on', 'or', 'our', 'says', 'she', "she's", 'should', 'so', 'than', 'that', "that's", 'the', 'them', 'there', "there's", 'they', 'this', 'to', 'uh', 'um', 'up', 'was', "wasn't", 'we', 'were', 'what', 'when', 'which', 'who', 'with', 'would', 'yeah', 'you', "you've")
#apply the filtering
vowels_all <- vowels_all %>%
filter(
#filter missing gender and participant year of birth
!is.na(Gender),
!is.na(participant_year_of_birth)) %>%
filter(
#filter word lists
!grepl("-WL", Transcript_anon),
!Transcript_anon %in% c("CC_f_450-08.trs", "CC_f_546-12.trs", "CC_m_032-10.trs", "CC_m_059-08.trs",
"CC_m_204-11.trs", "CC_m_228-08.trs", "CC_m_340-06.trs", "CC_m_397-07.trs",
"CC_m_427-02.trs", "CC_m_446-10.trs", "CC_m_565-09.trs")) %>%
filter(
#filter tokens with an error or missing F1/F2
is.na(Error) |
!is.na(F1_50) |
!is.na(F2_50),
#filter tokens that have 4 or more characters, this indicates that they are i. an error term outputted from Praat or ii. greater than 1000hz for F1
str_length(F1_50) < 4) %>%
filter(
#filter tokens with a very long phonological length
phonological_length < 25,
#filter tokens with hesitations
!grepl("~", Word),
#filter tokens which do not have the word transcribed
!is.na(Word)) %>%
filter(
#filter tokens with very short or long vowel durations
VowelDur >= 0.01,
VowelDur <= 3) %>%
filter(
#filter stopwords
!Word %in% stopWords
) %>%
#ensure the F1/F2 variables are numeric as any error strings are now removed
mutate(F1_50 = as.numeric(as.character(F1_50)),
F2_50 = as.numeric(as.character(F2_50)))
## filter: removed 81,040 rows (4%), 1,939,823 rows remaining
## filter: removed 91,400 rows (5%), 1,848,423 rows remaining
## filter: removed 79,618 rows (4%), 1,768,805 rows remaining
## filter: removed 12,705 rows (1%), 1,756,100 rows remaining
## filter: removed 7,246 rows (<1%), 1,748,854 rows remaining
## filter: removed 711,200 rows (41%), 1,037,654 rows remaining
## mutate: converted 'F1_50' from character to double (0 new NA)
## converted 'F2_50' from character to double (0 new NA)
Purpose: add speech rate and stress coding, filtering out unstressed tokens
As the original dataset did not have information for speaker’s speech rate on each transcript or the stress coding of the tokens, we have to add this information to the dataset. Additionally, as unstressed tokens are produced differently to stressed tokens, we chose to remove all SCHWA and unstressed tokens from the dataset. There is additional data loss as some of the tokens do not have stress coded, when this is the case, they are also removed.
Speech rate (calculated per speaker per transcript in syllable/second):
#load in the speech rate data and modify the variable names
vowels_all_time_sr <- readRDS("Filtering_data/ONZE_speech_rates.rds") %>%
#filter tokens without speech rate information or the transcript is not present in the dataset
filter(!is.na(Speech_rate),
Transcript %in% vowels_all$Transcript)
## filter: removed 811 rows (20%), 3,156 rows remaining
#add the speech rate variable to the dataset
vowels_all <- vowels_all %>%
left_join(., vowels_all_time_sr)
## Joining, by = c("Transcript", "Corpus")
## left_join: added one column (Speech_rate)
## > rows only in x 0
## > rows only in y ( 0)
## > matched rows 1,037,654
## > ===========
## > rows total 1,037,654
Stress:
#load in the stress data
vowels_stress <- readRDS("Filtering_data/ONZE_stress.rds")
#remove SCHWA tokens
vowels_all2 <- vowels_all %>%
filter(Vowel != "SCHWA") %>%
left_join(., vowels_stress)
## filter: removed 175,121 rows (17%), 862,533 rows remaining
## Joining, by = c("Speaker", "Transcript", "Corpus", "Line", "LineEnd", "MatchId", "TargetId", "URL", "Text")
## left_join: added 16 columns (SearchName, Number, Before Match, After Match, Target stress, …)
## > rows only in x 33,228
## > rows only in y (1,192,435)
## > matched rows 829,305
## > ===========
## > rows total 862,533
#create list of words with stress coding
vowels_all_present_stress <- vowels_all2 %>%
filter(!is.na(`Target stress`),
`Target stress` != "0") %>%
select(WordSegments, `Target segments`, `Target stress`) %>%
distinct()
## filter: removed 386,312 rows (45%), 476,221 rows remaining
## select: dropped 44 variables (TokenNum, Speaker, Speaker_anon, Transcript, Transcript_anon, …)
## distinct: removed 457,385 rows (96%), 18,836 rows remaining
#create list of words missing stress coding
vowels_all_missing_stress <- vowels_all2 %>%
filter(is.na(`Target stress`),
`Match segments` %in% vowels_all_present_stress$WordSegments) %>%
mutate(`Target stress` = "stress")
## filter: removed 708,544 rows (82%), 153,989 rows remaining
## mutate: converted 'Target stress' from factor to character (153989 fewer NA)
#add in tokens with stress
vowels_all <- vowels_all2 %>%
filter(!is.na(`Target stress`),
`Target stress` != "0") %>%
rbind(vowels_all_missing_stress)
## filter: removed 386,312 rows (45%), 476,221 rows remaining
#clean up the data
vowels_all <- vowels_all %>%
select(TokenNum:Speech_rate, `Target stress`)
## select: dropped 15 variables (SearchName, Number, Before Match, After Match, Target stress start, …)
nrow(vowels_all)
## [1] 630210
Purpose: filter out outliers
As the data has been force aligned and not undergone hand correction of the F1/F2 values, we will implement an outlier removal step so that values that are likely to be errors are filtered from the dataset. The way we do this is:
Calculate the mean and standard deviation for F1 and F2, per vowel and per speaker
Calculate a min and max threshold for F1 and F2 values, based on the mean + or - 2.5 standard deviations
Determine if a token is smaller or larger than the threshold, in which case it is classified as an outlier and it is removed from the dataset
#outlier removal
sd_limit = 2.5
#calculate the summary statistics required for the outlier removal
vowels_all_summary <- vowels_all %>%
group_by(Speaker, Vowel) %>%
summarise(n = n(),
mean_F1 = mean(F1_50, na.rm = TRUE),
mean_F2 = mean(F2_50, na.rm = TRUE),
sd_F1 = sd(F1_50, na.rm = TRUE),
sd_F2 = sd(F2_50, na.rm = TRUE),
max_F1 = mean(F1_50) + sd_limit*(sd(F1_50)),
min_F1 = mean(F1_50) - sd_limit*(sd(F1_50)),
max_F2 = mean(F2_50) + sd_limit*(sd(F2_50)),
min_F2 = mean(F2_50) - sd_limit*(sd(F2_50)))
## group_by: 2 grouping variables (Speaker, Vowel)
## summarise: now 6,511 rows and 11 columns, one group variable remaining (Speaker)
#store the outlier tokens data
outlier_tokens <- vowels_all %>%
inner_join(., vowels_all_summary) %>%
mutate(outlier = ifelse(F1_50 > min_F1 &
F1_50 < max_F1 &
F2_50 > min_F2 &
F2_50 < max_F2, FALSE, TRUE)) %>%
group_by(Speaker, Vowel) %>%
filter(outlier == TRUE) %>%
ungroup() %>%
select(TokenNum:`Target stress`)
## Joining, by = c("Speaker", "Vowel")
## inner_join: added 9 columns (n, mean_F1, mean_F2, sd_F1, sd_F2, …)
## > rows only in x ( 0)
## > rows only in y ( 0)
## > matched rows 630,210
## > =========
## > rows total 630,210
## mutate: new variable 'outlier' with 3 unique values and <1% NA
## group_by: 2 grouping variables (Speaker, Vowel)
## filter (grouped): removed 603,324 rows (96%), 26,886 rows remaining
## ungroup: no grouping variables
## select: dropped 10 variables (n, mean_F1, mean_F2, sd_F1, sd_F2, …)
#add the summary statistics and filter out outliers
vowels_all <- vowels_all %>%
inner_join(., vowels_all_summary) %>%
mutate(outlier = ifelse(F1_50 > min_F1 &
F1_50 < max_F1 &
F2_50 > min_F2 &
F2_50 < max_F2, FALSE, TRUE)) %>%
group_by(Speaker, Vowel) %>%
filter(outlier == FALSE) %>%
ungroup() %>%
select(TokenNum:`Target stress`)
## Joining, by = c("Speaker", "Vowel")
## inner_join: added 9 columns (n, mean_F1, mean_F2, sd_F1, sd_F2, …)
## > rows only in x ( 0)
## > rows only in y ( 0)
## > matched rows 630,210
## > =========
## > rows total 630,210
## mutate: new variable 'outlier' with 3 unique values and <1% NA
## group_by: 2 grouping variables (Speaker, Vowel)
## filter (grouped): removed 26,891 rows (4%), 603,319 rows remaining
## ungroup: no grouping variables
## select: dropped 10 variables (n, mean_F1, mean_F2, sd_F1, sd_F2, …)
purpose: remove all FOOT tokens as they have a relatively low token count
Inspecting the token counts per vowel highlights that there is considerably fewer FOOT tokens in the dataset compared to the other vowels, as we require speakers to have a large as possible inventory of tokens for all vowels, we decided to removed all FOOT tokens from the dataset in order to minimise the likelihood that some speakers would have a small token count. This option was preferred to removing individual speakers with low FOOT counts as it would have resulted in substantial data loss.
A table with the token counts per vowel (at this stage of the processing) is provided below.
#get token counts per vowel
vowels_all %>%
group_by(Vowel) %>%
summarise(n = n()) %>%
arrange(n)
## group_by: one grouping variable (Vowel)
## summarise: now 11 rows and 2 columns, ungrouped
#filter out FOOT tokens
vowels_all <- vowels_all %>%
filter(Vowel != "FOOT")
## filter: removed 17,241 rows (3%), 586,078 rows remaining
Purpose: filter out speakers who have low quality data
As there may be some speakers whose alignment is particularly poor (not simply because of a number of outlier tokens) we want to check the mean distance between each speakers vowels. This is achieved by calculating the mean euclidean distance between all the the vowels mean F1/F2 values. This is essentially raising the question: are there speakers in the dataset who have vowel spaces where the vowels are unnaturally overlapping, e.g. is it the case for any given speaker that the mean F1/F2 values for all their vowels are within a very small overlapping space. If we do find there are some speakers, this may indicate issues with their automatic alignment and thus, their F1/F2 values are dramatically unreliable.
Note, that these speakers would not have had a reliable outlier removal implemented in the outlier removal step in the last section (i.e. removing tokens that were +/- 2.5 SDs from the mean, calculated per speaker, per vowel), as the mean value in these calculations for speakers with very small euclidean distances between their mean F1/F2 values would not have been reliable in the first instance.
We then remove speakers who are -2 standard deviations from the mean euclidean distance, as these represent relatively overlapped vowel spaces and we define them to be outliers relative to the rest of the speakers. Doing this identifies 12 speakers, who are subsequently removed from the dataset.
To calculate the metric, we will do the following:
Calculate the mean F1/F2 values per speaker, per vowel
Create a euclidean distance matrix based on the distances between each of the vowels, i.e. each speaker will have an 10x10 matrix based on the distance in F1/F2 space between each of the vowels, we would expect FLEECE and START to have a high distance, but STRUT and START to have a small distance
For each speaker, take the mean of the distances per vowel, i.e. the column mean for each vowel in the distance matrix, this will give the average distance of a vowel in comparison to the others
For each speaker, take the mean of all of the vowel distances, this will give the average distance between all vowels for a given speaker, if this value is small it means all vowels are very close to one another, if it is large it means that they are sparsely distributed
We can now use these values to inspect the distribution of speakers distances, the resulting plot will have the mean euclidean distance value on the x axis and the kernel density estimate on the y axis (essentially a smoothed histogram), the solid horizontal line represents the mean, the dashed horizontal lines represent +/- 2 standard deviations from the mean
Remove the speakers who are -2 standard deviations from the mean euclidean distance, this will filter out speakers with particularly overlapping vowel productions. We will keep the ones that +2 standard deviations as these are not indicative of measurement issues. The vowel spaces (with the mean euclidean distance and speaker name given in the facet label) of these speakers is plotted for reference
#calculate speaker means and sd
speaker_means <- vowels_all %>% #use the vowels_all data
group_by(Speaker, Speaker_anon, Gender, Vowel) %>% #group based on per speaker, per vowel
summarise(n = n(),
F1_mean = mean(F1_50),
F2_mean = mean(F2_50),
F1_sd = sd(F1_50),
F2_sd = sd(F2_50)) #get the mean F1/F2
## group_by: 4 grouping variables (Speaker, Speaker_anon, Gender, Vowel)
## summarise: now 5,920 rows and 9 columns, 3 group variables remaining (Speaker, Speaker_anon, Gender)
#caluculate euclidean distances between vowel means
speaker_distances <- speaker_means %>%
mutate(Dist = #calculate the euclidean distance matrix between the vowel means for each speaker
colMeans(as.matrix(dist(cbind(F1_mean, F2_mean))))) %>%
ungroup() %>% #ungroup the speakers and vowels
group_by(Speaker_anon) %>% #group by just speaker
summarise(mean_dist = mean(Dist), #calculate the mean distance across all of a speaker's vowels
sd_dist = sd(Dist)) %>% #calculate the sd too
mutate(Speaker_dist = paste(round(mean_dist, 2), Speaker_anon)) #create a new variable with the mean euclidean distance and the speaker name for plotting
## mutate (grouped): new variable 'Dist' with 5,920 unique values and 0% NA
## ungroup: no grouping variables
## group_by: one grouping variable (Speaker_anon)
## summarise: now 592 rows and 3 columns, ungrouped
## mutate: new variable 'Speaker_dist' with 592 unique values and 0% NA
#plot the distibution
ggplot(speaker_distances, aes(x = mean_dist)) + #plot the distribution
geom_density() +
geom_vline(xintercept = mean(speaker_distances$mean_dist), linetype = 1) +
geom_vline(xintercept = mean(speaker_distances$mean_dist) + 2*sd(speaker_distances$mean_dist), linetype = 2) +
geom_vline(xintercept = mean(speaker_distances$mean_dist) - 2*sd(speaker_distances$mean_dist), linetype = 2, colour = "red") +
theme_bw()
#filter speakers who are -2 SDs from the mean euclidean distance
outlier_speakers <- speaker_means %>%
inner_join(., speaker_distances) %>%
filter(mean_dist < mean(speaker_distances$mean_dist) - 2*sd(speaker_distances$mean_dist))
## Joining, by = "Speaker_anon"
## inner_join: added 3 columns (mean_dist, sd_dist, Speaker_dist)
## > rows only in x ( 0)
## > rows only in y ( 0)
## > matched rows 5,920
## > =======
## > rows total 5,920
## filter (grouped): removed 5,780 rows (98%), 140 rows remaining
#plot the outlier speakers vowel spaces
outlier_speakers_plot <- vowels_all %>%
filter(Speaker_anon %in% outlier_speakers$Speaker_anon) %>%
inner_join(., outlier_speakers) %>%
arrange(mean_dist) %>%
ggplot(aes(x = F2_50, y = F1_50, colour = Vowel)) +
geom_point(size = 0.02, alpha = 0.5) +
stat_ellipse(level = 0.67) +
geom_text(data = outlier_speakers, aes(x = F2_mean, y = F1_mean, label = Vowel), size = 2) +
scale_x_reverse() +
scale_y_reverse() +
theme_bw() +
theme(legend.position = "none")
## filter: removed 574,234 rows (98%), 11,844 rows remaining
## Joining, by = c("Speaker", "Speaker_anon", "Gender", "Vowel")
## inner_join: added 8 columns (n, F1_mean, F2_mean, F1_sd, F2_sd, …)
## > rows only in x ( 0)
## > rows only in y ( 0)
## > matched rows 11,844
## > ========
## > rows total 11,844
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
#store the outlier speaker data
outlier_speakers_data <- vowels_all %>%
filter(Speaker_anon %in% outlier_speakers$Speaker_anon)
## filter: removed 574,234 rows (98%), 11,844 rows remaining
#filter out the speakers from the dataset
vowels_all <- vowels_all %>%
filter(!Speaker_anon %in% outlier_speakers$Speaker_anon)
## filter: removed 11,844 rows (2%), 574,234 rows remaining
Further inspection of the data reveals that there are considerably more speakers with vowel spaces that look like there are issues with the raw data. Thus, individual vowel plots and density distributions for F1/F2 were generated for each of the speakers in the dataset. These can be found in the Speaker_vowel_plots
folder of the repository for all speakers in the data (referring to the data at this stage of the filtering process).
Based on the inspection, we identified 76 further speakers with vowel spaces that appeared to have substantially low quality data for F1/F2 values. These are plotted below for inspection.
Additional exploration of these speakers suggests that there appears to be large variation in the F2 measurements, with values having substantially more spread than those speakers not identified as being low quality. This indicates that the outlier filtering of +/- 2.5 standard deviations to the raw data may not be sufficient in removing unreliable measurements. Alternatively, it could be explained by the visual nature of the inspection process, with the inspection being biased towards identifying speakers with more variable F2 across certain vowels.
Whilst manually inspecting the vowel spaces and distributions may be not be an optimal solution to identifying speakers with low quality data, it does highlight underlying issues with automatic extraction of formant values. Furthermore, upon listening to the recordings of some of the low quality speakers, some of these issues can be explained simply by recording quality. Given that our dataset is substantially large (for sociophonetic research), we found that a compromise between automatic extraction and manual inspection resulted in a large enough dataset that was of sufficiently high quality for our analyses.
#list of speakers with low quality vowel plots
speakers_remove <- c(
"CC_f_004", "CC_f_005", "CC_f_036", "CC_f_045", "CC_f_061", "CC_f_073", "CC_f_106", "CC_f_155", "CC_f_163", "CC_f_222", "CC_f_252", "CC_f_267", "CC_f_337", "CC_f_376", "CC_f_392", "CC_f_418", "CC_f_421", "CC_f_430", "CC_f_494", "CC_f_500", "CC_f_533", "CC_f_584", "CC_m_081", "CC_m_189", "CC_m_257", "CC_m_281", "CC_m_321", "CC_m_327", "CC_m_422", "CC_m_485", "CC_m_521", "CC_m_568", "CC_m_633", "Darfield_f_266", "Darfield_f_306", "Darfield_f_384", "Darfield_f_395", "Darfield_f_472", "Darfield_f_503", "Darfield_f_510", "Darfield_f_614", "Darfield_m_320", "Darfield_m_541", "Darfield_m_625", "IA_f_015", "IA_f_018", "IA_f_124", "IA_f_239", "IA_f_254", "IA_f_264", "IA_f_284", "IA_f_298", "IA_f_314", "IA_f_334", "IA_f_335", "IA_f_352", "IA_f_358", "IA_f_372", "IA_f_412", "IA_f_586", "IA_m_192", "IA_m_213", "IA_m_305", "IA_m_416", "IA_m_506", "IA_m_589", "MU_f_014", "MU_f_023", "MU_f_141", "MU_f_153", "MU_f_195", "MU_f_415", "MU_f_519", "MU_f_603", "MU_m_158", "MU_m_355"
)
#plot the outlier speakers vowel spaces
low_quality_speakers_plot <- vowels_all %>%
filter(Speaker_anon %in% speakers_remove) %>%
inner_join(., speaker_means) %>%
ggplot(aes(x = F2_50, y = F1_50, colour = Vowel)) +
geom_point(size = 0.02, alpha = 0.5) +
stat_ellipse(level = 0.67) +
geom_text(aes(x = F2_mean, y = F1_mean, label = Vowel), size = 2) +
scale_x_reverse() +
scale_y_reverse() +
theme_bw() +
theme(legend.position = "none")
## filter: removed 489,344 rows (85%), 84,890 rows remaining
## Joining, by = c("Speaker", "Speaker_anon", "Gender", "Vowel")
## inner_join: added 5 columns (n, F1_mean, F2_mean, F1_sd, F2_sd)
## > rows only in x ( 0)
## > rows only in y ( 5,160)
## > matched rows 84,890
## > ========
## > rows total 84,890
#plot distibutions of F1 and F2 comparing low quality speakers to those not identified as being low quality
#F1
speaker_means %>%
mutate(removed = Speaker_anon %in% speakers_remove) %>%
ggplot(aes(x = F1_sd, colour = removed, fill = removed, linetype = Gender)) +
geom_density(alpha = 0.2) +
facet_wrap(~Vowel, nrow = 2) +
ggtitle("F1 sd distirbutions") +
theme_bw() +
theme(legend.position = 'top',
legend.direction = "horizontal",
legend.box = "horizontal") +
guides(color = guide_legend(title.position = "top",
# hjust = 0.5 centres the title horizontally
title.hjust = 0.5),
linetype = guide_legend(title.position = "top",
# hjust = 0.5 centres the title horizontally
title.hjust = 0.5))
## mutate (grouped): new variable 'removed' with 2 unique values and 0% NA
#F2
speaker_means %>%
mutate(removed = Speaker_anon %in% speakers_remove) %>%
ggplot(aes(x = F2_sd, colour = removed, fill = removed, linetype = Gender)) +
geom_density(alpha = 0.2) +
facet_wrap(~Vowel, nrow = 2) +
ggtitle("F2 sd distributions") +
theme_bw() +
theme(legend.position = 'top',
legend.direction = "horizontal",
legend.box = "horizontal") +
guides(color = guide_legend(title.position = "top",
# hjust = 0.5 centres the title horizontally
title.hjust = 0.5),
linetype = guide_legend(title.position = "top",
# hjust = 0.5 centres the title horizontally
title.hjust = 0.5))
## mutate (grouped): new variable 'removed' with 2 unique values and 0% NA
#store the low quality speakers data
low_quality_speakers_data <- vowels_all %>%
filter(Speaker_anon %in% speakers_remove)
## filter: removed 489,344 rows (85%), 84,890 rows remaining
#filter out the low quality speakers from the dataset
vowels_all <- vowels_all %>%
filter(!Speaker_anon %in% speakers_remove)
## filter: removed 84,890 rows (15%), 489,344 rows remaining
In order to ensure that speakers in the dataset have sufficient numbers of tokens for each of the vowels, we will next filter out any speaker with < 5 tokens for any of the vowels. The code below will calculate the number of tokens per speaker and per vowel, then remove speakers with < 5 tokens for any vowel from the dataset (removing all of their other tokens in the process).
#count number of vowels per speaker and store list of speakers with < 5 tokens in any vowel
low_n_speakers <- vowels_all %>%
group_by(Speaker, Vowel) %>%
count() %>%
ungroup() %>%
filter(n < 5) %>%
select(Speaker)
## group_by: 2 grouping variables (Speaker, Vowel)
## count: now 5,020 rows and 3 columns, 2 group variables remaining (Speaker, Vowel)
## ungroup: no grouping variables
## filter: removed 5,011 rows (>99%), 9 rows remaining
## select: dropped 2 variables (Vowel, n)
#filter out the speakers with < 5 tokens
vowels_all <- vowels_all %>%
ungroup() %>%
filter(!Speaker %in% low_n_speakers$Speaker)
## ungroup: no grouping variables
## filter: removed 1,997 rows (<1%), 487,347 rows remaining
We also wanted to filter out any tokens that have a l
or r
as the segment following the vowel being analysed. Therefore, we queried LaBB-CAT for this information which is stored in the ONZE_following_segments.rds
file. We will load in this data and add the relevant information to our current dataset, then remove any tokens that have a following l
or r
.
#load in the following segment data
following_segment_data <- readRDS("Filtering_data/ONZE_following_segments.rds") %>%
rename(following_segment = Token.1..segments) %>%
mutate(following_segment = as.character(following_segment)) %>%
select(Speaker:following_segment, -Error, -Target.stress)
## rename: renamed one variable (following_segment)
## mutate: converted 'following_segment' from factor to character (0 new NA)
## select: dropped 5 variables (TokenNum, Error, Target.stress, Token.1..segments.start, Token.1..segments.end)
#add the following segement information to the dataset then filter out tokens with a `l` or `r` following the vowel
vowels_all <- vowels_all %>%
cbind(., following_segment_data[,c("following_segment")]) %>%
rename(following_segment = `following_segment_data[, c(\"following_segment\")]`) %>%
filter(!following_segment %in% c("l", "r"))
## rename: renamed one variable (following_segment)
## filter: removed 58,296 rows (12%), 429,051 rows remaining
In order to ensure the speakers at the very ends of the year of birth distribution do not inadvertently affect the statistical modelling steps that are conducted in the main analysis (i.e. the GAMMs), we decided to remove speakers who were at the edges of our year of birth distribution. By removing these speakers, the data is better suited for the analysis as it means that there are fewer observations that may represent specific time points with low speaker counts.
Note, the decision to remove these speakers was made after running through an earlier version of the final analyses. Removing these speakers did not affect the pattern of results reported in the paper, we chose to implement this filtering step as it most appropriate for the statistical modelling procedure used in the main analysis.
This removes 13 speakers in total, 3 born before 1863 and 10 born after 1983.
The distribution of year of births is given in the histogram below, with the red dashed lines indicating the position where we filtered the speakers.
#histogram of the data by year of birth and gender
vowels_all %>%
select(Speaker, Gender, participant_year_of_birth) %>%
distinct() %>%
ggplot(aes(x = participant_year_of_birth, fill = Gender, colour = Gender)) +
geom_histogram(aes(position="identity"),
binwidth=1,
alpha = 0.8, colour = NA) +
geom_rug(alpha = 0.2) +
scale_x_continuous(breaks = seq(1860, 1990, 15)) +
scale_fill_manual(values = c("#f1a340", "#998ec3")) +
scale_color_manual(values = c("#f1a340", "#998ec3")) +
geom_vline(xintercept = 1863.5, colour = "red", linetype = 2) +
geom_vline(xintercept = 1983.5, colour = "red", linetype = 2) +
geom_label(data = vowels_all %>% select(Speaker, Gender, participant_year_of_birth) %>% distinct() %>% group_by(Gender) %>% summarise(n = n()), aes(x = 1865, y = 20, label = paste0("Before filtering:\nN female = ", n[1], "\nN male = ", n[2], "\nN total = ", sum(n))), hjust=0, inherit.aes = FALSE) +
geom_label(data = vowels_all %>% filter(participant_year_of_birth > 1863 & participant_year_of_birth < 1983) %>% select(Speaker, Gender, participant_year_of_birth) %>% distinct() %>% group_by(Gender) %>% summarise(n = n()), aes(x = 1892, y = 20, label = paste0("After filtering:\nN female = ", n[1], "\nN male = ", n[2], "\nN total = ", sum(n))), hjust=0, inherit.aes = FALSE) +
theme_bw() +
theme(legend.position = "top")
## select: dropped 30 variables (TokenNum, Speaker_anon, Transcript, Transcript_anon, Corpus, …)
## distinct: removed 428,557 rows (>99%), 494 rows remaining
## select: dropped 30 variables (TokenNum, Speaker_anon, Transcript, Transcript_anon, Corpus, …)
## distinct: removed 428,557 rows (>99%), 494 rows remaining
## group_by: one grouping variable (Gender)
## summarise: now 2 rows and 2 columns, ungrouped
## filter: removed 14,372 rows (3%), 414,679 rows remaining
## select: dropped 30 variables (TokenNum, Speaker_anon, Transcript, Transcript_anon, Corpus, …)
## distinct: removed 414,198 rows (>99%), 481 rows remaining
## group_by: one grouping variable (Gender)
## summarise: now 2 rows and 2 columns, ungrouped
#filter the speakers born at the tails
vowels_all <- vowels_all %>%
filter(participant_year_of_birth > 1863 & participant_year_of_birth < 1983)
## filter: removed 14,372 rows (3%), 414,679 rows remaining
Purpose: export the filtered data for the analysis script
Now that the data filtering steps have been carried out, the final dataset is ready to be saved and used in the analysis file. We will store this file in the data
folder of the repository.
We will keep only a subset of the variables as these are the ones used for the main analysis.
vowels_all <- vowels_all %>%
arrange(Speaker) %>%
mutate_if(is.character, as.factor) %>%
select(Speaker_anon,
Transcript_anon,
Corpus,
Gender,
participant_year_of_birth,
Word_anon,
Vowel,
F1_50,
F2_50,
Speech_rate) %>%
rename(Speaker = Speaker_anon,
Transcript = Transcript_anon,
Word = Word_anon)
## mutate_if: converted 'Speaker' from character to factor (0 new NA)
## converted 'Speaker_anon' from character to factor (0 new NA)
## converted 'Transcript' from character to factor (0 new NA)
## converted 'Transcript_anon' from character to factor (0 new NA)
## converted 'Corpus' from character to factor (0 new NA)
## converted 'Gender' from character to factor (0 new NA)
## converted 'MatchId' from character to factor (0 new NA)
## converted 'TargetId' from character to factor (0 new NA)
## converted 'URL' from character to factor (0 new NA)
## converted 'PrevWord' from character to factor (0 new NA)
## converted 'Text' from character to factor (0 new NA)
## converted 'FollWord' from character to factor (0 new NA)
## converted 'Word' from character to factor (0 new NA)
## converted 'Word_anon' from character to factor (0 new NA)
## converted 'WordSegments' from character to factor (0 new NA)
## converted 'Vowel' from character to factor (0 new NA)
## converted 'VowelDISC' from character to factor (0 new NA)
## converted 'Error' from character to factor (0 new NA)
## select: dropped 23 variables (TokenNum, Speaker, Transcript, Line, LineEnd, …)
## rename: renamed 3 variables (Speaker, Transcript, Word)
saveRDS(vowels_all, "Data/ONZE_vowels_filtered_anon.rds")