1 Overview
This file generates the similarity matrices used in the supplementary materials for the manuscript “Exploring the social meaning of the ‘leader-lagger vowels’ in New Zealand English”. The materials can be viewed here.
This script generates similarity matrices for MDS analysis for (a) the full data set in Section 3 and (b) a subset of the data set where listeners used social labels to make their speaker groups in Section 4
2 Load libraries and data
The chunk below loads the libraries used in this script.
Click here to view code.
# Data wrangling
library(tidyverse)
library(tidyr)
library(data.table)
# Other
library(here) # localised file paths
library(gt) # tables
The next chunk loads:
The anonymised results from the online free classification task
A data frame that contains all possible pairwise combinations of the 38 stimuli, with IDs for each pair (ordered and unordered). This data frame is used to generate the similarity matrices.
Click here to view code.
<- '250228'
date
# Load filtered and cleaned data set
<-
df read_rds(here('Data',
'FC_filtered_cleaned_anon_250228.rds'))
# Load dataframe containing all possible pairwise combinations
# of the 38 stimuli, with IDs for each pair
# Use to create similarity matrix
<-
combinations_anon read_rds(here("Data", "FC_combinations_anon_250228.rds"))
Table 1 describes the columns in the data frame. “Reclassified” refers to the tidied free-text responses.
Variables before matrix generation | |
Name | Description |
---|---|
workerId | Anonymised participant (listener) code |
SpeakerID | Anonymised speaker code |
trial_index | Task iteration (1:3, Representing each time listeners made groups) |
ratings | Represents number of the group each speaker was placed into by listener (i.e., 1 = first group made, 2 = second group made) |
label_category1 | Initial, least broad label category used to describe the group (e.g., 'young', 'old') |
label_category2 | Grouping for initial label categories (e.g., 'Age') |
label_category3 | Most broad label category ('social' versus 'speech' labels versus neither) |
f_gender | Listener gender (reclassified) |
f_ethnicity | Listener ethnicity (reclassified) |
f_ethnicity_maori | Listener ethnicity (binary distinction between Maori versus non-Maori) |
f_growup | Where listener grew up (reclassified) |
f_living | Where listener currently lives (reclassified) |
f_occupation | Listener occupation (reclassified) |
3 Full data set matrix
3.1 Filter data frames
The next blocks:
Select relevant columns
Exclude participants who:
Used labels unrelated to speech or social characteristics
Made more than 11 groups within the same broad category for a single iteration of the task (i.e., did not make groups)
Save corresponding data frames
Click here to view code.
# subset relevant columns
<- df %>%
fcdf_full select(
SpeakerID,
ratings,
workerId,
trial_index,
label_category1,
label_category2,
label_category3
)
# Save unfiltered df
<-
file_name_labels_unfiltered paste0('fcdf_full_unfiltered_anon_', date, '.rds', sep = '')
write_rds(fcdf_full,
here('Data', file_name_labels_unfiltered))
Click here to view code.
# Identify participants who made groups based on colours or clip topics
<- fcdf_full %>%
exclude_IDs group_by(workerId, label_category1) %>%
summarise(n_labels = n()) %>%
filter(label_category1 == "Colours" & n_labels > 0 |
== "ClipTopic" & n_labels > 0,) %>%
label_category1 pull(unique(workerId))
# Identify participants who made more than 10 groups within the same broad label category, for a single iteration of the task
<- fcdf_full %>%
exclude_IDs2 filter(!workerId %in% exclude_IDs) %>%
group_by(workerId, trial_index, label_category3) %>%
summarise(n_groups = n_distinct(ratings)) %>%
filter(n_groups > 11
%>%
)pull(unique(workerId))
# Filter excluded participants
<- fcdf_full %>%
fcdf_full_filt filter(!workerId %in% c(exclude_IDs, exclude_IDs2))
# Check participants who have been excluded
<- fcdf_full %>%
fcdf_excluded filter(workerId %in% c(exclude_IDs, exclude_IDs2))
# Save filtered dataframe
<-
file_name_labels paste0('fcdf_full_anon_', date, '.rds', sep = '')
write_rds(fcdf_full_filt,
here('Data', file_name_labels))
3.2 Calculate proportions
We need to calculate the number of times each pair of speakers have been placed in the same group as a proportion of the times they could have been placed in the same group (as not all participants will be evaluating all possible combinations).
The code below:
Calculates all possible combinations of the stimuli in each trial for each participant
Then joins these combinations with previously generated combination IDs (same ID for stimuli pairs regardless of stimuli order in each pair)
Groups the combinations by participant, trial ID (each participant completes three trials) and combination ID (bloc) and selects the top row of this group (each bloc will come up twice because there are two possible stimuli orders per combination ID)
Joins these combinations with the group ratings from the results twice (ratings for Stimuli 1 and for Stimuli 2)
Indicate whether the rating (grouping) for Stimuli 1 is the same as for Stimuli 2 (i.e., were they placed in the same group)
Calculate number of times each combination has occurred across all participants, then the proportion of those times that they have been placed in the same group
crossing()
can be used instead of expand()
but the latter is more efficient (i.e., it is affected by group_by()
).
Click here to view code.
# Subset data frrame
<- fcdf_full_filt %>%
fcdf_full_subset select(SpeakerID, ratings, workerId, trial_index)
<- fcdf_full_filt %>%
fc_proportions # Calculcate possible stimuli combinations per trial and participant
group_by(trial_index, workerId) %>%
::expand(SpeakerID, SpeakerID) %>%
tidyr# Rename new variables
rename(Stimuli1 = 3, Stimuli2 = 4) %>%
# Filter out combinations of the same stimuli
::filter(Stimuli1 != Stimuli2) %>%
dplyr# Join with previously generated combinations and combination IDs
right_join(combinations_anon,
by = c("Stimuli1" = "Stimuli1ID", "Stimuli2" = "Stimuli2ID")) %>%
group_by(workerId, trial_index, pair_id_unordered) %>%
# select one of two stimuli orders for each Combination ID
slice_head() %>%
# slice(1) %>%
select(Stimuli1, Stimuli2, pair_id_unordered, pair_id_ordered) %>%
# Join with grouping ratings for Stimuli 1
right_join(fcdf_full_subset,
by = c("Stimuli1" = "SpeakerID", "trial_index", "workerId")) %>%
rename(Stimuli1Rating = ratings) %>%
# Join with grouping ratings for Stimuli 2
right_join(fcdf_full_subset,
by = c("Stimuli2" = "SpeakerID", "trial_index", "workerId")) %>%
rename(Stimuli2Rating = ratings) %>%
# Remove NAs
filter(!is.na(Stimuli1)) %>%
# Indicate if Stimuli 1 group = Stimuli 2 group
mutate(
SameGroup = case_when(Stimuli1Rating == Stimuli2Rating ~ T,
TRUE ~ F),
SameGroupN = case_when(Stimuli1Rating == Stimuli2Rating ~ 1,
TRUE ~ 0)
%>%
) # For each combination ID, calculate the number of times it occurs across all participants, and the proportion of these times they have been been placed in the same group
group_by(pair_id_unordered) %>%
mutate(
blocN = n(),
totalSameGroupN = sum(SameGroupN),
propSame = totalSameGroupN / blocN
%>%
) ungroup()
3.3 Create matrix
The data needs to be in a particular format to work within the smacof
package, and the chunk below will produce that.
Click here to view code.
<- fc_proportions %>%
prop_top_half select(Stimuli1, Stimuli2, propSame) %>%
distinct()
<- fc_proportions %>%
prop_bottom_half select(Stimuli1, Stimuli2, propSame) %>%
distinct() %>%
rename(Stimuli2 = Stimuli1, Stimuli1 = Stimuli2)
<- rbind(prop_top_half, prop_bottom_half)
prop_combined
# Convert to wide df
<-
sim_matrix ::dcast(prop_combined, Stimuli1 ~ Stimuli2, value.var = 'propSame')
reshape2
# Convert to matrix
<-
sim_matrix %>% remove_rownames %>% column_to_rownames(var = "Stimuli1") %>%
sim_matrix as.matrix()
We can now save the output matrix.
Click here to view code.
<-
file_name_full_matrix_anon paste0('similarity_matrix_full_anon_', date, '.rds', sep = '')
write_rds(sim_matrix, here('Data', file_name_full_matrix_anon))