Generating similarity matrices for supplementary materials

Authors
Affiliations
Elena Sheard

New Zealand Institute of Language, Brain and Behaviour, University of Canterbury

Jen Hay

New Zealand Institute of Language, Brain and Behaviour, University of Canterbury

Department of Linguistics, University of Canterbury

Joshua Wilson Black

New Zealand Institute of Language, Brain and Behaviour, University of Canterbury

Robert Fromont

New Zealand Institute of Language, Brain and Behaviour, University of Canterbury

Lynn Clark

New Zealand Institute of Language, Brain and Behaviour, University of Canterbury

Department of Linguistics, University of Canterbury

Published

June 30, 2025

1 Overview

This file generates the similarity matrices used in the supplementary materials for the manuscript “Exploring the social meaning of the ‘leader-lagger vowels’ in New Zealand English”. The materials can be viewed here.

This script generates similarity matrices for MDS analysis for (a) the full data set in Section 3 and (b) a subset of the data set where listeners used social labels to make their speaker groups in Section 4

2 Load libraries and data

The chunk below loads the libraries used in this script.

Click here to view code.
# Data wrangling
library(tidyverse)
library(tidyr)
library(data.table)

# Other
library(here) # localised file paths
library(gt) # tables
  • The next chunk loads:

    • The anonymised results from the online free classification task

    • A data frame that contains all possible pairwise combinations of the 38 stimuli, with IDs for each pair (ordered and unordered). This data frame is used to generate the similarity matrices.

Click here to view code.
date <- '250228'

# Load filtered and cleaned data set 
df <-
  read_rds(here('Data',
                'FC_filtered_cleaned_anon_250228.rds'))

# Load dataframe containing all possible pairwise combinations
  # of the 38 stimuli, with IDs for each pair
# Use to create similarity matrix
combinations_anon <-
  read_rds(here("Data", "FC_combinations_anon_250228.rds"))

Table 1 describes the columns in the data frame. “Reclassified” refers to the tidied free-text responses.

Variables before matrix generation
Name Description
workerId Anonymised participant (listener) code
SpeakerID Anonymised speaker code
trial_index Task iteration (1:3, Representing each time listeners made groups)
ratings Represents number of the group each speaker was placed into by listener (i.e., 1 = first group made, 2 = second group made)
label_category1 Initial, least broad label category used to describe the group (e.g., 'young', 'old')
label_category2 Grouping for initial label categories (e.g., 'Age')
label_category3 Most broad label category ('social' versus 'speech' labels versus neither)
f_gender Listener gender (reclassified)
f_ethnicity Listener ethnicity (reclassified)
f_ethnicity_maori Listener ethnicity (binary distinction between Maori versus non-Maori)
f_growup Where listener grew up (reclassified)
f_living Where listener currently lives (reclassified)
f_occupation Listener occupation (reclassified)
Table 1: Overview of pre-processed dataframe

3 Full data set matrix

3.1 Filter data frames

The next blocks:

  • Select relevant columns

  • Exclude participants who:

    • Used labels unrelated to speech or social characteristics

    • Made more than 11 groups within the same broad category for a single iteration of the task (i.e., did not make groups)

  • Save corresponding data frames

Click here to view code.
# subset relevant columns
fcdf_full <- df %>%
  select(
    SpeakerID,
    ratings,
    workerId,
    trial_index,
    label_category1,
    label_category2,
    label_category3
  )

# Save unfiltered df
file_name_labels_unfiltered <-
  paste0('fcdf_full_unfiltered_anon_', date, '.rds', sep = '')

write_rds(fcdf_full,
          here('Data', file_name_labels_unfiltered))
Click here to view code.
# Identify participants who made groups based on colours or clip topics 
exclude_IDs <- fcdf_full %>%
  group_by(workerId, label_category1) %>%
  summarise(n_labels = n()) %>%
  filter(label_category1 == "Colours" & n_labels > 0 |
           label_category1 == "ClipTopic" & n_labels > 0,) %>%
  pull(unique(workerId))

# Identify participants who made more than 10 groups within the same broad label category, for a single iteration of the task
exclude_IDs2 <- fcdf_full %>%
 filter(!workerId %in% exclude_IDs) %>% 
  group_by(workerId, trial_index, label_category3) %>%
  summarise(n_groups = n_distinct(ratings)) %>%
  filter(n_groups > 11
         )%>% 
  pull(unique(workerId))

# Filter excluded participants
fcdf_full_filt <- fcdf_full %>%
  filter(!workerId %in% c(exclude_IDs, exclude_IDs2))

# Check participants who have been excluded 
fcdf_excluded <- fcdf_full %>%
  filter(workerId %in% c(exclude_IDs, exclude_IDs2))

# Save filtered dataframe 
file_name_labels <-
  paste0('fcdf_full_anon_', date, '.rds', sep = '')

write_rds(fcdf_full_filt,
          here('Data', file_name_labels))

3.2 Calculate proportions

We need to calculate the number of times each pair of speakers have been placed in the same group as a proportion of the times they could have been placed in the same group (as not all participants will be evaluating all possible combinations).

The code below:

  • Calculates all possible combinations of the stimuli in each trial for each participant

  • Then joins these combinations with previously generated combination IDs (same ID for stimuli pairs regardless of stimuli order in each pair)

  • Groups the combinations by participant, trial ID (each participant completes three trials) and combination ID (bloc) and selects the top row of this group (each bloc will come up twice because there are two possible stimuli orders per combination ID)

  • Joins these combinations with the group ratings from the results twice (ratings for Stimuli 1 and for Stimuli 2)

  • Indicate whether the rating (grouping) for Stimuli 1 is the same as for Stimuli 2 (i.e., were they placed in the same group)

  • Calculate number of times each combination has occurred across all participants, then the proportion of those times that they have been placed in the same group

crossing() can be used instead of expand() but the latter is more efficient (i.e., it is affected by group_by()).

Click here to view code.
# Subset data frrame 
fcdf_full_subset <- fcdf_full_filt %>%
  select(SpeakerID, ratings, workerId, trial_index)

fc_proportions <- fcdf_full_filt %>%
  # Calculcate possible stimuli combinations per trial and participant
  group_by(trial_index, workerId) %>%
  tidyr::expand(SpeakerID, SpeakerID) %>%
  # Rename new variables
  rename(Stimuli1 = 3, Stimuli2 = 4) %>%
  # Filter out combinations of the same stimuli
  dplyr::filter(Stimuli1 != Stimuli2) %>%
  # Join with previously generated combinations and combination IDs
  right_join(combinations_anon,
             by = c("Stimuli1" = "Stimuli1ID", "Stimuli2" = "Stimuli2ID")) %>%
  group_by(workerId, trial_index, pair_id_unordered) %>%
  # select one of two stimuli orders for each Combination ID
  slice_head() %>%
  # slice(1) %>%
  select(Stimuli1, Stimuli2, pair_id_unordered, pair_id_ordered) %>%
  # Join with grouping ratings for Stimuli 1
  right_join(fcdf_full_subset,
             by = c("Stimuli1" = "SpeakerID", "trial_index", "workerId")) %>%
  rename(Stimuli1Rating = ratings) %>%
  # Join with grouping ratings for Stimuli 2
  right_join(fcdf_full_subset,
             by = c("Stimuli2" = "SpeakerID", "trial_index", "workerId")) %>%
  rename(Stimuli2Rating = ratings) %>%
  # Remove NAs
  filter(!is.na(Stimuli1)) %>%
  # Indicate if Stimuli 1 group = Stimuli 2 group
  mutate(
    SameGroup = case_when(Stimuli1Rating == Stimuli2Rating ~ T,
                          TRUE ~ F),
    SameGroupN = case_when(Stimuli1Rating == Stimuli2Rating ~ 1,
                           TRUE ~ 0)
  ) %>%
  # For each combination ID, calculate the number of times it occurs across all participants, and the proportion of these times they have been been placed in the same group
  group_by(pair_id_unordered) %>%
  mutate(
    blocN = n(),
    totalSameGroupN = sum(SameGroupN),
    propSame = totalSameGroupN / blocN
  ) %>%
  ungroup()

3.3 Create matrix

The data needs to be in a particular format to work within the smacof package, and the chunk below will produce that.

Click here to view code.
prop_top_half <- fc_proportions %>%
  select(Stimuli1, Stimuli2, propSame) %>%
  distinct()

prop_bottom_half <- fc_proportions %>%
  select(Stimuli1, Stimuli2, propSame) %>%
  distinct() %>%
  rename(Stimuli2 = Stimuli1, Stimuli1 = Stimuli2)

prop_combined <- rbind(prop_top_half, prop_bottom_half)

# Convert to wide df
sim_matrix <-
  reshape2::dcast(prop_combined, Stimuli1 ~ Stimuli2, value.var = 'propSame')

# Convert to matrix
sim_matrix <-
  sim_matrix %>% remove_rownames %>% column_to_rownames(var = "Stimuli1") %>%
  as.matrix()

We can now save the output matrix.

Click here to view code.
file_name_full_matrix_anon <-
  paste0('similarity_matrix_full_anon_', date, '.rds', sep = '')

write_rds(sim_matrix, here('Data', file_name_full_matrix_anon))

4 Social subset matrix

This section applies the same process as Section 3 but to the social subset of the data. Participants excluded in Section 3.1 remain excluded here.

4.1 Subset data frames

  • Include data only from social labels
Click here to view code.
fcdf_social <- fcdf_full_filt %>%
  mutate(
    label_category3 = case_when(
      label_category2 == c("Personality") ~ "SocialFactors",
      T ~ label_category3
    ),
    label_category3 = case_when(
      label_category3 == c("AccentTypes") ~ "SocialFactors",
      T ~ label_category3
    )
  ) %>%
  filter(label_category3 %in% c("SocialFactors"))
Click here to view code.
file_name_labels <-
  paste0('fcdf_social_anon_', date, '.rds', sep = '')

write_rds(fcdf_social,
          here('Data', file_name_labels))

4.2 Calculate proportions (social subset)

The code below:

  • Calculates all possible combinations of the stimuli in each trial for each participant

  • Then joins these combinations with previously generated combination IDs (same ID for stimuli pairs regardless of stimuli order in each pair)

  • Groups the combinations by participant, trial ID (each participant completes three trials) and combination ID (bloc) and selects the top row of this group (each bloc will come up twice because there are two possible stimuli orders per combination ID)

  • Joins these combinations with the group ratings from the results twice (ratings for Stimuli 1 and for Stimuli 2)

  • Indicate whether the rating (grouping) for Stimuli 1 is the same as for Stimuli 2 (i.e., were they placed in the same group)

  • Calculate number of times each combination has occurred across all participants, then the proportion of those times that they have been placed in the same group

Click here to view code.
fcdf_social_subset <- fcdf_social %>%
  select(SpeakerID, ratings, workerId, trial_index)

fc_social_proportions <- fcdf_social %>%
  # Calculcate possible stimuli combinations per trial and participant
  group_by(trial_index, workerId) %>%
  tidyr::expand(SpeakerID, SpeakerID) %>%
  # Rename new variables
  rename(Stimuli1 = 3, Stimuli2 = 4) %>%
  # Filter out combinations of the same stimuli
  dplyr::filter(Stimuli1 != Stimuli2) %>%
  # Join with previously generated combinations and combination IDs
  right_join(combinations_anon,
             by = c("Stimuli1" = "Stimuli1ID", "Stimuli2" = "Stimuli2ID")) %>%
  group_by(workerId, trial_index, pair_id_unordered) %>%
  # select one of two stimuli orders for each Combination ID
  slice_head() %>%
  # slice(1) %>%
  select(Stimuli1, Stimuli2, pair_id_unordered, pair_id_ordered) %>%
  # Join with grouping ratings for Stimuli 1
  right_join(fcdf_full_subset,
             by = c("Stimuli1" = "SpeakerID", "trial_index", "workerId")) %>%
  rename(Stimuli1Rating = ratings) %>%
  # Join with grouping ratings for Stimuli 2
  right_join(fcdf_full_subset,
             by = c("Stimuli2" = "SpeakerID", "trial_index", "workerId")) %>%
  rename(Stimuli2Rating = ratings) %>%
  # Remove NAs
  filter(!is.na(Stimuli1)) %>%
  # Indicate if Stimuli 1 group = Stimuli 2 group
  mutate(
    SameGroup = case_when(Stimuli1Rating == Stimuli2Rating ~ T,
                          TRUE ~ F),
    SameGroupN = case_when(Stimuli1Rating == Stimuli2Rating ~ 1,
                           TRUE ~ 0)
  ) %>%
  # For each combination ID, calculate the number of times it occurs across all participants, and the proportion of these times they have been been placed in the same group
  group_by(pair_id_unordered) %>%
  mutate(
    blocN = n(),
    totalSameGroupN = sum(SameGroupN),
    propSame = totalSameGroupN / blocN
  ) %>%
  ungroup()

4.3 Create matrix (social subset)

Create similarity matrix and save output.

Click here to view code.
prop_top_half <- fc_social_proportions %>%
  select(Stimuli1, Stimuli2, propSame) %>%
  distinct()

prop_bottom_half <- fc_social_proportions %>%
  select(Stimuli1, Stimuli2, propSame) %>%
  distinct() %>%
  rename(Stimuli2 = Stimuli1, Stimuli1 = Stimuli2)

prop_combined <- rbind(prop_top_half, prop_bottom_half)

# Convert to wide df
sim_matrix <-
  reshape2::dcast(prop_combined, Stimuli1 ~ Stimuli2, value.var = 'propSame')

# Convert to matrix
sim_matrix <-
  sim_matrix %>% remove_rownames %>% column_to_rownames(var = "Stimuli1") %>%
  as.matrix()
Click here to view code.
file_name_social_matrix <-
  paste0('similarity_matrix_social_anon_', date, '.rds', sep = '')

write_rds(sim_matrix, here('Data', file_name_social_matrix))