Note: An html version of this document, which includes the evaluated result of all code chunks, can be found at https://nzilbb.github.io/How-to-Train-Your-Classifier/How_to_Train_Your_Classifier.html.

# Introduction

This guide walks readers through the process of training a random-forest classifier for automated coding of sociophonetic variation, using the statistical computing language R (R Core Team, 2018) and the packages ranger (Wright & Ziegler, 2017) and caret (Kuhn, 2018). This guide is intended to be a companion to our 2020 Laboratory Phonology article “From categories to gradience: Auto-coding sociophonetic variation with random forests”, and it recreates the classifier of Southland English non-prevocalic /r/ discussed in the article. By consulting the R Markdown file that generated this document, readers can run the code on their own systems to recreate the /r/ classifier.

IMPORTANT: While you can run all this code, we don’t recommend trying to run it all at once (e.g., by ‘knitting’ the R Markdown document); the code that generated this document was tailored to a machine with 32 processors and still takes several hours to run start-to-finish. As we mention below, training a classifier is not a one-size-fits-all process.

This guide goes through the following steps:

1. Pre-process data

2. Determine classifier performance measure

3. Train an initial classifier

4. Tune hyperparameters

5. Determine effect of outliers

6. Train final classifier

7. Auto-code new tokens and extract classifier probabilities for hand-coded tokens

Perhaps the most important thing to know is that training a classifier is not a one-size-fits-all process. Your decisions about how to go about training a classifier will change depending on your data, your variables, your research questions, your standards for what counts as adequate performance, etc. This guide describes the process for one particular set of data, a particular variable, particular research questions, etc., and shouldn’t be taken as an exact template for every possible case.

# Implementation notes

This guide is intended to be accessible to intermediate to advanced R users. Some of the coding techniques used in this guide are fairly advanced, but you won’t need to understand all the code to use this guide or to train your own classifiers.

This guide also includes text that’s visible only on the R Markdown version of the document (to avoid cluttering the html version) giving more detail about the implementation of the code. Like this paragraph, R Markdown-only text in this document begins with a ‘greater than’ sign.

The steps in this guide can be used by any computer that can run R. That said, progressing through all of the steps listed here can be very time-consuming, especially with large amounts of data. As a result, all of the forests in this guide were generated on a computing cluster running Ubuntu 16.04 with 32 virtual cores and 64 GB of memory, in order to leverage R’s support for parallel computing. (Despite this powerful hardware, the code in this guide takes over 4 hours to run in terms of pure running time.) The code used here for running models in parallel (which uses the package doParallel) should work on all systems, although the user will need to adjust the number of cores to their system (you can find out how many cores are available via the detectCores() function in the parallel package). Note that train() (the function we use to train classifiers in this guide) by default uses a parallel backend when one is available (e.g., to train multiple resamples at a time), so parallelization is used even in instances where the end-user doesn’t explicitly invoke it using doParallel methods.

Since random forests are a stochastic process, results can be slightly different with each run, so this guide uses random seeds to ensure reproducibility. In single forest runs, it suffices to use set.seed() prior to running train(). In situations where doParallel methods are used to run a list of forests differing by a hyperparameter, it’s necessary to use clusterSetRNGStream() after registering the cluster and before calling foreach(); this ensures that each core starts with the same random seed. Finally, in Step 5 when we want to run lists of non-identical forests that don’t differ by any parameters, we use the seeds argument of trainControl().

Finally, our code creates ‘dummy files’, with no extension or contents, in a Model Status subfolder to help us keep track of when the sub-models finish running. This technique is especially useful when you expect some sub-models to take longer than others (e.g., repeated k-fold cross-validation will naturally take longer than non-repeated k-fold cross-validation). Although we’re only saving ‘dummy files’ here, you can also save more useful files; for example, if the overall model is too big to store in memory and the model crashes before all sub-models are complete, you can replace the second ‘dummy file’ with a command to save each sub-model when it’s finished running or a summary of the sub-model.

if (!dir.exists("Model Status")) dir.create("Model Status")

# Step 0: Get hand-coded data and load required packages

To run the code in this guide, you’ll need R and several R packages. In particular, this code relies heavily on the collection of packages called the tidyverse, which make R code faster, more readable, and safer. Though you can run this code without it, we recommend the RStudio integrated development environment for ease of understanding code. Information on the system, R version, and package versions that generated this guide can be found at the end of this guide.

Install necessary packages:

##From https://stackoverflow.com/a/9341833/5432121
pkgTest <- function(x) {
if (!require(x, character.only=TRUE)) {
install.packages(x)
}
}

pkgTest("tidyverse")
pkgTest("magrittr")
pkgTest("DMwR")
pkgTest("diptest")
Loading required package: diptest
pkgTest("parallel")
pkgTest("foreach")
pkgTest("doParallel")
pkgTest("caret")
pkgTest("ranger")

As a prerequisite to the steps we describe in this guide, you’ll also need data that has been hand-coded into two or more variants and for which you have all the acoustic measures you’ll want to enter into the classifier. In the case of our /r/ classifier (for which we provide the data as an example), we had several thousand hand-coded tokens, for which we extracted 180 acoustic measures using the “Process with Praat” function in LaBB-CAT (Fromont & Hay, 2012); these measures were motivated by previous literature on the acoustic correlates of rhoticity in English and other languages (e.g., Lawson, Stuart-Smith, & Scobbie, 2018; Heselwood, 2009). More details on the choice of these measures can be found in the main text.

Load the /r/ data, which includes 40614 rows and 217 columns, the last 180 of which are acoustic measures to be fed into the classifier.

The first 37 columns contain mostly linguistic information, plus LaBB-CAT metadata (TokenNum and MatchId) and speaker data (Speaker, which has been anonymized, and Gender). To preserve anonymity, both Speaker and Word have been anonymized to dummy factors. Stress is coded as either ’ (primary word stress), " (secondary word stress), or 0 (unstressed). Syllable and columns ending in “DISC” are phonemic strings represented in the DISC phonetic alphabet used by CELEX (see https://catalog.ldc.upenn.edu/docs/LDC96L14/eug_let.pdf); FollSegmentNoPause and FollSegment are represented in the ARPAbet codes used by the CMU pronouncing dictionary, without stress markers on vowels (see http://www.speech.cs.cmu.edu/cgi-bin/cmudict); and Vowel is represented in Wells lexical sets as they relate to New Zealand English (see https://www.phon.ucl.ac.uk/home/wells/stanlexsets.htm). HowCoded is used to separate tokens that have already been hand-coded from those to be auto-coded (which therefore have NA for Rpresent). All other columns should be self-explanatory.

trainingData <- readRDS("Data/RClassifierData_03July2019.Rds")
trainingData %>% dim()
[1] 40614   217
trainingData %>% colnames()
  [1] "TokenNum"           "Speaker"            "Gender"
[4] "MatchId"            "Stress"             "FollPause"
[7] "CelexFreq"          "FollPauseDur"       "Syllable"
[10] "SyllStart"          "SyllEnd"            "Word"
[13] "WordStart"          "WordEnd"            "TokenDISC"
[16] "TokenStart"         "TokenEnd"           "TokenGenAmDISC"
[19] "TokenGenAmStart"    "TokenGenAmEnd"      "Rpresent"
[22] "FollSegmentDISC"    "RType"              "PrevRInWord"
[25] "ContFuncWord"       "CorpusFreq"         "Vowel"
[28] "NURSEvsNonNURSE"    "FollSegmentNoPause" "FollSegment"
[31] "SyllFinal"          "WordFinal"          "HowCoded"
[34] "Initial_F1_50"      "Initial_F2_50"      "Initial_F3_50"
[37] "Initial_F4_50"      "TokenDur"           "F1_20"
[40] "diffF3F1_20"        "BW1_20"             "F1_25"
[43] "diffF3F1_25"        "BW1_25"             "F1_30"
[46] "diffF3F1_30"        "BW1_30"             "F1_35"
[49] "diffF3F1_35"        "BW1_35"             "F1_40"
[52] "diffF3F1_40"        "BW1_40"             "F1_45"
[55] "diffF3F1_45"        "BW1_45"             "F1_50"
[58] "diffF3F1_50"        "BW1_50"             "F1_55"
[61] "diffF3F1_55"        "BW1_55"             "F1_60"
[64] "diffF3F1_60"        "BW1_60"             "F1_65"
[67] "diffF3F1_65"        "BW1_65"             "F1_70"
[70] "diffF3F1_70"        "BW1_70"             "F1_75"
[73] "diffF3F1_75"        "BW1_75"             "F1_80"
[76] "diffF3F1_80"        "BW1_80"             "F1min"
[79] "F1max"              "time_F1min"         "time_F1max"
[82] "F1range"            "time_F1range"       "slopeF1"
[85] "F2_20"              "diffF3F2_20"        "BW2_20"
[88] "F2_25"              "diffF3F2_25"        "BW2_25"
[91] "F2_30"              "diffF3F2_30"        "BW2_30"
[94] "F2_35"              "diffF3F2_35"        "BW2_35"
[97] "F2_40"              "diffF3F2_40"        "BW2_40"
[100] "F2_45"              "diffF3F2_45"        "BW2_45"
[103] "F2_50"              "diffF3F2_50"        "BW2_50"
[106] "F2_55"              "diffF3F2_55"        "BW2_55"
[109] "F2_60"              "diffF3F2_60"        "BW2_60"
[112] "F2_65"              "diffF3F2_65"        "BW2_65"
[115] "F2_70"              "diffF3F2_70"        "BW2_70"
[118] "F2_75"              "diffF3F2_75"        "BW2_75"
[121] "F2_80"              "diffF3F2_80"        "BW2_80"
[124] "F2min"              "F2max"              "time_F2min"
[127] "time_F2max"         "F2range"            "time_F2range"
[130] "slopeF2"            "F3_20"              "BW3_20"
[133] "F3_25"              "BW3_25"             "F3_30"
[136] "BW3_30"             "F3_35"              "BW3_35"
[139] "F3_40"              "BW3_40"             "F3_45"
[142] "BW3_45"             "F3_50"              "BW3_50"
[145] "F3_55"              "BW3_55"             "F3_60"
[148] "BW3_60"             "F3_65"              "BW3_65"
[151] "F3_70"              "BW3_70"             "F3_75"
[154] "BW3_75"             "F3_80"              "BW3_80"
[157] "F3min"              "F3max"              "time_F3min"
[160] "time_F3max"         "F3range"            "time_F3range"
[163] "slopeF3"            "intens_F3min"       "intens_F3max"
[166] "F4_20"              "diffF4F3_20"        "BW4_20"
[169] "F4_25"              "diffF4F3_25"        "BW4_25"
[172] "F4_30"              "diffF4F3_30"        "BW4_30"
[175] "F4_35"              "diffF4F3_35"        "BW4_35"
[178] "F4_40"              "diffF4F3_40"        "BW4_40"
[181] "F4_45"              "diffF4F3_45"        "BW4_45"
[184] "F4_50"              "diffF4F3_50"        "BW4_50"
[187] "F4_55"              "diffF4F3_55"        "BW4_55"
[190] "F4_60"              "diffF4F3_60"        "BW4_60"
[193] "F4_65"              "diffF4F3_65"        "BW4_65"
[196] "F4_70"              "diffF4F3_70"        "BW4_70"
[199] "F4_75"              "diffF4F3_75"        "BW4_75"
[202] "F4_80"              "diffF4F3_80"        "BW4_80"
[205] "F4min"              "F4max"              "time_F4min"
[208] "time_F4max"         "F4range"            "time_F4range"
[211] "slopeF4"            "F0min"              "F0max"
[214] "F0rangeST"          "time_F0min"         "time_F0max"
[217] "absSlopeF0"        

# Step 1: Pre-process data: Normalization, missing value imputation, outlier marking

Due to a fair amount of measurement error in automatically extracting formant and pitch measurements, and to account for the degree to which formant frequencies are impacted by vocal tract length, we subjected /r/ measurements to pre-processing before entering data into the classifier. Your data might not need all these steps—we didn’t need to do nearly as much pre-processing for the medial /t/ data—but we’ll walk through them for your reference.

First, we removed tokens that either had bad durations or immediately preceded another /r/. Checking durations is a way to filter out tokens whose forced alignment is bad, since individual /r/ tokens are unlikely to be long (except perhaps in word lists or utterance-finally). We randomly sampled tokens in a variety of duration bands (e.g., [500 ms, 550 ms)) to check for misalignment; this procedure led to a ceiling of 450 ms. A handful of tokens somehow had zero or negative durations, so we removed these tokens, as well.

trainingData <-
trainingData %>%
filter(TokenDur > 0, TokenDur < 0.45, FollSegment!="R")

Second, we normalized F1–F4 measurements (timepoint measurements and maxima/minima) by subtracting the speaker’s mean word-initial /r/ midpoint measurement for that formant from the raw measurement. As with vowels, formant measurements for rhotics are affected by vocal tract length, but whereas normalization methods abound for vowel formant frequencies, there is no generally accepted way to normalize formants in rhotics. Our method of using each speaker’s initial /r/ formants as a baseline measure accounts for vocal tract length, following Hay and Clendon’s (2012) approach of modeling rhoticity by comparing non-prevocalic /r/ F3 against initial /r/ F3.

trainingData <-
trainingData %>%
mutate_at(vars(matches("^F1(_[2-8][50]|min|max)$")), ~ . - Initial_F1_50) %>% mutate_at(vars(matches("^F2(_[2-8][50]|min|max)$")), ~ . - Initial_F2_50) %>%
mutate_at(vars(matches("^F3(_[2-8][50]|min|max)$")), ~ . - Initial_F3_50) %>% mutate_at(vars(matches("^F4(_[2-8][50]|min|max)$")), ~ . - Initial_F4_50)

Third, we centered and scaled token duration by speaker and vowel; in cases where token duration was invariant within a speaker $$\times$$ vowel distribution, we used a value of 0.

trainingData <-
trainingData %>%
group_by(Speaker, Vowel) %>%
mutate(TokenDurSD = sd(TokenDur, na.rm=TRUE) %>% replace_na(0),
TokenDur = if_else(TokenDurSD > 0, scale(TokenDur), 0)) %>%
select(-TokenDurSD) %>%
ungroup()

Fourth, to deal with tokens that had missing pitch and/or F4 measurements (including F4 bandwidth), we imputed missing F0 and F4 measurements using k Nearest Neighbor imputation (with k = 10) via the function knnImputation() in the package DMwR (Torgo, 2010). Since random forests cannot classify observations with missing measurements, the imputation of missing measurements is common practice in machine learning contexts (e.g., Breen, Thomas, Baldwin, & Lipinska, 2019; Hudak, Crookston, Evans, Hall, & Falkowski, 2008). At the same time, this step won’t necessarily apply to every data set (we didn’t need to impute any measurements for the medial /t/ data). Note that this next step can be fairly time-consuming (on the cluster, it takes around 15 minutes).

For time-consuming steps such as imputation, we use caching to save time by avoiding re-running code. If we give the chunk a unique name and set cache=TRUE, R Markdown will save the results of the code chunk to a cache when initially knitting the file, such that on subsequent knits the results will simply be re-loaded. This can save a considerable amount of time, especially if you want to tweak text that’s outside of the cached chunks or plots that use data generated by cached chunks, etc. More info on caching can be found at https://yihui.name/knitr/demo/cache/ and in the knitr manual at https://yihui.name/knitr/demo/manual/.

##Capture current column ordering
origCols <- trainingData %>% colnames()

##Set up dataframe for imputation: Must contain only the measures to be imputed
##  and the measures being used to calculate k nearest neighbors.
imputeDF <-
trainingData %>%
##In this data, MatchId serves as a unique identifier for tokens
select(MatchId, TokenDur:absSlopeF0) %>%
select(-contains("F0"), -contains("F4"), -contains("BW4")) %>%
select_if(~ sum(is.na(.)) < 300) %>%
filter(complete.cases(.)) %>%
left_join(trainingData %>%
select(MatchId, TokenDur:absSlopeF0) %>%
select(MatchId, contains("F0"), contains("F4"), contains("BW4")),
by="MatchId")

##Perform imputation--this takes a while!
imputeDF <-
imputeDF %>%
select(-MatchId) %>%
as.data.frame() %>%
knnImputation() %>%
cbind(imputeDF %>% select(MatchId))

##Add imputed values back into data frame (MatchId is a unique identifier, so
##  we can use it to match tokens between data frames)
trainingData <-
trainingData %>%
left_join(imputeDF %>%
select(MatchId, contains("F0"), contains("F4"), contains("BW4")),
by="MatchId") %>%
select(-ends_with(".x")) %>%
rename_at(vars(ends_with(".y")), str_remove, ".y") %>%
select(!!origCols)

##Clean up
rm(imputeDF)

Fifth, we marked numerous measures for potential outlier-hood: formant timepoint measures, formant minima/maxima, pitch timepoint measures, pitch minima/maxima, intensity measures, and duration. Potential outliers were measurements that were more than 1.5 $$\times$$ the interquartile range (third quartile minus first quartile) outside the first or third quartile for that measure’s distribution by speaker and by vowel. We won’t actually exclude outliers now; instead, we’ll use outlier data in Step 5 below to figure out which outliers we can afford to keep, and which ones we’re better off discarding. As with imputation, outlier-marking isn’t necessary or appropriate for all situations; we give more details in the subsection following this one.

trainingData <-
trainingData %>%
group_by(Speaker, Vowel) %>%
mutate_at(vars(matches("^F[01234](_[2-8][50]|min|max)\$"), starts_with("intens_"), TokenDur),
list(Outlier = ~ . < quantile(., 0.25, na.rm=TRUE) - 1.5*IQR(., na.rm=TRUE) |
. > 1.5*IQR(., na.rm=TRUE) + quantile(., 0.75, na.rm=TRUE))) %>%
ungroup()

Finally, we excluded tokens that had missing measurements for measures other than F0 and F4.

trainingData <-
trainingData %>%
filter_at(vars(TokenDur:absSlopeF0), all_vars(!is.na(.))) %>%
droplevels()

Let’s save our progress by writing the training data to an RDS file.

if (!dir.exists("Data")) dir.create("Data")
trainingData %>% saveRDS("Data/trainingData.Rds")
trainingData <- readRDS("Data/trainingData.Rds")

## Do we even need to mark outliers?

Measurement error is a fact of life when using corpus methods, and of course classifiers will perform better if they have cleaner data. At the same time, not all data is appropriate for outlier-marking. Here are the conditions under which outlier-marking is appropriate:

1. Some measures include measurement error. If your data is spotless, then any outliers are ‘true’ outliers, not just mis-measurements, and your model will be better off for being able to see them.

2. For the measures with some measurement error, the distributions are either unimodal or their bi-/multimodality isn’t due to the classes of interest.

Let’s briefly investigate these conditions. First, do we have evidence of measurement error in our data? Looking at F3 minimum, we appear to have a good distribution, but the extent of it seems questionable (do we really have tokens whose F3 minimum is below 1000 Hz?). Similarly, the intensity at the time of the F3 minimum has some values that seem conspicuously low. When we hand-check these tokens, we find that these measurements are indeed erroneous, meaning there’s evidence for measurement error in our data:

trainingData %>%
filter(!is.na(F3min)) %>%
mutate(F3min_orig = F3min + Initial_F3_50) %>%
ggplot(aes(x=F3min_orig)) +
geom_histogram(binwidth=50) +
xlab("F3 minimum (Hz)")