Note: An html version of this document, which includes the evaluated result of all code chunks, can be found at https://nzilbb.github.io/How-to-Train-Your-Classifier/How_to_Train_Your_Classifier.html.
This guide walks readers through the process of training a random-forest classifier for automated coding of sociophonetic variation, using the statistical computing language R (R Core Team, 2018) and the packages ranger
(Wright & Ziegler, 2017) and caret
(Kuhn, 2018). This guide is intended to be a companion to our 2020 Laboratory Phonology article “From categories to gradience: Auto-coding sociophonetic variation with random forests”, and it recreates the classifier of Southland English non-prevocalic /r/ discussed in the article. By consulting the R Markdown file that generated this document, readers can run the code on their own systems to recreate the /r/ classifier. The classifier file can also be found here.
IMPORTANT: While you can run all this code, we don’t recommend trying to run it all at once (e.g., by ‘knitting’ the R Markdown document); the code that generated this document was tailored to a machine with 32 processors and still takes several hours to run start-to-finish. As we mention below, training a classifier is not a one-size-fits-all process.
This guide goes through the following steps:
Pre-process data
Determine classifier performance measure
Train an initial classifier
Tune hyperparameters
Determine effect of outliers
Train final classifier
Auto-code new tokens and extract classifier probabilities for hand-coded tokens
Perhaps the most important thing to know is that training a classifier is not a one-size-fits-all process. Your decisions about how to go about training a classifier will change depending on your data, your variables, your research questions, your standards for what counts as adequate performance, etc. This guide describes the process for one particular set of data, a particular variable, particular research questions, etc., and shouldn’t be taken as an exact template for every possible case.
This guide is intended to be accessible to intermediate to advanced R users. Some of the coding techniques used in this guide are fairly advanced, but you won’t need to understand all the code to use this guide or to train your own classifiers.
This guide also includes text that’s visible only on the R Markdown version of the document (to avoid cluttering the html version) giving more detail about the implementation of the code. Like this paragraph, R Markdown-only text in this document begins with a ‘greater than’ sign.
The steps in this guide can be used by any computer that can run R. That said, progressing through all of the steps listed here can be very time-consuming, especially with large amounts of data. As a result, all of the forests in this guide were generated on a computing cluster running Ubuntu 16.04 with 32 virtual cores and 64 GB of memory, in order to leverage R’s support for parallel computing. (Despite this powerful hardware, the code in this guide takes over 4 hours to run in terms of pure running time.) The code used here for running models in parallel (which uses the package doParallel
) should work on all systems, although the user will need to adjust the number of cores to their system (you can find out how many cores are available via the detectCores()
function in the parallel
package). Note that train()
(the function we use to train classifiers in this guide) by default uses a parallel backend when one is available (e.g., to train multiple resamples at a time), so parallelization is used even in instances where the end-user doesn’t explicitly invoke it using doParallel
methods.
Since random forests are a stochastic process, results can be slightly different with each run, so this guide uses random seeds to ensure reproducibility. In single forest runs, it suffices to use set.seed()
prior to running train()
. In situations where doParallel
methods are used to run a list of forests differing by a hyperparameter, it’s necessary to use clusterSetRNGStream()
after registering the cluster and before calling foreach()
; this ensures that each core starts with the same random seed. Finally, in Step 5 when we want to run lists of non-identical forests that don’t differ by any parameters, we use the seeds
argument of trainControl()
.
Finally, our code creates ‘dummy files’, with no extension or contents, in a Model Status
subfolder to help us keep track of when the sub-models finish running. This technique is especially useful when you expect some sub-models to take longer than others (e.g., repeated k-fold cross-validation will naturally take longer than non-repeated k-fold cross-validation). Although we’re only saving ‘dummy files’ here, you can also save more useful files; for example, if the overall model is too big to store in memory and the model crashes before all sub-models are complete, you can replace the second ‘dummy file’ with a command to save each sub-model when it’s finished running or a summary of the sub-model.
if (!dir.exists("Model Status")) dir.create("Model Status")
To run the code in this guide, you’ll need R and several R packages. In particular, this code relies heavily on the collection of packages called the tidyverse, which make R code faster, more readable, and safer. Though you can run this code without it, we recommend the RStudio integrated development environment for ease of understanding code. Information on the system, R version, and package versions that generated this guide can be found at the end of this guide.
Install necessary packages:
##From https://stackoverflow.com/a/9341833/5432121
pkgTest <- function(x) {
if (!require(x, character.only=TRUE)) {
install.packages(x)
if (!require(x, character.only=TRUE)) stop(paste("Package", x, "not found"))
}
}
pkgTest("tidyverse")
pkgTest("magrittr")
pkgTest("DMwR")
pkgTest("diptest")
Loading required package: diptest
pkgTest("parallel")
pkgTest("foreach")
pkgTest("doParallel")
pkgTest("caret")
pkgTest("ranger")
As a prerequisite to the steps we describe in this guide, you’ll also need data that has been hand-coded into two or more variants and for which you have all the acoustic measures you’ll want to enter into the classifier. In the case of our /r/ classifier (for which we provide the data as an example), we had several thousand hand-coded tokens, for which we extracted 180 acoustic measures using the “Process with Praat” function in LaBB-CAT (Fromont & Hay, 2012); these measures were motivated by previous literature on the acoustic correlates of rhoticity in English and other languages (e.g., Lawson, Stuart-Smith, & Scobbie, 2018; Heselwood, 2009). More details on the choice of these measures can be found in the main text.
Load the /r/ data, which includes 40614 rows and 217 columns, the last 180 of which are acoustic measures to be fed into the classifier.
The first 37 columns contain mostly linguistic information, plus LaBB-CAT metadata (TokenNum and MatchId) and speaker data (Speaker, which has been anonymized, and Gender). To preserve anonymity, both Speaker and Word have been anonymized to dummy factors. Stress is coded as either ’ (primary word stress), " (secondary word stress), or 0 (unstressed). Syllable and columns ending in “DISC” are phonemic strings represented in the DISC phonetic alphabet used by CELEX (see https://catalog.ldc.upenn.edu/docs/LDC96L14/eug_let.pdf); FollSegmentNoPause and FollSegment are represented in the ARPAbet codes used by the CMU pronouncing dictionary, without stress markers on vowels (see http://www.speech.cs.cmu.edu/cgi-bin/cmudict); and Vowel is represented in Wells lexical sets as they relate to New Zealand English (see https://www.phon.ucl.ac.uk/home/wells/stanlexsets.htm). HowCoded is used to separate tokens that have already been hand-coded from those to be auto-coded (which therefore have NA for Rpresent). All other columns should be self-explanatory.
trainingData <- readRDS("Data/RClassifierData_03July2019.Rds")
trainingData %>% dim()
[1] 40614 217
trainingData %>% colnames()
[1] "TokenNum" "Speaker" "Gender"
[4] "MatchId" "Stress" "FollPause"
[7] "CelexFreq" "FollPauseDur" "Syllable"
[10] "SyllStart" "SyllEnd" "Word"
[13] "WordStart" "WordEnd" "TokenDISC"
[16] "TokenStart" "TokenEnd" "TokenGenAmDISC"
[19] "TokenGenAmStart" "TokenGenAmEnd" "Rpresent"
[22] "FollSegmentDISC" "RType" "PrevRInWord"
[25] "ContFuncWord" "CorpusFreq" "Vowel"
[28] "NURSEvsNonNURSE" "FollSegmentNoPause" "FollSegment"
[31] "SyllFinal" "WordFinal" "HowCoded"
[34] "Initial_F1_50" "Initial_F2_50" "Initial_F3_50"
[37] "Initial_F4_50" "TokenDur" "F1_20"
[40] "diffF3F1_20" "BW1_20" "F1_25"
[43] "diffF3F1_25" "BW1_25" "F1_30"
[46] "diffF3F1_30" "BW1_30" "F1_35"
[49] "diffF3F1_35" "BW1_35" "F1_40"
[52] "diffF3F1_40" "BW1_40" "F1_45"
[55] "diffF3F1_45" "BW1_45" "F1_50"
[58] "diffF3F1_50" "BW1_50" "F1_55"
[61] "diffF3F1_55" "BW1_55" "F1_60"
[64] "diffF3F1_60" "BW1_60" "F1_65"
[67] "diffF3F1_65" "BW1_65" "F1_70"
[70] "diffF3F1_70" "BW1_70" "F1_75"
[73] "diffF3F1_75" "BW1_75" "F1_80"
[76] "diffF3F1_80" "BW1_80" "F1min"
[79] "F1max" "time_F1min" "time_F1max"
[82] "F1range" "time_F1range" "slopeF1"
[85] "F2_20" "diffF3F2_20" "BW2_20"
[88] "F2_25" "diffF3F2_25" "BW2_25"
[91] "F2_30" "diffF3F2_30" "BW2_30"
[94] "F2_35" "diffF3F2_35" "BW2_35"
[97] "F2_40" "diffF3F2_40" "BW2_40"
[100] "F2_45" "diffF3F2_45" "BW2_45"
[103] "F2_50" "diffF3F2_50" "BW2_50"
[106] "F2_55" "diffF3F2_55" "BW2_55"
[109] "F2_60" "diffF3F2_60" "BW2_60"
[112] "F2_65" "diffF3F2_65" "BW2_65"
[115] "F2_70" "diffF3F2_70" "BW2_70"
[118] "F2_75" "diffF3F2_75" "BW2_75"
[121] "F2_80" "diffF3F2_80" "BW2_80"
[124] "F2min" "F2max" "time_F2min"
[127] "time_F2max" "F2range" "time_F2range"
[130] "slopeF2" "F3_20" "BW3_20"
[133] "F3_25" "BW3_25" "F3_30"
[136] "BW3_30" "F3_35" "BW3_35"
[139] "F3_40" "BW3_40" "F3_45"
[142] "BW3_45" "F3_50" "BW3_50"
[145] "F3_55" "BW3_55" "F3_60"
[148] "BW3_60" "F3_65" "BW3_65"
[151] "F3_70" "BW3_70" "F3_75"
[154] "BW3_75" "F3_80" "BW3_80"
[157] "F3min" "F3max" "time_F3min"
[160] "time_F3max" "F3range" "time_F3range"
[163] "slopeF3" "intens_F3min" "intens_F3max"
[166] "F4_20" "diffF4F3_20" "BW4_20"
[169] "F4_25" "diffF4F3_25" "BW4_25"
[172] "F4_30" "diffF4F3_30" "BW4_30"
[175] "F4_35" "diffF4F3_35" "BW4_35"
[178] "F4_40" "diffF4F3_40" "BW4_40"
[181] "F4_45" "diffF4F3_45" "BW4_45"
[184] "F4_50" "diffF4F3_50" "BW4_50"
[187] "F4_55" "diffF4F3_55" "BW4_55"
[190] "F4_60" "diffF4F3_60" "BW4_60"
[193] "F4_65" "diffF4F3_65" "BW4_65"
[196] "F4_70" "diffF4F3_70" "BW4_70"
[199] "F4_75" "diffF4F3_75" "BW4_75"
[202] "F4_80" "diffF4F3_80" "BW4_80"
[205] "F4min" "F4max" "time_F4min"
[208] "time_F4max" "F4range" "time_F4range"
[211] "slopeF4" "F0min" "F0max"
[214] "F0rangeST" "time_F0min" "time_F0max"
[217] "absSlopeF0"
Due to a fair amount of measurement error in automatically extracting formant and pitch measurements, and to account for the degree to which formant frequencies are impacted by vocal tract length, we subjected /r/ measurements to pre-processing before entering data into the classifier. Your data might not need all these steps—we didn’t need to do nearly as much pre-processing for the medial /t/ data—but we’ll walk through them for your reference.
First, we removed tokens that either had bad durations or immediately preceded another /r/. Checking durations is a way to filter out tokens whose forced alignment is bad, since individual /r/ tokens are unlikely to be long (except perhaps in word lists or utterance-finally). We randomly sampled tokens in a variety of duration bands (e.g., [500 ms, 550 ms)) to check for misalignment; this procedure led to a ceiling of 450 ms. A handful of tokens somehow had zero or negative durations, so we removed these tokens, as well.
trainingData <-
trainingData %>%
filter(TokenDur > 0, TokenDur < 0.45, FollSegment!="R")
Second, we normalized F1–F4 measurements (timepoint measurements and maxima/minima) by subtracting the speaker’s mean word-initial /r/ midpoint measurement for that formant from the raw measurement. As with vowels, formant measurements for rhotics are affected by vocal tract length, but whereas normalization methods abound for vowel formant frequencies, there is no generally accepted way to normalize formants in rhotics. Our method of using each speaker’s initial /r/ formants as a baseline measure accounts for vocal tract length, following Hay and Clendon’s (2012) approach of modeling rhoticity by comparing non-prevocalic /r/ F3 against initial /r/ F3.
trainingData <-
trainingData %>%
mutate_at(vars(matches("^F1(_[2-8][50]|min|max)$")), ~ . - Initial_F1_50) %>%
mutate_at(vars(matches("^F2(_[2-8][50]|min|max)$")), ~ . - Initial_F2_50) %>%
mutate_at(vars(matches("^F3(_[2-8][50]|min|max)$")), ~ . - Initial_F3_50) %>%
mutate_at(vars(matches("^F4(_[2-8][50]|min|max)$")), ~ . - Initial_F4_50)
Third, we centered and scaled token duration by speaker and vowel; in cases where token duration was invariant within a speaker \(\times\) vowel distribution, we used a value of 0.
trainingData <-
trainingData %>%
group_by(Speaker, Vowel) %>%
mutate(TokenDurSD = sd(TokenDur, na.rm=TRUE) %>% replace_na(0),
TokenDur = if_else(TokenDurSD > 0, scale(TokenDur), 0)) %>%
select(-TokenDurSD) %>%
ungroup()
Fourth, to deal with tokens that had missing pitch and/or F4 measurements (including F4 bandwidth), we imputed missing F0 and F4 measurements using k Nearest Neighbor imputation (with k = 10) via the function knnImputation()
in the package DMwR
(Torgo, 2010). Since random forests cannot classify observations with missing measurements, the imputation of missing measurements is common practice in machine learning contexts (e.g., Breen, Thomas, Baldwin, & Lipinska, 2019; Hudak, Crookston, Evans, Hall, & Falkowski, 2008). At the same time, this step won’t necessarily apply to every data set (we didn’t need to impute any measurements for the medial /t/ data). Note that this next step can be fairly time-consuming (on the cluster, it takes around 15 minutes).
For time-consuming steps such as imputation, we use caching to save time by avoiding re-running code. If we give the chunk a unique name and set
cache=TRUE
, R Markdown will save the results of the code chunk to a cache when initially knitting the file, such that on subsequent knits the results will simply be re-loaded. This can save a considerable amount of time, especially if you want to tweak text that’s outside of the cached chunks or plots that use data generated by cached chunks, etc. More info on caching can be found at https://yihui.name/knitr/demo/cache/ and in theknitr
manual at https://yihui.name/knitr/demo/manual/.
##Capture current column ordering
origCols <- trainingData %>% colnames()
##Set up dataframe for imputation: Must contain only the measures to be imputed
## and the measures being used to calculate k nearest neighbors.
imputeDF <-
trainingData %>%
##In this data, MatchId serves as a unique identifier for tokens
select(MatchId, TokenDur:absSlopeF0) %>%
select(-contains("F0"), -contains("F4"), -contains("BW4")) %>%
select_if(~ sum(is.na(.)) < 300) %>%
filter(complete.cases(.)) %>%
left_join(trainingData %>%
select(MatchId, TokenDur:absSlopeF0) %>%
select(MatchId, contains("F0"), contains("F4"), contains("BW4")),
by="MatchId")
##Perform imputation--this takes a while!
imputeDF <-
imputeDF %>%
select(-MatchId) %>%
as.data.frame() %>%
knnImputation() %>%
cbind(imputeDF %>% select(MatchId))
##Add imputed values back into data frame (MatchId is a unique identifier, so
## we can use it to match tokens between data frames)
trainingData <-
trainingData %>%
left_join(imputeDF %>%
select(MatchId, contains("F0"), contains("F4"), contains("BW4")),
by="MatchId") %>%
select(-ends_with(".x")) %>%
rename_at(vars(ends_with(".y")), str_remove, ".y") %>%
select(!!origCols)
##Clean up
rm(imputeDF)
Fifth, we marked numerous measures for potential outlier-hood: formant timepoint measures, formant minima/maxima, pitch timepoint measures, pitch minima/maxima, intensity measures, and duration. Potential outliers were measurements that were more than 1.5 \(\times\) the interquartile range (third quartile minus first quartile) outside the first or third quartile for that measure’s distribution by speaker and by vowel. We won’t actually exclude outliers now; instead, we’ll use outlier data in Step 5 below to figure out which outliers we can afford to keep, and which ones we’re better off discarding. As with imputation, outlier-marking isn’t necessary or appropriate for all situations; we give more details in the subsection following this one.
trainingData <-
trainingData %>%
group_by(Speaker, Vowel) %>%
mutate_at(vars(matches("^F[01234](_[2-8][50]|min|max)$"), starts_with("intens_"), TokenDur),
list(Outlier = ~ . < quantile(., 0.25, na.rm=TRUE) - 1.5*IQR(., na.rm=TRUE) |
. > 1.5*IQR(., na.rm=TRUE) + quantile(., 0.75, na.rm=TRUE))) %>%
ungroup()
Finally, we excluded tokens that had missing measurements for measures other than F0 and F4.
trainingData <-
trainingData %>%
filter_at(vars(TokenDur:absSlopeF0), all_vars(!is.na(.))) %>%
droplevels()
Let’s save our progress by writing the training data to an RDS file.
if (!dir.exists("Data")) dir.create("Data")
trainingData %>% saveRDS("Data/trainingData.Rds")
trainingData <- readRDS("Data/trainingData.Rds")
Measurement error is a fact of life when using corpus methods, and of course classifiers will perform better if they have cleaner data. At the same time, not all data is appropriate for outlier-marking. Here are the conditions under which outlier-marking is appropriate:
Some measures include measurement error. If your data is spotless, then any outliers are ‘true’ outliers, not just mis-measurements, and your model will be better off for being able to see them.
For the measures with some measurement error, the distributions are either unimodal or their bi-/multimodality isn’t due to the classes of interest.
Let’s briefly investigate these conditions. First, do we have evidence of measurement error in our data? Looking at F3 minimum, we appear to have a good distribution, but the extent of it seems questionable (do we really have tokens whose F3 minimum is below 1000 Hz?). Similarly, the intensity at the time of the F3 minimum has some values that seem conspicuously low. When we hand-check these tokens, we find that these measurements are indeed erroneous, meaning there’s evidence for measurement error in our data:
trainingData %>%
filter(!is.na(F3min)) %>%
mutate(F3min_orig = F3min + Initial_F3_50) %>%
ggplot(aes(x=F3min_orig)) +
geom_histogram(binwidth=50) +
xlab("F3 minimum (Hz)")
trainingData %>%
filter(!is.na(intens_F3min)) %>%
ggplot(aes(x=intens_F3min)) +
geom_histogram(binwidth=1) +
xlab("Intensity at time of F3 minimum (dB)")
Once we’ve detected measurement error, we have to determine whether these measures are unimodally distributed. Rather than just eyeball the distributions, we’ll conduct Hartigan’s dip tests, which determine whether there is significant evidence for non-unimodality in a distribution (Freeman & Dale, 2013; Hartigan & Hartigan, 1985), using the diptest
package (Maechler, 2016). The null hypothesis in these tests is that the distribution is unimodal, so a significant result means that we have evidence that the distribution is bimodal or multimodal. The dip test on the F3 minimum distribution indicates no evidence for non-unimodality, but (perhaps surprisingly) the dip test on the intensity at F3 minimum distribution indicates significant evidence for non-unimodality. In other words, F3 minimum has met the conditions for outlier-marking, but intensity hasn’t.
trainingData$F3min %>% dip.test()
Hartigans' dip test for unimodality / multimodality
data: .
D = 0.0010115, p-value = 0.9962
alternative hypothesis: non-unimodal, i.e., at least bimodal
trainingData$intens_F3min %>% dip.test()
Hartigans' dip test for unimodality / multimodality
data: .
D = 0.021, p-value < 2.2e-16
alternative hypothesis: non-unimodal, i.e., at least bimodal
The reason for the bimodality condition is that we don’t want to outlier-mark a bimodal distribution that is merely the combinaton of two unimodal distributions that each correspond to one class. If underlying the distribution of some measure are two fairly separate unimodal distributions by class, then outlier-marking is going to eliminate some measurements that are extreme with respect to the pooled distribution but are actually well-behaving measurements for their own class. As a result, if our bimodal distribution isn’t the result of the classes of interest, but instead the result of some other factor external to the classes, then it’s okay to outlier-mark the distribution (although it might be beneficial to normalize for the external factor if that’s possible). To that end, we can look at the by-class distribution of intensity; since we don’t have evidence for different modes for Absent vs. Present, intensity meets the conditions for outlier-marking.
trainingData %>%
filter(HowCoded=="Hand") %>%
ggplot(aes(x=intens_F3min, fill=Rpresent)) +
geom_density(alpha=0.5) +
xlab("Intensity at time of F3 minimum (dB)")
We don’t have any examples of distributions whose bimodality is due to the Absent/Present in the /r/ data, but in the medial /t/ data, the center of gravity at 70% duration is clearly bimodal, and this is clearly due to Voiced tokens having a lower center of gravity than Voiceless tokens.
Now that we’ve determined that we can outlier-mark these measures, what criteria do we use to mark individual measurements as outliers? Two potential rules of thumb are to mark measurements that are more than 2 standard deviations from the mean, or to mark measurements that are more than 1.5 times the interquartile range from the first or third quartiles. The standard deviation-based rule is only appropriate if the data is normal, while the IQR-based rule is more general. In this case, since we had a lot of measures to outlier-mark, some of which are clearly non-normal (e.g., intensity at F3 minimum), we used the more general IQR rule of thumb. We also generated distributions within each speaker and vowel, since a well-behaving measurement in a rare vowel context can easily look like an outlier in the overall data.
Before tuning hyperparameters, it’s necessary to choose a classifier performance metric that the process of hyperparameter tuning will optimize. If you simply want the most accurate classifier possible, optimize for overall accuracy; if it’s overwhelmingly important that a particular class be accurately represented, optimize for that class’s accuracy. In our case, we wanted our classifiers to be highly accurate, but we also wanted our classifiers to balance true positives and true negatives; the latter aim was especially important in light of the class imbalance between Present and Absent in the /r/ training data. As a result, we tuned hyperparameters to optimize the area under the ROC curve (AUC) \(\times\) overall accuracy (see the Methods section of the main text). We also wanted to monitor the class accuracies of Present and Absent during the tuning process, even though we didn’t choose hyperparameters to optimize class accuracies.
The package that we used to run the classifiers, caret
(Kuhn, 2018), allows the user to specify custom summary functions to report summary statistics, which is useful for hyperparameter tuning. We used the following function (a lightly modified version of caret
’s multiClassSummary()
) to report several summary statistics, including overall accuracy \(\times\) AUC.
cutoffSummary <- function(data, lev=NULL, model=NULL, obsCol="obs", returnDF=FALSE) {
require(ROCR)
require(tidyverse)
require(magrittr)
if (!is.null(obsCol) & obsCol!="obs") data <- data %>% rename("obs" = obsCol)
preds <- data %$% prediction(Present, obs)
auc <-
preds %>%
performance(measure="auc") %>%
attr("y.values") %>%
extract2(1)
cutoff <-
preds %>%
performance(measure="acc") %>%
attributes() %>%
extract(c("x.values", "y.values")) %>%
map_dfc(extract2, 1) %>%
filter(y.values==max(y.values)) %>%
pull(x.values) %>%
median()
data <- data %>%
mutate(PredCutoff = if_else(Present >= cutoff, "Present", "Absent"),
PredHalf = if_else(Present >= 0.5, "Present", "Absent"),
AccCutoff = PredCutoff==obs,
AccHalf = PredHalf==obs)
class_means <-
data %>%
group_by(obs) %>%
summarise_at(vars(starts_with("Acc")), mean) %>%
gather(Type, Acc, starts_with("Acc")) %>%
mutate(Type = str_c(Type, "_", obs)) %>%
select(Type, Acc) %>%
spread(Type, Acc) %>%
rename_all(str_replace, "Acc", "ClassAccuracy") %>%
rename_all(str_remove, "Cutoff")
# confMat <- data %$% table(Predicted = PredCutoff, Actual = obs)
ret <- data.frame(AccAUC = mean(data$AccCutoff)*auc,
AUC = auc,
Accuracy = mean(data$AccCutoff),
BestCutoff = cutoff,
AccuracyHalf = mean(data$AccHalf)) %>%
cbind(class_means)
if (!returnDF) ret <- ret %>% map_dbl(I)
return(ret)
}
First, we train an initial classifier using the default settings of train()
. This initial classifier will help us understand what sort of performance we can expect from our final model; although we’ll tune hyperparameters and exclude outliers to improve the model’s performance, in reality the final model’s performance won’t be drastically better than the initial model.
In terms of syntax, we feed train()
a formula, the data, a model type, a variable importance mode, and a trainControl
object that contains some more details about how the model is to be run:
The formula describes the dependent variable we’re trying to classify (Rpresent
) and says that everything else in the data (.
) will serve as independent variables.
For the data, we use just the hand-coded tokens from the training data (recall that we’ve already filtered out tokens with missing measurements, etc., in Step 1), and just the columns containing the dependent variable and the acoustic measures that we want to use in training our classifier (hence why the formula has .
on the right-hand side).
We specify the "ranger"
method for running random forests because train()
can use over 200 models for classification and regression.
The variable importance mode will determine how ranger()
ranks variables in terms of importance (here, "impurity"
is the Gini index).
Finally, the various arguments of trainControl()
tell train()
to save the resampled classification predictions that the model generates for the hand-coded data, use our custom summary function (so we can use a non-default cutoff and calculate our chosen performance measure, overall accuracy \(\times\) AUC), save info on which tokens were in which resamples, and calculate not only binary classifications but also classifier probabilities (which is necessary for calculating AUC).
Because we haven’t specified anything else (e.g., number of trees, resampling method), train()
will use its defaults for the various hyperparameters we’ll tune later.
##Run the model using a single core
registerDoSEQ()
##RF are a stochastic process, so ensure reproducibility by setting a seed
set.seed(302302)
fstInit <-
train(Rpresent ~ .,
data=trainingData %>%
filter(HowCoded=="Hand") %>%
select(Rpresent, TokenDur:absSlopeF0),
method="ranger",
importance="impurity",
trControl=trainControl(savePredictions=TRUE,
summaryFunction=cutoffSummary,
returnResamp="all",
classProbs=TRUE))
Loading required package: ROCR
Loading required package: gplots
Attaching package: 'gplots'
The following object is masked from 'package:stats':
lowess
The fstInit
model we’ve just generated has lots of information inside of it. To pull out the most salient information from the model—the resampled performance results, an average confusion matrix, the variable importance scores, and timing information—we’ll use a summary function. This summary function will also come in handy because the forest models are large files (fstInit
on its own is 75.3 Mb), and they’ll add up when we’re generating large lists of them in Steps 4 and 5, so we’ll just save the summaries to disk rather than the forests.
To the last point about only saving the summaries: if we’re caching the chunks that are used to run the forests, the forests will be saved to disk as well; cache with caution. This file caches the forest-running chunks on a computing cluster with lots of extra space, but if you don’t have lots of extra space to use up on your drive, we wouldn’t recommend caching forests (especially if you set random seeds to ensure that you’d get the same results every time); the forests run in this document take up about 4 Gb.
fstSummary <- function(x) {
require(tidyverse)
##Create a simple summary with resampled performance measures, confusion
## matrix, variable importance, and timing info
simpleSummary <- function(x) {
resample <-
x %>%
extract2("resample") %>%
select(AccAUC, Accuracy, AUC, everything())
cutoff <-
resample %>%
pull(BestCutoff) %>%
mean(na.rm=TRUE)
confMat <-
x %>%
extract2("pred") %>%
mutate(pred = if_else(Present >= cutoff, "Present", "Absent")) %$%
table(Predicted = pred, Actual = obs)
varImp <-
x %>%
extract2("finalModel") %>%
extract2("variable.importance")
times <- x %>%
extract2("times") %>%
extract(-3)
list(resample = resample, confMat = confMat,
varImp = varImp, times = times)
}
##If x is a single forest, return a simple summary; if x is a list of
## forests, make each summary element a list of results for each forest
## (except for resample, which combines the results for all forests and adds
## a Model column)
if ("train" %in% class(x)) {
smry <- simpleSummary(x)
} else if (x %>% map_lgl(~ "train" %in% class(.x)) %>% all()) {
if (x %>% names() %>% is.null()) names(x) <- seq_along(x)
smry <-
x %>%
map(simpleSummary) %>%
transpose()
smry$resample <-
smry$resample %>%
imap_dfr(~cbind(Model = .y, .x, stringsAsFactors=FALSE)) %>%
mutate_at(vars(Model), factor, levels=names(x))
}
smry
}
For example, we can calculate the mean of resampled performance measures from the resample
element of our summary object. Here, we see that the mean overall accuracy \(\times\) AUC of this initial classifier was 0.6972862:
smryInit <- fstInit %>% fstSummary()
smryInit$resample %>%
summarise_if(is.numeric, mean, na.rm=TRUE)
Or we can look at timing, in the times
element. Here, we see that the classifier took 10 minutes to run, but if we’d run it on a single core rather than with a parallel backend, it would’ve taken 203 minutes to run:
smryInit$times
$everything
user system elapsed
12193.048 9.384 625.325
$final
user system elapsed
108.384 0.008 4.522
Of course, this is just one way to summarize a forest; depending on your needs, you might want to pull out different information. Regardless, we’ll save this summary (and not the forest itself) to the Model Status
directory, and we’ll remove the forest from our workspace to save on memory:
smryInit %>% saveRDS("Model Status/SmryInit.Rds")
rm(fstInit)
In this step, we adjust several hyperparameters to find the settings that will yield the best-performing model (with “best-performing” defined with respect to our chosen performance measure, of course). These hyperparameters control how the random forest classifier is run in ranger
and how caret
determines classifier performance. In particular, caret
takes the training data and generates training and test subsets to calculate classifier performance measures, optionally performing additional resampling on these training and test subsets.
We tune the following hyperparameters:
Additional resampling to perform on the training and test subsets (trainControl(sampling)
)
The method for generating training and test subsets from the training set, for the purpose of measuring performance (the method
argument of trainControl()
)
Parameters controlling the generation of training and test subsets, depending on the method: number of folds/resampling iterations (trainControl(number)
), number of repeated k-fold cross-validation iterations (trainControl(repeats)
), and/or leave-p-out cross-validation training percentage (trainControl(p)
)
Parameters of the ranger
model (controlled via the tuneGrid
argument of train()
):
The number of variables to test at each node (mtry
)
Node splitting rule (splitrule
)
Minimum node size (min.node.size
)
The number of trees to generate (train(num.trees)
)
Because we have unbalanced classes in our hand-coded data, this additional resampling is likely to have the greatest impact on classifier performance, hence why we tune this hyperparameter first.
After constructing training and test subsets, train()
can perform additional resampling to adjust for class imbalances in the training and test subsets prior to assessing classifier performance. By default, train()
performs no additional resampling, but other options are downsampling the majority class to match the size of the minority class, and SMOTE sampling, a hybrid method that combines downsampling the majority class with creating synthetic members of the minority class (Chawla, Bowyer, Hall, & Kegelmeyer, 2002). Given that the data are imbalanced, it’s likely that a model that looks at the data as-is will perform well on the majority class, Absent, but poorly on the minority class, Present. For all the other hyperparameters, we start with default values (method="boot"
, number=25
, num.trees=500
, tuneGrid=data.frame(mtry=13, splitrule="gini", min.node.size=1)
).
##Define a vector with possible values of our hyperparameter
tuneResamples <- c("none", "down", "smote")
##Register a parallel cluster with 3 cores (the length of our hyperparameter
## vector)
cl <- makeCluster(tuneResamples %>% length())
registerDoParallel(cl)
##Set a parallel seed for reproducibility
clusterSetRNGStream(cl, 302302)
##Train the list of forests; the code inside the curly braces will be run in
## parallel for each value of the vector tuneResamples, outputting a classifier
## for each setting
fstListResamples <-
foreach(tuneRsmp = tuneResamples,
.packages=c("caret","dplyr")) %dopar% {
##Create an extensionless 'dummy file' to signal that the model has
## started running
file.create(paste("Model Status/Resamples", tuneRsmp, "begun", sep="_"))
##Call train()
mod <- train(Rpresent ~ .,
data=trainingData %>%
filter(HowCoded=="Hand") %>%
select(Rpresent, TokenDur:absSlopeF0),
method="ranger",
importance="impurity",
num.trees=500,
trControl=trainControl(method="boot",
number=25,
sampling=if (tuneRsmp=="none") NULL else tuneRsmp,
savePredictions=TRUE,
summaryFunction=cutoffSummary,
returnResamp="all",
classProbs=TRUE),
tuneGrid=data.frame(mtry=13, splitrule="gini", min.node.size=1))
##Create an extensionless 'dummy file' to signal that the model has
## finished running
file.create(paste("Model Status/Resamples", tuneRsmp, "completed", sep="_"))
##Return the classifier
return(mod)
} %>%
##Name each classifier with its hyperparameter setting
set_names(tuneResamples)
##Stop the parallel cluster
stopCluster(cl)
Clear the model status dummy files:
dir("Model Status", pattern="^Resamples", full.names=TRUE) %>% file.remove()
logical(0)
Create and save a summary object for this tuning step (note that we’ve written fstSummary()
to handle a list of classifiers as well as a single classifier):
smryResamples <- fstListResamples %>% fstSummary()
smryResamples %>% saveRDS("Model Status/SmryResamples.Rds")
The resampled performance measures indicate that SMOTE sampling narrowly outperforms no resampling, which outperforms downsampling.
smryResamples$resample %>%
group_by(Model) %>%
summarise_if(is.numeric, mean, na.rm=TRUE) %>%
arrange(desc(AccAUC))
Because we tested each classifier using 25 bootstrap resamples, we can assess performance across a range of possible outcomes. The plot below shows the resampled overall accuracy \(\times\) AUC for each hyperparameter setting. The SMOTE and downsampling models are more variable in their performance than the non-resampled model; this is not surprising given that the process of resampling introduces an additional layer of randomness. Nevertheless, the SMOTE models generally outperform the downsampled and non-resampled models.
smryResamples$resample %>%
ggplot(aes(x=Model, y=AccAUC)) +
geom_violin() +
xlab("Additional resampling method") +
ylab("Overall accuracy × AUC")
Even though we’re optimizing our classifiers based on overall accuracy \(\times\) AUC, our summary function allows us to keep an eye on other performance metrics we might care about, like class accuracies. The plot below demonstrates that the choice of additional resampling method makes a considerable impact on class accuracy. As we suspected, the non-resampled model suffers from stratification between high accuracy for Absent and low accuracy for Present; compared to the non-resampled model, the SMOTE model has slightly worse accuracy for Absent, but much better accuracy for Present. This is a tradeoff we’re willing to make, and this tradeoff is reflected in overall accuracy \(\times\) AUC, where the SMOTE model edges out the non-resampled model.
smryResamples$resample %>%
gather("Class", "ClassAccuracy", starts_with("ClassAccuracy_")) %>%
mutate_at(vars(Class), str_remove, "ClassAccuracy_") %>%
ggplot(aes(x=Model, y=ClassAccuracy, fill=Class)) +
geom_violin() +
xlab("Additional resampling method") +
ylab("Class accuracy")
Again, caret
automatically saves timing data in the random forest objects it creates, and the summary objects we’ve created store this timing information. Here, SMOTE takes substantially longer to run than no resampling or downsampling, thanks to the fact that it takes the extra step of creating synthetic minority class tokens. The time-saving benefits of parallel processing are also evident from the difference between user
(the sum of running time on all cores) and elapsed
(the actual time it took to run the models); it took 7 minutes to run the SMOTE model in parallel, but without parallel processing it would have taken nearly 25 minutes!
smryResamples$times
$none
$none$everything
user system elapsed
480.320 1.900 67.386
$none$final
user system elapsed
21.484 0.028 1.526
$down
$down$everything
user system elapsed
277.904 1.572 55.985
$down$final
user system elapsed
10.772 0.016 0.879
$smote
$smote$everything
user system elapsed
1479.396 4.672 444.371
$smote$final
user system elapsed
64.148 0.016 14.972
It’s also a good idea when you’re running these models to keep an eye on how much memory your forest lists are taking up by sitting in your R workspace. Here we see that our forest list with three different models takes up quite a bit of space. We’re running these forests on a platform where memory’s not a concern, but if it were, we’d certainly want to remove each forest list after extracting whatever summary information we want from the forest list and saving that summary to file.
fstListResamples %>% object.size() %>% format("Mb")
[1] "178.3 Mb"
The possible subset generation methods are the simple bootstrap estimator, the 0.632 bootstrap estimator, out-of-bag, k-fold cross-validation, repeated k-fold cross-validation, and leave-p-out cross-validation (aka leave-group-out cross-validation). Here, we use caret
’s default parameter settings for each of these methods: 25 resamples for bootstrap, out-of-bag, and leave-p-out cross-validation; 10 folds for k-fold cross-validation and repeated k-fold cross-validation; and a training percentage of 75% for leave-p-out cross-validation. We’ll override one caret
default, as we’ll use 5 repeats for repeated k-fold cross-validation (caret
’s default of 1 repeat makes this method indistinguishable from non-repeated k-fold cross-validation).
Now, we generate our list of forests:
tuneMethods <- c("boot", "boot632", "cv", "repeatedcv", "LGOCV")
cl <- makeCluster(tuneMethods %>% length())
registerDoParallel(cl)
clusterSetRNGStream(cl, 302302)
fstListMethods <-
foreach(tuneMth = tuneMethods,
.packages=c("caret","dplyr")) %dopar% {
file.create(paste("Model Status/Methods", tuneMth, "begun", sep="_"))
mod <- train(Rpresent ~ .,
data=trainingData %>%
filter(HowCoded=="Hand") %>%
select(Rpresent, TokenDur:absSlopeF0),
method="ranger",
importance="impurity",
num.trees=500,
trControl=trainControl(method=tuneMth,
repeats=if (tuneMth=="repeatedcv") 5 else NA,
sampling="smote",
savePredictions=TRUE,
summaryFunction=cutoffSummary,
returnResamp="all",
classProbs=TRUE),
tuneGrid=data.frame(mtry=13, splitrule="gini", min.node.size=1))
file.create(paste("Model Status/Methods", tuneMth, "completed", sep="_"))
return(mod)
} %>%
set_names(tuneMethods)
stopCluster(cl)
Clear the model status dummy files:
dir("Model Status", pattern="^Methods", full.names=TRUE) %>% file.remove()
logical(0)
Create and save a summary object:
smryMethods <- fstListMethods %>% fstSummary()
smryMethods %>% saveRDS("Model Status/SmryMethods.Rds")
The resampling results give k-fold cross-validation as the best-performing model, although there’s not much difference between methods in terms of performance:
smryMethods$resample %>%
group_by(Model) %>%
summarise_if(is.numeric, mean, na.rm=TRUE) %>%
arrange(desc(AccAUC))
We can also plot the resampled results to get a better sense of the range of performance for each of these methods.
smryMethods$resample %>%
ggplot(aes(x=Model, y=AccAUC)) +
geom_violin()
Now that we’ve selected repeated k-fold cross-validation as our subset generation method, what’s the optimal number of folds and the optimal number of repeats? For folds, We’ll sample a space of options around the default of 10: 4, 6, 8, 10, 12, 14. For repeats, we’ll sample a space of options around 5: 3, 5, 7.
tuneListFoldsReps <- expand.grid(Folds = seq(4, 14, by=2),
Repeats = seq(3, 7, by=2))
tuneListFoldsReps <- tuneListFoldsReps %>% split(tuneListFoldsReps %>% nrow() %>% seq_len())
tuneListFoldsReps <- tuneListFoldsReps %>% set_names(tuneListFoldsReps %>% map(str_c, collapse=" "))
cl <- makeCluster(9)
registerDoParallel(cl)
clusterSetRNGStream(cl, 302302)
fstListFoldsReps <-
foreach(tuneNmRp = tuneListFoldsReps,
.packages=c("caret","dplyr")) %dopar% {
file.create(paste("Model Status/FoldsReps", paste(tuneNmRp, collapse="&"), "begun", sep="_"))
mod <- train(Rpresent ~ .,
data=trainingData %>%
filter(HowCoded=="Hand") %>%
select(Rpresent, TokenDur:absSlopeF0),
method="ranger",
importance="impurity",
num.trees=500,
trControl=trainControl(method="repeatedcv",
number=tuneNmRp$Folds,
repeats=tuneNmRp$Repeats,
sampling="smote",
savePredictions=TRUE,
summaryFunction=cutoffSummary,
returnResamp="all",
classProbs=TRUE),
tuneGrid=data.frame(mtry=13, splitrule="gini", min.node.size=1))
file.create(paste("Model Status/FoldsReps", paste(tuneNmRp, collapse="&"), "completed", sep="_"))
return(mod)
} %>%
set_names(names(tuneListFoldsReps))
stopCluster(cl)
Clear the model status dummy files:
dir("Model Status", pattern="^FoldsReps", full.names=TRUE) %>% file.remove()
logical(0)
Create another summary object:
smryFoldsReps <- fstListFoldsReps %>% fstSummary()
smryFoldsReps$resample <- smryFoldsReps$resample %>%
separate(Model, c("NumFolds", "Repeats"), sep=" ")
smryFoldsReps %>% saveRDS("Model Status/SmryFoldsReps.Rds")
The resampling results give the model with 12 folds as the best model:
smryFoldsReps$resample %>%
group_by(NumFolds, Repeats) %>%
summarise_if(is.numeric, mean, na.rm=TRUE) %>%
arrange(desc(AccAUC))
ranger
parameters (mtry
, splitrule
, min.node.size
)Define an initial search space for these parameters. A typical value of mtry
for classification is the square root of the number of predictors, rounded down to the nearest integer, which in the case of the /r/ data is 13; so we’ll try a few mtry
values around 13: 9, 11, 13, 15, 17. splitrule
has two values, "gini"
and "extratrees"
, so we’ll try both. Finally, the original randomForest
package (Liaw & Wiener, 2002) uses min.node.size
defaults of 1 for classification and 5 for regression, so for good measure we’ll try 1, 5, and 10. With respect to the continuous parameters (mtry
and min.node.size
), the best classifiers in this initial round may indicate values toward the end of the ranges we tuned; if that’s the case, we’ll do another round of tuning these parameters with a better-informed range of values.
tuneList <- expand.grid(mtry=seq(9, 17, by=2),
splitrule=c("gini", "extratrees"),
min.node.size=c(1, 5, 10),
stringsAsFactors=FALSE)
tuneList <- tuneList %>% split(tuneList %>% nrow() %>% seq_len())
tuneList <- tuneList %>% set_names(tuneList %>% map(str_c, collapse=" "))
cl <- makeCluster(10)
registerDoParallel(cl)
clusterSetRNGStream(cl, 302302)
fstListTuneGrid <-
foreach(tuneVals = tuneList,
.packages=c("caret","dplyr")) %dopar% {
file.create(paste("Model Status/TuneGrid",
paste(tuneVals, collapse=" "),
"begun", sep="_"))
mod <- train(Rpresent ~ .,
data=trainingData %>%
filter(HowCoded=="Hand") %>%
select(Rpresent, TokenDur:absSlopeF0),
method="ranger",
importance="impurity",
num.trees=500,
trControl=trainControl(method="repeatedcv",
number=12,
repeats=3,
sampling="smote",
savePredictions=TRUE,
summaryFunction=cutoffSummary,
returnResamp="all",
classProbs=TRUE),
tuneGrid=tuneVals)
file.create(paste("Model Status/TuneGrid",
paste(tuneVals, collapse=" "),
"completed", sep="_"))
return(mod)
} %>%
set_names(names(tuneList))
Clear the model status dummy files:
dir("Model Status", pattern="^TuneGrid", full.names=TRUE) %>% file.remove()
logical(0)
To assess which tuning parameters are the best, we collect some of the most salient characteristics of the model (resampling results, confusion matrix, variable importance, and running time) into a summary object. We’ll also save this summary in its own .Rds
file, in case something goes awry and we lose our R session (which is a risk when you’re trying to hold large objects in memory).
smryTuneGrid <- fstListTuneGrid %>% fstSummary()
smryTuneGrid %>% saveRDS("Model Status/SmryTuneGrid.Rds")
The resampled performance measures indicate the best performance with mtry=15
, splitrule="gini"
, and min.node.size=1
smryTuneGrid$resample %>%
group_by(mtry, splitrule, min.node.size) %>%
summarise_if(is.numeric, mean, na.rm=TRUE) %>%
arrange(desc(AccAUC))
Finally, we can adjust the number of trees in our forest. We’ve been using 500 trees, but it seems reasonable to try smaller or larger forests. We’ll try forests of size 250, 500, 750, and 1000; if it seems that our forests’ performance only increases with greater size, we can try larger forests still.
tuneNumTrees <- seq(250, 1000, by=250)
cl <- makeCluster(tuneNumTrees %>% length())
registerDoParallel(cl)
clusterSetRNGStream(cl, 302302)
fstListNumTrees <-
foreach(tuneNmTr = tuneNumTrees,
.packages=c("caret","dplyr")) %dopar% {
file.create(paste("Model Status/NumTrees", tuneNmTr, "begun", sep="_"))
mod <- train(Rpresent ~ .,
data=trainingData %>%
filter(HowCoded=="Hand") %>%
select(Rpresent, TokenDur:absSlopeF0),
method="ranger",
importance="impurity",
num.trees=tuneNmTr,
trControl=trainControl(method="repeatedcv",
number=12,
repeats=3,
sampling="smote",
savePredictions=TRUE,
summaryFunction=cutoffSummary,
returnResamp="all",
classProbs=TRUE),
tuneGrid=data.frame(mtry=13, splitrule="gini", min.node.size=10))
file.create(paste("Model Status/NumTrees", tuneNmTr, "completed", sep="_"))
return(mod)
} %>%
set_names(tuneNumTrees)
stopCluster(cl)
Clear the model status dummy files:
dir("Model Status", pattern="^NumTrees", full.names=TRUE) %>% file.remove()
logical(0)
Create and save another summary object:
smryNumTrees <- fstListNumTrees %>% fstSummary()
smryNumTrees %>% saveRDS("Model Status/SmryNumTrees.Rds")
The resampled performance measures are best for the model with 750 trees, so we’ll keep that setting, though we acknowledge that there is no real difference in performance between 250, 500, 750, and 1000 trees (that is, if our model took a very long time to run, we could get away with using just 250 trees).
smryNumTrees$resample %>%
group_by(Model) %>%
summarise_if(is.numeric, mean, na.rm=TRUE) %>%
arrange(desc(AccAUC))
After the tuning process, our classifier has the following hyperparameter settings:
Additional resampling: SMOTE
Training/test subset generation method: k-fold cross-validation
Training/test subset generation parameters: 10 folds
ranger
parameters: mtry=13
, splitrule="gini"
, min.node.size=10
Number of trees: 750
Let’s train another classifier with these parameter settings to get a snapshot of performance after tuning hyperparameters (since we’re only training one model, we’ll dispatch with the parallel backend):
registerDoSEQ()
set.seed(302302)
fstTuned <- train(Rpresent ~ .,
data=trainingData %>%
filter(HowCoded=="Hand") %>%
select(Rpresent, TokenDur:absSlopeF0),
method="ranger",
importance="impurity",
num.trees=750,
trControl=trainControl(method="repeatedcv",
number=12,
repeats=3,
sampling="smote",
savePredictions=TRUE,
summaryFunction=cutoffSummary,
returnResamp="all",
classProbs=TRUE),
tuneGrid=data.frame(mtry=13, splitrule="gini", min.node.size=10))
Create and save a summary object:
smryTuned <- fstTuned %>% fstSummary()
smryTuned %>% saveRDS("Model Status/SmryTuned.Rds")
Here are the performance measures for our tuned classifier:
smryTuned$resample %>%
summarise_if(is.numeric, mean, na.rm=TRUE)
Finally, we remove forest objects to avoid overtaxing memory (depending on your system, it might be a good idea to do this after each hyperparameter), except for fstTuned
.
rm(fstListMethods, fstListFoldsReps, fstListNumTrees, fstListResamples, fstListTuneGrid)
So far, we’ve been training our classifiers on the entire set of hand-coded data available to us, without excluding any outliers, but it’s worth questioning whether this is the right approach. One the one hand, bigger is better from the perspective of both training and testing: our sample size simulations demonstrate the positive effect of training set size on classifier performance, and the more permissive we are about outliers, the fewer test set tokens we throw out as ‘uncodeable.’ On the other hand, the principle of ‘garbage in, garbage out’ warns us against trusting the predictions of a model trained on bad data.
We already know how our model performs when we’re maximally permissive about outliers. Let’s train another model where we’re maximally conservative about outliers, discarding every token with even a single outlying measurement:
registerDoSEQ()
set.seed(302302)
fstNoOutlier <- train(Rpresent ~ .,
data=trainingData %>%
filter(HowCoded=="Hand") %>%
filter_at(vars(ends_with("_Outlier")), all_vars(!.)) %>%
select(Rpresent, TokenDur:absSlopeF0),
method="ranger",
importance="impurity",
num.trees=750,
trControl=trainControl(method="repeatedcv",
number=12,
repeats=3,
sampling="smote",
savePredictions=TRUE,
summaryFunction=cutoffSummary,
returnResamp="all",
classProbs=TRUE),
tuneGrid=data.frame(mtry=13, splitrule="gini", min.node.size=10))
As the performance measures below show, this no-outlier classifier improves considerably over our previous model.
fstNoOutlier$resample %>%
summarise_if(is.numeric, mean, na.rm=TRUE) %>%
select(AccAUC, Accuracy, AUC, everything())
rm(fstNoOutlier)
But what is the cost for this improved performance? If we only keep tokens with zero outlying measurements, we throw out about 40% of our tokens for both the hand-coded training set and the auto-coded test set. This is because we have a lot of formant timepoint measures, so the likelihood of outliers is high even if the likelihood of meaningful outliers is low. Indeed, only 5% of all formant timepoint measurements are outliers (a direct consequence of our 2-SD cutoff); among tokens with at least one outlying measurement, half have fewer than 4 outliers.
trainingData %>%
mutate(PossOutlier = apply(select(., ends_with("_Outlier")), 1, any)) %$%
table(HowCoded, PossOutlier)
PossOutlier
HowCoded FALSE TRUE
Auto 17204 15915
Hand 2978 2642
trainingData %>%
select(ends_with("_Outlier")) %>%
summarise_all(mean) %>%
apply(1, mean)
[1] 0.04197241
trainingData %>%
mutate(NumOutlier = apply(select(., ends_with("_Outlier")), 1, sum)) %>%
filter(NumOutlier > 0) %>%
ggplot(aes(x=NumOutlier)) +
geom_histogram(binwidth=1) +
facet_grid(HowCoded ~ .)
In other words, where outliers are concerned, we have to achieve a balance. It’s likely that some outliers hurt our classifier’s performance so much that our classifier is better without them, and other outliers hurt performance to such a minor degree that it’s worth keeping those tokens. In addition, with so many acoustic measures, the likelihood of outliers is high, so we want to be careful about which outliers we can afford to keep and which ones are detrimental to performance.
Here is the procedure we’ll follow. We’ll train a set of 10 classifiers (to lessen statistical noise) and obtain a ranking of which measures’ outliers most hurt classifier performance. We’ll train another set of 10 classifiers where we drop outliers in the top-ranked measure, and we’ll compare performance in the classifiers that drop outliers to the ones that don’t drop outliers. If there is a significant difference in performance, we will consider the loss in data to be worth the gain in performance; we’ll obtain an updated ranking of measures from this set of classifiers, and we’ll proceed to train another set of classifiers that drops outliers in the top-ranked measure. If there is no significant difference in performance, we’ll move down to the next measure in the ranking to see if dropping its outliers leads to a significant gain in performance. Once we reach three consecutive nonsignificant results, we’ll stop. By way of this procedure, we’ll determine which outliers are most detrimental to performance and which ones we can afford to keep.
To lessen the effect of statistical noise, we’ll train 5 models for each step. We’ll use parallel processing as before, but with one additional bit of machinery. Previously, we ensured reproducibility by using parallel::clusterSetRNGStream()
to pass the same random seed to each parallel process. If we did the same thing here, we would end up with 5 repetitions of the exact same model running in parallel! Not exactly the best use of parallel processing. As a result, we’ll also pass different vectors of random seeds to the seeds
argument of trainControl()
, ensuring that train()
generates different training and test subsets from one parallel model to the next.
registerDoSEQ()
set.seed(302302)
##For repeated k-fold CV with n repeats, trainControl(seeds) must be a list of
## (n*k)+1 lists, each containing M integers (M = number of models being
## evaluated within each call to train())
seedList <- replicate(5,
replicate(37, sample.int(10000, 1), simplify=FALSE),
simplify=FALSE)
First, train a list of 5 classifiers on the full data set (i.e., without dropping any outliers):
cl <- makeCluster(5)
registerDoParallel(cl)
clusterSetRNGStream(cl, 302302)
fstListDropOut0 <- foreach(seed = seedList,
.packages=c("caret","dplyr")) %dopar% {
train(Rpresent ~ .,
data=trainingData %>%
filter(HowCoded=="Hand") %>%
select(Rpresent, TokenDur:absSlopeF0),
method="ranger",
importance="impurity",
num.trees=750,
trControl=trainControl(method="repeatedcv",
number=12,
repeats=3,
sampling="smote",
seeds=seed,
savePredictions=TRUE,
summaryFunction=cutoffSummary,
returnResamp="all",
classProbs=TRUE),
tuneGrid=data.frame(mtry=13, splitrule="gini", min.node.size=10))
}
stopCluster(cl)
Extract and save a summary:
smryDropOut0 <- fstListDropOut0 %>% fstSummary()
smryDropOut0 %>% saveRDS("Model Status/SmryDropOut0.Rds")
For each forest in this list, we can assess the effect of each outlier on performance by comparing the proportions of outliers for different measures in each fold and each fold’s performance. We pull each token’s fold assignment from the pred
element of each forest and add it to each token’s outlier data:
fstListDropOut0[[1]] %>%
extract2("pred") %>%
arrange(rowIndex) %>%
select(Resample) %>%
cbind(trainingData %>% filter(HowCoded=="Hand") %>% select(ends_with("_Outlier")))
From this data, we can find the proportion of outliers for each measure in each fold (within each repetition):
fstListDropOut0[[1]] %>%
extract2("pred") %>%
arrange(rowIndex) %>%
select(Resample) %>%
cbind(trainingData %>% filter(HowCoded=="Hand") %>% select(ends_with("_Outlier"))) %>%
group_by(Resample) %>%
summarise_all(mean)
We then take performance data from the resample
element of the forest and join it to the outlier data:
fstListDropOut0[[1]] %>%
extract2("resample") %>%
select(AccAUC, Resample)
fstListDropOut0[[1]] %>%
extract2("pred") %>%
arrange(rowIndex) %>%
select(Resample) %>%
cbind(trainingData %>% filter(HowCoded=="Hand") %>% select(ends_with("_Outlier"))) %>%
group_by(Resample) %>%
summarise_all(mean) %>%
left_join(fstListDropOut0[[1]] %>%
extract2("resample") %>%
select(AccAUC, Resample),
by="Resample")
After we put this data together for all 5 classifiers in the list, we pass the resulting dataframe to cor()
to get correlations, and we pull out just the correlations we care about: the correlations between AccAUC and the proportions of outliers. Finally, we store the data in a dataframe and, to save memory, remove the forest list from our R environment. The following function puts all these calculations together:
##Find the outliers that most hurt classifier performance by calculating the
## correlation between folds' test performance and proportion of outliers
##This code works for repeated k-fold cross-validation, but it'll need to be
## tweaked for different methods of generating training/test subsets.
outlierCors <- function(fstList, data=trainingData, dropped=NULL) {
##Get fold info (performance and proportion of outliers) for each forest; if
## outliers have already been dropped, don't include them in the calculation
if (is.null(dropped) | length(dropped)==0) {
foldInfo <- function(fst, data, dropped) {
data %>%
filter(HowCoded=="Hand") %>%
cbind(fst$pred %>% arrange(rowIndex) %>% select(Resample)) %>%
group_by(Resample) %>%
summarise_at(vars(ends_with("_Outlier")), mean, na.rm=TRUE) %>%
left_join(fst$resample %>% select(AccAUC, Resample), by="Resample")
}
} else {
foldInfo <- function(fst, data, dropped) {
data %>%
filter(HowCoded=="Hand") %>%
filter_at(vars(dropped), all_vars(!.)) %>%
cbind(fst$pred %>% arrange(rowIndex) %>% select(Resample)) %>%
group_by(Resample) %>%
summarise_at(vars(ends_with("_Outlier")), mean, na.rm=TRUE) %>%
select(-dropped) %>%
left_join(fst$resample %>% select(AccAUC, Resample), by="Resample")
}
}
##Get fold info for all forests and put them together into a dataframe
ret <-
fstList %>%
map_dfr(foldInfo, data=data, dropped=dropped) %>%
select(-Resample) %>%
select(AccAUC, everything()) %>%
##Get correlations between AccAUC and outlier proportions
cor() %>%
##Reduce correlation matrix to a vector
extract(1, -1) %>%
sort() %>%
tibble(Measure = names(.), TestCor = .)
ret
}
Running this function on our forest list with no outliers dropped, we find an initial ranking of outliers with F3_80 as the outlier that most hurts classifier performance. We’ll try removing F3_80 outliers first.
(negOutliers <- fstListDropOut0 %>% outlierCors(dropped=NULL))
rm(fstListDropOut0)
Start by initializing a character vector (empty, for now) where we’ll store measures that we decide to drop:
dropped <- character(0L)
Now train another list of 5 classifiers, this time dropping tokens with outliers in the measure with the biggest negative impact on performance, F3_80:
tryDrop <- negOutliers$Measure[1]
cl <- makeCluster(5)
registerDoParallel(cl)
clusterSetRNGStream(cl, 302302)
fstListDropOut1_1 <- foreach(seed = seedList,
.packages=c("caret","dplyr")) %dopar% {
train(Rpresent ~ .,
data=trainingData %>%
filter_at(vars(c(dropped, tryDrop)), all_vars(!.)) %>%
filter(HowCoded=="Hand") %>%
select(Rpresent, TokenDur:absSlopeF0),
method="ranger",
importance="impurity",
num.trees=750,
trControl=trainControl(method="repeatedcv",
number=12,
repeats=3,
sampling="smote",
seeds=seed,
savePredictions=TRUE,
summaryFunction=cutoffSummary,
returnResamp="all",
classProbs=TRUE),
tuneGrid=data.frame(mtry=13, splitrule="gini", min.node.size=10))
}
stopCluster(cl)
Create and save a summary object:
smryDropOut1_1 <- fstListDropOut1_1 %>% fstSummary()
smryDropOut1_1 %>% saveRDS("Model Status/SmryDropOut1_1.Rds")
Now we can compare the overall accuracy \(\times\) AUC for the 10 classifiers that drop no outliers vs. the overall accuracy \(\times\) AUC for the 10 classifiers that drop tokens with F3_80 outliers. The latter model well outperforms the former, indicating that the performance gain from dropping F3_80 outliers is worth the loss in data.
list(smryDropOut0, smryDropOut1_1) %>%
set_names(c("None", dropped, tryDrop) %>% str_remove("_Outlier")) %>%
imap_dfr(~ .x %>%
extract2("resample") %>%
separate(Resample, c("Fold","Rep"), "\\.") %>%
group_by(Model, Rep) %>%
summarise_at(vars(AccAUC), mean, na.rm=TRUE) %>%
mutate(Dropped = .y)) %>%
mutate_at(vars(Dropped), factor, levels=c("None", dropped, tryDrop) %>% str_remove("_Outlier")) %>%
ggplot(aes(x=AccAUC, group=Dropped, fill=Dropped)) +
geom_density(alpha=0.5)
Now, we’ll proceed to drop outliers for one measure at a time and compare classifier performance after each step using Wilcoxon rank-sum tests (rather than its parametric counterpart, an independent samples t-test). The distributions of classifier performance fail to satisfy the assumption of normality that is necessary to use an independent samples t-test, so instead we compare performance using a Wilcoxon rank-sum test. We’ll drop outliers until we get three consecutive results showing that classifier performance fails to significantly improve at \(\alpha\) = .05. In this case, we find a highly significant increase in classifier performance, which justifies dropping tokens with outlying F3_80 measurements.
list(smryDropOut0, smryDropOut1_1) %>%
set_names(c("None", dropped, tryDrop) %>% str_remove("_Outlier")) %>%
imap_dfr(~ .x %>%
extract2("resample") %>%
separate(Resample, c("Fold","Rep"), "\\.") %>%
group_by(Model, Rep) %>%
summarise_at(vars(AccAUC), mean, na.rm=TRUE) %>%
mutate(Dropped = .y)) %>%
mutate_at(vars(Dropped), factor, levels=c("None", dropped, tryDrop) %>% str_remove("_Outlier")) %>%
wilcox.test(AccAUC ~ Dropped, ., alternative="less", exact=TRUE)
Wilcoxon rank sum test
data: AccAUC by Dropped
W = 19, p-value = 1.327e-05
alternative hypothesis: true location shift is less than 0
So we add F3_80 to our established set of dropped outliers, meaning we drop about 3% of tokens:
dropped <- c(dropped, tryDrop)
tibble(Dropped = "None",
Auto = trainingData %>% filter(HowCoded=="Auto") %>% nrow(),
Hand = trainingData %>% filter(HowCoded=="Hand") %>% nrow()) %>%
rbind(
cbind(Dropped = dropped,
map_dfr(dropped %>% seq_along(),
~ trainingData %>%
filter_at(vars(dropped[1:.x]), all_vars(!.)) %>%
count(HowCoded) %>%
spread(HowCoded, n)
)
)
) %>%
mutate_at(vars(Auto, Hand), list(PctTossed = ~ (.[1] - .) / .[1]))
We re-rank measures based on the results of this list of classifiers:
(negOutliers <- fstListDropOut1_1 %>% outlierCors(dropped=dropped))
And we continue the procedure by training another list of 5 classifiers, this time on data that drops tokens with outliers in F3_80 and intens_F3max:
rm(fstListDropOut1_1)
tryDrop <- negOutliers$Measure[1]
cl <- makeCluster(5)
registerDoParallel(cl)
clusterSetRNGStream(cl, 302302)
fstListDropOut2_1 <- foreach(seed = seedList,
.packages=c("caret","dplyr")) %dopar% {
train(Rpresent ~ .,
data=trainingData %>%
filter_at(vars(c(dropped, tryDrop)), all_vars(!.)) %>%
filter(HowCoded=="Hand") %>%
select(Rpresent, TokenDur:absSlopeF0),
method="ranger",
importance="impurity",
num.trees=750,
trControl=trainControl(method="repeatedcv",
number=12,
repeats=3,
sampling="smote",
seeds=seed,
savePredictions=TRUE,
summaryFunction=cutoffSummary,
returnResamp="all",
classProbs=TRUE),
tuneGrid=data.frame(mtry=13, splitrule="gini", min.node.size=10))
}
stopCluster(cl)
Create and save a summary object:
smryDropOut2_1 <- fstListDropOut2_1 %>% fstSummary()
smryDropOut2_1 %>% saveRDS("Model Status/SmryDropOut2_1.Rds")
The plot of performance suggests little performance gain from dropping intens_F3max outliers, compared to our established set of dropped outliers.
list(smryDropOut0, smryDropOut1_1, smryDropOut2_1) %>%
set_names(c("None", dropped, tryDrop) %>% str_remove("_Outlier")) %>%
imap_dfr(~ .x %>%
extract2("resample") %>%
separate(Resample, c("Fold","Rep"), "\\.") %>%
group_by(Model, Rep) %>%
summarise_at(vars(AccAUC), mean, na.rm=TRUE) %>%
mutate(Dropped = .y)) %>%
mutate_at(vars(Dropped), factor, levels=c("None", dropped, tryDrop) %>% str_remove("_Outlier")) %>%
ggplot(aes(x=AccAUC, group=Dropped, fill=Dropped)) +
geom_density(alpha=0.5)
A Wilcoxon rank-sum test fails to indicate a significant performance gain from dropping intens_F3max outliers, compared to our established set of dropped outliers.
list(smryDropOut1_1, smryDropOut2_1) %>%
set_names(c(dropped[length(dropped)], tryDrop) %>% str_remove("_Outlier")) %>%
imap_dfr(~ .x %>%
extract2("resample") %>%
separate(Resample, c("Fold","Rep"), "\\.") %>%
group_by(Model, Rep) %>%
summarise_at(vars(AccAUC), mean, na.rm=TRUE) %>%
mutate(Dropped = .y)) %>%
mutate_at(vars(Dropped), factor,
levels=c(dropped[length(dropped)], tryDrop) %>% str_remove("_Outlier")) %>%
wilcox.test(AccAUC ~ Dropped, ., alternative="less", exact=TRUE)
Wilcoxon rank sum test
data: AccAUC by Dropped
W = 96, p-value = 0.2562
alternative hypothesis: true location shift is less than 0
So we move on to the next measure in the ranking, F2_35, training a list of classifiers on data that drops outliers in F2_35 in addition to our established set of dropped outliers.
rm(fstListDropOut2_1)
tryDrop <- negOutliers$Measure[2]
cl <- makeCluster(5)
registerDoParallel(cl)
clusterSetRNGStream(cl, 302302)
fstListDropOut2_2 <- foreach(seed = seedList,
.packages=c("caret","dplyr")) %dopar% {
train(Rpresent ~ .,
data=trainingData %>%
filter_at(vars(c(dropped, tryDrop)), all_vars(!.)) %>%
filter(HowCoded=="Hand") %>%
select(Rpresent, TokenDur:absSlopeF0),
method="ranger",
importance="impurity",
num.trees=750,
trControl=trainControl(method="repeatedcv",
number=12,
repeats=3,
sampling="smote",
seeds=seed,
savePredictions=TRUE,
summaryFunction=cutoffSummary,
returnResamp="all",
classProbs=TRUE),
tuneGrid=data.frame(mtry=13, splitrule="gini", min.node.size=10))
}
stopCluster(cl)
Create and save a summary object:
smryDropOut2_2 <- fstListDropOut2_2 %>% fstSummary()
smryDropOut2_2 %>% saveRDS("Model Status/SmryDropOut2_2.Rds")
The plot of performance suggests further performance gain from dropping F2_35 outliers, compared to our established set of dropped outliers.
list(smryDropOut0, smryDropOut1_1, smryDropOut2_2) %>%
set_names(c("None", dropped, tryDrop) %>% str_remove("_Outlier")) %>%
imap_dfr(~ .x %>%
extract2("resample") %>%
separate(Resample, c("Fold","Rep"), "\\.") %>%
group_by(Model, Rep) %>%
summarise_at(vars(AccAUC), mean, na.rm=TRUE) %>%
mutate(Dropped = .y)) %>%
mutate_at(vars(Dropped), factor, levels=c("None", dropped, tryDrop) %>% str_remove("_Outlier")) %>%
ggplot(aes(x=AccAUC, group=Dropped, fill=Dropped)) +
geom_density(alpha=0.5)
A Wilcoxon rank-sum test indicates a significant performance gain from dropping F2_35 outliers, compared to our established set of dropped outliers.
list(smryDropOut1_1, smryDropOut2_2) %>%
set_names(c(dropped[length(dropped)], tryDrop) %>% str_remove("_Outlier")) %>%
imap_dfr(~ .x %>%
extract2("resample") %>%
separate(Resample, c("Fold","Rep"), "\\.") %>%
group_by(Model, Rep) %>%
summarise_at(vars(AccAUC), mean, na.rm=TRUE) %>%
mutate(Dropped = .y)) %>%
mutate_at(vars(Dropped), factor,
levels=c(dropped[length(dropped)], tryDrop) %>% str_remove("_Outlier")) %>%
wilcox.test(AccAUC ~ Dropped, ., alternative="less", exact=TRUE)
Wilcoxon rank sum test
data: AccAUC by Dropped
W = 14, p-value = 3.275e-06
alternative hypothesis: true location shift is less than 0
So we add F2_35 to our established set of dropped outliers, meaning we drop about 8% of tokens:
dropped <- c(dropped, tryDrop)
tibble(Dropped = "None",
Auto = trainingData %>% filter(HowCoded=="Auto") %>% nrow(),
Hand = trainingData %>% filter(HowCoded=="Hand") %>% nrow()) %>%
rbind(
cbind(Dropped = dropped,
map_dfr(dropped %>% seq_along(),
~ trainingData %>%
filter_at(vars(dropped[1:.x]), all_vars(!.)) %>%
count(HowCoded) %>%
spread(HowCoded, n)
)
)
) %>%
mutate_at(vars(Auto, Hand), list(PctTossed = ~ (.[1] - .) / .[1]))
We re-rank measures based on the results of this list of classifiers:
(negOutliers <- fstListDropOut2_2 %>% outlierCors(dropped=dropped))
And we continue the procedure by training another list of 5 classifiers, this time on data that drops tokens with outliers in intens_F3min, in addition to our established set of dropped outliers:
rm(fstListDropOut2_2)
tryDrop <- negOutliers$Measure[1]
cl <- makeCluster(5)
registerDoParallel(cl)
clusterSetRNGStream(cl, 302302)
fstListDropOut3_1 <- foreach(seed = seedList,
.packages=c("caret","dplyr")) %dopar% {
train(Rpresent ~ .,
data=trainingData %>%
filter_at(vars(c(dropped, tryDrop)), all_vars(!.)) %>%
filter(HowCoded=="Hand") %>%
select(Rpresent, TokenDur:absSlopeF0),
method="ranger",
importance="impurity",
num.trees=750,
trControl=trainControl(method="repeatedcv",
number=12,
repeats=3,
sampling="smote",
seeds=seed,
savePredictions=TRUE,
summaryFunction=cutoffSummary,
returnResamp="all",
classProbs=TRUE),
tuneGrid=data.frame(mtry=13, splitrule="gini", min.node.size=10))
}
stopCluster(cl)
Create and save a summary object:
smryDropOut3_1 <- fstListDropOut3_1 %>% fstSummary()
smryDropOut3_1 %>% saveRDS("Model Status/SmryDropOut3_1.Rds")
The plot of performance suggests further performance gain from dropping intens_F3min outliers, compared to our established set of dropped outliers.
list(smryDropOut0, smryDropOut1_1, smryDropOut2_2, smryDropOut3_1) %>%
set_names(c("None", dropped, tryDrop) %>% str_remove("_Outlier")) %>%
imap_dfr(~ .x %>%
extract2("resample") %>%
separate(Resample, c("Fold","Rep"), "\\.") %>%
group_by(Model, Rep) %>%
summarise_at(vars(AccAUC), mean, na.rm=TRUE) %>%
mutate(Dropped = .y)) %>%
mutate_at(vars(Dropped), factor, levels=c("None", dropped, tryDrop) %>% str_remove("_Outlier")) %>%
ggplot(aes(x=AccAUC, group=Dropped, fill=Dropped)) +
geom_density(alpha=0.5)
A Wilcoxon rank-sum test indicates a significant performance gain from dropping intens_F3min outliers, compared to our established set of dropped outliers.
list(smryDropOut2_2, smryDropOut3_1) %>%
set_names(c(dropped[length(dropped)], tryDrop) %>% str_remove("_Outlier")) %>%
imap_dfr(~ .x %>%
extract2("resample") %>%
separate(Resample, c("Fold","Rep"), "\\.") %>%
group_by(Model, Rep) %>%
summarise_at(vars(AccAUC), mean, na.rm=TRUE) %>%
mutate(Dropped = .y)) %>%
mutate_at(vars(Dropped), factor,
levels=c(dropped[length(dropped)], tryDrop) %>% str_remove("_Outlier")) %>%
wilcox.test(AccAUC ~ Dropped, ., alternative="less", exact=TRUE)
Wilcoxon rank sum test
data: AccAUC by Dropped
W = 29, p-value = 0.0001322
alternative hypothesis: true location shift is less than 0
So we add intens_F3min to our established set of dropped outliers, meaning we drop about 9.5% of tokens:
dropped <- c(dropped, tryDrop)
tibble(Dropped = "None",
Auto = trainingData %>% filter(HowCoded=="Auto") %>% nrow(),
Hand = trainingData %>% filter(HowCoded=="Hand") %>% nrow()) %>%
rbind(
cbind(Dropped = dropped,
map_dfr(dropped %>% seq_along(),
~ trainingData %>%
filter_at(vars(dropped[1:.x]), all_vars(!.)) %>%
count(HowCoded) %>%
spread(HowCoded, n)
)
)
) %>%
mutate_at(vars(Auto, Hand), list(PctTossed = ~ (.[1] - .) / .[1]))
We re-rank measures based on the results of this list of classifiers:
(negOutliers <- fstListDropOut3_1 %>% outlierCors(dropped=dropped))
And we continue the procedure by training another list of 5 classifiers, this time on data that drops tokens with outliers in F1_70, in addition to our established set of dropped outliers:
rm(fstListDropOut3_1)
tryDrop <- negOutliers$Measure[1]
cl <- makeCluster(5)
registerDoParallel(cl)
clusterSetRNGStream(cl, 302302)
fstListDropOut4_1 <- foreach(seed = seedList,
.packages=c("caret","dplyr")) %dopar% {
train(Rpresent ~ .,
data=trainingData %>%
filter_at(vars(c(dropped, tryDrop)), all_vars(!.)) %>%
filter(HowCoded=="Hand") %>%
select(Rpresent, TokenDur:absSlopeF0),
method="ranger",
importance="impurity",
num.trees=750,
trControl=trainControl(method="repeatedcv",
number=12,
repeats=3,
sampling="smote",
seeds=seed,
savePredictions=TRUE,
summaryFunction=cutoffSummary,
returnResamp="all",
classProbs=TRUE),
tuneGrid=data.frame(mtry=13, splitrule="gini", min.node.size=10))
}
stopCluster(cl)
Create and save a summary object:
smryDropOut4_1 <- fstListDropOut4_1 %>% fstSummary()
smryDropOut4_1 %>% saveRDS("Model Status/SmryDropOut4_1.Rds")
The plot of performance suggests further performance gain from dropping F1_70 outliers, compared to our established set of dropped outliers.
list(smryDropOut0, smryDropOut1_1, smryDropOut2_2, smryDropOut3_1,
smryDropOut4_1) %>%
set_names(c("None", dropped, tryDrop) %>% str_remove("_Outlier")) %>%
imap_dfr(~ .x %>%
extract2("resample") %>%
separate(Resample, c("Fold","Rep"), "\\.") %>%
group_by(Model, Rep) %>%
summarise_at(vars(AccAUC), mean, na.rm=TRUE) %>%
mutate(Dropped = .y)) %>%
mutate_at(vars(Dropped), factor, levels=c("None", dropped, tryDrop) %>% str_remove("_Outlier")) %>%
ggplot(aes(x=AccAUC, group=Dropped, fill=Dropped)) +
geom_density(alpha=0.5)
A Wilcoxon rank-sum test fails to indicate a significant performance gain from dropping intens_F3max outliers, compared to our established set of dropped outliers.
list(smryDropOut3_1, smryDropOut4_1) %>%
set_names(c(dropped[length(dropped)], tryDrop) %>% str_remove("_Outlier")) %>%
imap_dfr(~ .x %>%
extract2("resample") %>%
separate(Resample, c("Fold","Rep"), "\\.") %>%
group_by(Model, Rep) %>%
summarise_at(vars(AccAUC), mean, na.rm=TRUE) %>%
mutate(Dropped = .y)) %>%
mutate_at(vars(Dropped), factor,
levels=c(dropped[length(dropped)], tryDrop) %>% str_remove("_Outlier")) %>%
wilcox.test(AccAUC ~ Dropped, ., alternative="less", exact=TRUE)
Wilcoxon rank sum test
data: AccAUC by Dropped
W = 28, p-value = 0.0001075
alternative hypothesis: true location shift is less than 0
So we add F1_70 to our established set of dropped outliers, meaning we drop about 13% of tokens:
dropped <- c(dropped, tryDrop)
tibble(Dropped = "None",
Auto = trainingData %>% filter(HowCoded=="Auto") %>% nrow(),
Hand = trainingData %>% filter(HowCoded=="Hand") %>% nrow()) %>%
rbind(
cbind(Dropped = dropped,
map_dfr(dropped %>% seq_along(),
~ trainingData %>%
filter_at(vars(dropped[1:.x]), all_vars(!.)) %>%
count(HowCoded) %>%
spread(HowCoded, n)
)
)
) %>%
mutate_at(vars(Auto, Hand), list(PctTossed = ~ (.[1] - .) / .[1]))
We re-rank measures based on the results of this list of classifiers:
(negOutliers <- fstListDropOut4_1 %>% outlierCors(dropped=dropped))
And we continue the procedure by training another list of 5 classifiers, this time on data that drops tokens with outliers in F1_50, in addition to our established set of dropped outliers:
rm(fstListDropOut4_1)
tryDrop <- negOutliers$Measure[1]
cl <- makeCluster(5)
registerDoParallel(cl)
clusterSetRNGStream(cl, 302302)
fstListDropOut5_1 <- foreach(seed = seedList,
.packages=c("caret","dplyr")) %dopar% {
train(Rpresent ~ .,
data=trainingData %>%
filter_at(vars(c(dropped, tryDrop)), all_vars(!.)) %>%
filter(HowCoded=="Hand") %>%
select(Rpresent, TokenDur:absSlopeF0),
method="ranger",
importance="impurity",
num.trees=750,
trControl=trainControl(method="repeatedcv",
number=12,
repeats=3,
sampling="smote",
seeds=seed,
savePredictions=TRUE,
summaryFunction=cutoffSummary,
returnResamp="all",
classProbs=TRUE),
tuneGrid=data.frame(mtry=13, splitrule="gini", min.node.size=10))
}
stopCluster(cl)
Create and save a summary object:
smryDropOut5_1 <- fstListDropOut5_1 %>% fstSummary()
smryDropOut5_1 %>% saveRDS("Model Status/SmryDropOut5_1.Rds")
The plot of performance suggests some further performance gain from dropping F1_50 outliers, compared to our established set of dropped outliers.
list(smryDropOut0, smryDropOut1_1, smryDropOut2_2, smryDropOut3_1,
smryDropOut4_1, smryDropOut5_1) %>%
set_names(c("None", dropped, tryDrop) %>% str_remove("_Outlier")) %>%
imap_dfr(~ .x %>%
extract2("resample") %>%
separate(Resample, c("Fold","Rep"), "\\.") %>%
group_by(Model, Rep) %>%
summarise_at(vars(AccAUC), mean, na.rm=TRUE) %>%
mutate(Dropped = .y)) %>%
mutate_at(vars(Dropped), factor, levels=c("None", dropped, tryDrop) %>% str_remove("_Outlier")) %>%
ggplot(aes(x=AccAUC, group=Dropped, fill=Dropped)) +
geom_density(alpha=0.5)
A Wilcoxon rank-sum test indicates a significant performance gain from dropping F1_50 outliers, compared to our established set of dropped outliers.
list(smryDropOut4_1, smryDropOut5_1) %>%
set_names(c(dropped[length(dropped)], tryDrop) %>% str_remove("_Outlier")) %>%
imap_dfr(~ .x %>%
extract2("resample") %>%
separate(Resample, c("Fold","Rep"), "\\.") %>%
group_by(Model, Rep) %>%
summarise_at(vars(AccAUC), mean, na.rm=TRUE) %>%
mutate(Dropped = .y)) %>%
mutate_at(vars(Dropped), factor,
levels=c(dropped[length(dropped)], tryDrop) %>% str_remove("_Outlier")) %>%
wilcox.test(AccAUC ~ Dropped, ., alternative="less", exact=TRUE)
Wilcoxon rank sum test
data: AccAUC by Dropped
W = 53, p-value = 0.0064
alternative hypothesis: true location shift is less than 0
So we add F1_50 to our established set of dropped outliers, meaning we drop about 15% of tokens:
dropped <- c(dropped, tryDrop)
tibble(Dropped = "None",
Auto = trainingData %>% filter(HowCoded=="Auto") %>% nrow(),
Hand = trainingData %>% filter(HowCoded=="Hand") %>% nrow()) %>%
rbind(
cbind(Dropped = dropped,
map_dfr(dropped %>% seq_along(),
~ trainingData %>%
filter_at(vars(dropped[1:.x]), all_vars(!.)) %>%
count(HowCoded) %>%
spread(HowCoded, n)
)
)
) %>%
mutate_at(vars(Auto, Hand), list(PctTossed = ~ (.[1] - .) / .[1]))
We re-rank measures based on the results of this list of classifiers:
(negOutliers <- fstListDropOut5_1 %>% outlierCors(dropped=dropped))
And we continue the procedure by training another list of 5 classifiers, this time on data that drops tokens with outliers in F3min, in addition to our established set of dropped outliers:
rm(fstListDropOut5_1)
tryDrop <- negOutliers$Measure[1]
cl <- makeCluster(5)
registerDoParallel(cl)
clusterSetRNGStream(cl, 302302)
fstListDropOut6_1 <- foreach(seed = seedList,
.packages=c("caret","dplyr")) %dopar% {
train(Rpresent ~ .,
data=trainingData %>%
filter_at(vars(c(dropped, tryDrop)), all_vars(!.)) %>%
filter(HowCoded=="Hand") %>%
select(Rpresent, TokenDur:absSlopeF0),
method="ranger",
importance="impurity",
num.trees=750,
trControl=trainControl(method="repeatedcv",
number=12,
repeats=3,
sampling="smote",
seeds=seed,
savePredictions=TRUE,
summaryFunction=cutoffSummary,
returnResamp="all",
classProbs=TRUE),
tuneGrid=data.frame(mtry=13, splitrule="gini", min.node.size=10))
}
stopCluster(cl)
Create and save a summary object:
smryDropOut6_1 <- fstListDropOut6_1 %>% fstSummary()
smryDropOut6_1 %>% saveRDS("Model Status/SmryDropOut6_1.Rds")
The plot of performance suggests some further performance gain from dropping F3min outliers, compared to our established set of dropped outliers.
list(smryDropOut0, smryDropOut1_1, smryDropOut2_2, smryDropOut3_1,
smryDropOut4_1, smryDropOut5_1, smryDropOut6_1) %>%
set_names(c("None", dropped, tryDrop) %>% str_remove("_Outlier")) %>%
imap_dfr(~ .x %>%
extract2("resample") %>%
separate(Resample, c("Fold","Rep"), "\\.") %>%
group_by(Model, Rep) %>%
summarise_at(vars(AccAUC), mean, na.rm=TRUE) %>%
mutate(Dropped = .y)) %>%
mutate_at(vars(Dropped), factor, levels=c("None", dropped, tryDrop) %>% str_remove("_Outlier")) %>%
ggplot(aes(x=AccAUC, group=Dropped, fill=Dropped)) +
geom_density(alpha=0.5)
A Wilcoxon rank-sum test indicates a significant performance gain from dropping F3min outliers, compared to our established set of dropped outliers.
list(smryDropOut5_1, smryDropOut6_1) %>%
set_names(c(dropped[length(dropped)], tryDrop) %>% str_remove("_Outlier")) %>%
imap_dfr(~ .x %>%
extract2("resample") %>%
separate(Resample, c("Fold","Rep"), "\\.") %>%
group_by(Model, Rep) %>%
summarise_at(vars(AccAUC), mean, na.rm=TRUE) %>%
mutate(Dropped = .y)) %>%
mutate_at(vars(Dropped), factor,
levels=c(dropped[length(dropped)], tryDrop) %>% str_remove("_Outlier")) %>%
wilcox.test(AccAUC ~ Dropped, ., alternative="less", exact=TRUE)
Wilcoxon rank sum test
data: AccAUC by Dropped
W = 31, p-value = 0.000197
alternative hypothesis: true location shift is less than 0
So we add F3min to our established set of dropped outliers, meaning we drop about 17% of tokens:
dropped <- c(dropped, tryDrop)
tibble(Dropped = "None",
Auto = trainingData %>% filter(HowCoded=="Auto") %>% nrow(),
Hand = trainingData %>% filter(HowCoded=="Hand") %>% nrow()) %>%
rbind(
cbind(Dropped = dropped,
map_dfr(dropped %>% seq_along(),
~ trainingData %>%
filter_at(vars(dropped[1:.x]), all_vars(!.)) %>%
count(HowCoded) %>%
spread(HowCoded, n)
)
)
) %>%
mutate_at(vars(Auto, Hand), list(PctTossed = ~ (.[1] - .) / .[1]))
We re-rank measures based on the results of this list of classifiers:
(negOutliers <- fstListDropOut6_1 %>% outlierCors(dropped=dropped))
And we continue the procedure by training another list of 5 classifiers, this time on data that drops tokens with outliers in F4_40, in addition to our established set of dropped outliers:
rm(fstListDropOut6_1)
tryDrop <- negOutliers$Measure[1]
cl <- makeCluster(5)
registerDoParallel(cl)
clusterSetRNGStream(cl, 302302)
fstListDropOut7_1 <- foreach(seed = seedList,
.packages=c("caret","dplyr")) %dopar% {
train(Rpresent ~ .,
data=trainingData %>%
filter_at(vars(c(dropped, tryDrop)), all_vars(!.)) %>%
filter(HowCoded=="Hand") %>%
select(Rpresent, TokenDur:absSlopeF0),
method="ranger",
importance="impurity",
num.trees=750,
trControl=trainControl(method="repeatedcv",
number=12,
repeats=3,
sampling="smote",
seeds=seed,
savePredictions=TRUE,
summaryFunction=cutoffSummary,
returnResamp="all",
classProbs=TRUE),
tuneGrid=data.frame(mtry=13, splitrule="gini", min.node.size=10))
}
stopCluster(cl)
Create and save a summary object:
smryDropOut7_1 <- fstListDropOut7_1 %>% fstSummary()
smryDropOut7_1 %>% saveRDS("Model Status/SmryDropOut7_1.Rds")
The plot of performance suggests little performance gain from dropping F4_40 outliers, compared to our established set of dropped outliers.
list(smryDropOut0, smryDropOut1_1, smryDropOut2_2, smryDropOut3_1,
smryDropOut4_1, smryDropOut5_1, smryDropOut6_1, smryDropOut7_1) %>%
set_names(c("None", dropped, tryDrop) %>% str_remove("_Outlier")) %>%
imap_dfr(~ .x %>%
extract2("resample") %>%
separate(Resample, c("Fold","Rep"), "\\.") %>%
group_by(Model, Rep) %>%
summarise_at(vars(AccAUC), mean, na.rm=TRUE) %>%
mutate(Dropped = .y)) %>%
mutate_at(vars(Dropped), factor, levels=c("None", dropped, tryDrop) %>% str_remove("_Outlier")) %>%
ggplot(aes(x=AccAUC, group=Dropped, fill=Dropped)) +
geom_density(alpha=0.5)
A Wilcoxon rank-sum test fails to indicate a significant performance gain from dropping F4_40 outliers, compared to our established set of dropped outliers.
list(smryDropOut6_1, smryDropOut7_1) %>%
set_names(c(dropped[length(dropped)], tryDrop) %>% str_remove("_Outlier")) %>%
imap_dfr(~ .x %>%
extract2("resample") %>%
separate(Resample, c("Fold","Rep"), "\\.") %>%
group_by(Model, Rep) %>%
summarise_at(vars(AccAUC), mean, na.rm=TRUE) %>%
mutate(Dropped = .y)) %>%
mutate_at(vars(Dropped), factor,
levels=c(dropped[length(dropped)], tryDrop) %>% str_remove("_Outlier")) %>%
wilcox.test(AccAUC ~ Dropped, ., alternative="less", exact=TRUE)
Wilcoxon rank sum test
data: AccAUC by Dropped
W = 119, p-value = 0.6126
alternative hypothesis: true location shift is less than 0
So we move on to the next measure in the ranking, F4_50, training a list of classifiers on data that drops outliers in F4_50 in addition to our established set of dropped outliers.
rm(fstListDropOut7_1)
tryDrop <- negOutliers$Measure[2]
cl <- makeCluster(5)
registerDoParallel(cl)
clusterSetRNGStream(cl, 302302)
fstListDropOut7_2 <- foreach(seed = seedList,
.packages=c("caret","dplyr")) %dopar% {
train(Rpresent ~ .,
data=trainingData %>%
filter_at(vars(c(dropped, tryDrop)), all_vars(!.)) %>%
filter(HowCoded=="Hand") %>%
select(Rpresent, TokenDur:absSlopeF0),
method="ranger",
importance="impurity",
num.trees=750,
trControl=trainControl(method="repeatedcv",
number=12,
repeats=3,
sampling="smote",
seeds=seed,
savePredictions=TRUE,
summaryFunction=cutoffSummary,
returnResamp="all",
classProbs=TRUE),
tuneGrid=data.frame(mtry=13, splitrule="gini", min.node.size=10))
}
stopCluster(cl)
Create and save a summary object:
smryDropOut7_2 <- fstListDropOut7_2 %>% fstSummary()
smryDropOut7_2 %>% saveRDS("Model Status/SmryDropOut7_2.Rds")
The plot of performance suggests little performance gain from dropping F4_50 outliers, compared to our established set of dropped outliers.
list(smryDropOut0, smryDropOut1_1, smryDropOut2_2, smryDropOut3_1,
smryDropOut4_1, smryDropOut5_1, smryDropOut6_1, smryDropOut7_2) %>%
set_names(c("None", dropped, tryDrop) %>% str_remove("_Outlier")) %>%
imap_dfr(~ .x %>%
extract2("resample") %>%
separate(Resample, c("Fold","Rep"), "\\.") %>%
group_by(Model, Rep) %>%
summarise_at(vars(AccAUC), mean, na.rm=TRUE) %>%
mutate(Dropped = .y)) %>%
mutate_at(vars(Dropped), factor, levels=c("None", dropped, tryDrop) %>% str_remove("_Outlier")) %>%
ggplot(aes(x=AccAUC, group=Dropped, fill=Dropped)) +
geom_density(alpha=0.5)
A Wilcoxon rank-sum test fails to indicate a significant performance gain from dropping F4_50 outliers, compared to our established set of dropped outliers.
list(smryDropOut6_1, smryDropOut7_2) %>%
set_names(c(dropped[length(dropped)], tryDrop) %>% str_remove("_Outlier")) %>%
imap_dfr(~ .x %>%
extract2("resample") %>%
separate(Resample, c("Fold","Rep"), "\\.") %>%
group_by(Model, Rep) %>%
summarise_at(vars(AccAUC), mean, na.rm=TRUE) %>%
mutate(Dropped = .y)) %>%
mutate_at(vars(Dropped), factor,
levels=c(dropped[length(dropped)], tryDrop) %>% str_remove("_Outlier")) %>%
wilcox.test(AccAUC ~ Dropped, ., alternative="less", exact=TRUE)
Wilcoxon rank sum test
data: AccAUC by Dropped
W = 138, p-value = 0.8573
alternative hypothesis: true location shift is less than 0
So we move on to the next measure in the ranking, F1_60, training a list of classifiers on data that drops outliers in F1_60 in addition to our established set of dropped outliers. Since that’s been two consecutive nonsignificant results, we’ll stop this process if the next result is also nonsignificant.
rm(fstListDropOut7_2)
tryDrop <- negOutliers$Measure[3]
cl <- makeCluster(5)
registerDoParallel(cl)
clusterSetRNGStream(cl, 302302)
fstListDropOut7_3 <- foreach(seed = seedList,
.packages=c("caret","dplyr")) %dopar% {
train(Rpresent ~ .,
data=trainingData %>%
filter_at(vars(c(dropped, tryDrop)), all_vars(!.)) %>%
filter(HowCoded=="Hand") %>%
select(Rpresent, TokenDur:absSlopeF0),
method="ranger",
importance="impurity",
num.trees=750,
trControl=trainControl(method="repeatedcv",
number=12,
repeats=3,
sampling="smote",
seeds=seed,
savePredictions=TRUE,
summaryFunction=cutoffSummary,
returnResamp="all",
classProbs=TRUE),
tuneGrid=data.frame(mtry=13, splitrule="gini", min.node.size=10))
}
stopCluster(cl)
Create and save a summary object:
smryDropOut7_3 <- fstListDropOut7_3 %>% fstSummary()
smryDropOut7_3 %>% saveRDS("Model Status/SmryDropOut7_3.Rds")
The plot of performance suggests little performance gain from dropping F1_60 outliers, compared to our established set of dropped outliers.
list(smryDropOut0, smryDropOut1_1, smryDropOut2_2, smryDropOut3_1,
smryDropOut4_1, smryDropOut5_1, smryDropOut6_1, smryDropOut7_3) %>%
set_names(c("None", dropped, tryDrop) %>% str_remove("_Outlier")) %>%
imap_dfr(~ .x %>%
extract2("resample") %>%
separate(Resample, c("Fold","Rep"), "\\.") %>%
group_by(Model, Rep) %>%
summarise_at(vars(AccAUC), mean, na.rm=TRUE) %>%
mutate(Dropped = .y)) %>%
mutate_at(vars(Dropped), factor, levels=c("None", dropped, tryDrop) %>% str_remove("_Outlier")) %>%
ggplot(aes(x=AccAUC, group=Dropped, fill=Dropped)) +
geom_density(alpha=0.5)
A Wilcoxon rank-sum test fails to indicate a significant performance gain from dropping F1_60 outliers, compared to our established set of dropped outliers.
list(smryDropOut6_1, smryDropOut7_3) %>%
set_names(c(dropped[length(dropped)], tryDrop) %>% str_remove("_Outlier")) %>%
imap_dfr(~ .x %>%
extract2("resample") %>%
separate(Resample, c("Fold","Rep"), "\\.") %>%
group_by(Model, Rep) %>%
summarise_at(vars(AccAUC), mean, na.rm=TRUE) %>%
mutate(Dropped = .y)) %>%
mutate_at(vars(Dropped), factor,
levels=c(dropped[length(dropped)], tryDrop) %>% str_remove("_Outlier")) %>%
wilcox.test(AccAUC ~ Dropped, ., alternative="less", exact=TRUE)
Wilcoxon rank sum test
data: AccAUC by Dropped
W = 74, p-value = 0.05799
alternative hypothesis: true location shift is less than 0
Since that’s been three consecutive nonsignificant results, we stop our outlier-dropping process there.
rm(fstListDropOut7_3)
We’ve dropped outliers in six measures for which outliers significantly hurt classifier performance: F3_80, F2_35, intens_F3min, F1_70, F1_50, and F3min. We determined which outliers to drop not in terms of variable importance—most of these measures did not rank particularly high in variable importance—but in terms of which measures’ outliers were most negatively correlated with classifier performance.
dropped %>%
str_remove("_Outlier") %>%
set_names(., .) %>%
map_int(~ smryTuned$varImp %>% sort(decreasing=TRUE) %>% names() %>% equals(.x) %>% which())
F3_80 F2_35 intens_F3min F1_70 F1_50
39 114 13 50 58
F3min
18
Dropping these outliers comes at the price of about 1/6 of all auto-coded and hand-coded tokens.
tibble(Dropped = "None",
Auto = trainingData %>% filter(HowCoded=="Auto") %>% nrow(),
Hand = trainingData %>% filter(HowCoded=="Hand") %>% nrow()) %>%
rbind(
cbind(Dropped = dropped,
map_dfr(dropped %>% seq_along(),
~ trainingData %>%
filter_at(vars(dropped[1:.x]), all_vars(!.)) %>%
count(HowCoded) %>%
spread(HowCoded, n)
)
)
) %>%
mutate_at(vars(Auto, Hand), list(PctTossed = ~ (.[1] - .) / .[1]))
The loss of these tokens pays off in an incremental increase in the performance of our classifier.
list(smryDropOut0, smryDropOut1_1, smryDropOut2_2, smryDropOut3_1,
smryDropOut4_1, smryDropOut5_1, smryDropOut6_1) %>%
set_names(c("None", dropped) %>% str_remove("_Outlier")) %>%
imap_dfr(~ .x %>%
extract2("resample") %>%
separate(Resample, c("Fold","Rep"), "\\.") %>%
group_by(Model, Rep) %>%
summarise_at(vars(AccAUC), mean, na.rm=TRUE) %>%
mutate(Dropped = .y)) %>%
mutate_at(vars(Dropped), factor, levels=c("None", dropped) %>% str_remove("_Outlier")) %>%
ggplot(aes(x=AccAUC, group=Dropped, fill=Dropped)) +
geom_density(alpha=0.5)
list(smryDropOut0, smryDropOut1_1, smryDropOut2_2, smryDropOut3_1,
smryDropOut4_1, smryDropOut5_1, smryDropOut6_1) %>%
set_names(c("None", dropped) %>% str_remove("_Outlier")) %>%
imap_dfr(~ .x %>%
extract2("resample") %>%
mutate(Dropped = .y)) %>%
mutate_at(vars(Dropped), factor, levels=c("None", dropped) %>% str_remove("_Outlier")) %>%
select(-Model) %>%
group_by(Dropped) %>%
summarise_if(is.numeric, mean, na.rm=TRUE)
Next, we’ll train a final classifier with the settings we’ve tuned.
registerDoSEQ()
set.seed(302302)
finalClassifier <- train(Rpresent ~ .,
data=trainingData %>%
filter_at(vars(F3_80_Outlier, F2_35_Outlier, intens_F3min_Outlier,
F1_70_Outlier, F1_50_Outlier, F3min_Outlier),
all_vars(!.)) %>%
filter(HowCoded=="Hand") %>%
select(Rpresent, TokenDur:absSlopeF0),
method="ranger",
importance="impurity",
num.trees=750,
trControl=trainControl(method="repeatedcv",
number=12,
repeats=3,
sampling="smote",
savePredictions=TRUE,
summaryFunction=cutoffSummary,
returnResamp="all",
classProbs=TRUE),
tuneGrid=data.frame(mtry=13, splitrule="gini", min.node.size=10))
Here are the properties of our final classifier: performance measures, confusion matrix, variable importance, and timing. We’ll save this final classifier to file.
finalClassifier %>% fstSummary()
$resample
AccAUC Accuracy AUC BestCutoff AccuracyHalf
1 0.7364007 0.8516624 0.8646626 0.5703111 0.8260870
2 0.7324460 0.8465473 0.8652157 0.6048942 0.8286445
3 0.7300310 0.8286445 0.8809942 0.4898394 0.8286445
4 0.7541890 0.8491049 0.8882165 0.5250767 0.8312020
5 0.7252607 0.8337596 0.8698679 0.5473815 0.8235294
6 0.7584978 0.8538462 0.8883307 0.5139561 0.8538462
7 0.7579272 0.8487179 0.8930260 0.6526381 0.8051282
8 0.7716437 0.8644501 0.8926410 0.6043831 0.8491049
9 0.7467352 0.8439898 0.8847680 0.5192844 0.8260870
10 0.7007671 0.8337596 0.8404906 0.4967894 0.8286445
11 0.7807526 0.8670077 0.9005140 0.5544693 0.8491049
12 0.7335236 0.8358974 0.8775282 0.4957651 0.8333333
13 0.7246340 0.8388747 0.8638168 0.5233698 0.8312020
14 0.7149154 0.8363171 0.8548377 0.6622127 0.8158568
15 0.8086528 0.8721228 0.9272236 0.5806561 0.8567775
16 0.7153387 0.8286445 0.8632637 0.6554881 0.8081841
17 0.7202507 0.8312020 0.8665170 0.5574746 0.8184143
18 0.7336072 0.8363171 0.8771878 0.5178709 0.8312020
19 0.7536270 0.8439898 0.8929338 0.5513929 0.8312020
20 0.7577129 0.8410256 0.9009391 0.6316360 0.8333333
21 0.7512366 0.8538462 0.8798266 0.6532548 0.8435897
22 0.7921895 0.8717949 0.9086879 0.5419931 0.8641026
23 0.7098668 0.8312020 0.8540243 0.5367921 0.8260870
24 0.7235885 0.8439898 0.8573427 0.5242439 0.8337596
25 0.7681710 0.8564103 0.8969661 0.6058937 0.8358974
26 0.7574047 0.8593350 0.8813846 0.5811873 0.8491049
27 0.7353567 0.8414322 0.8739345 0.6464735 0.8286445
28 0.7355146 0.8333333 0.8826175 0.4983272 0.8358974
29 0.6835013 0.8312020 0.8223046 0.6350204 0.7928389
30 0.7625674 0.8542199 0.8927061 0.5559661 0.8465473
31 0.7552387 0.8491049 0.8894528 0.5854690 0.8158568
32 0.7370354 0.8363171 0.8812870 0.5263519 0.8235294
33 0.7419239 0.8491049 0.8737719 0.5447746 0.8312020
34 0.7531500 0.8414322 0.8950810 0.5479886 0.8516624
35 0.7365899 0.8435897 0.8731613 0.4973709 0.8410256
36 0.7212472 0.8260870 0.8730887 0.6589222 0.8081841
ClassAccuracy_Absent ClassAccuracy_Present ClassAccuracyHalf_Absent
1 0.9290780 0.6513761 0.8758865
2 0.9468085 0.5871560 0.8865248
3 0.8865248 0.6788991 0.8936170
4 0.9078014 0.6972477 0.8758865
5 0.9113475 0.6330275 0.8723404
6 0.9255319 0.6666667 0.9148936
7 0.9680851 0.5370370 0.8546099
8 0.9609929 0.6146789 0.8900709
9 0.9042553 0.6880734 0.8758865
10 0.9007092 0.6605505 0.9007092
11 0.9539007 0.6422018 0.9184397
12 0.8936170 0.6851852 0.8936170
13 0.9007092 0.6788991 0.8829787
14 0.9716312 0.4862385 0.8900709
15 0.9751773 0.6055046 0.9078014
16 0.9645390 0.4770642 0.8758865
17 0.9255319 0.5871560 0.8900709
18 0.9290780 0.5963303 0.9184397
19 0.9184397 0.6513761 0.8758865
20 0.9468085 0.5648148 0.8687943
21 0.9468085 0.6111111 0.8900709
22 0.9397163 0.6944444 0.9148936
23 0.8971631 0.6605505 0.8794326
24 0.9007092 0.6972477 0.8829787
25 0.9645390 0.5740741 0.8865248
26 0.9503546 0.6238532 0.9078014
27 0.9290780 0.6146789 0.8687943
28 0.8936170 0.6759259 0.8971631
29 0.9609929 0.4954128 0.8794326
30 0.9397163 0.6330275 0.9078014
31 0.9468085 0.5963303 0.8758865
32 0.9078014 0.6513761 0.8829787
33 0.9326241 0.6330275 0.8900709
34 0.9148936 0.6513761 0.8900709
35 0.9042553 0.6851852 0.9042553
36 0.9539007 0.4954128 0.8687943
ClassAccuracyHalf_Present mtry splitrule min.node.size Resample
1 0.6972477 13 gini 10 Fold01.Rep1
2 0.6788991 13 gini 10 Fold02.Rep1
3 0.6605505 13 gini 10 Fold03.Rep1
4 0.7155963 13 gini 10 Fold04.Rep1
5 0.6972477 13 gini 10 Fold05.Rep1
6 0.6944444 13 gini 10 Fold06.Rep1
7 0.6759259 13 gini 10 Fold07.Rep1
8 0.7431193 13 gini 10 Fold08.Rep1
9 0.6972477 13 gini 10 Fold09.Rep1
10 0.6422018 13 gini 10 Fold10.Rep1
11 0.6697248 13 gini 10 Fold11.Rep1
12 0.6759259 13 gini 10 Fold12.Rep1
13 0.6972477 13 gini 10 Fold01.Rep2
14 0.6238532 13 gini 10 Fold02.Rep2
15 0.7247706 13 gini 10 Fold03.Rep2
16 0.6330275 13 gini 10 Fold04.Rep2
17 0.6330275 13 gini 10 Fold05.Rep2
18 0.6055046 13 gini 10 Fold06.Rep2
19 0.7155963 13 gini 10 Fold07.Rep2
20 0.7407407 13 gini 10 Fold08.Rep2
21 0.7222222 13 gini 10 Fold09.Rep2
22 0.7314815 13 gini 10 Fold10.Rep2
23 0.6880734 13 gini 10 Fold11.Rep2
24 0.7064220 13 gini 10 Fold12.Rep2
25 0.7037037 13 gini 10 Fold01.Rep3
26 0.6972477 13 gini 10 Fold02.Rep3
27 0.7247706 13 gini 10 Fold03.Rep3
28 0.6759259 13 gini 10 Fold04.Rep3
29 0.5688073 13 gini 10 Fold05.Rep3
30 0.6880734 13 gini 10 Fold06.Rep3
31 0.6605505 13 gini 10 Fold07.Rep3
32 0.6697248 13 gini 10 Fold08.Rep3
33 0.6788991 13 gini 10 Fold09.Rep3
34 0.7522936 13 gini 10 Fold10.Rep3
35 0.6759259 13 gini 10 Fold11.Rep3
36 0.6513761 13 gini 10 Fold12.Rep3
$confMat
Actual
Predicted Absent Present
Absent 9403 1569
Present 749 2346
$varImp
TokenDur F1_20 diffF3F1_20 BW1_20 F1_25
30.911940 16.335820 32.862408 11.985398 15.627758
diffF3F1_25 BW1_25 F1_30 diffF3F1_30 BW1_30
40.392839 12.102756 14.620130 50.362062 11.480090
F1_35 diffF3F1_35 BW1_35 F1_40 diffF3F1_40
14.949967 56.388952 10.582690 17.784323 84.201425
BW1_40 F1_45 diffF3F1_45 BW1_45 F1_50
11.121165 18.969431 84.292686 10.915955 21.197655
diffF3F1_50 BW1_50 F1_55 diffF3F1_55 BW1_55
99.537248 11.618726 19.039541 125.586082 11.333006
F1_60 diffF3F1_60 BW1_60 F1_65 diffF3F1_65
19.224371 118.264581 11.144018 19.412881 110.854734
BW1_65 F1_70 diffF3F1_70 BW1_70 F1_75
12.500780 21.375723 86.618344 12.003453 15.281521
diffF3F1_75 BW1_75 F1_80 diffF3F1_80 BW1_80
76.461505 13.435915 15.111977 59.660382 12.767858
F1min F1max time_F1min time_F1max F1range
12.147976 16.659691 10.039284 14.906209 14.658252
time_F1range slopeF1 F2_20 diffF3F2_20 BW2_20
17.867152 13.481181 13.442405 21.099381 14.527165
F2_25 diffF3F2_25 BW2_25 F2_30 diffF3F2_30
13.169765 23.269944 16.183169 15.400825 26.014706
BW2_30 F2_35 diffF3F2_35 BW2_35 F2_40
14.709718 16.154775 31.150467 14.253708 18.577094
diffF3F2_40 BW2_40 F2_45 diffF3F2_45 BW2_45
47.690684 11.710464 19.456580 44.892944 14.687540
F2_50 diffF3F2_50 BW2_50 F2_55 diffF3F2_55
24.832089 36.691517 13.905267 25.093486 61.194509
BW2_55 F2_60 diffF3F2_60 BW2_60 F2_65
14.285630 24.268267 64.969456 14.510267 22.533153
diffF3F2_65 BW2_65 F2_70 diffF3F2_70 BW2_70
60.846535 14.993224 26.894477 74.508886 13.242494
F2_75 diffF3F2_75 BW2_75 F2_80 diffF3F2_80
22.859192 59.812080 15.013879 22.105508 56.329374
BW2_80 F2min F2max time_F2min time_F2max
15.919198 20.667474 15.942155 15.483057 14.147046
F2range time_F2range slopeF2 F3_20 BW3_20
15.169717 13.754622 14.263354 14.348496 12.365233
F3_25 BW3_25 F3_30 BW3_30 F3_35
15.306057 13.304127 16.518102 12.059952 17.413337
BW3_35 F3_40 BW3_40 F3_45 BW3_45
12.477190 20.935256 13.980339 21.344514 13.120294
F3_50 BW3_50 F3_55 BW3_55 F3_60
20.098019 13.557393 21.328554 16.285327 23.948897
BW3_60 F3_65 BW3_65 F3_70 BW3_70
14.734931 23.961821 16.602846 27.458054 14.179753
F3_75 BW3_75 F3_80 BW3_80 F3min
22.262424 15.953980 23.853178 15.350797 59.767004
F3max time_F3min time_F3max F3range time_F3range
16.836315 11.275293 11.702691 26.141592 11.871325
slopeF3 intens_F3min intens_F3max F4_20 diffF4F3_20
21.232243 57.045374 52.454464 13.534569 15.746468
BW4_20 F4_25 diffF4F3_25 BW4_25 F4_30
12.677645 14.936757 15.301114 12.637193 16.506180
diffF4F3_30 BW4_30 F4_35 diffF4F3_35 BW4_35
14.470135 12.825110 20.189683 16.317236 13.488398
F4_40 diffF4F3_40 BW4_40 F4_45 diffF4F3_45
18.903320 15.196490 12.669967 16.752640 15.538213
BW4_45 F4_50 diffF4F3_50 BW4_50 F4_55
13.928383 16.726361 15.418162 13.724884 18.286778
diffF4F3_55 BW4_55 F4_60 diffF4F3_60 BW4_60
15.642251 12.994861 16.939063 15.817247 12.537453
F4_65 diffF4F3_65 BW4_65 F4_70 diffF4F3_70
16.032317 14.654391 12.618240 17.874112 13.460142
BW4_70 F4_75 diffF4F3_75 BW4_75 F4_80
12.529991 15.419088 13.987334 12.737561 15.394507
diffF4F3_80 BW4_80 F4min F4max time_F4min
15.506276 13.670787 25.102296 14.844825 14.078358
time_F4max F4range time_F4range slopeF4 F0min
13.149421 18.261516 12.922259 19.413098 44.339740
F0max F0rangeST time_F0min time_F0max absSlopeF0
39.061307 19.600931 7.389494 8.545474 12.825855
$times
$times$everything
user system elapsed
2193.552 2.880 392.333
$times$final
user system elapsed
65.940 0.032 11.271
finalClassifier %>% saveRDS("Data/ClassifierR.Rds")
Auto-coding is easy and fast. To get continuous probabilities, we pass the final classifier and non-hand-coded data (without the dependent variable and with relevant measures only) to predict()
with type="prob"
.
predict()
is what’s called a “generic function” in R, with lots of different methods for objects of different classes that are passed to it. Since our final classifier is an object of classtrain
(seeclass(finalClassifier)
), callingpredict(finalClassifier)
passes thefinalClassifier
object to thepredict.train()
function that’s loaded with thecaret
package. To see all the methods for a generic function likepredict()
, runmethods("predict")
.
preds <-
predict(finalClassifier,
trainingData %>%
filter_at(vars(F3_80_Outlier, F2_35_Outlier, intens_F3min_Outlier,
F1_70_Outlier, F1_50_Outlier, F3min_Outlier),
all_vars(!.)) %>%
filter(HowCoded=="Auto") %>%
select(TokenDur:absSlopeF0),
type="prob")
preds %>% head()
Then, as in the custom prediction function we defined above, you can use if_else()
with a custom cutoff to convert continuous probabilities to binary.
bestCutoff <- mean(finalClassifier$resample$BestCutoff)
preds <- preds %>%
mutate(Rpresent = if_else(Present >= bestCutoff, "Present", "Absent") %>% factor())
preds %>% head()
In the case of our trainingData
dataframe, where hand-coded and auto-coded tokens are interspersed, an extra step is required to add the predictions back into the dataframe. Here, we use the same trick as with the imputation dataframe in Step 1, creating a temporary dataframe for prediction that includes a unique identifier column (which we sneak out of the dataframe before prediction and sneak back in after prediction), joining classifier probabilities with the original trainingData
dataframe, and recoding the Rpresent
dependent variable only for the auto-coded tokens with non-NA
classifier probabilities.
preds <-
trainingData %>%
filter_at(vars(F3_80_Outlier, F2_35_Outlier, intens_F3min_Outlier,
F1_70_Outlier, F1_50_Outlier, F3min_Outlier),
all_vars(!.)) %>%
filter(HowCoded=="Auto") %>%
select(MatchId, TokenDur:absSlopeF0)
preds <-
predict(finalClassifier, preds %>% select(-MatchId), type="prob") %>%
cbind(preds %>% select(MatchId))
trainingData <-
trainingData %>%
left_join(preds %>% select(MatchId, ProbPresent = Present), by="MatchId") %>%
mutate(Rpresent = case_when(
HowCoded=="Hand" ~ Rpresent %>% as.character(),
HowCoded=="Auto" & !is.na(ProbPresent) & ProbPresent >= bestCutoff ~ "Present",
HowCoded=="Auto" & !is.na(ProbPresent) & ProbPresent < bestCutoff ~ "Absent",
TRUE ~ NA_character_
) %>%
factor())
rm(preds)
And now we’ve got auto-coded tokens! Here’s a sampling:
set.seed(302302)
trainingData %>%
filter_at(vars(F3_80_Outlier, F2_35_Outlier, intens_F3min_Outlier,
F1_70_Outlier, F1_50_Outlier, F3min_Outlier),
all_vars(!.)) %>%
filter(HowCoded=="Auto") %>%
select(HowCoded, Rpresent, ProbPresent, TokenDur:absSlopeF0) %>%
sample_n(10)
Finally, we can extract resampled predictions (binary codes and classifier probabilities) for the hand-coded training set, as long as we’ve set savePredictions=TRUE
inside trainControl()
. Note that we do not recommend comparing classifier probabilities for hand-coded tokens to those of the auto-coded tokens because they are generated with slightly different processes (the auto-coded tokens are exposed to the entire classifier, the hand-coded tokens to a subset); nonetheless, obtaining the hand-coded probabilities can be useful for some applications (e.g., we used them to construct homogeneous vs. heterogeneous Present classes in the sample size simulations). The dataframe that results from the following code can be cbind()
-ed to our existing hand-coded data. (Note also that this code works for the repeated k-fold cross-validation method we used, but the code would need to be tweaked for different methods of generating training/test subsets.)
finalClassifier$pred %>%
arrange(rowIndex) %>%
group_by(rowIndex) %>%
summarise(ProbPresent = mean(Present, na.rm=TRUE))
This guide was generated in an R session with the following information:
sessionInfo()
R version 3.5.2 (2018-12-20)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.6 LTS
Matrix products: default
BLAS: /usr/lib/atlas-base/atlas/libblas.so.3.0
LAPACK: /usr/lib/atlas-base/atlas/liblapack.so.3.0
locale:
[1] LC_CTYPE=en_NZ.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_NZ.UTF-8 LC_COLLATE=en_NZ.UTF-8
[5] LC_MONETARY=en_NZ.UTF-8 LC_MESSAGES=en_NZ.UTF-8
[7] LC_PAPER=en_NZ.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_NZ.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] parallel grid stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] diptest_0.75-7 doParallel_1.0.14 iterators_1.0.10
[4] foreach_1.4.4 DMwR_0.4.1 knitr_1.21
[7] caret_6.0-84 lattice_0.20-38 ranger_0.11.1
[10] magrittr_1.5 forcats_0.4.0 stringr_1.4.0
[13] dplyr_0.8.1 purrr_0.3.0 readr_1.3.1
[16] tidyr_0.8.2 tibble_2.1.1 ggplot2_3.1.0
[19] tidyverse_1.2.1
loaded via a namespace (and not attached):
[1] httr_1.4.0 jsonlite_1.6 splines_3.5.2
[4] gtools_3.8.1 prodlim_2018.04.18 modelr_0.1.4
[7] assertthat_0.2.1 TTR_0.23-4 stats4_3.5.2
[10] cellranger_1.1.0 yaml_2.2.0 ipred_0.9-8
[13] pillar_1.3.1 backports_1.1.3 glue_1.3.1
[16] digest_0.6.18 rvest_0.3.2 colorspace_1.4-0
[19] recipes_0.1.4 htmltools_0.3.6 Matrix_1.2-15
[22] plyr_1.8.4 timeDate_3043.102 pkgconfig_2.0.2
[25] broom_0.5.1 haven_2.1.0 bookdown_0.10
[28] scales_1.0.0 gdata_2.18.0 gower_0.1.2
[31] lava_1.6.5 generics_0.0.2 withr_2.1.2
[34] ROCR_1.0-7 nnet_7.3-12 lazyeval_0.2.1
[37] cli_1.0.1 quantmod_0.4-13 survival_2.43-3
[40] crayon_1.3.4 readxl_1.3.0 evaluate_0.13
[43] nlme_3.1-137 MASS_7.3-51.1 gplots_3.0.1.1
[46] xts_0.11-2 xml2_1.2.0 class_7.3-15
[49] tools_3.5.2 data.table_1.12.0 hms_0.4.2
[52] munsell_0.5.0 e1071_1.7-0.1 compiler_3.5.2
[55] caTools_1.17.1.1 rlang_0.3.4 rstudioapi_0.9.0
[58] labeling_0.3 bitops_1.0-6 rmarkdown_1.11
[61] gtable_0.2.0 ModelMetrics_1.2.2 codetools_0.2-16
[64] abind_1.4-5 curl_3.3 reshape2_1.4.3
[67] R6_2.4.0 zoo_1.8-4 lubridate_1.7.4
[70] KernSmooth_2.23-15 stringi_1.3.1 Rcpp_1.0.1
[73] rpart_4.1-13 tidyselect_0.2.5 xfun_0.5
Breen, M. S., Thomas, K. G., Baldwin, D. S., & Lipinska, G. (2019). Modelling PTSD diagnosis using sleep, memory, and adrenergic metabolites: An exploratory machine-learning study. Human Psychopharmacology: Clinical and Experimental, 34(2), e2691. https://doi.org/10.1002/hup.2691
Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: Synthetic Minority Over-sampling Technique. Journal of Artificial Intelligence Research, 16, 321–357. https://doi.org/10.1613/jair.953
Freeman, J. B., & Dale, R. (2013). Assessing bimodality to detect the presence of a dual cognitive process. Behavior Research Methods (Online), 45(1), 83–97. https://doi.org/10.3758/s13428-012-0225-x
Fromont, R., & Hay, J. (2012). LaBB-CAT: An annotation store. Proceedings of Australasian Language Technology Association Workshop, 113–117. Retrieved from http://www.aclweb.org/anthology/U12-1015
Hartigan, J. A., & Hartigan, P. M. (1985). The Dip Test of Unimodality. Annals of Statistics, 13(1), 70–84. https://doi.org/10.1214/aos/1176346577
Hay, J., & Clendon, A. (2012). (Non)rhoticity: Lessons from New Zealand English. In T. Nevalainen & E. C. Traugott (Eds.), The Oxford Handbook of the history of English (pp. 761–772). Oxford: Oxford University Press.
Heselwood, B. (2009). Rhoticity without F3: Lowpass filtering and the perception of rhoticity in ’NORTH/FORCE,’ ’START,’ and ’NURSE’ words. Leeds Working Papers in Linguistics and Phonetics, 14, 49–64. https://doi.org/10.1.1.500.6321
Hudak, A. T., Crookston, N. L., Evans, J. S., Hall, D. E., & Falkowski, M. J. (2008). Nearest neighbor imputation of species-level, plot-scale forest structure attributes from LiDAR data. Remote Sensing of Environment, 112(5), 2232–2245. https://doi.org/https://doi.org/10.1016/j.rse.2007.10.009
Kuhn, M. (2018). Caret. Retrieved from https://CRAN.R-project.org/package=caret
Lawson, E., Stuart-Smith, J., & Scobbie, J. (2018). The role of gesture delay in coda /r/ weakening: An articulatory, auditory and acoustic study. Journal of the Acoustical Society of America, 143(3), 1646–1657. https://doi.org/10.1121/1.5027833
Liaw, A., & Wiener, M. (2002). Classification and regression by randomForest. R News, 2(3), 18–22. Retrieved from https://CRAN.R-project.org/doc/Rnews/
Maechler, M. (2016). Diptest. Retrieved from https://cran.r-project.org/package=diptest
R Core Team. (2018). R: A language and environment for statistical computing. Retrieved from https://www.R-project.org/
Torgo, L. (2010). Data Mining with R, learning with case studies. Retrieved from http://www.dcc.fc.up.pt/~ltorgo/DataMiningWithR
Wright, M. N., & Ziegler, A. (2017). Ranger: A fast implementation of random forests for high dimensional data in C++ and R. Journal of Statistical Software, 77(1), 1–17. https://doi.org/10.18637/jss.v077.i01