forced-alignment-comparison.Rmd
LaBB-CAT integrates with several forced-aligners, which automatically determine the start and end time of each word, and each segment within each word, given each utterance transcript and start/edn time, and the wav corresponding file.
LaBB-CAT can also compare alignments using a module that maps annotations on one layer to those on another, and computes the overlap rate of the annotation pairs; i.e. a measure of how much the alignments agree with each other.
This example shows how, given a set of manual word/segment alignments (in this case, New Zealand English utterances), it’s possible to run several different forced alignment configurations, and compare them with the manual alignments, in order to determine which forced alignment configuration works best for the type of speech data being processed.
Almost all the operations needed for forced alignment comparison can be implemented directly in code. However, the annotator modules used must be already installed in LaBB-CAT. In this case, there is a local LaBB-CAT instance that already has the following annotator modules installed:
Defining the location of the LaBB-CAT server, and the transcript/media files is also required:
library(nzilbb.labbcat)
dataDir <- "data"
transcriptFiles = dir(path = dataDir, pattern = ".*\\.TextGrid$", full.names = FALSE)
url <- Sys.getenv('TEST_ADMIN_LABBCAT_URL') # load details from .Renviron file
credentialError <- labbcatCredentials(
url, Sys.getenv('TEST_ADMIN_LABBCAT_USERNAME'), Sys.getenv('TEST_ADMIN_LABBCAT_PASSWORD'))
This process is designed to be re-runnable, so we don’t assume that the LaBB-CAT instance is completely empty. The following code removes any previously-created configuration and data.
# delete layers
deleteLayer(url, "cmuDictPhonemes")
deleteLayer(url, "p2fa")
deleteLayer(url, "p2faPhone")
deleteLayer(url, "p2faComp")
deleteLayer(url, "mfaGAm")
deleteLayer(url, "mfaGAmPhone")
deleteLayer(url, "mfaGAmComp")
deleteLayer(url, "mfaUKE")
deleteLayer(url, "mfaUKEPhone")
deleteLayer(url, "mfaUKEComp")
deleteLayer(url, "maus")
deleteLayer(url, "mausPhone")
deleteLayer(url, "mausComp")
# delete any pre-existing transcripts
for (transcriptName in transcriptFiles) {
# if the transcript is in LaBB-CAT...
if (length(getMatchingTranscriptIds(url, paste("id = '", transcriptName, "'", sep=""))) > 0) {
# ...delete it
deleteTranscript(url, transcriptName)
}
} # next textGrid
There are several configurations for forced alignment compared here, including different forced aligners, and different configurations of the same forced aligner.
These can be compared to the manual alignments by mapping manually aligned words to automatically aligned words, and then within each word token, mapping the manually aligned phones to the automatically aligned phones.
This is performed using the Layer Mapper module, which is configured to map and compare each of the forced alignment configurations.
The Hidden Markov Model Toolkit (HTK) is a speech recognition toolkit developed at Cambridge University. Integration with LaBB-CAT involves specifying the layer where the orthographic transcription comes from, and another that provides phonemic transcriptions for each word token, in adding to setting up layers for receiving word and phoneme alignments.
We will use the CMU Pronouncing Dictionary, to provide word pronunciations for our corpus, using a LaBB-CAT module that integrates with the CMU Pronouncing dictionary:
getAnnotatorDescriptor(url, "CMUDictionaryTagger")$taskParameterInfo
This annotator tags words with their pronunciations according to the CMU Pronouncing dictionary, a machine-readable pronunciation dictionary for North American English that contains over 125,000 words and their transcriptions.
Configuration parameters are encoded as a URL query string, e.g.
tokenLayerId=orthography&pronunciationLayerId=phonemes&encoding=DISC
Parameters are:
The cmuDictPhonemes layer is configured as follows:
cmuDictPhonemes = newLayer(
url, "cmuDictPhonemes",
description = "Phonemic transcriptions according to the CMU Pronouncing Dictionary",
alignment = 0, parent.id = "word",
annotator.id = "CMUdict",
annotator.task.parameters = paste(
"tokenLayerId=orthography", # get word tokens from orthography layer
"pronunciationLayerId=cmuDictPhonemes",
"encoding=CMU", # Use original ARPAbet encoding, not CELEX DISC encoding
"transcriptLanguageLayerId=", # not filtering tagging by language...
"phraseLanguageLayerId=",
sep = "&"))
The LaBB-CAT module that integrates with HTK can be configured in various ways, for example to train acoustic models from scratch on the corpus data itself, or to use pre-trained models from another corpus.
The module includes pre-trained acoustic models from the University of Pennsylvania Phonetics Lab Forced Aligner (P2FA). The P2FA models were trained on 25.5 hours of speech by adult American English speakers, specifically speech of eight Supreme Court Justices selected from oral arguments in the Supreme Court of the United States (SCOTUS) corpus.
The LaBB-CAT module that integrates with HTK is called HTKAligner:
getAnnotatorDescriptor(url, "HTKAligner")$taskParameterInfo
The HTK Aligner can use words with phonemic transcriptions, and the corresponding audio, to force-align words and phones; i.e. determine the start and end time of each speech sound within each word, and thus the start/end times of the words.
Configuration parameters are encoded as a URL query string, e.g.
orthographyLayerId=orthography&pauseMarkers=-&pronunciationLayerId=cmudict&noiseLayerId=noise&mainUtteranceGrouping=Speaker&otherUtteranceGrouping=Not Aligned&noisePatterns=laugh.* unclear .*noise.*&overlapThreshold=5&wordAlignmentLayerId=word&phoneAlignmentLayerId=segmentutteranceTagLayerId=htk&cleanupOption=100
NB Ensure the configuration string provided has all parts correctly
URI-encoded; in particular, the space delimiter of noisePatterns should be
encoded %20 or +
Parameters are:
- .
75- Working files should be deleted if alignment succeeds.
25- Working files should be deleted if alignment fails.
100- Working files should be deleted whether alignment succeeds or not.
0- Working files should not be deleted whether alignment succeeds or not.
The configuration for using the P2FA pre-trained models is:
p2fa <- newLayer(
url, "p2fa",
description = "Word alignments from HTK using the P2FA pretrained acoustic models.",
alignment = 2, parent.id = "turn", # a phrase layer
annotator.id = "HTK",
annotator.task.parameters = paste(
"orthographyLayerId=orthography",
"pronunciationLayerId=cmuDictPhonemes", # pronunciations come from the CMU Dict layer
"useP2FA=on", # use pre-trained P2FA models
"overlapThreshold=5", # ignore utterances that overlap more than 5%
"wordAlignmentLayerId=p2fa", # save word alignments to this layer
"phoneAlignmentLayerId=p2faPhone", # this layer will be created by the annotator
"cleanupOption=100", sep="&"))
The Montreal Forced Aligner (MFA), is another forced alignment system that uses the Kaldi ASR toolkit instead of HTK.
The LaBB-CAT module that integrates with the Montreal Forced Aligner is called MFA:
getAnnotatorDescriptor(url, "MFA")$taskParameterInfo
The MFA Annotator integrates with the Montreal Forced Aligner, which can use words with phonemic transcriptions, and the corresponding audio, to force-align words and phones; i.e. determine the start and end time of each speech sound within each word, and thus the start/end times of the words.
Configuration parameters are encoded as a URL query string, e.g.
orthographyLayerId=orthography&dictionaryName=english&modelsName=english&wordAlignmentLayerId=word&phoneAlignmentLayerId=segment&utteranceTagLayerId=mfa
Parameters are:
MFA provides a set pre-trained English models trained on 982 hours of speech by 2484 American English speakers from the LibriSpeech corpus, (Vassil Panayotov and Guoguo Chen and Daniel Povey and Sanjeev Khudanpur (2015) “Librispeech: An ASR corpus based on public domain audio books”, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp 5206-5210, DOI:10.1109/ICASSP.2015.7178964) and pronunciations from an ARPAbet encoded General American English dictionary.
A layer is set up to use this configuration:
mfaGAm <- newLayer(
url, "mfaGAm",
description = "Word alignments from MFA using an ARPAbet dictionary and pretrained models.",
alignment = 2, parent.id = "turn", # a phrase layer
annotator.id = "MFA",
annotator.task.parameters = paste(
"orthographyLayerId=orthography",
"dictionaryName=english_us_arpa",
"modelsName=english_us_arpa",
"wordAlignmentLayerId=mfaGAm", # save word alignments to this layer
"phoneAlignmentLayerId=mfaGAmPhone", # this will be created by the annotator
sep="&"))
MFA also includes IPA-encoded models trained on 3687 hours of English speech of a number of varieties, and an IPA-encoded British English pronunciation dictionary which may perform better for our New Zealand English data, as both varieties are non-rhotic (unlike the US English dictionary used for the mfaGAm layer above).
This configuration uses pre-trained English models and pronunciations from an IPA encoded British English dictionary.
# create layer
mfaUKE <- newLayer(
url, "mfaUKE",
description = "Word alignments from MFA using an IPA dictionary and pretrained models.",
alignment = 2, parent.id = "turn", # a phrase layer
annotator.id = "MFA",
annotator.task.parameters = paste(
"orthographyLayerId=orthography",
"dictionaryName=english_uk_mfa",
"modelsName=english_mfa",
"wordAlignmentLayerId=mfaUKE", # save word alignments to this layer
"phoneAlignmentLayerId=mfaUKEPhone", # this will be created by the annotator
sep="&"))
MAUSBasic web service; part of the CLARIN-D’s BAS Web Services suite provides a web service for the MAUS forced aligner, which supports a number of languages and varieties.
LaBB-CAT integrates with the BAS Web Services using a module called BASAnnotator (which also provides access to the G2P BAS web service):
getAnnotatorDescriptor(url, "BASAnnotator")$taskParameterInfo
This annotator connects to the BAS web services - http://hdl.handle.net/11858/00-1779-0000-0028-421B-4, hosted by Ludwig Maximilians Universität München - for various annotation tasks.
Current annotation tasks include:
Please note that using these services requires sending transcript, annotation, and audio data over the internet to the external provider of these services.
Configuration parameters are encoded as a URL query string, e.g.
orthographyLayerId=orthography&service=MAUSBasic&forceLanguageMAUSBasic=deu-DE&phonemeEncoding=disc&wordAlignmentLayerId=word&phoneAlignmentLayerId=segment&utteranceTagLayerId=bas
Parameters are:
New Zealand English is supported explicitly by MAUSBasic, so layer is set up to using the following configuration:
maus <- newLayer(
url, "maus",
description = "Word alignments from MAUSBasic web service provided by the BAS Web Services.",
alignment = 2, parent.id = "turn", # a phrase layer
annotator.id = "BAS",
annotator.task.parameters = paste(
"service=MAUSBasic",
"orthographyLayerId=orthography",
"forceLanguageMAUSBasic=eng-NZ", # New Zealand English
"phonemeEncoding=disc", # use DISC for phone labels
"wordAlignmentLayerId=maus", # save word alignments to this layer
"phoneAlignmentLayerId=mausPhone", # this will be created by the annotator
sep = "&"))
The speech to be force-aligned is in New Zealand English, and has been transcribed in a Praat. This corpus is (very!) small, but enough to illustrate what’s possible.
TextGrid files are uploaded with their corresponsing .wav files. The TextGrids include manually aligned words and segments (phones). The phone labels use the CELEX ‘DISC’ encoding, which utilises one character per phoneme. These will be the ‘gold standard’ alignments for evaluating each forced alignment configuration.
# for each transcript
for (transcriptName in transcriptFiles) {
transcript <- file.path(dataDir, transcriptName)
# locate recording
noExtension <- substr(transcriptName, 1, nchar(transcriptName) - 9)
wav <- file.path(dataDir, paste(noExtension, ".wav", sep=""))
if (!file.exists(wav)) {
wav <- file.path(dataDir, paste(noExtension, ".WAV", sep=""))
}
if (!file.exists(wav)) cat(paste(wav, "doesn't exist\n"))
if (!file.exists(transcript)) cat(paste(transcript, "doesn't exist\n"))
# upload the transcript/recording
newTranscript(url, transcript, wav, no.progress = TRUE)
} # next trancript
cat(paste("Transcripts uploaded: ", length(transcriptFiles), "\n"))
## Transcripts uploaded: 1
At this point the manually aligned transcripts have been uploaded, and all of the forced-alignment configurations have been run.
In order to compare the word and phone alignments produced by the forced aligners with the manual alignments, we use LaBB-CAT’s Label Mapper layer manager:
getAnnotatorDescriptor(url, "LabelMapper")$info
This annotator creates a mapping between the labels of pairs of layers, by finding the minimum edit path between them.
For example, this layer manager can be used to tag each phone in a word with its likely counterpart in another phonemic transcription:
d | ɪ | f | ə | ɹ | n̩ | t |
↓ | ↓ | ↓ | ↓ | ↓ | ↓ | |
d | ɪ | f | ɹ | ən | t |
… or phonemic encoding:
d | ɪ | f | ə | ɹ | n̩ | t |
↓ | ↓ | ↓ | ↓ | ↓ | ↓ | ↓ |
D | IH1 | F | ER0 | R | AH0 N | T |
One possible use is to map non-rhotic phones to their equivalent rhotic transcription:
f | αɪ | ə | f | αɪ | t | ə |
↓ | ↓ | ↓ | ↓ | ↓ | ↓ | ↓ |
f | αɪ | əɹ | f | αɪ | t | əɹ |
The Label Mapper can also be used to compare alternative word/phone alignments - e.g. you may have manual word/phone alignment that you want to compare with word/phone alignmentd automatically generated by HTK, in order to measure the accuracy of the automatic alignments.
In this case, you specify two mappings, a main mapping between a pair of word alignment layers, and a sub-mapping between a paird of phone alignment layers.
Word Source | CMUDictWord | DIFFERENT | FIREFIGHTER | |||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
↓ | ↓ | |||||||||||||||
Word Target | orthography | different | firefighter | |||||||||||||
Phone Source | P2FAPhone | D | IH1 | F | ER0 | R | AH0 | N | T | F | AY1 | R | F | AY2 | T | ER0 |
↓ | ↓ | ↓ | ↓ | ↓ | ↓ | ↓ | ↓ | ↓ | ↓ | ↓ | ↓ | ↓ | ↓ | |||
Phone Target | segments | d | I | f | r | @ | n | t | f | 2 | @ | f | 2 | t | @ |
These mappings are tracked in detail, to facilitate comparison of alignment and labelling. Mapping information can be accessed via the 'extensions' page of this annotator.
The details of the Label Mapper configuration are:
getAnnotatorDescriptor(url, "LabelMapper")$taskParameterInfo
This annotator creates a mapping between the labels of two layers, a source
layer and a target
layer, by finding the minimum edit path between them.
A phone Sub-mapping may be configured if you are mapping word tokens on one layer to word tokens on another layer. In this case, map phone tokens on a source layer to phone tokens on a target layer - i.e. you can compare two sets of word/phone annotations.
The words on the sourceLayerId are assumed to be divided into phones that are on the subSourceLayerId, and the words on the targetLayerId are assumed to be divided into phones that are on the subTargetLayerId.
These mappings are tracked in detail, to facilitate comparison of alignment and labelling. Mapping information can be accessed via this annotator's extensions.
Configuration parameters are encoded as a URL query string, e.g.
sourceLayerId=orthography&splitLabels=&mappingLayerId=&comparator=CharacterToCharacter&targetLayerId=mfaPretrainedARPAbet&subSourceLayerId=segment&subComparator=DISCToArpabet&subTargetLayerId=mfaPretrainedARPAbetPhone
CharacterToCharacter- Orthography → Orthography
OrthographyToDISC- Orthography → DISC
OrthographyToArpabet- Orthography → ARPAbet
DISCToDISC- DISC → DISC
DISCToArpabet- DISC → ARPAbet
ArpabetToDISC- ARPAbet → DISC
IPAToIPA- IPA → IPA
DISCToIPA- DISC → IPA
IPAToDISC- IPA → DISC
char- Use the most similar annotation in the scope, split its label into characters, and map each to labels on the target layer (e.g. DISC word transcriptions to phones).
space- Use the most similar annotation in the scope, split its label on spaces, and map each to labels on the target layer (e.g. ARPAbet word transcriptions to phones).
CharacterToCharacter
CharacterToCharacter- Orthography → Orthography
OrthographyToDISC- Orthography → DISC
OrthographyToArpabet- Orthography → ARPAbet
DISCToDISC- DISC → DISC
DISCToArpabet- DISC → ARPAbet
ArpabetToDISC- ARPAbet → DISC
IPAToIPA- IPA → IPA
DISCToIPA- DISC → IPA
IPAToDISC- IPA → DISC
A comparison is set up for each of the forced-alignment configurations above, i.e each automatic word alignment is matched to a corresponding manual word alignment, and within each word, each automatic phone alignment is matched to a corresponding manual phone alignment. Once this is done, it’s possible to measure the degree to which the automatic alignments overlap with the manual ones.
# compare p2fa alignments with manual ones
p2faComp <- newLayer(
url, "p2faComp",
description = "Compare P2FA alignments with Manual alignments.",
alignment = 2, parent.id = "turn", # a phrase layer
annotator.id = "labelmapper",
annotator.task.parameters = paste(
"sourceLayerId=orthography",
"targetLayerId=p2fa",
"splitLabels=", # no splitting; target to source annotations map 1:1
"comparator=CharacterToCharacter", # word tokens use plain orthography
"subSourceLayerId=segment",
"subTargetLayerId=p2faPhone",
"subComparator=DISCToArpabet", # phone tokens in p2faPhone use ARPAbet encoding
sep="&"))
generateLayer(url, "p2faComp")
## [1] "Finished."
# compare MFA ARPAbet alignments with manual ones
mfaGAmComp <- newLayer(
url, "mfaGAmComp",
description = "Compare MFA ARPAbet pretrained-model alignments with Manual alignments.",
alignment = 2, parent.id = "turn", # a phrase layer
annotator.id = "labelmapper",
annotator.task.parameters = paste(
"sourceLayerId=orthography",
"targetLayerId=mfaGAm",
"splitLabels=", # no splitting; target to source annotations map 1:1
"comparator=CharacterToCharacter", # word tokens use plain orthography
"subSourceLayerId=segment",
"subTargetLayerId=mfaGAmPhone",
"subComparator=DISCToArpabet", # phone tokens in mfaGAmPhone use ARPAbet encoding
sep="&"))
generateLayer(url, "mfaGAmComp")
## [1] "Finished."
# compare MFA non-rhotic IPA alignments with manual ones
mfaUKEComp <- newLayer(
url, "mfaUKEComp",
description = "Compare MFA IPA pretrained-model alignments with Manual alignments.",
alignment = 2, parent.id = "turn", # a phrase layer
annotator.id = "labelmapper",
annotator.task.parameters = paste(
"sourceLayerId=orthography",
"targetLayerId=mfaUKE",
"splitLabels=", # no splitting; target to source annotations map 1:1
"comparator=CharacterToCharacter", # word tokens use plain orthography
"subSourceLayerId=segment",
"subTargetLayerId=mfaUKEPhone",
"subComparator=DISCToIPA", # phone tokens in mfaUKEPhone use IPA encoding
sep="&"))
generateLayer(url, "mfaUKEComp")
## [1] "Finished."
# compare MAUSBasic alignments with manual ones
mausComp <- newLayer(
url, "mausComp",
description = "Compare MAUS Basic New Zealand English alignments with Manual alignments.",
alignment = 2, parent.id = "turn", # a phrase layer
annotator.id = "labelmapper",
annotator.task.parameters = paste(
"sourceLayerId=orthography",
"targetLayerId=maus",
"splitLabels=", # no splitting; target to source annotations map 1:1
"comparator=CharacterToCharacter", # word tokens use plain orthography
"subSourceLayerId=segment",
"subTargetLayerId=mausPhone",
"subComparator=DISCToDISC", # phone tokens in mausPhone use DISC encoding
sep="&"))
generateLayer(url, "mausComp")
## [1] "Finished."
The LabelMapper has an extended API for exporting information about the edit-paths computed, and the resulting alignment comparisons.
getAnnotatorDescriptor(url, "LabelMapper")$extApiInfo
This annotator creates a mapping between the labels of a pair of layers, by finding the minimum edit path between them. If any sub-mappings have been configured - where two pairs of layers are mapped, two word layers and two corresponding phone layers - these mappings are tracked in detail so that alignments and label assignments can be compared.
This API provides access to tracked sub-mapping information, including raw mapping data and summary information including mean Overlap Rate.
Paulo and Oliveira (2004) devised Overlap Rate (OvR) to compare alignments, which measures how much two intervals overlap, independent of their absolute durations. OvR is calculated as follows:
The result is a value between 0 and 1. A value of 0 means that the two intervals do not overlap at all, with 1 meaning they completely overlap (i.e., the alignments exactly agree). Overlap Rate has several advantages over comparing start/end offsets directly:
The extension API can be used to list and download tracked mappings, by using the following endpoints, accessed with a GET http request:
The mean overlap rates for the forced alignment configurations can be directly compared, by extracting the mapping summaries and comparing them:
p2faPhone <- jsonlite::fromJSON(
annotatorExt(url, "LabelMapper", "summarizeMapping", list("segment","p2faPhone")))
mfaGAmPhone <- jsonlite::fromJSON(
annotatorExt(url, "LabelMapper", "summarizeMapping", list("segment","mfaGAmPhone")))
mfaUKEPhone <- jsonlite::fromJSON(
annotatorExt(url, "LabelMapper", "summarizeMapping", list("segment","mfaUKEPhone")))
mausPhone <- jsonlite::fromJSON(
annotatorExt(url, "LabelMapper", "summarizeMapping", list("segment","mausPhone")))
knitr::kable(rbind(p2faPhone, mfaGAmPhone, mfaUKEPhone, mausPhone))
meanOverlapRate | sourceCount | stepCount | targetCount | utteranceCount | |
---|---|---|---|---|---|
p2faPhone | 0.572620566232039 | 182 | 187 | 186 | 3 |
mfaGAmPhone | 0.657239987452307 | 182 | 188 | 186 | 3 |
mfaUKEPhone | 0.628982581365989 | 182 | 186 | 178 | 3 |
mausPhone | 0.625540272980146 | 182 | 183 | 178 | 3 |
The LabelMapper module can also provide finely-grained details of the mappings - i.e. exactly which manual phones mapped to which automatic phones.
# get edit paths for manual-to-P2FA comparison
p2faPhoneEditPaths <- read.csv(
text = annotatorExt(url, "LabelMapper", "mappingToCsv", list("segment","p2faPhone")))
# show a sample of the paths to give an idea
knitr::kable(head(
p2faPhoneEditPaths[
c("sourceParentLabel", "sourceLabel", "sourceStart","sourceEnd",
"targetLabel", "targetStart","targetEnd", "overlapRate")]),
# tweak some column names for display purposes
col.names = c("Word","sourceLabel", "sourceStart","sourceEnd",
"targetLabel", "targetStart","targetEnd", "OvR"))
Word | sourceLabel | sourceStart | sourceEnd | targetLabel | targetStart | targetEnd | OvR |
---|---|---|---|---|---|---|---|
yes | j | 6.384000 | 6.622325 | Y | 6.414 | 6.644 | 0.8012497 |
yes | E | 6.622325 | 6.754000 | EH1 | 6.644 | 6.754 | 0.8353896 |
yes | s | 6.754000 | 6.854000 | S | 6.754 | 6.864 | 0.9090909 |
on | Q | 6.854000 | 6.944000 | AA1 | 6.864 | 6.954 | 0.8000000 |
on | n | 6.944000 | 7.054000 | N | 6.954 | 7.134 | 0.5263158 |
the | D | 8.311241 | 8.350198 | DH | 7.564 | 7.694 | 0.0000000 |
This allows closer analysis of the forced-alignments, beyond the overlap rate summary.
For example, it allows us to identify which phones are being produced the the forced aligner which have no corresponding phone in the manual alignments; forced aligners that are using a rhotic dictionary are expected to produce many spurious /r/ segments that are not really present in the non-rhotic speech in the corpus:
# identify all edit steps where a phone is added - i.e. the source phone label is empty
spuriousSegments <- subset(p2faPhoneEditPaths, sourceLabel == "")
# count these spurious segments grouping by their label to see which is most common
knitr::kable(
aggregate(spuriousSegments$targetLabel,
by=list(spuriousSegments$targetLabel),
FUN=length),
col.names = c("Spurious Segment", "Count"))
Spurious Segment | Count |
---|---|
AH0 | 1 |
D | 1 |
R | 2 |
V | 1 |
With this detailed mapping information, it’s possible to: