Forced Alignment Comparison

LaBB-CAT integrates with several forced-aligners, which automatically determine the start and end time of each word, and each segment within each word, given each utterance transcript and start/edn time, and the wav corresponding file.

LaBB-CAT can also compare alignments using a module that maps annotations on one layer to those on another, and computes the overlap rate of the annotation pairs; i.e. a measure of how much the alignments agree with each other.

This example shows how, given a set of manual word/segment alignments (in this case, New Zealand English utterances), it’s possible to run several different forced alignment configurations, and compare them with the manual alignments, in order to determine which forced alignment configuration works best for the type of speech data being processed.

Set Up Initial Environment

Almost all the operations needed for forced alignment comparison can be implemented directly in code. However, the annotator modules used must be already installed in LaBB-CAT. In this case, there is a local LaBB-CAT instance that already has the following annotator modules installed:

CMU Dictionary Manager, which provides ARPAbet-encoded pronunciations for words
HTK Layer Manager, which will perform forced alignment using the above dictionary and the ‘P2FA’ pre-trained models.
MFA Manager, which integrates with the Montreal Forced Aligner
BAS Web Services Annotator, which provides access to the ‘WebMAUS’ web service for forced alignment.

Defining the location of the LaBB-CAT server, and the transcript/media files is also required:

library(nzilbb.labbcat)
dataDir <- "data"
transcriptFiles = dir(path = dataDir, pattern = ".*\\.TextGrid$", full.names = FALSE)
url <- Sys.getenv('TEST_ADMIN_LABBCAT_URL') # load details from .Renviron file
credentialError <- labbcatCredentials(
  url, Sys.getenv('TEST_ADMIN_LABBCAT_USERNAME'), Sys.getenv('TEST_ADMIN_LABBCAT_PASSWORD'))

Cleaning up any previous setup

This process is designed to be re-runnable, so we don’t assume that the LaBB-CAT instance is completely empty. The following code removes any previously-created configuration and data.

# delete layers
deleteLayer(url, "cmuDictPhonemes")
deleteLayer(url, "p2fa")
deleteLayer(url, "p2faPhone")
deleteLayer(url, "p2faComp")
deleteLayer(url, "mfaGAm")
deleteLayer(url, "mfaGAmPhone")
deleteLayer(url, "mfaGAmComp")
deleteLayer(url, "mfaUKE")
deleteLayer(url, "mfaUKEPhone")
deleteLayer(url, "mfaUKEComp")
deleteLayer(url, "maus")
deleteLayer(url, "mausPhone")
deleteLayer(url, "mausComp")

# delete any pre-existing transcripts
for (transcriptName in transcriptFiles) {
  # if the transcript is in LaBB-CAT...
  if (length(getMatchingTranscriptIds(url, paste("id = '", transcriptName, "'", sep=""))) > 0) {
    # ...delete it 
    deleteTranscript(url, transcriptName)
  }
} # next textGrid

Forced alignment

There are several configurations for forced alignment compared here, including different forced aligners, and different configurations of the same forced aligner.

These can be compared to the manual alignments by mapping manually aligned words to automatically aligned words, and then within each word token, mapping the manually aligned phones to the automatically aligned phones.

This is performed using the Layer Mapper module, which is configured to map and compare each of the forced alignment configurations.

HTK

The Hidden Markov Model Toolkit (HTK) is a speech recognition toolkit developed at Cambridge University. Integration with LaBB-CAT involves specifying the layer where the orthographic transcription comes from, and another that provides phonemic transcriptions for each word token, in adding to setting up layers for receiving word and phoneme alignments.

CMU Pronouncing Dictionary

We will use the CMU Pronouncing Dictionary, to provide word pronunciations for our corpus, using a LaBB-CAT module that integrates with the CMU Pronouncing dictionary:

getAnnotatorDescriptor(url, "CMUDictionaryTagger")$taskParameterInfo

CMU Pronouncing Dictionary

This annotator tags words with their pronunciations according to the CMU Pronouncing dictionary, a machine-readable pronunciation dictionary for North American English that contains over 125,000 words and their transcriptions.

Configuration parameters are encoded as a URL query string, e.g.
tokenLayerId=orthography&pronunciationLayerId=phonemes&encoding=DISC

Parameters are:

tokenLayerId

Input layer from which words come.

transcriptLanguageLayerId

Transcript attribute for overall language, so the annotator can avoid tagging non-English transcripts.

phraseLanguageLayerId

Layer for annotating phrases in a different language, so the annotator can avoid tagging non-English phrases.

pronunciationLayerId

Output layer to which pronunciations are added.

encoding

Which encoding system should be used for the resulting annotation labels. Possible values are:

"CMU" - CMU ARPAbet encoding, e.g. "T R AE2 N S K R IH1 P SH AH0 N"
"DISC" - CELEX DISC encoding, e.g. "tr{nskrIpS@n"

firstVariantOnly

If there are multiple entries for the token, use only the first one. Value should be "on" for true, and the parameter should be absent to indicate false.

The cmuDictPhonemes layer is configured as follows:

cmuDictPhonemes = newLayer(
  url, "cmuDictPhonemes", 
  description = "Phonemic transcriptions according to the CMU Pronouncing Dictionary",
  alignment = 0, parent.id = "word", 
  annotator.id = "CMUdict", 
  annotator.task.parameters = paste(
    "tokenLayerId=orthography",   # get word tokens from orthography layer
    "pronunciationLayerId=cmuDictPhonemes",
    "encoding=CMU",               # Use original ARPAbet encoding, not CELEX DISC encoding
    "transcriptLanguageLayerId=", # not filtering tagging by language...
    "phraseLanguageLayerId=",
    sep = "&"))

P2FA forced alignment

The LaBB-CAT module that integrates with HTK can be configured in various ways, for example to train acoustic models from scratch on the corpus data itself, or to use pre-trained models from another corpus.

The module includes pre-trained acoustic models from the University of Pennsylvania Phonetics Lab Forced Aligner (P2FA). The P2FA models were trained on 25.5 hours of speech by adult American English speakers, specifically speech of eight Supreme Court Justices selected from oral arguments in the Supreme Court of the United States (SCOTUS) corpus.

The LaBB-CAT module that integrates with HTK is called HTKAligner:

getAnnotatorDescriptor(url, "HTKAligner")$taskParameterInfo

HTK Aligner

The HTK Aligner can use words with phonemic transcriptions, and the corresponding audio, to force-align words and phones; i.e. determine the start and end time of each speech sound within each word, and thus the start/end times of the words.

Configuration parameters are encoded as a URL query string, e.g.
orthographyLayerId=orthography&pauseMarkers=-&pronunciationLayerId=cmudict&noiseLayerId=noise&mainUtteranceGrouping=Speaker&otherUtteranceGrouping=Not Aligned&noisePatterns=laugh.* unclear .*noise.*&overlapThreshold=5&wordAlignmentLayerId=word&phoneAlignmentLayerId=segmentutteranceTagLayerId=htk&cleanupOption=100
NB Ensure the configuration string provided has all parts correctly URI-encoded; in particular, the space delimiter of noisePatterns should be encoded %20 or +

Parameters are:

orthographyLayerId

Input layer from which words come.

pauseMarkers

Characters that mark pauses in speech in the transcript, separated by spaces.

pronunciationLayerId

Input layer from which word pronunciations come.

mainUtteranceGrouping

How to group main-participant utterances for automatic training. Possible values are :

"Speaker" - group all main-participant utterances from the same speaker together, across all their transcripts.
"Transcript" - group all main-participant utterances from the same speaker together, but only within the same transcript.

otherUtteranceGrouping

How to group non-main-participant utterances for automatic training. Possible values are:

"Not Aligned" - do not include non-main participant speech.
"Transcript" - group all non-main-participant utterances from the same speaker together, but only within the same transcript.

pauseMarkers

Characters that mark pauses in speech in the transcript, separated by spaces, e.g. - .

noiseLayerId

Layer that has noise annotations.

noisePatterns

Space-delimited list of regular expressions for matching on the noise layer, that HTK should model for. e.g. "laugh.* unclear .*noise.*" will train three non-speech models for all noise annotations whose label starts with the word "laugh", or the label is "unclear", or the label includes the word "noise".

overlapThreshold

Percentage of overlap with other speech, above which the utterance is ignored. 0 or blank means no utterances are ignored, no matter how much they overlap with other speech.

useP2FA

Whether to use pre-trained P2FA models or not.

sampleRate

(optional) "11025" to downsample audio to 11,025Hz before alignment.

leftPattern

Regular expression for matching the ID of the participant in the left audio channel.

rightPattern

Regular expression for matching the ID of the participant in the right audio channel.

ignoreAlignmentStatuses

Set any non-empty value of this parameter to overwrite manual alignments, otherwise omit this parameter.

wordAlignmentLayerId

Output layer on which word alignments are saved. This can be word to update main word token alignments, or a phrase layer to save alignments separately from main word token alignments. If the specified layer doesn't exist, a phrase layer will be created.

phoneAlignmentLayerId

Output layer on which phone alignments are saved. This can be segment to update main phone segment alignments, or a phrase layer to save alignments separately from main phone segment alignments. If the specified layer doesn't exist, a phrase layer will be created.

utteranceTagLayerId

Optional output layer for a time stamp tagging each aligned utterance. If the specified layer doesn't exist, a phrase layer will be created.

scoreLayerId

Output layer on which phone acoustic scores are saved.

cleanupOption

What should happen with working files after training/alignment is finished. Possible values are:

75 - Working files should be deleted if alignment succeeds.
25 - Working files should be deleted if alignment fails.
100 - Working files should be deleted whether alignment succeeds or not.
0 - Working files should not be deleted whether alignment succeeds or not.

The configuration for using the P2FA pre-trained models is:

p2fa <- newLayer(
  url, "p2fa", 
  description = "Word alignments from HTK using the P2FA pretrained acoustic models.",
  alignment = 2, parent.id = "turn", # a phrase layer
  annotator.id = "HTK", 
  annotator.task.parameters = paste(
    "orthographyLayerId=orthography",
    "pronunciationLayerId=cmuDictPhonemes", # pronunciations come from the CMU Dict layer
    "useP2FA=on",                           # use pre-trained P2FA models
    "overlapThreshold=5",                   # ignore utterances that overlap more than 5%
    "wordAlignmentLayerId=p2fa",     # save word alignments to this layer
    "phoneAlignmentLayerId=p2faPhone", # this layer will be created by the annotator
    "cleanupOption=100", sep="&"))

Montreal Forced Aligner

The Montreal Forced Aligner (MFA), is another forced alignment system that uses the Kaldi ASR toolkit instead of HTK.

The LaBB-CAT module that integrates with the Montreal Forced Aligner is called MFA:

getAnnotatorDescriptor(url, "MFA")$taskParameterInfo

MFA

The MFA Annotator integrates with the Montreal Forced Aligner, which can use words with phonemic transcriptions, and the corresponding audio, to force-align words and phones; i.e. determine the start and end time of each speech sound within each word, and thus the start/end times of the words.

Configuration parameters are encoded as a URL query string, e.g.
orthographyLayerId=orthography&dictionaryName=english&modelsName=english&wordAlignmentLayerId=word&phoneAlignmentLayerId=segment&utteranceTagLayerId=mfa

Parameters are:

orthographyLayerId: Input layer from which words come.
pronunciationLayerId: Input layer from which word pronunciations come. If you don't select this layer, you must specify a value for dictionaryName.
overlapThreshold: Percentage of overlap with other speech, above which the utterance is ignored. 0 means no utterances are ignored, no matter how much they overlap with other speech. If not specified, default is 5%.
multilingualIPA: Set any non-empty value for this parameter to use the --multilingual_ipa switch, omit the parameter otherwise
dictionaryName: If you wish to use one of the available language dictionaries instead of specifying a Pronunciation Layer, specify the name with this parameter. Available dictionaries are listed at https://github.com/MontrealCorpusTools/mfa-models/tree/main/dictionary
modelsName: If you wish to use pre-trained acoustic models instead of training models from your speech data, specify the name with this parameter. If not specified, acoustic models will be trained from the speech itself. Available models are listed at https://github.com/MontrealCorpusTools/mfa-models/tree/main/acoustic
usePostgres: Whether to use PostGreSQL as the relational database engine Set to any non-empty value for this parameter to use the --use_postgres switch. Otherwise, MFA uses SQLLite instead (--no_use_postgres).
beam: This is the value for the --beam mfa parameter.
retryBeam: This is the value for the --retry-beam mfa parameter.
noSpeakerAdaptation: Whether trained acoustic models are adapted for each speaker before alignment or not. Set any non-empty value for this parameter to use the --uses_speaker_adaptation False switch, omit the parameter otherwise
noCleanupOnFailure: If an error occurs, working files will be deleted by default. Set this to any non-empty value working files to be left in place for manual inspection.
ignoreAlignmentStatuses: Set any non-empty value of this parameter to overwrite manual alignments, otherwise omit this parameter.
wordAlignmentLayerId: Output layer on which word alignments are saved. This can be word to update main word token alignments, or a phrase layer to save alignments separately from main word token alignments. If the specified layer doesn't exist, a phrase layer will be created.
phoneAlignmentLayerId: Output layer on which phone alignments are saved. This can be segment to update main phone segment alignments, or a phrase layer to save alignments separately from main phone segment alignments. If the specified layer doesn't exist, a phrase layer will be created.
utteranceTagLayerId: Optional output layer for a time stamp tagging each aligned utterance. If the specified layer doesn't exist, a phrase layer will be created.

MFA Alignment with General American English ARPAbet Models

MFA provides a set pre-trained English models trained on 982 hours of speech by 2484 American English speakers from the LibriSpeech corpus, (Vassil Panayotov and Guoguo Chen and Daniel Povey and Sanjeev Khudanpur (2015) “Librispeech: An ASR corpus based on public domain audio books”, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp 5206-5210, DOI:10.1109/ICASSP.2015.7178964) and pronunciations from an ARPAbet encoded General American English dictionary.

A layer is set up to use this configuration:

mfaGAm <- newLayer(
  url, "mfaGAm", 
  description = "Word alignments from MFA using an ARPAbet dictionary and pretrained models.",
  alignment = 2, parent.id = "turn", # a phrase layer
  annotator.id = "MFA", 
  annotator.task.parameters = paste(
    "orthographyLayerId=orthography",
    "dictionaryName=english_us_arpa",
    "modelsName=english_us_arpa",
    "wordAlignmentLayerId=mfaGAm",       # save word alignments to this layer
    "phoneAlignmentLayerId=mfaGAmPhone", # this will be created by the annotator
    sep="&"))

MFA Alignment with IPA Models and a Non-rhotic Dictionary

MFA also includes IPA-encoded models trained on 3687 hours of English speech of a number of varieties, and an IPA-encoded British English pronunciation dictionary which may perform better for our New Zealand English data, as both varieties are non-rhotic (unlike the US English dictionary used for the mfaGAm layer above).

This configuration uses pre-trained English models and pronunciations from an IPA encoded British English dictionary.

# create layer
mfaUKE <- newLayer(
  url, "mfaUKE", 
  description = "Word alignments from MFA using an IPA dictionary and pretrained models.",
  alignment = 2, parent.id = "turn", # a phrase layer
  annotator.id = "MFA", 
  annotator.task.parameters = paste(
    "orthographyLayerId=orthography",
    "dictionaryName=english_uk_mfa",
    "modelsName=english_mfa",
    "wordAlignmentLayerId=mfaUKE",       # save word alignments to this layer
    "phoneAlignmentLayerId=mfaUKEPhone", # this will be created by the annotator
    sep="&"))

BAS Web Service and MAUS Basic

MAUSBasic web service; part of the CLARIN-D’s BAS Web Services suite provides a web service for the MAUS forced aligner, which supports a number of languages and varieties.

LaBB-CAT integrates with the BAS Web Services using a module called BASAnnotator (which also provides access to the G2P BAS web service):

getAnnotatorDescriptor(url, "BASAnnotator")$taskParameterInfo

BAS Annotator

This annotator connects to the BAS web services - http://hdl.handle.net/11858/00-1779-0000-0028-421B-4, hosted by Ludwig Maximilians Universität München - for various annotation tasks.

Current annotation tasks include:

Using G2P to annotate words with their phonemic transcription, given their orthography.
Using WebMAUSBasic to force-align utterances, given their orthographic transcriptions.

Please note that using these services requires sending transcript, annotation, and audio data over the internet to the external provider of these services.

Configuration parameters are encoded as a URL query string, e.g.
orthographyLayerId=orthography&service=MAUSBasic&forceLanguageMAUSBasic=deu-DE&phonemeEncoding=disc&wordAlignmentLayerId=word&phoneAlignmentLayerId=segment&utteranceTagLayerId=bas

Parameters are:

orthographyLayerId

Input layer from which words come.

transcriptLanguageLayerId

Transcript attribute for overall language.

targetLanguagePattern

Regular expression for the ISO language codes of the languages to target, or blank for all.

service

Which BAS web service to use - "MAUSBasic" or "G2P"

phonemeEncoding

Which set of symbols to use for phoneme layers. Options are:

disc
sampa
x-sampa
maus-sampa
ipa
arpabet

forceLanguageMAUSBasic

Language to tell the MAUSBasic web service the utterances are in, regardless of the language the transcript is in (Optional - service=MAUSBasic only).

forceLanguageG2P

Language to tell the G2P web service the utterances are in, regardless of the language the transcript is in (Optional - service=G2P only).

pronunciationLayerId

Output layer to which pronunciations are added (service=G2P only).

wordStress

Send any non-empty value of this parameter to include lexical stress marks, omit the parameter to omit stress marks (service=G2P only).

syllabification

Send any non-empty value of this parameter to include syllabification marks, omit the parameter to syllabification marks (service=G2P only).

wordAlignmentLayerId

phoneAlignmentLayerId

utteranceTagLayerId

Optional output layer for a time stamp tagging each aligned utterance. If the specified layer doesn't exist, a phrase layer will be created (service=MAUSBasic only).

New Zealand English is supported explicitly by MAUSBasic, so layer is set up to using the following configuration:

maus <- newLayer(
  url, "maus", 
  description = "Word alignments from MAUSBasic web service provided by the BAS Web Services.",
  alignment = 2, parent.id = "turn", # a phrase layer
  annotator.id = "BAS", 
  annotator.task.parameters = paste(
    "service=MAUSBasic",
    "orthographyLayerId=orthography",
    "forceLanguageMAUSBasic=eng-NZ",   # New Zealand English
    "phonemeEncoding=disc",            # use DISC for phone labels
    "wordAlignmentLayerId=maus",       # save word alignments to this layer
    "phoneAlignmentLayerId=mausPhone", # this will be created by the annotator
    sep = "&"))

Upload transcripts and recordings

The speech to be force-aligned is in New Zealand English, and has been transcribed in a Praat. This corpus is (very!) small, but enough to illustrate what’s possible.

TextGrid files are uploaded with their corresponsing .wav files. The TextGrids include manually aligned words and segments (phones). The phone labels use the CELEX ‘DISC’ encoding, which utilises one character per phoneme. These will be the ‘gold standard’ alignments for evaluating each forced alignment configuration.

# for each transcript
for (transcriptName in transcriptFiles) {
  transcript <- file.path(dataDir, transcriptName)
  
  # locate recording
  noExtension <- substr(transcriptName, 1, nchar(transcriptName) - 9)
  wav <- file.path(dataDir, paste(noExtension, ".wav", sep=""))
  if (!file.exists(wav)) {
    wav <- file.path(dataDir, paste(noExtension, ".WAV", sep=""))
  }
  if (!file.exists(wav)) cat(paste(wav, "doesn't exist\n"))
  if (!file.exists(transcript)) cat(paste(transcript, "doesn't exist\n"))
  
  # upload the transcript/recording
  newTranscript(url, transcript, wav, no.progress = TRUE)

} # next trancript
cat(paste("Transcripts uploaded: ", length(transcriptFiles), "\n"))

## Transcripts uploaded:  1

At this point the manually aligned transcripts have been uploaded, and all of the forced-alignment configurations have been run.

Compare Automatic Alignments with Manual Alignments

In order to compare the word and phone alignments produced by the forced aligners with the manual alignments, we use LaBB-CAT’s Label Mapper layer manager:

getAnnotatorDescriptor(url, "LabelMapper")$info

Label Mapper

This annotator creates a mapping between the labels of pairs of layers, by finding the minimum edit path between them.

For example, this layer manager can be used to tag each phone in a word with its likely counterpart in another phonemic transcription:

d	ɪ	f	ə	ɹ	n̩	t
↓	↓	↓		↓	↓	↓
d	ɪ	f		ɹ	ən	t

… or phonemic encoding:

d	ɪ	f	ə	ɹ	n̩	t
↓	↓	↓	↓	↓	↓	↓
D	IH1	F	ER0	R	AH0 N	T

One possible use is to map non-rhotic phones to their equivalent rhotic transcription:

f	αɪ	ə	f	αɪ	t	ə
↓	↓	↓	↓	↓	↓	↓
f	αɪ	əɹ	f	αɪ	t	əɹ

The Label Mapper can also be used to compare alternative word/phone alignments - e.g. you may have manual word/phone alignment that you want to compare with word/phone alignmentd automatically generated by HTK, in order to measure the accuracy of the automatic alignments.

In this case, you specify two mappings, a main mapping between a pair of word alignment layers, and a sub-mapping between a paird of phone alignment layers.

Word Source	CMUDictWord	DIFFERENT								FIREFIGHTER
		↓								↓
Word Target	orthography	different								firefighter
Phone Source	P2FAPhone	D	IH1	F	ER0	R	AH0	N	T	F	AY1	R	F	AY2	T	ER0
		↓	↓	↓		↓	↓	↓	↓	↓	↓	↓	↓	↓	↓	↓
Phone Target	segments	d	I	f		r	@	n	t	f	2	@	f	2	t	@

These mappings are tracked in detail, to facilitate comparison of alignment and labelling. Mapping information can be accessed via the 'extensions' page of this annotator.

The details of the Label Mapper configuration are:

getAnnotatorDescriptor(url, "LabelMapper")$taskParameterInfo

Label Mapper

This annotator creates a mapping between the labels of two layers, a source layer and a target layer, by finding the minimum edit path between them.

Phone Sub-mapping

A phone Sub-mapping may be configured if you are mapping word tokens on one layer to word tokens on another layer. In this case, map phone tokens on a source layer to phone tokens on a target layer - i.e. you can compare two sets of word/phone annotations.

The words on the sourceLayerId are assumed to be divided into phones that are on the subSourceLayerId, and the words on the targetLayerId are assumed to be divided into phones that are on the subTargetLayerId.

These mappings are tracked in detail, to facilitate comparison of alignment and labelling. Mapping information can be accessed via this annotator's extensions.

Encoding

Configuration parameters are encoded as a URL query string, e.g.
sourceLayerId=orthography&splitLabels=&mappingLayerId=&comparator=CharacterToCharacter&targetLayerId=mfaPretrainedARPAbet&subSourceLayerId=segment&subComparator=DISCToArpabet&subTargetLayerId=mfaPretrainedARPAbetPhone

Parameters

sourceLayerId

Input layer from which the resulting annotations' labels come from come.

targetLayerId

This layer defines which tokens will be tagged. Each annotation on this layer will have one tag added, with one or more labels from the sourceLayerId, unless no correspondence can be found, in which case the token is not tagged.

mappingLayerId

Output layer mapping the Source layer to the Target layer. If no Phone Sub-mapping is specified (see below), any Source Layer labels that cannot be directly mapped to a Target token will be appended to a nearby Mapping Layer label; i.e. appended to the last one where possible.

comparator

How the Source layer and the Target layer labels relate to each other. Possible values are:

CharacterToCharacter - Orthography → Orthography
OrthographyToDISC - Orthography → DISC
OrthographyToArpabet - Orthography → ARPAbet
DISCToDISC - DISC → DISC
DISCToArpabet - DISC → ARPAbet
ArpabetToDISC - ARPAbet → DISC
IPAToIPA - IPA → IPA
DISCToIPA - DISC → IPA
IPAToDISC - IPA → DISC

splitLabels

Possible values are:

char - Use the most similar annotation in the scope, split its label into characters, and map each to labels on the target layer (e.g. DISC word transcriptions to phones).
space - Use the most similar annotation in the scope, split its label on spaces, and map each to labels on the target layer (e.g. ARPAbet word transcriptions to phones).
(nothing) - Use all annotations in the scope, and do not split their labels (e.g. phone alignments to other phone alignments).

subSourceLayerId

(Optional) Source Layer and Target Layer above identify words, which can be subdivided into phones on this layer and the subTargetLayerId below.
The Sub-mapping option should only be configured if:

The splitLabels parameter is omitted, and
The comparator is set to CharacterToCharacter

i.e. you are mapping word tokens on one layer to word tokens on another layer.

subTargetLayerId

Which tokens will be tagged for the sub-mapping.

subComparator

How the Source layer and the Target layer labels relate to each other. Possible values are:

CharacterToCharacter - Orthography → Orthography
OrthographyToDISC - Orthography → DISC
OrthographyToArpabet - Orthography → ARPAbet
DISCToDISC - DISC → DISC
DISCToArpabet - DISC → ARPAbet
ArpabetToDISC - ARPAbet → DISC
IPAToIPA - IPA → IPA
DISCToIPA - DISC → IPA
IPAToDISC - IPA → DISC

A comparison is set up for each of the forced-alignment configurations above, i.e each automatic word alignment is matched to a corresponding manual word alignment, and within each word, each automatic phone alignment is matched to a corresponding manual phone alignment. Once this is done, it’s possible to measure the degree to which the automatic alignments overlap with the manual ones.

# compare p2fa alignments with manual ones
p2faComp <- newLayer(
  url, "p2faComp", 
  description = "Compare P2FA alignments with Manual alignments.",
  alignment = 2, parent.id = "turn", # a phrase layer
  annotator.id = "labelmapper", 
  annotator.task.parameters = paste(
    "sourceLayerId=orthography",
    "targetLayerId=p2fa",
    "splitLabels=",                    # no splitting; target to source annotations map 1:1
    "comparator=CharacterToCharacter", # word tokens use plain orthography
    "subSourceLayerId=segment",
    "subTargetLayerId=p2faPhone", 
    "subComparator=DISCToArpabet",     # phone tokens in p2faPhone use ARPAbet encoding
    sep="&"))
generateLayer(url, "p2faComp")

## [1] "Finished."

# compare MFA ARPAbet alignments with manual ones
mfaGAmComp <- newLayer(
  url, "mfaGAmComp", 
  description = "Compare MFA ARPAbet pretrained-model alignments with Manual alignments.",
  alignment = 2, parent.id = "turn", # a phrase layer
  annotator.id = "labelmapper", 
  annotator.task.parameters = paste(
    "sourceLayerId=orthography",
    "targetLayerId=mfaGAm",
    "splitLabels=",                    # no splitting; target to source annotations map 1:1
    "comparator=CharacterToCharacter", # word tokens use plain orthography
    "subSourceLayerId=segment",
    "subTargetLayerId=mfaGAmPhone", 
    "subComparator=DISCToArpabet",     # phone tokens in mfaGAmPhone use ARPAbet encoding
    sep="&"))
generateLayer(url, "mfaGAmComp")

## [1] "Finished."

# compare MFA non-rhotic IPA alignments with manual ones
mfaUKEComp <- newLayer(
  url, "mfaUKEComp", 
  description = "Compare MFA IPA pretrained-model alignments with Manual alignments.",
  alignment = 2, parent.id = "turn", # a phrase layer
  annotator.id = "labelmapper", 
  annotator.task.parameters = paste(
    "sourceLayerId=orthography",
    "targetLayerId=mfaUKE",
    "splitLabels=",                    # no splitting; target to source annotations map 1:1
    "comparator=CharacterToCharacter", # word tokens use plain orthography
    "subSourceLayerId=segment",
    "subTargetLayerId=mfaUKEPhone", 
    "subComparator=DISCToIPA",         # phone tokens in mfaUKEPhone use IPA encoding
    sep="&"))
generateLayer(url, "mfaUKEComp")

## [1] "Finished."

# compare MAUSBasic alignments with manual ones
mausComp <- newLayer(
  url, "mausComp", 
  description = "Compare MAUS Basic New Zealand English alignments with Manual alignments.",
  alignment = 2, parent.id = "turn", # a phrase layer
  annotator.id = "labelmapper", 
  annotator.task.parameters = paste(
    "sourceLayerId=orthography",
    "targetLayerId=maus",
    "splitLabels=",                    # no splitting; target to source annotations map 1:1
    "comparator=CharacterToCharacter", # word tokens use plain orthography
    "subSourceLayerId=segment",
    "subTargetLayerId=mausPhone", 
    "subComparator=DISCToDISC",        # phone tokens in mausPhone use DISC encoding
    sep="&"))
generateLayer(url, "mausComp")

## [1] "Finished."

Edit Paths and Alignment Comparisons

The LabelMapper has an extended API for exporting information about the edit-paths computed, and the resulting alignment comparisons.

getAnnotatorDescriptor(url, "LabelMapper")$extApiInfo

Label Mapper

This annotator creates a mapping between the labels of a pair of layers, by finding the minimum edit path between them. If any sub-mappings have been configured - where two pairs of layers are mapped, two word layers and two corresponding phone layers - these mappings are tracked in detail so that alignments and label assignments can be compared.

This API provides access to tracked sub-mapping information, including raw mapping data and summary information including mean Overlap Rate.

Overlap Rate is a value between 0 and 1. A value of 0 means that the two intervals do not overlap at all, with 1 meaning they completely overlap.

Paulo and Oliveira (2004) devised Overlap Rate (OvR) to compare alignments, which measures how much two intervals overlap, independent of their absolute durations. OvR is calculated as follows:

OνR = CommonDur / DurMax = CommonDur / (DurRef + DurAuto - CommonDur)

The result is a value between 0 and 1. A value of 0 means that the two intervals do not overlap at all, with 1 meaning they completely overlap (i.e., the alignments exactly agree). Overlap Rate has several advantages over comparing start/end offsets directly:

It provides a single result, with no need to specify different thresholds for offset accuracy.
It gives a measure of how time-spans (rather than time-points) align, and these time-spans correspond directly to what is usually the primary unit of interest: phones.
It ensures that a small absolute discrepancy in alignment is more serious for short intervals than for longer ones.

The extension API can be used to list and download tracked mappings, by using the following endpoints, accessed with a GET http request:

listMappings

Lists tracked mappings. The response is a JSON-encoded list of strings formatted sourceLayerId→targetLayerId, representing tracked mappings that can be accessed via other endpoints.

summarizeMapping?id

Provides summary information about the given mapping. The query string should be a mapping ID returned by listMappings. The response is a JSON-encoded object with summary information about the given mapping, e.g.

utteranceCount - number of utterances with a mapping
stepCount - total number of edit steps
meanOverlapRate - mean overlap rate across all edit steps
sourceCount - number of source annotations mapped
targetCount - number of target annotations mapped

mappingToCsv?id

Provides access to the mapping between the given two layers, as a CSV stream. The query string should be a mapping ID returned by listMappings. The response is a CSV file containing all edit steps mapping all source annotations to target annotations, with the following fields:

transcript - Transcript ID
scope - Parent utterance/word ID
URL - URL to the utterance
step - The edit step index in the sequence
sourceLayer - Layer of the source annotations
sourceParentId - ID of the parent of source annotation, if this is a sub-mapping
sourceParentLabel - Label of the parent of source annotation, if this is a sub-mapping
sourceId - ID of the source annotation
sourceLabel - Label of the source annotation
sourceStart - Start offset of the source annotation
sourceEnd - End offset of the source annotation
targetLayer - Layer of the target annotations
targetParentId - ID of the parent of target annotation, if this is a sub-mapping
targetParentLabel - Label of the parent of target annotation, if this is a sub-mapping
targetId - ID of the target annotation
targetLabel - Label of the target annotation
targetStart - Start offset of the target annotation
targetEnd - End offset of the target annotation
operation - The edit operation: + for insert, - for delete, ! for change, = for no change
distance - Distance (cost) for this edit step
hierarchy - This mappings position in the sub-mapping hierarchy: parent, child, or none
overlapRate - As per Paulo and Oliveira (2004): 0 means no overlap at all, 1 means they complete overlap

deleteMapping?id

Deletes all data associated with the given mapping. The query string should be a mapping ID returned by listMappings.

The mean overlap rates for the forced alignment configurations can be directly compared, by extracting the mapping summaries and comparing them:

p2faPhone <- jsonlite::fromJSON(
  annotatorExt(url, "LabelMapper", "summarizeMapping", list("segment","p2faPhone")))
mfaGAmPhone <- jsonlite::fromJSON(
  annotatorExt(url, "LabelMapper", "summarizeMapping", list("segment","mfaGAmPhone")))
mfaUKEPhone <- jsonlite::fromJSON(
  annotatorExt(url, "LabelMapper", "summarizeMapping", list("segment","mfaUKEPhone")))
mausPhone <- jsonlite::fromJSON(
  annotatorExt(url, "LabelMapper", "summarizeMapping", list("segment","mausPhone")))

knitr::kable(rbind(p2faPhone, mfaGAmPhone, mfaUKEPhone, mausPhone))

	meanOverlapRate	sourceCount	stepCount	targetCount	utteranceCount
p2faPhone	0.572620566232039	182	187	186	3
mfaGAmPhone	0.598941007351142	182	188	186	3
mfaUKEPhone	0.604270103076775	182	186	176	3
mausPhone	0.625540272980146	182	183	178	3

The LabelMapper module can also provide finely-grained details of the mappings - i.e. exactly which manual phones mapped to which automatic phones.

# get edit paths for manual-to-P2FA comparison
p2faPhoneEditPaths <- read.csv(
  text = annotatorExt(url, "LabelMapper", "mappingToCsv", list("segment","p2faPhone")))
# show a sample of the paths to give an idea
knitr::kable(head(
  p2faPhoneEditPaths[
    c("sourceParentLabel", "sourceLabel", "sourceStart","sourceEnd", 
      "targetLabel", "targetStart","targetEnd", "overlapRate")]),
  # tweak some column names for display purposes
  col.names = c("Word","sourceLabel", "sourceStart","sourceEnd", 
      "targetLabel", "targetStart","targetEnd", "OvR"))

Word	sourceLabel	sourceStart	sourceEnd	targetLabel	targetStart	targetEnd	OvR
yes	j	6.384000	6.622325	Y	6.414	6.644	0.8012497
yes	E	6.622325	6.754000	EH1	6.644	6.754	0.8353896
yes	s	6.754000	6.854000	S	6.754	6.864	0.9090909
on	Q	6.854000	6.944000	AA1	6.864	6.954	0.8000000
on	n	6.944000	7.054000	N	6.954	7.134	0.5263158
the	D	8.311241	8.350198	DH	7.564	7.694	0.0000000

This allows closer analysis of the forced-alignments, beyond the overlap rate summary.

For example, it allows us to identify which phones are being produced the the forced aligner which have no corresponding phone in the manual alignments; forced aligners that are using a rhotic dictionary are expected to produce many spurious /r/ segments that are not really present in the non-rhotic speech in the corpus:

# identify all edit steps where a phone is added - i.e. the source phone label is empty
spuriousSegments <- subset(p2faPhoneEditPaths, sourceLabel == "")
# count these spurious segments grouping by their label to see which is most common
knitr::kable(
  aggregate(spuriousSegments$targetLabel, 
            by=list(spuriousSegments$targetLabel), 
            FUN=length),
  col.names = c("Spurious Segment", "Count"))

Spurious Segment	Count
AH0	1
D	1
R	2
V	1

With this detailed mapping information, it’s possible to:

identify which segments are added or removed
analyse which types of segments are aligned more accurately, e.g. vowels vs. consonants, etc.
apply other alignment accuracy measurements other than Overlap Rate

Forced Alignment Comparison

Robert Fromont

13/04/2022

Set Up Initial Environment

Cleaning up any previous setup

Forced alignment

HTK

CMU Pronouncing Dictionary

CMU Pronouncing Dictionary

P2FA forced alignment

HTK Aligner

Montreal Forced Aligner

MFA

MFA Alignment with General American English ARPAbet Models

MFA Alignment with IPA Models and a Non-rhotic Dictionary

BAS Web Service and MAUS Basic

BAS Annotator

Upload transcripts and recordings

Compare Automatic Alignments with Manual Alignments

Label Mapper

Forced Alignment Comparison

Label Mapper

Phone Sub-mapping

Encoding

Parameters

Edit Paths and Alignment Comparisons

Label Mapper