LaBB-CAT
Setting Up a Corpus
3 December 2024
LaBB-CAT needs:
Transcripts
what
: Orthographic transcripts (audio/video is optional)
when
: …divided into time-stamped utterances (≤ 15 words)
who
: …with some way to identify the speaker
Textual data is supported
- Transcript = text
- Utterances = lines
- Speaker = author
Elicitation Tasks
![]()
Define a speech elicitation task
Elicitation Tasks
![]()
Recording via the web browser
Standalone
![]()
Browser and Web Server on the same computer
Layer Managers
Automated Annotation
- Porter Stemmer – English stem
- Lexicon layer managers: CELEX, CMU Dictionary, Unisyn, Flat file dictionary
- Pattern Matcher and Character Mapper – regular-expression-based processing
- Frequency – word counts in different scopes
- LIWC – percentages of words in different categories
- Statistics – aggregation od groups of tokens
- Context – previous mention, previous pause
- Partition – partitions of n tokens
- Javascript and Python managers – arbitrary scripting
- Stanford POS Tagger – part-of-speech
- Stanford Parser – syntactic parsing
- Forced alignment …
Forced Alignment
![]()
Automatically locate start/end times of speech sounds
HTK / Penn Aligner
![]()
HTK software, Penn pretrained models
Montreal Forced Aligner
![]()
MFA software, Dictionaries and pretrained models downloaded from GitHub
WebMAUS
![]()
Transcripts and recordings sent over the internet
Worksheets…
![]()
Create a corpus from ELAN transcripts