LaBB-CAT

Setting Up a Corpus

Robert Fromont

3 December 2024

LaBB-CAT needs:

Transcripts

what: Orthographic transcripts (audio/video is optional)
when: …divided into time-stamped utterances (≤ 15 words)
who: …with some way to identify the speaker

Textual data is supported

Transcript = text
Utterances = lines
Speaker = author

Transcript Formats

Praat TextGrid

Transcript Formats

Praat TextGrid

Transcript Formats

Praat TextGrid
Transcriber

Transcript Formats

Praat TextGrid
Transcriber
ELAN

Transcript Formats

Praat TextGrid
Transcriber
ELAN
Plain Text

Transcript Formats

Praat TextGrid
Transcriber
ELAN
Plain Text
CHAT (partial support)

Transcript Formats

Praat TextGrid
Transcriber
ELAN
Plain Text
CHAT (partial support)
SALT (partial support)

Transcript Formats

Praat TextGrid
Transcriber
ELAN
Plain Text
CHAT (partial support)
SALT (partial support)
TEI (partial support)

Transcript Formats

Praat TextGrid
Transcriber
ELAN
Plain Text
CHAT (partial support)
SALT (partial support)
TEI (partial support)
VTT Subtitles

Elicitation Tasks

Define a speech elicitation task

Elicitation Tasks

Recording via the web browser

Meta-data

Both Transcripts and Participants can have defined ‘attributes’

Textual types:
- String - e.g. ethnicity
- Text - e.g. notes
- Email - e.g. contact address
- URL - e.g. source document
- Read Only - e.g. transcript version
Numeric types:
- Integer - e.g. age in years
- Number - e.g. syllables per minute
- Styles: Slider - e.g. rating

Temporal types:
- Date - e.g. date of birth
- Time - e.g. interview duration
- Date/Time - e.g. recording date/time
Fixed class types:
- Boolean e.g. checked
- Select e.g. language
  Styles:
  - Multiple - e.g. permissions
  - Other - e.g. gender

Client/Server

Standalone

Browser and Web Server on the same computer

Layer Managers

Automated Annotation

Porter Stemmer – English stem
Lexicon layer managers: CELEX, CMU Dictionary, Unisyn, Flat file dictionary
Pattern Matcher and Character Mapper – regular-expression-based processing
Frequency – word counts in different scopes
LIWC – percentages of words in different categories
Statistics – aggregation od groups of tokens
Context – previous mention, previous pause
Partition – partitions of n tokens
Javascript and Python managers – arbitrary scripting
Stanford POS Tagger – part-of-speech
Stanford Parser – syntactic parsing
Forced alignment …

Forced Alignment

Automatically locate start/end times of speech sounds

HTK / Penn Aligner

HTK software, Penn pretrained models

Montreal Forced Aligner

MFA software, Dictionaries and pretrained models downloaded from GitHub

WebMAUS

Transcripts and recordings sent over the internet

LaBB-CAT

Open Source
Cross Platform
Free to install
https://sourceforge.net/projects/labbcat

robert.fromont@canterbury.ac.nz

https://labbcat.canterbury.ac.nz

Worksheets…

Create a corpus from ELAN transcripts