LaBB-CAT

Setting Up a Corpus

Robert Fromont

3 December 2024

LaBB-CAT needs:

Transcripts

  • what: Orthographic transcripts (audio/video is optional)
  • when: …divided into time-stamped utterances (≤ 15 words)
  • who: …with some way to identify the speaker

Textual data is supported

  • Transcript = text
  • Utterances = lines
  • Speaker = author

Transcript Formats

  • Praat TextGrid

One tier per speaker

Transcript Formats

  • Praat TextGrid

One tier per layer

Transcript Formats

  • Praat TextGrid
  • Transcriber

Transcript Formats

  • Praat TextGrid
  • Transcriber
  • ELAN

Transcript Formats

  • Praat TextGrid
  • Transcriber
  • ELAN
  • Plain Text

Transcript Formats

  • Praat TextGrid
  • Transcriber
  • ELAN
  • Plain Text
  • CHAT (partial support)

Transcript Formats

  • Praat TextGrid
  • Transcriber
  • ELAN
  • Plain Text
  • CHAT (partial support)
  • SALT (partial support)

Transcript Formats

  • Praat TextGrid
  • Transcriber
  • ELAN
  • Plain Text
  • CHAT (partial support)
  • SALT (partial support)
  • TEI (partial support)

Transcript Formats

  • Praat TextGrid
  • Transcriber
  • ELAN
  • Plain Text
  • CHAT (partial support)
  • SALT (partial support)
  • TEI (partial support)
  • VTT Subtitles

Elicitation Tasks

Define a speech elicitation task

Elicitation Tasks

Recording via the web browser

Meta-data

Both Transcripts and Participants can have defined ‘attributes’

  • Textual types:
    • String - e.g. ethnicity
    • Text - e.g. notes
    • Email - e.g. contact address
    • URL - e.g. source document
    • Read Only - e.g. transcript version
  • Numeric types:
    • Integer - e.g. age in years
    • Number - e.g. syllables per minute
    • Styles: Slider - e.g. rating
  • Temporal types:
    • Date - e.g. date of birth
    • Time - e.g. interview duration
    • Date/Time - e.g. recording date/time
  • Fixed class types:
    • Boolean e.g. checked
    • Select e.g. language
      Styles:
      • Multiple - e.g. permissions
      • Other - e.g. gender

Client/Server

Browser on Clients

Web Server

Standalone

Browser and Web Server on the same computer

Layer Managers

Automated Annotation

  • Porter Stemmer – English stem
  • Lexicon layer managers: CELEX, CMU Dictionary, Unisyn, Flat file dictionary
  • Pattern Matcher and Character Mapper – regular-expression-based processing
  • Frequency – word counts in different scopes
  • LIWC – percentages of words in different categories
  • Statistics – aggregation od groups of tokens
  • Context – previous mention, previous pause
  • Partition – partitions of n tokens
  • Javascript and Python managers – arbitrary scripting
  • Stanford POS Tagger – part-of-speech
  • Stanford Parser – syntactic parsing
  • Forced alignment …

Forced Alignment

Automatically locate start/end times of speech sounds

HTK / Penn Aligner

HTK software, Penn pretrained models

Montreal Forced Aligner

MFA software, Dictionaries and pretrained models downloaded from GitHub

WebMAUS

Transcripts and recordings sent over the internet

LaBB-CAT

robert.fromont@canterbury.ac.nz

https://labbcat.canterbury.ac.nz

Worksheets…

Create a corpus from ELAN transcripts