1 - Exploration

In this worksheet you will start exploring a demo LaBB-CAT corpus, to get a general idea of how to find your way around LaBB-CAT and how the language data is presented.

The demo corpus contains a collection of videos of people telling stories about their experiences during the earthquakes that struck Canterbury during 2010 and 2011. They have been orthographically transcribed using a tool called ELAN, so they have been time aligned to the utterance level; i.e. the start and end time of each line in the transcript has been manually synchronized with the recording. The ELAN transcripts, and their video and audio files, have been uploaded into LaBB-CAT.

LaBB-CAT is a browser-based system so the first thing to do is access it with your web browser. Generally, any modern browser should be fine (although some features you’ll see in later worksheets are only supported by Mozilla Firefox or Google Chrome).

  1. In your web browser, type in the following URL:
    https://labbcat.canterbury.ac.nz/demo
    You will be asked for a username and password.
  2. The username is demo and the password is demo
    The very first time you access LaBB-CAT, you will see the licence agreement for accessing the corpus data.
  3. Press I Agree to continue.
    You will see a page called “LaBB-CAT Demo” which has a menu of links along the top and a number of icons. Below the icons is some information about the corpus. This is the LaBB-CAT home page.
  4. Click the where do I start? icon on the left.
    The help page that pops up includes a brief description of LaBB-CAT and some tips for navigation and getting more information.
  5. Read through at least the top section of the page to get some helpful tips, and then close the browser tab to return to the home page.

Transcripts and Participants

First we will look at ways to manually browse the corpus data.

  1. On the LaBB-CAT home page, select the transcripts option on the menu at the top.
    You will see a list of transcripts in LaBB-CAT, together with some meta-data. The first twenty transcripts are listed, and there are controls at the bottom of the page to list others.
  2. Click the name of the first transcript listed: AP2505_Nelson.eaf
    You will see a page with transcript text, and the video appears in the top right corner of the page.
  3. Press the play button on the video.
    As the video plays, you will see the current utterance highlighted in the transcript. You will also see that the current utterance appears as closed captions in the video. You can use the video controls as normal, including the full-screen button in the bottom right, to make the video occupy the whole screen.
  4. Pause the recording.
  5. Click one of the transcript lines further down the transcript.
    A menu will appear.
  6. Select the ‘Play’ option at the bottom of the menu.
    You will see that playback starts at that line. Playback will stop when the participant finishes the utterance.
  7. Click on the formats link at the top left under the title.
    You will see a menu, which includes various formats for exporting the transcript.
  8. Select Plain Text Document
  9. Save the resulting file and then open it.
    You will see the transcript in plain-text form.

Plain text is a format supported by many language analysis tools, so exporting text transcript allows you to use your favourite tools for whatever research you’re doing.

  1. If you have Praat installed on your computer, click the formats link, and select the Praat Text Grid option. Save the resulting file on your desktop, and then open it with Praat.

You will see that the TextGrid has a couple of tiers, one for whole utterances, and one for individual words.

  1. Now select the participants option on the menu at the top.
    You will see a list that looks similar to the ‘transcripts’ list we saw earlier, but this page lists names and meta-data of speakers rather than the recordings in which they appear.

Regular Expressions

  1. On the search page, in the orthography box, prefix the word “quake” with .* i.e.:
    .*quake

This is a ‘regular expression’ that allows you to search for patterns instead of matching exact text:

  • . means “any letter, number, or other character”
  • * means “zero or more of the previous thing”,
    so .* means “any number of characters of any kind”
  • quake means literally the sequence of letters ‘quake’
    so .*quake means “any word ending in ‘quake’”
  1. Press Search.
    Depending on your browser, you may have to click the Display results link to see the results page.
    Now your results include all the instances of the word “earthquake”, plus instances of “quake” as well.

Up until now, we’ve only been matching against one word at a time. Now we’re going to create a search pattern for a chain of words.

  1. Close the results tab of the previous search.
  2. Back on the search page, next to the orthography box where you entered the regular expression, there’s a button for adding a column to the ‘search matrix’. Click it.

    Now you will see that our ‘search matrix’ is two words wide.
  3. In the new orthography box on the right, enter the regular expression:
    is|was

This regular expression is:

  • is means “the word ‘is’”
  • | (the vertical bar character) means or
  • was means “the word ‘was’”
    so is|was means “the word ‘is’ or the word ‘was’”
  1. Press Search.
    You should see results are now words ending in ‘quake’ followed by either ‘is’ or ‘was’.
Tip

You can get more information about regular expressions by using the online help back on the search page.


In this worksheet you have seen that:

  • LaBB-CAT is a repository for recordings and their transcripts;
  • Transcripts can be exported in a variety of formats;
  • Meta-data can be attached to transcripts (transcript attributes) and to participants (participant attributes);
  • You can filter lists of participant (or transcripts) on the basis of meta-data;
  • You can search the texts of the transcripts for patterns using ‘regular expressions’;
  • Search results can be exported to CSV files for further processing;

Reuse