Transcription Guidelines
There are various tools available for transcribing recordings, and LaBB-CAT supports the transcription file formats used by the most commonly used tools. Each of these tools has its own capabilities for specifying speakers and meta data, and adding annotations.
Beyond the specific tool and file format used for transcription, there are some general principles that can facilitate subsequent processing of speech data in LaBB-CAT.
Spelling
Many automatic annotation tasks involve looking up standard dictionaries, and words that are not found are not annotated, so it’s important to use standard spelling where possible.
- Use conventional spelling, and if you are unsure of how to spell something, look it up in a dictionary, or on a map.
- Write all numbers out in full, with spaces instead of hyphens - e.g.
- \(\times\) “1984”
- \(\times\) “nineteen-eighty-four”
- \(\checkmark\) “nineteen eighty four”
- When abbreviations are used, use capital letters with spaces in between each letter if each letter is said separately, otherwise use capitals with no spaces - e.g.
- \(\times\) “NSA”
- \(\times\) “N A S A”
- \(\checkmark\) “N S A”
- \(\checkmark\) “NASA”
- All words should be spelt out in full, like “and” and “suppose”. Final gs and ds should not be dropped from words even if that’s what the speaker says - e.g.
- \(\times\) “skippin’ an’ jumpin’ an ol’ rope”
- \(\checkmark\) “skipping and jumping an old rope”
- A single word should always be spelt as an entire word, even if there is a pause between syllables.
- Don’t tidy up the speech. Leave in the repetitions, fillers and errors.
There may be a limited set of shortened words and contractions defined, which is fine as long as they’re consistently used - e.g. if you use “cos” as a shortened version of “because”, always spell it “cos”, and never “cause”, nor “’cause”, nor “coz”. For example:
- gonna
- sorta
- cos
- kinda
- gotta
- dunno
- wanna
- yip
- yeah
- okay
- uh huh
- gee
- jeez
SALT transcription guidelines include their own set of standardised spellings for interjections and fillers. If you’re working with SALT transcripts, you should use these instead:
- hey
- hmm
- huh
- mhm
- nope
- oops
- oopsy
- ok
- psst
- ssh
- uhhuh (indicating “yes”)
- uhuh (indicating “no”)
- yeah
- yep
Disfluencies
It’s important to be consistent with the spelling of filled pauses:
- ah
- er
- um
- mmm
The spelling of the last of these, with three m’s, is recommended because if it’s spelled with one m - “m” - this can match the name of the letter “M” in the dictionary, so the pronunciation can be tagged as /εm/
, and if it’s spelled with two m’s - “mm” - this sometimes matches an alternative spelling of the word “millimeter”, so the pronunciation can be tagged /'mɪ-lɪ-"mi-tə/
.
Unfilled pauses can be transcribed with a hyphen surrounded by whitespace; some modules use such pause information to help with automatic annotation (e.g. force-alginment with HTK benefits with pause annotations like this) - e.g.
- \(\times\) “stop-before you begin”
- \(\checkmark\) “stop - before you begin”
Incomplete words should marked with a tilde ~ (not a hyphen which may be interpreted as a pause) at the end of the word e.g.:
- \(\times\) “hesi-”
- \(\checkmark\) “hesi~”
For very short hesitations - “b~ bu~ but” - some pronunciation module can infer a pronunciation for such words, without the need for a manual pronunciation tag.
Utterances/Lines
Some processes, like forced alignment, involve processing individual utterances in the recording, which correspond to lines of text in many transcription systems. Very long or very short utterances can be difficult to process.
Ideally, each line in a transcript should be 5 to 15 words long, and line breaks should be made where there are pauses in speech.
Some annotation tools allow for marking periods of simultaneous speech - i.e. periods during which there’s more than one person speaking. These periods should be aligned as accurately as possible, because some automatic processing (e.g. forced alignment) ignore simultaneous speech; short simultaneous-speech utterances ensure that as little speech as possible is ignored.