nzilbb.formatter.text (1.5.1)

Serializer/Deserializer for transcripts in plain text format.

Support for plain-text transcripts includes both import and export of transcripts as .txt files. Transcripts can be textual documents, or transcriptions of speech attributable to different speakers.

For importing/parsing files, the document structure is assumed to be a meta-data ‘header’ followed by speaker turns introduced by speaker IDs. Each line in the speaker turn is taken to be an utterances.

The speaker ID of a turn is a word followed by a colon, by default (this is configurable); that marks the beginning of utterances by that speaker, and all subsequent lines are assumed to belong to that speaker/turn, until the next speaker ID is encountered.

All lines before the first speaker ID are assumed to be part of the ‘header’, each line possibly containing meta-data of the form key=value (this form is also configurable). Header lines that don't conform to the meta-data pattern are prepended to the transcript as comment annotations.

If there are no speaker IDs found, the entire transcript is assumed to be a text rather than a transcript, and all lines are attributed to a single person called “author”.

Synchronization

Utterances can be synchronized to a separate recording by including time-stamp lines throughout the text.

A time-stamp must occur on a line by itself, and is assumed to use the following format by default (the format is configurable):
HH:mm:ss.SSS

If time-stamps are found in the tex, or it is loaded with a sound file, the document is assumed to be a transcript of a recording, and offsets are in seconds. Otherwise, the document is assumed to be a written text, and offsets are in characters.

Annotation Conventions

A number of textual annotation conventions are supported:

Non-speech commentary
enclosed in curly braces (surrounded by spaces), e.g.
the he went {pointing} over there
Noises
enclosed in square brackets (surrounded by spaces), e.g.
now [door slamming] let's start
Pronunciation
enclosed in square brackets, immediately following the word it's annotating, with no intervening space, e.g.
this was at Wingatui[wIN@tui]
Lexical tags
enclosed in round parentheses, immediately following the word it's annotating, with no intervening space, e.g.
I can't remem~(remember)

Configuration

The following parameters can be specified for the formatter:

participantLayer
Layer for speaker/participant identification (“participant” by default)
turnLayer
Layer for speaker turns (“turn” by default)
utteranceLayer
Layer for speaker utterances or text lines (“utterance” by default)
wordLayer
Layer for individual word tokens (“word” by default)
commentLayer
Annotation layer for commentary (“comment” by default)
noiseLayer
Annotation layer for Background noises (“noise” by default)
lexicalLayer
Word annotation layer for lexical tags (“lexical” by default)
pronounceLayer
Word annotation layer for non-standard pronunciation tags (“pronounce” by default)
orthographyLayer
Word token orthography layer (excluding punctuation etc.) used for when exporting texts with other word tags (“orthography” by default)
useConventions
Whether to use text conventions for comment, noise, lexical, and pronounce annotations
maxParticipantLength
The maximum length of a participant name/ID, when starting a speaker turn (20 characters by default)
maxHeaderLines
The maximum number of lines in a meta-data header (50 by default)
participantFormat
Format for marking a change of turn within the transcript body - e.g. {0}:, where {0} is a place-holder for the participant ID/name
metaDataFormat
Format for a meta-data line in the header - e.g. {0}={1}, where {0} is a place-holder for the attribute name or key, and {1} is a place-holder for the attribute value
timestampFormat
Format for a time stamp - e.g. HH:mm:ss.SSS

Exporting texts with word tags

If an annotation graph is exported with an orthography layer and one or more word-tag layers (e.g. POS tags, etc.), then in the output text, each word token has its annotation tags appended, with underscore _ as the delimiter between token/tags.