nzilbb.formatter.trmparsercsv (0.1.1)
Generates CSV files specificially for export/import of data for the trm-parser implemented by Connor Talyor-Brown for Māori data.
When serializing fragments, the following transformations are made:
- Vowels with umlauts or followed by a colon are macronized.
- English words are enclosed in square brackets.
- Utterances are split on full stops and pauses of 1000ms or longer, creating two fragments per utterance.
- All punctuation is removed.
A CSV file is generated with the following columns:
Document
- the transcript ID.Speaker
- the participant ID.MatchId
- the MatchId-encoded identifier for the fragment.ID
- the unique identifier for the fragment.Original
- the original, unstandardized text of the fragment.WithPauses
- the standardized fragment text with pause length (in seconds) between each token.Terminator
- the reason for terminating the fragment, which can be:.
or-
: there was a pause marker,- A number like
1.234
: there was an inter-token pause, utterance
: it was the end of the utterance, orturn
: it was the end of the speaker turn.
Fragment
- the standardized fragment text.