nzilbb.formatter.trmparsercsv (0.1.1)

Generates CSV files specificially for export/import of data for the trm-parser implemented by Connor Talyor-Brown for Māori data.

When serializing fragments, the following transformations are made:

  • Vowels with umlauts or followed by a colon are macronized.
  • English words are enclosed in square brackets.
  • Utterances are split on full stops and pauses of 1000ms or longer, creating two fragments per utterance.
  • All punctuation is removed.

A CSV file is generated with the following columns:

  • Document - the transcript ID.
  • Speaker - the participant ID.
  • MatchId - the MatchId-encoded identifier for the fragment.
  • ID - the unique identifier for the fragment.
  • Original - the original, unstandardized text of the fragment.
  • WithPauses - the standardized fragment text with pause length (in seconds) between each token.
  • Terminator - the reason for terminating the fragment, which can be:
    • . or - : there was a pause marker,
    • A number like 1.234 : there was an inter-token pause,
    • utterance : it was the end of the utterance, or
    • turn : it was the end of the speaker turn.
  • Fragment - the standardized fragment text.