TrsToKaldi

Converts Transcriber .trs files to corpus input files for Kaldi

The participant genders from the Transcriber transcripts are used, if present, to generate the spk2gender file. The following participant meta-data is lost during conversion:

  • dialect
  • accent
  • scope
  • version
  • version date
  • air date
  • scribe
  • language

The following Transcriber annotations are lost during conversion:

  • phrase language annotations
  • named entity annotations
  • comments
  • noises
  • lexical tags
  • pronounce tags

By default, all words are converted to lowercase, and extraneous punctuation is removed. To disable this behaviour, use the –cleanOrthography=false command line switch.

Deserializing from “Transcriber transcript” text/xml-transcriber

Command-line configuration parameters for deserialization:

--topicLayer=Layer Topic tags
--commentLayer=Layer Commentary
--noiseLayer=Layer Noise annotations
--languageLayer=Layer Inline language tags
--lexicalLayer=Layer Lexical tags
--pronounceLayer=Layer Manual pronunciation tags
--entityLayer=Layer Named entities
--scribeLayer=Layer Name of transcriber
--versionLayer=Layer Version of transcriber
--versionDateLayer=Layer Version date of transcriber
--programLayer=Layer Name of the program recorded
--airDateLayer=Layer Date the program aired
--transcriptLanguageLayer=Layer The language of the whole transcript
--participantCheckLayer=Layer Participant checked
--genderLayer=Layer Gender - participant ‘type’
--dialectLayer=Layer Participant's dialect
--accentLayer=Layer Participant's accent
--scopeLayer=Layer Participant's ‘scope’

Serializing to “Kaldi Files” text/x-kaldi-text

Command-line configuration parameters for serialization:

--orthographyLayer=Layer Orthography tags
--pronunciationLayer=Layer Pronunciation tags
--genderLayer=Layer Participant gender
--prefixUtteranceId=Boolean Whether to prefix utterance IDs with the speaker ID or not.
--wavBasePath=String Base path to prefix all wav files names.