TrsToKaldi

Converts Transcriber .trs files to corpus input files for Kaldi

The participant genders from the Transcriber transcripts are used, if present, to generate the spk2gender file. The following participant meta-data is lost during conversion:

dialect
accent
scope
version
version date
air date
scribe
language

The following Transcriber annotations are lost during conversion:

phrase language annotations
named entity annotations
comments
noises
lexical tags
pronounce tags

By default, all words are converted to lowercase, and extraneous punctuation is removed. To disable this behaviour, use the –cleanOrthography=false command line switch.

Deserializing from “Transcriber transcript” text/xml-transcriber

Command-line configuration parameters for deserialization:


`--topicLayer=`Layer	Topic tags
`--commentLayer=`Layer	Commentary
`--noiseLayer=`Layer	Noise annotations
`--languageLayer=`Layer	Inline language tags
`--lexicalLayer=`Layer	Lexical tags
`--pronounceLayer=`Layer	Manual pronunciation tags
`--entityLayer=`Layer	Named entities
`--scribeLayer=`Layer	Name of transcriber
`--versionLayer=`Layer	Version of transcriber
`--versionDateLayer=`Layer	Version date of transcriber
`--programLayer=`Layer	Name of the program recorded
`--airDateLayer=`Layer	Date the program aired
`--transcriptLanguageLayer=`Layer	The language of the whole transcript
`--participantCheckLayer=`Layer	Participant checked
`--genderLayer=`Layer	Gender - participant ‘type’
`--dialectLayer=`Layer	Participant's dialect
`--accentLayer=`Layer	Participant's accent
`--scopeLayer=`Layer	Participant's ‘scope’

Serializing to “Kaldi Files” text/x-kaldi-text

Command-line configuration parameters for serialization:


`--orthographyLayer=`Layer	Orthography tags
`--pronunciationLayer=`Layer	Pronunciation tags
`--genderLayer=`Layer	Participant gender
`--prefixUtteranceId=`Boolean	Whether to prefix utterance IDs with the speaker ID or not.
`--wavBasePath=`String	Base path to prefix all wav files names.