TextGridToKaldi

Converts Praat TextGrids to corpus input files for Kaldi

The Praat TextGrid format is extremely flexible and there are many different possible ways a transcript can be structured. This converter assumes the following principles:

the TextGrid is generally an orthographic transcription of speech
each tier is named after the speaker
all tiers are labelled intervals
the interval labels are utterance transcripts - i.e. contain multiple word orthographies

All tiers will be interpreted as transcription of participant speech. If some tiers contain other annotations, use the –ignoreTiers command line switch to exclude them from the conversion using a regular expression, e.g.: –ignoreTiers=(segments.*)|(target)

By default, all words are converted to lowercase, and extraneous punctuation is removed. To disable this behaviour, use the –cleanOrthography=false command line switch.

Praat doesn't support participant meta-data, so the ‘spk2gender’ file is not generated.

Praat has no direct mechanism for marking non-speech annotations in their position within the transcript text. However, this converter supports the use of textual conventions in various ways to make certain annotations:

To tag a word with its pronunciation, enter the pronunciation in square brackets, directly following the word (i.e. with no intervening space), e.g.: …this was at Wingatui[wIN@tui]…
To tag a word with its full orthography (if the transcript doesn't include it), enter the orthography in round parentheses, directly following the word (i.e. with no intervening space), e.g.: …I can't remem~(remember)…
To insert a noise annotation within the text, enclose it in square brackets (surrounded by spaces so it's not taken as a pronunciation annotation), e.g.: …sometimes me [laughs] not always but sometimes…
To insert a comment annotation within the text, enclose it in curly braces (surrounded by spaces), e.g.: …beautifully warm {softly} but its…

To enable these transcription conventions, use the –useConventions command-line switch.

Deserializing from “Praat TextGrid” text/praat-textgrid

Command-line configuration parameters for deserialization:


`--commentLayer=`Layer	Commentary
`--noiseLayer=`Layer	Noise annotations
`--lexicalLayer=`Layer	Lexical tags
`--pronounceLayer=`Layer	Manual pronunciation tags
`--renameShortNumericSpeakers=`Boolean	Short speaker names like ‘S1’ should be prefixed with the transcript name during import
`--allowPeerOverlap=`Boolean	Allows TextGrids with, for example, multiple segment tiers, if the underlying annotations are invalid and have overlapping segments.
`--utteranceThreshold=`Double	Minimum inter-word pause to trigger an utterance boundary, when no utterance layer is mapped. 0 means ‘do not infer utterance boundaries’.
`--useConventions=`Boolean	Whether to use text conventions for comment, noise, lexical, and pronounce annotations
`--ignoreLabels=`String	Regular expression for annotation to ignore, e.g. <p:> to ignore MAUS pauses

Serializing to “Kaldi Files” text/x-kaldi-text

Command-line configuration parameters for serialization:


`--orthographyLayer=`Layer	Orthography tags
`--pronunciationLayer=`Layer	Pronunciation tags
`--genderLayer=`Layer	Participant gender
`--prefixUtteranceId=`Boolean	Whether to prefix utterance IDs with the speaker ID or not.
`--wavBasePath=`String	Base path to prefix all wav files names.