TextToEaf
Converts time-aligned plain text .txt transcripts to ELAN .eaf files
The plain text transcript must include synchronisation information
- i.e. time codes - and must end in a timecode, indicating the end time
of the last utterance.
Consecutive lines without intervening time codes will be merged into one
ELAN annotation.
Check the –timestampFormat setting matches your time codes.
This setting uses Java SimpleDateFormat format:
https://docs.oracle.com/javase/8/docs/api/index.html?java/text/SimpleDateFormat.html
Deserializing from “Plain Text Document” text/plain
Command-line configuration parameters for deserialization:
--commentLayer=Layer |
Commentary | |
--noiseLayer=Layer |
Background noises | |
--lexicalLayer=Layer |
Lexical tags | |
--pronounceLayer=Layer |
Non-standard pronunciation tags | |
--orthographyLayer=Layer |
Orthography | |
--useConventions=Boolean |
Whether to use text conventions for comment, noise, lexical, and pronounce annotations | |
--maxParticipantLength=Integer |
The maximum length of a participant name | |
--maxHeaderLines=Integer |
The maximum number of lines in a meta-data header | |
--participantFormat=String |
Format for marking a change of turn within the transcript body - e.g. {0}:, where {0} is a place-holder for the participant ID/name | |
--metaDataFormat=String |
Format for a meta-data line in the header - e.g. {0}={1}, where {0} is a place-holder for the attribute name or key, and {1} is a place-holder for the attribute value | |
--tagFormat=String |
Output format for tags - e.g. {0}_{1} for output like ‘the_DET’, where {0} is a place-holder for the word, and {1} is a place-holder for annotation label | |
--includeMissingTags=Boolean |
Whether to output missing tags with Tag Format, e.g. if the word ‘the’ has no tag, setting this to true will output ‘the_’, and false will output ‘the’. | |
--timestampFormat=String |
Format for a time stamp - e.g. HH:mm:ss.SSS | |
--nonWordPattern=String |
Regular expression to identify non-word characters for joining to a neighboring words e.g. (([0-9]+.[0-9]+)) | ([\p{Punct}&&[^_]]) - set this blank to simply tokenize on spaces. |
Serializing to “ELAN EAF Transcript” text/x-eaf+xml
Command-line configuration parameters for serialization:
--commentLayer=Layer |
Commentary |
--noiseLayer=Layer |
Noise annotations |
--lexicalLayer=Layer |
Lexical tags |
--pronounceLayer=Layer |
Manual pronunciation tags |
--authorLayer=Layer |
Name of transcriber |
--dateLayer=Layer |
Document date |
--languageLayer=Layer |
The language of the whole transcript |
--phraseLanguageLayer=Layer |
For tagging individual phrases with a language |
--useConventions=Boolean |
Whether to use text conventions for comment, noise, lexical, and pronounce annotations |
--ignoreBlankAnnotations=Boolean |
Whether to skip annotations with no label, or process them |
--minimumTurnPauseLength=Double |
Minimum amount of time between two turns by the same speaker, with no intervening speaker, for which the inter-turn pause counts as a turn change boundary. If the pause is shorter than this, the turns are merged into one. |
nzilbb.converter.txttoeaf