ChaToTrs

Converts CLAN CHAT transcripts to Transcriber files

Transcriber doesn't support meta-data like @Recording Quality, @Location, etc. so CHAT header meta-data is instead included as comments at the beginning of the Transcriber transcript.

This conversion will only work well for CHAT transcripts that are fully aligned; i.e. all lines include time alignment bullets.

The CLAN parser is not exhaustive; it one parses:

  • Disfluency marking with &+ - e.g. so &+sund Sunday
  • Non-standard form expansion - e.g. gonna [: going to]
  • Incomplete word completion - e.g. dinner doin(g) all
  • Acronym/proper name joining with _ - e.g. no T_V in my room
  • Retracing - e.g. <some friends and I> [//] uh or and sit [//] sets him
  • Repetition/stuttered false starts - e.g. the <picnic> [/] picnic or the Saturday [/] in the morning
  • Errors - e.g. they've <work up a hunger> [* s:r] or they got [* m] to

Deserializing from “CLAN CHAT transcript” text/x-chat

Command-line configuration parameters for deserialization:

--cUnitLayer=Layer Layer for marking c-units
--tokenLayer=Layer Output word tokens come from this layer
--disfluencyLayer=Layer Layer for disfluency annotations
--nonWordLayer=Layer Layer for non-word noises
--expansionLayer=Layer Layer for expansion annotations
--errorsLayer=Layer Layer for error annotations
--linkageLayer=Layer Layer for linkage annotations
--repetitionsLayer=Layer Layer for repetition annotations
--retracingLayer=Layer Layer for retracing annotations
--pauseLayer=Layer Layer for marking unfilled pauses
--completionLayer=Layer Layer for completion annotations
--morLayer=Layer Layer for morphosyntactic tags
--morPrefixLayer=Layer Layer for prefixes in MOR tags
--morPartOfSpeechLayer=Layer Layer for parts of speech in MOR tags
--morPartOfSpeechSubcategoryLayer=Layer Layer for subcategories of parts of speech in MOR tags
--morStemLayer=Layer Layer for stems in MOR tags
--morFusionalSuffixLayer=Layer Layer for fusional suffixes in MOR tags
--morSuffixLayer=Layer Layer for (non-fusional) suffixes in MOR tags
--morGlossLayer=Layer Layer for English glosses in MOR tags
--gemLayer=Layer Layer for gems
--transcriberLayer=Layer Layer for transcriber name
--languagesLayer=Layer Layer for transcriber language
--dateLayer=Layer Layer for date of the interaction
--locationLayer=Layer Layer for location of the interaction
--recordingQualityLayer=Layer Layer for recording quality
--roomLayoutLayer=Layer Layer for room layout
--tapeLocationLayer=Layer Layer for tape and location on the tape covered by the transcription
--targetParticipantLayer=Layer Layer for identifying target participants
--SESLayer=Layer Layer for SES
--roleLayer=Layer Layer for role
--educationLayer=Layer Layer for education
--sexLayer=Layer Layer for sex
--customLayer=Layer Layer for custom
--corpusLayer=Layer Layer for corpus
--languageLayer=Layer Layer for language
--ageLayer=Layer Layer for age
--groupLayer=Layer Layer for group
--includeTimeCodes=Boolean Include utterance sychronization information when exporting transcripts
--splitMorTagGroups=Boolean Split alternative MOR taggings into separate annotations
--splitMorWordGroups=Boolean Split MOR word morphemes (clitics, components of compounds ) into separate annotations. This is only supported when Split MOR Tag Groups is also enabled.

Serializing to “Transcriber transcript” text/xml-transcriber

Command-line configuration parameters for serialization:

--topicLayer=Layer Topic tags
--commentLayer=Layer Commentary
--noiseLayer=Layer Noise annotations
--languageLayer=Layer Inline language tags
--lexicalLayer=Layer Lexical tags
--pronounceLayer=Layer Manual pronunciation tags
--entityLayer=Layer Named entities
--scribeLayer=Layer Name of transcriber
--versionLayer=Layer Version of transcriber
--versionDateLayer=Layer Version date of transcriber
--programLayer=Layer Name of the program recorded
--airDateLayer=Layer Date the program aired
--transcriptLanguageLayer=Layer The language of the whole transcript
--participantCheckLayer=Layer Participant checked
--genderLayer=Layer Gender - participant ‘type’
--dialectLayer=Layer Participant's dialect
--accentLayer=Layer Participant's accent
--scopeLayer=Layer Participant's ‘scope’