nzilbb.formatter.clan (1.4.2)

Serialization for CHAT files produced by CLAN.

NB the current implementation is not exhaustive; it only covers:

  • Time synchronization codes, including mid-line synchronization.
    Overlapping utterances in the same speaker turn are handled as follows:

    • If overlap is partial, the start of the second utterance is set to the end of the first.
    • If overlap is total, the two utterances are chained together with a non-aligned anchor between them.
  • Disfluency marking with &+ - e.g. so &+sund Sunday

  • Non-standard form expansion - e.g. gonna [: going to]

  • Incomplete word completion - e.g. dinner doin(g) all

  • Acronym/proper name joining with _ - e.g. no T_V in my room

  • Retracing - e.g. <some friends and I> [//] uh or and sit [//] sets him

  • Repetition/stuttered false starts - e.g. the <picnic> [/] picnic or the Saturday [/] in the morning

  • Errors - e.g. they've <work up a hunger> [* s:r] or they got [* m] to

  • Pauses - untimed, (e.g. (.), (...)), or timed (e.g. (0.15), (2.), (1:05.15))

  • %mor line annotations (or %pos line annotations, if present)

Layers to add to fully capture supported tags

Word layers

(All Type=Text, Alignment=None)

  • completion: Incomplete word completion - e.g. dinner doin(g) all
  • disfluency: Disfluency marking with &+ - e.g. so &+sund Sunday
  • expansion: Non-standard form expansion - e.g. gonna [: going to]

If there's a %mor layer in the transcripts:

(All Type=Text, Alignment=Intervals)

  • mor: Complete %mor tag(s) for the word token - there can be multiple tags, e.g. contractions and cliticizations.
    These tags are then spit into the following parts:
  • morFusionalSuffix
  • morGloss
  • morPOS
  • morPOSSubcategory
  • morPrefix
  • morStem
  • morSuffix

If there's also a %gra layer in the transcripts:

  • gra (Type=Text, Alignment=Intervals): Complete %gra tag, one for each %mor annotation. (These tags mark grammatical relations, between a dependent and a head; the dependent is tagged, but currently is not formally linked to its head.)

Phrase layers

(All Type=Text, Alignment=Intervals)

  • cunit: The grammatical unit for each utterance, labelled with utterance terminator
  • error: Errors - e.g. they've <work up a hunger> [* s:r] or they got [* m] to
  • linkage: Multiple words in a name joined by _ e.g. Winnie_ther_Pooh
  • pause: Pauses - untimed, (e.g. (.), (...)), or timed (e.g. (0.15), (2.), (1:05.15))
  • repetition: Repetition/stuttered false starts - e.g. the <picnic> [/] picnic or the Saturday [/] in the morning
  • retrace: Retracing - e.g. <some friends and I> [//] uh or and sit [//] sets him

Span layers

  • gem (Type=Text, Alignment=Intervals): Parts of the transcript marked for separate analysis.

Participant Attributes

  • language: The language the participant speaks.
  • corpus: A one-word label for the corpus in lowercase.
  • age: The age of the speaker, using the form years;months.days as in 2;11.17 for 2 years, 11 months, and 17 days.
  • sex: gender can be used as the attribute name.
  • group: Any single word label.
  • SES: Socio-economic status
    e.g. WC for working class, UC for upper class, MC for middle class, LI for limited income)
  • role: The speaker's (standardized) role e.g. Target_Child, Target_Adult, Child, Mother, Father, Participant, Investigator, Adult, Friend, Unidentified, etc.
  • education: Educational level of the speaker e.g. Elem, HS, UG, Grad, Doc
  • custom: Any additional information needed for a given project.

Transcript Attributes

  • scribe: The person who transcribed the transcript.
  • language: The language(s) of the speech in the recording.
  • recordingdate: The date of the recording.
  • location: Location of the recording.
  • recordingquality: Recorging quality.
  • roomlayout: Room layout.