TEI Texts
The Text Encoding Initiative (TEI - http://www.tei-c.org) is a consortium that develops guidelines for representation of texts in digital form, mainly for libraries, humanities research, social sciences, and linguistics. They have developed an XML format, which you can use to produce texts that can be uploaded into LaBB-CAT. At present, only texts that are not synchronised with audio or video are supported. However, the TEI P5 guidelines do specify methods for linking text to media, and support for this will be added to LaBB-CAT in the future.
The TEI header and Meta-data
LaBB-CAT recognises and imports certain constructions in the <teiHeader>
section of a TEI file:
- in the
<fileDesc>
subsection:- the text in
<titleStmt><title>…
is taken to be the value of the title transcript attribute - the text in
<titleStmt><respStmt><name>…
is taken to be the value of the scribe transcript attribute (i.e. the name of the transcriber) - the text in
<publicationStmt><distributor>…
is taken to be the value of the distributor transcript attribute - the text in
<publicationStmt><publisher>…
is taken to be the value of the publisher transcript attribute - the text in
<publicationStmt><availability><p>…
is taken to be the value of the availability transcript attribute - the text in
<publicationStmt><date>…
is taken to be the value of the airDate transcript attribute - the text in
<publicationStmt><distributor>…
is taken to be the value of the distributor transcript attribute - the text in
<sourceDesc><bibleStruct><monogr><author>…
is taken to be the name of the author of the text (who is created as the sole ‘participant’ of the transcript)
- the text in
- in the
<profileDesc>
subsection:- the text in
<creation><date>…
is taken to be the value of the creationDate transcript attribute - the value of the
<langUsage><language ident="…">
attribute is taken to be the value of the creationDate transcript attribute - the
<particDesc><person>
tags are taken to be participants, whose<idno>
tag specifies the participant’s identifier, and whose other tags specify the participant’s attributes named after the tag name (or optionally are added as transcript attributes). The content of the<person>
tag’s<age>
tag is converted to a single number (in years) if the text is formatted asy;m.d
or asy years m months d days
- the text in
- in the
<revisionDesc>
subsection:- the text in
<change><date>…
is taken to be the value of the versionDate transcript attribute - the text in
<change><respStmt><name>…
is taken to be the value of the scribe transcript attribute
- the text in
Arbitrary transcript/document attributes are implemented by including a <notesStmt>
within the <fileDesc>
header, containing one <note>
tag per attribute, and using the type
attribute as the attribute key and the tag content as the value - e.g.
<notesStmt>
<note type="subreddit">StrangerThings</note>
<note type="parent_id">t1_dd5f8en</note>
</notesStmt>
Participant/speaker attributes can be included in the <person>
tag, as per the TEI specification.
Arbitrary participant/speaker attributes (i.e. custom attributes or others not foreseen by the TEI specification) can be processed by including one <note>
tag per attribute within each participant’s <person>
tag, and using the type
attribute as the attribute key and the tag content as the value - e.g.
<person>
<idno>ABCD</idno>
<age>46</age>
<education>Secondary</education>
<note type="first language">English</note>
<note type="origin">Liverpool</note>
</person>
Special support for regularization is used; for a construction like this:
<choice><orig>color</orig><reg>colour</reg></choice>
This deserializer supports part of the schema for Representation of Computer-mediated Communication proposed by Michael Beißwenger, Maria Ermakova, Alexander Geyken, Lothar Lemnitzer, and Angelika Storrer (2012), with the exception of the following:
- When
<posting>
tags are sychronised to a<when>
tag inside a<timeline>
, the time synchronisation is ignored. - The
<addressingTerm>
,<addressMarker>
and<addressee>
tags supported, and mapped to “entities” layer by default, but thewho
attribute of<addressee>
is ignored. - The
type
attribute of the<div>
tag is ignored. - The
revisedWhen
,revisedBy
, andindentLevel
attributes of the<posting>
tag are ignored - The
<interactionTerm>
tag is ignored. - The
<interactionTemplate>
,<interactionWord>
, and<emoticon>
tags are not explicitly supported. - The
<autoSignature>
and<signatureContent>
tags are not explicitly supported.