Deserializer for TEI P5 XML files.
TEI tag support is basic at this stage, and many tags are not explicitly interpreted, but most can be mapped to arbitrary annotation layers.
The TEI header and Meta-data
Certain constructions in the <teiHeader>
section of a TEI file are recognised:
- in the
<fileDesc>
subsection:- the text in
<titleStmt><title>
… is taken to be the value of the “title” transcript attribute - the text in
<titleStmt><respStmt><name>
… is taken to be the value of the “scribe” transcript attribute (i.e. the name of the transcriber) - the text in
<publicationStmt><distributor>
… is taken to be the value of the “distributor” transcript attribute - the text in
<publicationStmt><publisher>
… is taken to be the value of the “publisher” transcript attribute - the text in
<publicationStmt><availability><p>
… is taken to be the value of the “availability” transcript attribute - the text in
<publicationStmt><date>
… is taken to be the value of the “date” or “air_date” transcript attribute - the text in
<publicationStmt><distributor>
… is taken to be the value of the “distributor” transcript attribute - the text in
<sourceDesc><bibleStruct><monogr><author>
… is taken to be the value of the “author” transcript attribute (who is created as the sole ‘participant’ of the transcript) - alternatively the text in
<sourceDesc><titleStmt><author>
… is taken to be the value of the “author” transcript attribute (who is created as the sole ‘participant’ of the transcript) - the text in
<sourceDesc><bibl><publisher>
… is taken to be the name of the author of the text (who is created as the sole ‘participant’ of the transcript)
- the text in
- in the
<profileDesc>
subsection:- the text in
<creation><date>
… is taken to be the value of the “creation_date” transcript attribute - the value of the
<langUsage><language ident="
…">
attribute is taken to be the value of the “language” transcript attribute - when
<creation><place><placeName placeName='nnn'><country>
ccc</country><ppp</placeName>
exists, nnn can be mapped to a transcript attribute called something like “place”. If theplaceName
attribute is absent, then ppp is used. - the
<particDesc><person``>
tags are taken to be participants, whose<idno>
tag specifies the participant's identifier, and whose other tags specify the participant's attributes named after the tag name (or optionally are added as transcript attributes). The content of the<person>
tag's<age>
tag is converted to a single number (in years) if the text is formatted asy;m.d
or asy years m months d days
- the
<particDesc><person><persName>
tags with a “role” attribute can be mapped to a transcript attribute, by the value of the attribute. For example ifrole="Recipient"
then a “recipient” transcript attribute can receive the tag content.
Additionally, if the<persName>
has agender
attribute, the value can be mapped to a transcript attribute for gender.
- the text in
- in the
<revisionDesc>
subsection:- the text in
<change><date>
… is taken to be the value of the “version_date” transcript attribute - the text in
<change><respStmt><name>
… is taken to be the value of the “scribe” transcript attribute
- the text in
Arbitrary transcript/document attributes are implemented by including a <notesStmt>
within the <fileDesc>
header, containing one <note>
tag per attribute, and using the
type
attribute as the attribute key and the tag content as the value - e.g.
<notesStmt>
<note type="subreddit">StrangerThings</note>
<note type="parent_id">t1_dd5f8en</note>
</notesStmt>
Participant/speaker attributes can be included in the <person>
tag, as per the TEI
specification.
Arbitrary participant/speaker attributes (i.e. custom attributes or others not foreseen by
the TEI specification) can be processed by including one <note>
tag per attribute within
each participant's <person>
tag, and using the
type
attribute as the attribute key and the tag content as the value - e.g.\
<person>
<idno>ABCD</idno>
<age>46</age>
<education>Secondary</education>
<note type="first language">English</note>
<note type="origin">Liverpool</note>
</person>
Tags in the text
The P5 guidelines for TEI specify a dazzling array of tags for capturing all kinds of information about texts, only a subset of which will work well with this formatter. There is explicit support for the following TEI tags:
<p>
,<div>
, and<ab>
are interpreted as starting a new line<w>
tags (for marking up words) are used for word tokenization if they are present (if they are absent, standard whitespace-based tokenization is used). Attributes of the<w>
tag likelemma
ortype
can be mapped to word layers for capturing such tagging in the text.- the
<choice><orig>
…</orig><reg>
…</reg></choice>
construction for marking regularization of text is recognized, and the contents of the<reg>
…</reg>
tag can be extracted to the “lexical” layer (for single-word regularization) or a selected ‘meta’ layer (if multi-word regularization is used). <foreign>
is recognised as marking sections of the transcript as being in another language, and so its contents are annotated on the “language” layer, using value of the thexml:lang
attribute as the annotation label.<note>
is recognised as a commentary marker, and so its contents are put on to the “comment” layer instead of being inserted into the transcript text.<unclear>
tags can create annotations on a selected layer, and thereason
andcert
attributes recognised and used in the resulting annotation label if present.- the
<placeName placeName='nnn'><country>
ccc</country><ppp</placeName>
construction is recognized as tagging the name of a place ppp with annotations for the correct place name nnn and country ccc - i.e. ppp is part of the text, but nnn and ccc aren't. - …but the
<country>
ccc</country>
tag alone is assumed to be annotating part of the text - i.e. ccc is left in the text rather than being stripped out as it is when inside<placeName>
…</placeName>
. <hi rend='rrr'>
…</hi>
results in an annotation with label rrr.
Other tags are by default mapped on to the “entities” layer (in which case their tag
name,and its type attribute if present, will be used for the entity label).
Alternatively, if there is a layer that is named after the TEI tag name, then tags will
be mapped to that layer by default during upload. For example, to have all <sic>
tags extracted to their own layer (instead of the “entities” layer) by default, create
a new ‘phrase’ layer called “sic”.
This deserializer supports part of the schema for Representation of Computer-mediated Communication proposed by Michael Beißwenger, Maria Ermakova, Alexander Geyken, Lothar Lemnitzer, and Angelika Storrer (2012), with the exception of the following:
- When
<posting>
tags are sychronised to a<when>
tag inside a<timeline>
, the time synchronisation is ignored. - The
<addressingTerm>
,<addressMarker>
and<addressee>
tags supported, and mapped to “entities” layer by default, but thewho
attribute of<addressee>
is ignored. - The
type
attribute of the<div>
tag is ignored. - The
revisedWhen
,revisedBy
, andindentLevel
attributes of the<posting>
tag are ignored - The
<interactionTerm>
tag is ignored. - The
<interactionTemplate>
,<interactionWord>
, and<emoticon>
tags are not explicitly supported. - The
<autoSignature>
and<signatureContent>
tags are not explicitly supported.