nzilbb.annotator.orthography

Orthography Standardizer

The Orthography Standardizer annotator takes tokens from the word layer and 'cleans up' their labels, creating tags on a new layer with standardized labels, which should be optimal for looking up lexicons, frequency computation, etc. Specifically:

the word label is converted to all lowercase,
standardizes characters sometimes produced by word-processing software - e.g. em-dashes are converted to hyphens, 'smart-quote' apostrophes ’ are converted to plain apostrophes ' , etc.
removes all punctuation characters except: ~ - ' :
trailing apostrophes and hyphens are removed (leaving word-internal ones intact), and
leading/trailing whitespace is trimmed off.

For example:

word		orthography
“Why	→	why
hasn’t	→	hasn't
Inés	→	inés
d~	→	d~
got —	→	got
her	→	her
X—ray	→	x-ray
yet?”	→	yet