Porter Stemmer
The Porter Stemmer annotator uses the Porter Algorithm to compute the stems of English words from their orthography. It achieves this by systematically stripping off or converting suffixes of word orthographies, in several passes, until a 'stem' remains, which is the same as the 'stem' for other forms of the same word.
It's important to realise that, in the words of M. F. Porter himself,
the suffixes are being removed simply to improve IR performance, and not as a
linguistic exercise
.
It doesn't get the stems 100% right. It is pretty good for regular words, e.g.
walk | → | walk |
walks | → | walk |
walked | → | walk |
walking | → | walk |
…however, it behaves less well for irregular cases, e.g.
sing | → | sing |
sings | → | sing |
sang | → | sang |
sung | → | sung |
singing | → | sing |
However, it's also worth noting that it will do better than lexicon-based methods when it comes to new words, e.g.
blog | → | blog |
blogs | → | blog |
blogging | → | blog |
blogged | → | blog |
For more information about the algorithm, see Porter, 1980, An algorithm for suffix stripping, Program, Vol. 14, no. 3, pp 130-137, or http://www.tartarus.org/~martin/PorterStemmer