Lemmatisation
Lemmatisation is closely allied to the identification of parts-of-speech and involves the reduction of the words in a corpus to their respective lexemes. Lemmatisation allows the researcher to extract and examine all the variants of a particular lexeme without having to input all the possible variants, and to produce frequency and distribution information for the lexeme. Although accurate software has been developed for this purpose (Beale 1987), lemmatisation has not been applied to many of the more widely available corpora. However, the SUSANNE corpus does contain lemmatised forms of the corpus words, along with other information. See the example below - the fourth column contains the lemmatised words:
N12:0510g - PPHS1m He he
N12:0510h - VVDv studied study
N12:0510i - AT the the
N12:0510j - NN1c problem problem
N12:0510k - IF for for
N12:0510m - DD221 a a
N12:0510n - DD222 few few
N12:0510p - NNT2 seconds second
N12:0520a - CC and and
N12:0520b - VVDv thought think
N12:0520c - IO of of
N12:0520d - AT1 a a
N12:0520e - NNc means means
N12:0520f - IIb by by
N12:0520g - DDQr which which
N12:0520h - PPH1 it it
N12:0520i - VMd might may
N12:0520j - VB0 be be
N12:0520k - VVNt solved solve
N12:0520m - YF +. -
Part-of-speech annotation | Parsing | Semantics
Discoursal and text linguistic annotation | Phonetic transcription
Prosody | Problem-oriented tagging