Lemmatisation

Lemmatisation is closely allied to the identification of parts-of-speech and involves the reduction of the words in a corpus to their respective lexemes. Lemmatisation allows the researcher to extract and examine all the variants of a particular lexeme without having to input all the possible variants, and to produce frequency and distribution information for the lexeme. Although accurate software has been developed for this purpose (Beale 1987), lemmatisation has not been applied to many of the more widely available corpora. However, the SUSANNE corpus does contain lemmatised forms of the corpus words, along with other information. See the example below - the fourth column contains the lemmatised words:

N12:0510g - PPHS1m	He	he
N12:0510h - VVDv	studied study
N12:0510i - AT		the	the
N12:0510j - NN1c	problem	problem
N12:0510k - IF		for	for
N12:0510m - DD221	a	a
N12:0510n - DD222	few	few
N12:0510p - NNT2	seconds	second
N12:0520a - CC		and	and
N12:0520b - VVDv	thought	think
N12:0520c - IO		of	of
N12:0520d - AT1		a	a
N12:0520e - NNc		means	means
N12:0520f - IIb		by	by
N12:0520g - DDQr	which	which
N12:0520h - PPH1	it	it
N12:0520i - VMd		might	may
N12:0520j - VB0		be	be
N12:0520k - VVNt	solved	solve
N12:0520m - YF		+.	-

Part-of-speech annotation | Parsing | Semantics
Discoursal and text linguistic annotation | Phonetic transcription
Prosody | Problem-oriented tagging