Well-known and influential
corpora: A survey
[Note: This survey is
based on my (forthcoming) chapter written for A. Lüdeling,
M. Kyto & A. McEnery (eds) Handbooks of Linguistics and Communication Science Volume Corpus
Linguistics.
2.1. The British National Corpus
2.2. The American National Corpus
2.3. The Polish National Corpus
2.4. The Czech National Corpus
2.5. The Hungarian National Corpus
2.6. The Russian Reference Corpus
2.8. The Hellenic National Corpus
2.9. The German National Corpus
2.10. The Slovak National Corpus
2.11. The Modern Chinese Language Corpus
2.12. The Sejong Balanced Corpus
3.2. The global
English Monitor Corpus
4. Corpora of the Brown family
5.1. The International Corpus of English
5.2. The Longman/Lancaster Corpus
5.3. The Longman Written American Corpus
5.4. The CREA corpus of Spanish
5.5. The LIVAC corpus of Chinese
6.1. The Helsinki Corpus of English Texts
6.3. The Lampeter Corpus of Early Modern English Tracts
6.4 The Dictionary of Old English Corpus in Electronic
Form
6.5 Early English Books Online
6.6 The Corpus of Early English Correspondence
6.7. The Zurich English Newspaper Corpus
6.8. The Innsbruck Computer Archive of Machine-Readable
English Texts
6.9. The Corpus of English Dialogues
6.10 A Corpus of Late Eighteenth-Century Prose
6.11 A Corpus of Late Modern English Prose
7.2. SEC, MARSEC and Aix-MARSEC
7.3. The Bergen Corpus of London Teenage Language
7.4. The Cambridge and Nottingham Corpus of Discourse in
English
7.5. The Spoken Corpus of the Survey of English Dialects
7.6. The Intonational Variation in English Corpus
7.7. The Longman British Spoken Corpus
7.8. The Longman Spoken American Corpus
7.9. The Santa Barbara Corpus of Spoken American English
7.10. The Saarbrücken Corpus of Spoken English
7.12. The Wellington Corpus of Spoken New Zealand English
7.13. The Limerick corpus of Irish English
7.14. The Hong Kong Corpus of Conversational English
8. Academic and professional English corpora
8.1. The Michigan Corpus of Academic Spoken English
8.2. The British Academic Spoken English corpus
8.3. The Reading Academic Text corpus
8.5. The Corpus of Professional Spoken American English
8.6. The Corpus of Professional English
9.1. The Lancaster-Leeds Treebank
9.2. The Lancaster Parsed Corpus
9.8. Parsed historical corpora
10. Developmental and learner corpora
10.1. The Child Language Data Exchange System
10.2. The Louvain Corpus of Native English Essays
10.3. The Polytechnic of Wales corpus
10.4. The International Corpus of Learner English
10.6. The Longman Learners’ Corpus
10.7. The Cambridge Learner Corpus
11.1. The Canadian Hansard Corpus
11.2. The English-Norwegian
Parallel Corpus
11.3. The English-Swedish Parallel Corpus
11.4. The Oslo Multilingual Corpus
11.5. The ET10/63 and ITU/CRATER parallel corpora
11.6. The IJS-ELAN Slovene-English Parallel Corpus
11.7. The CLUVI parallel corpus
11.8. European Corpus Initiative Multilingual Corpus I
11.11. Multilingual Corpora for Cooperation
11.13. The BFSU Chinese-English Parallel Corpus
11.14. The Babel Chinese-English Parallel Corpus
11.15. Hong Kong Parallel Text
12. Non-English monolingual corpora
12.5. The Scottish Corpus of Texts and Speech
12.6. The Prague Dependency Treebank
12.7. Academia Sinica Balanced Corpus
12.10. Spoken Chinese Corpus of Situated Discourse
13. Well-known distributors of corpus resources
As
corpus building is an activity that takes times and costs money, readers may
wish to use ready-made corpora to carry out their work. However, as a corpus is
always designed for a particular purpose, the usefulness of a ready-made corpus
must be judged with regard to the purpose to which a user intends to put it.
There are thousands of corpora in the world, but most of them are created for
specific research projects and are thus not publicly available. While abundant
corpus resources for languages other than English are also available now, this
survey focuses upon major English corpora, which are grouped in terms of their
primary uses so that readers will find it easier to choose corpus resources
suitable for their particular research questions. Note, however, that overlaps
are inevitable in our classification. It is used in this survey simply to give
a better account of the primary uses of the relevant corpora.
National
corpora are normally general reference corpora which are supposed to represent
the national language of a country. They are balanced with regard to genres and
domains that typically represent the language under consideration. While an
ideal national corpus should cover proportionally both written and spoken
language, most existing national corpora and those under construction consist
only of written data, as spoken data is much more difficult and expensive to
capture than written data. This section introduces a number of major national
corpora.
2.1. The British National Corpus
The
first and best-known national corpus is perhaps the British National Corpus
(BNC), which is designed to represent as wide a range of modern British English
as possible so as to “make it possible to say something about
language in general” (Burnard 2002,
56). The BNC comprises approximately 100 million words of written texts (90%)
and transcripts of speech (10%) in modern British English. Written texts were
selected using three criteria: “domain”, “time” and “medium”. Domain refers to the content type (i.e. subject field) of
the text; time refers to the period of text production, while medium refers to
the type of text publication such as books, periodicals or unpublished
manuscripts. Table 1 summarizes the distribution of these criteria (see Aston/Burnard 1998, 29-30).
Table 1: Composition of the written BNC
|
Domain |
% |
Date |
% |
Medium |
% |
|
Imaginative |
21.91 |
1960-74 |
2.26 |
Book |
58.58 |
|
Arts |
8.08 |
1975-93 |
89.23 |
Periodical |
31.08 |
|
Belief and thought |
3.40 |
Unclassified |
8.49 |
Misc. published |
4.38 |
|
Commerce/Finance |
7.93 |
|
|
Misc. unpublished |
4.00 |
|
Leisure |
11.13 |
|
|
To-be-spoken |
1.52 |
|
Natural/pure science |
4.18 |
|
|
Unclassified |
0.40 |
|
Applied science |
8.21 |
|
|
|
|
|
Social science |
14.80 |
|
|
|
|
|
World affairs |
18.39 |
|
|
|
|
|
Unclassified |
1.93 |
|
|
|
|
The
spoken data in the BNC was collected on the basis of two criteria: “demographic” and “context-governed”. The demographic component is composed of informal
encounters recorded by 124 volunteer respondents selected by age group, sex,
social class and geographical region, while the context-governed component
consists of more formal encounters such as meetings, lectures and radio
broadcasts recorded in four broad context categories. The two components of
spoken data complement each other, as many types of spoken text would not have
been covered if demographic sampling techniques alone were used in data collection.
Table 2 summarizes the composition of the spoken BNC. Note that in the table,
the first two columns apply to both demographic and context-governed components
while the third column refers to the latter component alone.
Table 2: Composition of the spoken BNC
|
Region |
% |
Interaction type |
% |
Context-governed |
% |
|
South |
45.61 |
Monologue |
18.64 |
Educational/informative |
20.56 |
|
|
23.33 |
Dialogue |
74.87 |
Business |
21.47 |
|
North |
25.43 |
Unclassified |
6.48 |
Institutional |
21.86 |
|
Unclassified |
5.61 |
|
|
Leisure |
23.71 |
|
|
|
|
|
Unclassified |
12.38 |
In
addition to part-of-speech (POS) information, the BNC is annotated with rich
metadata (i.e. contextual information) encoded according to the TEI guidelines,
using ISO standard 8879. Because of its generality, as well as the use of
internationally agreed standards for its encoding, the BNC corpus is a useful
resource for a very wide variety of research purposes, in fields as distinct as
lexicography, artificial intelligence, speech recognition and synthesis,
literary studies and, of course, linguistics. There are a number of ways one
can access the BNC corpus. It can be accessed online remotely using the BNC Online service or the BNCWeb interface.
Alternatively, if a local copy of the corpus is available, the BNC can be
explored using corpus exploration tools such as WordSmith
(Scott 1999).
The
current version of the full release of the BNC is BNC-2, the World Edition.
This version has removed a small number of texts (less than 50) which restrict
the worldwide distribution of the corpus. The BNC World has also corrected
errors relating to mislabeled texts and indeterminate
part-of-speech codes in the first version, and has included a classification
system of genre labels developed by Lee (2001) at
The
BNC model for achieving corpus balance and representativeness
has been followed by a number of national corpus projects including, for
example, the American National Corpus, the Polish National Corpus and the
Russian Reference Corpus.
2.2. The American National Corpus
The
American National Corpus (ANC) project was initiated in 1998 with the aim of
building a corpus comparable to the BNC. While the ANC follows the general
design of the BNC, there are differences with regarding to its sampling period
and text categories. The ANC only samples language data produced from 1990
onwards whereas the sampling period for the BNC is 1960-1993. This time frame
has enabled the ANC to cover text categories which have developed recently and
thus were not included in the BNC, e.g. emails, web pages and chat room talks,
as shown in Table 3. In addition to the BNC-like core, the ANC will also
include specialized “satellite” corpora (cf. Reppen/Ide 2004, 106-107).
Table 3: Text categories in the ANC
|
Channel |
Text category |
% |
|
Written |
Books (41% informative texts for various domains and
14% imaginative texts of various types) |
55 |
|
Newspapers, magazines and journals |
20 |
|
|
Electronic (emails, web pages etc) |
10 |
|
|
Miscellaneous (published and unpublished) |
5 |
|
|
Spoken |
Face-to-face/phone conversations, speech, meetings |
10 |
The
ANC corpus is encoded in XML, following the guidelines of the XML version of
the Corpus Encoding Standard. The standalone annotation, i.e. with the primary
data and annotations kept in separate documents but linked with pointers, has
enabled the corpus to be POS tagged using different tagsets
(e.g. Biber’s (1988) tags, the
CLAWS C5/C7 tagsets (Garside/Leech/Sampson 1987) and
the Penn tags (Marcus/Santorini/Marcinkiewicz 1993)
to suit the needs of different users.
The
full release of the ANC is expected to be available in late 2005. At present
the first release of the corpus, which contains 11.5 million words of written
and spoken data (8.3 million words for writing and 3.2 million words for
speech, but not balanced for genre), is now available from the Linguistic Data
Consortium (LDC).
2.3. The Polish National Corpus
The
Polish National Corpus (PNC) is under construction on the PELCRA (Polish and
English Language Corpora for Research and Application) project, which is
undertaken jointly by the Universities of Lodz and
Lancaster. The project aims to develop a large, fully annotated reference
corpus of native Polish, “mirroring the BNC in terms of genres and
its coverage of written and spoken language” (Lewandowska-Tomaszczyk
2003, 106). A total of 130 million words of running texts have been collected,
and part of the data (30 million words) has been compiled into a balanced
corpus, which covers genres, and styles comparable in proportions to those
included the BNC. The PNC is TEI-compliant and is annotated for part-of-speech.
Presently, a balanced PNC sampler, which contains 10 millions of both written
and spoken data reflecting proportionally the text categories in the BNC, can
be ordered from the PELCRA project
site.
2.4. The Czech National Corpus
The
Czech National Corpus (CNC) consists of two sections: synchronous and
diachronic. Each section is designed to include written, spoken and dialectal
components. As some of the components are currently hardly more than blueprints
for future work (see Kučera 2002, 254), we will
only introduce the written and spoken components in the synchronous section.
The
written component of the synchronous section, which contains 100 million words,
was completed in 2000 and thus named SYN2000. SYN2000 includes both imaginative
(15%) and informative (85%) texts, each being divided into a number of text
categories, as shown in Table 4 (see Kučera
2002, 247-248). The technical and specialized texts in the corpus
proportionally cover nine domains: lifestyle (5.55%), technology (4.61%),
social sciences (3.67%), arts (3.48%), natural sciences (3.37%),
economics/management (2.27%), law/security (0.82%), belief/religion (0.74%) and
administrative texts (0.49%).
Table
4: Design of SYN2000
|
Major category |
Genre |
% |
|
Imaginative (15%) |
Fiction |
11.02 |
|
Poetry |
0.81 |
|
|
Drama |
0.21 |
|
|
Other literary texts |
0.36 |
|
|
Transitional text types |
2.6 |
|
|
Informative (85%) |
Journal |
60 |
|
Technical/specialized texts |
25 |
Table
5: Sampling frame of the Prague Spoken Corpus
|
Criteria |
Type |
Proportion |
|
Speaker sex |
Male |
50% |
|
Female |
50% |
|
|
Speaker age |
21-35 |
50% |
|
35+ |
50% |
|
|
Education level |
Secondary school |
50% |
|
University |
50% |
|
|
Discourse type |
Formal |
50% |
|
Informal |
50% |
The
spoken component of the synchronous section, the so-called Prague Spoken Corpus
(PMK), contains 800,000 words of transcription of authentic spoken language
sampled in a balanced way according to four sociolinguistic criteria: speaker
sex, age, educational level and discourse type, as shown in Table 5. The data
contained the Prague Spoken corpus consists exclusively of impromptu spoken
language (roughly equivalent to the demographically sampled component in the
BNC). Texts representing various blends of written and spoken language such as
lectures, political speeches and play scripts are included in a special section
in the written corpus (cf. Kučera 2002, 248,
253).
Both
SYN2000 and the Prague Spoken Corpus are marked up in TEI-compliant SGML and
tagged to show part-of-speech categories. SYN2000 is licensed free of charge
for non-commercial use. A scaled-down version of SYN2000, PUBLIC, which
contains 20 million words with the same genre distribution, is accessible
online at the corpus website. The
tagged version of the Prague Spoken Corpus will also be made publicly available
in the near future.
2.5. The Hungarian National Corpus
The
Hungarian National Corpus (HNC) is a balanced reference corpus of present-day
Hungarian. The corpus contains 153.7 million words of texts produced from the
mid-1990s onwards, which are divided into five subcorpora,
each representing a written text type: media (52.7%), literature (9.43%),
scientific texts (13.34%), official documents (12.95%) and informal texts (e.g.
electronic forum discussion, 11.58%). The size of the literary subcorpus is expected to increase from the current 14.5
million words to approximately 40 million words (see Váradi
2002, 386). The HNC is encoded in SGML in compliance with Corpus Encoding
Standard (CES) and annotated for part-of-speech. The corpus can be accessed
free of charge after registration via the online query system at
the corpus site.
2.6. The Russian Reference Corpus
The
Russian Reference Corpus (BOKR) is designed as a Russian match for the BNC. The
corpus contains 100 million words of modern Russian, following the general
sampling frame of the BNC, as shown in Table 6 (see Sharoff
2004).
Table
6: Sampling frame of the Russian Reference Corpus
|
Text category |
Proportion |
|
Spoken |
5% |
|
Life (Imaginative texts in the BNC) |
30% |
|
Natural sciences |
5% |
|
Applied sciences |
10% |
|
Social sciences |
12% |
|
Politics (World affairs in the BNC) |
15% |
|
Commerce |
5% |
|
Arts |
5% |
|
Religion and philosophy (Belief and thought in the
BNC) |
3% |
|
Leisure |
10% |
The
BOKR corpus is encoded in TEI-compliant SGML and annotated for part-of-speech.
As Russian is a highly inflective language, the technique used in annotating
English corpora with complex POS tags is impractical for Russian, because that
would entail thousands of tags which would make corpus exploration ineffective,
if not impossible at all. Hence in the Russian Reference corpus, each word is
annotated with a bundle of lexical and syntactic features such as
part-of-speech, aspect, transitivity, voice, gender, number and tense. Separate
features from a feature bundle associated with each word can be selected in a
window in the query
interface. The corpus is under construction and its final release is expected
by the end of 2004 (cf. Sharoff 2004).
The
CORIS (Corpus di Italiano Scritto) corpus is a general reference corpus of
present-day Italian. It contains 100 million words of written Italian sampled
from five text categories, which constitute five subcorpora,
as shown in Table 7.
Table
7: Components of the CORIS corpus
|
Category |
Subcategory |
Proportion |
|
Press |
Newspapers, periodicals, supplement |
38% |
|
Fiction |
Novel, short stories |
25% |
|
Academic prose |
Human sciences, natural sciences, physics,
experimental sciences |
12% |
|
Legal and administrative prose |
Legal, bureaucratic, administrative documents |
10% |
|
Miscellaneous |
Books on religion, travel, cookery, hobbies, etc. |
10% |
|
Ephemeral |
Letters, leaflets, instructions |
5% |
Unlike
most national corpora that are sample corpora, the CORIS corpus follows a
dynamic corpus model, which will be updated every two years by means of a
built-in monitor corpus (Rossini Favretti/Tamburini/de
Santis 2004). The current version of the corpus can
be accessed online free of charge via a web-based query system at the corpus website.
2.8. The Hellenic National Corpus
The
Hellenic National Corpus is a 32-million-word corpus of written Modern Greek
sampled from several publication media covering various genres (articles,
essays, literary works, reports, biographies etc.) and domains (economy,
medicine, leisure, art, human sciences etc.) published from 1976 onwards. Of
the four types of medium, books account for 15.75% of the total texts,
newspapers 69.01%, periodicals 6.97% and miscellaneous (correspondence,
electronic texts, ephemera, and hand-written/typed material) 8.27%. The text classification
with regard to medium, genre and domain follows the PAROLE
standards. This taxonomy information, together with the bibliographic
information, is encoded in TEI-compliant SGML (cf. Hatzigeorgiu/Gavrilidou/Piperidis
et al 2000, 1737). The corpus can be accessed online at the corpus site, where users can make
queries concerning the lexicon, morphology, syntax and usage of Modern Greek (e.g.
words, lemmas, part-of-speech categories or combinations of the three).
2.9. The German National Corpus
The
German National Corpus is a product of the DWDS (Digital Dictionary of the 20th
Century German Language) project. The corpus is divided into two parts, a
100-million-word balanced core and a much larger opportunistic subcorpus. This section introduces the core corpus, which
is roughly comparable to the British National Corpus, covering the whole 20th
century (1900-2000). Table 8 shows the text categories covered in the corpus.
Table
8: Design criteria of the German National Corpus
|
Text category |
Proportion |
|
Literature |
25% |
|
Journalistic prose |
25% |
|
Scientific texts |
20% |
|
Specialized texts (advert, manuals, etc) |
20% |
|
Spoken (everyday language, televised debates,
dialect, etc) |
10% |
The
metadata such as genre information is encoded in XML. Linguistic annotation
consists basically of lemmatization, part-of-speech and semantic annotation on the
word level, as well as prepositional phrase and noun phrase recognition on the
phrase level (Cavar/Geyken/Neumann 2000). The core
corpus is available for online search at the corpus site after free-of-charge
registration.
2.10. The Slovak National Corpus
The
Slovak National Corpus is presently under construction. The project aims to
create a 200-million-word corpus of the Slovak language. The first phase of the
project has produced a corpus containing 30 million words of written texts
published between 1990 and 2003, which will be expanded to other periods
of the contemporary language (1955 – 2005) to the target
size at the second phase of the project (2003-2006). The final corpus will also
include diachronic and dialectological texts.
At
present the 30-million-word part of the corpus has been annotated with
lemmatization, morphological and source (bibliographical
and style-genre) information. Users can access the corpus using a simple online query system at the corpus
website. More complex searches require the “corpus manager”, which supports regular expressions and can be downloaded
at the same site.
We
have so far introduced national corpora for European languages. The next two
sections will introduce two national corpora of Asian languages, namely Chinese
and Korean.
2.11. The Modern Chinese Language Corpus
The
Modern Chinese Language Corpus (MCLC) is
Table
9: Components of the MCLC corpus
|
Category |
Subcategory |
Proportion |
|
Humanities and social sciences (8 subcategories) |
Politics and laws, history, society, economics,
arts, literature, military and physical education, life |
59.6% |
|
Natural sciences (6 subcategories) |
Mathematics and physics, biology and chemistry,
astronomy and geography, oceanology and
meteorology, agriculture and forestry, medical and health |
17.24% |
|
Miscellaneous (6 subcategories) |
Official documents, regulations, judicial documents,
business documents, ceremonial speech, ephemera |
9.36% |
|
Newspapers |
|
13.79% |
A scaled down version of the corpus, the core, which
contains 20 million characters proportionally sampled from the larger corpus,
is tokenized and tagged with part-of-speech categories. The MCLC license can be
purchased from the National
Language Committee of China.
2.12. The Sejong Balanced Corpus
The 21st Century Sejong
project was launched in 1998 as a ten-year development project to build various
kinds of language resources including Korean corpora and Korean electronic
dictionaries. One of the goals of the project is to construct a balanced
national corpus (300 million words and phrases from modern Korean, spoken
materials, North Korean language, words of foreign origin, etc.), comparable to
the BNC. By 2003 a raw corpus of modern Korean was compiled, containing 57
million words with 75 million more words already existing electronic texts and
being processed and standardized. The corpus also includes around 3 million
words of spoken data.
The markup scheme used in
the Sejong Corpus is TEI-compliant. As of 2003, 10
million words have been morphologically annotated, 5.5 million words sense
tagged, and 150,000 words treebanked (see Kang/Kim
2004, 1747). The corpus is accessible over the Internet after registration at the corpus site.
In addition to those introduced above, there are a
number of nation-level corpora which are either already available or are under
construction. They include, for example, the FRANTEXT Database,
the Croatian National Corpus
(30 million words), Korpus 2000 for Danish (28 million words), the National Corpus of Irish
(15 million words). A number of corpora representing other national languages
are also under construction, including, for example, Norwegian (Choukri 2003), Dutch (Wittenburg/Brugman/Broeder 2000), Maltese (Dalli 2001), Basque (Aduriz/Aldezabal/Alegria et al 2003), Kurdish (Gautier 1998), Nepali (Glover 1998), Tamil (Malten 1998) and
While most of the national corpora introduced in
section 2 follow a static sample corpus model, there are also corpora which are constantly
updated to track rapid language change, such as the development and the life
cycle of neologisms. Corpora of this type are referred to as monitor
corpora.
The best-known monitor corpus is the Bank of English
(BoE), which was initiated in 1991 on the COBUILD (Collins Birmingham University International Language Database) project. The corpus was designed to represent standard
English as it was relevant to the needs of learners, teachers and other users,
while also being of use to researchers in present-day English language. Written
texts (75%) come from newspapers, magazines, fiction and non-fiction books,
brochures, reports, and websites while spoken data (25%) consists of
transcripts of television and radio broadcasts, meetings, interviews,
discussions, and conversations. The majority of the material in the corpus
represents British English (70%) while American English and other varieties
account for 20% and 10% respectively. Presently the BoE
contains 524 million words of written and spoken English. The corpus keeps
growing with the constant addition of new material.
The BoE corpus is
particularly useful for lexical and lexicographic studies, for example,
tracking new words, new uses or meanings of old words, and words falling out of
use. A 56 million word sampler of the corpus can be accessed online free of
charge at the corpus
website. Access to larger corpora is granted by special arrangement.
3.2. The
global English Monitor Corpus
Another corpus of the monitor type is the Global English Monitor
Corpus, which was started in late 2001 as an electronic archive of the
world’s leading newspapers in English. The corpus aims
at monitoring language use and semantic change in English as reflected in newspapers
so as to allow for research into whether the English language discourses in
4.
Corpora of the Brown family
The
first modern corpus of English, the Brown University Standard Corpus of
Present-day American English (i.e. the Brown corpus, see Kucěra/Francis
1967), was built in the early 1960s for written American English. The
population from which samples for this pioneering corpus were drawn was written
English text published in the
The
Brown corpus was constructed with comparative studies in mind, in the hope of
setting the standard for the preparation and presentation of further bodies of
data in English or in other languages. This expectation has now been realized.
Since its completion, the Brown corpus model has been followed in the
construction of a number of corpora for synchronic and diachronic studies as
well as for cross-linguistic contrast. Table 10 shows a brief comparison of
these corpora.
Table
10: Corpora of the Brown family
|
Corpus |
Language variety |
Period |
Samples |
Words (Million) |
|
American English |
1961 |
500 |
One |
|
|
American English |
1991-1992 |
500 |
One |
|
|
British English |
1961 |
500 |
One |
|
|
British English |
1931+/- 3 years |
500 |
One |
|
|
British English |
1991-1992 |
500 |
One |
|
|
Indian English |
1978 |
500 |
One |
|
|
Australian English |
1986 |
500 |
One |
|
|
|
1986-1990 |
500 |
One |
|
|
Mandarin Chinese |
1991+/- 3 years |
500 |
One |
As
can be seen, these corpora are roughly comparable but have sampled different
languages or language varieties. Their sampling periods are either similar for
the purposes of synchronic comparison or distanced by about three decades for
the purposes of diachronic comparison. For example, the Brown and LOB (the
Lancaster-Oslo-Bergen corpus of British English, see Johansson/Leech/Goodluck 1978) can be used to compare American and British
English as used in the early 1960s. The updated versions of the two corpora,
Frown (see Hundt/Sand/Skandera 1999) and FLOB (see Hundt/Sand/Siemund 1998) can be used to compare the two
major varieties of English as used in the early 1990s. Other corpora of the
similar sampling period, such as ACE (the Australian Corpus of English, also
known as the Macquarie corpus), WWC (the Wellington Corpus of Written New
Zealand English) and Kolhapur (the Kolhapur Corpus of Indian English), together with FLOB and
Frown, allow for comparison of “world Englishes”. For diachronic studies, the Brown vs. Frown on the one
hand, and the Pre-LOB,
LOB and FLOB corpora on the other hand, provide a reliable basis for tracking
recent language change over 30-year periods. The LCMC corpus (the Lancaster
Corpus of Mandarin Chinese, see McEnery/Xiao/Mo
2003), when used in combination with FLOB/Frown corpora, provides a valuable
resource for contrastive studies between Chinese and two major varieties of
English.
In
comparing these corpora synchronically, caution must be exercised to ensure
that the sampling periods are similar. For example, comparing the Brown corpus
with FLOB would involve not only language varieties but also language change.
Also, as the Brown model may have been modified slightly in some of these
corpora, account must be taken of such variation in comparing these corpora
across text category by normalizing the raw frequencies to a common basis.
Table 11 compares the text categories and number of samples for each category
in these corpora.
Table
11: Text categories in the corpora of the Brown family
|
Code |
Text category |
Brown |
Frown |
LOB |
FLOB |
Pre-LOB |
|
ACE |
WWC |
LCMC |
|
A |
Press reportage |
44 |
44 |
44 |
44 |
44 |
44 |
44 |
44 |
44 |
|
B |
Press editorials |
27 |
27 |
27 |
27 |
27 |
27 |
27 |
27 |
27 |
|
C |
Press reviews |
17 |
17 |
17 |
17 |
17 |
17 |
17 |
17 |
17 |
|
D |
Religion |
17 |
17 |
17 |
17 |
17 |
17 |
17 |
17 |
17 |
|
E |
Skills, trades and hobbies |
36 |
36 |
38 |
38 |
38 |
38 |
38 |
38 |
38 |
|
F |
Popular lore |
48 |
48 |
44 |
44 |
44 |
44 |
44 |
44 |
44 |
|
G |
Biographies and essays |
75 |
75 |
77 |
77 |
77 |
70 |
77 |
77 |
77 |
|
H |
Miscellaneous (reports, official documents) |
30 |
30 |
30 |
30 |
30 |
37 |
30 |
30 |
30 |
|
J |
Science (academic prose) |
80 |
80 |
80 |
80 |
80 |
80 |
80 |
80 |
80 |
|
K |
General fiction |
29 |
29 |
29 |
29 |
29 |
59 |
29 |
29 |
29 |
|
L |
Mystery and detective fiction |
24 |
24 |
24 |
24 |
24 |
24 |
15 |
24 |
24 |
|
M |
Science fiction |
6 |
6 |
6 |
6 |
6 |
2 |
7 |
6 |
6 |
|
N |
Western and adventure fiction |
29 |
29 |
29 |
29 |
29 |
15 |
8 |
29 |
29 |
|
P |
Romantic fiction |
29 |
29 |
29 |
29 |
29 |
18 |
15 |
29 |
29 |
|
R |
Humour |
9 |
9 |
9 |
9 |
9 |
9 |
15 |
9 |
9 |
|
S |
Historical fiction |
- |
- |
- |
- |
- |
- |
22 |
- |
- |
|
W |
Women’s fiction |
- |
- |
- |
- |
- |
- |
15 |
- |
- |
It
can be seen from the table that the two American English corpora (Brown and Frown)
have the same numbers of samples for each of the 15 text categories while the
British English corpora share the same proportions. The two groups differ in
the numbers of samples for categories E, F, and G. The WWC and LCMC corpora
follow the model of FLOB. There are important differences between the
With
the exceptions of the Pre-LOB corpus, which is under construction, and LCMC,
which is distributed by the European Language Resources Association (ELRA), all of the corpora of the Brown family
are available from the International Computer Archive of Modern and Medieval
English (ICAME).
The
corpora of the Brown family are balanced corpora representing a static snapshot
of a language or language variety in a certain period. While they can be used
for synchronic and diachronic studies, more appropriate resources for these
kinds of research are synchronic and diachronic corpora, which will be
introduced in the following two sections.
While
the corpora of the Brown family are generally good for comparing language
varieties such as world Englishes, the results from
such a comparison must be interpreted with caution when the corpora under
examination were built for different periods or the Brown model has been
modified. A more reliable basis for comparing language varieties is a
synchronic corpus.
5.1. The International Corpus of English
A
typical corpus of this type is the International Corpus of English (ICE), which
is specifically designed for the synchronic study of world Englishes.
The ICE corpus consists of a collection of twenty corpora of one million words
each, each composed of written and spoken English produced during 1990-1994 in
countries or regions in which English is a first or official language (e.g.
Table
12 Corpus design of ICE
|
Spoken (300) |
Dialogues (180) |
Private |
Conversations (90) |
|
Public |
Class lessons (20) |
||
|
Monologues (120) |
Unscripted |
Commentaries (20) |
|
|
Scripted |
Broadcast news (20) |
||
|
Written |
Non-printed (50) |
Student writing |
Student essays (10) |
|
Letters |
Social letters (15) |
||
|
Printed |
Academic |
Humanities (10) |
|
|
Popular |
Humanities (10) |
||
|
Reportage |
Press reports (20) |
||
|
Instructional |
Administrative writing (10) |
||
|
Persuasive |
Editorials (10) |
||
|
Creative |
Novels (20) |
The
ICE corpora are marked up and annotated at various levels. In written texts,
features of the original layout are marked, including sentence and paragraph
boundaries, headings, deletions, and typographic features while spoken texts
are transcribed orthographically, and are marked for pauses, overlapping
strings, discourse phenomena such as false starts and hesitations, and speaker
turns. The bibliographic markup, which gives a
complete description (e.g. text category, date, and publisher) of each text, is
stored in the corpus header of each file. Different levels of annotation are
undertaken for the ICE corpora. Some of them are POS tagged and parsed (e.g.
the British component ICE-GB) while others are currently available as unannotated lexical corpora (e.g. the components for
5.2. The Longman/Lancaster Corpus
The
Longman/Lancaster Corpus consists of about 30 million words of published
English. British data takes up 50% and American data 40% while the other 10%
represents other varieties such as Australian, African
and Irish English. One half of the samples were selected
randomly (“microcosmic texts”) and the other half selected by a panel of experts (“selective texts”). Most texts in the
corpus are about 40,000 words long but no whole texts are used.
Both
imaginative and informative text categories are included. Imaginative texts
come from well-known literary works and works randomly sampled from books in
print; informative texts come from the natural and social sciences, world
affairs, commerce and finance, the arts, leisure, and so on. Imaginative texts
are mainly works of fiction in book form while informative texts comprise
books, newspapers and journals, unpublished and ephemera. Four external
criteria have been used in text selection (see Holmes-Higgin/Abidi/Ahmad
1994): “region” (language
varieties), “time” (1900s-1980s), “medium” (books 80%, periodicals 13.3% and
ephemera 6.7%), and “level” (literary, middle
and popular for imaginative texts, and technical, lay and popular for
informative texts). As part of the Longman Corpus
Network, the Longman/Lancaster Corpus is not available for public access.
5.3. The Longman Written American Corpus
The
Longman Written American Corpus currently contains over 100 million words of
running texts taken from newspapers, journals, magazines, best-selling novels,
technical and scientific writing, and coffee-table books. The design of the
Longman Written American Corpus is based on the general design principles of
the Longman/Lancaster Corpus and the written section of the BNC. The corpus is
dynamically refined and keeps growing with the constant addition of new
materials. Like the other components of the Longman Corpus
Network, this corpus does not appear to allow public access.
5.4. The CREA corpus of Spanish
The
CREA (Corpus de Referencia
The
CREA was designed as a monitor corpus which is continually updated so that it
always represents the last twenty-five years of the history of Spanish. New
data is added proportionally to maintain the corpus balance and to ensure that
the various trends in current Spanish are represented. Texts for 2000-2004 are
currently being incorporated (Sánchez 2002).
The
CREA corpus is marked in SGML. Bibliographic and taxonomic information is
encoded in the corpus header of each file. For written texts, both structural
(paragraph and page number) and intratextual (notes,
formulas, tables, quotations, foreign words etc.) marks are encoded. For spoken
texts, the markup scheme indicates structural (speech
turns) and non-structural (overlapping, tottering, anacoluthon, etc.) marks
(cf. Guerra 1998).
The
modular structure of the CREA corpus allows
for flexible searches using geographical, generic, temporal, and thematic
criteria. The corpus is accessible on the Internet.
5.5. The LIVAC corpus of Chinese
The
LIVAC (Linguistic Variation in Chinese Speech Communities) project started in 1993
with the aim of building a synchronous corpus for studying varieties of
Mandarin Chinese. For this purpose, data has been collected regularly and
simultaneously, once every four days since July 1995, from representative
Mandarin Chinese newspapers and the electronic media of six Chinese speaking
communities:
All
of the corpus texts in LIVAC are segmented automatically and checked by hand.
In addition the corpus, a lexical database is derived from the segmented texts,
which includes, apart from ordinary words, those expressing new concepts or
undergoing sense shifts, as well as region specific words from the six
communities. The database is thus a rich resource for research into
linguistics, sociolinguistics, and Chinese language and society.
As
LIVAC captures the social, cultural, and linguistic developments of the six
Chinese speaking communities within a decade, it allows for a wide range of
comparative studies on linguistic variation in Mandarin Chinese. The corpus
also provides an important resource for tracking lexical development such as
the evolution of new concepts and their expressions in present-day Chinese. A
sample of the corpus (data covering the period from July 1995 to June 1996) can
be accessed using the online query system at the corpus site, which shows
KWIC concordances as well as frequency distribution across the six speech
communities.
Another
way to explore language variation is from a diachronic perspective using
diachronic corpora. A diachronic (or historical) corpus contains texts from the
same language gathered from different time periods. Typically that period is
far more extensive than that covered by Brown/Frown and LOB/FLOB or a monitor
corpus such as the Bank of English. Diachronic corpora are used to track
changes in language evolution. This section introduces a number of corpora of
this kind.
6.1. The
Perhaps
the best-known historical corpus is the diachronic part of the Helsinki Corpus
of English Texts (i.e. the Helsinki corpus), which consists of approximately
1.5 million words of English in the form of 400 text samples, dating from the 8th
to 18th centuries. The corpus is divided into three periods (Old,
Middle, and Early Modern English) and eleven subperiods,
as shown in Table 13 (cf. Kytö 1996).
Table
13: Periods covered in the Helsinki Diachronic Corpus
|
Period |
Subperiod |
Words |
Percent |
Overall |
|
Old English |
I. –850 |
2,190 |
0.5 |
413,250 |
|
II. 850-950 |
92,050 |
22.3 |
||
|
III. 950-1050 |
251,630 |
60.9 |
||
|
IV. 1050-1150 |
67,380 |
16.3 |
||
|
Total |
413,250 |
100 |
26.27% |
|
|
Middle English |
I. 1150-1250 |
113,010 |
18.6 |
608,570 |
|
II. 1250-1350 |
97,480 |
16.0 |
||
|
III. 1350-1420 |
184,230 |
30.3 |
||
|
IV. 1420-1500 |
213,850 |
35.1 |
||
|
Total |
608,570 |
100% |
38.70% |
|
|
Early Modern English |
I. 1500-1570 |
190,160 |
34.5 |
551,000 |
|
II. 1570-1640 |
189,800 |
34.5 |
||
|
III. 1640-1710 |
171,040 |
31.0 |
||
|
Total |
551,000 |
100 |
35.03% |
|
|
Total |
1,572,820 |
|
100% |
|
In
addition to the basic selection of texts as indicated in the table, there is a
supplementary part in the
As
the Helsinki corpus not only sampled different periods covering one millennium,
and it also encoded genre and sociolinguistic information, this corpus allows
for researchers to go beyond simply dating and reporting language change by
combining diachronic, sociolinguistic and genre studies. The
ARCHER,
an acronym for “A Representative Corpus of Historical
English Registers”, contains 1.7 million words of data in
the form of 1,037 texts sampled from seven 50-year historical periods covering
Early Modern English (1650-1990). The corpus is designed as a balanced
representation of seven written (journal-diaries, letters, fiction, news, and
science, etc.) and three speech-based (fictional conversation, drama and
sermons-homilies) genres in British (two thirds of the corpus) and American
(one third, data available only for the periods 1750-1799, 1850-1899,
1950-1990) English. Each 50-year subcorpus includes
20,000-30,000 words per register, typically containing ten texts of
approximately 2,000-3,000 words each (cf. Biber/Finegan/Atkinson
1994). ARCHER is tagged for grammatical/functional categories. It allows for a
wide variety of investigations on recent linguistic change and change in
discourse and genre conventions. The corpus is presently being expanded with
more American texts to make the American and British data comparable (see ARCHER 2).
The expanded version will also enable a systematic comparison of the two
varieties of English diachronically. However, because of the copyright problem,
ARCHER is not publicly available at the moment. Readers interested in using
this corpus can contact Douglas Biber.
In
addition to the
6.3. The Lampeter Corpus of Early Modern English Tracts
The
Lampeter Corpus of Early Modern English Tracts is a
balanced corpus covering one century between 1640 and 1740, which is divided
into ten decades. Each decade consists of data sampled from six domains
(religion, politics, economics/trade, science, law and miscellaneous). Two
complete texts, ranging from 3,000 to 20,000 words, are included for each
domain within each decade, totaling approximately 1.1
million words (Schmied 1994).
The
Lampeter corpus is encoded in TEI-compliant SGML. The
TEI headers provides the framework for historical, sociolinguistic and
stylistic investigations, including information regarding authors (name, age,
sex, place of residence, education, social status, political affiliation),
printers/publishers, place and date of print, publication format, text
characteristics and bibliographical sources. As the corpus includes whole texts
rather than smaller samples, the corpus is also useful for study of textual
organization in Early Modern English. The Lampeter
corpus can be ordered from ICAME
or OTA.
6.4. The Dictionary of Old English
Corpus in Electronic Form
The
Dictionary of Old English Corpus in Electronic Form (DOEC, the 2000 release)
contains 3,037 texts of Old English, totaling over
three million words, in addition to two million words of Latin. The texts in
the corpus are practically all extant Old English writings. The DOEC corpus
includes at least one copy of each surviving text in Old English while in cases
where it is significant because of dialect or date, more than one copy is
included. These texts cover six text categories: poetry, prose, interlinear
glosses, glossaries, runic inscriptions, and inscriptions in the Latin
alphabet. In the prose category in particular, a wide range of text types are
covered which include, for example, saints’ lives, sermons,
biblical translations, penitential writings, laws, charters and wills, records
(of manumissions, land grants, land sales, land surveys), chronicles, a set of
tables for computing the moveable feasts of the Church calendar and for
astrological calculations, medical texts, prognostics (the Anglo-Saxon
equivalent of the horoscope), charms (such as those for a toothache or for an
easy labour), and even cryptograms (cf. the corpus website). The texts in the
corpus are encoded in TEI-compliant SGML. The DOEC corpus can be ordered on CDs
or assessed online by institutional site license at the corpus website. The
web-based query system allows for searches by single words, word combinations,
word proximity and bibliographic sources.
6.5. Early English Books Online
Early
English Books Online (EEBO) is a joint effort launched in 1999 between the
6.6. The Corpus of Early English
Correspondence
The
Corpus of Early English Correspondence (CEEC, the 1998 version) consists of 96
collections of ca. 6,000 personal letters written by 778 people (women
accounting for 20%) between 1417 and 1681, totaling
2.7 million words. The corpus is accompanied by a sender database, which offers
users easy access to various sociolinguistic variables, including writer age,
gender, place of birth, education, occupation, social rank, domicile and the
relationship with the addressee. CEEC is a balanced corpus which can be neatly
divided into two parts, both covering chronologically fairly equal periods: the
first from ca. 1417 to 1550 and the second from 1551 to 1680 (cf. Laitinen 2002). Table 14 shows the proportions in terms of
writers’ social ranks and domiciles (see Nevalainen 2000: 40). The CEEC corpus is currently being
expanded with personal letters written between 1682 and 1800 to cover the
18th-century.
Table
14: the CEEC corpus by rank and domicile
|
Rank (percent) |
Domicile (percent) |
|
Royalty: 2.4 |
Court: 7.8 |
|
Nobility: 14.7 |
|
|
Gentry: 39.3 |
|
|
Clergy: 13.6 |
North: 12.5 |
|
Professionals: 11.2 |
Other regions: 48.6 |
|
Merchants: 8.4 |
|
|
Other nongentry: 9.4 |
|
As
the copyright problem has prevented public access to the full release of the
CEEC corpus, a CEEC sampler (CEECS) has been published by ICAME, which represents the
non-copyrighted materials included CEEC. The sampler reflects the structure of
the full CEEC only in some respects. The time covered is nearly the same
(1418-1680), which is divided into two parts. CEECS1 (246,055 words) covers the
15th and 16th centuries while CEECS2 (204,030 words) covers the 17th century.
The sampler corpus consists of 23 collections of 1,147 letters with 194
informants, totaling 450,085 words. The CEEC sampler is
available from ICAME or OTA.
6.7. The
The
Zurich English Newspaper Corpus (ZEN) is a 1.2-million-word collection of
newspapers in Early English, covering 120 years (from 1671 to 1791) of British
newspaper history. To achieve a representative coverage, a wide variety of
newspapers were included. Up to ten issues per newspaper were selected at
ten-year intervals throughout the whole period. With the exception of
stock market reports, lottery figures, long lists of names and poetry, the
whole newspapers were included in the corpus. The news stories are grouped into
two major categories: foreign news and home news, with each news category
further classified according to its own text genre definition (cf.
Fries/Schneider 2000). The corpus is split into four 30-year periods in order
to track potential language change, as shown in Table 15 (see Schneider 2002:
202).
Table
15: The ZEN corpus
|
Section |
Period |
Words |
Sentences |
|
A |
1670-1709 |
242758 |
7642 |
|
B |
1710-1739 |
347825 |
12163 |
|
C |
1740-1769 |
339362 |
14112 |
|
D |
1770-1799 |
298249 |
11843 |
|
Total |
1228194 |
45760 |
|
The
ZEN corpus is SGML-conformant. It not only allows for linguistic analysis of
different types of news stories in the 17th and 18th
centuries, it has also made it possible to compare news texts in Early English
with modern newspaper language. The ZEN
query system allows restricted access to the online database.
6.8. The
The
Innsbruck Computer Archive of Machine-Readable English Texts (ICAMET) contains ca.
500 Middle English texts totaling 5.7 million words.
The database comprises three parts, namely, the Prose
Corpus (129 texts written during 1100-1500, accounting for two thirds of the
total), the Letter Corpus (254 letters written during 1386-1688, arranged in
the diachronic order), and the Prose Varia Corpus (mainly
translations or normalized versions of Middle English texts). An advantage of ICAMET is that the database consists of
complete texts instead of extracts, which allows literary, historical and
topical analyses of various kinds, particularly studies of cultural history (Marcus
1999). Nevertheless, the copyright issue has
restricted public access to many prose texts in the corpus. A sampler
containing half of the prose texts and all letters is available from ICAME.
6.9. The Corpus of English Dialogues
The
Corpus of English Dialogues (CED) contains 1.3 million words of Early Modern
English dialogue texts produced over a 200-year time span between 1560 and
1760. While the spoken language of the past is inaccessible directly to modern
speakers, it is recorded in speech related texts. The CED corpus sampled from
six such text categories, including trial proceedings, witness depositions,
drama, handbooks in dialogue form, fictional
dialogues, and language teaching books (cf. Culpeper/Kytö 1997).
The focus on dialogue will allow insight into the nature of impromptu speech and interactive two-way communication in the Early Modern English period - aspects which have r