Well-known and influential
corpora: A survey
[Note: This survey is
based on my (forthcoming) chapter written for A. Lüdeling,
M. Kyto & A. McEnery (eds) Handbooks of Linguistics and Communication Science Volume Corpus
Linguistics.
2.1. The British National Corpus
2.2. The American National Corpus
2.3. The Polish National Corpus
2.4. The Czech National Corpus
2.5. The Hungarian National Corpus
2.6. The Russian Reference Corpus
2.8. The Hellenic National Corpus
2.9. The German National Corpus
2.10. The Slovak National Corpus
2.11. The Modern Chinese Language Corpus
2.12. The Sejong Balanced Corpus
3.2. The global
English Monitor Corpus
4. Corpora of the Brown family
5.1. The International Corpus of English
5.2. The Longman/Lancaster Corpus
5.3. The Longman Written American Corpus
5.4. The CREA corpus of Spanish
5.5. The LIVAC corpus of Chinese
6.1. The Helsinki Corpus of English Texts
6.3. The Lampeter Corpus of Early Modern English Tracts
6.4 The Dictionary of Old English Corpus in Electronic
Form
6.5 Early English Books Online
6.6 The Corpus of Early English Correspondence
6.7. The Zurich English Newspaper Corpus
6.8. The Innsbruck Computer Archive of Machine-Readable
English Texts
6.9. The Corpus of English Dialogues
6.10 A Corpus of Late Eighteenth-Century Prose
6.11 A Corpus of Late Modern English Prose
7.2. SEC, MARSEC and Aix-MARSEC
7.3. The Bergen Corpus of London Teenage Language
7.4. The Cambridge and Nottingham Corpus of Discourse in
English
7.5. The Spoken Corpus of the Survey of English Dialects
7.6. The Intonational Variation in English Corpus
7.7. The Longman British Spoken Corpus
7.8. The Longman Spoken American Corpus
7.9. The Santa Barbara Corpus of Spoken American English
7.10. The Saarbrücken Corpus of Spoken English
7.12. The Wellington Corpus of Spoken New Zealand English
7.13. The Limerick corpus of Irish English
7.14. The Hong Kong Corpus of Conversational English
8. Academic and professional English corpora
8.1. The Michigan Corpus of Academic Spoken English
8.2. The British Academic Spoken English corpus
8.3. The Reading Academic Text corpus
8.5. The Corpus of Professional Spoken American English
8.6. The Corpus of Professional English
9.1. The Lancaster-Leeds Treebank
9.2. The Lancaster Parsed Corpus
9.8. Parsed historical corpora
10. Developmental and learner corpora
10.1. The Child Language Data Exchange System
10.2. The Louvain Corpus of Native English Essays
10.3. The Polytechnic of Wales corpus
10.4. The International Corpus of Learner English
10.6. The Longman Learners’ Corpus
10.7. The Cambridge Learner Corpus
11.1. The Canadian Hansard Corpus
11.2. The English-Norwegian
Parallel Corpus
11.3. The English-Swedish Parallel Corpus
11.4. The Oslo Multilingual Corpus
11.5. The ET10/63 and ITU/CRATER parallel corpora
11.6. The IJS-ELAN Slovene-English Parallel Corpus
11.7. The CLUVI parallel corpus
11.8. European Corpus Initiative Multilingual Corpus I
11.11. Multilingual Corpora for Cooperation
11.13. The BFSU Chinese-English Parallel Corpus
11.14. The Babel Chinese-English Parallel Corpus
11.15. Hong Kong Parallel Text
12. Non-English monolingual corpora
12.5. The Scottish Corpus of Texts and Speech
12.6. The Prague Dependency Treebank
12.7. Academia Sinica Balanced Corpus
12.10. Spoken Chinese Corpus of Situated Discourse
13. Well-known distributors of corpus resources
As
corpus building is an activity that takes times and costs money, readers may
wish to use ready-made corpora to carry out their work. However, as a corpus is
always designed for a particular purpose, the usefulness of a ready-made corpus
must be judged with regard to the purpose to which a user intends to put it.
There are thousands of corpora in the world, but most of them are created for
specific research projects and are thus not publicly available. While abundant
corpus resources for languages other than English are also available now, this
survey focuses upon major English corpora, which are grouped in terms of their
primary uses so that readers will find it easier to choose corpus resources
suitable for their particular research questions. Note, however, that overlaps
are inevitable in our classification. It is used in this survey simply to give
a better account of the primary uses of the relevant corpora.
National
corpora are normally general reference corpora which are supposed to represent
the national language of a country. They are balanced with regard to genres and
domains that typically represent the language under consideration. While an
ideal national corpus should cover proportionally both written and spoken
language, most existing national corpora and those under construction consist
only of written data, as spoken data is much more difficult and expensive to
capture than written data. This section introduces a number of major national
corpora.
2.1. The British National Corpus
The
first and best-known national corpus is perhaps the British National Corpus
(BNC), which is designed to represent as wide a range of modern British English
as possible so as to “make it possible to say something about
language in general” (Burnard 2002,
56). The BNC comprises approximately 100 million words of written texts (90%)
and transcripts of speech (10%) in modern British English. Written texts were
selected using three criteria: “domain”, “time” and “medium”. Domain refers to the content type (i.e. subject field) of
the text; time refers to the period of text production, while medium refers to
the type of text publication such as books, periodicals or unpublished
manuscripts. Table 1 summarizes the distribution of these criteria (see Aston/Burnard 1998, 29-30).
Table 1: Composition of the written BNC
|
Domain |
% |
Date |
% |
Medium |
% |
|
Imaginative |
21.91 |
1960-74 |
2.26 |
Book |
58.58 |
|
Arts |
8.08 |
1975-93 |
89.23 |
Periodical |
31.08 |
|
Belief and thought |
3.40 |
Unclassified |
8.49 |
Misc. published |
4.38 |
|
Commerce/Finance |
7.93 |
|
|
Misc. unpublished |
4.00 |
|
Leisure |
11.13 |
|
|
To-be-spoken |
1.52 |
|
Natural/pure science |
4.18 |
|
|
Unclassified |
0.40 |
|
Applied science |
8.21 |
|
|
|
|
|
Social science |
14.80 |
|
|
|
|
|
World affairs |
18.39 |
|
|
|
|
|
Unclassified |
1.93 |
|
|
|
|
The
spoken data in the BNC was collected on the basis of two criteria: “demographic” and “context-governed”. The demographic component is composed of informal
encounters recorded by 124 volunteer respondents selected by age group, sex,
social class and geographical region, while the context-governed component
consists of more formal encounters such as meetings, lectures and radio
broadcasts recorded in four broad context categories. The two components of
spoken data complement each other, as many types of spoken text would not have
been covered if demographic sampling techniques alone were used in data collection.
Table 2 summarizes the composition of the spoken BNC. Note that in the table,
the first two columns apply to both demographic and context-governed components
while the third column refers to the latter component alone.
Table 2: Composition of the spoken BNC
|
Region |
% |
Interaction type |
% |
Context-governed |
% |
|
South |
45.61 |
Monologue |
18.64 |
Educational/informative |
20.56 |
|
|
23.33 |
Dialogue |
74.87 |
Business |
21.47 |
|
North |
25.43 |
Unclassified |
6.48 |
Institutional |
21.86 |
|
Unclassified |
5.61 |
|
|
Leisure |
23.71 |
|
|
|
|
|
Unclassified |
12.38 |
In
addition to part-of-speech (POS) information, the BNC is annotated with rich
metadata (i.e. contextual information) encoded according to the TEI guidelines,
using ISO standard 8879. Because of its generality, as well as the use of
internationally agreed standards for its encoding, the BNC corpus is a useful
resource for a very wide variety of research purposes, in fields as distinct as
lexicography, artificial intelligence, speech recognition and synthesis,
literary studies and, of course, linguistics. There are a number of ways one
can access the BNC corpus. It can be accessed online remotely using the BNC Online service or the BNCWeb interface.
Alternatively, if a local copy of the corpus is available, the BNC can be
explored using corpus exploration tools such as WordSmith
(Scott 1999).
The
current version of the full release of the BNC is BNC-2, the World Edition.
This version has removed a small number of texts (less than 50) which restrict
the worldwide distribution of the corpus. The BNC World has also corrected
errors relating to mislabeled texts and indeterminate
part-of-speech codes in the first version, and has included a classification
system of genre labels developed by Lee (2001) at
The
BNC model for achieving corpus balance and representativeness
has been followed by a number of national corpus projects including, for
example, the American National Corpus, the Polish National Corpus and the
Russian Reference Corpus.
2.2. The American National Corpus
The
American National Corpus (ANC) project was initiated in 1998 with the aim of
building a corpus comparable to the BNC. While the ANC follows the general
design of the BNC, there are differences with regarding to its sampling period
and text categories. The ANC only samples language data produced from 1990
onwards whereas the sampling period for the BNC is 1960-1993. This time frame
has enabled the ANC to cover text categories which have developed recently and
thus were not included in the BNC, e.g. emails, web pages and chat room talks,
as shown in Table 3. In addition to the BNC-like core, the ANC will also
include specialized “satellite” corpora (cf. Reppen/Ide 2004, 106-107).
Table 3: Text categories in the ANC
|
Channel |
Text category |
% |
|
Written |
Books (41% informative texts for various domains and
14% imaginative texts of various types) |
55 |
|
Newspapers, magazines and journals |
20 |
|
|
Electronic (emails, web pages etc) |
10 |
|
|
Miscellaneous (published and unpublished) |
5 |
|
|
Spoken |
Face-to-face/phone conversations, speech, meetings |
10 |
The
ANC corpus is encoded in XML, following the guidelines of the XML version of
the Corpus Encoding Standard. The standalone annotation, i.e. with the primary
data and annotations kept in separate documents but linked with pointers, has
enabled the corpus to be POS tagged using different tagsets
(e.g. Biber’s (1988) tags, the
CLAWS C5/C7 tagsets (Garside/Leech/Sampson 1987) and
the Penn tags (Marcus/Santorini/Marcinkiewicz 1993)
to suit the needs of different users.
The
full release of the ANC is expected to be available in late 2005. At present
the first release of the corpus, which contains 11.5 million words of written
and spoken data (8.3 million words for writing and 3.2 million words for
speech, but not balanced for genre), is now available from the Linguistic Data
Consortium (LDC).
2.3. The Polish National Corpus
The
Polish National Corpus (PNC) is under construction on the PELCRA (Polish and
English Language Corpora for Research and Application) project, which is
undertaken jointly by the Universities of Lodz and
Lancaster. The project aims to develop a large, fully annotated reference
corpus of native Polish, “mirroring the BNC in terms of genres and
its coverage of written and spoken language” (Lewandowska-Tomaszczyk
2003, 106). A total of 130 million words of running texts have been collected,
and part of the data (30 million words) has been compiled into a balanced
corpus, which covers genres, and styles comparable in proportions to those
included the BNC. The PNC is TEI-compliant and is annotated for part-of-speech.
Presently, a balanced PNC sampler, which contains 10 millions of both written
and spoken data reflecting proportionally the text categories in the BNC, can
be ordered from the PELCRA project
site.
2.4. The Czech National Corpus
The
Czech National Corpus (CNC) consists of two sections: synchronous and
diachronic. Each section is designed to include written, spoken and dialectal
components. As some of the components are currently hardly more than blueprints
for future work (see Kučera 2002, 254), we will
only introduce the written and spoken components in the synchronous section.
The
written component of the synchronous section, which contains 100 million words,
was completed in 2000 and thus named SYN2000. SYN2000 includes both imaginative
(15%) and informative (85%) texts, each being divided into a number of text
categories, as shown in Table 4 (see Kučera
2002, 247-248). The technical and specialized texts in the corpus
proportionally cover nine domains: lifestyle (5.55%), technology (4.61%),
social sciences (3.67%), arts (3.48%), natural sciences (3.37%),
economics/management (2.27%), law/security (0.82%), belief/religion (0.74%) and
administrative texts (0.49%).
Table
4: Design of SYN2000
|
Major category |
Genre |
% |
|
Imaginative (15%) |
Fiction |
11.02 |
|
Poetry |
0.81 |
|
|
Drama |
0.21 |
|
|
Other literary texts |
0.36 |
|
|
Transitional text types |
2.6 |
|
|
Informative (85%) |
Journal |
60 |
|
Technical/specialized texts |
25 |
Table
5: Sampling frame of the Prague Spoken Corpus
|
Criteria |
Type |
Proportion |
|
Speaker sex |
Male |
50% |
|
Female |
50% |
|
|
Speaker age |
21-35 |
50% |
|
35+ |
50% |
|
|
Education level |
Secondary school |
50% |
|
University |
50% |
|
|
Discourse type |
Formal |
50% |
|
Informal |
50% |
The
spoken component of the synchronous section, the so-called Prague Spoken Corpus
(PMK), contains 800,000 words of transcription of authentic spoken language
sampled in a balanced way according to four sociolinguistic criteria: speaker
sex, age, educational level and discourse type, as shown in Table 5. The data
contained the Prague Spoken corpus consists exclusively of impromptu spoken
language (roughly equivalent to the demographically sampled component in the
BNC). Texts representing various blends of written and spoken language such as
lectures, political speeches and play scripts are included in a special section
in the written corpus (cf. Kučera 2002, 248,
253).
Both
SYN2000 and the Prague Spoken Corpus are marked up in TEI-compliant SGML and
tagged to show part-of-speech categories. SYN2000 is licensed free of charge
for non-commercial use. A scaled-down version of SYN2000, PUBLIC, which
contains 20 million words with the same genre distribution, is accessible
online at the corpus website. The
tagged version of the Prague Spoken Corpus will also be made publicly available
in the near future.
2.5. The Hungarian National Corpus
The
Hungarian National Corpus (HNC) is a balanced reference corpus of present-day
Hungarian. The corpus contains 153.7 million words of texts produced from the
mid-1990s onwards, which are divided into five subcorpora,
each representing a written text type: media (52.7%), literature (9.43%),
scientific texts (13.34%), official documents (12.95%) and informal texts (e.g.
electronic forum discussion, 11.58%). The size of the literary subcorpus is expected to increase from the current 14.5
million words to approximately 40 million words (see Váradi
2002, 386). The HNC is encoded in SGML in compliance with Corpus Encoding
Standard (CES) and annotated for part-of-speech. The corpus can be accessed
free of charge after registration via the online query system at
the corpus site.
2.6. The Russian Reference Corpus
The
Russian Reference Corpus (BOKR) is designed as a Russian match for the BNC. The
corpus contains 100 million words of modern Russian, following the general
sampling frame of the BNC, as shown in Table 6 (see Sharoff
2004).
Table
6: Sampling frame of the Russian Reference Corpus
|
Text category |
Proportion |
|
Spoken |
5% |
|
Life (Imaginative texts in the BNC) |
30% |
|
Natural sciences |
5% |
|
Applied sciences |
10% |
|
Social sciences |
12% |
|
Politics (World affairs in the BNC) |
15% |
|
Commerce |
5% |
|
Arts |
5% |
|
Religion and philosophy (Belief and thought in the
BNC) |
3% |
|
Leisure |
10% |
The
BOKR corpus is encoded in TEI-compliant SGML and annotated for part-of-speech.
As Russian is a highly inflective language, the technique used in annotating
English corpora with complex POS tags is impractical for Russian, because that
would entail thousands of tags which would make corpus exploration ineffective,
if not impossible at all. Hence in the Russian Reference corpus, each word is
annotated with a bundle of lexical and syntactic features such as
part-of-speech, aspect, transitivity, voice, gender, number and tense. Separate
features from a feature bundle associated with each word can be selected in a
window in the query
interface. The corpus is under construction and its final release is expected
by the end of 2004 (cf. Sharoff 2004).
The
CORIS (Corpus di Italiano Scritto) corpus is a general reference corpus of
present-day Italian. It contains 100 million words of written Italian sampled
from five text categories, which constitute five subcorpora,
as shown in Table 7.
Table
7: Components of the CORIS corpus
|
Category |
Subcategory |
Proportion |
|
Press |
Newspapers, periodicals, supplement |
38% |
|
Fiction |
Novel, short stories |
25% |
|
Academic prose |
Human sciences, natural sciences, physics,
experimental sciences |
12% |
|
Legal and administrative prose |
Legal, bureaucratic, administrative documents |
10% |
|
Miscellaneous |
Books on religion, travel, cookery, hobbies, etc. |
10% |
|
Ephemeral |
Letters, leaflets, instructions |
5% |
Unlike
most national corpora that are sample corpora, the CORIS corpus follows a
dynamic corpus model, which will be updated every two years by means of a
built-in monitor corpus (Rossini Favretti/Tamburini/de
Santis 2004). The current version of the corpus can
be accessed online free of charge via a web-based query system at the corpus website.
2.8. The Hellenic National Corpus
The
Hellenic National Corpus is a 32-million-word corpus of written Modern Greek
sampled from several publication media covering various genres (articles,
essays, literary works, reports, biographies etc.) and domains (economy,
medicine, leisure, art, human sciences etc.) published from 1976 onwards. Of
the four types of medium, books account for 15.75% of the total texts,
newspapers 69.01%, periodicals 6.97% and miscellaneous (correspondence,
electronic texts, ephemera, and hand-written/typed material) 8.27%. The text classification
with regard to medium, genre and domain follows the PAROLE
standards. This taxonomy information, together with the bibliographic
information, is encoded in TEI-compliant SGML (cf. Hatzigeorgiu/Gavrilidou/Piperidis
et al 2000, 1737). The corpus can be accessed online at the corpus site, where users can make
queries concerning the lexicon, morphology, syntax and usage of Modern Greek (e.g.
words, lemmas, part-of-speech categories or combinations of the three).
2.9. The German National Corpus
The
German National Corpus is a product of the DWDS (Digital Dictionary of the 20th
Century German Language) project. The corpus is divided into two parts, a
100-million-word balanced core and a much larger opportunistic subcorpus. This section introduces the core corpus, which
is roughly comparable to the British National Corpus, covering the whole 20th
century (1900-2000). Table 8 shows the text categories covered in the corpus.
Table
8: Design criteria of the German National Corpus
|
Text category |
Proportion |
|
Literature |
25% |
|
Journalistic prose |
25% |
|
Scientific texts |
20% |
|
Specialized texts (advert, manuals, etc) |
20% |
|
Spoken (everyday language, televised debates,
dialect, etc) |
10% |
The
metadata such as genre information is encoded in XML. Linguistic annotation
consists basically of lemmatization, part-of-speech and semantic annotation on the
word level, as well as prepositional phrase and noun phrase recognition on the
phrase level (Cavar/Geyken/Neumann 2000). The core
corpus is available for online search at the corpus site after free-of-charge
registration.
2.10. The Slovak National Corpus
The
Slovak National Corpus is presently under construction. The project aims to
create a 200-million-word corpus of the Slovak language. The first phase of the
project has produced a corpus containing 30 million words of written texts
published between 1990 and 2003, which will be expanded to other periods
of the contemporary language (1955 – 2005) to the target
size at the second phase of the project (2003-2006). The final corpus will also
include diachronic and dialectological texts.
At
present the 30-million-word part of the corpus has been annotated with
lemmatization, morphological and source (bibliographical
and style-genre) information. Users can access the corpus using a simple online query system at the corpus
website. More complex searches require the “corpus manager”, which supports regular expressions and can be downloaded
at the same site.
We
have so far introduced national corpora for European languages. The next two
sections will introduce two national corpora of Asian languages, namely Chinese
and Korean.
2.11. The Modern Chinese Language Corpus
The
Modern Chinese Language Corpus (MCLC) is
Table
9: Components of the MCLC corpus
|
Category |
Subcategory |
Proportion |
|
Humanities and social sciences (8 subcategories) |
Politics and laws, history, society, economics,
arts, literature, military and physical education, life |
59.6% |
|
Natural sciences (6 subcategories) |
Mathematics and physics, biology and chemistry,
astronomy and geography, oceanology and
meteorology, agriculture and forestry, medical and health |
17.24% |
|
Miscellaneous (6 subcategories) |
Official documents, regulations, judicial documents,
business documents, ceremonial speech, ephemera |
9.36% |
|
Newspapers |
|
13.79% |
A scaled down version of the corpus, the core, which
contains 20 million characters proportionally sampled from the larger corpus,
is tokenized and tagged with part-of-speech categories. The MCLC license can be
purchased from the National
Language Committee of China.
2.12. The Sejong Balanced Corpus
The 21st Century Sejong
project was launched in 1998 as a ten-year development project to build various
kinds of language resources including Korean corpora and Korean electronic
dictionaries. One of the goals of the project is to construct a balanced
national corpus (300 million words and phrases from modern Korean, spoken
materials, North Korean language, words of foreign origin, etc.), comparable to
the BNC. By 2003 a raw corpus of modern Korean was compiled, containing 57
million words with 75 million more words already existing electronic texts and
being processed and standardized. The corpus also includes around 3 million
words of spoken data.
The markup scheme used in
the Sejong Corpus is TEI-compliant. As of 2003, 10
million words have been morphologically annotated, 5.5 million words sense
tagged, and 150,000 words treebanked (see Kang/Kim
2004, 1747). The corpus is accessible over the Internet after registration at the corpus site.
In addition to those introduced above, there are a
number of nation-level corpora which are either already available or are under
construction. They include, for example, the FRANTEXT Database,
the Croatian National Corpus
(30 million words), Korpus 2000 for Danish (28 million words), the National Corpus of Irish
(15 million words). A number of corpora representing other national languages
are also under construction, including, for example, Norwegian (Choukri 2003), Dutch (Wittenburg/Brugman/Broeder 2000), Maltese (Dalli 2001), Basque (Aduriz/Aldezabal/Alegria et al 2003), Kurdish (Gautier 1998), Nepali (Glover 1998), Tamil (Malten 1998) and
While most of the national corpora introduced in
section 2 follow a static sample corpus model, there are also corpora which are constantly
updated to track rapid language change, such as the development and the life
cycle of neologisms. Corpora of this type are referred to as monitor
corpora.
The best-known monitor corpus is the Bank of English
(BoE), which was initiated in 1991 on the COBUILD (Collins Birmingham University International Language Database) project. The corpus was designed to represent standard
English as it was relevant to the needs of learners, teachers and other users,
while also being of use to researchers in present-day English language. Written
texts (75%) come from newspapers, magazines, fiction and non-fiction books,
brochures, reports, and websites while spoken data (25%) consists of
transcripts of television and radio broadcasts, meetings, interviews,
discussions, and conversations. The majority of the material in the corpus
represents British English (70%) while American English and other varieties
account for 20% and 10% respectively. Presently the BoE
contains 524 million words of written and spoken English. The corpus keeps
growing with the constant addition of new material.
The BoE corpus is
particularly useful for lexical and lexicographic studies, for example,
tracking new words, new uses or meanings of old words, and words falling out of
use. A 56 million word sampler of the corpus can be accessed online free of
charge at the corpus
website. Access to larger corpora is granted by special arrangement.
3.2. The
global English Monitor Corpus
Another corpus of the monitor type is the Global English Monitor
Corpus, which was started in late 2001 as an electronic archive of the
world’s leading newspapers in English. The corpus aims
at monitoring language use and semantic change in English as reflected in newspapers
so as to allow for research into whether the English language discourses in
4.
Corpora of the Brown family
The
first modern corpus of English, the Brown University Standard Corpus of
Present-day American English (i.e. the Brown corpus, see Kucěra/Francis
1967), was built in the early 1960s for written American English. The
population from which samples for this pioneering corpus were drawn was written
English text published in the
The
Brown corpus was constructed with comparative studies in mind, in the hope of
setting the standard for the preparation and presentation of further bodies of
data in English or in other languages. This expectation has now been realized.
Since its completion, the Brown corpus model has been followed in the
construction of a number of corpora for synchronic and diachronic studies as
well as for cross-linguistic contrast. Table 10 shows a brief comparison of
these corpora.
Table
10: Corpora of the Brown family
|
Corpus |
Language variety |
Period |
Samples |
Words (Million) |
|
American English |
1961 |
500 |
One |
|
|
American English |
1991-1992 |
500 |
One |
|
|
British English |
1961 |
500 |
One |
|
|
British English |
1931+/- 3 years |
500 |
One |
|
|
British English |
1991-1992 |
500 |
One |
|
|
Indian English |
1978 |
500 |
One |
|
|
Australian English |
1986 |
500 |
One |
|
|
|
1986-1990 |
500 |
One |
|
|
Mandarin Chinese |
1991+/- 3 years |
500 |
One |
As
can be seen, these corpora are roughly comparable but have sampled different
languages or language varieties. Their sampling periods are either similar for
the purposes of synchronic comparison or distanced by about three decades for
the purposes of diachronic comparison. For example, the Brown and LOB (the
Lancaster-Oslo-Bergen corpus of British English, see Johansson/Leech/Goodluck 1978) can be used to compare American and British
English as used in the early 1960s. The updated versions of the two corpora,
Frown (see Hundt/Sand/Skandera 1999) and FLOB (see Hundt/Sand/Siemund 1998) can be used to compare the two
major varieties of English as used in the early 1990s. Other corpora of the
similar sampling period, such as ACE (the Australian Corpus of English, also
known as the Macquarie corpus), WWC (the Wellington Corpus of Written New
Zealand English) and Kolhapur (the Kolhapur Corpus of Indian English), together with FLOB and
Frown, allow for comparison of “world Englishes”. For diachronic studies, the Brown vs. Frown on the one
hand, and the Pre-LOB,
LOB and FLOB corpora on the other hand, provide a reliable basis for tracking
recent language change over 30-year periods. The LCMC corpus (the Lancaster
Corpus of Mandarin Chinese, see McEnery/Xiao/Mo
2003), when used in combination with FLOB/Frown corpora, provides a valuable
resource for contrastive studies between Chinese and two major varieties of
English.
In
comparing these corpora synchronically, caution must be exercised to ensure
that the sampling periods are similar. For example, comparing the Brown corpus
with FLOB would involve not only language varieties but also language change.
Also, as the Brown model may have been modified slightly in some of these
corpora, account must be taken of such variation in comparing these corpora
across text category by normalizing the raw frequencies to a common basis.
Table 11 compares the text categories and number of samples for each category
in these corpora.
Table
11: Text categories in the corpora of the Brown family
|
Code |
Text category |
Brown |
Frown |
LOB |
FLOB |
Pre-LOB |
|
ACE |
WWC |
LCMC |
|
A |
Press reportage |
44 |
44 |
44 |
44 |
44 |
44 |
44 |
44 |
44 |
|
B |
Press editorials |
27 |
27 |
27 |
27 |
27 |
27 |
27 |
27 |
27 |
|
C |
Press reviews |
17 |
17 |
17 |
17 |
17 |
17 |
17 |
17 |
17 |
|
D |
Religion |
17 |
17 |
17 |
17 |
17 |
17 |
17 |
17 |
17 |
|
E |
Skills, trades and hobbies |
36 |
36 |
38 |
38 |
38 |
38 |
38 |
38 |
38 |
|
F |
Popular lore |
48 |
48 |
44 |
44 |
44 |
44 |
44 |
44 |
44 |
|
G |
Biographies and essays |
75 |
75 |
77 |
77 |
77 |
70 |
77 |
77 |
77 |
|
H |
Miscellaneous (reports, official documents) |
30 |
30 |
30 |
30 |
30 |
37 |
30 |
30 |
30 |
|
J |
Science (academic prose) |
80 |
80 |
80 |
80 |
80 |
80 |
80 |
80 |
80 |
|
K |
General fiction |
29 |
29 |
29 |
29 |
29 |
59 |
29 |
29 |
29 |
|
L |
Mystery and detective fiction |
24 |
24 |
24 |
24 |
24 |
24 |
15 |
24 |
24 |
|
M |
Science fiction |
6 |
6 |
6 |
6 |
6 |
2 |
7 |
6 |
6 |
|
N |
Western and adventure fiction |
29 |
29 |
29 |
29 |
29 |
15 |
8 |
29 |
29 |
|
P |
Romantic fiction |
29 |
29 |
29 |
29 |
29 |
18 |
15 |
29 |
29 |
|
R |
Humour |
9 |
9 |
9 |
9 |
9 |
9 |
15 |
9 |
9 |
|
S |
Historical fiction |
- |
- |
- |
- |
- |
- |
22 |
- |
- |
|
W |
Women’s fiction |
- |
- |
- |
- |
- |
- |
15 |
- |
- |
It
can be seen from the table that the two American English corpora (Brown and Frown)
have the same numbers of samples for each of the 15 text categories while the
British English corpora share the same proportions. The two groups differ in
the numbers of samples for categories E, F, and G. The WWC and LCMC corpora
follow the model of FLOB. There are important differences between the
With
the exceptions of the Pre-LOB corpus, which is under construction, and LCMC,
which is distributed by the European Language Resources Association (ELRA), all of the corpora of the Brown family
are available from the International Computer Archive of Modern and Medieval
English (ICAME).
The
corpora of the Brown family are balanced corpora representing a static snapshot
of a language or language variety in a certain period. While they can be used
for synchronic and diachronic studies, more appropriate resources for these
kinds of research are synchronic and diachronic corpora, which will be
introduced in the following two sections.
While
the corpora of the Brown family are generally good for comparing language
varieties such as world Englishes, the results from
such a comparison must be interpreted with caution when the corpora under
examination were built for different periods or the Brown model has been
modified. A more reliable basis for comparing language varieties is a
synchronic corpus.
5.1. The International Corpus of English
A
typical corpus of this type is the International Corpus of English (ICE), which
is specifically designed for the synchronic study of world Englishes.
The ICE corpus consists of a collection of twenty corpora of one million words
each, each composed of written and spoken English produced during 1990-1994 in
countries or regions in which English is a first or official language (e.g.
Table
12 Corpus design of ICE
|
Spoken (300) |
Dialogues (180) |
Private |
Conversations (90) |
|
Public |
Class lessons (20) |
||
|
Monologues (120) |
Unscripted |
Commentaries (20) |
|
|
Scripted |
Broadcast news (20) |
||
|
Written |
Non-printed (50) |
Student writing |
Student essays (10) |
|
Letters |
Social letters (15) |
||
|
Printed |
Academic |
Humanities (10) |
|
|
Popular |
Humanities (10) |
||
|
Reportage |
Press reports (20) |
||
|
Instructional |
Administrative writing (10) |
||
|
Persuasive |
Editorials (10) |
||
|
Creative |
Novels (20) |
The
ICE corpora are marked up and annotated at various levels. In written texts,
features of the original layout are marked, including sentence and paragraph
boundaries, headings, deletions, and typographic features while spoken texts
are transcribed orthographically, and are marked for pauses, overlapping
strings, discourse phenomena such as false starts and hesitations, and speaker
turns. The bibliographic markup, which gives a
complete description (e.g. text category, date, and publisher) of each text, is
stored in the corpus header of each file. Different levels of annotation are
undertaken for the ICE corpora. Some of them are POS tagged and parsed (e.g.
the British component ICE-GB) while others are currently available as unannotated lexical corpora (e.g. the components for
5.2. The Longman/Lancaster Corpus
The
Longman/Lancaster Corpus consists of about 30 million words of published
English. British data takes up 50% and American data 40% while the other 10%
represents other varieties such as Australian, African
and Irish English. One half of the samples were selected
randomly (“microcosmic texts”) and the other half selected by a panel of experts (“selective texts”). Most texts in the
corpus are about 40,000 words long but no whole texts are used.
Both
imaginative and informative text categories are included. Imaginative texts
come from well-known literary works and works randomly sampled from books in
print; informative texts come from the natural and social sciences, world
affairs, commerce and finance, the arts, leisure, and so on. Imaginative texts
are mainly works of fiction in book form while informative texts comprise
books, newspapers and journals, unpublished and ephemera. Four external
criteria have been used in text selection (see Holmes-Higgin/Abidi/Ahmad
1994): “region” (language
varieties), “time” (1900s-1980s), “medium” (books 80%, periodicals 13.3% and
ephemera 6.7%), and “level” (literary, middle
and popular for imaginative texts, and technical, lay and popular for
informative texts). As part of the Longman Corpus
Network, the Longman/Lancaster Corpus is not available for public access.
5.3. The Longman Written American Corpus
The
Longman Written American Corpus currently contains over 100 million words of
running texts taken from newspapers, journals, magazines, best-selling novels,
technical and scientific writing, and coffee-table books. The design of the
Longman Written American Corpus is based on the general design principles of
the Longman/Lancaster Corpus and the written section of the BNC. The corpus is
dynamically refined and keeps growing with the constant addition of new
materials. Like the other components of the Longman Corpus
Network, this corpus does not appear to allow public access.
5.4. The CREA corpus of Spanish
The
CREA (Corpus de Referencia
The
CREA was designed as a monitor corpus which is continually updated so that it
always represents the last twenty-five years of the history of Spanish. New
data is added proportionally to maintain the corpus balance and to ensure that
the various trends in current Spanish are represented. Texts for 2000-2004 are
currently being incorporated (Sánchez 2002).
The
CREA corpus is marked in SGML. Bibliographic and taxonomic information is
encoded in the corpus header of each file. For written texts, both structural
(paragraph and page number) and intratextual (notes,
formulas, tables, quotations, foreign words etc.) marks are encoded. For spoken
texts, the markup scheme indicates structural (speech
turns) and non-structural (overlapping, tottering, anacoluthon, etc.) marks
(cf. Guerra 1998).
The
modular structure of the CREA corpus allows
for flexible searches using geographical, generic, temporal, and thematic
criteria. The corpus is accessible on the Internet.
5.5. The LIVAC corpus of Chinese
The
LIVAC (Linguistic Variation in Chinese Speech Communities) project started in 1993
with the aim of building a synchronous corpus for studying varieties of
Mandarin Chinese. For this purpose, data has been collected regularly and
simultaneously, once every four days since July 1995, from representative
Mandarin Chinese newspapers and the electronic media of six Chinese speaking
communities:
All
of the corpus texts in LIVAC are segmented automatically and checked by hand.
In addition the corpus, a lexical database is derived from the segmented texts,
which includes, apart from ordinary words, those expressing new concepts or
undergoing sense shifts, as well as region specific words from the six
communities. The database is thus a rich resource for research into
linguistics, sociolinguistics, and Chinese language and society.
As
LIVAC captures the social, cultural, and linguistic developments of the six
Chinese speaking communities within a decade, it allows for a wide range of
comparative studies on linguistic variation in Mandarin Chinese. The corpus
also provides an important resource for tracking lexical development such as
the evolution of new concepts and their expressions in present-day Chinese. A
sample of the corpus (data covering the period from July 1995 to June 1996) can
be accessed using the online query system at the corpus site, which shows
KWIC concordances as well as frequency distribution across the six speech
communities.
Another
way to explore language variation is from a diachronic perspective using
diachronic corpora. A diachronic (or historical) corpus contains texts from the
same language gathered from different time periods. Typically that period is
far more extensive than that covered by Brown/Frown and LOB/FLOB or a monitor
corpus such as the Bank of English. Diachronic corpora are used to track
changes in language evolution. This section introduces a number of corpora of
this kind.
6.1. The
Perhaps
the best-known historical corpus is the diachronic part of the Helsinki Corpus
of English Texts (i.e. the Helsinki corpus), which consists of approximately
1.5 million words of English in the form of 400 text samples, dating from the 8th
to 18th centuries. The corpus is divided into three periods (Old,
Middle, and Early Modern English) and eleven subperiods,
as shown in Table 13 (cf. Kytö 1996).
Table
13: Periods covered in the Helsinki Diachronic Corpus
|
Period |
Subperiod |
Words |
Percent |
Overall |
|
Old English |
I. –850 |
2,190 |
0.5 |
413,250 |
|
II. 850-950 |
92,050 |
22.3 |
||
|
III. 950-1050 |
251,630 |
60.9 |
||
|
IV. 1050-1150 |
67,380 |
16.3 |
||
|
Total |
413,250 |
100 |
26.27% |
|
|
Middle English |
I. 1150-1250 |
113,010 |
18.6 |
608,570 |
|
II. 1250-1350 |
97,480 |
16.0 |
||
|
III. 1350-1420 |
184,230 |
30.3 |
||
|
IV. 1420-1500 |
213,850 |
35.1 |
||
|
Total |
608,570 |
100% |
38.70% |
|
|
Early Modern English |
I. 1500-1570 |
190,160 |
34.5 |
551,000 |
|
II. 1570-1640 |
189,800 |
34.5 |
||
|
III. 1640-1710 |
171,040 |
31.0 |
||
|
Total |
551,000 |
100 |
35.03% |
|
|
Total |
1,572,820 |
|
100% |
|
In
addition to the basic selection of texts as indicated in the table, there is a
supplementary part in the
As
the Helsinki corpus not only sampled different periods covering one millennium,
and it also encoded genre and sociolinguistic information, this corpus allows
for researchers to go beyond simply dating and reporting language change by
combining diachronic, sociolinguistic and genre studies. The
ARCHER,
an acronym for “A Representative Corpus of Historical
English Registers”, contains 1.7 million words of data in
the form of 1,037 texts sampled from seven 50-year historical periods covering
Early Modern English (1650-1990). The corpus is designed as a balanced
representation of seven written (journal-diaries, letters, fiction, news, and
science, etc.) and three speech-based (fictional conversation, drama and
sermons-homilies) genres in British (two thirds of the corpus) and American
(one third, data available only for the periods 1750-1799, 1850-1899,
1950-1990) English. Each 50-year subcorpus includes
20,000-30,000 words per register, typically containing ten texts of
approximately 2,000-3,000 words each (cf. Biber/Finegan/Atkinson
1994). ARCHER is tagged for grammatical/functional categories. It allows for a
wide variety of investigations on recent linguistic change and change in
discourse and genre conventions. The corpus is presently being expanded with
more American texts to make the American and British data comparable (see ARCHER 2).
The expanded version will also enable a systematic comparison of the two
varieties of English diachronically. However, because of the copyright problem,
ARCHER is not publicly available at the moment. Readers interested in using
this corpus can contact Douglas Biber.
In
addition to the
6.3. The Lampeter Corpus of Early Modern English Tracts
The
Lampeter Corpus of Early Modern English Tracts is a
balanced corpus covering one century between 1640 and 1740, which is divided
into ten decades. Each decade consists of data sampled from six domains
(religion, politics, economics/trade, science, law and miscellaneous). Two
complete texts, ranging from 3,000 to 20,000 words, are included for each
domain within each decade, totaling approximately 1.1
million words (Schmied 1994).
The
Lampeter corpus is encoded in TEI-compliant SGML. The
TEI headers provides the framework for historical, sociolinguistic and
stylistic investigations, including information regarding authors (name, age,
sex, place of residence, education, social status, political affiliation),
printers/publishers, place and date of print, publication format, text
characteristics and bibliographical sources. As the corpus includes whole texts
rather than smaller samples, the corpus is also useful for study of textual
organization in Early Modern English. The Lampeter
corpus can be ordered from ICAME
or OTA.
6.4. The Dictionary of Old English
Corpus in Electronic Form
The
Dictionary of Old English Corpus in Electronic Form (DOEC, the 2000 release)
contains 3,037 texts of Old English, totaling over
three million words, in addition to two million words of Latin. The texts in
the corpus are practically all extant Old English writings. The DOEC corpus
includes at least one copy of each surviving text in Old English while in cases
where it is significant because of dialect or date, more than one copy is
included. These texts cover six text categories: poetry, prose, interlinear
glosses, glossaries, runic inscriptions, and inscriptions in the Latin
alphabet. In the prose category in particular, a wide range of text types are
covered which include, for example, saints’ lives, sermons,
biblical translations, penitential writings, laws, charters and wills, records
(of manumissions, land grants, land sales, land surveys), chronicles, a set of
tables for computing the moveable feasts of the Church calendar and for
astrological calculations, medical texts, prognostics (the Anglo-Saxon
equivalent of the horoscope), charms (such as those for a toothache or for an
easy labour), and even cryptograms (cf. the corpus website). The texts in the
corpus are encoded in TEI-compliant SGML. The DOEC corpus can be ordered on CDs
or assessed online by institutional site license at the corpus website. The
web-based query system allows for searches by single words, word combinations,
word proximity and bibliographic sources.
6.5. Early English Books Online
Early
English Books Online (EEBO) is a joint effort launched in 1999 between the
6.6. The Corpus of Early English
Correspondence
The
Corpus of Early English Correspondence (CEEC, the 1998 version) consists of 96
collections of ca. 6,000 personal letters written by 778 people (women
accounting for 20%) between 1417 and 1681, totaling
2.7 million words. The corpus is accompanied by a sender database, which offers
users easy access to various sociolinguistic variables, including writer age,
gender, place of birth, education, occupation, social rank, domicile and the
relationship with the addressee. CEEC is a balanced corpus which can be neatly
divided into two parts, both covering chronologically fairly equal periods: the
first from ca. 1417 to 1550 and the second from 1551 to 1680 (cf. Laitinen 2002). Table 14 shows the proportions in terms of
writers’ social ranks and domiciles (see Nevalainen 2000: 40). The CEEC corpus is currently being
expanded with personal letters written between 1682 and 1800 to cover the
18th-century.
Table
14: the CEEC corpus by rank and domicile
|
Rank (percent) |
Domicile (percent) |
|
Royalty: 2.4 |
Court: 7.8 |
|
Nobility: 14.7 |
|
|
Gentry: 39.3 |
|
|
Clergy: 13.6 |
North: 12.5 |
|
Professionals: 11.2 |
Other regions: 48.6 |
|
Merchants: 8.4 |
|
|
Other nongentry: 9.4 |
|
As
the copyright problem has prevented public access to the full release of the
CEEC corpus, a CEEC sampler (CEECS) has been published by ICAME, which represents the
non-copyrighted materials included CEEC. The sampler reflects the structure of
the full CEEC only in some respects. The time covered is nearly the same
(1418-1680), which is divided into two parts. CEECS1 (246,055 words) covers the
15th and 16th centuries while CEECS2 (204,030 words) covers the 17th century.
The sampler corpus consists of 23 collections of 1,147 letters with 194
informants, totaling 450,085 words. The CEEC sampler is
available from ICAME or OTA.
6.7. The
The
Zurich English Newspaper Corpus (ZEN) is a 1.2-million-word collection of
newspapers in Early English, covering 120 years (from 1671 to 1791) of British
newspaper history. To achieve a representative coverage, a wide variety of
newspapers were included. Up to ten issues per newspaper were selected at
ten-year intervals throughout the whole period. With the exception of
stock market reports, lottery figures, long lists of names and poetry, the
whole newspapers were included in the corpus. The news stories are grouped into
two major categories: foreign news and home news, with each news category
further classified according to its own text genre definition (cf.
Fries/Schneider 2000). The corpus is split into four 30-year periods in order
to track potential language change, as shown in Table 15 (see Schneider 2002:
202).
Table
15: The ZEN corpus
|
Section |
Period |
Words |
Sentences |
|
A |
1670-1709 |
242758 |
7642 |
|
B |
1710-1739 |
347825 |
12163 |
|
C |
1740-1769 |
339362 |
14112 |
|
D |
1770-1799 |
298249 |
11843 |
|
Total |
1228194 |
45760 |
|
The
ZEN corpus is SGML-conformant. It not only allows for linguistic analysis of
different types of news stories in the 17th and 18th
centuries, it has also made it possible to compare news texts in Early English
with modern newspaper language. The ZEN
query system allows restricted access to the online database.
6.8. The
The
Innsbruck Computer Archive of Machine-Readable English Texts (ICAMET) contains ca.
500 Middle English texts totaling 5.7 million words.
The database comprises three parts, namely, the Prose
Corpus (129 texts written during 1100-1500, accounting for two thirds of the
total), the Letter Corpus (254 letters written during 1386-1688, arranged in
the diachronic order), and the Prose Varia Corpus (mainly
translations or normalized versions of Middle English texts). An advantage of ICAMET is that the database consists of
complete texts instead of extracts, which allows literary, historical and
topical analyses of various kinds, particularly studies of cultural history (Marcus
1999). Nevertheless, the copyright issue has
restricted public access to many prose texts in the corpus. A sampler
containing half of the prose texts and all letters is available from ICAME.
6.9. The Corpus of English Dialogues
The
Corpus of English Dialogues (CED) contains 1.3 million words of Early Modern
English dialogue texts produced over a 200-year time span between 1560 and
1760. While the spoken language of the past is inaccessible directly to modern
speakers, it is recorded in speech related texts. The CED corpus sampled from
six such text categories, including trial proceedings, witness depositions,
drama, handbooks in dialogue form, fictional
dialogues, and language teaching books (cf. Culpeper/Kytö 1997).
The
focus on dialogue will allow insight into the nature of impromptu speech and
interactive two-way communication in the Early Modern English period - aspects
which have received little research attention. The CED corpus is currently
under construction by the Universities of Lancaster and
6.10 A Corpus of Late Eighteenth-Century
Prose
A
Corpus of Late Eighteenth-Century Prose contains 30,000 words of unpublished
letters transcribed from the originals dated from the period 1761-1790. The
corpus is distributed in both plain text (extended ASCII) and HTML versions.
The text version can be used with a concordancer
while the HTML version facilitates viewing the corpus in a browser. The plain
text version is marked up in the
6.11. A Corpus of Late Modern English
Prose
A
Corpus of Late Modern English Prose contains 10,000 words of informal private
letters written by British writers between 1861 and 1919. All decades in this
period are represented, with about 6,000 words for the decade 1880-1889, 13,000
words for 1890-1899 and 20,000 words for the other four decades each. These
blocks of texts are sampled from five sources.
Stored
in seven extended (8-bit) ASCII text files, the corpus is marked up following
the conventions used in the Helsinki corpus, with information on writer,
recipient, relationship, date, genre, and page etc. encoded in COCOA-style
brackets (see Denison 1994). The corpus can be ordered at no cost from the
Oxford Text Archive.
In
addition to the diachronic corpora introduced in the previous sections, there
are a number of online databases which are accessible on the Internet, for
example, Michigan Early Modern
English Materials, the Corpus of Middle English Prose and Verse (CME), the Middle English Collection
(MidEng),
and the Korpus
of Early Modern Playtexts in English.
While
general corpora like national corpora may contain spoken material, there are a
number of well-known publicly available spoken English corpora, which will be
introduced in this section.
The
London-Lund Corpus (LLC), as the first electronic corpus of spontaneous
language, is a corpus of spoken British English recorded from 1953-1987. The
corpus derived from two projects: the Survey of English Usage (SEU) at
University College London and the Survey of Spoken English (SSE) at
7.2. SEC, MARSEC and Aix-MARSEC
The
Lancaster/IBM Spoken English Corpus (SEC) consists of approximately 53,000
words of spoken British English, mainly taken from radio broadcasts dating
between 1984 and 1991. For a corpus of this size, it is impossible to include
samples of every style of spoken English. The SEC corpus has been designed to
cover speech categories suitable for speech synthesis, as shown in Table 16
(see Taylor/Knowles 1988).
Tab
le 16: The SEC categories
|
Code |
Category |
Words |
Proportion |
|
A |
Commentary |
9066 |
17% |
|
B |
News broadcast |
5235 |
10% |
|
C |
Lecture aimed at general audience |
4471 |
8% |
|
D |
Lecture aimed at restricted audience |
7451 |
14% |
|
E |
Religious broadcast including liturgy |
|
|
|
F |
Magazine-style reporting |
4170 |
9% |
|
G |
Fiction |
7299 |
14% |
|
H |
Poetry |
1292 |
2% |
|
J |
Dialogue |
6826 |
13% |
|
K |
Propaganda |
1432 |
3% |
|
M |
Miscellaneous |
3352 |
6% |
|
Total |
52637 |
c. 100% |
|
In
the SEC corpus, efforts have been made to achieve a balance between the highly stylized
texts (e.g. poetry, religious broadcast, propaganda) and dialogue, and between
male and female speakers. Of the 53 speakers in the corpus, 17 are female,
representing 30% of the corpus. The higher proportions of male speakers in the
news and commentary categories reflect the tendency of the BBC to use mainly
male speakers in these types of programmes.
SEC
is available in orthographic, prosodic, grammatically tagged and treebank versions, which should prove most useful to those
who research in the speech synthesis or speech recognition fields. The corpus
can be ordered from ICAME.
The
Machine Readable Spoken English Corpus (MARSEC) is an
extension of SEC in which the original acoustic recordings were digitalized, and word-level time-alignment between the
transcripts and the acoustic signals was included. Tonetic
stress marks were also converted into ASCII symbols to make the corpus
machine-readable. The prosodically annotated
word-level alignment files are available at the MARSEC website.
The
Aix-MARSEC database is a further development of MARSEC. The database consists
of two major components: the digitalized recordings from MARSEC and the
annotations. Annotations have so far been undertaken at nine levels such as
phonemes, syllables, words, stress feet, rhythm units, minor
and major turn units. Two supplementary levels, the grammatical annotation by
CLAWS and a Property Grammar system developed at
7.3. The
The
Bergen Corpus of London Teenage Language (COLT) is the first large English
corpus focusing on the speech of teenagers. It contains half a million words
(about 55 hours of recording) of orthographically transcribed spontaneous
teenage talk recorded in 1993 by 31 volunteer recruits from five socially
different school boroughs. The speakers in the corpus are classified into six
age groups: preadolescence (0-9 years old), early adolescence (10-13), middle
adolescence (14-16), late adolescence (17-19), young adults (20-29) and older
adults (30+). As the name of corpus suggests, the core of the corpus represents
teenagers. The early, middle and late adolescence groups account respectively
for 24%, 61% and 9%, totaling 94% of the corpus. The
older adult group, mostly parents, teachers, takes up 6%. As regards speaker
gender, girls and boys contributed roughly the same amount of text: the male
speakers about 51.8% (230,616 words) and the female speakers 48.2% (214,215
words). In terms of social class, only about 50% of the corpus material can be
assigned a social group value. The material that has been classified is evenly
distributed across the three social groups: high, middle, and low. While a wide
range of settings are present in the COLT corpus, settings in connection with
school (48%) and home (32%) are the most common. Such speaker-specific
information (speaker age, gender, social class, etc.) and conversation-specific
information (location and setting) is encoded in the header of each corpus
text. In the body of the text, paralinguistic features and non-verbal sounds
are also marked up (cf. Haslerud/Stenström 1995).
The
corpus constitutes part of the British National Corpus. In addition, COLT is
released in both orthographically transcribed (pure text) and tagged version
(using CLAWS C7 tagset). A prosodically
annotated version (a representative selection amounting to approximately
150,000 words) is also available. The corpus is for non-commercial purposes and
can be accessed online by
registered users or ordered form ICAME.
7.4. The
The
A
unique feature of CANCODE is that the corpus has been coded with information
pertaining to the relationship between the speakers: whether they are intimates
(living together), casual acquaintances, colleagues at work, or strangers. For
this purpose, CANOCDE is organized along two main axes: context-type and
interaction-type. Alongside the axis of context-type are, on the cline from “public” to “private”, transactional, professional, socializing and intimate.
Alongside the axis of interaction-type, on the cline from “collaborative” to “non-collaborative”, information provision,
collaborative idea, and collaborative work. The interactions between the two
axes, together with typical settings, are shown in Table 17 (see
Carter/McCarthy 2004, 67). This coding allows users to look more closely at how
different levels of familiarity (formality) affect the way in which people
speak to each other. The corpus is not currently available to the public.
Table
17: CANCODE text types
|
Context-type |
Interaction-type |
||
|
Information provision |
Collaborative idea |
Collaborative work |
|
|
Transactional |
commentary by museum guide |
chatting with hairdresser |
choosing and buying a television |
|
Professional |
oral report at group meeting |
planning meeting at place of work |
colleagues window-dressing |
|
Socializing |
telling jokes to friends |
reminiscing with friends |
friends working together |
|
Intimate |
partner relating the story to a film seen |
siblings discussing their childhood |
couple decorating a room |
7.5. The Spoken Corpus of the Survey of English
Dialects
A
corpus that was built specifically for the study of English dialects is the
spoken corpus of the Survey of English Dialects (SED, see Beare/Scott
1999). The Survey of English Dialects was started in 1948 by Harold Orton at
the
The
spoken corpus derived from SED consists of transcripts of 314 recordings from
289 (out of the 313) SED localities in
While
the spoken corpus of SED comprises data invariably produced by elderly people,
as the survey was conducted nationwide, covering every
7.6. The Intonational Variation in English Corpus
The
Intonational Variation in English (IviE) corpus was constructed for the investigation of
cross-varietal and stylistic variation in British
English intonation, focusing on nine urban varieties of English spoken in the
British Isles, i.e.
7.7. The Longman British Spoken Corpus
The
Longman British Spoken Corpus contains 10 million words of natural, spontaneous
conversations from a representative sample of the population in terms of speaker
age, gender, social group and region, and from the language of lectures,
business meetings, after dinner speeches and chat shows. The design criteria
are discussed in detail in Crowdy (1993). The Longman
British Spoken Corpus is the first large scale attempt to collect spoken data
in a systematic way. The corpus is part of the spoken section of the British
National Corpus (see section 2.1).
7.8. The Longman Spoken American Corpus
The
Longman Spoken American Corpus comprises five million words of spoken data
collected from everyday conversations of more than 1,000 Americans of various
age groups, levels of education, and ethnicity from over 30 US States. Equal
numbers of participants were chosen from each region, and a balance was struck
between the numbers of participants from rural and city areas within those
regions. Recordings were made of four-hour chunks of the normal daily
conversations of each participant over periods of at least four days. The
participants were chosen to be representative for gender, age, ethnicity and
education, as shown by the latest US demographic census statistics (Table 18,
see Stern 1997). As part of the Longman Corpus
Network, the Longman Spoken American Corpus is a property of the Longman
publishers for in-house use only.
Table
18: Demographic distribution of the Longman Spoken American Corpus
|
Variable |
Proportions |
|
Gender |
Male: 50%; Female: 50% |
|
Age |
18-24: 20%; 25-34: 20%; 35-44: 20%; 4445-60: 20%;
60+: 20% |
|
Ethnicity |
White: 75%; Black: 13%; Hispanic: 8%; Asian: 4% |
|
Education |
Degree/Higher degree: 33%; College: 33%; High
school: 33% |
7.9. The
The
Santa Barbara Corpus of Spoken American English (SBCSAE) is based on hundreds
of recordings of spontaneous speech from all over the
The
corpus is particularly useful for research into speech recognition as each
speech file is accompanied by a transcript in which phrases are time stamped to
allow them to be linked with the audio recording from which the transcription
was produced. Personal names, place names, phone numbers, etc, in the
transcripts have been altered to preserve the anonymity of the speakers and
their acquaintances and the audio files have been filtered to make these portions
of the recordings unrecognisable. The SBCSAE corpus is distributed by the LDC in five parts, the first three of
which have been released to date.
7.10. The Saarbrücken Corpus of Spoken English
The
Saarbrücken Corpus of Spoken English (SCoSE) consists of three parts: stories, jokes and
interviews. The first two parts comprise excerpts transcribed from audio-taped
talk recorded by researchers and students at
The
Switchboard Corpus (SWB) is a corpus of is 2,438 spontaneous telephone
conversations, averaging 6 minutes in length, recorded for over 542 speakers of
both sexes from every major dialect of American English in the early 1990s. The
transcripts total three million words (over 240 hours of recordings). Information
relevant to speakers' sex, year of birth, education level and dialect region is
available in the documentation accompanying the corpus. Table 19 shows the
distribution of major sociolinguistic variables (see Godfrey/Holliman 1997).
Table
19: The Switchboard corpus
|
Dialect |
Speaker age |
Speaker sex |
Education |
|
Western (85) Northern (75) Southern (56) NYC (33) Mixed (26) |
20-29 (140) 30-39 (179) 40-49 (112) 50-59 (87) 60-69 (13) |
Male (292) Female (239) |
High school - (14) College - (39) College (309) College + (176) Unknown (4) |
As
each transcript in the corpus is time-aligned at the word level, the corpus is useful
for sociolinguistic studies as well for speech recognition. The corpus is
distributed by the LDC. It can also be
downloaded from the Switchboard
website or accessed via the LDC
Online.
7.12. The
The
Wellington Corpus of Spoken New Zealand English (WSC) comprises one million
words of spoken New Zealand English in the form of 551 2,000-word extracts
collected between 1988 and 1994 (99% of the data from 1990–1994, the exception being eight private interviews). A very
stringent criterion was adopted to ensure the integrity of the
Table
20: Composition of the WSC corpus
|
Category |
Text category |
Words |
|
Monologue: |
Broadcast news |
28,929 |
|
Broadcast monologue |
11,205 |
|
|
Broadcast weather |
3,641 |
|
|
Monologue: |
Sports commentary |
26,010 |
|
Judge's summation |
4,489 |
|
|
Lecture |
30,406 |
|
|
Teacher monologue |
12,496 |
|
|
Dialogue: |
Conversation |
500,363 |
|
Telephone conversation |
70,156 |
|
|
Oral history interview |
21,972 |
|
|
Social dialect interview |
31,058 |
|
|
Dialogue: |
Radio talkback |
84,321 |
|
Broadcast interview |
96,775 |
|
|
Parliamentary debate |
22,446 |
|
|
Transactions and meetings |
102,332 |
|
|
Total |
1,046,599 |
|
The
formal speech section (12%) in the WSC corpus includes all monologue categories
and “parliamentary debate” in the public dialogue
category. The semi-formal section (13%) includes the three types of interview
(both public and private). All of the other text categories
make up the informal speech section (75%), with private conversation alone
accounting for 50% of the corpus. In terms of speaker gender, women
contributed 52% and men 48% of the final transcribed words, reflecting the
The
unusually high proportion of private material and the rich sociolinguistic
variation make the WSC corpus a valuable resource for research into informal
spoken registers as well as for sociolinguistic studies. The corpus is
available from ICAME.
7.13. The
The
Limerick corpus of Irish English (L-CIE) comprises one million words in the form
of 375 transcripts of naturally occurring conversations recorded in a wide
variety of speech contexts throughout
Table
21: Design of the L-CIE corpus
|
|
Information provision |
Collaborative idea |
Collaborative task |
|
Pedagogic |
80,253 words e.g. linguistics lecture |
60,473 words e.g. English poetry tutorial |
10,000 words e.g. one-to-one computer lesson |
|
Professional |
145,000 words e.g. real-estate office talk |
100,000 words e.g. team meeting |
60,000 words e.g. waitresses washing dishes |
|
Socialising |
50,000 words e.g. describing a new bar |
54,356 words e.g. friends discussing college |
30,000 words e.g. friends assembling a bed |
|
Intimate |
60,000 words e.g. mother storytelling |
266,000 words e.g. partners making holiday plans |
60,000 word e.g. family preparing dinner |
|
Transactional |
5,000 words e.g. product presentation |
10,000 words e.g. chatting in a taxi |
1,000 words e.g. eye examination |
While
it is not designed to be geographically representative –
it does
not include data from every county in the
7.14. The
The
Hong Kong Corpus of conversational English (HKCCE) comprises 50 hours of
recordings made up of 130 separate conversations involving a total of 341
participants. The lengths of the conversations are between 1 hour 15 minutes
and 2 minutes 49 seconds, averaging about 23 minutes in length. The corpus is
divided into four subcorpora (conversations, academic
discourses, business discourses and public discourses), amounting to
approximately 500,000 words. The recordings were made in the mid-1990s of
conversations between Hong Kong Chinese and non–Cantonese speakers
(mostly native speakers of English). Table 22 shows the distribution of the
data cross various design criteria (cf. Cheng/Warren 1999, 13-16).
Table
22: Design criteria of HKCCE
|
Criterion |
Type |
Proportion |
|
Gender |
Male (Native speaker of English) |
34% |
|
Female (Native speaker of English) |
18% |
|
|
Male (Non-native English) |
24% |
|
|
Female (Non-native English) |
24% |
|
|
Age |
18-29 |
40% |
|
30-39 |
21% |
|
|
40-49 |
27% |
|
|
50-59 |
10% |
|
|
60+ |
2% |
|
|
Education |
Form 5 (17 years) |
8% |
|
Form 7 (19 years) |
5% |
|
|
University |
82% |
|
|
Other |
5% |
|
|
Domain |
Education |
35% |
|
Business |
23% |
|
|
Administration |
11% |
|
|
Engineering |
8% |
|
|
Service sector |
7% |
|
|
Arts |
5% |
|
|
Law |
3% |
|
|
Media |
2% |
|
|
Airline industry |
2% |
|
|
Medical |
2% |
|
|
Not employed |
2% |
|
|
Number of speakers |
2 |
58% |
|
3 |
23% |
|
|
4 |
10% |
|
|
4+ |
9% |
The
corpus has not only facilitated sociolinguistic research in English spoken in
8.
Academic and professional English corpora
As
language may vary considerably across genre and domain, specialized corpora
provide valuable resources for investigations in the relevant genres and
domains. Unsurprisingly, there has recently been much interest in the creation
and exploitation of specialized corpora in academic or professional settings.
This section introduces a number of well-known English corpora of this kind.
8.1. The
The
Michigan Corpus of Academic Spoken English (MICASE) contains approximately 1.8
million words in the form of 152 transcripts of nearly 200 hours of recordings
of 1,571 speakers, focusing on contemporary university speech within the domain
of the
Table
23: The MICASE corpus
|
Criterion |
Distribution |
|
Speaker gender |
Male (46%) Female (54%) |
|
Academic role |
Faculty (49) Students (44%) |
|
Language status |
Native speakers (88%) Non-native speakers (12%) |
|
Academic division |
Humanities & Arts (26%) Social Sciences &
Education (25%) Biological & Health Sciences (19%) Physical Sciences
& Engineering (21%) Other (9%) |
|
Primary discourse mode |
Monologue (33%) Panel (8%) Interactive (42%) Mixed
(17%) |
|
Speech event type |
Advising (3.5%) Colloquia (8.9%) Discussion sections
(4.4%) Dissertation defenses (3.4%) Interviews (0.8%)
Labs (4.4%) Large lectures (15.2%) Small lectures (18.9%) Meetings (4.1%)
Office hours (7.1%) Seminars (8.9%) Study groups (7.7%) Student presentations
(8.5%) Service encounters (1.5%) |
In
the MICASE corpus, speakers are divided into four age groups: 17-23, 24-30,
31-50, and 51+. In terms of academic role, they are classified into a number of
categories: junior and senior undergraduates, junior and senior postgraduates,
junior and senior faculty and researchers, etc. The language status can be
native speaker (North American English), other native speaker (non-American
English), near native speaker, and non-native speaker.
The
MICASE corpus was originally marked up in TEI-compliant SGML. All of the SGML
files have now been converted to the XML format in order to meet the
requirements for further corpus development including a web-based search
interface and the streaming web delivery of the sound recordings, synchronized
with the transcripts. At present, only the orthographically transcribed version
of the corpus is available, though future releases will include various kinds
of annotations such as parts-of-speech, lemmas and discourse-pragmatic
categories. The MICASE
corpus can be searched online free of charge or ordered at a nominal fee at the
corpus website.
8.2. The British Academic Spoken English
corpus
The
British Academic Spoken English (BASE) corpus, which is designed as a British
counterpart to the MICASE, is under construction at the Universities of Reading
and Warwick. The corpus currently comprises a collection of recordings and
marked up transcripts of 160 lectures (63 from Reading and 97 from Warwick, totaling 127 recording hours) and 39 seminars (from
Warwick, 32 hours). The lectures and seminars spread evenly across four subject
areas, as shown in Table 24 (cf. the corpus website).
Table
24: Components of the BASE corpus
|
Subject area |
Lectures |
Seminars |
|
Arts and Humanities |
42 |
10 |
|
Social Studies and Sciences |
40 |
11 |
|
Physical Sciences |
40 |
8 |
|
Life and Medical Sciences |
38 |
10 |
|
Total |
160 |
39 |
Unlike
MICASE, the BASE corpus only covers two types of speech event, lectures and
seminars. Most of the recordings were made on digital video instead of
audiotapes. At the moment, the majority of these recordings have been
transcribed (157 lectures and 22 seminars) and marked up in TEI-compliant SGML
(114 lectures and 3 seminars). The corpus will not only enable research into
spoken academic English at the lexical and structural levels, it will also make
it possible, when used in combination with MICASE, to compare academic spoken
English in British and US university settings. When it is complete, the BASE
corpus will be published on CD-ROM, with transcripts linked to edited
video/audio files.
8.3. The
The
Reading Academic Text (RAT) corpus is a collection of academic texts written by
academic staff and research students at the
The
Academic Corpus is a written corpus of academic English developed at Victoria
University of Wellington. The corpus contains approximately 3.5 million words,
covering 28 subject areas from four faculty sections (arts, commerce, law, and
science), as shown in Table 25 (cf. Coxhead 2000,
220).
Table
25: Subject areas in the Academic Corpus
|
Faculty |
Arts |
Commerce |
Law |
Science |
Total |
|
Texts |
122 |
107 |
72 |
113 |
414 |
|
Words |
883,214 |
879,547 |
874,723 |
875,846 |
351,333 |
|
Subject areas |
Education History Linguistics Philosophy Politics Psychology Sociology |
Accounting Economics Finance Industrial relations Management Marketing Public policy |
Constitutional law Criminal law Family law and medicolegal International law Pure commercial law Quasi-commercial law Rights and remedies |
Biology Chemistry Computer science Geography Geology Mathematics Physics |
|
Each
of these faculty sections is divided into seven subject areas of ca. 125,000
words, totaling 875,000 words for each section. The
corpus comprises 414 academic texts by more than 400 authors which were sampled
from journal articles, book surveys, course workbooks, laboratory manuals,
course notes and the Internet. With exceptions of the 41 excerpts from the
Brown corpus, 31 excerpts from LOB and 42 excerpts from the Wellington Corpus
of Written New Zealand English, full texts (excluding bibliographies) were included
from other sources. The majority of the texts were written for an international
audience, with 64% sourced in
The
corpus has been used to develop an Academic Word List (AWL) which containing
570 word families (see Coxhead 2000), which is
available at the AWL
site.
8.5. The Corpus of Professional Spoken
American English
The
Corpus of Professional Spoken American English (CPSAE) has been constructed
using a selection of transcripts of interactions of various types occurring in
professional settings recorded during 1994-1998. The corpus contains two
million words of speech involving over 400 speakers. The CPASE corpus has two
main components. The first component is made up of transcripts (0.9 million words)
of press conferences from the White House, which contains almost exclusively
question and answer sessions in addition to some policy statements by
politicians and White House officials. The second component consists of
transcripts (1.1 million words) of faculty meetings and committee meetings
related to national tests, which involve statements, discussions as well as
questions (see Barlow 1998).
The
transcripts in the corpus have been marked up in a minimal but consistent way.
The markup scheme only indicates speech turns by
identifying the last name of the speaker (or VOICE if the name is unknown) with
the <SP> element, and puts the non-verbal events such as laughter in the
brackets. Two versions of the corpus are available, a raw text version and an
annotated version tagged by the Lancaster CLAWS. Both versions can be ordered
from the corpus website.
8.6. The Corpus of Professional English
A
much more ambitious project has been initiated by the Professional English
Research Consortium (PERC), which aims to create a 100-million-word Corpus of
Professional English (CPE). The corpus is expected to include both spoken and
written discourse used by working professionals and professionals-in-training
and covering a wide range of domains such as science, engineering, technology,
law, medicine, finance and other professions. The CPE corpus is designed as a
balanced representation of professional English via texts published between
1995 and 2001 by over 1,000 major review and research journals, trade
magazines, and textbooks, in American and British English, based on selection
criteria such as impact factors provided by the Journal of Citation Reports, and other pertinent criteria.
The
Corpus of Professional English is marked up in XML. The contextual information
such as author's name, title, publication year and
journal title is stored in the corpus header. The structural information is
also encoded to show paragraphs, sections, headings and similar features in
written texts. Linguistic annotations such as POS and semantic tagging will be
carried out on the corpus using tools developed at
The
CPE corpus can be used for linguistic research as well as for the development
of educational resources, such as specialized dictionaries, handbooks, language
tests, and other materials that will be useful to working professionals and
professionals-in-training. The corpus, when completed, will be made available
to consortium members for online access at the PERC website.
Parsing,
or called treebanking, is a form of corpus
annotation. It is independent of corpus design criteria. Hence, a corpus,
whether balanced or specialized, whether written or spoken, can be
syntactically parsed. However, as parsing is a much more challenging task which
often necessitates human correction, parsed corpora are typically very small in
size. Of the corpora we have introduced so far, only ICE-GB is parsed. This section
introduces a number of well-known parsed corpora.
9.1. The Lancaster-Leeds Treebank
The
Lancaster-Leeds Treebank is perhaps the first syntactically parsed corpus. The
corpus is a subset of 45,000 words taken from all text categories in the LOB
corpus which was parsed manually by Geoffrey Sampson using a specially devised
surface-level phrase structure grammar compatible with the CLAWS word-tagging
scheme (cf. Sampson 1987). The annotation scheme used in the Lancaster-Leeds
Treebank, which consisted of 47 labels for daughter nodes (14 phrase and clause
classes, 28 word classes and five classes of punctuation mark), represented
surface grammar only, without indications of logical form. This hand-crafted treebank provided training data for the automatic probabilistic
parser which was used to analyze the Lancaster Parsed Corpus. The corpus was
not published but is available from UCREL
at
9.2. The
The
Lancaster Parsed Corpus (LPC) is a much larger sample of approximately
144,000 words taken from the LOB corpus that has been parsed. Except for
categories M (science fiction, six samples) and R (humor,
nine samples), which are all included, LPC takes the first 10 samples from each
of the other 13 text categories in LOB, totaling 145
files which account for 13.29% of the full LOB corpus. Even in these 145
samples, longer sentences have been excluded from the parsed corpus because the
parser was unable to process sentences over 20-25 words in length, with the
result that the parsed corpus no longer contains LOB text extracts in their
entirety. The errors resulting from automatic parsing were corrected by hand to
ensure the corpus is reasonably error free (cf. Garside/Leech/Váradi 1992).
The
Lancaster Parsed Corpus can be regarded as a treebank
broadly representative of the syntax of written English across a great variety
of styles and text types. It provides a testbed for
wide-coverage general-purpose grammars and parsers of English and a valuable
resource for quantitative linguistic studies of English syntax. The corpus is
available through ICAME.
The
SUSANNE (an acronym for “surface and underlying structural
analysis of natural English”) is a 130,000 word sub-sample taken
from the Brown corpus of American English that has been parsed. The parsed
corpus comprises 64 text samples, with 16 taken from each of the four text
categories: A (press reportage), G (belle letters, biography and memoir), J
(learned writing) and N (adventure and Western fiction).
The
parsing was largely undertaken manually in accordance with the SUSANNE analytic
scheme developed by Geoffrey Sampson in collaboration with Geoffrey Leech on
the basis of samples from written British and American English. In SUSANNE, a
parse tree is represented as a bracketed string, with the labels of
non-terminal nodes inserted between opening and closing brackets. There are
three types of information in the parsing scheme: a form tag, a function tag
and an index. The hierarchy of form tag ranks (word, phrase, clause and root)
defines the shape of a parse tree. The function tags identify surface roles
such as surface and logical subject, agent of passive, and time and place
adjuncts. An index shows referential identity between nodes (cf. Sampson 1995).
The
SUSANNE corpus was first released in 1992 and its latest version, Release 5,
was published in 2000. Each successive release has corrected errors found in
earlier releases. The latest release, together with the documentation
accompanying the corpus, is distributed free of charge at Sampson’s website.
The
CHRISTINE corpus is a spoken counterpart to SUSANNE, developed by Geoffrey
Sampson and his team. It is one of the first treebanks
of spontaneous speech. The CHRISTINE analytic scheme includes explicit
extensions to the SUSANNE annotation which are designed to handle speech
phenomena such as pauses, discourse items and speech repairs. The first stage
of CHRISTINE (CHRISTINE/I), which was released in 1999, is based on 40 extracts
chosen at random from the demographically sampled component in the spoken BNC
and other sources, totaling approximately 80,500
words of spoken data representing 147 identified speakers in addition to a
great number of unidentifiable speakers. The information about speakers and the
metadata originally contained in the BNC corpus header were converted into
database files accompanying the corpus (cf. Sampson 2000).
The
full version of the CHRISTINE corpus includes 66 further texts drawn from the
spoken BNC and other sources. The overall proportion of the BNC data accounts
for 50% of the full CHRISTINE corpus, with 40% from the London-Lund corpus and
10% from the Reading Emotional Speech Corpus (see Stibbard 2001 for a description). The full release also
incorporates a minor change in the distribution of analytic information between
the fields to make it more compatible with SUSANNE and easier to read. This
version became available in 2000. At present only CHRISTINE/I can be
downloaded at Sampson’s website.
The
LUCY corpus is the third in Sampson’s series of treebanks. This corpus represents written English in modern
There
are 239 text files in LUCY, amounting to 165,000 words. The corpus consists of
three sections: polished writing (41 text files, 102,000 words), young adult
writing (48 text files, 33,000 words), and child writing (150 files, 30,000
words). The polished texts are taken from both informative and imaginative
categories in the written section of the British National Corpus. The young
adult writing comprises three groups, namely, A-level general study scripts,
access-course coursework, and first-year undergraduate essays. The child writing
section is composed of material from the Nuffield corpus, a collection of
writing by children aged between 9 and 12 years in 1965.
In
addition to providing a valuable source of information on the realities of
skilled written usage in modern
The
British component of the International Corpus of English (ICE-GB) is the first
corpus that has been completed in the ICE series. Like all of the ICE
components, ICE-GB comprises 300 spoken and 200 written texts from 32
categories, amounting to one million words. As noted in section 5.1, this
corpus is not only POS tagged but also fully parsed and hand checked. The
corpus contains 83,394 parse trees, including 59,640 in the spoken part of the
corpus. Each node in the tree is labelled with up to three types of
information: word class/syntactic category, syntactic function and features
(e.g. transitivity), the latter being optional (cf. Nelson/Wallis/Aarts 2002).
Unlike
the SUSANNE, CHRISTINE and LUCY corpora, which come without retrieval software,
ICE-GB is distributed together with a utility program, ICEUP, which allows very
complex queries of various kinds, e.g. markup
queries, exact and inexact grammatical node queries, text fragment queries,
Fuzzy Tree Fragment (FTF) queries, and sociolinguistic variable queries.
The
first full release of the corpus and ICEUP can be ordered on CD-ROM from the ICE-GB website. The
ICE-GB sampler, which includes ICEUP and ten ICE-GB texts, is also available
free of charge at the site. Release 2 will include the digitized speech
recordings of the spoken part of the corpus, aligned with the text. This will
allow researchers to hear the original source of what they see on-screen. In
addition to the online help included in ICEUP, Nelson/Wallis/Aarts (2002) provides a comprehensive reference guide to
both corpus and software.
The
Penn Treebank (PTB) is an example of skeleton parsing. Three releases of the treebank have so far been published by the LDC. The original release (Penn Treebank
I, 1992) contains over 4.5 million words of American English data. The whole
corpus is POS tagged while two thirds of the data is parsed. All of this material
has been corrected by hand after automatic processing. Table 26 shows the
components of Penn Treebank I (cf. Marcus/Santorini/Marcinkiewicz
1993).
Table
26: Penn Treebank Release 1
|
Component |
Tagged words |
Parsed words |
|
Dow Jones news stories |
3,065,776 |
1,061,166 |
|
Brown corpus retagged |
1,172,041 |
1,172,041 |
|
Dept. of Energy abstract |
231,404 |
231,404 |
|
MUC-3 messages |
111,828 |
111,828 |
|
Library of |
105,652 |
105,652 |
|
IBM manuals |
89,121 |
89,121 |
|
Dept. of Agriculture bulletins |
78,555 |
78,555 |
|
ATIS sentences |
19,832 |
19,832 |
|
WBUR radio transcripts |
11,589 |
11,589 |
|
Total |
4,855,798 |
2,881,188 |
Penn
Treebank I applies a parsing scheme which is extended and modified on the basis
of the
Penn
Treebank Release 2, which was published in 1995, features the new Treebank II
bracketing style. The new bracketing style is designed to facilitate the
extraction of simple predicate/argument structure (see Bies/Ferguson/Katz
et al 1994). Penn Treebank II contains one million words of
9.8. Parsed historical corpora
In
addition to the treebanks of present-day English introduced
above, this section introduces a number of parsed historical corpora. These
corpora are largely based on the diachronic part of the Helsinki Corpus.
The
Penn-Helsinki Parsed Corpus of Middle English version 2 (PPCME2) is a corpus of
prose text samples of Middle English, annotated for syntactic structure to
allow searching not only for words and word sequences but also for syntactic
structure. Based on the Middle English section of the
The
York-Helsinki Parsed Corpus of Old English Poetry is a selection of poetic
texts from the Old English Section of the Helsinki Corpus which have been
annotated to facilitate searches on lexical items and syntactic structure. The
corpus contains 71,490 words of Old English text samples ranging from 4,000 to
17,000 words in length. The Brooklyn-Geneva-Amsterdam-Helsinki Parsed Corpus of
Old English is a selection of texts from the Old English Section of the
Helsinki Corpus of English Texts. The corpus contains 106,210 words of Old English
text samples, ranging 5,000 to 10,000 words in length, which represent a range
of dates of composition, authors and genres. A much larger corpus with much
more detailed annotation is the York-Toronto-Helsinki Parsed Corpus of Old
English Prose (YCOE), which contains 1.5 million words of Old English prose
texts taken from the Toronto Dictionary of Old English Corpus, with special
formatting which has made it possible to search conveniently for syntactic
structure using a computer search engine. These corpora apply the PPCME2
annotation scheme. They are available at no cost for non-commercial use at the corpus website or
via OTA.
10.
Developmental and learner corpora
Two
types of corpora are particularly relevant to language learning: developmental
corpora and learner corpora. A learner corpus is a collection of the writing or
speech produced by learners acquiring a second language (L2). The term is used
here as opposed to a developmental corpus, which consists of data produced by
children acquiring their first language (L1). This section introduces
well-known corpora of these two types.
10.1. The Child Language Data Exchange
System
The
Child Language Data Exchange System (CHILDES) is an international database
organized for the study of first and second language acquisition. The database
consists of three parts: Codes for the Human Analysis of Transcripts (CHAT),
Computerized Language Analysis (CLAN), and a database. The CHILDES database
contains transcripts of data collected from children and adults who are
learning both first and second languages. The total size of the database is now
approximately 180 million characters (ca. 20 million words), covering 25
languages. The database is divided into six major components: English,
non-English, narratives, language impairments, bilingual acquisition, and
books. Some files have associated audio and video recordings. The transcripts
from normal English-speaking children constitute about half of the total
CHILDES database. All of the data is transcribed in the CHAT format and can be
analyzed using the CLAN programs, which support four basic types of linguistic
analysis: lexical analysis, morpho-syntactic
analysis, discourse analysis, and phonological analysis (cf. MacWhinney 1995).
The
CHILDES database has been used in a wide range of research of normal and
abnormal child language. The database and computer programs are freely
available for research at the CHILDES
website.
10.2. The
The
Louvain Corpus of Native English Essays (LOCNESS) is
a corpus of argumentative essays on a great variety of topics written by native
British and American university students (cf. Granger/Tyson 1996). The LOCNESS
corpus comprises three parts, British pupils’ 114 A-Level essays
(60,209 words), British university students’ 90 essays (95,695
words), and American university students’ 232 essays (168,400
words), totaling 324,304 words. As the age group of
those students is comparable to that of the non-native EFL students in the
International Corpus of Learner English (ICLE, see section 20.10.4), LOCNESS
provides control data in comparing writings of native and non-native learners.
The corpus can be ordered from the Centre for English Corpus Linguistics (CELC) at the
10.3. The Polytechnic of
The
Polytechnic of Wales (POW) corpus contains 65,000 words of informal
conversations of about 120 6-to-12-year-old children, which were collected
between 1978 and 1984 in
10.4. The International Corpus of
Learner English
The
first and best-known learner corpus is the International Corpus of Learner
English (i.e. ICLE). The corpus comprises argumentative essays written by
advanced learners of English, i.e. university students of English as a foreign
language (EFL) in their 3rd or 4th year of study. The primary goal of ICLE is
the investigation of the interlanguage of the foreign
language learner (cf. Granger 2003).
ICLE
version 1.1, published on CDROM in 2002, contained over 2.5 million words in
the form of 3,640 texts ranging between 500-1,000 words in length written by
EFL learner from 11 mother tongue backgrounds, namely, Bulgarian, Czech, Dutch,
Finnish, French, German, Italian, Polish, Russian, Spanish, and Swedish. The
corpus is still expanding with additional subcorpora
(each containing 200,000 words) of eight other L1 backgrounds including
Brazilian, Chinese, Japanese, Lithuanian, Norwegian, Portuguese, South African
(Setswana) and Turkish (see the ICLE
website for the current state of affairs). ICLE published on CDROM is not
tagged for parts of speech or learner errors. The error and POS-tagged version
of corpus will be available in near future.
In
addition to allowing the comparison of the writing of learners from different
backgrounds, the corpus can be used in combination with LOCNESS to compare
native and learner English. The ICLE corpus is available for linguistic
research but cannot be used for commercial purposes. Orders can be placed via i6Doc.
The
Louvain International Database of Spoken English Interlanguage (LINDSEI) is a spoken counterpart to ICLE.
Each subcorpus represents an L2 background and
comprises transcripts of fifty 15-minute interviews with 3rd and 4th
year university students. The first component of LINDSEI contains transcripts
of interviews with 30 female and 20 male French learners of English, totaling ca. 100,000 words. The database is currently being
expanded with additional components representing other L1 mother tongues
including Bulgarian, Chinese, Italian, Japanese, Spanish, and Swedish. As most
learner corpora have used written data only, this type of data allows new
research into a wide range of features of oral interlanguage.
See the LINDSEI
website for the latest development of the corpus.
10.6. The Longman Learners’ Corpus
The
Longman Learners’ Corpus contains ten million words of
essays written during 1990-2002 by students of English at a range of levels of
proficiency from 20 different L1 backgrounds. The elicitation tasks varied,
ranging from in-class essays with or without the use of a dictionary to exam
essays or assignments. Each script in the corpus is coded for the student’s L1 background, proficiency level, text type (essay,
letter, exam script, etc.), target variety (British, American or Australian
English), and for the country of residence. This corpus has been designed to
provide balanced and representative coverage for each of these categories (cf.
Gillard/Gadsby 1998, 160). Taken as a whole it offers
a multi-faceted picture of interlanguage, which can
be explored in a variety of ways. The Longman Learners’ Corpus is not POS
tagged, but part of the corpus has been error-tagged manually, although this
portion is only for internal use by the Longman publishers. Longman Learners’ Corpus is a commercial corpus. It is also available for
academic use. At present around 10 million words can be supplied. Users can
also order a subcorpus for a certain proficiency
level or L1 background. For details, see the Longman website.
10.7. The
As
part of the Cambridge International Corpus (CIC), the Cambridge Learner
Corpus (CLC) is a
large collection of examples of English writing from learners of English all
over the world. It contains over 20 million words and is expanding continually.
The English in the CLC comes from anonymized exam
scripts written by students taking Cambridge ESOL English exams worldwide. The
corpus currently contains 50,000 scripts from 150 countries (100 different L1
backgrounds). Each script is coded with information about the student’s first language, nationality, level of English, age, etc.
Over eight million words (or about 25,000 scripts) have been coded for errors
using the Learner Error Coding system developed by Cambridge University Press.
CLC is a commercial corpus. Currently the corpus can only be accessed by
authors and writers working for Cambridge University Press and by members of
staff at Cambridge ESOL.
10.8. Other learner
corpora
In
addition to the corpora which cover multiple L1 backgrounds as introduced
above, there are a number of learner corpora specific to one particular mother
tongue.
The HKUST Corpus of Learner English is one such
example. The
corpus contains 25 million words of essays and exam scripts of upper-secondary
and tertiary-level Chinese learners of English in
The
Chinese Learner English Corpus (CLEC) contains one million words from writing
produced Chinese learners of English from five proficiency levels: middle
school students, junior and senior non-English majors, and junior and senior
English majors. The five types of learners are equally represented in the
corpus. The CLEC material includes writings for tests, guided writings and free
writings. The corpus is not POS tagged, but it is fully annotated with learner
errors using an annotation scheme which consists of 61 error types clustered in
11 categories (see Gui/Yang 2001). The CLEC corpus,
together with a companion book, can be ordered from Shanghai Foreign Language
Education Press (SFLEP).
The corpus can also be searched online at the CLEC website.
The
JEFLL (Japanese EFL Learner) corpus contains one million words. It has three
components. The most important part is the written (ca. 400,000 words) and
spoken (ca. 50,000 words) data produced by Japanese EFL learners from Years
7-12 in secondary schools. The second component is an L2 target language subcorpus which contains 150,000 words of English textbook
material used in
The
Standard Speaking Test (SST) corpus, also known as the NICT JLE (Japanese
Learner English) Corpus, contains one million words of error tagged spoken English
produced by Japanese learners. Based entirely upon the audio-recordings of an
English oral proficiency interview test called the Standard Speaking Test
(SST), the corpus comprises 1,200 samples transcribed from 15-minute oral
interview test (around 300 hours of recording in total). This is the largest
spoken learner corpus which has been built to date. The subjects are classified
into nine SST
proficiency levels, thus making it possible to compare speech across different
learner proficiency groups. Two types of tagging have been used in the SST
corpus: discourse tagging and error tagging. The tags are XML-compliant. More
than 30 basic tags are used to mark up discourse phenomena in the learners’ utterances, which are clustered into four main categories:
tags for representing the structure of the entire transcription file, tags for
the interviewee’s profile, tags for speaker turns, and
tags for representing utterance phenomena such as fillers and repetitions (see Izumi/Uchimoto/Isahara 2004, 34).
The error tagging scheme consists of 47 tags. Each tag shows three types of
information: part of speech, a grammatical/lexical rule, and a corrected form
(cf. Izumi/Isahara 2004). The SST corpus CD and a companion
book can be ordered from the publisher’s website.
Thai
English Learner Corpus (TELC) currently contains 1.5 million words of writings
by Thai learners of English. Two thirds of the materials were taken from
university entrance exams at the Institute for English Language Education
(IELE, Assumption University) and one third came from writings
by undergraduate students at various stages during the four-year EFL course.
The corpus continues to grow with the constant addition of new data. The
whole TELC corpus is tagged for part of speech and lemma. A demonstration
version of the corpus can be accessed at the TELC website, but the query system only
displays a maximum of 50 concordance lines though it also indicates a total
number of matches in the whole corpus.
The
Uppsala Student English (USE) corpus contains 1.2 million words in the form of
1,489 essays written during 1999-2001 by 440 Swedish university students of
English at three different levels, the majority in their first term of
full-time studies. These essays were written out of class, against a deadline
of 2-3 weeks, length limitations imposed (usually 700-800 words), and suitable
text structure suggested. There are a variety of essay types in the corpus,
including evaluation, argumentation, and discussion, etc. The corpus is
available for non-commercial use only. It can be accessed at the USE site or ordered via OTA.
The
Polish Learner English Corpus is designed by the PELCRA project (see section
2.3) as a half-a-million-word corpus of written learner data produced by Polish
learners of English from a range of learner styles at different proficiency
levels, from beginning learners to post-advanced learners (cf. Lewandowska-Tomaszczyk 2003, 107). The data was collected
between 1998 and 2000 from the exam essays of Polish learners of English at the
The JPU (
We
have so far introduced major monolingual corpora of English and a number of
other languages. This section introduces multilingual corpora. The term
multilingual is used here in a broad sense to include bilingual corpora.
Multilingual corpora can be parallel or comparable. Corpora of this kind are
particularly useful in translation and contrastive studies.
11.1. The Canadian Hansard Corpus
The
earliest and perhaps best-known parallel corpus is the Canadian Hansard Corpus, which consists
of debates from the Canadian Parliament published in the country’s official languages, English and
French. While
its content is limited to legislative discourse, the corpus covers a broad
range of topics and styles, e.g. spontaneous discussion, written
correspondence, as well as prepared speeches.
There
are several versions of the Canadian Hansard parallel
corpus. The USC
version comprises 1.3 million pairs of aligned text chunks (i.e. sentences
or smaller fragments) from the official records (Hansards)
of the 36th Canadian Parliament (1997-2000) with ca. 2 million words
in English and French each. This version is freely downloadable at the USC site
(USC Hansard, see Appendix). TransSearch
offers an online service which allows subscribed users to access all of the Hansard texts from 1986 to February 2003 (approximately 235
million words). The LDC released a
collection of Hansard parallel texts in 1995,
covering a time span from the mid-1970's through 1988. This version is
available on CDROM from the LDC. The Canadian Hansard
Treebank contains 750,000 words of skeleton-parsed texts from
proceedings in the Canadian Parliament, which is available from UCREL of
11.2. The English-Norwegian Parallel Corpus
The
English-Norwegian Parallel Corpus (ENPC) is one of the earliest and best-known
parallel corpora. The corpus is bi-directional in that it contains both
original and translated texts in the two languages. ENPC consists of 100
original texts between 10,000 to 15,000 words in length in English and
Norwegian together with their corresponding translations in the two languages, totaling 2.6 million words. Unlike most parallel corpora
which are limited to a particular domain or text type, efforts have been made
to balance the ENPC corpus. Both fiction (30 originals plus translations in
each language) and non-fiction (20 originals plus translations in each
language) texts are sampled. Fiction texts include children’s fiction, detective fiction and general fiction.
Non-fiction texts cover religion, social sciences,
law, natural sciences, medicine, arts, and geography/history (see Johansson/Ebeling/Oksefjell 2002). ENPC is marked up in TEI-compliant
SGML. The English texts in the corpus are POS tagged and lemmatized while the
Norwegian part has also been tagged recently. The corpus is aligned at the
sentence level. The ENPC corpus is available for non-commercial research.
Registered users can access the corpus online. See the corpus homepage for details of
registration.
11.3. The English-Swedish Parallel
Corpus
The
English-Swedish Parallel Corpus (ESPC) follows ENPC in its design. The corpus
consists of 64 English text samples and their translations into Swedish and 72
Swedish text samples and their translations into English, amounting to 2.8
million words. The samples from each language have been drawn from two main
text categories, fiction and non-fiction. The fiction categories include
children’s fiction, crime and mystery fiction,
and general fiction while non-fiction texts cover memoirs and biography, geography,
humanities, natural sciences, social sciences, applied sciences, legal
documents, and prepared speech. The text types of the originals from both
languages are comparable in terms of genre, subject matter, type of audience
and register (cf. Altenberg/Aijmer/Svensson 2001).
ESPC is aligned at the sentence level and marked up in TEI-compliant SGML. The
corpus is for non-commercial research and only registered users can access the
corpus. See the ESPC site
for contact details.
11.4. The
The
Oslo Multilingual Corpus (OMC) is an extension of ENPC which covers more
languages including, in addition to English and Norwegian, also German, French,
Swedish, Dutch, Finnish and Portuguese. The corpus is composed of many subcorpora that differ in composition with regard to
languages and number of texts included. OMC is a corpus under construction.
Apart from ENPC and ESPC, the corpus currently includes an English-German subcorpus (1.3 million words), a French-Norwegian subcorpus (0.5 million words), a
German-Norwegian subcorpus (1.5
million words), a Norwegian-English-German subcorpus
(289,230 words of Norwegian original texts, 419,500 words of English original
texts, and 220,600 words of English original texts, plus the translations in
the other two languages), an English-Dutch subcorpus
(0.3 million words), an English-Norwegian-Portuguese subcorpus (0.6 million words), a Norwegian-French-German
subcorpus (1.5 million words), a Norwegian-English-French-German
subcorpus (1.7 million words), and an English-Finnish
subcorpus (0.3 million words).
OMC
has been constructed following the same principles as ENPC; and like ENPC, the
corpus is coded and marked up in TEI-compliant SGML. The OMC corpus is for
academic, non-commercial purposes but it can be accessed only by registered
users. See the OMC
homepage for the current status of the corpus.
11.5. The ET10/63 and ITU/CRATER
parallel corpora
ET10/63 is a bilingual parallel corpus of English and
French, containing ca. one million words of EC official documents on
telecommunications in each language. The corpus is POS tagged and also lemmatized. This
bilingual parallel corpus has been extended to include Spanish on the Corpus
Resources and Terminology Extraction project. The extension is thus named the
CRATER parallel corpus, which contains one million words in each of the three
languages. The corpus is sentence aligned and tagged with part-of-speech in all three languages (cf. Garside/Hutchinson/Leech et al.
1994). An expanded version of the CRATER corpus, CRATER 2, has increased the
size of the English and French components of the parallel corpus from one
million to 1.5 million words. Both versions of CRATER are available via ELRA. The corpus can also been accessed online
or downloaded via FTP at the CRATER site.
11.6. The IJS-ELAN Slovene-English
Parallel Corpus
The
Slovene-English Parallel Corpus (IJS-ELAN) contains one million words from 15 terminology-rich bilingual texts produced in the 1990s. One half of the
corpus (in terms of the text size) consists of 11 Slovene texts and their
English translations while the other half comprises four English texts and
their Slovene translations. The corpus is aligned at the sentence level (cf. Erjavec 2002). Two versions of the IJS-ELAN corpus are
available, with one version marked up in TEI-compliant SGML and the other
encoded in XML and lemmatized and POS tagged. Both versions are freely
available for downloading at the corpus
website, which also allows free online access.
11.7. The CLUVI parallel corpus
The
CLUVI (Linguistic Corpus of the
11.8. European Corpus Initiative
Multilingual Corpus I
European
Corpus Initiative Multilingual Corpus I (ECI/MCI) was released in 1994 by ELSNET (see section 13). The corpus contains
98 million words of texts from 27 languages, covering most of the major
European languages as well as some non-European languages such as Chinese,
Japanese and Malay. The corpus has 48 components, 12 of which are parallel
corpora composed of 2-9 subcorpora.
It also
includes a great diversity of text types such as newspapers, novels and
stories, technical papers and dictionaries and wordlists, though most
components are quite homogeneous in contents (cf. Thompson/Armstrong-Warwick/McKelvie et al. 1994).
ECI/MCI
is marked up in TEI P2 conformant SGML, but the markup
has been undertaken in such a way that users can also get easy access to the
source text without markup. The corpus is available
from ELSNET or the LDC.
Multilingual
Tools and Corpora (MULTEXT) is a series of projects whose aims are to develop
standards and specifications for the encoding and processing of linguistic
corpora, and to develop tools, corpora and linguistic resources embodying these
standards. The multilingual corpus used for developing linguistic tools is the
JOC (Official Journal of European
Community) corpus, which comprises 40 files in five languages: English,
German, Italian, Spanish and French. Of these ten files in five languages
(English, French, German, Spanish and Italian) are POS tagged and 10 files in
four language pairs (English-French, English-German, English-Italian and
English-Spanish) are aligned at the sentence level. The corpus is conformant
with the Corpus Encoding Standard. The availability of the corpus is unknown,
but some samples can be downloaded at the MULTEXT website.
MULTEXT-East
is a project which is intended to extend the scope of MULTEXT by transferring
MULTEXT’s expertise, methodologies, and tools to
Central and Eastern European countries, thus enabling the extension and
validation of these methodologies and tools on a new range of languages. The Multext-East parallel corpus consists of the English
original of George Orwell's Nineteen
Eighty-Four (100,000 words) together with its translations into the nine
project languages: Bulgarian, Czech, Estonian, Hungarian, Lithuanian, Romanian,
Russian, Serbian, and Slovene. The corpus contains extensive
CES-compliant headers and markup for document structure, sentences, and various
sub-sentence annotations. The translations of Nineteen
Eighty-Four are automatically POS tagged and
sentence aligned with the English original, with the alignments validated by
hand. The MULTEXT-East multilingual comparable corpus comprises a fiction
subset and a news subset of at least 100,000 words each, for each of the six
project languages (Bulgarian, Czech, Estonian, Hungarian, Romanian and
Slovene). Each language component is comparable in terms of the number and size
of texts. The multilingual comparable corpus is marked up in CES format with
over 40 different elements (see Erjavec 2004). The parallel and comparable corpora, together with other
MULTEXT-East language resources, are mounted on the Web. The corpora are
restricted to research only. Registered users can browse or download full
resources. Registrations can be made on the MULTEXT-East
website.
PAROLE
(Preparatory Action for Linguistic Resources Organization for Language
Engineering) represents a large-scale harmonized effort to create comparable
text corpora and lexica for EU languages. Fourteen languages are involved on
the PAROLE project, including Belgian French, Catalan, Danish, Dutch, English,
French, Finnish, German, Greek, Irish, Italian, Norwegian, Portuguese and
Swedish. Corpora containing 20 million words and lexica containing 20,000
entries were constructed for each of these languages using the same design and
composition principles during 1996-1998. These corpora all include specific
proportions of texts from the categories book (20%), newspaper (65%),
periodical (5%) and miscellaneous (10%) within a settled range.
The
PAROLE corpora are marked up according to CES-conformant PAROLE DTD (Document
Type Declaration). An equal proportion of the texts (up to 250,000 running
words) in each PAROLE corpus were POS tagged according to a common PAROLE tagset and morpho-syntactic
annotation standards. Part of the tagged data was validated: 50,000 words
checked for maximum granularity and 200,000 for part of speech. For some PAROLE
corpora, only a copyright-free subset is available the public. The PAROLE
corpora that are currently available are distributed by ELRA.
11.11. Multilingual Corpora for
Cooperation
Multilingual
Corpora for Cooperation (MLCC) is a corpus acquisition project which aims to
collect a set of texts representing a substantial improvement in range,
quantity and quality of corpus material available. The MLCC multilingual data
consists of the Multilingual Parallel Corpus and the comparable Polylingual Document Collection. The parallel corpus comprises
translated data in nine European languages: Danish, Dutch, English, French,
German, Greek, Italian, Portuguese and Spanish. This corpus has two datasets,
with one set taken from the Official
Journal of the European Commission, C Series: Written Questions 1993, totaling approximately 10.2 million words (1.1 million
words per language), and the other set taken from the Official Journal of the European Commission, Annex: Debates of the
European Parliament 1992-1994, with 5-8 million words for each language. The
comparable corpus includes financial newspaper articles from the early 1990s in
six European languages: Dutch (8.5 million words), English (30 million words),
French (10 million words), German (33 million words), Italian (1.88 million
words), and Spanish (10 million words). The MLCC multilingual and parallel
corpora are marked up in TEI-compliant SGML (cf. Armstrong/Kempen/McKelvie
et al. 1998). The resources are available via ELRA.
We
have so far introduced multilingual corpora of European languages. The
following sections are concerned with corpora involving other languages.
The
EMILLE Corpus is a product of the Enabling Minority Language Engineering
project which develops language resources for South Asian languages. Two
versions of the EMILLE Corpus are available: the EMILLE/CIIL Corpus distributed
free of charge for non-commercial research, and the EMILLE/Lancaster Corpus for
commercial use only.
The
EMILLE/CIIL Corpus consists of three components: monolingual, parallel and
annotated corpora. There are fourteen monolingual corpora, including both
written and (for some languages) spoken data for fourteen South Asian
languages: Assamese, Bengali, Gujarati, Hindi, Kannada, Kashmiri, Malayalam,
Marathi, Oriya, Punjabi, Sinhala, Tamil, Telegu and Urdu. The EMILLE monolingual corpora contain
approximately 92,799,000 words (including 2,627,000 words of transcribed spoken
data for Bengali, Gujarati, Hindi, Punjabi and Urdu). The parallel corpus
consists of 200,000 words of text in English and its accompanying translations
in Hindi, Bengali, Punjabi, Gujarati and Urdu. The annotated component includes
the Urdu monolingual and parallel corpora annotated for part of speech,
together with twenty written Hindi corpus files annotated to show the nature of
demonstrative use. The EMILLE/Lancaster Corpus consists of three components:
monolingual, parallel and annotated corpora. This version differs from the
EMILLE/CIIL Corpus in its monolingual component, which consists of monolingual
corpora covering seven South Asian languages (Bengali, Gujarati, Hindi,
Punjabi, Sinhala, Tamil, and Urdu), totaling approximately 58,880,000 words (including
2,627,000 words of transcribed spoken data for Bengali, Gujarati, Hindi,
Punjabi and Urdu). The parallel and annotated components are the same as in the
EMILLE/CIIL Corpus (cf. Baker/Hardie/McEnery et al.
2004).
The
EMILLE Corpus is marked up using CES-compliant SGML, and encoded using Unicode.
More information about the corpus is available on the EMILLE corpus site.
Both versions of the corpus are distributed via ELRA.
11.13. The BFSU Chinese-English Parallel
Corpus
The
BFSU (
11.14. The
The
Babel Chinese-English Parallel Corpus contains 20 million Chinese characters
and 10 million English words of bilingual texts sampled from a great variety of
text categories including government documents, news, academic prose, fiction,
play scripts, and speech, etc. Babel is designed as a balanced corpus covering
three styles (literature, practical writing and news), six fields (arts,
business/economics, politics, science, sports, and society/culture), two modes
(written, spoken), and four periods (ancient, early modern, modern, and
contemporary for Chinese texts, and Old English, Middle English, Early Modern
English and present-day English for English texts). Presently only
contemporary/present-Day written texts are included, and about 400,000 sentence
pairs have been aligned (cf. Bai/Chang/Zhan 2002).
The
11.15. Hong Kong Parallel Text
Hong
Kong Parallel Text is a large parallel corpus released by the LDC in 2004. The corpus contains
approximately 59 million English words and 49 million Chinese words (or 98
million Chinese characters). It consists of the updates of three parallel
corpora published in 2000: Hong Kong Hansards, Hong
Kong Laws, and Hong Kong News. The Hong Kong Hansards
component contains excerpts from the Official Record of Proceedings of the
Legislative Council of the HKSAR from October 1985 to April 2003, totaling 36,140,737 English words and 56,618,181 Chinese
characters. The Hong Kong Laws component contains statute laws of Hong Kong in
English and Chinese, constitutional instruments, national laws and other
relevant instruments published by the Department of Justice of the HKSAR up to
year 2000, amounting to 8,396,243 English words and 14,868,621 Chinese
characters. The Hong Kong News component contains press releases from the
Information Services Department of the HKSAR between July 1997 and October
2003, amounting to 14,798,671 English words and 26,677,514 Chinese characters.
All of the three components in the Hong Kong Parallel Text corpus are aligned
at the sentence level. The English and Chinese texts are kept in separate
files, with alignment indicated by corresponding sentence numbers. The corpus
is available from the LDC.
12.
Non-English monolingual corpora
We
have so far been concerned with well-known and influential English corpora and
multilingual corpora involving English, in addition to some national corpora.
This section introduces a number of major monolingual corpora of other
languages.
COSMAS
(Corpus Search, Management and Analysis System) is a large collection of German
text corpora developed at the Mannheim IDS (Institut für deutsche Sprache). With a
size of almost two billion words, this is the world’s largest,
ever-growing collection of German online corpora for
linguistic research. The collection covers a wide variety of sources, e.g.
classic literary texts, national and regional newspapers, transcribed spoken
language, morpho-syntactically annotated texts and
several unique corpora.
The copy-right free part of the COSMAS collection (over 1.1
billion words) is publicly available free of charge for searching via the COSMAS online toolbox, which
allows complex queries, collocation analysis, clustering, and virtual corpus
composition, etc. The COSMAS corpora are only available for non-commercial use
and anonymous COSMAS sessions are limited to 60
minutes.
12.2. The CETEMPúblico Corpus
The
CETEMPúblico (Corpus de Extractos de Textos Electrónicos MCT/Público) corpus includes the text of around 2,600 editions of the
Portuguese daily newspaper Público, written between 1991 and 1999,
amounting to approximately 180 million words. The corpus is marked up in SGML.
Having removed some repeated extracts from version 1.0, CETEMPúblico
version 1.7 consists of over 1.5 million extracts. The first million words
(8,043 extracts) have been parsed. This subset represents a balanced selection
from the whole period (1991-1999) rather than early years alone. It also covers
all of the categories included in the full corpus (cf. Santos/Rocha 2001). CETEMPúblico can be used for research and technological
development, but direct commercial exploitation is not permitted. There are a
number of ways to access the corpus: CDROM from the LDC, FTP download, and online access at
the corpus
website.
The
Institute for Dutch Lexicology (INL) has offered three corpora over the Web. The
5 Million Words Corpus 1994 has diversified compositions. It comprises texts of
present-day Dutch derived from 17 text sources dating from 1989-1994, including
books, magazines, newspapers and TV broadcasts which cover topics such as
journalism, politics, environment, linguistics, leisure and business/employment
(see Kruyt 1995). The 27 million
Words Dutch Newspaper Corpus 1995 consists of newspaper texts derived
from issues published in 1994-1995 by a major national newspaper, NRC (see Kruyt/Raaijmakers/van der Kamp et al. 1996). The 38 Million Words Corpus 1996 has
three main components: a component with varied composition (books, magazines,
newspaper texts, TV broadcasts, parliamentary reports, 1970-1995, 12.7 million
words), a newspaper component (Meppeler
Courant, 1992-1995, 12.4 million words), and a legal component (Dutch legal
texts operative in 1989, with some dating back as early as 1814, 12.9 million
words) (see Kruyt/Dutilh 1997). All three corpora are
lemmatized and tagged for part of speech and users can define subcorpora using the parameters encoded therein. They are
available for non-commercial research purposes only. Access to these corpora is
free of charge but subject to an individual user agreement, which can be
obtained from the INL website.
The
CEG (Cronfa Electroneg o Gymraeg) corpus contains one million words of written Welsh
prose. The corpus is designed as Welsh parallel to the Brown and LOB corpora,
consisting of five hundred 2,000-word samples selected from a representative
range of text types to illustrate modern (mainly post 1970) Welsh prose
writing. However, the text categories and their proportions in the corpus are
different from those in Brown and LOB. The texts in CEG are grouped into two
broad categories: factual prose and fiction. There are seven types of fiction
such as novels and short stories while the factual prose is further divided
into 22 categories such as various types of press material, administrative
documents, academic texts and biography (see Ellis/O'Dochartaigh/Hicks
et al. 2001).
The
corpus is of value for lexical and syntactic analyses of modern Welsh prose. It
is available as both raw and annotated texts. Annotations include lemmatization
and POS tagging. Both versions are available at the CEG website.
12.5. The Scottish Corpus of Texts and
Speech
The
Scottish Corpus of Texts and Speech (SCOTS) is an ongoing project which aims to
build a large electronic corpus of both written and spoken texts for the
languages of
The
SCOTS corpus is marked up in SGML. Extensive
sociolinguistic metadata, including, for example, resource type, text
type, setting, medium, audience, text details, author/speaker details (gender,
age, geographic region, education, occupation, religious background, languages
used, etc.), and copyright information (see Anderson 2005). The current version
of the SCOTS Corpus is not linguistically annotated, but
the transcripts of spoken data are aligned with digital audio/video recordings.
The available texts can be browsed and downloaded at the project site.
12.6. The
The
Prague Dependency Treebank (PDT, version 1.0) contains 1.8 million words of
texts drawn from the Czech National Corpus (see section 2.4) which have been
annotated morphologically and syntactically. Of the texts included in the treebank, general newspaper articles related to politics,
sports, culture, hobbies, etc. account for 60%, economic news and analyses 20%,
and popular science magazines 20%. PDT version 1.0 is marked up in SGML, which
is migrated to XML in version 2.0. The annotation scheme consists of three
levels. The morphological level assigns a lemma and a morphological tag to each
token. The analytical level uses dependency syntax to annotate the structure of
the parse tree and the analytical function of every node, which determines the
relationship between the dependent node and its governing node one level higher
in the tree. The highest level of annotation, the tectogrammatical
level, uses the dependency framework to describe the linguistic meaning of a
sentence (see Bohmova/Hajic/Hajicova et al. 2001). In
version 1.0, only the first two levels have been annotated. The third level
annotation is undertaken in PDT version 2.0. The same texts are annotated on
all three levels, but the amount of annotated material decreases with the
complexity of the levels, specifically about 1.8 million tokens on the morphological
level, about 1.3 million tokens at the analytical level, and one million tokens
on the tectogrammatical level. The
Prague Dependency Treebank version 1.0 is available on CDROM from the LDC. It can also be accessed at the PDT website using an online tool
which allows users to search and view parse trees.
12.7. Academia Sinica Balanced Corpus
Academia
Sinica Balanced Corpus (ASBC) is the first annotated
corpus of modern Chinese. The corpus is a representative sample of Mandarin
Chinese as used in
Table
27: Composition of ASBC
|
Criterion |
Proportions |
|
Genre |
Press reportage: 56.25%, Press review: 10.01%, Advert: 0.59%, Letter:
1.29%, Fiction: 10.12%, Essay: 8.48%, Biography and diary: 0.50%, Poetry:
0.29%, Quotes: 0.03%, Manual: 2.03%, Play script: 0.05%, Public speech:
8.19%, Conversation: 1.34%, Meeting minutes: 0.11% |
|
Style |
Narrative texts: 70.66%, Argumentative texts: 12.24%, Expository
texts: 14.72%, Descriptive texts: 2.83% |
|
Mode |
Written: 90.14%, Written-to-be-read: 1.38%, Written-to-be-spoken:
0.82%, Spoken: 7.29%, Spoken-to-be read: 0.35% |
|
Topic |
Philosophy: 8.68%, Natural science: 12.97%, Social science: 34.99%, Arts:
9.28%, General/leisure: 17.89%, Literature: 16.20% |
|
Source |
Newspaper: 31.28%, General magazine: 29.18%, Academic journal: 0.70%,
Textbook: 4.08%, Reference book: 0.13%, Thesis: 1.36%, General book: 8.45%,
Audio/video medium: 22.83%, Conversation/interview: 1.63%, Public speech:
0.25% |
The
values of these parameters, together with bibliographic information, are
encoded at the beginning of each text in the corpus. The whole corpus is tagged
for part of speech and a range of linguistic features such as nominalization
and reduplication. The Sinica corpus is accessible
online at the ASBC website
using the query system which also allows users to define subcorpora.
12.8. Sinica Treebank
Sinica Treebank (version 2.1) contains 23 texts (290,114 words)
extracted from the ASBC corpus, covering subject areas such as politics,
travelling, sports, finance and society. There are 54,902 structural trees in
the treebank. Like the Prague Dependency Treebank,
the thematic relation between a predicate and an argument is marked in addition
to grammatical categories in Sinica Treebank. Six
non-terminal phrasal categories are annotated in the treebank:
S (a complete tree headed by a predicate), VP (a verb phrase headed by a
predicate), NP (a noun phrase headed by a noun), GP (a phrase headed by locational noun or locational
adjunct), PP (a prepositional phrase headed by a preposition), and XP (a
conjunctive phrase that is headed by a conjunction). There are three different
kinds of grammatical heads: Head, head and DUMMY. Head indicates a grammatical
head in a phrasal category); head indicates a semantic head which does not
simultaneously function as a syntactic head); and DUMMY indicates the semantic
head(s) whose categorical or thematic identity cannot be locally determined). A
total of 63 thematic roles are annotated in the treebank
including, for example, agent, causer, condition and instrument for verbs, and
time and location for nouns (see Huang/Chen/Chen et al. 2000). Sinica Treebank
can be accessed online
using the Web-based interface which allows users to search the treebank and view diagrammatical parse trees.
Penn
Chinese Treebank (CTB version 4.0) contains 404,156 words in the form of 838
data files sampled from three newswire sources: 698 articles from the Xinhua News Agency (1994-1998), 55 articles from the Information
Services Department of HKSAR (1997), and 80 articles from Sinorama
magazine,
12.10. Spoken Chinese Corpus of Situated
Discourse
Spoken
Chinese Corpus of Situated Discourse (SCCSD) is an ongoing project under the
auspices of the
Table
28: Discourse types in SCCSD
|
Category |
Subcategory |
Example |
|
Societal |
Major activities of organisation |
government and political discourse, business discourse, educational
and academic discourse, legal and mediatory discourse, mass media discourse, discourse
of medicine and health, discourse of sports, public service discourse, public
welfare discourse, religious and superstitious discourse, |
|
Activities common to organization |
administrative discourse, banquet discourse, discourse of celebration
and ceremony, discourse of entertainment and leisure, office discourse,
political study discourse, telephone discourse |
|
|
Special discourse |
pathological discourse, criminal discourse, military discourse,
miscellaneous |
|
|
Familial discourse |
Family discourse in a metropolis |
family of high-ranking officials, family of entrepreneurs, family of
businessmen, family of academics, family of white collar, family of blue
collar, family of suburb farmers, family of immigrant labor |
|
Family discourse in a small town |
family of academics, family of white collar |
The
corpus is presently being transcribed and annotated, with segmented audio/video
chunks linked to the corresponding transcripts. When the corpus is completed, about
50-100 hours will be mounted at the SCCSD website
and made available on the Internet in a multimedia form.
13.
Well-known distributors of corpus resources
While
many corpora introduced in this survey are made available at individual project
or corpus websites, there are a number of organisations which aim at creating,
collecting and distributing corpus resources. The best-known of these include
CSLU, ELRA/ELDA, ELSNET, ENABLER, ICAME, the LDC, OTA, and TELRI/TRACTOR.
CSLU (Centre for Spoken Language Understanding) is a research
centre at Oregon Graduate Institute of Science and Technology (OGI) that
focuses on spoken language technologies. The centre offers a range of products
and services. For non-commercial purposes (educational, research, personal and
evaluation), most products are freely available. Some products (generally
source codes) are also available for commercial use via a membership agreement.
CSLU has created, collected and distributed speech corpora in over 20 languages
for use in the area of voice processing. A description of the corpora currently
available from the centre is available at the CSLU website.
ELRA (The European Language Resources Association) is a
non-profit organization established in 1995 with
the goal of promoting the creation, validation, and distribution of language
resources (LRs) for the Human Language
Technology (HLT) community, and evaluating language engineering technologies.
Many of these tasks are carried out by ELRA’s operational body
ELSNET (European Network in Language and Speech) is a Europe-based
forum which aims to advance human language technologies in a broad sense by
bringing together
The
ENABLER (European National
Activities for Basic Language Resources) Network aims at improving cooperation
among the national activities which provide language resources for their
respective languages. ENABLER has worked in close collaboration with ELSNET to
develop the Language Resources Roadmap and the Language Resources Landscape.
Resources offered by ENABLER include written, spoken, and multimodal corpora as
well as lexical resources. See the ENABLER catalogue for a
list of available corpora.
ICAME (International Computer Archive of Modern and Medieval
English) is an international organization of linguists and information
scientists working with English corpora. The aim of the organization is to
collect and distribute information on English language material available for
computer processing and on linguistic research completed or in progress on the
material, to compile an archive of English text corpora in machine-readable
form, and to make material available to research institutions. About 20 corpora
amounting to 17 million words are currently available on CDs from ICAME.
The
LDC (Linguistic Data
Consortium) is an open consortium of universities, companies and government
research laboratories which creates, collects and distributes speech and text
databases, lexicons, and other resources for research and development purposes.
The LDC is the largest distributor of corpus resources, but most LDC resources
are specialized corpora which are more geared towards language engineering than
linguistic analysis. See the LDC
catalogue for a list of available corpora.
OTA
(Oxford
Text Archive) is one of the oldest and best-known electronic text centres in
the world. It works closely with members of the Arts and Humanities academic
community to collect, catalogue, and preserve high-quality electronic texts for
research and teaching. OTA currently distributes more than 2,500 resources in
over 25 different languages, which include a great variety of language corpora
in addition to electronic editions of works by individual authors, manuscript
transcriptions and reference works. See the OTA
catalogue for available resources.
TRACTOR is the TELRI (Trans-European Language Resources
Infrastructure) Research Archive of Computational Tools and Resources, which
aims at collecting, promoting, and making available monolingual and
multilingual language resources and tools for the extraction of language data
and linguistic knowledge, with a special focus on Central and Eastern European
languages. The TRACTOR archive features monolingual and multilingual corpora as
well as lexicons in a wide variety of languages, currently including Bulgarian,
Croatian, Czech, Dutch, English, Estonian, French, German, Greek, Hungarian,
Italian, Latvian, Lithuanian, Romanian, Russian, Serbian, Slovak, Slovenian,
Swedish, Turkish, Ukrainian and Uzbek. Resources distributed through TRACTOR are available for non-commercial use
only, but TRACTOR aims to promote and foster commercial links between academic
and industrial researchers.
This
survey introduced well-known and influential corpora for various research
purposes, including national corpora, monitor corpora, corpora of the Brown
family, synchronic corpora, diachronic corpora, spoken corpora,
academic/professional corpora, parsed corpora, developmental/learner corpora,
multilingual corpora, and non-English monolingual corpora. This discussion,
however, only covers a very small proportion of the available corpus resources.
The classification used in this survey was for illustrative purpose only. The
distinctions given have been forced for the purpose of this introduction. It is
not unusual to find that any given corpus will be a blend of many of the
features introduced here.
Aduriz, I./Aldezabal, I./Alegria, I./Arriola, J./Diaz de Ilarraza, A./Ezeiza, N./Gojenola, K. (2003), Finite State Applications for Basque.
In: Proceedings of EACL'2003 Workshop on Finite-State Methods in Natural
Language Processing.
Altenberg, B./Aijmer, K./Svensson, M. (2001), The English-Swedish Parallel Corpus
(ESPC): Manual of enlarged version. Universities of
Anderson, W. (2005), The SCOTS Corpus: a resource for
language contact study. In: Ureland, S. & Pugh,
S. (eds) Symposium Logos
Series Studies in Eurolinguistics Vol. 4.
Armstrong, S./Kempen, M./McKelvie, D./ Petitpierre, D./Rapp, R./Thompson, H. (1998), HCRC Publication: Multilingual Corpora
for Cooperation. In: LREC 1998 Proceedings.
Aston, G./Burnard, L. (1998), The BNC Handbook.
Auran, C./Bouzon, C./Hirst, D. (2004), The aix-MARSEC
project: an evolutive database of spoken British
English. In: SP-2004, 561-564.
Bai, X./Chang, B./Zhan, W. (2002), Building a
large Chinese-English parallel corpus. In Huang, H. (ed)
Proceedings of the National Symposium on Machine Translation 2002.
Baker, P./Hardie, A./McEnery, T./Xiao, R./Bontcheva,
K./Cunningham, H./Gaizauskas, R./Hamza,
O./Maynard, D./Tablan, V./Ursu,
C./Jayaram, B./Leisher, M.
(2004), Corpus linguistics and South Asian languages: Corpus creation and tool
development. In: Literary and Linguistic Computing 19(4), 509-524.
Barlow, M. (1998), A Corpus of
Spoken Professional American English.
Beare, J./Scott, B. (1999), The Spoken Corpus of
the Survey of English Dialects: language variation and oral history. In: Proceedings
of ALLC/ACH 1999.
Berglund, Y./Burnard, L./Wynne, M. (2004), BNC-baby: using corpora in
the virtual classroom. In: Proceedings of the 6th Teaching and Language Corpora
conference.
Biber, D. (1988), Variation Across Speech and
Writing.
Biber, D./Finegan,
E./Atkinson, D. (1994), ARCHER and its challenges: compiling and exploring a
representative corpus of historical English registers. In: Fries, U., Tottie, G. & Schneider, P. (eds) Creating and Using English Language Corpora.
Bies, A./Ferguson, M./Katz, K./Schasberger,
B. (1994), The Penn Treebank: annotating predicate-argument structure. In: ARPA
Human Language Technology Workshop.
Bohmova, A./Hajic, J./Hajicova, E./Hladka, B. (2001),
The Prague Dependency Treebank: three-Level Annotation Scenario. In: Abeille, A. (ed) Treebanks: Building and Using Syntactically Annotated
Corpora.
Burnard, L. (2002), Where did we go wrong? A retrospective look at the British National Corpus. In: Ketterman, B. & Marko, G. (eds)
Teaching and Learning by doing Corpus Analysis: Proceedings of the Fourth
International TALC.
Carter, R./McCarthy, M.
(2004), Talking, creating: interactional language,
creativity, and context. In: Applied Linguistics 25(1), 62-88.
Cavar, D./Geyken,
A./Neumann, G. (2000), Digital Dictionary of the 20th Century German Language.
In: Erjavec, T. & Gros,
J. (eds) Proceedings of the
Language Technologies Conference.
Cheng, W./Warren, M. (1999),
Facilitating a description of intercultural conversations: the Hong Kong Corpus
of Conversational English. In: ICAME Journal 23, 5-20.
Choukri, K. (2003), Brief overview of activities in
Coxhead, A. (2000), A new academic word list. In:
TESOL Quarterly 34(2), 213-238.
Crowdy, S. (1993), Spoken corpus design. In:
Literary and Linguistic Computing 8(4), 259-265.
Culpeper, J./Kytö, M.
(1997), Towards a corpus of dialogues, 1550-1750. In: Ramisch,
H. & Wynne, K. (eds)
Language in Time and Space: Studies in Honour of Wolfgang Viereck
on the Occasion of his 60th Birthday.
Dalli, A. (2001), Interoperable extensible linguistics databases. In:
Proceedings of IRCS Workshop on Linguistic Databases.
Dubois, J./Chafe, W./Meyer,
C./Thompson, S. (2000-2004),
Ellis, N./O'Dochartaigh, C./Hicks, W./ Morgan, M./Laporte,
N. (2001), Cronfa Electroneg
o Gymraeg (CEG): a 1 million word lexical database
and frequency count for Welsh. Accessed online on 8 December 2004 at http://www.bangor.ac.uk/ar/cb/ceg/ceg_eng.html
English Language Institute (2003), MICASE Manual: The
Michigan Corpus of Academic Spoken English (version 1.1).
Erjavec, T. (2002), The IJS-ELAN Slovene-English Parallel Corpus.
In: International Journal of Corpus Linguistics 7(1), 1-20.
Erjavec, T. (2004), MULTEXT-East Version 3: Multilingual Morpho-syntactic Specifications, Lexicons and Corpora. In:
LREC 2004 Proceedings.
Farr,
F./Murphy, B./O’Keeffe, A.
(forthcoming), The Limerick Corpus of Irish English: design, description and
application. In: Teanga 21.
Fries,
U./Schneider, P. (2000), ZEN: Preparing the
Garside,
R./Leech, G./Váradi, T.
(1992), Manual of Information for the
Garside,
R./Hutchinson, J./ Leech, G./McEnery,
A./Oakes, M. (1994), The exploitation of parallel corpora in projects ET10/63
and CRATER. In: New Methods in Language Processing.
Garside,
R./Leech, G./Sampson, G. (eds)
(1987), The Computational Analysis of English: A Corpus-Based Approach.
Gautier,
G. (1998), Building a Kurdish language corpus. Paper presented at ICEMCO 98 6th
International Conference and Exhibition on Multilingual Computing. Cambridge,
April 1998. Accessed online on 6 December 2004 at http://ggautierk.free.fr/e/icem_98.htm.
Gillard,
P./Gadsby, A. (1998), Using
a learners’ corpus in compiling ELT dictionaries.
In: Granger, S. (ed) Learner English on Computer.
Glover, W. (1998), Toward a Nepali national corpus. In: Yadava, P & Kansakar, T. (eds) Lexicography in
Godfrey,
J./Holliman, E. (1997), The Switchboard-1 Telephone
Speech Corpus. Linguistic Data Consortium.
Gomez Guinovart,
X./Sacau Fontenla,
E.
(2004), Parallel corpora for the Galician language: building and processing of
the CLUVI (Linguistic Corpus of the
Grabe, E./Post, B./Nolan, F. (2001),
The IViE Corpus. Department of
Linguistics,
Granger,
S. (2003), The International Corpus of Learner English: a new resource for
foreign language learning and teaching and second language acquisition
research. In: TESOL Quarterly 37(3), 538-546.
Granger,
S./Tyson, s. (1996), Connector usage in the English
essay writing of native and non-native EFL speakers of English. In: World Englishes 15(1), 17-27.
Greenbaum, S./Svartvik,
J. (1990), The London-Lund Corpus of Spoken English. In: Svartvik,
J. (ed) The London Lund Corpus of Spoken English:
Description and Research [Lund Studies in English 82].
Gu, Y. (2003),
Exploring multi-modal corpus segmentation and annotation. Talk
given at the Corpus Research Group,
Guerra,
L. (1998), Research in language and literature: Old problems, new solutions.
Paper presented at the conference of the Future of Humanities in the Digital
Age. Bergen, 25-26 September 1998. Accessed online on 6
December 2004 at http://ultibase.rmit.edu.au/Articles/dec98/guerra1.htm.
Gui, S./Yang, H. (2002), Chinese
Learner English Corpus. Shanghai: Shanghai Foreign Language Education Press.
Haslerud, V./Stenström,
A. (1995), The
Hatzigeorgiu, N./Gavrilidou, M./Piperidis, S./Carayannis, G./Papakostopoulou, A./Spiliotopoulou,
A./Vacalopoulou, A./Labropoulou,
P./Mantzari, E./Papageorgiou,
H./Demiros, I. (2000), Design and Implementation of
the Online ILSP Greek Corpus. In: Proceedings of LREC 2000.
Holmes,
J./Vine, B./Johnson, G. (1998), Guide to the
Holmes-Higgin, P./Abidi,
S./Ahmad, K. (1994), A description of texts in a corpus: “Virtual” and “Real” corpora. In: Martin, W., Meijs,
W. Moerland, M. ten Pas, E., van Sterkenburg,
P. & Vossen, P. (eds)
EURALEX'94 Proceedings.
Horvath,
J. (1999), Advanced Writing in English as a Foreign Language, A Corpus-based
Study of Processes and Products. PhD thesis.
Huang,
C./Chen, K. (1995/1998), CKIP Technical Report
95-02/98-04.
Huang,
C./Chen, F./Chen, K./Gao,
Z./Chen, K. (2000), Sinica Treebank: design criteria,
annotation guidelines, and on-line interface. In: Bagga,
A., Pustejovsky, J. & Zadrozny,
W. (eds) Proceedings of
NAACL-ANLP 2000 Workshop: Syntactic and Semantic Complexity in Natural Language
Processing Systems.
Hundt, M./Sand, A./Siemund, R. (1998),
Manual of Information to Accompany the Freiburg-LOB Corpus of British English (“FLOB”). Accessed online on
6 December 2004 at http://www.hit.uib.no/icame/flob/index.htm.
Hundt, M./Sand, A./Skandera, P. (1999),
Manual of Information to Accompany the Freiburg-Brown Corpus of American
English (“Frown”). Accessed
online on 6 December 2004 at http://khnt.hit.uib.no/icame/manuals/frown/INDEX.HTM.
Izumi,
E./Isahara, H. (2004),
Investigation into language learners' acquisition order based on the error analysis
of the learner corpus. Paper presented at IWLeL
2004.
Izumi,
E./Uchimoto, K./Isahara, H. (2004), SST speech corpus of Japanese learners’ English and automatic detection of learners’ errors. In: ICAME Journal 28, 31-48.
Johansson,
S./Ebeling, J./Oksefjell, S. (2002), English-Norwegian Parallel Corpus:
Manual.
Johansson,
S./Leech, G./Goodluck, H.
(1978), Manual of Information to Accompany the Lancaster-Oslo/Bergen Corpus of
British English, for Use with Digital Computers.
Kang,
B./King, H. (2004), Sejong
Korean corpus in the making. In: Proceedings of LREC 2004. 1747-1750.
Kruyt, J. (1995), Nationale tekstcorpora in internationaal perspectief. In: Forum der Letteren 36(1), 47-58.
Kruyt, J./Dutilh,
M. (1997) A 38 million words Dutch text corpus and its users. In: Lexikos 7 (Afrilex-reeks/series
7: 1997), 229-244.
Kruyt, J./Raaijmakers,
S./van der Kamp, P./van Strien, R. (1996), Language resources for language
technology. In: Proceedings of the first TELRI European Seminar. Tihany, 173-178.
Kucěra, H./Francis, W. (1967),
Computational Analysis of Present-day English.
Kučera, K. (2002), The Czech National Corpus: principles, design,
and results. In: Literary and Linguistic computing 17(2), 245-257.
Kytö, M. (1996), Manual to the Diachronic Part of the
Laitinen, M. (2002), Extending the Corpus of Early English
Correspondence to the 18th century. In:
Lee,
D. (2001), Genres, registers, text types, domains, and styles: clarifying the
concepts and navigating a path through the BNC jungle. In: Language Learning
and Technology 5(3), 37-72.
Lewandowska-Tomaszczyk, B. (2003), The PELCRA project – state of art. In: Lewandowska-Tomaszczyk,
B. (ed) Practical Applications in Language and
Computers.
MacWhinney, B. (1995), The CHILDES project: Tools for Analyzing Talk.
Malten, T. (1998), Tamil studies in
Marcus,
M./Santorini, B./Marcinkiewicz, M. (1993), Building a large annotated corpus
of English: The Penn Treebank. In: Computational Linguistics 19, 313-330.
Marcus,
M./Kim, G./Marcinkiewicz,
M./MacIntyre, R./Bies,
A./Ferguson, M./ Katz, K./Schasberger, B. (1994), The
Penn Treebank: Annotating predicate-argument structure. In: ARPA Human Language
Technology Workshop.
McEnery, A./Xiao Z./Mo L. (2003), Aspect marking
in English and Chinese: using the Lancaster Corpus of Mandarin Chinese for
contrastive language study. In: Literary and Linguistic Computing 18(4),
361-378.
Milton,
J./Chowdhury, N. (1994),
Tagging the interlanguage of Chinese learners of
English. In: Flowerdew, L. & Tong, A. (eds) Entering Text. Hong Kong: The
Nelson,
G. (1996), The design of the corpus. In: Greenbaum, S. (ed) Comparing
English Worldwide: the International Corpus of English.
Nelson,
G./Wallis, S./Aarts, B. (ed)
(2002), Exploring Natural Language: Working with the British Component of the
International Corpus of English.
Nevalainen, T. (2000), Gender differences in the evolution of standard English. In: Journal of English Linguistics 28(1),
38-59.
Reppen, R./Ide,
N. (2004), The American National Corpus: overall goals and the first release.
In: Journal of English Linguistics 32(2), 105-113.
Rissanen, M. (2000), The world of English
historical corpora. In: Journal of English Linguistics 8(1), 7-20.
Riza, H. (1999), The
Rossini
Favretti, R./Tamburini, F./De Santis, C.
(2004), A corpus of written Italian: a defined and a dynamic model. In: Wilson,
A., Rayson, P. & McEnery,
T. (eds) A Rainbow of
Corpora: Corpus Linguistics and the Languages of the World.
Sampson,
G. (1987), The grammatical database and parsing
scheme. In Garside, R., Leech, G. & Sampson, G. (eds) The Computational Analysis of English.
Sampson,
G. (1995), English for the Computer: The SUSANNE Corpus and Analytic Scheme.
Sampson,
G. (2000), CHRISTINE Corpus, Stage I: Documentation.
Sampson,
G. (2003), The LUCY Corpus: Documentation.
Sánchez, M. (2002), CREA: Reference corpora for current Spanish.
In: Proceedings of Language Corpora: Present and Future. Donostia,
24-25 October 2002.
Schmied, J. (1994), The Lampeter Corpus of Early Modern English Tracts. In: Kytö, M., Rissanen, M. &
Wright, S. (eds) Corpora
Across the Centuries: Proceedings of the First International Colloquium on
English Diachronic Corpora. St Catharine's College Cambridge, 25-27 March 1993.
Schneider,
P. (2002), Computer assisted spelling normalization of 18th century
English. In Peters, P., Collins, P. & Smith, a. (eds) New Frontiers of Corpus research.
Scott, M. (1999), WordSmith
Tools.
Sharoff,
S. (2004), Methods and tools for development of the Russian Reference Corpus. In: Archer, D.,
Wilson, A. & Rayson, P. (eds) Corpus Linguistics Around the World.
Souter, C. (1993), Towards a standard
format for parsed corpora. In: Aarts, J., Haan, P. & Oostdijk, N. (eds) English Language Corpora:
Design, Analysis and Exploitation.
Stern,
K. (1997), The Longman Spoken American Corpus: providing an in-depth analysis
of everyday English. In: Longman Language Review Issue 3. Accessed
online on 7 December 2004 at http://www.longman.com/dictionaries/llreview/r3stern.html.
Stibbard, R. (2001), Vocal Expression of Emotions in Non-laboratory
Speech: An Investigation of the Reading/Leeds Emotion in Speech Project
Annotation Data. PhD thesis.
Taylor,
A. (2000), The Penn-Helsinki Parsed Corpus of Middle
English 2.
Taylor,
L./Knowles, G. (1988), Manual of Information to
Accompany the SEC Corpus: The machine readable corpus of spoken English.
Thompson,
H./Armstrong-Warwick, S./McKelvie,
D./Petitpierre, D. (1994), Data in your Language: the
ECI Multilingual Corpus 1. In: Proceedings of the International Workshop on
Shareable Natural Language Resources.
Tsou, B./Tsoi,
W./Lai, T./Hu, J./Chan, S. (2000), LIVAC, a Chinese
synchronous corpus, and some applications. In: Proceedings of the ICCLC
International Conference on Chinese Language Computing. Chicago. 233-238.
van Bergen, L./Denison, D. (2004), A corpus of late eighteenth
century prose. In Beal, J., Corrigan, K. & Mosil,
H. (eds) Models and Methods
in the Handling of Unconventional Digital Corpora vol. 2. Diachronic
Corpora. Palgrave.
Váradi, T. (2002), The Hungarian
National Corpus. In: Proceedings of the Third International Conference on
Language Resources and Evaluation. Las Palmas, Spain. 385-389.
Wang,
J. (2001), Recent progress in corpus linguistics in
Wang,
K. (ed.) (2004), The Development of the Compilation
and Application of Parallel Corpora.
Wittenburg, P./Brugman,
H./Broeder, D. (2000), Summary. In: Proceedings of
LREC 2000 Pre-conference Workshop on Meta-Descriptions for Multi-media Language
Resources.
Xue, N./Xia,
F./Chiou, F./Palmer, M. (2004), The Penn Chinese TreeBank: phrase structure annotation of a large corpus.
In: Natural Language Engineering 10(4), 1-30.