Skip Links | Access/General | Site Map
Faculty of Arts and Social Sciences
Lancaster University
You are here: Home >

UCREL CRS: Wrangling large-scale data for specialised corpora

Date: 7 March 2013 Time: 2.00-3:00 pm

Venue: FASS Meeting Room 3

UCREL Corpus Research Seminar

Wrangling large-scale data for specialised corpora

Andrew Hardie (LAEL, Lancaster University)

With vast amounts of text now readily accessible via the web, a "specialised corpus" need not be a "small corpus". However, the immensity of web resources presents challenges. Automatically-spidered data comes with none of the structure that characterises carefully-constructed corpora; when the research goal is to approach language of a very specific type, a "flat" corpus of this kind will typically not be satisfactory.

The ESRC-funded project "Metaphor in End-of-Life Care" (MELC) aims to examine metaphoricity in language associated with terminal illness - the language not only of patients but also of carers and medics. We mass-downloaded message-boards amounting to 50 million words, but then faced a number of problems: (1) structuring the data in a way that reflects the conceptual divisions of the original message-board; (2) allowing analysts routes of access into this dataset; (3) labelling the different classes of participant.

By using a bespoke spidering program, rather than an off-the-shelf mass-download tool, we made a single message-board thread correspond to a single corpus text. Within each thread/text, the mark-up identifies different posts, as well as the user responsible for each. A relational database was created in which all threads, users and posts are represented and cross-linked. A web-interface to this database allowed us to annotate user types - identifying users as patients, carers etc. by examining how they identify themselves in their first posts. We can then extract specified "slices" of the corpus for detailed analysis.

These techniques illustrate how a very large web-derived corpus can be made tractable as a resource for detailed analysis such as the investigation of metaphor, whilst respecting the conceptual structure of the original online resource.

Event website: http://ucrel.lancs.ac.uk/crs

Contact:

Who can attend: Anyone

 

Further information

Associated staff: Andrew Hardie

Associated projects: Metaphor in End-of-Life Care

Organising departments and research centres: Computing and Communications, Linguistics and English Language, University Centre for Computer Corpus Research on Language (UCREL)

«Back

Search FASS

Faculty of Arts and Social Sciences
| Home | Departments | People | Study Here | Research | Business and Enterprise | News and Events |
- FASS Intranet -

Save this page: Delicious Del.icio.us Digg It Reddit Reddit Facebook Stumble It Stumble It!

Faculty of Arts and Social Sciences
Faculty of Arts and Social Sciences
Lancaster University
Lancaster LA1 4YD
United Kingdom

Tel: +44 (0) 1524 510851
Fax: +44 (0) 1524 510857
E-mail:

E-mail: Email address protected by JavaScript. Please enable JavaScript to contact us.

Copyright & Disclaimer | Privacy and Cookies Notice

Save contact details

Save contact details