Spinning Lancaster University logo
Project homepage
Written Corpus
View a sample of the Written Corpus
Project Team 
Publications

The Lancaster Speech, Writing and Thought Presentation Written Corpus

 

 

A Handbook to the Lancaster Speech, Writing and Thought Presentation Written Corpus

1. Introduction

This handbook details the construction of the Lancaster Speech, Thought and Writing Presentation Written Corpus, and discusses some of the issues involved in this. For more detailed information see Semino and Short (forthcoming).

We built a corpus of around 250,000 words
and annotated this for categories of speech and thought presentation (also known as speech and thought reporting or representation) using a tagset which has been developed by Mick Short, Elena Semino, Jonathan Culpeper and Martin Wynne at Lancaster University. This tagset is an extension of the model of speech and thought presentation (SW&TP) proposed in Leech and Short (1981) which posits a continuum of categories along an axis representing degrees of narrator's intervention.

Originally a pilot corpus of some 40,000 words of fiction texts was compiled and annotated in 1994. A parallel pilot sample of 40,000 newspaper texts were then added in 1994 and 1995. This work was done with funds provided by the Faculty of Social Sciences at Lancaster University.

Following the award of a major British Academy research project grant, this 80,000-word pilot corpus was expanded in 1996 and 1997 to a nominally 240,000-word corpus. The fiction and newspaper sections were doubled in size, and a new section of biography and autobiography texts was added. 

Analysis of the corpus is ongoing. 


2. The composition of the corpus

2.1 Structure

There are approximately 250,000 words of text in the corpus. It is made up of 120 sections of about 2,000 words. The final count is somewhat in excess of 240,000 because the texts were sampled in such a way as to begin and end them at fairly 'natural' breaks, so that a reader of the corpus text can see enough of the relevant context to understand the narrative, and usually it was preferred to find such a break after rather than before the 2,000 word mark, but as close to it as possible (for more on sampling strategies see 2.3 below).

The primary classification of the corpus is into three sections relating narrative genres. These genres are: (i) fiction, (ii) newspaper news reports and (iii) biography and autobiography. There are a minimum 80,000 words in each section.

Within each genre there is a division between 'serious' and 'popular' texts. While such a division is inevitably difficult to some extent, this classification was made on the basis of what would commonly be held to be the case by the average educated reader. This will enable the testing of such preconceptions by the analysis of the actual texts.

In the fiction and biography there is also an binary division (cutting across the popular/serious division) between first and third person narratives. In the biography this creates a division between biography and autobiography.


2.2 List of texts sampled


Serious fiction

Amis, M. (1984) Money, London: Penguin.
Atkinson, K. (1984) Behind the Scenes at the Museum, London: Penguin.
Ballard, J.G. (1984) Empire of the Sun, London: Panther.
Barnes, J. (1989) A History of the World in 10˝ Chapters, London: Picador.
Byatt, A.S. (1991) Possession, London: Vintage.
Carter, A. (1967) The Magic Toyshop, London: Heinemann.
Drabble, M. (1969) Jerusalem the Golden, Harmondsworth: Penguin.
Fowles, J. (1963) The Collector, London: Vintage
Gardam, J. (1992) Queen of the Tambourine, London: Abacus.
Golding, W. (1980) Rites of Passage, London: Faber & Faber.
Greene, G. (1943) Brighton Rock, London: Penguin.
Huxley, A. (1928) Point Counter Point, London: Chatto & Windus.
Lawrence, D.H. (1955) ‘Tickets Please’, in The Complete Short Stories (Vol. II), London: Heinemann.
Lessing, D. (1974) The Memoirs of a Survivor, London: Octagon.
Lowry, M. (1969) ‘Gin and Goldenrod’, in Hear us O Lord from Heaven thy Dwelling Place, Harmondsworth: Penguin.
Maugham, S. (1935) The Moon and Sixpence, London: Heinemann.
Murdoch, I. (1961) A Severed Head, London: Chatto & Windus.
Rushdie, S. (1995) The Moor’s Last Sigh, London: Jonathan Cape.
Wells, H.G. (1953) Tono-Bungay, London: Collins.
Woolf, V. (1919) Night and Day, London: The Hogarth Press.


Popular fiction

Adler, E. (1986) Peach, London: Hodder & Stoughton.
Bow, J. (1991) Jane's Journey, Sussex: The Book Guild Ltd.
Burley, W.J. (1978) Wycliffe and the Scapegoat, London: Gollancz.
Conran, S. (1982) Lace, Harmondsworth: Penguin.
Cookson, C. (1984) Hamilton, London: Heinemann.
Dibdin, M. (1991) Dirty Tricks, London: Faber & Faber.
Francis, D. (1988) The Edge, London: Michael Joseph.
Higgins, J. (1991) The Eagle Has Flown, London: Pan.
Holt, V. (1991) Daughter of Deceit, London: Harper Collins.
Lewis, T. (1992) Get Carter, London: Allison & Busby.
MacLean, A. (1986) Santorini, London: Collins.
Maitland, S. (1990) Three Times Table, London: Chatto & Windus.
McDermid, V. (1992) Dead Beat, London: Gollancz.
McDowell, C. (1991) A Woman of Style, London: Century Group.
Nabb, M. (1989) Death in Springtime, London: Fontana.
Peters, E. (1992) The Holy Thief: The Nineteenth Century Chronicle of Brother Cadfael, London: Headline.
Seymour, G. (1992) Archangel, London: Fontana.
Smith, W. (1987) The Eye of the Tiger, London: Heinemann.
Taylor, A. (1986) The Raven on the Water, London: Harper Collins.
Thomson, R. (1991) The Five Gates to Hell, London: Bloomsbury.


Popular (auto)biography

Bannister, J. (1994) Lara: the story of a record-breaking year, London: Stanley Paul
Beck, S. (1995) Queen of the Street: The Amazing Life of Julie Goodyear, London: Blake.
Bergan, R. (1991) Dustin Hoffman, London: Virgin.
Black, C. (1985) Step Inside, London: Dent.
Caine, M. (1992) What's it all about?, London: Century.
Cherrington, J. (1993) On the Smell of an Oily Rag: My Fifty Years in Farming, Ipswich: John Farming Press Books.
Christie, L. (with Ward, T.) (1989) Linford Christie: An Autobiography, London: Paul.
Dimbleby, J. (1994) The Prince of Wales: A Biography, London: Little, Brown.
Dorman, L.S. and Rawlins, C.L. (1990) Leonard Cohen: Prophet of the Heart, London: Omnibus.
Henry, A. (1994) From Zero to Hero: Damon Hill, Yeovil: Patrick Stephens Limited
Juby, K. (ed.) (1986) In other words – David Bowie, London: Omnibus Press.
Miller, J. (with Brown, J.) (1989) Former Soldier Seeks Employment, London: MacMillan.
Milligan, S. (1976) Monty – his part in my victory, London: Penguin.
Morton, A. (1993) Diana: Her True Story, London: O’Mara.
Phoenix, P. (1983) Love, Curiosity, Freckles and Doubt, London: Arlington Books.
Smith, J. (1988) The Benny Hill Story, London: W H Allen.
Stokes, D. (with Dearsley, L.) (1987) Joyful Voices, London: Macdonald.
Stone, S. (1990) Kylie Minogue: The Superstar Next Door, London: Omnibus.
Whitbread, F. (with Blue, A.) (1988) Fatima, London: Pelham Books.
Windsor, B. (with Flory, J.) (1990) Barbara: The Laughter and Tears of a Cockney Sparrow, London: Century.


Serious (auto)biography

Baker, K. (1993) The Turbulent Years, London: Faber and Faber.
Critchley, J. (1995) A Bag of Boiled Sweets, London: Faber and Faber.
Glasser, R. (1986) Growing Up in the Gorbals, London: Chatto and Windus.
Isherwood, C. (1980) My Guru and His Disciple, London: Magnum.
Kennedy, L. (1989) On my way to the club: The Autobiography of Ludovic Kennedy, London: Collins.
Lee, L. (1969) As I Walked Out One Midsummer Morning, London: Andre Deutsch.
Worsthorne, P. (1993) Tricks of Memory. An Autobiography: Peregrine Worsthorne, London: Weidenfeld & Nicholson.
Spark, M. (1992) Curriculum Vitae, London: Constable.
Stalker, J. (1988) Stalker, London: Harrap.
Thatcher, M. (1993) The Downing Street Years, London: Harper Collins.
Carpenter, H. (1983) W. H. Auden, London: Unwin.
Adams, J. (1992) Tony Benn, London: MacMillan.
Bragg, M. (1988) Rich - The life of Richard Burton, London: Hodder and Stoughton.
Ponting, C. (1994) Churchill, London: Sinclair-Stevenson.
Wilson, A.N. (1990) C.S Lewis, a biography, London: Collins.
Sherry, N. (1989) The Life of Graham Greene, London: Penguin.
Rose, J. (1990) Modigliani: The Pure Bohemian, London: Constable.
Ackroyd, P. (1984) T.S. Eliot, London: Hamilton.
Hodges, A. (1983) Alan Turing: The Enigma of Intelligence, London: Unwin.
Callow, S. (1990) Vincent Van Gogh - a Life, London: Allison & Busby.


Broadsheets (Serious newspapers)

The Daily Telegraph 
The Guardian
The Independent 
The Independent on Sunday 
The Times 


Tabloids (Popular newspapers)

The Express
The Mirror
The News of the World
The Star
The Sun

Today (samples from 1994 only, as the paper ceased publication before the 1996 sample was taken)


2.3 Sampling Strategies


Fiction

The decision as to what counted as 'high' literature was made by nine members of Lancaster University's Stylistics Research Group, who were given a list of authors whose works were available in electronic form in the Oxford Text Archive. Authors which six or more informants judged as 'high' literature were selected. The extracts which we took constituted relatively independent units (e.g. chapters, sections or short stories). The popular fiction extracts consisted of eight 3rd-person narratives taken from the relevant category of the British National Corpus, to which we added two 1st-person narratives, so that we would have a greater range of narrative styles. Six extracts were from romantic novels and four from action novels.

In the fiction section a further subdivision was made within each text type between texts with first and third person narrators. This is paralleled in the biography/autobiography section, where the biography texts are all first person narratives and the autobiography texts are all third person narratives.


News

All the press data was taken from articles published in British national daily newspapers. Only newspapers that were felt to be prototypical members of the broadsheet or tabloid categories were selected. Newspapers were taken from the same or consecutive days in four samples: 4-5 December 1994, 11-12 December 1994, 28-29 April 1996 and 12-13 May 1996. This enabled us to select articles that covered the same story, and thereby facilitated comparisons between different newspaper styles (work which we hope to carry out in later phases of the project). News stories rather than editorials or magazine-style articles were chosen so that the press data would be as similar as possible in type to the narrative fiction data. The main criterion for selecting articles was that they should appear in at least three newspapers.


(Auto)biography

It was less clear-cut how to make a serious/popular distinction in the biography/autobiography section. It was decided to rely on the perceived seriousness of the subject, so politicians, serious writers and artists are considered 'serious' and TV stars, royalty and sports people are considered 'popular'. At the same time some attention was paid to the writing style of the biography in question so as not to include problematic cases, such as particularly badly written autobiographies of serious politicians, or highbrow biographies of pop stars, for example.


2.4 Text markup

The following SGML elements were used to mark up the text:


div1 text divisions - fiction (types: serious, popular), newspapers (types: broadsheet, tabloid), biography (types: serious, popular)
div2 sample (c.2000 words)
div3 (in newspapers) articles; other subdivisions of samples
edit as a note to the text editor indicating what stage of processing the text is at
head a text heading (e.g. newspaper headline or a chapter heading)
header bibliographical information and the list of speakers in the text
note a note indicating additional information about the SW&TP tagging
p paragraph break
pb page break
sptag Speech, Writing and Thought Presentation category tag



3. Speech, Writing and Thought Presentation annotation

3.1 SW&TP categories

Here we show the acronyms used in the tags and their accompanying definitions. For full definitions of the SW&TP categories see Short et al. (1998).

 

Narrative

NRS 

Narrative Report of Speech

NRW 

Narrative Report of Writing

NRT 

Narrative Report of Thought

NI 

Narrative Report of Internal State

NV 

Narrative Report of Voice

NRSA 

Narrative Report of Speech Act

NRWA 

Narrative Report of Writing Act

NRTA 

Narrative Report of Thought Act

NRSAP 

Narrative  Report of Speech Act with Topic

NRWAP 

Narrative Report of Writing Act with Topic

NRTAP 

Narrative Report of Thought Act with Topic

IS 

Indirect Speech

IW 

Indirect Writing

IT 

Indirect Thought

FIS 

Free Indirect Speech

FIW 

Free Indirect Writing

FIT 

Free Indirect Thought

DS 

Direct Speech

DW 

Direct Writing

DT 

Direct Thought

FDS 

Free Direct Speech

FDW 

Free Direct Writing

FDT 

Free Direct Thought

 

Affixes:


e embedded
q with quote
h hypothetical
i inferred (see section on NIi below)
+ speech summary (not used)

 
e.g. NRSAPq is "Narrative Report of Speech Act with Topic with an embedded quotation":

Notes

#
is used to flag problems for discussion, i.e. things that we weren't sure how to analyse. May be used in conjunction with a portmanteau tag to indicate choices (see below)

(portmanteau tagging) is to be used for genuine ambiguity, where it is preferable to indicate two possible interpretations (e.g. IS-IT, NV-NRSA).

e
embedded SW&TP is indented on the page to make it easier to read

Line breaks
all SW&TP tags are printed on a line on their own, to make it easier to extract, sort, count etc.

Wordcounts
the unit used is the orthographic word, simply defined as a string of alphanumeric characters surrounded by spaces or punctuation. Hyphenated and contracted words count as one unit and genitive diacritics are ignored. e.g. "she's a man-eater" (3 words)

Scare quotes
these are tagged with a <note>

 

3.2 Tagging Guidelines

When tagging, consideration is taken of all three of these levels of analysis. For more details of how this is done, it is necessary to refer to the guidelines for the tagging of particular categories in the different text types.

Some problems arise because there are fuzzy areas on the boundary between the presentation of linguistic and non-linguistic acts. There are also mentions of written language where the focus is not on the production or reception of the text, but merely on its existence. Such cases were not annotated as writing presentation.


3.2.1 NIi

The NI category was invented to cover cases in fiction where an omniscient narrator is able to report on the internal states of characters, e.g.:

Jed's heart lifted in his ribs.
(Rupert Thomson, The Five Gates to Hell)

For a moment she didn't know where she was.
(Graham Greene, Brighton Rock)

In texts which are not fiction with an omniscient narrator, NI (and all categories of thought presentation) are only used where the character in question has access to the thoughts and internal states which are reported. This usually means that only states and thoughts of the reporter are tagged as NI (or thought).

Often passages tagged as N-NI are in fact inferences based on what someone has said, and may even be quite close to the form of the original utterance. For example:

<sptag cat=N-NI who=S next=NRSAP whonext=B s=1 w=16>

The Palace was keen that the Prime Minister should continue until a successor had been elected.
(Baker)

This is formally presented formally as as the report of an internal state ('is keen that'), but the reader will infer that this is a report of something that was probably said by a spokesman for the Palace. However it is impossible to tell what type of speech report this might be. It could be FIS, if the original utterance was something like 'The Palace is keen that...'; it could be NRS followed by IS if the original utterance was something like 'The Prime Minister should continue...'; or it could be NRSAP if this report is a summary of what was said, possibly on different occasions and by different people with the words here bearing little relation to the actual words said.

Given the impossibility of classifying the type of speech report, such examples are tagged as NIi, since they are formally presentation of internal states, even though pragmatically they can function as speech presentation.

In all cases where there is no omniscient narrator then, what is formally presented as the narration of internal states or thought is tagged as ambiguous between N and the relevant category of thought presentation, e.g. N-NI, N-NRT, N-IT.

 

3.2.2 NRSAP

The analysis of the press data highlighted the existence of particular variants of existing categories, which appear to be typical of newspaper reporting. An example of this is the use of extremely long and detailed NRSAs, such as those given below:

Mr Major warned yesterday of the dangers of Britain being left behind if a group of European Union members pushed ahead with a single currency.
(The Independent on Sunday, "Blair Puts Labour Troops on Alert for Snap Election")

Labour called last night for a streamlined Scandinavian style monarchy to banish Britain's class-riddensociety.
(The Daily Mirror, "Cut the Royals Down to Size")

In both cases the reporter spells out the speech act that the original speaker is supposed to have performed (warned, called), and then goes on to provide details of the content of the utterance in the form of lengthy and complex noun phrases. Clearly, such instances are not fully accounted for by the original definition of the NRSA category, which aimed to capture those cases where little more than the speech act is provided.

They are therefore tagged as NRSAP, or 'narrator's representation of speech act with topic', as in the following example:

<sptag cat=NRSAP who=M next=NRSA whonext=B s=0.67 w=18>

However, when he invited Beatrice Hastings to come and model
for him nude early on in their affair,

<sptag cat=NRSA who=B next=N s=0.07 w=2>
Modigliani objected
<sptag cat=N next=NRS whonext=M s=0.26 w=10>
and she failed to keep the appointment. This happened twice.
(June Rose, Modigliani)

 

3.2.3 NV

Both the fictional and newspaper data contained instances of minimal speech presentation, which could not easily be accounted for by Leech and Short's categories. Consider the emboldened parts in the examples below:

"Don't you love Barrie's plays?" she asked. "I'm so fond of them". She talked on. Rampion made no comment.
(Aldous Huxley, Point Counter Point)

We spoke to vice madam Michaela Hamilton from Bullwell, Notts, who arranged girls for a Hudson orgy at the Sanam curry house in Stoke.
(The News of the World, "Hudson Fixed Sex Orgies as his Charity Fund Collapsed")

In both cases we are informed that someone engaged in verbal activity, but we are not given any explicit indication even as to what speech acts were performed, let alone what the form and content of the utterances were. In other words, we are faced with a form of speech presentation that is even more minimal, both formally and functionally, than that captured by the NRSA category, where the narrator specifies the illocutionary force of the utterance, and, possibly, its topic. We classified instances like these as Narrator's Report of Voice, and tagged them with the acronym NV.

 

3.3 The tagging process

All texts were tagged manually by the present author using a version of the emacs text editor under the Unix operating system. All tagged texts were then checked by both Mick Short and Elena Semino, and any problems were then discussed in detail, and any necessary changes were then made. Additionally, others have been involved in tagging and checking areas of the corpus, and numerous further checks have also been applied in order to ensure global consistency, to enforce evolving guidelines and to identify and correct typographical and other errors. Checking and refinement of the tagging is still in progress.

 

3.4 The tagging format

Formally an sptag can be defined as follows:

<sptag cat=[tag](-[tag]) (who=[A-Z]) next=[tag](-[tag]) (whonext=[A-Z]) s=n(+n(+n)) w=n>

where elements in round brackets are optional; tag is an SW&TP tag from the tagset (e.g. N, NRS, FDS, etc); and n is a number. For example:

<sptag cat=DS who=B next=NRS whonext=B s=0.77 w=10>
'A criminal offence under the Defence of the Realm Act,'
<sptag cat=NRS who=B next=FDS whonext=C s=0.23 w=3>
I told her.

Here the tags tell us that there is a sequence of direct speech spoken by speaker B which is 10 words long (comprising 77% of the sentence) followed by a reported clause reporting speech by speaker B which is 3 words long (comprising 23% of the sentence).

 

3.5 Who's who

Where possible, speakers are expicitly named persons. However, it is sometimes necessary to attribute SW&TP to somewhat vaguer entities, such as groups of people or institutions, and sometimes the speaker is unknown. Occasionally it has been necessary to indicate the medium rather than the speaker, where this is the only information given, for example in The Prince of Wales by Jonathon Dimbleby:

F is the avalanche bulletin
P is the Sun

in the examples:

<sptag cat=eNRSAPQ-eNRWAPQ level=2 who=F s=0.16 w=10>
the avalanche bulletin warned of 'a considerable local avalanche danger'
</sptag level=1>
<sptag cat=NRW who=P next=DW whonext=P s=0.88 w=7>
On 18 March, the <hi r=it>Sun</hi> headline read 
<sptag cat=DW who=P next=NW whonext=X s=0.13+1 w=8>
'ACCUSED. Official: Charles DID cause the killer avalanche'.

First person narrators are always coded as B, and unknown speakers always as X. The main protagonist of third narratives have also been coded as B. It may be preferable to change this so that all and only first person narrators are B.

 

3.6 Boundary problems

Where the reported speech is nominalised clauses it is NRSAP. Independent clauses, "to" and "-ing" clauses are IS. Sometimes semantic criteria are relevant however; such cases are hashed, notably all "how" clauses.

NQ
Tag the whole sentence as NQ.


N-NRS
Also relates to DS-FDS ambiguities. Where a passage is not formally a reporting clause but functions to introduce SW&TP, a note has been inserted saying 'functions as NRS'.


NRS-NRSA
Where an NRSA is syntactically embedded within an NRSA, if possible it should be disentangled and tagged separately, e.g.:

<sptag cat=NRS who=B next=NRSA whonext=B s=0.25 w=5>
I put forward the view
<sptag cat=NRSA who=B next=IS whonext=B s=0.4 w=8>
which I had earlier expressed to John Wakeham,
<sptag cat=IS who=B next=NRSAP whonext=B s=0.35 w=7>
that Margaret should not be so definite.


NRSAP-NRSAPQ
Q does not necessarily have to be in quotation marks, or even formally marked at all, e.g.:
<sptag cat=N next=NRSAPQ whonext=G s=1+0.70 w=41>
When the boys came at her she attacked them with a ferocity that easily overcame their theoretical advantages of strength and size. Her gifts of
war came down to her from some unknown ancestor; and though her adversaries grabbed her hair 
<sptag cat=NRSAPQ who=G next=N s=0.15 w=4>and called her Jewess


Narrative/Thought in non-fiction
In general, what appears formally to be thought presentation in non-fiction should be tagged as ambiguous between N and the relevant category of TP.


N-NV-NRSA
N words: phone, news
NV words: interview, comments, chatted, talks, speak, conversation, "a lie detector test", "cheering and singing", "Robbie Williams made sure he was never knowingly underquoted".
NRSA words: questions, quiz, request, swore, warn, greet, bust-up, blasted, complaints, threatening, "called off the search", "Police were called", "further tests were ordered" "joined the condolences in a message to...", condemned, security warnings, "urging/cheering the sides on", "crack bad jokes", "delivered a defiant message".
Not speech events: "positive media coverage" "meeting" "Richard had been permanently excluded from school"


N-NW
"I got you a flag with 'Champions' on last time,"


IS-FIS-DS
Decide
'Decide' is tagged as ambiguous between narrative and thought presentation, e.g.:

<sptag cat=N-NRT next=N-IT s=0.17 w=4>
So the Mirror decided
<sptag cat=N-IT next=NRSA s=0.83 w=20>
to confront 48-year-old OJ...

 


4. Findings

The findings from our work on the Written Corpus are available in published form (click on the 'publications' link on the left for more details), and a monograph on the project is due to be be published in 2003 (Semino and Short forthcoming).

 


References

Leech, G. N. and Short, M. H. (1981) Style in Fiction. London: Longman.

Semino, E. and Short, M. (forthcoming) Corpus Stylistics: A Corpus-based Study of Speech, Writing and Thought Presentation in Narratives. London: Routledge.

Short, M. Wynne, M. and Semino, E. (1998) ‘Reading reports: discourse presentation in a corpus of narratives, with special reference to news reports.’ Anglistik & Englischunterricht. 39-65.



Back to top


Last updated:
Dan McIntyre