Since the 1950s, research in the field of corpus linguistics has developed an array of methodologies for the analysis of both the linguistic form and the content of very large collections of texts. Corpora ranging from the very small (tens of thousands of words) to the very large (hundreds of millions or billions of running words) have been constructed and approached through techniques such as:
- Concordances - extracting examples of usage in context
- Collocations - statistical analysis of co-occurrence patterns among words or categories of words
- Key words and key items - significantly frequent items in one dataset compared to a reference dataset
- Part-of-speech annotation - grammatical labelling of the words in a corpus
- Maecenas vitae orci vitae tellus feugiat eleifend.
- Semantic tagging - automatic grouping of words into categories based on meaning
Many of these techniques are fundamentally quantitative, in that outputs are generated based purely on statistical processing of corpus data. However, quantitative and qualitative approaches are radically intertwined in corpus linguistics - since quantitative results are interpreted in a qualitative fashion by the analyst, and likewise qualitative statements are always formulated in light of the available quantitative data.
The application of corpus-based methods has led to dramatic advances in fields such as lexicography, descriptive grammar, language teaching, and literary stylistics. But to date, relatively little work has sought to add a spatial dimension to corpus analysis - despite the clear coherence of the corpus-based approach with the ideas underlying the field of Geographical Information Systems (GIS). In this project, we are working to bridge that gap.
In particular, we see three techniques of corpus linguistics as key to a successful integration of corpus data into GIS analysis:
First, automated grammatical analysis (popularly known as part-of-speech tagging) allows all instances of place-names in the texts of a corpus to be identified. This is one of a range of techniques associated with the field of named entity recognition. The resulting data, when geo-referenced, provides the basis of a GIS - allowing the spatial scope of the content of the corpus to be visualised.
Second, collocation analysis allows us to examine on a very large scale what words and topics are being discussed in relation to different instances of place-names in a corpus. It is a common finding of corpus linguistics that collocation is utterly pervasive in language - all words meaning is to some extent dependent on the meanings and patterns that they characteristically co-occur with. But to date collocation analysis has not been linked to spatial visualisation.
Third, semantic tagging allows us to operate the collocational analysis at a higher level of generality. Instead of just looking at the words that collocate with a place-name such as London or Edinburgh, we can specify a topic-category such as War or Disease and identify all places discussed in relation to any word tagged as relating to that topic. In this way, spatial analysis of specific subject domains can be undertaken.
