Various researchers on both sides of the Atlantic have recently turned their attention to how to convert corpora of unstructured texts, such as books, official papers or newspapers, into a format suitable for analysis within a GIS. Most of these approaches use automated language processing techniques to identify and extract proper nouns from the text, and then use gazetteers and filters to narrow these down such that only place-names are included. The gazetteer then provides the coordinates that allow the GIS to be created. A number of researchers are working in this area and have demonstrated the feasibility of this. The challenge, however, is not simply to geo-reference texts in this way but also to develop techniques that allow us to gain a better understanding of the geographies within them and of the realities that underlie these geographies.
Integrating corpus linguistics and spatial technologies we are developing exploratory and formal methods to integrate texts into GIS and carry out spatial analysis.
These techniques are allowing us now to ask questions such as "what places is this corpus talking about?", "what is being said about different places?", and "how has the way that places are represented in the corpus changed over time?" These will be piloted on a range of sources including literature, guide books, newspapers, and government reports. The techniques developed will be applicable to a wide range of other sources including 'born-digital' material and e-resources potentially including Google Books. We will therefore be allowing GIS to be applied to a far wider range of sources and infrastructures than is currently possible.