47 research outputs found
From digital resources to historical scholarship with the British Library 19th Century Newspaper Collection
It is increasingly acknowledged that the Digital Humanities have placed too much emphasis on data creation and that the major priority should be turning digital sources into contributions to knowledge. While this sounds relatively simple, doing it involves intermediate stages of research that enhance digital sources, develop new methodologies and explore their potential to generate new knowledge from the source. While these stages are familiar in the social sciences they are less so in the humanities. In this paper we explore these stages based on research on the British Library’s Nineteenth Century Newspaper Collection, a corpus of many billion words that has much to offer to our understanding of the nineteenth century but whose size and complexity makes it difficult to work with
Suomenkielisen geojäsentimen kehittäminen: kuinka hankkia sijaintitietoa jäsentelemättömistä tekstiaineistoista
Alati enemmän aineistoa tuotetaan ja jaetaan internetin kautta. Aineistot ovat vaihtelevia muodoiltaan, kuten verkkoartikkelien ja sosiaalisen media julkaisujen kaltaiset digitaaliset tekstit, ja niillä on usein spatiaalinen ulottuvuus. Teksteissä geospatiaalisuutta ilmaistaan paikannimien kautta, mutta tavanomaisilla paikkatietomenetelmillä ei kyetä käsittelemään tietoa epätäsmällisessä kielellisessä asussaan. Tämä on luonut tarpeen muuntaa tekstimuotoisen sijaintitiedon näkyvään muotoon, koordinaateiksi. Ongelmaa ratkaisemaan on kehitetty geojäsentimiä, jotka tunnistavat ja paikantavat paikannimet vapaista teksteistä, ja jotka oikein toimiessaan voisivat toimia paikkatiedon lähteenä maantieteellisessä tutkimuksessa. Geojäsentämistä onkin sovellettu katastrofihallinnasta kirjallisuudentutkimukseen. Merkittävässä osassa geojäsentämisen tutkimusta tutkimusaineiston kielenä on ollut englanti ja geojäsentimetkin ovat kielikohtaisia – tämä jättää pimentoon paitsi geojäsentimien kehitykseen vaikuttavat havainnot pienemmistä kielistä myös kyseisten kielten puhujien näkemykset.
Maisterintutkielmassani pyrin vastaamaan kolmeen tutkimuskysymykseen: Mitkä ovat edistyneimmät geojäsentämismenetelmät? Mitkä kielelliset ja maantieteelliset monitulkintaisuudet vaikeuttavat tämän monitahoisen ongelman ratkaisua? Ja miten arvioida geojäsentimien luotettavuutta ja käytettävyyttä? Tutkielman soveltavassa osuudessa esittelen Fingerin, geojäsentimen suomen kielelle, ja kuvaan sen kehitystä sekä suorituskyvyn arviointia. Arviointia varten loin kaksi testiaineistoa, joista toinen koostuu Twitter-julkaisuista ja toinen uutisartikkeleista. Finger-geojäsennin, testiaineistot ja relevantit ohjelmakoodit jaetaan avoimesti.
Geojäsentäminen voidaan jakaa kahteen alitehtävään: paikannimien tunnistamiseen tekstivirrasta ja paikannimien ratkaisemiseen oikeaan koordinaattipisteeseen mahdollisesti useasta kandidaatista. Molemmissa vaiheissa uusimmat metodit nojaavat syväoppimismalleihin ja -menetelmiin, joiden syötteinä ovat sanaupotusten kaltaiset vektorit. Geojäsentimien suoriutumista testataan aineistoilla, joissa paikannimet ja niiden koordinaatit tiedetään. Mittatikkuna tunnistamisessa on vastaavuus ja ratkaisemisessa etäisyys oikeasta sijainnista.
Finger käyttää paikannimitunnistinta, joka hyödyntää suomenkielistä BERT-kielimallia, ja suoraviivaista tietokantahakua paikannimien ratkaisemiseen. Ohjelmisto tuottaa taulukkomuotoiseksi jäsenneltyä paikkatietoa, joka sisältää syötetekstit ja niistä mahdollisesti tunnistetut paikannimet koordinaattisijainteineen. Testiaineistot eroavat aihepiireiltään, mutta Finger suoriutuu niillä likipitäen samoin, ja suoriutuu englanninkielisillä aineistoilla tehtyihin arviointeihin suhteutettuna kelvollisesti. Virheanalyysi paljastaa useita virhelähteitä, jotka johtuvat kielten tai maantieteellisen todellisuuden luontaisesta epäselvyydestä tai ovat prosessoinnin aiheuttamia, kuten perusmuotoistamisvirheet.
Kaikkia osia Fingerissä voidaan parantaa, muun muassa kehittämällä kielellistä käsittelyä pidemmälle ja luomalla kattavampia testiaineistoja. Samoin tulevaisuuden geojäsentimien tulee kyetä käsittelemään monimutkaisempia kielellisiä ja maantieteellisiä kuvaustapoja kuin pelkät paikannimet ja koordinaattipisteet. Finger ei nykymuodossaan tuota valmista paikkatietoa, jota kannattaisi kritiikittä käyttää. Se on kuitenkin lupaava ensiaskel suomen kielen geojäsentimille ja astinlauta vastaisuuden soveltavalle tutkimukselle.Ever more data is available and shared through the internet. The big data masses often have a spatial dimension and can take many forms, one of which are digital texts, such as articles or social media posts. The geospatial links in these texts are made through place names, also called toponyms, but traditional GIS methods are unable to deal with the fuzzy linguistic information. This creates the need to transform the linguistic location information to an explicit coordinate form. Several geoparsers have been developed to recognize and locate toponyms in free-form texts: the task of these systems is to be a reliable source of location information. Geoparsers have been applied to topics ranging from disaster management to literary studies. Major language of study in geoparser research has been English and geoparsers tend to be language-specific, which threatens to leave the experiences provided by studying and expressed in smaller languages unexplored.
This thesis seeks to answer three research questions related to geoparsing: What are the most advanced geoparsing methods? What linguistic and geographical features complicate this multi-faceted problem? And how to evaluate the reliability and usability of geoparsers? The major contributions of this work are an open-source geoparser for Finnish texts, Finger, and two test datasets, or corpora, for testing Finnish geoparsers. One of the datasets consists of tweets and the other of news articles. All of these resources, including the relevant code for acquiring the test data and evaluating the geoparser, are shared openly.
Geoparsing can be divided into two sub-tasks: recognizing toponyms amid text flows and resolving them to the correct coordinate location. Both tasks have seen a recent turn to deep learning methods and models, where the input texts are encoded as, for example, word embeddings. Geoparsers are evaluated against gold standard datasets where toponyms and their coordinates are marked. Performance is measured on equivalence and distance-based metrics for toponym recognition and resolution respectively.
Finger uses a toponym recognition classifier built on a Finnish BERT model and a simple gazetteer query to resolve the toponyms to coordinate points. The program outputs structured geodata, with input texts and the recognized toponyms and coordinate locations. While the datasets represent different text types in terms of formality and topics, there is little difference in performance when evaluating Finger against them. The overall performance is comparable to the performance of geoparsers of English texts. Error analysis reveals multiple error sources, caused either by the inherent ambiguousness of the studied language and the geographical world or are caused by the processing itself, for example by the lemmatizer.
Finger can be improved in multiple ways, such as refining how it analyzes texts and creating more comprehensive evaluation datasets. Similarly, the geoparsing task should move towards more complex linguistic and geographical descriptions than just toponyms and coordinate points. Finger is not, in its current state, a ready source of geodata. However, the system has potential to be the first step for geoparsers for Finnish and it can be a steppingstone for future applied research
Further frontiers in GIS: Extending Spatial Analysis to Textual Sources in Archaeology
Although the use of Geographic Information Systems (GIS) has a long history in archaeology, spatial technologies have been rarely used to analyse the content of textual collections. A newly developed approach termed Geographic Text Analysis (GTA) is now allowing the semi-automated exploration of large corpora incorporating a combination of Natural Language Processing techniques, Corpus Linguistics, and GIS. In this article we explain the development of GTA, propose possible uses of this methodology in the field of archaeology, and give a summary of the challenges that emerge from this type of analysis.The research leading to these results has received funding from the European Research Council (ERC) under the European Union’s Seventh Framework Programme (FP7/2007-2013) / ERC grant “Spatial Humanities: Texts, GIS, places” (agreement number 283850)
Qualitative Geographies in Digital Texts:Representing historical spatial identities in the Lake District
Techniques for extracting place names (toponyms) from texts and using them to conduct analyses of the geographies within the texts are becoming reasonably well established. These are generally referred to as Geographical Text Analysis (GTA) and allow us to ask questions about the geographies within a corpus. The limitation of this approach is that the geographies that can be uncovered are solely associated with toponyms for which a coordinate-based location can be assigned. While this method is valuable, it is effectively a quantitative representation of the geographies associated with named places. Other representations of geography are ignored. To complement GTA, we need to develop techniques that are capable of representing the more qualitative representations of geography that are found within texts. Drawing on the Corpus of Lake District Writing, this paper presents some initial ideas about how this can be achieved, primarily by using techniques from corpus linguistics
Ensemble Named Entity Recognition (NER):Evaluating NER Tools in the Identification of Place Names in Historical Corpora
The field of Spatial Humanities has advanced substantially in the past years. The identification and extraction of toponyms and spatial information mentioned in historical text collections has allowed its use in innovative ways, making possible the application of spatial analysis and the mapping of these places with geographic information systems. For instance, automated place name identification is possible with Named Entity Recognition (NER) systems. Statistical NER methods based on supervised learning, in particular, are highly successful with modern datasets. However, there are still major challenges to address when dealing with historical corpora. These challenges include language changes over time, spelling variations, transliterations, OCR errors, and sources written in multiple languages among others. In this article, considering a task of place name recognition over two collections of historical correspondence, we report an evaluation of five NER systems and an approach that combines these through a voting system. We found that although individual performance of each NER system was corpus dependent, the ensemble combination was able to achieve consistent measures of precision and recall, outperforming the individual NER systems. In addition, the results showed that these NER systems are not strongly dependent on preprocessing and translation to Modern English
Recommended from our members
Where are you talking about? Advances and Challenges of Geographic Analysis of Text with Application to Disease Monitoring
The Natural Language Processing task we focus on in this thesis is Geoparsing. Geoparsing is the process of extraction and grounding of toponyms (place names). Consider this sentence: "The victims of the Spanish earthquake off the coast of Malaga were of American and Mexican origin." Four toponyms will be extracted (called Geotagging) and grounded to their geographic coordinates (called Toponym Resolution). However, our research goes further than any previous work by showing how to distinguish the literal place(s) of the event (Spain, Malaga) from other linguistic types/uses such as nationalities (Mexican, American), improving downstream task accuracy. We consolidate and extend the Standard Evaluation Framework, discuss key research problems, then present concrete solutions in order to advance each stage of geoparsing. For geotagging, as well as training a SOTA neural Location-NER tagger, we simplify Metonymy Resolution with a novel minimalist feature extraction combined with an LSTM-based classifier, matching SOTA results. For toponym resolution, we deploy the latest deep learning methods to achieve SOTA performance by augmenting neural models with hitherto unused geographic features called Map Vectors. With each research project, we provide high-quality datasets and system prototypes, further building resources in this field. We then show how these geoparsing advances coupled with our proposed Intra-Document Analysis can be used to associate news articles with locations in order to monitor the spread of public health threats. To this end, we evaluate our research contributions with production data from a real-time downstream application to improve geolocation of news events for disease monitoring. The data was made available to us by the Joint Research Centre (JRC), which operates one such system called MediSys that processes incoming news articles in order to monitor threats to public health and make these available to a variety of governmental, business and non-profit organisations. We also discuss steps towards an end-to-end, automated news monitoring system and make actionable recommendations for future work. In summary, the thesis aims are twofold: (1) Generate original geoparsing research aimed at advancing each stage of the pipeline by addressing pertinent challenges with concrete solutions and actionable proposals. (2) Demonstrate how this research can be applied to news event monitoring to increase the efficacy of existing biosurveillance systems, e.g. European Commission’s MediSys.I was generously funded by DREAM CDT, which was funded by NERC of UKRI
Chapter 12 Landscape Appreciation in the English Lake District
"The relational complexity of urban and rural landscapes in space and in time.
The development of historical geographical information systems (HGIS) and other methods from the digital humanities have revolutionised historical research on cultural landscapes. Additionally, the opening up of increasingly diverse collections of source material, often incomplete and difficult to interpret, has led to methodologically innovative experiments. One of today’s major challenges, however, concerns the concepts and tools to be deployed for mapping processes of transformation—that is, interpreting and imagining the relational complexity of urban and rural landscapes, both in space and in time, at micro- and macro-scale.
Mapping Landscapes in Transformation gathers experts from different disciplines, active in the fields of historical geography, urban and landscape history, archaeology and heritage conservation. They are specialised in a wide variety of space-time contexts, including regions within Europe, Asia, and the Americas, and periods from antiquity to the 21st century.
Uncovering Environmental Change in the English Lake District:Using Computational Techniques to Trace Shifting Practice in the Historical Documentation of Flora
There is a lack of concrete knowledge about floristic change in Britain before the mid-twentieth century. Relevant evidence is available, but it is principally contained in disparate historical sources, and this is a major impediment to further research. In this article we demonstrate how these sources can be efficiently collated and analysed through the implementation of state-of-the-art computational-linguistic and historical-GIS techniques. We do so through a case study which focuses on the floristic history of the English Lake District. This region has been selected because of its outstanding cultural and environmental value and because it has been extensively and continuously documented since the late seventeenth century. We outline how Natural language processing (NLP) techniques can be integrated with Kew’s Plants of the World Online (POWO) database to enable temporal shifts in plant-naming conventions to be more accurately traced across a heterogeneous corpus of texts published between 1682 and 1904. Through collocate analysis and automated geoparsing techniques, the geographies associated with these plant names is then identified and extracted. Finally, we use geographic information systems (GIS) to demonstrate the potential of this dataset for geo-temporal analysis and for revealing the historical distribution of Lake District flora. In outlining our methodology, this article indicates how the spatial and digital humanities can benefit research both in environmental history and in the environmental sciences more widely
What’s missing in geographical parsing?
Geographical data can be obtained by converting place names from free-format text into geographical coordinates. The ability to geo-locate events in textual reports represents a valuable source of information in many real-world applications such as emergency responses, real-time social media geographical event analysis, understanding location instructions in auto-response systems and more. However, geoparsing is still widely regarded as a challenge because of domain language diversity, place name ambiguity, metonymic language and limited leveraging of context as we show in our analysis. Results to date, whilst promising, are on laboratory data and unlike in wider NLP are often not cross-compared. In this study, we evaluate and analyse the performance of a number of leading geoparsers on a number of corpora and highlight the challenges in detail. We also publish an automatically geotagged Wikipedia corpus to alleviate the dearth of (open source) corpora in this domain.We gratefully acknowledge the funding support of the Natural Environment Research Council (NERC) Ph.D. Studentship NE/M009009/1 (MG) and EPSRC (NC and NL: Grant No. EP/M005089/1