296 research outputs found

    Suomenkielisen geojäsentimen kehittäminen: kuinka hankkia sijaintitietoa jäsentelemättömistä tekstiaineistoista

    Get PDF
    Alati enemmän aineistoa tuotetaan ja jaetaan internetin kautta. Aineistot ovat vaihtelevia muodoiltaan, kuten verkkoartikkelien ja sosiaalisen media julkaisujen kaltaiset digitaaliset tekstit, ja niillä on usein spatiaalinen ulottuvuus. Teksteissä geospatiaalisuutta ilmaistaan paikannimien kautta, mutta tavanomaisilla paikkatietomenetelmillä ei kyetä käsittelemään tietoa epätäsmällisessä kielellisessä asussaan. Tämä on luonut tarpeen muuntaa tekstimuotoisen sijaintitiedon näkyvään muotoon, koordinaateiksi. Ongelmaa ratkaisemaan on kehitetty geojäsentimiä, jotka tunnistavat ja paikantavat paikannimet vapaista teksteistä, ja jotka oikein toimiessaan voisivat toimia paikkatiedon lähteenä maantieteellisessä tutkimuksessa. Geojäsentämistä onkin sovellettu katastrofihallinnasta kirjallisuudentutkimukseen. Merkittävässä osassa geojäsentämisen tutkimusta tutkimusaineiston kielenä on ollut englanti ja geojäsentimetkin ovat kielikohtaisia – tämä jättää pimentoon paitsi geojäsentimien kehitykseen vaikuttavat havainnot pienemmistä kielistä myös kyseisten kielten puhujien näkemykset. Maisterintutkielmassani pyrin vastaamaan kolmeen tutkimuskysymykseen: Mitkä ovat edistyneimmät geojäsentämismenetelmät? Mitkä kielelliset ja maantieteelliset monitulkintaisuudet vaikeuttavat tämän monitahoisen ongelman ratkaisua? Ja miten arvioida geojäsentimien luotettavuutta ja käytettävyyttä? Tutkielman soveltavassa osuudessa esittelen Fingerin, geojäsentimen suomen kielelle, ja kuvaan sen kehitystä sekä suorituskyvyn arviointia. Arviointia varten loin kaksi testiaineistoa, joista toinen koostuu Twitter-julkaisuista ja toinen uutisartikkeleista. Finger-geojäsennin, testiaineistot ja relevantit ohjelmakoodit jaetaan avoimesti. Geojäsentäminen voidaan jakaa kahteen alitehtävään: paikannimien tunnistamiseen tekstivirrasta ja paikannimien ratkaisemiseen oikeaan koordinaattipisteeseen mahdollisesti useasta kandidaatista. Molemmissa vaiheissa uusimmat metodit nojaavat syväoppimismalleihin ja -menetelmiin, joiden syötteinä ovat sanaupotusten kaltaiset vektorit. Geojäsentimien suoriutumista testataan aineistoilla, joissa paikannimet ja niiden koordinaatit tiedetään. Mittatikkuna tunnistamisessa on vastaavuus ja ratkaisemisessa etäisyys oikeasta sijainnista. Finger käyttää paikannimitunnistinta, joka hyödyntää suomenkielistä BERT-kielimallia, ja suoraviivaista tietokantahakua paikannimien ratkaisemiseen. Ohjelmisto tuottaa taulukkomuotoiseksi jäsenneltyä paikkatietoa, joka sisältää syötetekstit ja niistä mahdollisesti tunnistetut paikannimet koordinaattisijainteineen. Testiaineistot eroavat aihepiireiltään, mutta Finger suoriutuu niillä likipitäen samoin, ja suoriutuu englanninkielisillä aineistoilla tehtyihin arviointeihin suhteutettuna kelvollisesti. Virheanalyysi paljastaa useita virhelähteitä, jotka johtuvat kielten tai maantieteellisen todellisuuden luontaisesta epäselvyydestä tai ovat prosessoinnin aiheuttamia, kuten perusmuotoistamisvirheet. Kaikkia osia Fingerissä voidaan parantaa, muun muassa kehittämällä kielellistä käsittelyä pidemmälle ja luomalla kattavampia testiaineistoja. Samoin tulevaisuuden geojäsentimien tulee kyetä käsittelemään monimutkaisempia kielellisiä ja maantieteellisiä kuvaustapoja kuin pelkät paikannimet ja koordinaattipisteet. Finger ei nykymuodossaan tuota valmista paikkatietoa, jota kannattaisi kritiikittä käyttää. Se on kuitenkin lupaava ensiaskel suomen kielen geojäsentimille ja astinlauta vastaisuuden soveltavalle tutkimukselle.Ever more data is available and shared through the internet. The big data masses often have a spatial dimension and can take many forms, one of which are digital texts, such as articles or social media posts. The geospatial links in these texts are made through place names, also called toponyms, but traditional GIS methods are unable to deal with the fuzzy linguistic information. This creates the need to transform the linguistic location information to an explicit coordinate form. Several geoparsers have been developed to recognize and locate toponyms in free-form texts: the task of these systems is to be a reliable source of location information. Geoparsers have been applied to topics ranging from disaster management to literary studies. Major language of study in geoparser research has been English and geoparsers tend to be language-specific, which threatens to leave the experiences provided by studying and expressed in smaller languages unexplored. This thesis seeks to answer three research questions related to geoparsing: What are the most advanced geoparsing methods? What linguistic and geographical features complicate this multi-faceted problem? And how to evaluate the reliability and usability of geoparsers? The major contributions of this work are an open-source geoparser for Finnish texts, Finger, and two test datasets, or corpora, for testing Finnish geoparsers. One of the datasets consists of tweets and the other of news articles. All of these resources, including the relevant code for acquiring the test data and evaluating the geoparser, are shared openly. Geoparsing can be divided into two sub-tasks: recognizing toponyms amid text flows and resolving them to the correct coordinate location. Both tasks have seen a recent turn to deep learning methods and models, where the input texts are encoded as, for example, word embeddings. Geoparsers are evaluated against gold standard datasets where toponyms and their coordinates are marked. Performance is measured on equivalence and distance-based metrics for toponym recognition and resolution respectively. Finger uses a toponym recognition classifier built on a Finnish BERT model and a simple gazetteer query to resolve the toponyms to coordinate points. The program outputs structured geodata, with input texts and the recognized toponyms and coordinate locations. While the datasets represent different text types in terms of formality and topics, there is little difference in performance when evaluating Finger against them. The overall performance is comparable to the performance of geoparsers of English texts. Error analysis reveals multiple error sources, caused either by the inherent ambiguousness of the studied language and the geographical world or are caused by the processing itself, for example by the lemmatizer. Finger can be improved in multiple ways, such as refining how it analyzes texts and creating more comprehensive evaluation datasets. Similarly, the geoparsing task should move towards more complex linguistic and geographical descriptions than just toponyms and coordinate points. Finger is not, in its current state, a ready source of geodata. However, the system has potential to be the first step for geoparsers for Finnish and it can be a steppingstone for future applied research

    Periods, Organized (PeriodO): A gazetteer of period assertions for linking and visualizing periodized data

    Get PDF
    The PeriodO project seeks to create an online gazetteer of authoritative assertions about the chronological and geographic extent of historical and archaeological periods. Starting with a trial dataset related to Classical antiquity, this gazetteer will combine period thesauri used by museums and cultural heritage bodies with published assertions about the dates and locations of periods in authoritative print sources. These assertions will be modeled in a Linked Data format (JSON-LD, a serialization of RDF). They will be given Uniform Resource Identifiers (URIs) and served from a public GitHub repository, where they can act as a shared reference point to describe data in datasets with periodized information. We will also create a search and visualization tool to view the temporal and geographic extent of an assertion and compare it with others. Authoritative users will be able to add their own period assertions

    Modern Tools for Old Content - in Search of Named Entities in a Finnish OCRed Historical Newspaper Collection 1771-1910

    Get PDF
    Named entity recognition (NER), search, classification and tagging of names and name like frequent informational elements in texts, has become a standard information extraction procedure for textual data. NER has been applied to many types of texts and different types of entities: newspapers, fiction, historical records, persons, locations, chemical compounds, protein families, animals etc. In general a NER system’s performance is genre and domain dependent and also used entity categories vary [1]. The most general set of named entities is usually some version of three partite categorization of locations, persons and organizations. In this paper we report first trials and evaluation of NER with data out of a digitized Finnish historical newspaper collection Digi. Digi collection contains 1,960,921 pages of newspaper material from years 1771– 1910 both in Finnish and Swedish. We use only material of Finnish documents in our evaluation. The OCRed newspaper collection has lots of OCR errors; its estimated word level correctness is about 74–75 % [2]. Our principal NER tagger is a rule-based tagger of Finnish, FiNER, provided by the FIN-CLARIN consortium. We show also results of limited category semantic tagging with tools of the Semantic Computing Research Group (SeCo) of the Aalto University. FiNER is able to achieve up to 60.0 F-score with named entities in the evaluation data. Seco’s tools achieve 30.0–60.0 F-score with locations and persons. Performance of FiNER and SeCo’s tools with the data shows that at best about half of named entities can be recognized even in a quite erroneous OCRed textNamed entity recognition (NER), search, classification and tagging of names and name like frequent informational elements in texts, has become a standard information extraction procedure for textual data. NER has been applied to many types of texts and different types of entities: newspapers, fiction, historical records, persons, locations, chemical compounds, protein families, animals etc. In general a NER system’s performance is genre and domain dependent and also used entity categories vary [1]. The most general set of named entities is usually some version of three partite categorization of locations, persons and organizations. In this paper we report first trials and evaluation of NER with data out of a digitized Finnish historical newspaper collection Digi. Digi collection contains 1,960,921 pages of newspaper material from years 1771– 1910 both in Finnish and Swedish. We use only material of Finnish documents in our evaluation. The OCRed newspaper collection has lots of OCR errors; its estimated word level correctness is about 74–75 % [2]. Our principal NER tagger is a rule-based tagger of Finnish, FiNER, provided by the FIN-CLARIN consortium. We show also results of limited category semantic tagging with tools of the Semantic Computing Research Group (SeCo) of the Aalto University. FiNER is able to achieve up to 60.0 F-score with named entities in the evaluation data. Seco’s tools achieve 30.0–60.0 F-score with locations and persons. Performance of FiNER and SeCo’s tools with the data shows that at best about half of named entities can be recognized even in a quite erroneous OCRed text.Peer reviewe

    “All the world’s a stage”: A GIS framework for recreating personal time-space from qualitative and quantitative sources

    Get PDF
    This article presents a methodological model for the study of the space‐time patterns of everyday life. The framework utilizes a wide range of qualitative and quantitative sources to create two environmental stages, social and built, which place and contextualize the daily mobilities of individuals as they traverse urban environments. Additionally, this study outlines a procedure to fully integrate narrative sources in a GIS. By placing qualitative sources, such as narratives, within a stage‐based GIS, researchers can begin to tell rich spatial stories about the lived experiences of segregation, social interaction, and environmental exposure. The article concludes with a case study utilizing the diary of a postal clerk to outline the wide applicability of this model for space‐time GIS research

    Geospatial Analysis and Modeling of Textual Descriptions of Pre-modern Geography

    Get PDF
    Textual descriptions of pre-modern geography offer a different view of classical geography. The descriptions have been produced when none of the modern geographical concepts and tools were available. In this dissertation, we study pre-modern geography by primarily finding the existing structures of the descriptions and different cases of geographical data. We first explain four major geographical cases in pre-modern Arabic sources: gazetteer, administrative hierarchies, routes, and toponyms associated with people. Focusing on hierarchical divisions and routes, we offer approaches for manual annotation of administrative hierarchies and route sections as well as a semi-automated toponyms annotation. The latter starts with a fuzzy search of toponyms from an authority list and applies two different extrapolation models to infer true or false values, based on the context, for disambiguating the automatically annotated toponyms. Having the annotated data, we introduce mathematical models to shape and visualize regions based on the description of administrative hierarchies. Moreover, we offer models for comparing hierarchical divisions and route networks from different sources. We also suggest approaches to approximate geographical coordinates for places that do not have geographical coordinates - we call them unknown places - which is a major issue in visualization of pre-modern places on map. The final chapter of the dissertation introduces the new version of al-Ṯurayyā, a gazetteer and a spatial model of the classical Islamic world using georeferenced data of a pre-modern atlas with more than 2, 000 toponyms and routes. It offers search, path finding, and flood network functionalities as well as visualizations of regions using one of the models that we describe for regions. However the gazetteer is designed using the classical Islamic world data, the spatial model and features can be used for similarly prepared datasets.:1 Introduction 1 2 Related Work 8 2.1 GIS 8 2.2 NLP, Georeferencing, Geoparsing, Annotation 10 2.3 Gazetteer 15 2.4 Modeling 17 3 Classical Geographical Cases 20 3.1 Gazetteer 21 3.2 Routes and Travelogues 22 3.3 Administrative Hierarchy 24 3.4 Geographical Aspects of Biographical Data 25 4 Annotation and Extraction 27 4.1 Annotation 29 4.1.1 Manual Annotation of Geographical Texts 29 4.1.1.1 Administrative Hierarchy 30 4.1.1.2 Routes and Travelogues 32 4.1.2 Semi-Automatic Toponym Annotation 34 4.1.2.1 The Annotation Process 35 4.1.2.2 Extrapolation Models 37 4.1.2.2.1 Frequency of Toponymic N-grams 37 4.1.2.2.2 Co-occurrence Frequencies 38 4.1.2.2.3 A Supervised ML Approach 40 4.1.2.3 Summary 45 4.2 Data Extraction and Structures 45 4.2.1 Administrative Hierarchy 45 4.2.2 Routes and Distances 49 5 Modeling Geographical Data 51 5.1 Mathematical Models for Administrative Hierarchies 52 5.1.1 Sample Data 53 5.1.2 Quadtree 56 5.1.3 Voronoi Diagram 58 5.1.4 Voronoi Clippings 62 5.1.4.1 Convex Hull 62 5.1.4.2 Concave Hull 63 5.1.5 Convex Hulls 65 5.1.6 Concave Hulls 67 5.1.7 Route Network 69 5.1.8 Summary of Models for Administrative Hierarchy 69 5.2 Comparison Models 71 5.2.1 Hierarchical Data 71 5.2.1.1 Test Data 73 5.2.2 Route Networks 76 5.2.2.1 Post-processing 81 5.2.2.2 Applications 82 5.3 Unknown Places 84 6 Al-Ṯurayyā 89 6.1 Introducing al-Ṯurayyā 90 6.2 Gazetteer 90 6.3 Spatial Model 91 6.3.1 Provinces and Administrative Divisions 93 6.3.2 Pathfinding and Itineraries 93 6.3.3 Flood Network 96 6.3.4 Path Alignment Tool 97 6.3.5 Data Structure 99 6.3.5.1 Places 100 6.3.5.2 Routes and Distances 100 7 Conclusions and Further Work 10

    Context and Relationships: Ireland and Irish Studies

    Get PDF
    Students, editors, researchers, and the public are necessarily concerned with the context and relationships of names and topics found when reading texts. What other documents relate this topic? Where and when did this happen? What else was going on around that time and place? Who were the people and institutions mentioned? How were they related? What else did they do? This project will develop three tools: A Context Finder will find information in reference works about names, words, or phrases found when reading; a Context Builder will annotate texts with the source of information for the next reader; and reference works will be enriched by Context Provider to show where the names or topics were mentioned in texts. An international collaboration between the University of California, Berkeley, and the Queen's University, Belfast will develop these tools for texts relating to Ireland

    APREGOAR: Development of a geospatial database applied to local news in Lisbon

    Get PDF
    Project Work presented as the partial requirement for obtaining a Master's degree in Geographic Information Systems and ScienceHá informações valiosas em formato de texto não estruturado sobre a localização, calendarização e a essências dos eventos disponíveis no conteúdo de notícias digitais. Vários trabalhos em curso já tentam extrair detalhes de eventos de fontes de notícias digitais, mas muitas vezes não com a nuance necssária para representar com precisão onde as coisas realmente acontecem. Alternativamente, os jornalistas poderiam associar manualmente atributos a eventos descritos nos seus artigos enquanto publicam, melhorando a exatidão e a confiança nestes atributos espaciais e temporais. Estes atributos poderiam então estar imediatamente disponíveis para avaliar a cobertura temática, temporal e espacial do conteúdo de uma agência, bem como melhorar a experiência do utilizador na exploração do conteúdo, fornecendo dimensões adicionais que podem ser filtradas. Embora a tecnologia de atribuição de dimensões geoespaciais e temporais para o emprego de aplicaçãoes voltadas para o consumidor não seja novidade, tem ainda de ser aplicada à escala das notícias. Além disso, a maioria dos sistemas existentes suporta apenas uma definição pontual da localização dos artigos, que pode não representar bem o(s) local(is) real(ais) dos eventos descritos. Este trabalho define uma aplicação web de código aberto e uma base de dados espacial subjacente que suporta i) a associação de múltiplos polígonos a representar o local onde cada evento ocorre, os prazos associados aos eventos, em linha com os atributos temáticos tradicionais associados aos artigos de notícias; ii) a contextualização de cada artigo através da adição de mapas de eventos em linha para esclarecer aos leitores onde os eventos do artigo ocorrem; e iii) a exploração dos corpora adicionados através de filtros temáticos, espaciais e temporais que exibem os resultados em mapas de cobertura interactivos e listas de artigos e eventos. O projeto foi aplicado na área da grande Lisboa de Portugal. Para além da funcionalidade acima referida, este projeto constroi gazetteers progressivos que podem ser reutilizados como associações de lugares, ou para uma meta-análise mais aprofundada do lugar, tal como é percebido coloquialmente. Demonstra a facilidade com que estas dimensões adicionais podem ser incorporadas com grade confiança na precisão da definição, geridas, e alavancadas para melhorar a gestão de conteúdo das agências noticiosas, a compreensão dos leitores, a exploração dos investigadores, ou extraídas para combinação com outros conjuntos dos dados para fornecer conhecimentos adicionais.There is valuable information in unstructured text format about the location, timing, and nature of events available in digital news content. Several ongoing efforts already attempt to extract event details from digital news sources, but often not with the nuance needed to accurately represent the where things actually happen. Alternatively, journalists could manually associate attributes to events described in their articles while publishing, improving accuracy and confidence in these spatial and temporal attributes. These attributes could then be immediately available for evaluating thematic, temporal, and spatial coverage of an agency’s content, as well as improve the user experience of content exploration by providing additional dimensions that can be filtered. Though the technology of assigning geospatial and temporal dimensions for the employ of consumer-facing applications is not novel, it has yet to be applied at scale to the news. Additionally, most existing systems only support a single point definition of article locations, which may not well represent the actual place(s) of events described within. This work defines an open source web application and underlying spatial database that supports i) the association of multiple polygons representing where each event occurs, time frames associated with the events, inline with the traditional thematic attributes associated with news articles; ii) the contextualization of each article via the addition of inline event maps to clarify to readers where the events of the article occur; and iii) the exploration of the added corpora via thematic, spatial, and temporal filters that display results in interactive coverage maps and lists of articles and events. The project was applied to the greater Lisbon area of Portugal. In addition to the above functionality, this project builds progressive gazetteers that can be reused as place associations, or for further meta analysis of place as it is colloquially understood. It demonstrates the ease of which these additional dimensions may be incorporated with a high confidence in definition accuracy, managed, and leveraged to improve news agency content management, reader understanding, researcher exploration, or extracted for combination with other datasets to provide additional insights

    Spatializing and Analyzing Digital Texts: Corpora, GIS, and Places

    Get PDF
    This paper argues for Geographical Text Analysis (GTA), a new approach based on combining techniques from corpus linguistics and Geographical Information Systems (GIS). First these allow a text to be converted into a GIS. Corpus linguistics allows place-names to be extracted from texts. These place-names can then be matched to a gazetteer to provide grid references which subsequently allow the place-names to be converted into a GIS layer. Once this has been done, GTA also allows the text to summarize and analyze the geographies within the texts. These approaches can be applied to very large bodies of text, potentially millions or billions of words. We argue that this approach does not replace close reading which is more commonly used in the study of texts. Instead it allows very large volumes of text to be summarized and the close reader’s attention to be drawn to the parts of the text that are most relevant to their interest in particular places or the themes associated with these places. The implications of this approach in relation to deep mapping and literary studies are discussed
    corecore