6,305 research outputs found

    Transforming texts to maps : geovisualizing topics in texts

    Get PDF
    Dissertation submitted in partial fulfilment of the requirements for the degree of Master of Science in Geospatial TechnologiesUnstructured textual data is one of the most dominant forms of communication. Especially after the adoption of Web 2.0, there has been a massive surge in the rate of generation of unstructured textual data. While a large amount of information is intuitively better for proper decision-making, it also means that it becomes virtually impossible to manually process, discover and extract useful information from textual data. Several supervised and unsupervised techniques in text mining have been developed to classify, cluster and extract information from texts. While text data mining provides insight to the contents of the texts, these techniques do not provide insights to the location component of the texts. In simple terms, text data mining addresses “What is the text about?” but fails to answer the “Where is the text about?” Since textual data have a large amount of geographic content (estimates of about 80%), it can be safely reasoned that answering “Where is the text about?” adds significant insights about the texts. In this study, a collection of news articles from the year 2017 were analyzed using topic modelling, an unsupervised text mining technique. Topics were discovered from the text collections using Latent Dirichlet Allocation method, a popular topic modelling technique. Topics are probability distribution of words which correspond to one of the concepts covered in the text. Spatial locations were extracted from text documents by geoparsing them. Topics were geovisualized as interactive maps according to the probability of each spatial location word which contributed to the corresponding topic. This is analogous to thematic mapping in Geographical Information System. Coordinates obtained from geoparsed words provide basis for georeferencing the topics while the probability of such location words corresponding to the particular topics provide the attribute value for thematic mapping. An interactive geovisualization of Choropleth maps at the level of country was constructed using the Leaflet visualization library. A comparative analysis between the maps and corresponding topics was made to see if the maps provided spatial context to the topics

    Content Analysis of 150 Years of British Periodicals

    Get PDF
    Previous studies have shown that it is possible to detect macroscopic patterns of cultural change over periods of centuries by analyzing large textual time series, specifically digitized books. This method promises to empower scholars with a quantitative and data-driven tool to study culture and society, but its power has been limited by the use of data from books and simple analytics based essentially on word counts. This study addresses these problems by assembling a vast corpus of regional newspapers from the United Kingdom, incorporating very fine-grained geographical and temporal information that is not available for books. The corpus spans 150 years and is formed by millions of articles, representing 14% of all British regional outlets of the period. Simple content analysis of this corpus allowed us to detect specific events, like wars, epidemics, coronations, or conclaves, with high accuracy, whereas the use of more refined techniques from artificial intelligence enabled us to move beyond counting words by detecting references to named entities. These techniques allowed us to observe both a systematic underrepresentation and a steady increase of women in the news during the 20th century and the change of geographic focus for various concepts. We also estimate the dates when electricity overtook steam and trains overtook horses as a means of transportation, both around the year 1900, along with observing other cultural transitions. We believe that these data-driven approaches can complement the traditional method of close reading in detecting trends of continuity and change in historical corpora

    APREGOAR: Development of a geospatial database applied to local news in Lisbon

    Get PDF
    Project Work presented as the partial requirement for obtaining a Master's degree in Geographic Information Systems and ScienceHá informações valiosas em formato de texto não estruturado sobre a localização, calendarização e a essências dos eventos disponíveis no conteúdo de notícias digitais. Vários trabalhos em curso já tentam extrair detalhes de eventos de fontes de notícias digitais, mas muitas vezes não com a nuance necssária para representar com precisão onde as coisas realmente acontecem. Alternativamente, os jornalistas poderiam associar manualmente atributos a eventos descritos nos seus artigos enquanto publicam, melhorando a exatidão e a confiança nestes atributos espaciais e temporais. Estes atributos poderiam então estar imediatamente disponíveis para avaliar a cobertura temática, temporal e espacial do conteúdo de uma agência, bem como melhorar a experiência do utilizador na exploração do conteúdo, fornecendo dimensões adicionais que podem ser filtradas. Embora a tecnologia de atribuição de dimensões geoespaciais e temporais para o emprego de aplicaçãoes voltadas para o consumidor não seja novidade, tem ainda de ser aplicada à escala das notícias. Além disso, a maioria dos sistemas existentes suporta apenas uma definição pontual da localização dos artigos, que pode não representar bem o(s) local(is) real(ais) dos eventos descritos. Este trabalho define uma aplicação web de código aberto e uma base de dados espacial subjacente que suporta i) a associação de múltiplos polígonos a representar o local onde cada evento ocorre, os prazos associados aos eventos, em linha com os atributos temáticos tradicionais associados aos artigos de notícias; ii) a contextualização de cada artigo através da adição de mapas de eventos em linha para esclarecer aos leitores onde os eventos do artigo ocorrem; e iii) a exploração dos corpora adicionados através de filtros temáticos, espaciais e temporais que exibem os resultados em mapas de cobertura interactivos e listas de artigos e eventos. O projeto foi aplicado na área da grande Lisboa de Portugal. Para além da funcionalidade acima referida, este projeto constroi gazetteers progressivos que podem ser reutilizados como associações de lugares, ou para uma meta-análise mais aprofundada do lugar, tal como é percebido coloquialmente. Demonstra a facilidade com que estas dimensões adicionais podem ser incorporadas com grade confiança na precisão da definição, geridas, e alavancadas para melhorar a gestão de conteúdo das agências noticiosas, a compreensão dos leitores, a exploração dos investigadores, ou extraídas para combinação com outros conjuntos dos dados para fornecer conhecimentos adicionais.There is valuable information in unstructured text format about the location, timing, and nature of events available in digital news content. Several ongoing efforts already attempt to extract event details from digital news sources, but often not with the nuance needed to accurately represent the where things actually happen. Alternatively, journalists could manually associate attributes to events described in their articles while publishing, improving accuracy and confidence in these spatial and temporal attributes. These attributes could then be immediately available for evaluating thematic, temporal, and spatial coverage of an agency’s content, as well as improve the user experience of content exploration by providing additional dimensions that can be filtered. Though the technology of assigning geospatial and temporal dimensions for the employ of consumer-facing applications is not novel, it has yet to be applied at scale to the news. Additionally, most existing systems only support a single point definition of article locations, which may not well represent the actual place(s) of events described within. This work defines an open source web application and underlying spatial database that supports i) the association of multiple polygons representing where each event occurs, time frames associated with the events, inline with the traditional thematic attributes associated with news articles; ii) the contextualization of each article via the addition of inline event maps to clarify to readers where the events of the article occur; and iii) the exploration of the added corpora via thematic, spatial, and temporal filters that display results in interactive coverage maps and lists of articles and events. The project was applied to the greater Lisbon area of Portugal. In addition to the above functionality, this project builds progressive gazetteers that can be reused as place associations, or for further meta analysis of place as it is colloquially understood. It demonstrates the ease of which these additional dimensions may be incorporated with a high confidence in definition accuracy, managed, and leveraged to improve news agency content management, reader understanding, researcher exploration, or extracted for combination with other datasets to provide additional insights

    Location Reference Recognition from Texts: A Survey and Comparison

    Full text link
    A vast amount of location information exists in unstructured texts, such as social media posts, news stories, scientific articles, web pages, travel blogs, and historical archives. Geoparsing refers to recognizing location references from texts and identifying their geospatial representations. While geoparsing can benefit many domains, a summary of its specific applications is still missing. Further, there is a lack of a comprehensive review and comparison of existing approaches for location reference recognition, which is the first and core step of geoparsing. To fill these research gaps, this review first summarizes seven typical application domains of geoparsing: geographic information retrieval, disaster management, disease surveillance, traffic management, spatial humanities, tourism management, and crime management. We then review existing approaches for location reference recognition by categorizing these approaches into four groups based on their underlying functional principle: rule-based, gazetteer matching–based, statistical learning-–based, and hybrid approaches. Next, we thoroughly evaluate the correctness and computational efficiency of the 27 most widely used approaches for location reference recognition based on 26 public datasets with different types of texts (e.g., social media posts and news stories) containing 39,736 location references worldwide. Results from this thorough evaluation can help inform future methodological developments and can help guide the selection of proper approaches based on application needs

    DARIAH and the Benelux

    Get PDF

    Semantic Enrichment of a Multilingual Archive with Linked Open Data

    Get PDF
    This paper introduces MERCKX, a Multilingual Entity/Resource Combiner & Knowledge eXtractor. A case study involving the semantic enrichment of a multilingual archive is presented with the aim of assessing the relevance of natural language processing techniques such as named-entity recognition and entity linking for cultural heritage material. In order to improve the indexing of historical collections, we map entities to the Linked Open Data cloud using a language-independent method. Our evaluation shows that MERCKX outperforms similar tools on the task of place disambiguation and linking, achieving over 80% precision despite lower recall scores. These results are encouraging for small and medium-size cultural institutions since they demonstrate that semantic enrichment can be achieved with limited resources.Peer reviewe

    CHORUS Deliverable 2.1: State of the Art on Multimedia Search Engines

    Get PDF
    Based on the information provided by European projects and national initiatives related to multimedia search as well as domains experts that participated in the CHORUS Think-thanks and workshops, this document reports on the state of the art related to multimedia content search from, a technical, and socio-economic perspective. The technical perspective includes an up to date view on content based indexing and retrieval technologies, multimedia search in the context of mobile devices and peer-to-peer networks, and an overview of current evaluation and benchmark inititiatives to measure the performance of multimedia search engines. From a socio-economic perspective we inventorize the impact and legal consequences of these technical advances and point out future directions of research
    corecore