10 research outputs found
Visualising the South Yorkshire floods of â07
This paper describes initial work on developing an information
system to gather, process and visualise various multimedia data sources related to the South Yorkshire (UK) floods of 2007. The work is part of the Memoir project which aims to investigate how technology can help people create and manage long-term personal memories. We are using maps to aggregate multimedia data and to stimulate remembering past events. The paper describes an initial prototype; challenges faced so far and planned future work
A Web-based Geo-resolution Annotation and Evaluation Tool
In this paper we present the Edinburgh Geo-annotator, a web-based annotation tool for the manual geo-resolution of location mentions in text using a gazetteer. The annotation tool has an inter-linked text and map interface which lets annotators pick correct candidates within the gazetteer more easily. The geo-annotator can be used to correct the output of a geoparser or to create gold standard geo-resolution data. We include accompanying scoring software for geo-resolution evaluation.
Named entity recognition for sensitive data discovery in Portuguese
The process of protecting sensitive data is continually growing and becoming increasingly
important, especially as a result of the directives and laws imposed by the European Union. The effort
to create automatic systems is continuous, but, in most cases, the processes behind them are still
manual or semi-automatic. In this work, we have developed a component that can extract and
classify sensitive data, from unstructured text information in European Portuguese. The objective
was to create a system that allows organizations to understand their data and comply with legal and
security purposes. We studied a hybrid approach to the problem of Named Entity Recognition for the
Portuguese language. This approach combines several techniques such as rule-based/lexical-based
models, machine learning algorithms, and neural networks. The rule-based and lexical-based
approaches were used only for a set of specific classes. For the remaining classes of entities, two
statistical models were testedâConditional Random Fields and Random Forest and, finally, a
Bidirectional-LSTM approach as experimented. Regarding the statistical models, we realized that
Conditional Random Fields is the one that can obtain the best results, with a f1-score of 65.50%.
With the Bi-LSTM approach, we have achieved a result of 83.01%. The corpora used for training and
testing were HAREM Golden Collection, SIGARRA News Corpus, and DataSense NER Corpus.info:eu-repo/semantics/publishedVersio
Linking archival data to location A case study at the UK National Archives
Purpose
The National Archives (TNA) is the UK Government's official archive. It stores and maintains records spanning over a 1,000 years in both physical and digital form. Much of the information held by TNA includes references to place and frequently user queries to TNA's online catalogue involve searches for location. The purpose of this paper is to illustrate how TNA have extracted the geographic references in their historic data to improve access to the archives.
Design/methodology/approach
To be able to quickly enhance the existing archival data with geographic information, existing technologies from Natural Language Processing (NLP) and Geographical Information Retrieval (GIR) have been utilised and adapted to historical archives.
Findings
Enhancing the archival records with geographic information has enabled TNA to quickly develop a number of case studies highlighting how geographic information can improve access to largeâscale archival collections. The use of existing methods from the GIR domain and technologies, such as OpenLayers, enabled one to quickly implement this process in a way that is easily transferable to other institutions.
Practical implications
The methods and technologies described in this paper can be adapted, by other archives, to similarly enhance access to their historic data. Also the dataâsharing methods described can be used to enable the integration of knowledge held at different archival institutions.
Originality/value
Place is one of the core dimensions for TNA's archival data. Many of the records which are held make reference to place data (wills, legislation, court cases), and approximately one fifth of users' searches involve place names. However, there are still a number of open questions regarding the adaptation of existing GIR methods to the history domain. This paper presents an overview over available GIR methods and the challenges in applying them to historical data
Toponym detection in the bio-medical domain: A hybrid approach with deep learning
This paper compares how different machine learning classifiers can be used together with simple string matching and named entity recognition to detect locations in texts. We compare five different state-of-the-art machine learning classifiers in order to predict whether a sentence contains a location or not. Following this classification task, we use a string matching algorithm with a gazetteer to identify the exact index of a toponym within the sentence. We evaluate different approaches in terms of machine learning classifiers, text pre-processing and location extraction on the SemEval-2019 Task 12 dataset, compiled for toponym resolution in the bio-medical domain. Finally, we compare the results with our system that was previously submitted to the SemEval-2019 task evaluation.Published versio
Location Reference Recognition from Texts: A Survey and Comparison
A vast amount of location information exists in unstructured texts, such as social media posts, news stories, scientific articles, web pages, travel blogs, and historical archives. Geoparsing refers to recognizing location references from texts and identifying their geospatial representations. While geoparsing can benefit many domains, a summary of its specific applications is still missing. Further, there is a lack of a comprehensive review and comparison of existing approaches for location reference recognition, which is the first and core step of geoparsing. To fill these research gaps, this review first summarizes seven typical application domains of geoparsing: geographic information retrieval, disaster management, disease surveillance, traffic management, spatial humanities, tourism management, and crime management. We then review existing approaches for location reference recognition by categorizing these approaches into four groups based on their underlying functional principle: rule-based, gazetteer matchingâbased, statistical learning-âbased, and hybrid approaches. Next, we thoroughly evaluate the correctness and computational efficiency of the 27Â most widely used approaches for location reference recognition based on 26 public datasets with different types of texts (e.g., social media posts and news stories) containing 39,736 location references worldwide. Results from this thorough evaluation can help inform future methodological developments and can help guide the selection of proper approaches based on application needs
Recommended from our members
Final report : PATTON Alliance gazetteer evaluation project.
In 2005 the National Ground Intelligence Center (NGIC) proposed that the PATTON Alliance provide assistance in evaluating and obtaining the Integrated Gazetteer Database (IGDB), developed for the Naval Space Warfare Command Research group (SPAWAR) under Advance Research and Development Activity (ARDA) funds by MITRE Inc., fielded to the text-based search tool GeoLocator, currently in use by NGIC. We met with the developers of GeoLocator and identified their requirements for a better gazetteer. We then validated those requirements by reviewing the technical literature, meeting with other members of the intelligence community (IC), and talking with both the United States Geologic Survey (USGS) and the National Geospatial Intelligence Agency (NGA), the authoritative sources for official geographic name information. We thus identified 12 high-level requirements from users and the broader intelligence community. The IGDB satisfies many of these requirements. We identified gaps and proposed ways of closing these gaps. Three important needs have not been addressed but are critical future needs for the broader intelligence community. These needs include standardization of gazetteer data, a web feature service for gazetteer information that is maintained by NGA and USGS but accessible to users, and a common forum that brings together IC stakeholders and federal agency representatives to provide input to these activities over the next several years. Establishing a robust gazetteer web feature service that is available to all IC users may go a long way toward resolving the gazetteer needs within the IC. Without a common forum to provide input and feedback, community adoption may take significantly longer than anticipated with resulting risks to the war fighter
Location reference recognition from texts: A survey and comparison
A vast amount of location information exists in unstructured texts, such as social media posts, news stories, scientific articles, web pages, travel blogs, and historical archives. Geoparsing refers to the process of recognizing location references from texts and identifying their geospatial representations. While geoparsing can benefit many domains, a summary of the specific applications is still missing. Further, there lacks a comprehensive review and comparison of existing approaches for location reference recognition, which is the first and a core step of geoparsing. To fill these research gaps, this review first summarizes seven typical application domains of geoparsing: geographic information retrieval, disaster management, disease surveillance, traffic management, spatial humanities, tourism management, and crime management. We then review existing approaches for location reference recognition by categorizing these approaches into four groups based on their underlying functional principle: rule-based, gazetteer matching-based, statistical learning-based, and hybrid approaches. Next, we thoroughly evaluate the correctness and computational efficiency of the 27 most widely used approaches for location reference recognition based on 26 public datasets with different types of texts (e.g., social media posts and news stories) containing 39,736 location references across the world. Results from this thorough evaluation can help inform future methodological developments for location reference recognition, and can help guide the selection of proper approaches based on application needs
Extracting metadata for spatially-aware information retrieval on the internet
This paper presents methods used to extract geospatial information from web pages for use in SPIRIT, a new Geographic Information Retrieval (GIR) system for the web. The resulting geospatial markup tools have been used to annotate around 900,000 web pages taken from a 1TB web crawl, focused o