16 research outputs found

    On assigning place names to geography related web pages

    Get PDF
    In this paper, we attempt to give spatial semantics to web pages by assigning them place names. The entire assignment task is divided into three sub-problems, namely place name extraction, place name disambiguation and place name assignment. We propose our approaches to address these sub-problems. In particular, we have modified GATE, a wellknown named entity extraction software, to perform place name extraction using a US Census gazetteer. A rule-based place name disambiguation method and a place name assignment method capable of assigning place names to web page segments have also been proposed. We have evaluated our proposed disambiguation and assignment methods on a web page collection referenced by the DLESE metadata collection. The results returned by our methods are compared with manually disambiguated place names and place name assignment. It is shown that our proposed place name disambiguation method works well for geo/geo ambiguities. The preliminary results of our place name assignment method indicate promising results given the existence of geo/non-geo ambiguities among place names.Published versio

    One model to rule them all: unified classification model for geotagging websites

    Full text link
    The paper presents a novel approach to finding regional scopes (geotagging) of websites. It relies on a single binary classification model per region type to perform the multi-label classification and uses a variety of different features that have not been yet used together for machine-learning based regional classification of websites. The evaluation demonstrates the advantage of our one model per region type method versus the traditional one model per region approach

    Единая модСль для гСоклассификации Π²Π΅Π±-сайтов

    Get PDF
    The paper presents a novel approach to finding regional scopes (geotagging) of websites. Unlike the traditional approaches, which generally involve training a separate classification model for each class (region), the proposed method is based on training a single model which is used for all regions of the same type (e.g. cities). This approach is made possible by the usage of ”relative” features which indicate how a selected region matches up to other regions for a given website. The classification system uses a variety of features of different nature that have not been yet used together for machine-learning based regional classification of websites. The evaluation demonstrates the advantage of our ”one model per region type” method versus the traditional ”one model per region” approach. A separate experiment demonstrates the ability of the proposed classifier to successfully detect regions which were not present in the training set (which is impossible for traditional approaches).Π Π°Π±ΠΎΡ‚Π° прСдставляСт Π½ΠΎΠ²Ρ‹ΠΉ ΠΏΠΎΠ΄Ρ…ΠΎΠ΄ ΠΊ Π·Π°Π΄Π°Ρ‡Π΅ опрСдСлСния Ρ€Π΅Π³ΠΈΠΎΠ½Π°Π»ΡŒΠ½ΠΎΠ³ΠΎ фокуса Π²Π΅Π±-сайтов (гСоклассификации). Π’ ΠΎΡ‚Π»ΠΈΡ‡ΠΈΠ΅ ΠΎΡ‚ Ρ‚Ρ€Π°Π΄ΠΈΡ†ΠΈΠΎΠ½Π½Ρ‹Ρ… ΠΏΠΎΠ΄Ρ…ΠΎΠ΄ΠΎΠ² ΠΊ ΠΌΠ½ΠΎΠ³ΠΎΠ·Π½Π°Ρ‡Π½ΠΎΠΉ классификации, ΠΊΠΎΠ³Π΄Π° для ΠΊΠ°ΠΆΠ΄ΠΎΠ³ΠΎ класса (Ρ€Π΅Π³ΠΈΠΎΠ½Π°) обучаСтся ΠΏΠΎ ΠΎΡ‚Π΄Π΅Π»ΡŒΠ½ΠΎΠΉ классификационной ΠΌΠΎΠ΄Π΅Π»ΠΈ, ΠΏΡ€Π΅Π΄Π»Π°Π³Π°Π΅ΠΌΡ‹ΠΉ ΠΏΠΎΠ΄Ρ…ΠΎΠ΄ основан Π½Π° ΠΎΠ±ΡƒΡ‡Π΅Π½ΠΈΠΈ всСго ΠΎΠ΄Π½ΠΎΠΉ ΠΌΠΎΠ΄Π΅Π»ΠΈ, которая ΠΈΡΠΏΠΎΠ»ΡŒΠ·ΡƒΠ΅Ρ‚ΡΡ для всСх Ρ€Π΅Π³ΠΈΠΎΠ½ΠΎΠ² ΠΎΠ΄Π½ΠΎΠ³ΠΎ Ρ‚ΠΈΠΏΠ° (Π½Π°ΠΏΡ€ΠΈΠΌΠ΅Ρ€, для Π³ΠΎΡ€ΠΎΠ΄ΠΎΠ²). Π’Π°ΠΊΠΎΠΉ ΠΏΠΎΠ΄Ρ…ΠΎΠ΄ становится Π²ΠΎΠ·ΠΌΠΎΠΆΠ½Ρ‹ΠΌ благодаря использованию "ΠΎΡ‚Π½ΠΎΡΠΈΡ‚Π΅Π»ΡŒΠ½Ρ‹Ρ…" Ρ„Π°ΠΊΡ‚ΠΎΡ€ΠΎΠ², ΠΊΠΎΡ‚ΠΎΡ€Ρ‹Π΅ ΠΏΠΎΠΊΠ°Π·Ρ‹Π²Π°ΡŽΡ‚, ΠΊΠ°ΠΊ Π½Π΅ΠΊΠΎΡ‚ΠΎΡ€Ρ‹ΠΉ Π²Ρ‹Π±Ρ€Π°Π½Π½Ρ‹ΠΉ Ρ€Π΅Π³ΠΈΠΎΠ½ соотносится с Π΄Ρ€ΡƒΠ³ΠΈΠΌΠΈ Ρ€Π΅Π³ΠΈΠΎΠ½Π°ΠΌΠΈ для Π·Π°Π΄Π°Π½Π½ΠΎΠ³ΠΎ Π²Π΅Π±-сайта. ΠšΠ»Π°ΡΡΠΈΡ„ΠΈΠΊΠ°Ρ‚ΠΎΡ€ задСйствуСт большой Π½Π°Π±ΠΎΡ€ Ρ€Π°Π·Π½ΠΎΡ€ΠΎΠ΄Π½Ρ‹Ρ… Ρ„Π°ΠΊΡ‚ΠΎΡ€ΠΎΠ², ΠΊΠΎΡ‚ΠΎΡ€Ρ‹Π΅ Π΄ΠΎ этого ΠΌΠΎΠΌΠ΅Π½Ρ‚Π° Π½Π΅ использовались вмСстС для гСоклассификации Π²Π΅Π±-сайтов с ΠΏΡ€ΠΈΠΌΠ΅Π½Π΅Π½ΠΈΠ΅ΠΌ машинного обучСния. ΠžΡ†Π΅Π½ΠΊΠ° качСства дСмонстрируСт прСимущСство нашСго ΠΏΠΎΠ΄Ρ…ΠΎΠ΄Π° "ΠΏΠΎ ΠΎΠ΄Π½ΠΎΠΉ ΠΌΠΎΠ΄Π΅Π»ΠΈ Π½Π° Ρ‚ΠΈΠΏ Ρ€Π΅Π³ΠΈΠΎΠ½Π°" ΠΏΠ΅Ρ€Π΅Π΄ Ρ‚Ρ€Π°Π΄ΠΈΡ†ΠΈΠΎΠ½Π½Ρ‹ΠΌ ΠΏΠΎΠ΄Ρ…ΠΎΠ΄ΠΎΠΌ "ΠΏΠΎ ΠΎΠ΄Π½ΠΎΠΉ ΠΌΠΎΠ΄Π΅Π»ΠΈ Π½Π° Ρ€Π΅Π³ΠΈΠΎΠ½". ΠžΡ‚Π΄Π΅Π»ΡŒΠ½Ρ‹ΠΉ экспСримСнт дСмонстрируСт ΡΠΏΠΎΡΠΎΠ±Π½ΠΎΡΡ‚ΡŒ описываСмого классификатора ΡƒΡΠΏΠ΅ΡˆΠ½ΠΎ Π΄Π΅Ρ‚Π΅ΠΊΡ‚ΠΈΡ€ΠΎΠ²Π°Ρ‚ΡŒ Ρ€Π΅Π³ΠΈΠΎΠ½Ρ‹, ΠΊΠΎΡ‚ΠΎΡ€Ρ‹Π΅ отсутствовали Π² ΠΎΠ±ΡƒΡ‡Π°ΡŽΡ‰Π΅ΠΉ Π²Ρ‹Π±ΠΎΡ€ΠΊΠ΅ (Ρ‡Ρ‚ΠΎ Π½Π΅Π²ΠΎΠ·ΠΌΠΎΠΆΠ½ΠΎ ΠΏΡ€ΠΈ использовании Ρ‚Ρ€Π°Π΄ΠΈΡ†ΠΈΠΎΠ½Π½Ρ‹Ρ… ΠΏΠΎΠ΄Ρ…ΠΎΠ΄ΠΎΠ²)

    Improving the geospatial consistency of digital libraries metadata

    Get PDF
    Consistency is an essential aspect of the quality of metadata. Inconsistent metadata records are harmful: given a themed query, the set of retrieved metadata records would contain descriptions of unrelated or irrelevant resources, and may even not contain some resources considered obvious. This is even worse when the description of the location is inconsistent. Inconsistent spatial descriptions may yield invisible or hidden geographical resources that cannot be retrieved by means of spatially themed queries. Therefore, ensuring spatial consistency should be a primary goal when reusing, sharing and developing georeferenced digital collections. We present a methodology able to detect geospatial inconsistencies in metadata collections based on the combination of spatial ranking, reverse geocoding, geographic knowledge organization systems and information-retrieval techniques. This methodology has been applied to a collection of metadata records describing maps and atlases belonging to the Library of Congress. The proposed approach was able to automatically identify inconsistent metadata records (870 out of 10,575) and propose fixes to most of them (91.5%) These results support the ability of the proposed methodology to assess the impact of spatial inconsistency in the retrievability and visibility of metadata records and improve their spatial consistency

    Linking archival data to location A case study at the UK National Archives

    Get PDF
    Purpose The National Archives (TNA) is the UK Government's official archive. It stores and maintains records spanning over a 1,000 years in both physical and digital form. Much of the information held by TNA includes references to place and frequently user queries to TNA's online catalogue involve searches for location. The purpose of this paper is to illustrate how TNA have extracted the geographic references in their historic data to improve access to the archives. Design/methodology/approach To be able to quickly enhance the existing archival data with geographic information, existing technologies from Natural Language Processing (NLP) and Geographical Information Retrieval (GIR) have been utilised and adapted to historical archives. Findings Enhancing the archival records with geographic information has enabled TNA to quickly develop a number of case studies highlighting how geographic information can improve access to large‐scale archival collections. The use of existing methods from the GIR domain and technologies, such as OpenLayers, enabled one to quickly implement this process in a way that is easily transferable to other institutions. Practical implications The methods and technologies described in this paper can be adapted, by other archives, to similarly enhance access to their historic data. Also the data‐sharing methods described can be used to enable the integration of knowledge held at different archival institutions. Originality/value Place is one of the core dimensions for TNA's archival data. Many of the records which are held make reference to place data (wills, legislation, court cases), and approximately one fifth of users' searches involve place names. However, there are still a number of open questions regarding the adaptation of existing GIR methods to the history domain. This paper presents an overview over available GIR methods and the challenges in applying them to historical data

    A Survey of Location Prediction on Twitter

    Full text link
    Locations, e.g., countries, states, cities, and point-of-interests, are central to news, emergency events, and people's daily lives. Automatic identification of locations associated with or mentioned in documents has been explored for decades. As one of the most popular online social network platforms, Twitter has attracted a large number of users who send millions of tweets on daily basis. Due to the world-wide coverage of its users and real-time freshness of tweets, location prediction on Twitter has gained significant attention in recent years. Research efforts are spent on dealing with new challenges and opportunities brought by the noisy, short, and context-rich nature of tweets. In this survey, we aim at offering an overall picture of location prediction on Twitter. Specifically, we concentrate on the prediction of user home locations, tweet locations, and mentioned locations. We first define the three tasks and review the evaluation metrics. By summarizing Twitter network, tweet content, and tweet context as potential inputs, we then structurally highlight how the problems depend on these inputs. Each dependency is illustrated by a comprehensive review of the corresponding strategies adopted in state-of-the-art approaches. In addition, we also briefly review two related problems, i.e., semantic location prediction and point-of-interest recommendation. Finally, we list future research directions.Comment: Accepted to TKDE. 30 pages, 1 figur

    Extracting Geospatial Entities from Wikipedia

    Full text link
    This paper addresses the challenge of extracting geospa-tial data from the article text of the English Wikipedia. In the first phase of our work, we create a training corpus and select a set of word-based features to train a Support Vec-tor Machine (SVM) for the task of geospatial named entity recognition. We target for testing a corpus of Wikipedia articles about battles and wars, as these have a high in-cidence of geospatial content. The SVM recognizes place names in the corpus with a very high recall, close to 100%, with an acceptable precision. The set of geospatial NEs is then fed into a geocoding and resolution process, whose goal is to determine the correct coordinates for each place name. As many place names are ambiguous, and do not im-mediately geocode to a single location, we present a data structure and algorithm to resolve ambiguity based on sen-tence and article context, so the correct coordinates can be selected. We achieve an f-measure of 82%, and create a set of geospatial entities for each article, combining the place names, spatial locations, and an assumed point geometry. These entities can enable geospatial search on and geovi-sualization of Wikipedia.

    Automatic Generation of Geospatial Metadata for Web Resources

    Get PDF
    Web resources that are not part of any Spatial Data Infrastructure can be an important source of information. However, the incorporation of Web resources within a Spatial Data Infrastructure requires a significant effort to create metadata. This work presents an extensible architecture for an automatic characterisation of Web resources and a strategy for assignation of their geographic scope. The implemented prototype generates automatically geospatial metadata for Web pages. The metadata model conforms to the Common Element Set, a set of core properties, which is encouraged by the OGC Catalogue Service Specification to permit the minimal implementation of a catalogue service independent of an application profile. The performed experiments consisted in the creation of metadata for Web pages of providers of Geospatial Web resources. The Web pages have been gathered by a Web crawler focused on OGC Web Services. The manual revision of the results has shown that the coverage estimation method applied produces acceptable results for more than 80% of tested Web resources

    Enriching the Digital Library Experience: Innovations With Named Entity Recognition and Geographic Information System Technologies

    Get PDF
    Digital libraries are seeking innovative ways to share their resources and enhance user experience. To this end, numerous openly available technologies can be exploited. For this project, NER technology was applied to a subset of the Documenting the American South (DocSouth) digital collections. Personal and location names were hand-annotated to achieve a gold standard, and GATE, a text engineering tool, was run under two conditions: a defaults baseline and a test run that included gazetteers built from DocSouth's Colonial and State Records collection. Overall, GATE performance is promising, and numerous strategies for improvement are discussed. Next, derived location annotations were georeferenced and stored in a geodatabase through automated processes, and a prototype for a web-based map search was developed using the Google Maps API. This project showcases innovations with automated NER coupled with GIS technologies, and strongly supports further investment in applying these techniques across DocSouth and other digital libraries
    corecore