16 research outputs found
On assigning place names to geography related web pages
In this paper, we attempt to give spatial semantics to web pages by assigning them place names. The entire assignment task is divided into three sub-problems, namely place name extraction, place name disambiguation and place name assignment. We propose our approaches to address these sub-problems. In particular, we have modified GATE, a wellknown named entity extraction software, to perform place name extraction using a US Census gazetteer. A rule-based place name disambiguation method and a place name assignment method capable of assigning place names to web
page segments have also been proposed. We have evaluated our proposed disambiguation and assignment methods on a web page collection referenced by the DLESE metadata collection. The results returned by our methods are compared with manually disambiguated place names and
place name assignment. It is shown that our proposed place name disambiguation method works well for geo/geo ambiguities. The preliminary results of our place name assignment method indicate promising results given the existence of geo/non-geo ambiguities among place names.Published versio
One model to rule them all: unified classification model for geotagging websites
The paper presents a novel approach to finding regional scopes (geotagging) of websites. It relies on a single binary classification model per region type to perform the multi-label classification and uses a variety of different features that have not been yet used together for machine-learning based regional classification of websites. The evaluation demonstrates the advantage of our one model per region type method versus the traditional one model per region approach
ΠΠ΄ΠΈΠ½Π°Ρ ΠΌΠΎΠ΄Π΅Π»Ρ Π΄Π»Ρ Π³Π΅ΠΎΠΊΠ»Π°ΡΡΠΈΡΠΈΠΊΠ°ΡΠΈΠΈ Π²Π΅Π±-ΡΠ°ΠΉΡΠΎΠ²
The paper presents a novel approach to finding regional scopes (geotagging) of websites. Unlike the traditional approaches, which generally involve training a separate classification model for each class (region), the proposed method is based on training a single model which is used for all regions of the same type (e.g. cities). This approach is made possible by the usage of βrelativeβ features which indicate how a selected region matches up to other regions for a given website. The classification system uses a variety of features of different nature that have not been yet used together for machine-learning based regional classification of websites. The evaluation demonstrates the advantage of our βone model per region typeβ method versus the traditional βone model per regionβ approach. A separate experiment demonstrates the ability of the proposed classifier to successfully detect regions which were not present in the training set (which is impossible for traditional approaches).Π Π°Π±ΠΎΡΠ° ΠΏΡΠ΅Π΄ΡΡΠ°Π²Π»ΡΠ΅Ρ Π½ΠΎΠ²ΡΠΉ ΠΏΠΎΠ΄Ρ
ΠΎΠ΄ ΠΊ Π·Π°Π΄Π°ΡΠ΅ ΠΎΠΏΡΠ΅Π΄Π΅Π»Π΅Π½ΠΈΡ ΡΠ΅Π³ΠΈΠΎΠ½Π°Π»ΡΠ½ΠΎΠ³ΠΎ ΡΠΎΠΊΡΡΠ° Π²Π΅Π±-ΡΠ°ΠΉΡΠΎΠ² (Π³Π΅ΠΎΠΊΠ»Π°ΡΡΠΈΡΠΈΠΊΠ°ΡΠΈΠΈ). Π ΠΎΡΠ»ΠΈΡΠΈΠ΅ ΠΎΡ ΡΡΠ°Π΄ΠΈΡΠΈΠΎΠ½Π½ΡΡ
ΠΏΠΎΠ΄Ρ
ΠΎΠ΄ΠΎΠ² ΠΊ ΠΌΠ½ΠΎΠ³ΠΎΠ·Π½Π°ΡΠ½ΠΎΠΉ ΠΊΠ»Π°ΡΡΠΈΡΠΈΠΊΠ°ΡΠΈΠΈ, ΠΊΠΎΠ³Π΄Π° Π΄Π»Ρ ΠΊΠ°ΠΆΠ΄ΠΎΠ³ΠΎ ΠΊΠ»Π°ΡΡΠ° (ΡΠ΅Π³ΠΈΠΎΠ½Π°) ΠΎΠ±ΡΡΠ°Π΅ΡΡΡ ΠΏΠΎ ΠΎΡΠ΄Π΅Π»ΡΠ½ΠΎΠΉ ΠΊΠ»Π°ΡΡΠΈΡΠΈΠΊΠ°ΡΠΈΠΎΠ½Π½ΠΎΠΉ ΠΌΠΎΠ΄Π΅Π»ΠΈ, ΠΏΡΠ΅Π΄Π»Π°Π³Π°Π΅ΠΌΡΠΉ ΠΏΠΎΠ΄Ρ
ΠΎΠ΄ ΠΎΡΠ½ΠΎΠ²Π°Π½ Π½Π° ΠΎΠ±ΡΡΠ΅Π½ΠΈΠΈ Π²ΡΠ΅Π³ΠΎ ΠΎΠ΄Π½ΠΎΠΉ ΠΌΠΎΠ΄Π΅Π»ΠΈ, ΠΊΠΎΡΠΎΡΠ°Ρ ΠΈΡΠΏΠΎΠ»ΡΠ·ΡΠ΅ΡΡΡ Π΄Π»Ρ Π²ΡΠ΅Ρ
ΡΠ΅Π³ΠΈΠΎΠ½ΠΎΠ² ΠΎΠ΄Π½ΠΎΠ³ΠΎ ΡΠΈΠΏΠ° (Π½Π°ΠΏΡΠΈΠΌΠ΅Ρ, Π΄Π»Ρ Π³ΠΎΡΠΎΠ΄ΠΎΠ²). Π’Π°ΠΊΠΎΠΉ ΠΏΠΎΠ΄Ρ
ΠΎΠ΄ ΡΡΠ°Π½ΠΎΠ²ΠΈΡΡΡ Π²ΠΎΠ·ΠΌΠΎΠΆΠ½ΡΠΌ Π±Π»Π°Π³ΠΎΠ΄Π°ΡΡ ΠΈΡΠΏΠΎΠ»ΡΠ·ΠΎΠ²Π°Π½ΠΈΡ "ΠΎΡΠ½ΠΎΡΠΈΡΠ΅Π»ΡΠ½ΡΡ
" ΡΠ°ΠΊΡΠΎΡΠΎΠ², ΠΊΠΎΡΠΎΡΡΠ΅ ΠΏΠΎΠΊΠ°Π·ΡΠ²Π°ΡΡ, ΠΊΠ°ΠΊ Π½Π΅ΠΊΠΎΡΠΎΡΡΠΉ Π²ΡΠ±ΡΠ°Π½Π½ΡΠΉ ΡΠ΅Π³ΠΈΠΎΠ½ ΡΠΎΠΎΡΠ½ΠΎΡΠΈΡΡΡ Ρ Π΄ΡΡΠ³ΠΈΠΌΠΈ ΡΠ΅Π³ΠΈΠΎΠ½Π°ΠΌΠΈ Π΄Π»Ρ Π·Π°Π΄Π°Π½Π½ΠΎΠ³ΠΎ Π²Π΅Π±-ΡΠ°ΠΉΡΠ°. ΠΠ»Π°ΡΡΠΈΡΠΈΠΊΠ°ΡΠΎΡ Π·Π°Π΄Π΅ΠΉΡΡΠ²ΡΠ΅Ρ Π±ΠΎΠ»ΡΡΠΎΠΉ Π½Π°Π±ΠΎΡ ΡΠ°Π·Π½ΠΎΡΠΎΠ΄Π½ΡΡ
ΡΠ°ΠΊΡΠΎΡΠΎΠ², ΠΊΠΎΡΠΎΡΡΠ΅ Π΄ΠΎ ΡΡΠΎΠ³ΠΎ ΠΌΠΎΠΌΠ΅Π½ΡΠ° Π½Π΅ ΠΈΡΠΏΠΎΠ»ΡΠ·ΠΎΠ²Π°Π»ΠΈΡΡ Π²ΠΌΠ΅ΡΡΠ΅ Π΄Π»Ρ Π³Π΅ΠΎΠΊΠ»Π°ΡΡΠΈΡΠΈΠΊΠ°ΡΠΈΠΈ Π²Π΅Π±-ΡΠ°ΠΉΡΠΎΠ² Ρ ΠΏΡΠΈΠΌΠ΅Π½Π΅Π½ΠΈΠ΅ΠΌ ΠΌΠ°ΡΠΈΠ½Π½ΠΎΠ³ΠΎ ΠΎΠ±ΡΡΠ΅Π½ΠΈΡ. ΠΡΠ΅Π½ΠΊΠ° ΠΊΠ°ΡΠ΅ΡΡΠ²Π° Π΄Π΅ΠΌΠΎΠ½ΡΡΡΠΈΡΡΠ΅Ρ ΠΏΡΠ΅ΠΈΠΌΡΡΠ΅ΡΡΠ²ΠΎ Π½Π°ΡΠ΅Π³ΠΎ ΠΏΠΎΠ΄Ρ
ΠΎΠ΄Π° "ΠΏΠΎ ΠΎΠ΄Π½ΠΎΠΉ ΠΌΠΎΠ΄Π΅Π»ΠΈ Π½Π° ΡΠΈΠΏ ΡΠ΅Π³ΠΈΠΎΠ½Π°" ΠΏΠ΅ΡΠ΅Π΄ ΡΡΠ°Π΄ΠΈΡΠΈΠΎΠ½Π½ΡΠΌ ΠΏΠΎΠ΄Ρ
ΠΎΠ΄ΠΎΠΌ "ΠΏΠΎ ΠΎΠ΄Π½ΠΎΠΉ ΠΌΠΎΠ΄Π΅Π»ΠΈ Π½Π° ΡΠ΅Π³ΠΈΠΎΠ½". ΠΡΠ΄Π΅Π»ΡΠ½ΡΠΉ ΡΠΊΡΠΏΠ΅ΡΠΈΠΌΠ΅Π½Ρ Π΄Π΅ΠΌΠΎΠ½ΡΡΡΠΈΡΡΠ΅Ρ ΡΠΏΠΎΡΠΎΠ±Π½ΠΎΡΡΡ ΠΎΠΏΠΈΡΡΠ²Π°Π΅ΠΌΠΎΠ³ΠΎ ΠΊΠ»Π°ΡΡΠΈΡΠΈΠΊΠ°ΡΠΎΡΠ° ΡΡΠΏΠ΅ΡΠ½ΠΎ Π΄Π΅ΡΠ΅ΠΊΡΠΈΡΠΎΠ²Π°ΡΡ ΡΠ΅Π³ΠΈΠΎΠ½Ρ, ΠΊΠΎΡΠΎΡΡΠ΅ ΠΎΡΡΡΡΡΡΠ²ΠΎΠ²Π°Π»ΠΈ Π² ΠΎΠ±ΡΡΠ°ΡΡΠ΅ΠΉ Π²ΡΠ±ΠΎΡΠΊΠ΅ (ΡΡΠΎ Π½Π΅Π²ΠΎΠ·ΠΌΠΎΠΆΠ½ΠΎ ΠΏΡΠΈ ΠΈΡΠΏΠΎΠ»ΡΠ·ΠΎΠ²Π°Π½ΠΈΠΈ ΡΡΠ°Π΄ΠΈΡΠΈΠΎΠ½Π½ΡΡ
ΠΏΠΎΠ΄Ρ
ΠΎΠ΄ΠΎΠ²)
Improving the geospatial consistency of digital libraries metadata
Consistency is an essential aspect of the quality of metadata. Inconsistent metadata records are harmful: given a themed query, the set of retrieved metadata records would contain descriptions of unrelated or irrelevant resources, and may even not contain some resources considered obvious. This is even worse when the description of the location is inconsistent. Inconsistent spatial descriptions may yield invisible or hidden geographical resources that cannot be retrieved by means of spatially themed queries. Therefore, ensuring spatial consistency should be a primary goal when reusing, sharing and developing georeferenced digital collections. We present a methodology able to detect geospatial inconsistencies in metadata collections based on the combination of spatial ranking, reverse geocoding, geographic knowledge organization systems and information-retrieval techniques. This methodology has been applied to a collection of metadata records describing maps and atlases belonging to the Library of Congress. The proposed approach was able to automatically identify inconsistent metadata records (870 out of 10,575) and propose fixes to most of them (91.5%) These results support the ability of the proposed methodology to assess the impact of spatial inconsistency in the retrievability and visibility of metadata records and improve their spatial consistency
Linking archival data to location A case study at the UK National Archives
Purpose
The National Archives (TNA) is the UK Government's official archive. It stores and maintains records spanning over a 1,000 years in both physical and digital form. Much of the information held by TNA includes references to place and frequently user queries to TNA's online catalogue involve searches for location. The purpose of this paper is to illustrate how TNA have extracted the geographic references in their historic data to improve access to the archives.
Design/methodology/approach
To be able to quickly enhance the existing archival data with geographic information, existing technologies from Natural Language Processing (NLP) and Geographical Information Retrieval (GIR) have been utilised and adapted to historical archives.
Findings
Enhancing the archival records with geographic information has enabled TNA to quickly develop a number of case studies highlighting how geographic information can improve access to largeβscale archival collections. The use of existing methods from the GIR domain and technologies, such as OpenLayers, enabled one to quickly implement this process in a way that is easily transferable to other institutions.
Practical implications
The methods and technologies described in this paper can be adapted, by other archives, to similarly enhance access to their historic data. Also the dataβsharing methods described can be used to enable the integration of knowledge held at different archival institutions.
Originality/value
Place is one of the core dimensions for TNA's archival data. Many of the records which are held make reference to place data (wills, legislation, court cases), and approximately one fifth of users' searches involve place names. However, there are still a number of open questions regarding the adaptation of existing GIR methods to the history domain. This paper presents an overview over available GIR methods and the challenges in applying them to historical data
A Survey of Location Prediction on Twitter
Locations, e.g., countries, states, cities, and point-of-interests, are
central to news, emergency events, and people's daily lives. Automatic
identification of locations associated with or mentioned in documents has been
explored for decades. As one of the most popular online social network
platforms, Twitter has attracted a large number of users who send millions of
tweets on daily basis. Due to the world-wide coverage of its users and
real-time freshness of tweets, location prediction on Twitter has gained
significant attention in recent years. Research efforts are spent on dealing
with new challenges and opportunities brought by the noisy, short, and
context-rich nature of tweets. In this survey, we aim at offering an overall
picture of location prediction on Twitter. Specifically, we concentrate on the
prediction of user home locations, tweet locations, and mentioned locations. We
first define the three tasks and review the evaluation metrics. By summarizing
Twitter network, tweet content, and tweet context as potential inputs, we then
structurally highlight how the problems depend on these inputs. Each dependency
is illustrated by a comprehensive review of the corresponding strategies
adopted in state-of-the-art approaches. In addition, we also briefly review two
related problems, i.e., semantic location prediction and point-of-interest
recommendation. Finally, we list future research directions.Comment: Accepted to TKDE. 30 pages, 1 figur
Extracting Geospatial Entities from Wikipedia
This paper addresses the challenge of extracting geospa-tial data from the article text of the English Wikipedia. In the first phase of our work, we create a training corpus and select a set of word-based features to train a Support Vec-tor Machine (SVM) for the task of geospatial named entity recognition. We target for testing a corpus of Wikipedia articles about battles and wars, as these have a high in-cidence of geospatial content. The SVM recognizes place names in the corpus with a very high recall, close to 100%, with an acceptable precision. The set of geospatial NEs is then fed into a geocoding and resolution process, whose goal is to determine the correct coordinates for each place name. As many place names are ambiguous, and do not im-mediately geocode to a single location, we present a data structure and algorithm to resolve ambiguity based on sen-tence and article context, so the correct coordinates can be selected. We achieve an f-measure of 82%, and create a set of geospatial entities for each article, combining the place names, spatial locations, and an assumed point geometry. These entities can enable geospatial search on and geovi-sualization of Wikipedia.
Automatic Generation of Geospatial Metadata for Web Resources
Web resources that are not part of any Spatial Data Infrastructure can be an important source of information. However, the incorporation of Web resources within a Spatial Data Infrastructure requires a significant effort to create metadata. This work presents an extensible architecture for an automatic characterisation of Web resources and a strategy for assignation of their geographic scope. The implemented prototype generates automatically geospatial metadata for Web pages. The metadata model conforms to the Common Element Set, a set of core properties, which is encouraged by the OGC Catalogue Service Specification to permit the minimal implementation of a catalogue service independent of an application profile. The performed experiments consisted in the creation of metadata for Web pages of providers of Geospatial Web resources. The Web pages have been gathered by a Web crawler focused on OGC Web Services. The manual revision of the results has shown that the coverage estimation method applied produces acceptable results for more than 80% of tested Web resources
Enriching the Digital Library Experience: Innovations With Named Entity Recognition and Geographic Information System Technologies
Digital libraries are seeking innovative ways to share their resources and enhance user experience. To this end, numerous openly available technologies can be exploited. For this project, NER technology was applied to a subset of the Documenting the American South (DocSouth) digital collections. Personal and location names were hand-annotated to achieve a gold standard, and GATE, a text engineering tool, was run under two conditions: a defaults baseline and a test run that included gazetteers built from DocSouth's Colonial and State Records collection. Overall, GATE performance is promising, and numerous strategies for improvement are discussed. Next, derived location annotations were georeferenced and stored in a geodatabase through automated processes, and a prototype for a web-based map search was developed using the Google Maps API. This project showcases innovations with automated NER coupled with GIS technologies, and strongly supports further investment in applying these techniques across DocSouth and other digital libraries