83 research outputs found
Historical collaborative geocoding
The latest developments in digital have provided large data sets that can
increasingly easily be accessed and used. These data sets often contain
indirect localisation information, such as historical addresses. Historical
geocoding is the process of transforming the indirect localisation information
to direct localisation that can be placed on a map, which enables spatial
analysis and cross-referencing. Many efficient geocoders exist for current
addresses, but they do not deal with the temporal aspect and are based on a
strict hierarchy (..., city, street, house number) that is hard or impossible
to use with historical data. Indeed historical data are full of uncertainties
(temporal aspect, semantic aspect, spatial precision, confidence in historical
source, ...) that can not be resolved, as there is no way to go back in time to
check. We propose an open source, open data, extensible solution for geocoding
that is based on the building of gazetteers composed of geohistorical objects
extracted from historical topographical maps. Once the gazetteers are
available, geocoding an historical address is a matter of finding the
geohistorical object in the gazetteers that is the best match to the historical
address. The matching criteriae are customisable and include several dimensions
(fuzzy semantic, fuzzy temporal, scale, spatial precision ...). As the goal is
to facilitate historical work, we also propose web-based user interfaces that
help geocode (one address or batch mode) and display over current or historical
topographical maps, so that they can be checked and collaboratively edited. The
system is tested on Paris city for the 19-20th centuries, shows high returns
rate and is fast enough to be used interactively.Comment: WORKING PAPE
Improving the geospatial consistency of digital libraries metadata
Consistency is an essential aspect of the quality of metadata. Inconsistent metadata records are harmful: given a themed query, the set of retrieved metadata records would contain descriptions of unrelated or irrelevant resources, and may even not contain some resources considered obvious. This is even worse when the description of the location is inconsistent. Inconsistent spatial descriptions may yield invisible or hidden geographical resources that cannot be retrieved by means of spatially themed queries. Therefore, ensuring spatial consistency should be a primary goal when reusing, sharing and developing georeferenced digital collections. We present a methodology able to detect geospatial inconsistencies in metadata collections based on the combination of spatial ranking, reverse geocoding, geographic knowledge organization systems and information-retrieval techniques. This methodology has been applied to a collection of metadata records describing maps and atlases belonging to the Library of Congress. The proposed approach was able to automatically identify inconsistent metadata records (870 out of 10,575) and propose fixes to most of them (91.5%) These results support the ability of the proposed methodology to assess the impact of spatial inconsistency in the retrievability and visibility of metadata records and improve their spatial consistency
Improvements in the geocoding process in organizational environments
Dissertação para obtenção do Grau de Mestre em
Engenharia InformáticaThe current geocoding technologies are only able to handle addresses which fit a general
case for the location in consideration. The more edge-case addresses are mostly ignored
or wrongly geocoded, leading to imprecision and errors in the results obtained. To
try overcoming this problem the current geocoding services accompany their results with
confidence values, but the values and scales used vary between services, and are hard to
understand by users without knowledge in the area, and, as we discovered, are not truly
to be trusted.
Novabase aims to make available to organizations a geocoding service which allows
the improvement of the quality of the results obtained by mainstream geocoding services
such as Google and Bing. The objective is to give quality results in the cases where we
can act and, not being able to do, falling back to the results of other geocoding services.
We pretend to handle addresses in areas where results are of inferior quality, either
because the areas are not fully covered by the services or because those same services are
not prepared to handle the address formats which do not match the general case (one
example are addresses which are numbered by the use of Lotes).
The geocoding is executed in two steps. The first one matches the address with a
knowledge base owned by organizations, in which we assume full trust of the quality. If
the knowledge base returns a valid result, it is output with maximum confidence. When
it fails, we fall back to using the mainstream geocoding services, and use their results for
output
Wrapper Maintenance: A Machine Learning Approach
The proliferation of online information sources has led to an increased use
of wrappers for extracting data from Web sources. While most of the previous
research has focused on quick and efficient generation of wrappers, the
development of tools for wrapper maintenance has received less attention. This
is an important research problem because Web sources often change in ways that
prevent the wrappers from extracting data correctly. We present an efficient
algorithm that learns structural information about data from positive examples
alone. We describe how this information can be used for two wrapper maintenance
applications: wrapper verification and reinduction. The wrapper verification
system detects when a wrapper is not extracting correct data, usually because
the Web source has changed its format. The reinduction algorithm automatically
recovers from changes in the Web source by identifying data on Web pages so
that a new wrapper may be generated for this source. To validate our approach,
we monitored 27 wrappers over a period of a year. The verification algorithm
correctly discovered 35 of the 37 wrapper changes, and made 16 mistakes,
resulting in precision of 0.73 and recall of 0.95. We validated the reinduction
algorithm on ten Web sources. We were able to successfully reinduce the
wrappers, obtaining precision and recall values of 0.90 and 0.80 on the data
extraction task
Automatic Generation of Geospatial Metadata for Web Resources
Web resources that are not part of any Spatial Data Infrastructure can be an important source of information. However, the incorporation of Web resources within a Spatial Data Infrastructure requires a significant effort to create metadata. This work presents an extensible architecture for an automatic characterisation of Web resources and a strategy for assignation of their geographic scope. The implemented prototype generates automatically geospatial metadata for Web pages. The metadata model conforms to the Common Element Set, a set of core properties, which is encouraged by the OGC Catalogue Service Specification to permit the minimal implementation of a catalogue service independent of an application profile. The performed experiments consisted in the creation of metadata for Web pages of providers of Geospatial Web resources. The Web pages have been gathered by a Web crawler focused on OGC Web Services. The manual revision of the results has shown that the coverage estimation method applied produces acceptable results for more than 80% of tested Web resources
A large-scale study of fashion influencers on Twitter
The rise of social media has changed the nature of the fashion industry. Influence is no longer concentrated in the hands of an elite few: social networks distribute power across a broad set of tastemakers; trends are driven bottom-up and top-down; and designers, retailers, and consumers are regularly inundated with new styles and looks.
This thesis presents a large-scale study of fashion influencers on Twitter and proposes a fashion graph visualization dashboard to explore the social interactions between these Twitter accounts. Leveraging a dataset of 11.5k Twitter fashion accounts, a content-based classifier was trained to predict which accounts are fashion-centric. With the classifier, I identified more than 300k fashion-related accounts through a snowball crawling and then defined a stable group of 1000 influencers as the fashion core. I further human-labeled these influencers’ Twitter accounts and mine their recent tweets. Finally, I built a fashion graph visualization dashboard that allows users to visualize the interactions and relationships between individuals, brands, and media influencers
- …