32,535 research outputs found
A Holistic Solution for Duplicate Entity Identification in Deep Web Data Integration
Abstract-The proliferation of deep Web offers users a great opportunity to search high-quality information from Web. As a necessary step in deep Web data integration, the goal of duplicate entity identification is to discover the duplicate records from the integrated Web databases for further applications(e.g. price-comparison services). However, most of existing works address this issue only between two data sources, which are not practical to deep Web data integration systems. That is, one duplicate entity matcher trained over two specific Web databases cannot be applied to other Web databases. In addition, the cost of preparing the training set for n Web databases is ۱ ‫ܖ‬ times higher than that for two Web databases. In this paper, we propose a holistic solution to address the new challenges posed by deep Web, whose goal is to build one duplicate entity matcher over multiple Web databases. The extensive experiments on two domains show that the proposed solution is highly effective for deep Web data integration
Information Extraction, Data Integration, and Uncertain Data Management: The State of The Art
Information Extraction, data Integration, and uncertain data management are different areas of research that got vast focus in the last two decades. Many researches tackled those areas of research individually. However, information extraction systems should have integrated with data integration methods to make use of the extracted information. Handling uncertainty in extraction and integration process is an important issue to enhance the quality of the data in such integrated systems. This article presents the state of the art of the mentioned areas of research and shows the common grounds and how to integrate information extraction and data integration under uncertainty management cover
The DIGMAP geo-temporal web gazetteer service
This paper presents the DIGMAP geo-temporal Web gazetteer service, a system providing access to names of places, historical periods, and associated geo-temporal information. Within the DIGMAP project, this gazetteer serves as the unified repository of geographic and temporal information, assisting in the recognition and disambiguation of geo-temporal expressions over text, as well as in resource searching and indexing. We describe the data integration methodology, the handling of temporal information and some of the applications that use the gazetteer. Initial evaluation results show that the proposed system can adequately support several tasks related to geo-temporal information extraction and retrieval
Data-Driven Application Maintenance: Views from the Trenches
In this paper we present our experience during design, development, and pilot
deployments of a data-driven machine learning based application maintenance
solution. We implemented a proof of concept to address a spectrum of
interrelated problems encountered in application maintenance projects including
duplicate incident ticket identification, assignee recommendation, theme
mining, and mapping of incidents to business processes. In the context of IT
services, these problems are frequently encountered, yet there is a gap in
bringing automation and optimization. Despite long-standing research around
mining and analysis of software repositories, such research outputs are not
adopted well in practice due to the constraints these solutions impose on the
users. We discuss need for designing pragmatic solutions with low barriers to
adoption and addressing right level of complexity of problems with respect to
underlying business constraints and nature of data.Comment: Earlier version of paper appearing in proceedings of the 4th
International Workshop on Software Engineering Research and Industrial
Practice (SER&IP), IEEE Press, pp. 48-54, 201
A Data-Driven Approach for Tag Refinement and Localization in Web Videos
Tagging of visual content is becoming more and more widespread as web-based
services and social networks have popularized tagging functionalities among
their users. These user-generated tags are used to ease browsing and
exploration of media collections, e.g. using tag clouds, or to retrieve
multimedia content. However, not all media are equally tagged by users. Using
the current systems is easy to tag a single photo, and even tagging a part of a
photo, like a face, has become common in sites like Flickr and Facebook. On the
other hand, tagging a video sequence is more complicated and time consuming, so
that users just tag the overall content of a video. In this paper we present a
method for automatic video annotation that increases the number of tags
originally provided by users, and localizes them temporally, associating tags
to keyframes. Our approach exploits collective knowledge embedded in
user-generated tags and web sources, and visual similarity of keyframes and
images uploaded to social sites like YouTube and Flickr, as well as web sources
like Google and Bing. Given a keyframe, our method is able to select on the fly
from these visual sources the training exemplars that should be the most
relevant for this test sample, and proceeds to transfer labels across similar
images. Compared to existing video tagging approaches that require training
classifiers for each tag, our system has few parameters, is easy to implement
and can deal with an open vocabulary scenario. We demonstrate the approach on
tag refinement and localization on DUT-WEBV, a large dataset of web videos, and
show state-of-the-art results.Comment: Preprint submitted to Computer Vision and Image Understanding (CVIU
The Pan-STARRS Moving Object Processing System
We describe the Pan-STARRS Moving Object Processing System (MOPS), a modern
software package that produces automatic asteroid discoveries and
identifications from catalogs of transient detections from next-generation
astronomical survey telescopes. MOPS achieves > 99.5% efficiency in producing
orbits from a synthetic but realistic population of asteroids whose
measurements were simulated for a Pan-STARRS4-class telescope. Additionally,
using a non-physical grid population, we demonstrate that MOPS can detect
populations of currently unknown objects such as interstellar asteroids.
MOPS has been adapted successfully to the prototype Pan-STARRS1 telescope
despite differences in expected false detection rates, fill-factor loss and
relatively sparse observing cadence compared to a hypothetical Pan-STARRS4
telescope and survey. MOPS remains >99.5% efficient at detecting objects on a
single night but drops to 80% efficiency at producing orbits for objects
detected on multiple nights. This loss is primarily due to configurable MOPS
processing limits that are not yet tuned for the Pan-STARRS1 mission.
The core MOPS software package is the product of more than 15 person-years of
software development and incorporates countless additional years of effort in
third-party software to perform lower-level functions such as spatial searching
or orbit determination. We describe the high-level design of MOPS and essential
subcomponents, the suitability of MOPS for other survey programs, and suggest a
road map for future MOPS development.Comment: 57 Pages, 26 Figures, 13 Table
- …