23,814 research outputs found
Nonparametric Bayesian Modeling for Automated Database Schema Matching
The problem of merging databases arises in many government and commercial
applications. Schema matching, a common first step, identifies equivalent
fields between databases. We introduce a schema matching framework that builds
nonparametric Bayesian models for each field and compares them by computing the
probability that a single model could have generated both fields. Our
experiments show that our method is more accurate and faster than the existing
instance-based matching algorithms in part because of the use of nonparametric
Bayesian models
SODA: Generating SQL for Business Users
The purpose of data warehouses is to enable business analysts to make better
decisions. Over the years the technology has matured and data warehouses have
become extremely successful. As a consequence, more and more data has been
added to the data warehouses and their schemas have become increasingly
complex. These systems still work great in order to generate pre-canned
reports. However, with their current complexity, they tend to be a poor match
for non tech-savvy business analysts who need answers to ad-hoc queries that
were not anticipated. This paper describes the design, implementation, and
experience of the SODA system (Search over DAta Warehouse). SODA bridges the
gap between the business needs of analysts and the technical complexity of
current data warehouses. SODA enables a Google-like search experience for data
warehouses by taking keyword queries of business users and automatically
generating executable SQL. The key idea is to use a graph pattern matching
algorithm that uses the metadata model of the data warehouse. Our results with
real data from a global player in the financial services industry show that
SODA produces queries with high precision and recall, and makes it much easier
for business users to interactively explore highly-complex data warehouses.Comment: VLDB201
XML Matchers: approaches and challenges
Schema Matching, i.e. the process of discovering semantic correspondences
between concepts adopted in different data source schemas, has been a key topic
in Database and Artificial Intelligence research areas for many years. In the
past, it was largely investigated especially for classical database models
(e.g., E/R schemas, relational databases, etc.). However, in the latest years,
the widespread adoption of XML in the most disparate application fields pushed
a growing number of researchers to design XML-specific Schema Matching
approaches, called XML Matchers, aiming at finding semantic matchings between
concepts defined in DTDs and XSDs. XML Matchers do not just take well-known
techniques originally designed for other data models and apply them on
DTDs/XSDs, but they exploit specific XML features (e.g., the hierarchical
structure of a DTD/XSD) to improve the performance of the Schema Matching
process. The design of XML Matchers is currently a well-established research
area. The main goal of this paper is to provide a detailed description and
classification of XML Matchers. We first describe to what extent the
specificities of DTDs/XSDs impact on the Schema Matching task. Then we
introduce a template, called XML Matcher Template, that describes the main
components of an XML Matcher, their role and behavior. We illustrate how each
of these components has been implemented in some popular XML Matchers. We
consider our XML Matcher Template as the baseline for objectively comparing
approaches that, at first glance, might appear as unrelated. The introduction
of this template can be useful in the design of future XML Matchers. Finally,
we analyze commercial tools implementing XML Matchers and introduce two
challenging issues strictly related to this topic, namely XML source clustering
and uncertainty management in XML Matchers.Comment: 34 pages, 8 tables, 7 figure
Towards information profiling: data lake content metadata management
There is currently a burst of Big Data (BD) processed and stored in huge raw data repositories, commonly called Data Lakes (DL). These BD require new techniques of data integration and schema alignment in order to make the data usable by its consumers and to discover the relationships linking their content. This can be provided by metadata services which discover and describe their content. However, there is currently a lack of a systematic approach for such kind of metadata discovery and management. Thus, we propose a framework for the profiling of informational content stored in the DL, which we call information profiling. The profiles are stored as metadata to support data analysis. We formally define a metadata management process which identifies the key activities required to effectively handle this.We demonstrate the alternative techniques and performance of our process using a prototype implementation handling a real-life case-study from the OpenML DL, which showcases the value and feasibility of our approach.Peer ReviewedPostprint (author's final draft
ART-Ada design project, phase 2
Interest in deploying expert systems in Ada has increased. An Ada based expert system tool is described called ART-Ada, which was built to support research into the language and methodological issues of expert systems in Ada. ART-Ada allows applications of an existing expert system tool called ART-IM (Automated Reasoning Tool for Information Management) to be deployed in various Ada environments. ART-IM, a C-based expert system tool, is used to generate Ada source code which is compiled and linked with an Ada based inference engine to produce an Ada executable image. ART-Ada is being used to implement several expert systems for NASA's Space Station Freedom Program and the U.S. Air Force
The IPAC Image Subtraction and Discovery Pipeline for the intermediate Palomar Transient Factory
We describe the near real-time transient-source discovery engine for the
intermediate Palomar Transient Factory (iPTF), currently in operations at the
Infrared Processing and Analysis Center (IPAC), Caltech. We coin this system
the IPAC/iPTF Discovery Engine (or IDE). We review the algorithms used for
PSF-matching, image subtraction, detection, photometry, and machine-learned
(ML) vetting of extracted transient candidates. We also review the performance
of our ML classifier. For a limiting signal-to-noise ratio of 4 in relatively
unconfused regions, "bogus" candidates from processing artifacts and imperfect
image subtractions outnumber real transients by ~ 10:1. This can be
considerably higher for image data with inaccurate astrometric and/or
PSF-matching solutions. Despite this occasionally high contamination rate, the
ML classifier is able to identify real transients with an efficiency (or
completeness) of ~ 97% for a maximum tolerable false-positive rate of 1% when
classifying raw candidates. All subtraction-image metrics, source features, ML
probability-based real-bogus scores, contextual metadata from other surveys,
and possible associations with known Solar System objects are stored in a
relational database for retrieval by the various science working groups. We
review our efforts in mitigating false-positives and our experience in
optimizing the overall system in response to the multitude of science projects
underway with iPTF.Comment: 66 pages, 21 figures, 7 tables, accepted by PAS
- …