13,493 research outputs found
A Large Scale Dataset for the Evaluation of Ontology Matching Systems
Recently, the number of ontology matching techniques and systems has increased significantly. This makes the issue of their evaluation and comparison more severe. One of the challenges of the ontology matching evaluation is in building large scale evaluation datasets. In fact, the number of possible correspondences between two ontologies grows quadratically with respect to the numbers of entities in these ontologies. This often makes the manual construction of the evaluation datasets demanding to the point of being infeasible for large scale matching tasks. In this paper we present an ontology matching evaluation dataset composed of thousands of matching tasks, called TaxME2. It was built semi-automatically out of the Google, Yahoo and Looksmart web directories. We evaluated TaxME2 by exploiting the results of almost two dozen of state of the art ontology matching systems. The experiments indicate that the dataset possesses the desired key properties, namely it is error-free, incremental, discriminative, monotonic, and hard for the state of the art ontology matching systems. The paper has been accepted for publication in "The Knowledge Engineering Review", Cambridge Universty Press (ISSN: 0269-8889, EISSN: 1469-8005)
XML Matchers: approaches and challenges
Schema Matching, i.e. the process of discovering semantic correspondences
between concepts adopted in different data source schemas, has been a key topic
in Database and Artificial Intelligence research areas for many years. In the
past, it was largely investigated especially for classical database models
(e.g., E/R schemas, relational databases, etc.). However, in the latest years,
the widespread adoption of XML in the most disparate application fields pushed
a growing number of researchers to design XML-specific Schema Matching
approaches, called XML Matchers, aiming at finding semantic matchings between
concepts defined in DTDs and XSDs. XML Matchers do not just take well-known
techniques originally designed for other data models and apply them on
DTDs/XSDs, but they exploit specific XML features (e.g., the hierarchical
structure of a DTD/XSD) to improve the performance of the Schema Matching
process. The design of XML Matchers is currently a well-established research
area. The main goal of this paper is to provide a detailed description and
classification of XML Matchers. We first describe to what extent the
specificities of DTDs/XSDs impact on the Schema Matching task. Then we
introduce a template, called XML Matcher Template, that describes the main
components of an XML Matcher, their role and behavior. We illustrate how each
of these components has been implemented in some popular XML Matchers. We
consider our XML Matcher Template as the baseline for objectively comparing
approaches that, at first glance, might appear as unrelated. The introduction
of this template can be useful in the design of future XML Matchers. Finally,
we analyze commercial tools implementing XML Matchers and introduce two
challenging issues strictly related to this topic, namely XML source clustering
and uncertainty management in XML Matchers.Comment: 34 pages, 8 tables, 7 figure
Multilingual Schema Matching for Wikipedia Infoboxes
Recent research has taken advantage of Wikipedia's multilingualism as a
resource for cross-language information retrieval and machine translation, as
well as proposed techniques for enriching its cross-language structure. The
availability of documents in multiple languages also opens up new opportunities
for querying structured Wikipedia content, and in particular, to enable answers
that straddle different languages. As a step towards supporting such queries,
in this paper, we propose a method for identifying mappings between attributes
from infoboxes that come from pages in different languages. Our approach finds
mappings in a completely automated fashion. Because it does not require
training data, it is scalable: not only can it be used to find mappings between
many language pairs, but it is also effective for languages that are
under-represented and lack sufficient training samples. Another important
benefit of our approach is that it does not depend on syntactic similarity
between attribute names, and thus, it can be applied to language pairs that
have distinct morphologies. We have performed an extensive experimental
evaluation using a corpus consisting of pages in Portuguese, Vietnamese, and
English. The results show that not only does our approach obtain high precision
and recall, but it also outperforms state-of-the-art techniques. We also
present a case study which demonstrates that the multilingual mappings we
derive lead to substantial improvements in answer quality and coverage for
structured queries over Wikipedia content.Comment: VLDB201
Nearly Optimal Sparse Group Testing
Group testing is the process of pooling arbitrary subsets from a set of
items so as to identify, with a minimal number of tests, a "small" subset of
defective items. In "classical" non-adaptive group testing, it is known
that when is substantially smaller than , tests are
both information-theoretically necessary and sufficient to guarantee recovery
with high probability. Group testing schemes in the literature meeting this
bound require most items to be tested times, and most tests
to incorporate items.
Motivated by physical considerations, we study group testing models in which
the testing procedure is constrained to be "sparse". Specifically, we consider
(separately) scenarios in which (a) items are finitely divisible and hence may
participate in at most tests; or (b) tests are
size-constrained to pool no more than items per test. For both
scenarios we provide information-theoretic lower bounds on the number of tests
required to guarantee high probability recovery. In both scenarios we provide
both randomized constructions (under both -error and zero-error
reconstruction guarantees) and explicit constructions of designs with
computationally efficient reconstruction algorithms that require a number of
tests that are optimal up to constant or small polynomial factors in some
regimes of and . The randomized design/reconstruction
algorithm in the -sized test scenario is universal -- independent of the
value of , as long as . We also investigate the effect of
unreliability/noise in test outcomes. For the full abstract, please see the
full text PDF
ELUCID - Exploring the Local Universe with reConstructed Initial Density field III: Constrained Simulation in the SDSS Volume
A method we developed recently for the reconstruction of the initial density
field in the nearby Universe is applied to the Sloan Digital Sky Survey Data
Release 7. A high-resolution N-body constrained simulation (CS) of the
reconstructed initial condition, with particles evolved in a 500 Mpc/h
box, is carried out and analyzed in terms of the statistical properties of the
final density field and its relation with the distribution of SDSS galaxies. We
find that the statistical properties of the cosmic web and the halo populations
are accurately reproduced in the CS. The galaxy density field is strongly
correlated with the CS density field, with a bias that depend on both galaxy
luminosity and color. Our further investigations show that the CS provides
robust quantities describing the environments within which the observed
galaxies and galaxy systems reside. Cosmic variance is greatly reduced in the
CS so that the statistical uncertainties can be controlled effectively even for
samples of small volumes.Comment: submitted to ApJ, 19 pages, 22 figures. Please download the
high-resolution version at http://staff.ustc.edu.cn/~whywang/paper
The Globular Cluster System of the Coma cD Galaxy NGC 4874 from Hubble Space Telescope ACS and WFC3/IR Imaging
We present new HST optical and near-infrared (NIR) photometry of the rich
globular cluster (GC) system of NGC 4874, the cD galaxy in the core of the Coma
cluster (Abell 1656). NGC 4874 was observed with the HST Advanced Camera for
Surveys in the F475W (g) and F814W (I) passbands and the Wide Field Camera 3 IR
Channel in F160W (H). The GCs in this field exhibit a bimodal optical color
distribution with more than half of the GCs falling on the red side at g-I > 1.
Bimodality is also present, though less conspicuously, in the optical-NIR I-H
color. Consistent with past work, we find evidence for nonlinearity in the g-I
versus I-H color-color relation. Our results thus underscore the need for
understanding the detailed form of the color-metallicity relations in
interpreting observational data on GC bimodality. We also find a very strong
color-magnitude trend, or "blue tilt," for the blue component of the optical
color distribution of the NGC 4874 GC system. A similarly strong trend is
present for the overall mean I-H color as a function of magnitude; for M_814 <
-10 mag, these trends imply a steep mass-metallicity scaling with , but the scaling is not a simple power law and becomes
much weaker at lower masses. As in other similar systems, the spatial
distribution of the blue GCs is more extended than that of the red GCs, partly
because of blue GCs associated with surrounding cluster galaxies. In addition,
the center of the GC system is displaced by 4+/-1 kpc towards the southwest
from the luminosity center of NGC 4874, in the direction of NGC 4872. Finally,
we remark on a dwarf elliptical galaxy with a noticeably asymmetrical GC
distribution. Interestingly, this dwarf has a velocity of nearly -3000 km/s
with respect to NGC 4874; we suggest it is on its first infall into the cluster
core and is undergoing stripping of its GC system by the cluster potential.Comment: 24 pages, 20 figures, accepted for publication in Ap
- …