Search CORE

13,493 research outputs found

A Large Scale Dataset for the Evaluation of Ontology Matching Systems

Author: Avesani Paolo
Giunchiglia Fausto
Shvaiko Pavel
Yatskevich Mikalai
Publication venue
Publication date: 01/01/2008
Field of study

Recently, the number of ontology matching techniques and systems has increased significantly. This makes the issue of their evaluation and comparison more severe. One of the challenges of the ontology matching evaluation is in building large scale evaluation datasets. In fact, the number of possible correspondences between two ontologies grows quadratically with respect to the numbers of entities in these ontologies. This often makes the manual construction of the evaluation datasets demanding to the point of being infeasible for large scale matching tasks. In this paper we present an ontology matching evaluation dataset composed of thousands of matching tasks, called TaxME2. It was built semi-automatically out of the Google, Yahoo and Looksmart web directories. We evaluated TaxME2 by exploiting the results of almost two dozen of state of the art ontology matching systems. The experiments indicate that the dataset possesses the desired key properties, namely it is error-free, incremental, discriminative, monotonic, and hard for the state of the art ontology matching systems. The paper has been accepted for publication in "The Knowledge Engineering Review", Cambridge Universty Press (ISSN: 0269-8889, EISSN: 1469-8005)

CiteSeerX

Archivio della ricerca - Fondazione Bruno Kessler

Unitn-eprints Research

XML Matchers: approaches and challenges

Author: Agreste Santa
De Meo Pasquale
Ferrara Emilio
Ursino Domenico
Publication venue: 'Elsevier BV'
Publication date: 10/07/2014
Field of study

Schema Matching, i.e. the process of discovering semantic correspondences between concepts adopted in different data source schemas, has been a key topic in Database and Artificial Intelligence research areas for many years. In the past, it was largely investigated especially for classical database models (e.g., E/R schemas, relational databases, etc.). However, in the latest years, the widespread adoption of XML in the most disparate application fields pushed a growing number of researchers to design XML-specific Schema Matching approaches, called XML Matchers, aiming at finding semantic matchings between concepts defined in DTDs and XSDs. XML Matchers do not just take well-known techniques originally designed for other data models and apply them on DTDs/XSDs, but they exploit specific XML features (e.g., the hierarchical structure of a DTD/XSD) to improve the performance of the Schema Matching process. The design of XML Matchers is currently a well-established research area. The main goal of this paper is to provide a detailed description and classification of XML Matchers. We first describe to what extent the specificities of DTDs/XSDs impact on the Schema Matching task. Then we introduce a template, called XML Matcher Template, that describes the main components of an XML Matcher, their role and behavior. We illustrate how each of these components has been implemented in some popular XML Matchers. We consider our XML Matcher Template as the baseline for objectively comparing approaches that, at first glance, might appear as unrelated. The introduction of this template can be useful in the design of future XML Matchers. Finally, we analyze commercial tools implementing XML Matchers and introduce two challenging issues strictly related to this topic, namely XML source clustering and uncertainty management in XML Matchers.Comment: 34 pages, 8 tables, 7 figure

arXiv.org e-Print Archive

IRIS UniversitÃ Politecnica delle Marche

Multilingual Schema Matching for Wikipedia Infoboxes

Author: Freire Juliana
Moreira Viviane
Nguyen Hoa
Nguyen Huong
Nguyen Thanh
Publication venue
Publication date: 01/01/2011
Field of study

Recent research has taken advantage of Wikipedia's multilingualism as a resource for cross-language information retrieval and machine translation, as well as proposed techniques for enriching its cross-language structure. The availability of documents in multiple languages also opens up new opportunities for querying structured Wikipedia content, and in particular, to enable answers that straddle different languages. As a step towards supporting such queries, in this paper, we propose a method for identifying mappings between attributes from infoboxes that come from pages in different languages. Our approach finds mappings in a completely automated fashion. Because it does not require training data, it is scalable: not only can it be used to find mappings between many language pairs, but it is also effective for languages that are under-represented and lack sufficient training samples. Another important benefit of our approach is that it does not depend on syntactic similarity between attribute names, and thus, it can be applied to language pairs that have distinct morphologies. We have performed an extensive experimental evaluation using a corpus consisting of pages in Portuguese, Vietnamese, and English. The results show that not only does our approach obtain high precision and recall, but it also outperforms state-of-the-art techniques. We also present a case study which demonstrates that the multilingual mappings we derive lead to substantial improvements in answer quality and coverage for structured queries over Wikipedia content.Comment: VLDB201

arXiv.org e-Print Archive

CiteSeerX

Putting Context into Schema Matching

Author: Bohannon Philip
Elnahrawy Eiman
Fan Wenfei
Flaster Michael
Publication venue
Publication date: 01/01/2006
Field of study

Edinburgh Research Explorer

Nearly Optimal Sparse Group Testing

Author: Gandikota Venkata
Grigorescu Elena
Jaggi Sidharth
Zhou Samson
Publication venue
Publication date: 19/09/2018
Field of study

Group testing is the process of pooling arbitrary subsets from a set of

n

items so as to identify, with a minimal number of tests, a "small" subset of

d

defective items. In "classical" non-adaptive group testing, it is known that when

d

is substantially smaller than

n

\Theta(d\log(n))

tests are both information-theoretically necessary and sufficient to guarantee recovery with high probability. Group testing schemes in the literature meeting this bound require most items to be tested

\Omega(\log(n))

times, and most tests to incorporate

\Omega(n/d)

items. Motivated by physical considerations, we study group testing models in which the testing procedure is constrained to be "sparse". Specifically, we consider (separately) scenarios in which (a) items are finitely divisible and hence may participate in at most

\gamma \in o(\log(n))

tests; or (b) tests are size-constrained to pool no more than

\rho \in o(n/d)

items per test. For both scenarios we provide information-theoretic lower bounds on the number of tests required to guarantee high probability recovery. In both scenarios we provide both randomized constructions (under both

\epsilon

-error and zero-error reconstruction guarantees) and explicit constructions of designs with computationally efficient reconstruction algorithms that require a number of tests that are optimal up to constant or small polynomial factors in some regimes of

n, d, \gamma,

and

\rho

. The randomized design/reconstruction algorithm in the

\rho

-sized test scenario is universal -- independent of the value of

d

, as long as

\rho \in o(n/d)

. We also investigate the effect of unreliability/noise in test outcomes. For the full abstract, please see the full text PDF

arXiv.org e-Print Archive

Explore Bristol Research

ELUCID - Exploring the Local Universe with reConstructed Initial Density field III: Constrained Simulation in the SDSS Volume

Author: Gao Yang
Jing Y. P.
Kang Xi
Li Shijie
Liu Chengze
Mo H. J.
Shi JingJing
Wang Huiyuan
Yang Xiaohu
Zhang Youcai
Publication venue: 'American Astronomical Society'
Publication date: 30/08/2016
Field of study

A method we developed recently for the reconstruction of the initial density field in the nearby Universe is applied to the Sloan Digital Sky Survey Data Release 7. A high-resolution N-body constrained simulation (CS) of the reconstructed initial condition, with

3072^3

particles evolved in a 500 Mpc/h box, is carried out and analyzed in terms of the statistical properties of the final density field and its relation with the distribution of SDSS galaxies. We find that the statistical properties of the cosmic web and the halo populations are accurately reproduced in the CS. The galaxy density field is strongly correlated with the CS density field, with a bias that depend on both galaxy luminosity and color. Our further investigations show that the CS provides robust quantities describing the environments within which the observed galaxies and galaxy systems reside. Cosmic variance is greatly reduced in the CS so that the statistical uncertainties can be controlled effectively even for samples of small volumes.Comment: submitted to ApJ, 19 pages, 22 figures. Please download the high-resolution version at http://staff.ustc.edu.cn/~whywang/paper

arXiv.org e-Print Archive

Shanghai Astronomical Observatory,Chinese Academy of Sciences

The Globular Cluster System of the Coma cD Galaxy NGC 4874 from Hubble Space Telescope ACS and WFC3/IR Imaging

Author: Blakeslee John P.
Chies-Santos Ana L.
Cho Hyejeon
Jee M. James
Jensen Joseph B.
Lee Young-Wook
Peng Eric W.
Publication venue: 'American Astronomical Society'
Publication date: 01/01/2016
Field of study

We present new HST optical and near-infrared (NIR) photometry of the rich globular cluster (GC) system of NGC 4874, the cD galaxy in the core of the Coma cluster (Abell 1656). NGC 4874 was observed with the HST Advanced Camera for Surveys in the F475W (g) and F814W (I) passbands and the Wide Field Camera 3 IR Channel in F160W (H). The GCs in this field exhibit a bimodal optical color distribution with more than half of the GCs falling on the red side at g-I > 1. Bimodality is also present, though less conspicuously, in the optical-NIR I-H color. Consistent with past work, we find evidence for nonlinearity in the g-I versus I-H color-color relation. Our results thus underscore the need for understanding the detailed form of the color-metallicity relations in interpreting observational data on GC bimodality. We also find a very strong color-magnitude trend, or "blue tilt," for the blue component of the optical color distribution of the NGC 4874 GC system. A similarly strong trend is present for the overall mean I-H color as a function of magnitude; for M_814 < -10 mag, these trends imply a steep mass-metallicity scaling with

Z\propto M_{\rm GC}^{1.4\pm0.4}

, but the scaling is not a simple power law and becomes much weaker at lower masses. As in other similar systems, the spatial distribution of the blue GCs is more extended than that of the red GCs, partly because of blue GCs associated with surrounding cluster galaxies. In addition, the center of the GC system is displaced by 4+/-1 kpc towards the southwest from the luminosity center of NGC 4874, in the direction of NGC 4872. Finally, we remark on a dwarf elliptical galaxy with a noticeably asymmetrical GC distribution. Interestingly, this dwarf has a velocity of nearly -3000 km/s with respect to NGC 4874; we suggest it is on its first infall into the cluster core and is undergoing stripping of its GC system by the cluster potential.Comment: 24 pages, 20 figures, accepted for publication in Ap

arXiv.org e-Print Archive

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Lume 5.8