13,493 research outputs found

    A Large Scale Dataset for the Evaluation of Ontology Matching Systems

    Get PDF
    Recently, the number of ontology matching techniques and systems has increased significantly. This makes the issue of their evaluation and comparison more severe. One of the challenges of the ontology matching evaluation is in building large scale evaluation datasets. In fact, the number of possible correspondences between two ontologies grows quadratically with respect to the numbers of entities in these ontologies. This often makes the manual construction of the evaluation datasets demanding to the point of being infeasible for large scale matching tasks. In this paper we present an ontology matching evaluation dataset composed of thousands of matching tasks, called TaxME2. It was built semi-automatically out of the Google, Yahoo and Looksmart web directories. We evaluated TaxME2 by exploiting the results of almost two dozen of state of the art ontology matching systems. The experiments indicate that the dataset possesses the desired key properties, namely it is error-free, incremental, discriminative, monotonic, and hard for the state of the art ontology matching systems. The paper has been accepted for publication in "The Knowledge Engineering Review", Cambridge Universty Press (ISSN: 0269-8889, EISSN: 1469-8005)

    XML Matchers: approaches and challenges

    Full text link
    Schema Matching, i.e. the process of discovering semantic correspondences between concepts adopted in different data source schemas, has been a key topic in Database and Artificial Intelligence research areas for many years. In the past, it was largely investigated especially for classical database models (e.g., E/R schemas, relational databases, etc.). However, in the latest years, the widespread adoption of XML in the most disparate application fields pushed a growing number of researchers to design XML-specific Schema Matching approaches, called XML Matchers, aiming at finding semantic matchings between concepts defined in DTDs and XSDs. XML Matchers do not just take well-known techniques originally designed for other data models and apply them on DTDs/XSDs, but they exploit specific XML features (e.g., the hierarchical structure of a DTD/XSD) to improve the performance of the Schema Matching process. The design of XML Matchers is currently a well-established research area. The main goal of this paper is to provide a detailed description and classification of XML Matchers. We first describe to what extent the specificities of DTDs/XSDs impact on the Schema Matching task. Then we introduce a template, called XML Matcher Template, that describes the main components of an XML Matcher, their role and behavior. We illustrate how each of these components has been implemented in some popular XML Matchers. We consider our XML Matcher Template as the baseline for objectively comparing approaches that, at first glance, might appear as unrelated. The introduction of this template can be useful in the design of future XML Matchers. Finally, we analyze commercial tools implementing XML Matchers and introduce two challenging issues strictly related to this topic, namely XML source clustering and uncertainty management in XML Matchers.Comment: 34 pages, 8 tables, 7 figure

    Multilingual Schema Matching for Wikipedia Infoboxes

    Full text link
    Recent research has taken advantage of Wikipedia's multilingualism as a resource for cross-language information retrieval and machine translation, as well as proposed techniques for enriching its cross-language structure. The availability of documents in multiple languages also opens up new opportunities for querying structured Wikipedia content, and in particular, to enable answers that straddle different languages. As a step towards supporting such queries, in this paper, we propose a method for identifying mappings between attributes from infoboxes that come from pages in different languages. Our approach finds mappings in a completely automated fashion. Because it does not require training data, it is scalable: not only can it be used to find mappings between many language pairs, but it is also effective for languages that are under-represented and lack sufficient training samples. Another important benefit of our approach is that it does not depend on syntactic similarity between attribute names, and thus, it can be applied to language pairs that have distinct morphologies. We have performed an extensive experimental evaluation using a corpus consisting of pages in Portuguese, Vietnamese, and English. The results show that not only does our approach obtain high precision and recall, but it also outperforms state-of-the-art techniques. We also present a case study which demonstrates that the multilingual mappings we derive lead to substantial improvements in answer quality and coverage for structured queries over Wikipedia content.Comment: VLDB201

    Nearly Optimal Sparse Group Testing

    Full text link
    Group testing is the process of pooling arbitrary subsets from a set of nn items so as to identify, with a minimal number of tests, a "small" subset of dd defective items. In "classical" non-adaptive group testing, it is known that when dd is substantially smaller than nn, Θ(dlog(n))\Theta(d\log(n)) tests are both information-theoretically necessary and sufficient to guarantee recovery with high probability. Group testing schemes in the literature meeting this bound require most items to be tested Ω(log(n))\Omega(\log(n)) times, and most tests to incorporate Ω(n/d)\Omega(n/d) items. Motivated by physical considerations, we study group testing models in which the testing procedure is constrained to be "sparse". Specifically, we consider (separately) scenarios in which (a) items are finitely divisible and hence may participate in at most γo(log(n))\gamma \in o(\log(n)) tests; or (b) tests are size-constrained to pool no more than ρo(n/d)\rho \in o(n/d)items per test. For both scenarios we provide information-theoretic lower bounds on the number of tests required to guarantee high probability recovery. In both scenarios we provide both randomized constructions (under both ϵ\epsilon-error and zero-error reconstruction guarantees) and explicit constructions of designs with computationally efficient reconstruction algorithms that require a number of tests that are optimal up to constant or small polynomial factors in some regimes of n,d,γ,n, d, \gamma, and ρ\rho. The randomized design/reconstruction algorithm in the ρ\rho-sized test scenario is universal -- independent of the value of dd, as long as ρo(n/d)\rho \in o(n/d). We also investigate the effect of unreliability/noise in test outcomes. For the full abstract, please see the full text PDF

    ELUCID - Exploring the Local Universe with reConstructed Initial Density field III: Constrained Simulation in the SDSS Volume

    Full text link
    A method we developed recently for the reconstruction of the initial density field in the nearby Universe is applied to the Sloan Digital Sky Survey Data Release 7. A high-resolution N-body constrained simulation (CS) of the reconstructed initial condition, with 307233072^3 particles evolved in a 500 Mpc/h box, is carried out and analyzed in terms of the statistical properties of the final density field and its relation with the distribution of SDSS galaxies. We find that the statistical properties of the cosmic web and the halo populations are accurately reproduced in the CS. The galaxy density field is strongly correlated with the CS density field, with a bias that depend on both galaxy luminosity and color. Our further investigations show that the CS provides robust quantities describing the environments within which the observed galaxies and galaxy systems reside. Cosmic variance is greatly reduced in the CS so that the statistical uncertainties can be controlled effectively even for samples of small volumes.Comment: submitted to ApJ, 19 pages, 22 figures. Please download the high-resolution version at http://staff.ustc.edu.cn/~whywang/paper

    The Globular Cluster System of the Coma cD Galaxy NGC 4874 from Hubble Space Telescope ACS and WFC3/IR Imaging

    Get PDF
    We present new HST optical and near-infrared (NIR) photometry of the rich globular cluster (GC) system of NGC 4874, the cD galaxy in the core of the Coma cluster (Abell 1656). NGC 4874 was observed with the HST Advanced Camera for Surveys in the F475W (g) and F814W (I) passbands and the Wide Field Camera 3 IR Channel in F160W (H). The GCs in this field exhibit a bimodal optical color distribution with more than half of the GCs falling on the red side at g-I > 1. Bimodality is also present, though less conspicuously, in the optical-NIR I-H color. Consistent with past work, we find evidence for nonlinearity in the g-I versus I-H color-color relation. Our results thus underscore the need for understanding the detailed form of the color-metallicity relations in interpreting observational data on GC bimodality. We also find a very strong color-magnitude trend, or "blue tilt," for the blue component of the optical color distribution of the NGC 4874 GC system. A similarly strong trend is present for the overall mean I-H color as a function of magnitude; for M_814 < -10 mag, these trends imply a steep mass-metallicity scaling with ZMGC1.4±0.4Z\propto M_{\rm GC}^{1.4\pm0.4}, but the scaling is not a simple power law and becomes much weaker at lower masses. As in other similar systems, the spatial distribution of the blue GCs is more extended than that of the red GCs, partly because of blue GCs associated with surrounding cluster galaxies. In addition, the center of the GC system is displaced by 4+/-1 kpc towards the southwest from the luminosity center of NGC 4874, in the direction of NGC 4872. Finally, we remark on a dwarf elliptical galaxy with a noticeably asymmetrical GC distribution. Interestingly, this dwarf has a velocity of nearly -3000 km/s with respect to NGC 4874; we suggest it is on its first infall into the cluster core and is undergoing stripping of its GC system by the cluster potential.Comment: 24 pages, 20 figures, accepted for publication in Ap
    corecore