8 research outputs found

    SiGMa: Simple Greedy Matching for Aligning Large Knowledge Bases

    Get PDF
    The Internet has enabled the creation of a growing number of large-scale knowledge bases in a variety of domains containing complementary information. Tools for automatically aligning these knowledge bases would make it possible to unify many sources of structured knowledge and answer complex queries. However, the efficient alignment of large-scale knowledge bases still poses a considerable challenge. Here, we present Simple Greedy Matching (SiGMa), a simple algorithm for aligning knowledge bases with millions of entities and facts. SiGMa is an iterative propagation algorithm which leverages both the structural information from the relationship graph as well as flexible similarity measures between entity properties in a greedy local search, thus making it scalable. Despite its greedy nature, our experiments indicate that SiGMa can efficiently match some of the world's largest knowledge bases with high precision. We provide additional experiments on benchmark datasets which demonstrate that SiGMa can outperform state-of-the-art approaches both in accuracy and efficiency.Comment: 10 pages + 2 pages appendix; 5 figures -- initial preprin

    Searching by approximate personal-name matching

    Get PDF
    We discuss the design, building and evaluation of a method to access theinformation of a person, using his name as a search key, even if it has deformations. We present a similarity function, the DEA function, based on the probabilities of the edit operations accordingly to the involved letters and their position, and using a variable threshold. The efficacy of DEA is quantitatively evaluated, without human relevance judgments, very superior to the efficacy of known methods. A very efficient approximate search technique for the DEA function is also presented based on a compacted trie-tree structure.Postprint (published version

    Capturing place semantics on the GeoSocial web

    Get PDF

    Uso de multi termos em pesquisa textual jurídica

    Get PDF
    Dissertação (mestrado) - Universidade Federal de Santa Catarina, Centro Tecnológico. Programa de Pós-Graduação em Ciência da Computação.A pesquisa com multi termos auxilia o processo de busca em bases de dados textuais ao combinar as palavras existentes em cada documento e produzindo um índice classificado pela freqüência de ocorrência de cada um dos termos gerados. A utilização de multi termos na pesquisa jurídica demonstra ser de grande eficiência na aplicação da metodologia. É aferido na pesquisa que o uso de multi termos oferece uma quantidade menor de documentos retornados da pesquisa, com um maior nível de qualidade. A geração de índices de pesquisa é otimizada com a exclusão de palavras de alta ou baixa freqüência, bem como com a limitação na geração da quantidade de palavras que formarão cada termo

    Constructing Features and Pseudo-intersections to Map Unreliable Domain Specific Data Items Found in Disjoint Sets

    Get PDF
    This research studies the problem of identifying related tuples from two disjoint sets A and B of tuples of aircraft part data. The tuples in set B are defined as unique classifications or candidates to which tuples from set A map. The mapping studied is a many-to-one mapping. A context free grammar (CFG) based on a subset of the data tuples being processed is used to construct relevant features from a single attribute field within the tuples. The notion of discovery items is introduced to assist in feature construction. Once constructed, features are assigned weight values. A sum-ordering feature weighting approach to systematically compute weight values corresponding to analyst-defined ranks and constraints is presented. A series of record comparisons is conducted and an Object Translation Score (OTS), based on weight values, is computed with each comparison. The OTS is a quality of match score. Record Objects and the OTS are introduced to establish a method of quantifying the relationships thus providing a mathematical means to measure and validate relationships. To boost a tuple's probability of registering an optimal OTS, learned data as well as checkpoint data is introduced. These data items are denoted as Enhancement data. Findings and Conclusions: A new algorithm was introduced and compared to the popular EM-based probabilistic record linkage algorithm. The new algorithm outperformed the EM-based algorithm; however it made some incorrect mappings as a result of poorly cleaned data, incorrectly classified terms and the use of an inefficient string comparison model. One difference between our approach and most traditional approaches is that each feature contained multiple values whereas in traditional record linkage solutions, there is normally a single value associated with each feature. Our approach creates features from one record field; in this case the part description field. In addition no training data was needed and external data was used to make optimal record mappings.Computer Science Departmen

    Applications of Approximate Word Matching in Information Retrieval

    No full text
    As more online databases are integrated into digital libraries, the issue of quality control of the data becomes increasingly important, especially as it relates to the effective retrieval of information. The need to discover and reconcile variant forms of strings in bibliographic entries, i.e., authority work, will become more critical in the future. Spelling variants, misspellings, and transllteration differences will all increase the difficulty of retrieving information. Approximate string matching has traditionally been used to help with this problem. In this paper we introduce the notion of approximate word matching and show how it can be used to improve detection and categorization of variant forms
    corecore