8 research outputs found
SiGMa: Simple Greedy Matching for Aligning Large Knowledge Bases
The Internet has enabled the creation of a growing number of large-scale
knowledge bases in a variety of domains containing complementary information.
Tools for automatically aligning these knowledge bases would make it possible
to unify many sources of structured knowledge and answer complex queries.
However, the efficient alignment of large-scale knowledge bases still poses a
considerable challenge. Here, we present Simple Greedy Matching (SiGMa), a
simple algorithm for aligning knowledge bases with millions of entities and
facts. SiGMa is an iterative propagation algorithm which leverages both the
structural information from the relationship graph as well as flexible
similarity measures between entity properties in a greedy local search, thus
making it scalable. Despite its greedy nature, our experiments indicate that
SiGMa can efficiently match some of the world's largest knowledge bases with
high precision. We provide additional experiments on benchmark datasets which
demonstrate that SiGMa can outperform state-of-the-art approaches both in
accuracy and efficiency.Comment: 10 pages + 2 pages appendix; 5 figures -- initial preprin
Searching by approximate personal-name matching
We discuss the design, building and evaluation of a method to access theinformation of a person, using his name as a search key, even if it has deformations. We present a similarity function, the DEA function, based
on the probabilities of the edit operations accordingly to the involved
letters and their position, and using a variable threshold. The efficacy
of DEA is quantitatively evaluated, without human relevance judgments,
very superior to the efficacy of known methods. A very efficient
approximate search technique for the DEA function is also presented
based on a compacted trie-tree structure.Postprint (published version
Uso de multi termos em pesquisa textual jurídica
Dissertação (mestrado) - Universidade Federal de Santa Catarina, Centro Tecnológico. Programa de Pós-Graduação em Ciência da Computação.A pesquisa com multi termos auxilia o processo de busca em bases de dados textuais ao combinar as palavras existentes em cada documento e produzindo um índice classificado pela freqüência de ocorrência de cada um dos termos gerados. A utilização de multi termos na pesquisa jurídica demonstra ser de grande eficiência na aplicação da metodologia. É aferido na pesquisa que o uso de multi termos oferece uma quantidade menor de documentos retornados da pesquisa, com um maior nível de qualidade. A geração de índices de pesquisa é otimizada com a exclusão de palavras de alta ou baixa freqüência, bem como com a limitação na geração da quantidade de palavras que formarão cada termo
A study on Analysis and Utilization of Crowd-sourced Spatio-temporal Contexts from Social Media
兵庫県立大学大学院201
Constructing Features and Pseudo-intersections to Map Unreliable Domain Specific Data Items Found in Disjoint Sets
This research studies the problem of identifying related tuples from two disjoint sets A and B of tuples of aircraft part data. The tuples in set B are defined as unique classifications or candidates to which tuples from set A map. The mapping studied is a many-to-one mapping. A context free grammar (CFG) based on a subset of the data tuples being processed is used to construct relevant features from a single attribute field within the tuples. The notion of discovery items is introduced to assist in feature construction. Once constructed, features are assigned weight values. A sum-ordering feature weighting approach to systematically compute weight values corresponding to analyst-defined ranks and constraints is presented. A series of record comparisons is conducted and an Object Translation Score (OTS), based on weight values, is computed with each comparison. The OTS is a quality of match score. Record Objects and the OTS are introduced to establish a method of quantifying the relationships thus providing a mathematical means to measure and validate relationships. To boost a tuple's probability of registering an optimal OTS, learned data as well as checkpoint data is introduced. These data items are denoted as Enhancement data. Findings and Conclusions: A new algorithm was introduced and compared to the popular EM-based probabilistic record linkage algorithm. The new algorithm outperformed the EM-based algorithm; however it made some incorrect mappings as a result of poorly cleaned data, incorrectly classified terms and the use of an inefficient string comparison model. One difference between our approach and most traditional approaches is that each feature contained multiple values whereas in traditional record linkage solutions, there is normally a single value associated with each feature. Our approach creates features from one record field; in this case the part description field. In addition no training data was needed and external data was used to make optimal record mappings.Computer Science Departmen
Applications of Approximate Word Matching in Information Retrieval
As more online databases are integrated into digital libraries, the issue of quality control of the data becomes increasingly important, especially as it relates to the effective retrieval of information. The need to discover and reconcile variant forms of strings in bibliographic entries, i.e., authority work, will become more critical in the future. Spelling variants, misspellings, and transllteration differences will all increase the difficulty of retrieving information. Approximate string matching has traditionally been used to help with this problem. In this paper we introduce the notion of approximate word matching and show how it can be used to improve detection and categorization of variant forms