1,074 research outputs found

    Query-Driven Sampling for Collective Entity Resolution

    Full text link
    Probabilistic databases play a preeminent role in the processing and management of uncertain data. Recently, many database research efforts have integrated probabilistic models into databases to support tasks such as information extraction and labeling. Many of these efforts are based on batch oriented inference which inhibits a realtime workflow. One important task is entity resolution (ER). ER is the process of determining records (mentions) in a database that correspond to the same real-world entity. Traditional pairwise ER methods can lead to inconsistencies and low accuracy due to localized decisions. Leading ER systems solve this problem by collectively resolving all records using a probabilistic graphical model and Markov chain Monte Carlo (MCMC) inference. However, for large datasets this is an extremely expensive process. One key observation is that, such exhaustive ER process incurs a huge up-front cost, which is wasteful in practice because most users are interested in only a small subset of entities. In this paper, we advocate pay-as-you-go entity resolution by developing a number of query-driven collective ER techniques. We introduce two classes of SQL queries that involve ER operators --- selection-driven ER and join-driven ER. We implement novel variations of the MCMC Metropolis Hastings algorithm to generate biased samples and selectivity-based scheduling algorithms to support the two classes of ER queries. Finally, we show that query-driven ER algorithms can converge and return results within minutes over a database populated with the extraction from a newswire dataset containing 71 million mentions

    String Indexing for Top-kk Close Consecutive Occurrences

    Full text link
    The classic string indexing problem is to preprocess a string SS into a compact data structure that supports efficient subsequent pattern matching queries, that is, given a pattern string PP, report all occurrences of PP within SS. In this paper, we study a basic and natural extension of string indexing called the string indexing for top-kk close consecutive occurrences problem (SITCCO). Here, a consecutive occurrence is a pair (i,j)(i,j), i<ji < j, such that PP occurs at positions ii and jj in SS and there is no occurrence of PP between ii and jj, and their distance is defined as jij-i. Given a pattern PP and a parameter kk, the goal is to report the top-kk consecutive occurrences of PP in SS of minimal distance. The challenge is to compactly represent SS while supporting queries in time close to length of PP and kk. We give two time-space trade-offs for the problem. Let nn be the length of SS, mm the length of PP, and ϵ(0,1]\epsilon\in(0,1]. Our first result achieves O(nlogn)O(n\log n) space and optimal query time of O(m+k)O(m+k), and our second result achieves linear space and query time O(m+k1+ϵ)O(m+k^{1+\epsilon}). Along the way, we develop several techniques of independent interest, including a new translation of the problem into a line segment intersection problem and a new recursive clustering technique for trees.Comment: Fixed typos, minor change

    Image Characterization and Classification by Physical Complexity

    Full text link
    We present a method for estimating the complexity of an image based on Bennett's concept of logical depth. Bennett identified logical depth as the appropriate measure of organized complexity, and hence as being better suited to the evaluation of the complexity of objects in the physical world. Its use results in a different, and in some sense a finer characterization than is obtained through the application of the concept of Kolmogorov complexity alone. We use this measure to classify images by their information content. The method provides a means for classifying and evaluating the complexity of objects by way of their visual representations. To the authors' knowledge, the method and application inspired by the concept of logical depth presented herein are being proposed and implemented for the first time.Comment: 30 pages, 21 figure

    An examination of heuristic algorithms for the travelling salesman problem

    Get PDF
    The role of heuristics in combinatorial optimization is discussed. Published heuristics for the Travelling Salesman Problem (TSP) were reviewed and morphological boxes were used to develop new heuristics for the TSP. New and published heuristics were programmed for symmetric TSPs where the triangle inequality holds, and were tested on micro computer. The best of the quickest heuristics was the furthest insertion heuristic, finding tours 3 to 9% above the best known solutions (2 minutes for 100 nodes). Better results were found by longer running heuristics, e.g. the cheapest angle heuristic (CCAO), 0-6% above best (80 minutes for 100 nodes). The savings heuristic found the best results overall, but took more than 2 hours to complete. Of the new heuristics, the MST path algorithm at times improved on the results of the furthest insertion heuristic while taking the same time as the CCAO. The study indicated that there is little likelihood of improving on present methods unless a fundamental new approach is discovered. Finally a case study using TSP heuristics to aid the planning of grid surveys was described

    10091 Abstracts Collection -- Data Structures

    Get PDF
    From February 28th to March 5th 2010, the Dagstuhl Seminar 10091 "Data Structures" was held in Schloss Dagstuhl~--~Leibniz Center for Informatics. It brought together 45 international researchers to discuss recent developments concerning data structures in terms of research, but also in terms of new technologies that impact how data can be stored, updated, and retrieved. During the seminar a fair number of participants presented their current research and open problems where discussed. This document first briefly describes the seminar topics and then gives the abstracts of the presentations given during the seminar

    Finding a Cluster in Incomplete Data

    Get PDF
    We study two variants of the fundamental problem of finding a cluster in incomplete data. In the problems under consideration, we are given a multiset of incomplete d-dimensional vectors over the binary domain and integers k and r, and the goal is to complete the missing vector entries so that the multiset of complete vectors either contains (i) a cluster of k vectors of radius at most r, or (ii) a cluster of k vectors of diameter at most r. We give tight characterizations of the parameterized complexity of the problems under consideration with respect to the parameters k, r, and a third parameter that captures the missing vector entries

    Convex hull ranking algorithm for multi-objective evolutionary algorithms

    Get PDF
    AbstractDue to many applications of multi-objective evolutionary algorithms in real world optimization problems, several studies have been done to improve these algorithms in recent years. Since most multi-objective evolutionary algorithms are based on the non-dominated principle, and their complexity depends on finding non-dominated fronts, this paper introduces a new method for ranking the solutions of an evolutionary algorithm’s population. First, we investigate the relation between the convex hull and non-dominated solutions, and discuss the complexity time of the convex hull and non-dominated sorting problems. Then, we use convex hull concepts to present a new ranking procedure for multi-objective evolutionary algorithms. The proposed algorithm is very suitable for convex multi-objective optimization problems. Finally, we apply this method as an alternative ranking procedure to NSGA-II for non-dominated comparisons, and test it using some benchmark problems

    Clustering and Validation of Microarray Data Using Consensus Clustering

    Get PDF
    Clustering is a popular method to glean useful information from microarray data. Unfortunately the results obtained from the common clustering algorithms are not consistent and even with multiple runs of different algorithms a further validation step is required. Due to absence of well defined class labels, and unknown number of clusters, the unsupervised learning problem of finding optimal clustering is hard. Obtaining a consensus of judiciously obtained clusterings not only provides stable results but also lends a high level of confidence in the quality of results. Several base algorithm runs are used to generate clusterings and a co-association matrix of pairs of points is obtained using a configurable majority criterion. Using this consensus as a similarity measure we generate a clustering using four algorithms. Synthetic as well as real world datasets are used in experiment and results obtained are compared using various internal and external validity measures. Results on real world datasets showed a marked improvement over those obtained by other researchers with the same datasets
    corecore