Search CORE

1,074 research outputs found

Query-Driven Sampling for Collective Entity Resolution

Author: Grant Christan
Wang Daisy Zhe
Wick Michael L.
Publication venue
Publication date: 13/08/2015
Field of study

Probabilistic databases play a preeminent role in the processing and management of uncertain data. Recently, many database research efforts have integrated probabilistic models into databases to support tasks such as information extraction and labeling. Many of these efforts are based on batch oriented inference which inhibits a realtime workflow. One important task is entity resolution (ER). ER is the process of determining records (mentions) in a database that correspond to the same real-world entity. Traditional pairwise ER methods can lead to inconsistencies and low accuracy due to localized decisions. Leading ER systems solve this problem by collectively resolving all records using a probabilistic graphical model and Markov chain Monte Carlo (MCMC) inference. However, for large datasets this is an extremely expensive process. One key observation is that, such exhaustive ER process incurs a huge up-front cost, which is wasteful in practice because most users are interested in only a small subset of entities. In this paper, we advocate pay-as-you-go entity resolution by developing a number of query-driven collective ER techniques. We introduce two classes of SQL queries that involve ER operators --- selection-driven ER and join-driven ER. We implement novel variations of the MCMC Metropolis Hastings algorithm to generate biased samples and selectivity-based scheduling algorithms to support the two classes of ER queries. Finally, we show that query-driven ER algorithms can converge and return results within minutes over a database populated with the extraction from a newswire dataset containing 71 million mentions

arXiv.org e-Print Archive

Crossref

String Indexing for Top- $k$ Close Consecutive Occurrences

Author: Bille Philip
Gørtz Inge Li
Pedersen Max Rishøj
Rotenberg Eva
Steiner Teresa Anna
Publication venue
Publication date: 29/09/2020
Field of study

The classic string indexing problem is to preprocess a string

S

into a compact data structure that supports efficient subsequent pattern matching queries, that is, given a pattern string

P

, report all occurrences of

P

within

S

. In this paper, we study a basic and natural extension of string indexing called the string indexing for top-

k

close consecutive occurrences problem (SITCCO). Here, a consecutive occurrence is a pair

(i,j)

i < j

, such that

P

occurs at positions

i

and

j

S

and there is no occurrence of

P

between

i

and

j

, and their distance is defined as

j-i

. Given a pattern

P

and a parameter

k

, the goal is to report the top-

k

consecutive occurrences of

P

S

of minimal distance. The challenge is to compactly represent

S

while supporting queries in time close to length of

P

and

k

. We give two time-space trade-offs for the problem. Let

n

be the length of

S

m

the length of

P

, and

\epsilon\in(0,1]

. Our first result achieves

O(n\log n)

space and optimal query time of

O(m+k)

, and our second result achieves linear space and query time

O(m+k^{1+\epsilon})

. Along the way, we develop several techniques of independent interest, including a new translation of the problem into a line segment intersection problem and a new recursive clustering technique for trees.Comment: Fixed typos, minor change

arXiv.org e-Print Archive

Online Research Database In Technology

Image Characterization and Classification by Physical Complexity

Author: Delahaye Jean-Paul
Gaucherel Cedric
Zenil Hector
Publication venue
Publication date: 03/07/2011
Field of study

We present a method for estimating the complexity of an image based on Bennett's concept of logical depth. Bennett identified logical depth as the appropriate measure of organized complexity, and hence as being better suited to the evaluation of the complexity of objects in the physical world. Its use results in a different, and in some sense a finer characterization than is obtained through the application of the concept of Kolmogorov complexity alone. We use this measure to classify images by their information content. The method provides a means for classifying and evaluating the complexity of objects by way of their visual representations. To the authors' knowledge, the method and application inspired by the concept of logical depth presented herein are being proposed and implemented for the first time.Comment: 30 pages, 21 figure

arXiv.org e-Print Archive

HAL - Lille 3

INRIA a CCSD electronic archive server

HAL Descartes

Oxford University Research Archive

HAL-CIRAD

Hal-Diderot

An examination of heuristic algorithms for the travelling salesman problem

Author: Höck Barbar Katja
Publication venue: Department of Statistical Sciences
Publication date: 01/01/1988
Field of study

The role of heuristics in combinatorial optimization is discussed. Published heuristics for the Travelling Salesman Problem (TSP) were reviewed and morphological boxes were used to develop new heuristics for the TSP. New and published heuristics were programmed for symmetric TSPs where the triangle inequality holds, and were tested on micro computer. The best of the quickest heuristics was the furthest insertion heuristic, finding tours 3 to 9% above the best known solutions (2 minutes for 100 nodes). Better results were found by longer running heuristics, e.g. the cheapest angle heuristic (CCAO), 0-6% above best (80 minutes for 100 nodes). The savings heuristic found the best results overall, but took more than 2 hours to complete. Of the new heuristics, the MST path algorithm at times improved on the results of the furthest insertion heuristic while taking the same time as the CCAO. The study indicated that there is little likelihood of improving on present methods unless a fundamental new approach is discovered. Finally a case study using TSP heuristics to aid the planning of grid surveys was described

Cape Town University OpenUCT

10091 Abstracts Collection -- Data Structures

Author: Arge Lars
Demaine Erik D.
Seidel Raimund
Publication venue: Dagstuhl Seminar Proceedings. 10091 - Data Structures
Publication date: 01/01/2010
Field of study

From February 28th to March 5th 2010, the Dagstuhl Seminar 10091 "Data Structures" was held in Schloss Dagstuhl~--~Leibniz Center for Informatics. It brought together 45 international researchers to discuss recent developments concerning data structures in terms of research, but also in terms of new technologies that impact how data can be stored, updated, and retrieved. During the seminar a fair number of participants presented their current research and open problems where discussed. This document first briefly describes the seminar topics and then gives the abstracts of the presentations given during the seminar

Dagstuhl Research Online Publication Server

Finding a Cluster in Incomplete Data

Author: Eiben Eduard
Ganian Robert
Kanj Iyad
Ordyniak Sebastian
Szeider Stefan
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 30th Annual European Symposium on Algorithms (ESA 2022)
Publication date: 01/01/2022
Field of study

We study two variants of the fundamental problem of finding a cluster in incomplete data. In the problems under consideration, we are given a multiset of incomplete d-dimensional vectors over the binary domain and integers k and r, and the goal is to complete the missing vector entries so that the multiset of complete vectors either contains (i) a cluster of k vectors of radius at most r, or (ii) a cluster of k vectors of diameter at most r. We give tight characterizations of the parameterized complexity of the problems under consideration with respect to the parameters k, r, and a third parameter that captures the missing vector entries

Dagstuhl Research Online Publication Server

Convex hull ranking algorithm for multi-objective evolutionary algorithms

Author: Davoodi Monfared M.
Mohades A.
Rezaei J.
Publication venue: Sharif University of Technology. Production and hosting by Elsevier B.V.Production and hosting by Elsevier B.V.
Publication date: 31/12/2011
Field of study

AbstractDue to many applications of multi-objective evolutionary algorithms in real world optimization problems, several studies have been done to improve these algorithms in recent years. Since most multi-objective evolutionary algorithms are based on the non-dominated principle, and their complexity depends on finding non-dominated fronts, this paper introduces a new method for ranking the solutions of an evolutionary algorithm’s population. First, we investigate the relation between the convex hull and non-dominated solutions, and discuss the complexity time of the convex hull and non-dominated sorting problems. Then, we use convex hull concepts to present a new ranking procedure for multi-objective evolutionary algorithms. The proposed algorithm is very suitable for convex multi-objective optimization problems. Finally, we apply this method as an alternative ranking procedure to NSGA-II for non-dominated comparisons, and test it using some benchmark problems

Elsevier - Publisher Connector

Clustering and Validation of Microarray Data Using Consensus Clustering

Author: Kallar Sarbinder
Publication venue: SJSU ScholarWorks
Publication date: 01/01/2010
Field of study

Clustering is a popular method to glean useful information from microarray data. Unfortunately the results obtained from the common clustering algorithms are not consistent and even with multiple runs of different algorithms a further validation step is required. Due to absence of well defined class labels, and unknown number of clusters, the unsupervised learning problem of finding optimal clustering is hard. Obtaining a consensus of judiciously obtained clusterings not only provides stable results but also lends a high level of confidence in the quality of results. Several base algorithm runs are used to generate clusterings and a co-association matrix of pairs of points is obtained using a configurable majority criterion. Using this consensus as a similarity measure we generate a clustering using four algorithms. Synthetic as well as real world datasets are used in experiment and results obtained are compared using various internal and external validity measures. Results on real world datasets showed a marked improvement over those obtained by other researchers with the same datasets

SJSU ScholarWorks