Search CORE

11,177 research outputs found

BOOL-AN: A method for comparative sequence analysis and phylogenetic reconstruction

Author: Ari Eszter
Horváth Arnold
Ittzés Péter
Jakó Éena
Podani János
Publication venue: 'Elsevier BV'
Publication date: 01/01/2009
Field of study

A novel discrete mathematical approach is proposed as an additional tool for molecular systematics which does not require prior statistical assumptions concerning the evolutionary process. The method is based on algorithms generating mathematical representations directly from DNA/RNA or protein sequences, followed by the output of numerical (scalar or vector) and visual characteristics (graphs). The binary encoded sequence information is transformed into a compact analytical form, called the Iterative Canonical Form (or ICF) of Boolean functions, which can then be used as a generalized molecular descriptor. The method provides raw vector data for calculating different distance matrices, which in turn can be analyzed by neighbor-joining or UPGMA to derive a phylogenetic tree, or by principal coordinates analysis to get an ordination scattergram. The new method and the associated software for inferring phylogenetic trees are called the Boolean analysis or BOOL-AN

Crossref

Repository of the Academy's Library

Optimal-Time Text Indexing in BWT-runs Bounded Space

Author: Gagie Travis
Navarro Gonzalo
Prezza Nicola
Publication venue
Publication date: 11/07/2017
Field of study

Indexing highly repetitive texts --- such as genomic databases, software repositories and versioned text collections --- has become an important problem since the turn of the millennium. A relevant compressibility measure for repetitive texts is

r

, the number of runs in their Burrows-Wheeler Transform (BWT). One of the earliest indexes for repetitive collections, the Run-Length FM-index, used

O(r)

space and was able to efficiently count the number of occurrences of a pattern of length

m

in the text (in loglogarithmic time per pattern symbol, with current techniques). However, it was unable to locate the positions of those occurrences efficiently within a space bounded in terms of

r

. Since then, a number of other indexes with space bounded by other measures of repetitiveness --- the number of phrases in the Lempel-Ziv parse, the size of the smallest grammar generating the text, the size of the smallest automaton recognizing the text factors --- have been proposed for efficiently locating, but not directly counting, the occurrences of a pattern. In this paper we close this long-standing problem, showing how to extend the Run-Length FM-index so that it can locate the

occ

occurrences efficiently within

O(r)

space (in loglogarithmic time each), and reaching optimal time

O(m+occ)

within

O(r\log(n/r))

space, on a RAM machine of

w=\Omega(\log n)

bits. Within

O(r\log (n/r))

space, our index can also count in optimal time

O(m)

. Raising the space to

O(r w\log_\sigma(n/r))

, we support count and locate in

O(m\log(\sigma)/w)

and

O(m\log(\sigma)/w+occ)

time, which is optimal in the packed setting and had not been obtained before in compressed space. We also describe a structure using

O(r\log(n/r))

space that replaces the text and extracts any text substring of length

\ell

in almost-optimal time

O(\log(n/r)+\ell\log(\sigma)/w)

. (...continues...

arXiv.org e-Print Archive

Crossref

Archivio istituzionale della ricerca - Università degli Studi di Venezia Ca' Foscari

Archivio della ricerca- LUISS Libera Università Internazionale degli Studi Sociali Guido Carli di Roma

Online Research Database In Technology

ERBlox: Combining Matching Dependencies with Machine Learning for Entity Resolution

Author: A Elmagarmid
G Baudat
G Navarro
G Salton
IP Fellegi
J Bleiholder
L Bertossi
O Benjelloun
P Christen
P Christen
S Ceri
TM Cover
TN Herzog
V Rastogi
W Fan
Publication venue
Publication date: 24/08/2015
Field of study

Entity resolution (ER), an important and common data cleaning problem, is about detecting data duplicate representations for the same external entities, and merging them into single representations. Relatively recently, declarative rules called matching dependencies (MDs) have been proposed for specifying similarity conditions under which attribute values in database records are merged. In this work we show the process and the benefits of integrating three components of ER: (a) Classifiers for duplicate/non-duplicate record pairs built using machine learning (ML) techniques, (b) MDs for supporting both the blocking phase of ML and the merge itself; and (c) The use of the declarative language LogiQL -an extended form of Datalog supported by the LogicBlox platform- for data processing, and the specification and enforcement of MDs.Comment: To appear in Proc. SUM, 201

arXiv.org e-Print Archive

Crossref

Artificial Intelligence: An introductory course

Author: Bundy Alan
Burstall R.
Weir S.
Young R.
Publication venue: 'Edinburgh University Press'
Publication date: 01/01/1978
Field of study

Edinburgh Research Explorer

CERN Document Server

Measuring the Propagation of Information in Partial Evaluation

Author: Rohde Henning Korsholm
Publication venue: 'Aarhus University Library'
Publication date: 11/08/2005
Field of study

We present the first measurement-based analysis of the information propagated by a partial evaluator. Our analysis is based on measuring implementations of string-matching algorithms, based on the observation that the sequence of character comparisons accurately reflects maintained information. Notably, we can easily prove matchers to be different and we show that they display more variety and finesse than previously believed. As a consequence, we are able to pinpoint differences and inaccuracies in many results previously considered equivalent. Our analysis includes a framework that lets us obtain string matchers - notably the family of Boyer-Moore algorithms - in a systematic formalism-independent way from a few information-propagation primitives. By leveraging the existing research in string matching, we show that the landscape of information propagation is non-trivial in the sense that small changes in information propagation may dramatically change the properties of the resulting string matchers. We thus expect that this work will prove useful as a test and feedback mechanism for information propagation in the development of advanced program transformations, such as GPC or Supercompilation

Tidsskrift.dk (Det Kongelige Bibliotek)