Search CORE

1,728 research outputs found

Tree Compression with Top Trees Revisited

Author: F Wang
G Busatto
JI Munro
M Charikar
M Hirakawa
M Lohrey
M Lohrey
NJ Larsson
P Ferragina
PJ Downey
S Alstrup
S Gog
S Maneth
S Maruyama
Publication venue
Publication date: 01/01/2015
Field of study

We revisit tree compression with top trees (Bille et al, ICALP'13) and present several improvements to the compressor and its analysis. By significantly reducing the amount of information stored and guiding the compression step using a RePair-inspired heuristic, we obtain a fast compressor achieving good compression ratios, addressing an open problem posed by Bille et al. We show how, with relatively small overhead, the compressed file can be converted into an in-memory representation that supports basic navigation operations in worst-case logarithmic time without decompression. We also show a much improved worst-case bound on the size of the output of top-tree compression (answering an open question posed in a talk on this algorithm by Weimann in 2012).Comment: SEA 201

arXiv.org e-Print Archive

Crossref

KITopen

Leicester Research Archive

A data science approach to pattern discovery in complex structures with applications in bioinformatics

Author: Hua Lei
Publication venue: Digital Commons @ NJIT
Publication date: 31/05/2016
Field of study

Pattern discovery aims to find interesting, non-trivial, implicit, previously unknown and potentially useful patterns in data. This dissertation presents a data science approach for discovering patterns or motifs from complex structures, particularly complex RNA structures. RNA secondary and tertiary structure motifs are very important in biological molecules, which play multiple vital roles in cells. A lot of work has been done on RNA motif annotation. However, pattern discovery in RNA structure is less studied. In the first part of this dissertation, an ab initio algorithm, named DiscoverR, is introduced for pattern discovery in RNA secondary structures. This algorithm works by representing RNA secondary structures as ordered labeled trees and performs tree pattern discovery using a quadratic time dynamic programming algorithm. The algorithm is able to identify and extract the largest common substructures from two RNA molecules of different sizes, without prior knowledge of locations and topologies of these substructures. One application of DiscoverR is to locate the RNA structural elements in genomes. Experimental results show that this tool complements the currently used approaches for mining conserved structural RNAs in the human genome. DiscoverR can also be extended to find repeated regions in an RNA secondary structure. Specifically, this extended method is used to detect structural repeats in the 3\u27-untranslated region of a protein kinase gene

Digital Commons @ New Jersey Institute of Technology (NJIT)

Prospects and limitations of full-text index structures in genome analysis

Author: Dawyndt Peter
De Baets Bernard
Fack Veerle
Vyverman Michaël
Publication venue: 'Oxford University Press (OUP)'
Publication date: 01/01/2012
Field of study

The combination of incessant advances in sequencing technology producing large amounts of data and innovative bioinformatics approaches, designed to cope with this data flood, has led to new interesting results in the life sciences. Given the magnitude of sequence data to be processed, many bioinformatics tools rely on efficient solutions to a variety of complex string problems. These solutions include fast heuristic algorithms and advanced data structures, generally referred to as index structures. Although the importance of index structures is generally known to the bioinformatics community, the design and potency of these data structures, as well as their properties and limitations, are less understood. Moreover, the last decade has seen a boom in the number of variant index structures featuring complex and diverse memory-time trade-offs. This article brings a comprehensive state-of-the-art overview of the most popular index structures and their recently developed variants. Their features, interrelationships, the trade-offs they impose, but also their practical limitations, are explained and compared

Ghent University Academic Bibliography

PubMed Central

Tree-Independent Dual-Tree Algorithms

Author: Anderson David V.
Curtin Ryan R.
Gray Alexander G.
Isbell Jr Charles L.
March William B.
Ram Parikshit
Publication venue
Publication date: 16/04/2013
Field of study

Dual-tree algorithms are a widely used class of branch-and-bound algorithms. Unfortunately, developing dual-tree algorithms for use with different trees and problems is often complex and burdensome. We introduce a four-part logical split: the tree, the traversal, the point-to-point base case, and the pruning rule. We provide a meta-algorithm which allows development of dual-tree algorithms in a tree-independent manner and easy extension to entirely new types of trees. Representations are provided for five common algorithms; for k-nearest neighbor search, this leads to a novel, tighter pruning bound. The meta-algorithm also allows straightforward extensions to massively parallel settings.Comment: accepted in ICML 201

arXiv.org e-Print Archive

CiteSeerX

Minimal Suffix and Rotation of a Substring in Optimal Time

Author: Kociumaka Tomasz
Publication venue
Publication date: 01/01/2016
Field of study

For a text given in advance, the substring minimal suffix queries ask to determine the lexicographically minimal non-empty suffix of a substring specified by the location of its occurrence in the text. We develop a data structure answering such queries optimally: in constant time after linear-time preprocessing. This improves upon the results of Babenko et al. (CPM 2014), whose trade-off solution is characterized by

\Theta(n\log n)

product of these time complexities. Next, we extend our queries to support concatenations of

O(1)

substrings, for which the construction and query time is preserved. We apply these generalized queries to compute lexicographically minimal and maximal rotations of a given substring in constant time after linear-time preprocessing. Our data structures mainly rely on properties of Lyndon words and Lyndon factorizations. We combine them with further algorithmic and combinatorial tools, such as fusion trees and the notion of order isomorphism of strings

arXiv.org e-Print Archive

Dagstuhl Research Online Publication Server

Near-Optimal Computation of Runs over General Alphabet via Non-Crossing LCE Queries

Author: C Hohlweg
CSJA Nash-Williams
D Kosolobov
GS Brodal
H Barcelo
J Fischer
M Crochemore
M Crochemore
M Crochemore
M Crochemore
M Crochemore
M Giraud
SJ Puglisi
W Rytter
W Rytter
Publication venue
Publication date: 01/01/2016
Field of study

Longest common extension queries (LCE queries) and runs are ubiquitous in algorithmic stringology. Linear-time algorithms computing runs and preprocessing for constant-time LCE queries have been known for over a decade. However, these algorithms assume a linearly-sortable integer alphabet. A recent breakthrough paper by Bannai et.\ al.\ (SODA 2015) showed a link between the two notions: all the runs in a string can be computed via a linear number of LCE queries. The first to consider these problems over a general ordered alphabet was Kosolobov (\emph{Inf.\ Process.\ Lett.}, 2016), who presented an

O(n (\log n)^{2/3})

-time algorithm for answering

O(n)

LCE queries. This result was improved by Gawrychowski et.\ al.\ (accepted to CPM 2016) to

O(n \log \log n)

time. In this work we note a special \emph{non-crossing} property of LCE queries asked in the runs computation. We show that any

n

such non-crossing queries can be answered on-line in

O(n \alpha(n))

time, which yields an

O(n \alpha(n))

-time algorithm for computing runs

arXiv.org e-Print Archive

Crossref

King's Research Portal

Hal-Diderot

HAL - UPEC / UPEM

Efficient Detection of Repeating Sites to Accelerate Phylogenetic Likelihood Calculations

Author: Flouri T
Kobert K
Stamatakis A
Publication venue
Publication date: 24/08/2016
Field of study

The phylogenetic likelihood function (PLF) is the major computational bottleneck in several applications of evolutionary biology such as phylogenetic inference, species delimitation, model selection, and divergence times estimation. Given the alignment, a tree and the evolutionary model parameters, the likelihood function computes the conditional likelihood vectors for every node of the tree. Vector entries for which all input data are identical result in redundant likelihood operations which, in turn, yield identical conditional values. Such operations can be omitted for improving run-time and, using appropriate data structures, reducing memory usage. We present a fast, novel method for identifying and omitting such redundant operations in phylogenetic likelihood calculations, and assess the performance improvement and memory savings attained by our method. Using empirical and simulated data sets, we show that a prototype implementation of our method yields up to 12-fold speedups and uses up to 78% less memory than one of the fastest and most highly tuned implementations of the PLF currently available. Our method is generic and can seamlessly be integrated into any phylogenetic likelihood implementation

ZENODO

Dryad Digital Repository (Duke University)

UCL Discovery

Electronic Archiving System

NEUROSURGERY ENTHUSIASTIC WOMEN SOCIETY