581,373 research outputs found
SEED: efficient clustering of next-generation sequences.
MotivationSimilarity clustering of next-generation sequences (NGS) is an important computational problem to study the population sizes of DNA/RNA molecules and to reduce the redundancies in NGS data. Currently, most sequence clustering algorithms are limited by their speed and scalability, and thus cannot handle data with tens of millions of reads.ResultsHere, we introduce SEED-an efficient algorithm for clustering very large NGS sets. It joins sequences into clusters that can differ by up to three mismatches and three overhanging residues from their virtual center. It is based on a modified spaced seed method, called block spaced seeds. Its clustering component operates on the hash tables by first identifying virtual center sequences and then finding all their neighboring sequences that meet the similarity parameters. SEED can cluster 100 million short read sequences in <4 h with a linear time and memory performance. When using SEED as a preprocessing tool on genome/transcriptome assembly data, it was able to reduce the time and memory requirements of the Velvet/Oasis assembler for the datasets used in this study by 60-85% and 21-41%, respectively. In addition, the assemblies contained longer contigs than non-preprocessed data as indicated by 12-27% larger N50 values. Compared with other clustering tools, SEED showed the best performance in generating clusters of NGS data similar to true cluster results with a 2- to 10-fold better time performance. While most of SEED's utilities fall into the preprocessing area of NGS data, our tests also demonstrate its efficiency as stand-alone tool for discovering clusters of small RNA sequences in NGS data from unsequenced organisms.AvailabilityThe SEED software can be downloaded for free from this site: http://manuals.bioinformatics.ucr.edu/home/[email protected] informationSupplementary data are available at Bioinformatics online
Random Sampling in Computational Algebra: Helly Numbers and Violator Spaces
This paper transfers a randomized algorithm, originally used in geometric
optimization, to computational problems in commutative algebra. We show that
Clarkson's sampling algorithm can be applied to two problems in computational
algebra: solving large-scale polynomial systems and finding small generating
sets of graded ideals. The cornerstone of our work is showing that the theory
of violator spaces of G\"artner et al.\ applies to polynomial ideal problems.
To show this, one utilizes a Helly-type result for algebraic varieties. The
resulting algorithms have expected runtime linear in the number of input
polynomials, making the ideas interesting for handling systems with very large
numbers of polynomials, but whose rank in the vector space of polynomials is
small (e.g., when the number of variables and degree is constant).Comment: Minor edits, added two references; results unchange
Cayley graphs of order kp are hamiltonian for k < 48
We provide a computer-assisted proof that if G is any finite group of order
kp, where k < 48 and p is prime, then every connected Cayley graph on G is
hamiltonian (unless kp = 2). As part of the proof, it is verified that every
connected Cayley graph of order less than 48 is either hamiltonian connected or
hamiltonian laceable (or has valence less than three).Comment: 16 pages. GAP source code is available in the ancillary file
Finding Simple Shortest Paths and Cycles
The problem of finding multiple simple shortest paths in a weighted directed
graph has many applications, and is considerably more difficult than
the corresponding problem when cycles are allowed in the paths. Even for a
single source-sink pair, it is known that two simple shortest paths cannot be
found in time polynomially smaller than (where ) unless the
All-Pairs Shortest Paths problem can be solved in a similar time bound. The
latter is a well-known open problem in algorithm design. We consider the
all-pairs version of the problem, and we give a new algorithm to find
simple shortest paths for all pairs of vertices. For , our algorithm runs
in time (where ), which is almost the same bound as
for the single pair case, and for we improve earlier bounds. Our approach
is based on forming suitable path extensions to find simple shortest paths;
this method is different from the `detour finding' technique used in most of
the prior work on simple shortest paths, replacement paths, and distance
sensitivity oracles.
Enumerating simple cycles is a well-studied classical problem. We present new
algorithms for generating simple cycles and simple paths in in
non-decreasing order of their weights; the algorithm for generating simple
paths is much faster, and uses another variant of path extensions. We also give
hardness results for sparse graphs, relative to the complexity of computing a
minimum weight cycle in a graph, for several variants of problems related to
finding simple paths and cycles.Comment: The current version includes new results for undirected graphs. In
Section 4, the notion of an (m,n) reduction is generalized to an f(m,n)
reductio
Admissibility in Finitely Generated Quasivarieties
Checking the admissibility of quasiequations in a finitely generated (i.e.,
generated by a finite set of finite algebras) quasivariety Q amounts to
checking validity in a suitable finite free algebra of the quasivariety, and is
therefore decidable. However, since free algebras may be large even for small
sets of small algebras and very few generators, this naive method for checking
admissibility in \Q is not computationally feasible. In this paper,
algorithms are introduced that generate a minimal (with respect to a multiset
well-ordering on their cardinalities) finite set of algebras such that the
validity of a quasiequation in this set corresponds to admissibility of the
quasiequation in Q. In particular, structural completeness (validity and
admissibility coincide) and almost structural completeness (validity and
admissibility coincide for quasiequations with unifiable premises) can be
checked. The algorithms are illustrated with a selection of well-known finitely
generated quasivarieties, and adapted to handle also admissibility of rules in
finite-valued logics
A method for dense packing discovery
The problem of packing a system of particles as densely as possible is
foundational in the field of discrete geometry and is a powerful model in the
material and biological sciences. As packing problems retreat from the reach of
solution by analytic constructions, the importance of an efficient numerical
method for conducting \textit{de novo} (from-scratch) searches for dense
packings becomes crucial. In this paper, we use the \textit{divide and concur}
framework to develop a general search method for the solution of periodic
constraint problems, and we apply it to the discovery of dense periodic
packings. An important feature of the method is the integration of the unit
cell parameters with the other packing variables in the definition of the
configuration space. The method we present led to improvements in the
densest-known tetrahedron packing which are reported in [arXiv:0910.5226].
Here, we use the method to reproduce the densest known lattice sphere packings
and the best known lattice kissing arrangements in up to 14 and 11 dimensions
respectively (the first such numerical evidence for their optimality in some of
these dimensions). For non-spherical particles, we report a new dense packing
of regular four-dimensional simplices with density
and with a similar structure to the densest known tetrahedron packing.Comment: 15 pages, 5 figure
A Novel Approach for Ellipsoidal Outer-Approximation of the Intersection Region of Ellipses in the Plane
In this paper, a novel technique for tight outer-approximation of the
intersection region of a finite number of ellipses in 2-dimensional (2D) space
is proposed. First, the vertices of a tight polygon that contains the convex
intersection of the ellipses are found in an efficient manner. To do so, the
intersection points of the ellipses that fall on the boundary of the
intersection region are determined, and a set of points is generated on the
elliptic arcs connecting every two neighbouring intersection points. By finding
the tangent lines to the ellipses at the extended set of points, a set of
half-planes is obtained, whose intersection forms a polygon. To find the
polygon more efficiently, the points are given an order and the intersection of
the half-planes corresponding to every two neighbouring points is calculated.
If the polygon is convex and bounded, these calculated points together with the
initially obtained intersection points will form its vertices. If the polygon
is non-convex or unbounded, we can detect this situation and then generate
additional discrete points only on the elliptical arc segment causing the
issue, and restart the algorithm to obtain a bounded and convex polygon.
Finally, the smallest area ellipse that contains the vertices of the polygon is
obtained by solving a convex optimization problem. Through numerical
experiments, it is illustrated that the proposed technique returns a tighter
outer-approximation of the intersection of multiple ellipses, compared to
conventional techniques, with only slightly higher computational cost
- …