721 research outputs found
Prospects and limitations of full-text index structures in genome analysis
The combination of incessant advances in sequencing technology producing large amounts of data and innovative bioinformatics approaches, designed to cope with this data flood, has led to new interesting results in the life sciences. Given the magnitude of sequence data to be processed, many bioinformatics tools rely on efficient solutions to a variety of complex string problems. These solutions include fast heuristic algorithms and advanced data structures, generally referred to as index structures. Although the importance of index structures is generally known to the bioinformatics community, the design and potency of these data structures, as well as their properties and limitations, are less understood. Moreover, the last decade has seen a boom in the number of variant index structures featuring complex and diverse memory-time trade-offs. This article brings a comprehensive state-of-the-art overview of the most popular index structures and their recently developed variants. Their features, interrelationships, the trade-offs they impose, but also their practical limitations, are explained and compared
Polyhedral Combinatorics of UPGMA Cones
Distance-based methods such as UPGMA (Unweighted Pair Group Method with
Arithmetic Mean) continue to play a significant role in phylogenetic research.
We use polyhedral combinatorics to analyze the natural subdivision of the
positive orthant induced by classifying the input vectors according to tree
topologies returned by the algorithm. The partition lattice informs the study
of UPGMA trees. We give a closed form for the extreme rays of UPGMA cones on n
taxa, and compute the normalized volumes of the UPGMA cones for small n.
Keywords: phylogenetic trees, polyhedral combinatorics, partition lattic
A design space for RDF data representations
RDF triplestores' ability to store and query knowledge bases augmented with semantic annotations has attracted the attention of both research and industry. A multitude of systems offer varying data representation and indexing schemes. However, as recently shown for designing data structures, many design choices are biased by outdated considerations and may not result in the most efficient data representation for a given query workload. To overcome this limitation, we identify a novel three-dimensional design space. Within this design space, we map the trade-offs between different RDF data representations employed as part of an RDF triplestore and identify unexplored solutions. We complement the review with an empirical evaluation of ten standard SPARQL benchmarks to examine the prevalence of these access patterns in synthetic and real query workloads. We find some access patterns, to be both prevalent in the workloads and under-supported by existing triplestores. This shows the capabilities of our model to be used by RDF store designers to reason about different design choices and allow a (possibly artificially intelligent) designer to evaluate the fit between a given system design and a query workload
A Tree Logic with Graded Paths and Nominals
Regular tree grammars and regular path expressions constitute core constructs
widely used in programming languages and type systems. Nevertheless, there has
been little research so far on reasoning frameworks for path expressions where
node cardinality constraints occur along a path in a tree. We present a logic
capable of expressing deep counting along paths which may include arbitrary
recursive forward and backward navigation. The counting extensions can be seen
as a generalization of graded modalities that count immediate successor nodes.
While the combination of graded modalities, nominals, and inverse modalities
yields undecidable logics over graphs, we show that these features can be
combined in a tree logic decidable in exponential time
Inference of Many-Taxon Phylogenies
Phylogenetic trees are tree topologies that represent the evolutionary history of a set of organisms. In this thesis, we address computational challenges related to the analysis of large-scale datasets with Maximum Likelihood based phylogenetic inference. We have approached this using different strategies: reduction of memory requirements, reduction of running time, and reduction of man-hours
Simulating Stellar Merger using HPX/Kokkos on A64FX on Supercomputer Fugaku
The increasing availability of machines relying on non-GPU architectures,
such as ARM A64FX in high-performance computing, provides a set of interesting
challenges to application developers. In addition to requiring code portability
across different parallelization schemes, programs targeting these
architectures have to be highly adaptable in terms of compute kernel sizes to
accommodate different execution characteristics for various heterogeneous
workloads. In this paper, we demonstrate an approach to code and performance
portability that is based entirely on established standards in the industry. In
addition to applying Kokkos as an abstraction over the execution of compute
kernels on different heterogeneous execution environments, we show that the use
of standard C++ constructs as exposed by the HPX runtime system enables superb
portability in terms of code and performance based on the real-world Octo-Tiger
astrophysics application. We report our experience with porting Octo-Tiger to
the ARM A64FX architecture provided by Stony Brook's Ookami and Riken's
Supercomputer Fugaku and compare the resulting performance with that achieved
on well established GPU-oriented HPC machines such as ORNL's Summit, NERSC's
Perlmutter and CSCS's Piz Daint systems. Octo-Tiger scaled well on
Supercomputer Fugaku without any major code changes due to the abstraction
levels provided by HPX and Kokkos. Adding vectorization support for ARM's SVE
to Octo-Tiger was trivial thanks to using standard C+
- …