73,130 research outputs found
Prospects and limitations of full-text index structures in genome analysis
The combination of incessant advances in sequencing technology producing large amounts of data and innovative bioinformatics approaches, designed to cope with this data flood, has led to new interesting results in the life sciences. Given the magnitude of sequence data to be processed, many bioinformatics tools rely on efficient solutions to a variety of complex string problems. These solutions include fast heuristic algorithms and advanced data structures, generally referred to as index structures. Although the importance of index structures is generally known to the bioinformatics community, the design and potency of these data structures, as well as their properties and limitations, are less understood. Moreover, the last decade has seen a boom in the number of variant index structures featuring complex and diverse memory-time trade-offs. This article brings a comprehensive state-of-the-art overview of the most popular index structures and their recently developed variants. Their features, interrelationships, the trade-offs they impose, but also their practical limitations, are explained and compared
Reproducibility, accuracy and performance of the Feltor code and library on parallel computer architectures
Feltor is a modular and free scientific software package. It allows
developing platform independent code that runs on a variety of parallel
computer architectures ranging from laptop CPUs to multi-GPU distributed memory
systems. Feltor consists of both a numerical library and a collection of
application codes built on top of the library. Its main target are two- and
three-dimensional drift- and gyro-fluid simulations with discontinuous Galerkin
methods as the main numerical discretization technique. We observe that
numerical simulations of a recently developed gyro-fluid model produce
non-deterministic results in parallel computations. First, we show how we
restore accuracy and bitwise reproducibility algorithmically and
programmatically. In particular, we adopt an implementation of the exactly
rounded dot product based on long accumulators, which avoids accuracy losses
especially in parallel applications. However, reproducibility and accuracy
alone fail to indicate correct simulation behaviour. In fact, in the physical
model slightly different initial conditions lead to vastly different end
states. This behaviour translates to its numerical representation. Pointwise
convergence, even in principle, becomes impossible for long simulation times.
In a second part, we explore important performance tuning considerations. We
identify latency and memory bandwidth as the main performance indicators of our
routines. Based on these, we propose a parallel performance model that predicts
the execution time of algorithms implemented in Feltor and test our model on a
selection of parallel hardware architectures. We are able to predict the
execution time with a relative error of less than 25% for problem sizes between
0.1 and 1000 MB. Finally, we find that the product of latency and bandwidth
gives a minimum array size per compute node to achieve a scaling efficiency
above 50% (both strong and weak)
Removing batch effects for prediction problems with frozen surrogate variable analysis
Batch effects are responsible for the failure of promising genomic prognos-
tic signatures, major ambiguities in published genomic results, and retractions
of widely-publicized findings. Batch effect corrections have been developed to
re- move these artifacts, but they are designed to be used in population
studies. But genomic technologies are beginning to be used in clinical
applications where sam- ples are analyzed one at a time for diagnostic,
prognostic, and predictive applica- tions. There are currently no batch
correction methods that have been developed specifically for prediction. In
this paper, we propose an new method called frozen surrogate variable analysis
(fSVA) that borrows strength from a training set for individual sample batch
correction. We show that fSVA improves prediction ac- curacy in simulations and
in public genomic studies. fSVA is available as part of the sva Bioconductor
package
Lattice-Boltzmann and finite-difference simulations for the permeability for three-dimensional porous media
Numerical micropermeametry is performed on three dimensional porous samples
having a linear size of approximately 3 mm and a resolution of 7.5 m. One
of the samples is a microtomographic image of Fontainebleau sandstone. Two of
the samples are stochastic reconstructions with the same porosity, specific
surface area, and two-point correlation function as the Fontainebleau sample.
The fourth sample is a physical model which mimics the processes of
sedimentation, compaction and diagenesis of Fontainebleau sandstone. The
permeabilities of these samples are determined by numerically solving at low
Reynolds numbers the appropriate Stokes equations in the pore spaces of the
samples. The physical diagenesis model appears to reproduce the permeability of
the real sandstone sample quite accurately, while the permeabilities of the
stochastic reconstructions deviate from the latter by at least an order of
magnitude. This finding confirms earlier qualitative predictions based on local
porosity theory. Two numerical algorithms were used in these simulations. One
is based on the lattice-Boltzmann method, and the other on conventional
finite-difference techniques. The accuracy of these two methods is discussed
and compared, also with experiment.Comment: to appear in: Phys.Rev.E (2002), 32 pages, Latex, 1 Figur
- …