15,784 research outputs found
Multidimensional Range Queries on Modern Hardware
Range queries over multidimensional data are an important part of database
workloads in many applications. Their execution may be accelerated by using
multidimensional index structures (MDIS), such as kd-trees or R-trees. As for
most index structures, the usefulness of this approach depends on the
selectivity of the queries, and common wisdom told that a simple scan beats
MDIS for queries accessing more than 15%-20% of a dataset. However, this wisdom
is largely based on evaluations that are almost two decades old, performed on
data being held on disks, applying IO-optimized data structures, and using
single-core systems. The question is whether this rule of thumb still holds
when multidimensional range queries (MDRQ) are performed on modern
architectures with large main memories holding all data, multi-core CPUs and
data-parallel instruction sets. In this paper, we study the question whether
and how much modern hardware influences the performance ratio between index
structures and scans for MDRQ. To this end, we conservatively adapted three
popular MDIS, namely the R*-tree, the kd-tree, and the VA-file, to exploit
features of modern servers and compared their performance to different flavors
of parallel scans using multiple (synthetic and real-world) analytical
workloads over multiple (synthetic and real-world) datasets of varying size,
dimensionality, and skew. We find that all approaches benefit considerably from
using main memory and parallelization, yet to varying degrees. Our evaluation
indicates that, on current machines, scanning should be favored over parallel
versions of classical MDIS even for very selective queries
The potential for liquid biopsies in the precision medical treatment of breast cancer.
Currently the clinical management of breast cancer relies on relatively few prognostic/predictive clinical markers (estrogen receptor, progesterone receptor, HER2), based on primary tumor biology. Circulating biomarkers, such as circulating tumor DNA (ctDNA) or circulating tumor cells (CTCs) may enhance our treatment options by focusing on the very cells that are the direct precursors of distant metastatic disease, and probably inherently different than the primary tumor's biology. To shift the current clinical paradigm, assessing tumor biology in real time by molecularly profiling CTCs or ctDNA may serve to discover therapeutic targets, detect minimal residual disease and predict response to treatment. This review serves to elucidate the detection, characterization, and clinical application of CTCs and ctDNA with the goal of precision treatment of breast cancer
Distinguishing low frequency mutations from RT-PCR and sequence errors in viral deep sequencing data
There is a high prevalence of coronary artery disease (CAD) in patients with left bundle branch block (LBBB); however there are many other causes for this electrocardiographic abnormality. Non-invasive assessment of these patients remains difficult, and all commonly used modalities exhibit several drawbacks. This often leads to these patients undergoing invasive coronary angiography which may not have been necessary. In this review, we examine the uses and limitations of commonly performed non-invasive tests for diagnosis of CAD in patients with LBBB
GPU-Accelerated BWT Construction for Large Collection of Short Reads
Advances in DNA sequencing technology have stimulated the development of
algorithms and tools for processing very large collections of short strings
(reads). Short-read alignment and assembly are among the most well-studied
problems. Many state-of-the-art aligners, at their core, have used the
Burrows-Wheeler transform (BWT) as a main-memory index of a reference genome
(typical example, NCBI human genome). Recently, BWT has also found its use in
string-graph assembly, for indexing the reads (i.e., raw data from DNA
sequencers). In a typical data set, the volume of reads is tens of times of the
sequenced genome and can be up to 100 Gigabases. Note that a reference genome
is relatively stable and computing the index is not a frequent task. For reads,
the index has to computed from scratch for each given input. The ability of
efficient BWT construction becomes a much bigger concern than before. In this
paper, we present a practical method called CX1 for constructing the BWT of
very large string collections. CX1 is the first tool that can take advantage of
the parallelism given by a graphics processing unit (GPU, a relative cheap
device providing a thousand or more primitive cores), as well as simultaneously
the parallelism from a multi-core CPU and more interestingly, from a cluster of
GPU-enabled nodes. Using CX1, the BWT of a short-read collection of up to 100
Gigabases can be constructed in less than 2 hours using a machine equipped with
a quad-core CPU and a GPU, or in about 43 minutes using a cluster with 4 such
machines (the speedup is almost linear after excluding the first 16 minutes for
loading the reads from the hard disk). The previously fastest tool BRC is
measured to take 12 hours to process 100 Gigabases on one machine; it is
non-trivial how BRC can be parallelized to take advantage a cluster of
machines, let alone GPUs.Comment: 11 page
khmer: Working with Big Data in Bioinformatics
We introduce design and optimization considerations for the 'khmer' package.Comment: Invited chapter for forthcoming book on Performance of Open Source
Application
{BiQ} Analyzer {HiMod}: An Interactive Software Tool for High-throughput Locus-specific Analysis of 5-Methylcytosine and its Oxidized Derivatives
Recent data suggest important biological roles for oxidative modifications of methylated cytosines, specifically hydroxymethylation, formylation and carboxylation. Several assays are now available for profiling these DNA modifications genome-wide as well as in targeted, locus-specific settings. Here we present BiQ Analyzer HiMod, a user-friendly software tool for sequence alignment, quality control and initial analysis of locus-specific DNA modification data. The software supports four different assay types, and it leads the user from raw sequence reads to DNA modification statistics and publication-quality plots. BiQ Analyzer HiMod combines well-established graphical user interface of its predecessor tool, BiQ Analyzer HT, with new and extended analysis modes. BiQ Analyzer HiMod also includes updates of the analysis workspace, an intuitive interface, a custom vector graphics engine and support of additional input and output data formats. The tool is freely available as a stand-alone installation package from http://biq-analyzer-himod.bioinf.mpi-inf.mpg.de/
Efficient Processing of Range Queries in Main Memory
Datenbanksysteme verwenden Indexstrukturen, um Suchanfragen zu beschleunigen. Im Laufe der letzten Jahre haben Forscher verschiedene Ansätze zur Indexierung von Datenbanktabellen im Hauptspeicher entworfen. Hauptspeicherindexstrukturen versuchen möglichst häufig Daten zu verwenden, die bereits im Zwischenspeicher der CPU vorrätig sind, anstatt, wie bei traditionellen Datenbanksystemen, die Zugriffe auf den externen Speicher zu optimieren. Die meisten vorgeschlagenen Indexstrukturen für den Hauptspeicher beschränken sich jedoch auf Punktabfragen und vernachlässigen die ebenso wichtigen Bereichsabfragen, die in zahlreichen Anwendungen, wie in der Analyse von Genomdaten, Sensornetzwerken, oder analytischen Datenbanksystemen, zum Einsatz kommen.
Diese Dissertation verfolgt als Hauptziel die Fähigkeiten von modernen Hauptspeicherdatenbanksystemen im Ausführen von Bereichsabfragen zu verbessern. Dazu schlagen wir zunächst die Cache-Sensitive Skip List, eine neue aktualisierbare Hauptspeicherindexstruktur, vor, die für die Zwischenspeicher moderner Prozessoren optimiert ist und das Ausführen von Bereichsabfragen auf einzelnen Datenbankspalten ermöglicht. Im zweiten Abschnitt analysieren wir die Performanz von multidimensionalen Bereichsabfragen auf modernen Serverarchitekturen, bei denen Daten im Hauptspeicher hinterlegt sind und Prozessoren über SIMD-Instruktionen und Multithreading verfügen. Um die Relevanz unserer Experimente für praktische Anwendungen zu erhöhen, schlagen wir zudem einen realistischen Benchmark für multidimensionale Bereichsabfragen vor, der auf echten Genomdaten ausgeführt wird. Im letzten Abschnitt der Dissertation präsentieren wir den BB-Tree als neue, hochperformante und speichereffziente Hauptspeicherindexstruktur. Der BB-Tree ermöglicht das Ausführen von multidimensionalen Bereichs- und Punktabfragen und verfügt über einen parallelen Suchoperator, der mehrere Threads verwenden kann, um die Performanz von Suchanfragen zu erhöhen.Database systems employ index structures as means to accelerate search queries. Over the last years, the research community has proposed many different in-memory approaches that optimize cache misses instead of disk I/O, as opposed to disk-based systems, and make use of the grown parallel capabilities of modern CPUs. However, these techniques mainly focus on single-key lookups, but neglect equally important range queries. Range queries are an ubiquitous operator in data management commonly used in numerous domains, such as genomic analysis, sensor networks, or online analytical processing.
The main goal of this dissertation is thus to improve the capabilities of main-memory database systems with regard to executing range queries. To this end, we first propose a cache-optimized, updateable main-memory index structure, the cache-sensitive skip list, which targets the execution of range queries on single database columns. Second, we study the performance of multidimensional range queries on modern hardware, where data are stored in main memory and processors support SIMD instructions and multi-threading. We re-evaluate a previous rule of thumb suggesting that, on disk-based systems, scans outperform index structures for selectivities of approximately 15-20% or more. To increase the practical relevance of our analysis, we also contribute a novel benchmark consisting of several realistic multidimensional range queries applied to real- world genomic data. Third, based on the outcomes of our experimental analysis, we devise a novel, fast and space-effcient, main-memory based index structure, the BB- Tree, which supports multidimensional range and point queries and provides a parallel search operator that leverages the multi-threading capabilities of modern CPUs
- …