5 research outputs found

    Sequence Alignment Tools: One Parallel Pattern to Rule Them All?

    Get PDF

    Parallel and Scalable Short-Read Alignment on Multi-Core Clusters Using UPC++

    Get PDF
    [Abstract]: The growth of next-generation sequencing (NGS) datasets poses a challenge to the alignment of reads to reference genomes in terms of alignment quality and execution speed. Some available aligners have been shown to obtain high quality mappings at the expense of long execution times. Finding fast yet accurate software solutions is of high importance to research, since availability and size of NGS datasets continue to increase. In this work we present an efficient parallelization approach for NGS short-read alignment on multi-core clusters. Our approach takes advantage of a distributed shared memory programming model based on the new UPC++ language. Experimental results using the CUSHAW3 aligner show that our implementation based on dynamic scheduling obtains good scalability on multi-core clusters. Through our evaluation, we are able to complete the single-end and paired-end alignments of 246 million reads of length 150 base-pairs in 11.54 and 16.64 minutes, respectively, using 32 nodes with four AMD Opteron 6272 16-core CPUs per node. In contrast, the multi-threaded original tool needs 2.77 and 5.54 hours to perform the same alignments on the 64 cores of one node. The source code of our parallel implementation is publicly available at the CUSHAW3 homepage (http://cushaw3.sourceforge.net).[Resumen]: El crecimiento de los conjuntos de datos de "secuenciamiento de próxima generación" (NGS por sus siglas en inglés) es un reto respecto a la calidad y a la velocidad de alineamientos de secuencias a genomas de referencia. Algunos alineadores disponibles obtienen mapeados de alta calidad a expensas de largos tiempos de ejecución. Desarrollar software rápido y preciso es muy importante para la investigación, ya que la disponibilidad y tamaño de los conjuntos NGS continua creciendo. En este trabajo presentamos una paralelización eficiente para el alineamiento de secuencias cortas de NGS en sistemas con nodos de múltiples núcleos de computación. Nuestra aproximación se aprovecha de un modelo de programación distribuida-compartida basado en el nuevo lenguaje UPC++. Los resultados experimentales usando el alineador CUSHAW3 muestran que nuestra implementación basada en reparto dinámico de trabajo obtiene buena escalabilidad. En nuestra evaluación somos capaces de completar alineamientos sencillos y en parejas de 246 millones de secuencias de longitud 150 en 11.54 y 16.64 minutos, respectivamente, usando 32 nodos con cuatro AMD Opteron 6272 y 16 núcleos de CPU cada uno. Sin embargo, la herramienta multi-hilo original necesita 2.77 y 5.54 horas para completar los mismos alineamientos en los 64 núcleos de un nodo. El código fuente de nuestra implementación paralela está disponible públicamente en la web de CUSHAW3 (http://cushaw3.sourceforge.net).[Resumo]: O medre dos conxuntos de datos de "secuenzamento de próxima xeración" (NGS polas súas siglas en inglés) é un reto respecto á calidade e á velocidade dos aliñamentos de secuencias a xenomas de referencia. Algúns aliñadores disponibles obteñen mapeados de alta calidade a expensas de largos tempos de execución. Desenvolver software rápido e preciso é moi importante para a investigación, xa que a disponibilidade e tamaño dos conxuntos NGS continua a medrar. Neste traballo presentamos unha paralelización eficiente para o aliñamiento de secuencias cortas de NGS en sistemas con nodos de múltiples núcleos de computación. A nosa aproximación aproveitase dun modelo de programación distribuida-compartida basado na nova linguaxe UPC++. Os resultados experimentais que fan uso do aliñador CUSHAW3 mostran que a nosa implementación baseada en reparto dinámico de traballo obtén boa escalabilidade. Na nosa avaliación somos capaces de completar aliñamentos sinxelos e en parellas de 246 millóns de secuencias de lonxitude 150 en 11.54 e 16.64 minutos, respectivamente, empregando 32 nodos con catro AMD Opteron 6272 e 16 núcleos de CPU cada un. Sen embargo, a ferramenta multi-fío oxiginal necesita 2.77 e 5.54 horas para completar os mesmos aliñamientos nos 64 núcleos dun nodo. O código fonte da nosa implementación paralela está disponible públicamente na web de CUSHAW3 (http://cushaw3.sourceforge.net)

    PiCo: A Domain-Specific Language for Data Analytics Pipelines

    Get PDF
    In the world of Big Data analytics, there is a series of tools aiming at simplifying programming applications to be executed on clusters. Although each tool claims to provide better programming, data and execution models—for which only informal (and often confusing) semantics is generally provided—all share a common under- lying model, namely, the Dataflow model. Using this model as a starting point, it is possible to categorize and analyze almost all aspects about Big Data analytics tools from a high level perspective. This analysis can be considered as a first step toward a formal model to be exploited in the design of a (new) framework for Big Data analytics. By putting clear separations between all levels of abstraction (i.e., from the runtime to the user API), it is easier for a programmer or software designer to avoid mixing low level with high level aspects, as we are often used to see in state-of-the-art Big Data analytics frameworks. From the user-level perspective, we think that a clearer and simple semantics is preferable, together with a strong separation of concerns. For this reason, we use the Dataflow model as a starting point to build a programming environment with a simplified programming model implemented as a Domain-Specific Language, that is on top of a stack of layers that build a prototypical framework for Big Data analytics. The contribution of this thesis is twofold: first, we show that the proposed model is (at least) as general as existing batch and streaming frameworks (e.g., Spark, Flink, Storm, Google Dataflow), thus making it easier to understand high-level data-processing applications written in such frameworks. As result of this analysis, we provide a layered model that can represent tools and applications following the Dataflow paradigm and we show how the analyzed tools fit in each level. Second, we propose a programming environment based on such layered model in the form of a Domain-Specific Language (DSL) for processing data collections, called PiCo (Pipeline Composition). The main entity of this programming model is the Pipeline, basically a DAG-composition of processing elements. This model is intended to give the user an unique interface for both stream and batch processing, hiding completely data management and focusing only on operations, which are represented by Pipeline stages. Our DSL will be built on top of the FastFlow library, exploiting both shared and distributed parallelism, and implemented in C++11/14 with the aim of porting C++ into the Big Data world

    Accelerating Bowtie2 with a lock-less concurrency approach and memory affinity

    No full text
    Abstract—The implementation of DNA alignment tools for Bioinformatics lead to face different problems that dip into performances. A single alignment takes an amount of time that is not predictable and there are different factors that can affect performances, for instance the length of sequences can determine the computational grain of the task and mismatches or insertion/deletion (indels) increase time needed to complete an alignment. Moreover, an alignment is a strong memory-bound problem because of the irregular memory access pat-terns and limitations in memory-bandwidth. Over the years, many alignment tools were implemented. A concrete example is Bowtie2, one of the fastest (concurrent, Pthread-based) and state of the art not GPU-based alignment tool. Bowtie2 exploits concurrency by instantiating a pool of threads, which have access to a global input dataset, share the reference genome and have access to different objects for collecting alignment results. In this paper a modified implementation of Bowtie2 is presented, in which the concurrency structure has been changed. The proposed implementation exploits the task-farm skeleton pattern implemented as a Master-Worker. The Master-Worker pattern permits to delegate only to the Master thread dataset reading and to make private to each Worker data structures that are shared in the original version. Only the reference genome is left shared. As a further optimisation, the Master and each Worker were pinned on cores and the reference genome was allocated interleaved among memory nodes. The proposed implementation is able to gain up to 10 speedup points over the original implementation. I

    Insights into protein-RNA complexes from computational analyses of iCLIP experiments

    Get PDF
    RNA-binding proteins (RBPs) are the primary regulators of all aspects of posttranscriptional gene regulation. In order to understand how RBPs perform their function, it is important to identify their binding sites. Recently, new techniques have been developed to employ high-throughput sequencing to study protein-RNA interactions in vivo, including the individual-nucleotide resolution UV crosslinking and immunoprecipitation (iCLIP). iCLIP identifies sites of protein-RNA crosslinking with nucleotide resolution in a transcriptome-wide manner. It is composed of over60steps,whichcanbemodified,butitisnotclearhowvariationsinthemethod affect the assignment of RNA binding sites. This is even more pertinent given that several variants of iCLIP have been developed. A central question of my research is how to correctly assign binding sites to RBPs using the data produced by iCLIP and similar techniques. I first focused on the technical analyses and solutions for the iCLIP method. I examinedcDNAdeletionsandcrosslink-associatedmotifstoshowthatthestartsof cDNAs are appropriate to assign the crosslink sites in all variants of CLIP, including iCLIP, eCLIP and irCLIP. I also showed that the non-coinciding cDNA-starts are caused by technical conditions in the iCLIP protocol that may lead to sequence constraintsatcDNA-endsinthefinalcDNAlibrary. Ialsodemonstratedtheimportance of fully optimizing the RNase and purification conditions in iCLIP to avoid thesecDNA-endconstraints. Next,IdevelopedCLIPo,acomputationalframework that assesses various features of iCLIP data to provide quality control standards which reveals how technical variations between experiments affect the specificity of assigned binding sites. I used CLIPo to compare multiple PTBP1 experiments produced by iCLIP, eCLIP and irCLIP, to reveal major effects of sequence constraintsatcDNA-endsorstarts,cDNAlengthdistributionandnon-specificcontaminants. Moreover, I assessed how the variations between these methods influence themechanisticconclusions. Thus,CLIPopresentsthequalitycontrolstandardsfor transcriptome-wide assignment of protein-RNA binding sites. I continued with analyses of RBP complexes by using data from spliceosomeiCLIP. This method simultaneously detects crosslink sites of small nuclear ribonucleoproteins (snRNPs) and auxiliary splicing factors on pre-mRNAs. I demonstratedthatthehighresolutionofspliceosome-iCLIPallowsfordistinctionbetween multiple proximal RNA binding sites, which can be valuable for transcriptomewide studies of large ribonucleoprotein complexes. Moreover, I showed that spliceosome-iCLIP can experimentally identify over 50,000 human branch points. In summary, I detected technical biases from iCLIP data, and demonstrated how such biases can be avoided, so that cDNA-starts appropriately assign the RNA binding sites. CLIPo analysis proved a useful quality control tool that evaluates data specificity across different methods, and I applied it to iCLIP, irCLIP and ENCODE eCLIP datasets. I presented how spliceosome-iCLIP data can be used to study the splicing machinery on pre-mRNAs and how to use constrained cDNAs from spliceosome-iCLIP data to identify branch points on a genome-wide scale. Taken together, these studies provide new insights into the field of RNA biology and can be used for future studies of iCLIP and related methods
    corecore