61 research outputs found

    Accelerating pairwise sequence alignment on GPUs using the Wavefront Algorithm

    Get PDF
    Advances in genomics and sequencing technologies demand faster and more scalable analysis methods that can process longer sequences with higher accuracy. However, classical pairwise alignment methods, based on dynamic programming (DP), impose impractical computational requirements to align long and noisy sequences like those produced by PacBio, and Nanopore technologies. The recently proposed Wavefront Alignment (WFA) algorithm paves the way for more efficient alignment tools, improving time and memory complexity over previous methods. Notwithstanding the advantages of the WFA algorithm, modern high performance computing (HPC) platforms rely on accelerator-based architectures that exploit parallel computing resources to improve over classical computing CPUs. Hence, a GPU-enabled implementation of the WFA could exploit the hardware resources of modern GPUs and further accelerate sequence alignment in current genome analysis pipelines. This thesis presents two GPU-accelerated implementations based on the WFA for fast pairwise DNA sequence alignment: eWFA-GPU and WFA-GPU. Our first proposal, eWFA-GPU, computes the exact edit-distance alignment between two short sequences (up to a few thousand bases), taking full advantage of the massive parallel capabilities of modern GPUs. We propose a succinct representation of the alignment data that successfully reduces the overall amount of memory required, allowing the exploitation of the fast on-chip memory of a GPU. Our results show that eWFA-GPU outperforms by 3-9X the edit-distance WFA implementation running on a 20 core machine. Compared to other state-of-the-art tools computing the edit-distance, eWFA-GPU is up to 265X faster than CPU tools and up to 56 times faster than other GPU-enabled implementations. Our second contribution, the WFA-GPU tool, extends the work of eWFA-GPU to compute the exact gap-affine distance (i.e., a more general alignment problem) between arbitrary long sequences. In this work, we propose a CPU-GPU co-design capable of performing inter and intra-sequence parallel alignment of multiple sequences, combining a succinct WFA-data representation with an efficient GPU implementation. As a result, we demonstrate that our implementation outperforms the original WFA implementation between 1.5-7.7X times when computing the alignment path, and between 2.6-16X when computing only the alignment score. Moreover, compared to other state-of-the-art tools, the WFA-GPU is up to 26.7X faster than other GPU implementations and up to four orders of magnitude faster than other CPU implementations

    Algorithm-Hardware Co-Design for Performance-driven Embedded Genomics

    Get PDF
    PhD ThesisGenomics includes development of techniques for diagnosis, prognosis and therapy of over 6000 known genetic disorders. It is a major driver in the transformation of medicine from the reactive form to the personalized, predictive, preventive and participatory (P4) form. The availability of genome is an essential prerequisite to genomics and is obtained from the sequencing and analysis pipelines of the whole genome sequencing (WGS). The advent of second generation sequencing (SGS), significantly, reduced the sequencing costs leading to voluminous research in genomics. SGS technologies, however, generate massive volumes of data in the form of reads, which are fragmentations of the real genome. The performance requirements associated with mapping reads to the reference genome (RG), in order to reassemble the original genome, now, stands disproportionate to the available computational capabilities. Conventionally, the hardware resources used are made of homogeneous many-core architecture employing complex general-purpose CPU cores. Although these cores provide high-performance, a data-centric approach is required to identify alternate hardware systems more suitable for affordable and sustainable genome analysis. Most state-of-the-art genomic tools are performance oriented and do not address the crucial aspect of energy consumption. Although algorithmic innovations have reduced runtime on conventional hardware, the energy consumption has scaled poorly. The associated monetary and environmental costs have made it a major bottleneck to translational genomics. This thesis is concerned with the development and validation of read mappers for embedded genomics paradigm, aiming to provide a portable and energy-efficient hardware solution to the reassembly pipeline. It applies the algorithmhardware co-design approach to bridge the saturation point arrived in algorithmic innovations with emerging low-power/energy heterogeneous embedded platforms. Essential to embedded paradigm is the ability to use heterogeneous hardware resources. Graphical processing units (GPU) are, often, available in most modern devices alongside CPU but, conventionally, state-of-the-art read mappers are not tuned to use both together. The first part of the thesis develops a Cross-platfOrm Read mApper using opencL (CORAL) that can distribute workload on all available devices for high performance. OpenCL framework mitigates the need for designing separate kernels for CPU and GPU. It implements a verification-aware filtration algorithm for rapid pruning and identification of candidate locations for mapping reads to the RG. Mapping reads on embedded platforms decreases performance due to architectural differences such as limited on-chip/off-chip memory, smaller bandwidths and simpler cores. To mitigate performance degradation, in second part of the thesis, we propose a REad maPper for heterogeneoUs sysTEms (REPUTE) which uses an efficient dynamic programming (DP) based filtration methodology. Using algorithm-hardware co-design and kernel level optimizations to reduce its memory footprint, REPUTE demonstrated significant energy savings on HiKey970 embedded platform with acceptable performance. The third part of the thesis concentrates on mapping the whole genome on an embedded platform. We propose a Pyopencl based tooL for gEnomic workloaDs tarGeting Embedded platfoRms (PLEDGER) which includes two novel contributions. The first one proposes a novel preprocessing strategy to generate low-memory footprint (LMF) data structure to fit all human chromosomes at the cost of performance. Second contribution is LMF DP-based filtration method to work in conjunction with the proposed data structures. To mitigate performance degradation, the kernel employs several optimisations including extensive usage of bit-vector operations. Extensive experiments using real human reads were carried out with state-of-the-art read mappers on 5 different platforms for CORAL, REPUTE and PLEDGER. The results show that embedded genomics provides significant energy savings with similar performance compared to conventional CPU-based platforms

    Communication-Efficient Jaccard Similarity for High-Performance Distributed Genome Comparisons

    Full text link
    The Jaccard similarity index is an important measure of the overlap of two sets, widely used in machine learning, computational genomics, information retrieval, and many other areas. We design and implement SimilarityAtScale, the first communication-efficient distributed algorithm for computing the Jaccard similarity among pairs of large datasets. Our algorithm provides an efficient encoding of this problem into a multiplication of sparse matrices. Both the encoding and sparse matrix product are performed in a way that minimizes data movement in terms of communication and synchronization costs. We apply our algorithm to obtain similarity among all pairs of a set of large samples of genomes. This task is a key part of modern metagenomics analysis and an evergrowing need due to the increasing availability of high-throughput DNA sequencing data. The resulting scheme is the first to enable accurate Jaccard distance derivations for massive datasets, using largescale distributed-memory systems. We package our routines in a tool, called GenomeAtScale, that combines the proposed algorithm with tools for processing input sequences. Our evaluation on real data illustrates that one can use GenomeAtScale to effectively employ tens of thousands of processors to reach new frontiers in large-scale genomic and metagenomic analysis. While GenomeAtScale can be used to foster DNA research, the more general underlying SimilarityAtScale algorithm may be used for high-performance distributed similarity computations in other data analytics application domains

    NASA Tech Briefs, September 2012

    Get PDF
    Topics covered include: Beat-to-Beat Blood Pressure Monitor; Measurement Techniques for Clock Jitter; Lightweight, Miniature Inertial Measurement System; Optical Density Analysis of X-Rays Utilizing Calibration Tooling to Estimate Thickness of Parts; Fuel Cell/Electrochemical Cell Voltage Monitor; Anomaly Detection Techniques with Real Test Data from a Spinning Turbine Engine-Like Rotor; Measuring Air Leaks into the Vacuum Space of Large Liquid Hydrogen Tanks; Antenna Calibration and Measurement Equipment; Glass Solder Approach for Robust, Low-Loss, Fiber-to-Waveguide Coupling; Lightweight Metal Matrix Composite Segmented for Manufacturing High-Precision Mirrors; Plasma Treatment to Remove Carbon from Indium UV Filters; Telerobotics Workstation (TRWS) for Deep Space Habitats; Single-Pole Double-Throw MMIC Switches for a Microwave Radiometer; On Shaft Data Acquisition System (OSDAS); ASIC Readout Circuit Architecture for Large Geiger Photodiode Arrays; Flexible Architecture for FPGAs in Embedded Systems; Polyurea-Based Aerogel Monoliths and Composites; Resin-Impregnated Carbon Ablator: A New Ablative Material for Hyperbolic Entry Speeds; Self-Cleaning Particulate Prefilter Media; Modular, Rapid Propellant Loading System/Cryogenic Testbed; Compact, Low-Force, Low-Noise Linear Actuator; Loop Heat Pipe with Thermal Control Valve as a Variable Thermal Link; Process for Measuring Over-Center Distances; Hands-Free Transcranial Color Doppler Probe; Improving Balance Function Using Low Levels of Electrical Stimulation of the Balance Organs; Developing Physiologic Models for Emergency Medical Procedures Under Microgravity; PMA-Linked Fluorescence for Rapid Detection of Viable Bacterial Endospores; Portable Intravenous Fluid Production Device for Ground Use; Adaptation of a Filter Assembly to Assess Microbial Bioburden of Pressurant Within a Propulsion System; Multiplexed Force and Deflection Sensing Shell Membranes for Robotic Manipulators; Whispering Gallery Mode Optomechanical Resonator; Vision-Aided Autonomous Landing and Ingress of Micro Aerial Vehicles; Self-Sealing Wet Chemistry Cell for Field Analysis; General MACOS Interface for Modeling and Analysis for Controlled Optical Systems; Mars Technology Rover with Arm-Mounted Percussive Coring Tool, Microimager, and Sample-Handling Encapsulation Containerization Subsystem; Fault-Tolerant, Real-Time, Multi-Core Computer System; Water Detection Based on Object Reflections; SATPLOT for Analysis of SECCHI Heliospheric Imager Data; Plug-in Plan Tool v3.0.3.1; Frequency Correction for MIRO Chirp Transformation Spectroscopy Spectrum; Nonlinear Estimation Approach to Real-Time Georegistration from Aerial Images; Optimal Force Control of Vibro-Impact Systems for Autonomous Drilling Applications; Low-Cost Telemetry System for Small/Micro Satellites; Operator Interface and Control Software for the Reconfigurable Surface System Tri-ATHLETE; and Algorithms for Determining Physical Responses of Structures Under Load

    Genome sequence alignment in processing-In-memory architectures

    Get PDF
    Finalmente, también realizamos un estudio experimental de varias arquitecturas con diferentes tecnologías de memoria (DDR y HBM) y núcleos de procesamiento de distintos tipos, explotando, en algunos casos, procesamiento en la memoria (PIM). La aplicación de referencia es Bowtie2, una aplicación completa para el alineamiento de secuencias en el genoma. La implementación y evaluación de estas arquitecturas se realiza utilizando un simulador arquitectural basado en gem5.La combinación de la aparición de un cuello de botella en el acceso a los datos y la creciente importancia de las aplicaciones de procesamiento intensivo de datos, muy limitadas por el sistema de memoria, crea un importante problema que debe ser abordado. Por ello, en esta tesis nos proponemos afrontar este problema e intentar reducir su efecto en la medida de lo posible. El principal objetivo de esta tesis es el diseño de nuevas soluciones arquitecturales y algorítmicas para superar el problema del cuello de botella conocido como memory-wall y mejorar el rendimiento de aplicaciones con gran uso de memoria que no son capaces de beneficiarse lo suficiente de las jerarquías de memoria actuales. Además, creemos que es esencial centrarse en la eficiencia energética, un factor cuya importancia crece cada día y uno de los factores más limitantes en la computación de alto rendimiento. Las principales contribuciones de esta tesis son: Primero, analizamos el comportamiento de aplicaciones con accesos de memoria aleatorios, que no aprovechan correctamente las nuevas arquitecturas de memoria con jerarquías cache profundas. Específicamente, analizamos la estructura de datos FM-index y un algoritmo de búsqueda de secuencias basado en esa estructura, ampliamente usado en el alineamiento de secuencias en el genoma. Después de este análisis y de obtener un conocimiento más detallado del cuello de botella de la memoria, proponemos una nueva versión de FM-index que permite reducir el consumo de ancho de banda de memoria, de forma que mejora significativamente el rendimiento computacional. Posteriormente, proponemos una nueva arquitectura energéticamente eficiente, basada en un cubo de memoria en 3D (3D-Stacked) al que añadimos unos núcleos sencillos de bajo consumo en su capa lógica. Esta arquitectura permite la ejecución cerca de los datos (near-data-processing

    Efficient algorithms and architectures for protein 3-D structure comparison

    Get PDF
    Η σύγκριση δομών πρωτεϊνών είναι ανεπτυγμένος τομέας της υπολογιστικής πρωτεϊνωμικής που χρησιμοποιείται ευρέως στη δομική βιολογία και την ανακάλυψη φαρμάκων. Οι αυξανόμενες υπολογιστικές απαιτήσεις του είναι αποτέλεσμα τριών παραγόντων: ταχεία επέκταση των βάσεων δεδομένων με νέες δομές πρωτεϊνών, υψηλή υπολογιστική πολυπλοκότητα των αλγορίθμων σύγκρισης δομών πρωτεϊνών κατά ζεύγη (PSC), και τάση χρήσης πολλαπλών μεθόδων σύγκρισης και συνδυασμού των αποτελεσμάτων τους (multi criteria protein structure comparison-MCPSC-), μιας και δεν υπάρχει PSC μέθοδος κοινά αποδεκτή ως η καλύτερη. Αναπτύξαμε πλαίσιο λογισμικού που εκμεταλλεύεται επεξεργαστές πολλών πυρήνων για την υλοποίηση παράλληλων στρατηγικών MCPSC με βάση τρεις δημοφιλείς PSC μεθόδους, τις TMalign, CE και USM. Συγκρίνουμε την απόδοση και αποδοτικότητα δύο παράλληλων υλοποιήσεων MCPSC στον πειραματικό επεξεργαστή δικτύου σε ψηφίδα (Network on Chip)  Intel Single-Chip Cloud Computer και τον δημοφιλή επεξεργαστή Intel Core i7. Επιπλέον, αναπτύξαμε εκτενές υπολογιστικό pipeline και υλοποίησή του με πρόγραμμα Python, που ονομάζεται pyMCPSC, που επιτρέπει στους χρήστες να εκτελούν MCPSC διεργασίες σε επεξεργαστές πολλαπλών πυρήνων. Το pyMCPSC, το οποίο συνδυάζει πέντε μεθόδους PSC και υποστηρίζει πέντε διαφορετικά σχήματα συναίνεσης MCPSC, υποστηρίζει τη συγκριτική ανάλυση μεγάλων συνόλων με δομές πρωτεϊνών και μπορεί να επεκταθεί ώστε να ενσωματώσει και νέες μεθόδους PSC στις βαθμολογίες συναίνεσης, καθώς αυτές καθίστανται διαθέσιμες.Protein Structure Comparison (PSC) is a well developed field of computational proteomics with active interest since it is widely used in structural biology and drug discovery. Fast increasing computational demand for all-to-all protein structures comparison is a result of mainly three factors: rapidly expanding structural proteomics databases, high computational complexity of pairwise PSC algorithms, and the trend towards using multiple criteria for comparison and combining their results (MCPSC). In this thesis we have developed a software framework that exploits many-core and multi-core CPUs to implement efficient parallel MCPSC schemes in modern processors based on three popular PSC methods, namely, TMalign, CE, and USM. We evaluate and compare the performance and efficiency of two parallel MCPSC implementations using Intel’s experimental many-core Single-Chip Cloud Computer (SCC) CPU as well as Intel’s Core i7 multi-core processor. Further, we have developed a dataset processing pipeline and implemented it in a Python utility, called pyMCPSC, allowing users to perform MCPSC efficiently on multi-core CPU. pyMCPSC, which combines five PSC methods and five different consensus scoring schemes, facilitates the analysis of similarities in protein domain datasets and can be easily extended to incorporate more PSC methods in the consensus scoring as they are becoming available

    Approximation and Relaxation Approaches for Parallel and Distributed Machine Learning

    Get PDF
    Large scale machine learning requires tradeoffs. Commonly this tradeoff has led practitioners to choose simpler, less powerful models, e.g. linear models, in order to process more training examples in a limited time. In this work, we introduce parallelism to the training of non-linear models by leveraging a different tradeoff--approximation. We demonstrate various techniques by which non-linear models can be made amenable to larger data sets and significantly more training parallelism by strategically introducing approximation in certain optimization steps. For gradient boosted regression tree ensembles, we replace precise selection of tree splits with a coarse-grained, approximate split selection, yielding both faster sequential training and a significant increase in parallelism, in the distributed setting in particular. For metric learning with nearest neighbor classification, rather than explicitly train a neighborhood structure we leverage the implicit neighborhood structure induced by task-specific random forest classifiers, yielding a highly parallel method for metric learning. For support vector machines, we follow existing work to learn a reduced basis set with extremely high parallelism, particularly on GPUs, via existing linear algebra libraries. We believe these optimization tradeoffs are widely applicable wherever machine learning is put in practice in large scale settings. By carefully introducing approximation, we also introduce significantly higher parallelism and consequently can process more training examples for more iterations than competing exact methods. While seemingly learning the model with less precision, this tradeoff often yields noticeably higher accuracy under a restricted training time budget

    Models, Algorithms, and Architectures for Scalable Packet Classification

    Get PDF
    The growth and diversification of the Internet imposes increasing demands on the performance and functionality of network infrastructure. Routers, the devices responsible for the switch-ing and directing of traffic in the Internet, are being called upon to not only handle increased volumes of traffic at higher speeds, but also impose tighter security policies and provide support for a richer set of network services. This dissertation addresses the searching tasks performed by Internet routers in order to forward packets and apply network services to packets belonging to defined traffic flows. As these searching tasks must be performed for each packet traversing the router, the speed and scalability of the solutions to the route lookup and packet classification problems largely determine the realizable performance of the router, and hence the Internet as a whole. Despite the energetic attention of the academic and corporate research communities, there remains a need for search engines that scale to support faster communication links, larger route tables and filter sets and increasingly complex filters. The major contributions of this work include the design and analysis of a scalable hardware implementation of a Longest Prefix Matching (LPM) search engine for route lookup, a survey and taxonomy of packet classification techniques, a thorough analysis of packet classification filter sets, the design and analysis of a suite of performance evaluation tools for packet classification algorithms and devices, and a new packet classification algorithm that scales to support high-speed links and large filter sets classifying on additional packet fields
    corecore