50 research outputs found

    Accelerated Profile HMM Searches

    Get PDF
    Profile hidden Markov models (profile HMMs) and probabilistic inference methods have made important contributions to the theory of sequence database homology search. However, practical use of profile HMM methods has been hindered by the computational expense of existing software implementations. Here I describe an acceleration heuristic for profile HMMs, the “multiple segment Viterbi” (MSV) algorithm. The MSV algorithm computes an optimal sum of multiple ungapped local alignment segments using a striped vector-parallel approach previously described for fast Smith/Waterman alignment. MSV scores follow the same statistical distribution as gapped optimal local alignment scores, allowing rapid evaluation of significance of an MSV score and thus facilitating its use as a heuristic filter. I also describe a 20-fold acceleration of the standard profile HMM Forward/Backward algorithms using a method I call “sparse rescaling”. These methods are assembled in a pipeline in which high-scoring MSV hits are passed on for reanalysis with the full HMM Forward/Backward algorithm. This accelerated pipeline is implemented in the freely available HMMER3 software package. Performance benchmarks show that the use of the heuristic MSV filter sacrifices negligible sensitivity compared to unaccelerated profile HMM searches. HMMER3 is substantially more sensitive and 100- to 1000-fold faster than HMMER2. HMMER3 is now about as fast as BLAST for protein searches

    SIMD code generation in data-parallel programming

    Get PDF
    Today';s desktop PCs feature a variety of parallel processing units. Developing applications that exploit this parallelism is a demanding task, and a programmer has to obtain detailed knowledge about the hardware for efficient implementation. CGiS is a data-parallel programming language providing a unified abstraction for two parallel processing units: graphics processing units (GPUs) and the vector processing units of CPUs. The CGiS compiler framework fully virtualizes the differences in capability and accessibility by mapping an abstract data-parallel programming model on those targets. The applicability of CGiS for GPUs has been shown in previous work; this work focuses on applying the abstract programming model of CGiS to CPUs with SIMD (Single Instruction Multiple Data) instruction sets. We have identified, adapted and implemented a set of program analyses to expose and access the available parallelism. The code generation phase is based on selected optimization algorithms tailored to SIMD code generation. Via code generation profiles, it is possible to adapt the code generation strategy to different target architectures. To assess the effectiveness of our approach, we have implemented backends for the two most widespread SIMD instruction sets, namely Intel';s Streaming SIMD Extensions and Freescale';s AltiVec. Additionally, we integrated a prototypical backend for the Cell Broadband Engine as an example for a multi-core architecture. Our experimental results show excellent average performance gains by a factor of 3 compared to standard scalar C++ implementations and underline the viability of this approach: real-world applications can be implemented easily with CGiS and result in efficient code.Parallelverarbeitung wird heutzutage in handelsüblichen PCs von einer Reihe verschiedener Komponenten unterstützt. Grafikprozessoren (GPUs) und Vektoreinheiten in CPUs sind zwei dieser Komponenten. Da die Entwicklung von Anwendungen, die diese Parallelität nutzen, eine anspruchsvolle Aufgabe ist, muss sich ein Programmierer detaillierte Kenntnisse der internen Hardwarestruktur aneignen. Mit CGiS stellen wir eine datenparallele Programmiersprache vor, die eine gemeinsame Abstraktion für Grafikprozessoren und Vektoreinheiten in CPUs bietet und ein einheitliches Programmiermodell für beide bereitstellt. In vorherigen Arbeiten haben wir bereits die Nutzbarkeit von CGiS für GPUs gezeigt. In der vorliegenden Arbeit bilden wir das abstrakte Programmiermodel von CGiS auf CPUs mit SIMD (Single Instruction Multiple Data) Instruktionssatz ab. Wir haben eine Reihe relevanter Programmanalysen angepasst und implementiert um Parallelität aufzudecken und zu nutzen. Die Codegenerierungsphase basiert auf ausgewählten Optimierungsalgorithmen, die speziell auf die Generierung von SIMD-Code zugeschnitten sind. Durch Profile für verschiedene Architekturen ist es uns möglich, die Codegenierung zu steuern. Um die Effektivität unseres Ansatzes unter Beweis zu stellen, haben wir Backends für die beiden am weitesten verbreiteten SIMD-Instruktionssätze implementiert: Die "Streaming SIMD Extensions" von Intel und AltiVec von Freescale. Zusätzlich haben wir ein prototypisches Backend für den Cell Prozessor von IBM, als Beispiel für eine Multi-Core-Architektur, integriert. Die Ergebnisse unserer Experimente belegen eine ausgezeichnete durchschnittliche Beschleunigung um einen Faktor von 3 im Vergleich zu handgeschriebenen C++-Implementierungen. Diese Resultate untermauern unseren Ansatz: Mittels CGiS lässt sich leistungsstarker Code für SIMD- und Multi-Core-Applikationen generieren

    Meta-Programming and Policy-Based Design as a Technique of Architecting Modular and Efficient DSP Algorithm Implementations

    Get PDF
    Meta-programming paradigm and policy-based design are less known programming techniques in Digital Signal Processing (DSP) community, used to coding in pure C or assembly language. Major software components, like C++ STL, have proven usefulness of such paradigms in providing top performance of highly optimised native code, along with abstraction and modularity necessary in complex software projects. This paper describes composition of DSP code using these techniques, bringing as an example implementation of Feedback Delay Network (FDN) artificial reverberation algorithm. The proposed approach was proven to be practical, especially in case of prototyping computationally intense algorithms. To provide further performance insight, we discuss the techniques in context of other optimisation methods, like Single Instruction Multiple Data (SIMD) instruction sets usage and exploitation of superscalar architecture capabilities

    Classification algorithms on the cell processor

    Get PDF
    The rapid advancement in the capacity and reliability of data storage technology has allowed for the retention of virtually limitless quantity and detail of digital information. Massive information databases are becoming more and more widespread among governmental, educational, scientific, and commercial organizations. By segregating this data into carefully defined input (e.g.: images) and output (e.g.: classification labels) sets, a classification algorithm can be used develop an internal expert model of the data by employing a specialized training algorithm. A properly trained classifier is capable of predicting the output for future input data from the same input domain that it was trained on. Two popular classifiers are Neural Networks and Support Vector Machines. Both, as with most accurate classifiers, require massive computational resources to carry out the training step and can take months to complete when dealing with extremely large data sets. In most cases, utilizing larger training improves the final accuracy of the trained classifier. However, access to the kinds of computational resources required to do so is expensive and out of reach of private or under funded institutions. The Cell Broadband Engine (CBE), introduced by Sony, Toshiba, and IBM has recently been introduced into the market. The current most inexpensive iteration is available in the Sony Playstation 3 ® computer entertainment system. The CBE is a novel multi-core architecture which features many hardware enhancements designed to accelerate the processing of massive amounts of data. These characteristics and the cheap and widespread availability of this technology make the Cell a prime candidate for the task of training classifiers. In this work, the feasibility of the Cell processor in the use of training Neural Networks and Support Vector Machines was explored. In the Neural Network family of classifiers, the fully connected Multilayer Perceptron and Convolution Network were implemented. In the Support Vector Machine family, a Working Set technique known as the Gradient Projection-based Decomposition Technique, as well as the Cascade SVM were implemented

    Doctor of Philosophy

    Get PDF
    dissertationThe embedded system space is characterized by a rapid evolution in the complexity and functionality of applications. In addition, the short time-to-market nature of the business motivates the use of programmable devices capable of meeting the conflicting constraints of low-energy, high-performance, and short design times. The keys to achieving these conflicting constraints are specialization and maximally extracting available application parallelism. General purpose processors are flexible but are either too power hungry or lack the necessary performance. Application-specific integrated circuits (ASICS) efficiently meet the performance and power needs but are inflexible. Programmable domain-specific architectures (DSAs) are an attractive middle ground, but their design requires significant time, resources, and expertise in a variety of specialties, which range from application algorithms to architecture and ultimately, circuit design. This dissertation presents CoGenE, a design framework that automates the design of energy-performance-optimal DSAs for embedded systems. For a given application domain and a user-chosen initial architectural specification, CoGenE consists of a a Compiler to generate execution binary, a simulator Generator to collect performance/energy statistics, and an Explorer that modifies the current architecture to improve energy-performance-area characteristics. The above process repeats automatically until the user-specified constraints are achieved. This removes or alleviates the time needed to understand the application, manually design the DSA, and generate object code for the DSA. Thus, CoGenE is a new design methodology that represents a significant improvement in performance, energy dissipation, design time, and resources. This dissertation employs the face recognition domain to showcase a flexible architectural design methodology that creates "ASIC-like" DSAs. The DSAs are instruction set architecture (ISA)-independent and achieve good energy-performance characteristics by coscheduling the often conflicting constraints of data access, data movement, and computation through a flexible interconnect. This represents a significant increase in programming complexity and code generation time. To address this problem, the CoGenE compiler employs integer linear programming (ILP)-based 'interconnect-aware' scheduling techniques for automatic code generation. The CoGenE explorer employs an iterative technique to search the complete design space and select a set of energy-performance-optimal candidates. When compared to manual designs, results demonstrate that CoGenE produces superior designs for three application domains: face recognition, speech recognition and wireless telephony. While CoGenE is well suited to applications that exhibit a streaming behavior, multithreaded applications like ray tracing present a different but important challenge. To demonstrate its generality, CoGenE is evaluated in designing a novel multicore N-wide SIMD architecture, known as StreamRay, for the ray tracing domain. CoGenE is used to synthesize the SIMD execution cores, the compiler that generates the application binary, and the interconnection subsystem. Further, separating address and data computations in space reduces data movement and contention for resources, thereby significantly improving performance compared to existing ray tracing approaches

    Characterizing and Accelerating Bioinformatics Workloads on Modern Microarchitectures

    Get PDF
    Bioinformatics, the use of computer techniques to analyze biological data, has been a particularly active research field in the last two decades. Advances in this field have contributed to the collection of enormous amounts of data, and the sheer amount of available data has started to overtake the processing capability possible with current computer systems. Clearly, computer architects need to have a better understanding of how bioinformatics applications work and what kind of architectural techniques could be used to accelerate these important scientific workloads on future processors. In this dissertation, we develop a bioinformatic benchmark suite and provide a detailed characterization of these applications in common use today from a computer architect's point of view. We analyze a wide range of detailed execution characteristics including instruction mix, IPC measurements, L1 and L2 cache misses on a real architecture; and proceed to analyze the workloads' memory access characteristics. We then concentrate on accelerating a particularly computationally intensive bioinformatics workload on the novel Cell Broadband Engine multiprocessor architecture. The HMMER workload is used for protein profile searching using hidden Markov models, and most of its execution time is spent running the Viterbi algorithm. We parallelize and partition the HMMER application to implement it on the Cell Broadband Engine. In order to run the Viterbi algorithm on the 256KB local stores of the Cell BE synergistic processing units (SPEs), we present a method to develop a fast SIMD implementation of the Viterbi algorithm that reduces the storage requirements significantly. Our HMMER implementation for the Cell BE architecture, Cell-HMMER, exploits the multiple levels of parallelism inherent in this application, and can run protein profile searches up to 27.98 times faster than a modern dual-core x86 microprocessor

    Efficient FFT Algorithms for Mobile Devices

    Get PDF
    Increased traffic on wireless communication infrastructure has exacerbated the limited availability of radio frequency ({RF}) resources. Spectrum sharing is a possible solution to this problem that requires devices equipped with Cognitive Radio ({CR}) capabilities. A widely employed technique to enable {CR} is real-time {RF} spectrum analysis by applying the Fast Fourier Transform ({FFT}). Today’s mobile devices actually provide enough computing resources to perform not only the {FFT} but also wireless communication functions and protocols by software according to the software-defined radios paradigm. In addition to that, the pervasive availability of mobile devices make them powerful computing platform for new services. This thesis studies the feasibility of using mobile devices as a novel spectrum sensing platform with focus on {FFT}-based spectrum sensing algorithms. We benchmark several open-source {FFT} libraries on an Android smartphone. We relate the efficiency of calculating the {FFT} to both algorithmic and implementation-related aspects. The benchmark results also show the clear potential of special {FFT} algorithms that are tailored for sparse spectrum detection
    corecore