617 research outputs found

    Optimizing the MapReduce Framework on Intel Xeon Phi Coprocessor

    Full text link
    With the ease-of-programming, flexibility and yet efficiency, MapReduce has become one of the most popular frameworks for building big-data applications. MapReduce was originally designed for distributed-computing, and has been extended to various architectures, e,g, multi-core CPUs, GPUs and FPGAs. In this work, we focus on optimizing the MapReduce framework on Xeon Phi, which is the latest product released by Intel based on the Many Integrated Core Architecture. To the best of our knowledge, this is the first work to optimize the MapReduce framework on the Xeon Phi. In our work, we utilize advanced features of the Xeon Phi to achieve high performance. In order to take advantage of the SIMD vector processing units, we propose a vectorization friendly technique for the map phase to assist the auto-vectorization as well as develop SIMD hash computation algorithms. Furthermore, we utilize MIMD hyper-threading to pipeline the map and reduce to improve the resource utilization. We also eliminate multiple local arrays but use low cost atomic operations on the global array for some applications, which can improve the thread scalability and data locality due to the coherent L2 caches. Finally, for a given application, our framework can either automatically detect suitable techniques to apply or provide guideline for users at compilation time. We conduct comprehensive experiments to benchmark the Xeon Phi and compare our optimized MapReduce framework with a state-of-the-art multi-core based MapReduce framework (Phoenix++). By evaluating six real-world applications, the experimental results show that our optimized framework is 1.2X to 38X faster than Phoenix++ for various applications on the Xeon Phi

    A Multi-layer Fpga Framework Supporting Autonomous Runtime Partial Reconfiguration

    Get PDF
    Partial reconfiguration is a unique capability provided by several Field Programmable Gate Array (FPGA) vendors recently, which involves altering part of the programmed design within an SRAM-based FPGA at run-time. In this dissertation, a Multilayer Runtime Reconfiguration Architecture (MRRA) is developed, evaluated, and refined for Autonomous Runtime Partial Reconfiguration of FPGA devices. Under the proposed MRRA paradigm, FPGA configurations can be manipulated at runtime using on-chip resources. Operations are partitioned into Logic, Translation, and Reconfiguration layers along with a standardized set of Application Programming Interfaces (APIs). At each level, resource details are encapsulated and managed for efficiency and portability during operation. An MRRA mapping theory is developed to link the general logic function and area allocation information to the device related physical configuration level data by using mathematical data structure and physical constraints. In certain scenarios, configuration bit stream data can be read and modified directly for fast operations, relying on the use of similar logic functions and common interconnection resources for communication. A corresponding logic control flow is also developed to make the entire process autonomous. Several prototype MRRA systems are developed on a Xilinx Virtex II Pro platform. The Virtex II Pro on-chip PowerPC core and block RAM are employed to manage control operations while multiple physical interfaces establish and supplement autonomous reconfiguration capabilities. Area, speed and power optimization techniques are developed based on the developed Xilinx prototype. Evaluations and analysis of these prototype and techniques are performed on a number of benchmark and hashing algorithm case studies. The results indicate that based on a variety of test benches, up to 70% reduction in the resource utilization, up to 50% improvement in power consumption, and up to 10 times increase in run-time performance are achieved using the developed architecture and approaches compared with Xilinx baseline reconfiguration flow. Finally, a Genetic Algorithm (GA) for a FPGA fault tolerance case study is evaluated as a ultimate high-level application running on this architecture. It demonstrated that this is a hardware and software infrastructure that enables an FPGA to dynamically reconfigure itself efficiently under the control of a soft microprocessor core that is instantiated within the FPGA fabric. Such a system contributes to the observed benefits of intelligent control, fast reconfiguration, and low overhead

    Algorithm-Hardware Co-Design for Performance-driven Embedded Genomics

    Get PDF
    PhD ThesisGenomics includes development of techniques for diagnosis, prognosis and therapy of over 6000 known genetic disorders. It is a major driver in the transformation of medicine from the reactive form to the personalized, predictive, preventive and participatory (P4) form. The availability of genome is an essential prerequisite to genomics and is obtained from the sequencing and analysis pipelines of the whole genome sequencing (WGS). The advent of second generation sequencing (SGS), significantly, reduced the sequencing costs leading to voluminous research in genomics. SGS technologies, however, generate massive volumes of data in the form of reads, which are fragmentations of the real genome. The performance requirements associated with mapping reads to the reference genome (RG), in order to reassemble the original genome, now, stands disproportionate to the available computational capabilities. Conventionally, the hardware resources used are made of homogeneous many-core architecture employing complex general-purpose CPU cores. Although these cores provide high-performance, a data-centric approach is required to identify alternate hardware systems more suitable for affordable and sustainable genome analysis. Most state-of-the-art genomic tools are performance oriented and do not address the crucial aspect of energy consumption. Although algorithmic innovations have reduced runtime on conventional hardware, the energy consumption has scaled poorly. The associated monetary and environmental costs have made it a major bottleneck to translational genomics. This thesis is concerned with the development and validation of read mappers for embedded genomics paradigm, aiming to provide a portable and energy-efficient hardware solution to the reassembly pipeline. It applies the algorithmhardware co-design approach to bridge the saturation point arrived in algorithmic innovations with emerging low-power/energy heterogeneous embedded platforms. Essential to embedded paradigm is the ability to use heterogeneous hardware resources. Graphical processing units (GPU) are, often, available in most modern devices alongside CPU but, conventionally, state-of-the-art read mappers are not tuned to use both together. The first part of the thesis develops a Cross-platfOrm Read mApper using opencL (CORAL) that can distribute workload on all available devices for high performance. OpenCL framework mitigates the need for designing separate kernels for CPU and GPU. It implements a verification-aware filtration algorithm for rapid pruning and identification of candidate locations for mapping reads to the RG. Mapping reads on embedded platforms decreases performance due to architectural differences such as limited on-chip/off-chip memory, smaller bandwidths and simpler cores. To mitigate performance degradation, in second part of the thesis, we propose a REad maPper for heterogeneoUs sysTEms (REPUTE) which uses an efficient dynamic programming (DP) based filtration methodology. Using algorithm-hardware co-design and kernel level optimizations to reduce its memory footprint, REPUTE demonstrated significant energy savings on HiKey970 embedded platform with acceptable performance. The third part of the thesis concentrates on mapping the whole genome on an embedded platform. We propose a Pyopencl based tooL for gEnomic workloaDs tarGeting Embedded platfoRms (PLEDGER) which includes two novel contributions. The first one proposes a novel preprocessing strategy to generate low-memory footprint (LMF) data structure to fit all human chromosomes at the cost of performance. Second contribution is LMF DP-based filtration method to work in conjunction with the proposed data structures. To mitigate performance degradation, the kernel employs several optimisations including extensive usage of bit-vector operations. Extensive experiments using real human reads were carried out with state-of-the-art read mappers on 5 different platforms for CORAL, REPUTE and PLEDGER. The results show that embedded genomics provides significant energy savings with similar performance compared to conventional CPU-based platforms

    Real-Time Lossless Compression of SoC Trace Data

    Get PDF
    Nowadays, with the increasing complexity of System-on-Chip (SoC), traditional debugging approaches are not enough in multi-core architecture systems. Hardware tracing becomes necessary for performance analysis in these systems. The problem is that the size of collected trace data through hardware-based tracing techniques is usually extremely large due to the increasing complexity of System-on-Chips. Hence on-chip trace compression performed in hardware is needed to reduce the amount of transferred or stored data. In this dissertation, the feasibility of different types of lossless data compression algorithms in hardware implementation are investigated and examined. A lossless data compression algorithm LZ77 is selected, analyzed, and optimized to Nexus traces data. In order to meet the hardware cost and compression performances requirements for the real-time compression, an optimized LZ77 compression algorithm is proposed based on the characteristics of Nexus trace data. This thesis presents a hardware implementation of LZ77 encoder described in Very High Speed Integrated Circuit Hardware Description Language (VHDL). Test results demonstrate that the compression speed can achieve16 bits/clock cycle and the average compression ratio is 1.35 for the minimal hardware cost case, which is a suitable trade-off between the hardware cost and the compression performances effectively

    FPGA acceleration of DNA sequence alignment: design analysis and optimization

    Get PDF
    Existing FPGA accelerators for short read mapping often fail to utilize the complete biological information in sequencing data for simple hardware design, leading to missed or incorrect alignment. In this work, we propose a runtime reconfigurable alignment pipeline that considers all information in sequencing data for the biologically accurate acceleration of short read mapping. We focus our efforts on accelerating two string matching techniques: FM-index and the Smith-Waterman algorithm with the affine-gap model which are commonly used in short read mapping. We further optimize the FPGA hardware using a design analyzer and merger to improve alignment performance. The contributions of this work are as follows. 1. We accelerate the exact-match and mismatch alignment by leveraging the FM-index technique. We optimize memory access by compressing the data structure and interleaving the access with multiple short reads. The FM-index hardware also considers complete information in the read data to maximize accuracy. 2. We propose a seed-and-extend model to accelerate alignment with indels. The FM-index hardware is extended to support the seeding stage while a Smith-Waterman implementation with the affine-gap model is developed on FPGA for the extension stage. This model can improve the efficiency of indel alignment with comparable accuracy versus state-of-the-art software. 3. We present an approach for merging multiple FPGA designs into a single hardware design, so that multiple place-and-route tasks can be replaced by a single task to speed up functional evaluation of designs. We first experiment with this approach to demonstrate its feasibility for different designs. Then we apply this approach to optimize one of the proposed FPGA aligners for better alignment performance.Open Acces
    • …
    corecore