2,734 research outputs found

    On the orderability problem for PLA folding

    Get PDF
    AbstractAn interesting graph theoretic problem, called the orderability problem for programmable logic array (PLA) folding, was formulated and shown to be NP-complete. We show that a closely related problem, called the orderability problem for bipartite folding, is solvable in linear time

    Transformations of High-Level Synthesis Codes for High-Performance Computing

    Full text link
    Specialized hardware architectures promise a major step in performance and energy efficiency over the traditional load/store devices currently employed in large scale computing systems. The adoption of high-level synthesis (HLS) from languages such as C/C++ and OpenCL has greatly increased programmer productivity when designing for such platforms. While this has enabled a wider audience to target specialized hardware, the optimization principles known from traditional software design are no longer sufficient to implement high-performance codes. Fast and efficient codes for reconfigurable platforms are thus still challenging to design. To alleviate this, we present a set of optimizing transformations for HLS, targeting scalable and efficient architectures for high-performance computing (HPC) applications. Our work provides a toolbox for developers, where we systematically identify classes of transformations, the characteristics of their effect on the HLS code and the resulting hardware (e.g., increases data reuse or resource consumption), and the objectives that each transformation can target (e.g., resolve interface contention, or increase parallelism). We show how these can be used to efficiently exploit pipelining, on-chip distributed fast memory, and on-chip streaming dataflow, allowing for massively parallel architectures. To quantify the effect of our transformations, we use them to optimize a set of throughput-oriented FPGA kernels, demonstrating that our enhancements are sufficient to scale up parallelism within the hardware constraints. With the transformations covered, we hope to establish a common framework for performance engineers, compiler developers, and hardware developers, to tap into the performance potential offered by specialized hardware architectures using HLS

    Toolflows for Mapping Convolutional Neural Networks on FPGAs: A Survey and Future Directions

    Get PDF
    In the past decade, Convolutional Neural Networks (CNNs) have demonstrated state-of-the-art performance in various Artificial Intelligence tasks. To accelerate the experimentation and development of CNNs, several software frameworks have been released, primarily targeting power-hungry CPUs and GPUs. In this context, reconfigurable hardware in the form of FPGAs constitutes a potential alternative platform that can be integrated in the existing deep learning ecosystem to provide a tunable balance between performance, power consumption and programmability. In this paper, a survey of the existing CNN-to-FPGA toolflows is presented, comprising a comparative study of their key characteristics which include the supported applications, architectural choices, design space exploration methods and achieved performance. Moreover, major challenges and objectives introduced by the latest trends in CNN algorithmic research are identified and presented. Finally, a uniform evaluation methodology is proposed, aiming at the comprehensive, complete and in-depth evaluation of CNN-to-FPGA toolflows.Comment: Accepted for publication at the ACM Computing Surveys (CSUR) journal, 201

    Programming self developing blob machines for spatial computing.

    Get PDF

    Parallelization of dynamic programming recurrences in computational biology

    Get PDF
    The rapid growth of biosequence databases over the last decade has led to a performance bottleneck in the applications analyzing them. In particular, over the last five years DNA sequencing capacity of next-generation sequencers has been doubling every six months as costs have plummeted. The data produced by these sequencers is overwhelming traditional compute systems. We believe that in the future compute performance, not sequencing, will become the bottleneck in advancing genome science. In this work, we investigate novel computing platforms to accelerate dynamic programming algorithms, which are popular in bioinformatics workloads. We study algorithm-specific hardware architectures that exploit fine-grained parallelism in dynamic programming kernels using field-programmable gate arrays: FPGAs). We advocate a high-level synthesis approach, using the recurrence equation abstraction to represent dynamic programming and polyhedral analysis to exploit parallelism. We suggest a novel technique within the polyhedral model to optimize for throughput by pipelining independent computations on an array. This design technique improves on the state of the art, which builds latency-optimal arrays. We also suggest a method to dynamically switch between a family of designs using FPGA reconfiguration to achieve a significant performance boost. We have used polyhedral methods to parallelize the Nussinov RNA folding algorithm to build a family of accelerators that can trade resources for parallelism and are between 15-130x faster than a modern dual core CPU implementation. A Zuker RNA folding accelerator we built on a single workstation with four Xilinx Virtex 4 FPGAs outperforms 198 3 GHz Intel Core 2 Duo processors. Furthermore, our design running on a single FPGA is an order of magnitude faster than competing implementations on similar-generation FPGAs and graphics processors. Our work is a step toward the goal of automated synthesis of hardware accelerators for dynamic programming algorithms

    A Novel Frequency Based Current-to-Digital Converter with Programmable Dynamic Range

    Get PDF
    This work describes a novel frequency based Current to Digital converter, which would be fully realizable on a single chip. Biological systems make use of delay line techniques to compute many things critical to the life of an animal. Seeking to build up such a system, we are adapting the auditory localization circuit found in barn owls to detect and compute the magnitude of an input current. The increasing drive to produce ultra low-power circuits necessitates the use of very small currents. Frequently these currents need to accurately measured, but current solutions typically involve off-chip measurements. These are usually slow, and moving a current off chip increases noise to the system. Moving a system such as this completely on chip will allow for precise measurement and control of bias currents, and it will allow for better compensation of some common transistor mismatch issues. This project affords an extremely low power (100s nW) converter technology that is also very space efficient. The converter is completely asynchronous which yields ultra-low power standby operation [1]

    A Motion Planning Processor on Reconfigurable Hardware

    Get PDF
    Motion planning algorithms enable us to find feasible paths for moving objects. These algorithms utilize feasibility checks to differentiate valid paths from invalid ones. Unfortunately, the computationally expensive nature of such checks reduces the effectiveness of motion planning algorithms. However, by using hardware acceleration to speed up the feasibility checks, we can greatly enhance the performance of the motion planning algorithms. Of course, such acceleration is not limited to feasibility checks; other components of motion planning algorithms can also be accelerated using specially designed hardware. A Field Programmable Gate Array (FPGA) is a great platform to support such an acceleration. An FPGA is a collection of digital gates which can be reprogrammed at run time, i.e., it can be used as a CPU that reconfigures itself for a given task. In this paper, we study the feasibility of an FPGA based motion planning processor and evaluate its performance. In order to leverage its highly parallel nature and its modular structure, our processor utilizes the probabilistic roadmap method at its core. The modularity enables us to replace the feasibility criteria with other ones. The reconfigurability lets us run our processor in different roles, such as a motion planning co-processor, an autonomous motion planning processor or dedicated collision detection chip. Our experiments show that such a processor is not only feasible but also can greatly increase the performance of current algorithms
    corecore