1,253 research outputs found

    Beyond the arithmetic constraint: depth-optimal mapping of logic chains in reconfigurable fabrics

    Get PDF
    Look-up table based FPGAs have migrated from a niche technology for design prototyping to a valuable end-product component and, in some cases, a replacement for general purpose processors and ASICs alike. One way architects have bridged the performance gap between FPGAs and ASICs is through the inclusion of specialized components such as multipliers, RAM modules, and microcontrollers. Another dedicated structure that has become standard in reconfigurable fabrics is the arithmetic carry chain. Currently, it is only used to map arithmetic operations as identified by HDL macros. For non-arithmetic operations, it is an idle but potentially powerful resource.;Obstacles to using the carry chain for generic logic operations include lack of architectural and computer-aided design support. Current carry-select architectures facilitate carry chain reuse, although they do so only for (K-1)-input operations. Additionally, hardware description language (HDL) macros are the only recourse for a designer wishing to map generic logic chains in a carry-select architecture. A novel architecture that allows the full K-input operational capacity of the carry chain to be harnessed is presented as a solution to current architectural limitations. It is shown to have negligible impact on logic element area and delay. Using only two additional 2:1 pass transistor multiplexers, it enables the transmission of a K-input operation to the carry chain and general routing simultaneously. To successfully identify logic chains in an arbitrary Boolean network, ChainMap is presented as a novel technology mapping algorithm. ChainMap creates delay-optimal generic logic chains in polynomial time without HDL macros. It maps both arithmetic and non-arithmetic logic chains whenever depth increasing nodes, which increase logic depth but not routing depth, are encountered. Use of the chain is not reserved for arithmetic, but rather any set of gates exhibiting similar characteristics. By using the carry chain as a generic, near zero-delay adjacent cell interconnection structure a potential average optimal speedup of 1.4x is revealed. Post place and route experiments indicate that ChainMap solutions perform similarly to HDL chains when cluster resources are abundant and significantly better in cluster-constrained arrays

    Efficient machine learning: models and accelerations

    Get PDF
    One of the key enablers of the recent unprecedented success of machine learning is the adoption of very large models. Modern machine learning models typically consist of multiple cascaded layers such as deep neural networks, and at least millions to hundreds of millions of parameters (i.e., weights) for the entire model. The larger-scale model tend to enable the extraction of more complex high-level features, and therefore, lead to a significant improvement of the overall accuracy. On the other side, the layered deep structure and large model sizes also demand to increase computational capability and memory requirements. In order to achieve higher scalability, performance, and energy efficiency for deep learning systems, two orthogonal research and development trends have attracted enormous interests. The first trend is the acceleration while the second is the model compression. The underlying goal of these two trends is the high quality of the models to provides accurate predictions. In this thesis, we address these two problems and utilize different computing paradigms to solve real-life deep learning problems. To explore in these two domains, this thesis first presents the cogent confabulation network for sentence completion problem. We use Chinese language as a case study to describe our exploration of the cogent confabulation based text recognition models. The exploration and optimization of the cogent confabulation based models have been conducted through various comparisons. The optimized network offered a better accuracy performance for the sentence completion. To accelerate the sentence completion problem in a multi-processing system, we propose a parallel framework for the confabulation recall algorithm. The parallel implementation reduce runtime, improve the recall accuracy by breaking the fixed evaluation order and introducing more generalization, and maintain a balanced progress in status update among all neurons. A lexicon scheduling algorithm is presented to further improve the model performance. As deep neural networks have been proven effective to solve many real-life applications, and they are deployed on low-power devices, we then investigated the acceleration for the neural network inference using a hardware-friendly computing paradigm, stochastic computing. It is an approximate computing paradigm which requires small hardware footprint and achieves high energy efficiency. Applying this stochastic computing to deep convolutional neural networks, we design the functional hardware blocks and optimize them jointly to minimize the accuracy loss due to the approximation. The synthesis results show that the proposed design achieves the remarkable low hardware cost and power/energy consumption. Modern neural networks usually imply a huge amount of parameters which cannot be fit into embedded devices. Compression of the deep learning models together with acceleration attracts our attention. We introduce the structured matrices based neural network to address this problem. Circulant matrix is one of the structured matrices, where a matrix can be represented using a single vector, so that the matrix is compressed. We further investigate a more flexible structure based on circulant matrix, called block-circulant matrix. It partitions a matrix into several smaller blocks and makes each submatrix is circulant. The compression ratio is controllable. With the help of Fourier Transform based equivalent computation, the inference of the deep neural network can be accelerated energy efficiently on the FPGAs. We also offer the optimization for the training algorithm for block circulant matrices based neural networks to obtain a high accuracy after compression

    Scalable and deterministic timing-driven parallel placement for FPGAs

    Full text link

    New FPGA design tools and architectures

    Get PDF

    High performance graph analysis on parallel architectures

    Get PDF
    PhD ThesisOver the last decade pharmacology has been developing computational methods to enhance drug development and testing. A computational method called network pharmacology uses graph analysis tools to determine protein target sets that can lead on better targeted drugs for diseases as Cancer. One promising area of network-based pharmacology is the detection of protein groups that can produce better e ects if they are targeted together by drugs. However, the e cient prediction of such protein combinations is still a bottleneck in the area of computational biology. The computational burden of the algorithms used by such protein prediction strategies to characterise the importance of such proteins consists an additional challenge for the eld of network pharmacology. Such computationally expensive graph algorithms as the all pairs shortest path (APSP) computation can a ect the overall drug discovery process as needed network analysis results cannot be given on time. An ideal solution for these highly intensive computations could be the use of super-computing. However, graph algorithms have datadriven computation dictated by the structure of the graph and this can lead to low compute capacity utilisation with execution times dominated by memory latency. Therefore, this thesis seeks optimised solutions for the real-world graph problems of critical node detection and e ectiveness characterisation emerged from the collaboration with a pioneer company in the eld of network pharmacology as part of a Knowledge Transfer Partnership (KTP) / Secondment (KTS). In particular, we examine how genetic algorithms could bene t the prediction of protein complexes where their removal could produce a more e ective 'druggable' impact. Furthermore, we investigate how the problem of all pairs shortest path (APSP) computation can be bene ted by the use of emerging parallel hardware architectures as GPU- and FPGA- desktop-based accelerators. In particular, we address the problem of critical node detection with the development of a heuristic search method. It is based on a genetic algorithm that computes optimised node combinations where their removal causes greater impact than common impact analysis strategies. Furthermore, we design a general pattern for parallel network analysis on multi-core architectures that considers graph's embedded properties. It is a divide and conquer approach that decomposes a graph into smaller subgraphs based on its strongly connected components and computes the all pairs shortest paths concurrently on GPU. Furthermore, we use linear algebra to design an APSP approach based on the BFS algorithm. We use algebraic expressions to transform the problem of path computation to multiple independent matrix-vector multiplications that are executed concurrently on FPGA. Finally, we analyse how the optimised solutions of perturbation analysis and parallel graph processing provided in this thesis will impact the drug discovery process.This research was part of a Knowledge Transfer Partnership (KTP) and Knowledge Transfer Secondment (KTS) between e-therapeutics PLC and Newcastle University. It was supported as a collaborative project by e-therapeutics PLC and Technology Strategy boar

    CAD Techniques for Robust FPGA Design Under Variability

    Get PDF
    The imperfections in the semiconductor fabrication process and uncertainty in operating environment of VLSI circuits have emerged as critical challenges for the semiconductor industry. These are generally termed as process and environment variations, which lead to uncertainty in performance and unreliable operation of the circuits. These problems have been further aggravated in scaled nanometer technologies due to increased process variations and reduced operating voltage. Several techniques have been proposed recently for designing digital VLSI circuits under variability. However, most of them have targeted ASICs and custom designs. The flexibility of reconfiguration and unknown end application in FPGAs make design under variability different for FPGAs compared to ASICs and custom designs, and the techniques proposed for ASICs and custom designs cannot be directly applied to FPGAs. An important design consideration is to minimize the modifications in architecture and circuit to reduce the cost of changing the existing FPGA architecture and circuit. The focus of this work can be divided into three principal categories, which are, improving timing yield under process variations, improving power yield under process variations and improving the voltage profile in the FPGA power grid. The work on timing yield improvement proposes routing architecture enhancements along with CAD techniques to improve the timing yield of FPGA designs. The work on power yield improvement for FPGAs selects a low power dual-Vdd FPGA design as the baseline FPGA architecture for developing power yield enhancement techniques. It proposes CAD techniques to improve the power yield of FPGAs. A mathematical programming technique is proposed to determine the parameters of the buffers in the interconnect such as the sizes of the transistors and threshold voltage of the transistors, all within constraints, such that the leakage variability is minimized under delay constraints. Two CAD techniques are investigated and proposed to improve the supply voltage profile of the power grids in FPGAs. The first technique is a place and route technique and the second technique is a logic clustering technique to reduce IR-drops and spatial variation of supply voltage in the power grid
    • …
    corecore