393 research outputs found

    Design and resource management of reconfigurable multiprocessors for data-parallel applications

    Get PDF
    FPGA (Field-Programmable Gate Array)-based custom reconfigurable computing machines have established themselves as low-cost and low-risk alternatives to ASIC (Application-Specific Integrated Circuit) implementations and general-purpose microprocessors in accelerating a wide range of computation-intensive applications. Most often they are Application Specific Programmable Circuiits (ASPCs), which are developer programmable instead of user programmable. The major disadvantages of ASPCs are minimal programmability, and significant time and energy overheads caused by required hardware reconfiguration when the problem size outnumbers the available reconfigurable resources; these problems are expected to become more serious with increases in the FPGA chip size. On the other hand, dominant high-performance computing systems, such as PC clusters and SMPs (Symmetric Multiprocessors), suffer from high communication latencies and/or scalability problems. This research introduces low-cost, user-programmable and reconfigurable MultiProcessor-on-a-Programmable-Chip (MPoPC) systems for high-performance, low-cost computing. It also proposes a relevant resource management framework that deals with performance, power consumption and energy issues. These semi-customized systems reduce significantly runtime device reconfiguration by employing userprogrammable processing elements that are reusable for different tasks in large, complex applications. For the sake of illustration, two different types of MPoPCs with hardware FPUs (floating-point units) are designed and implemented for credible performance evaluation and modeling: the coarse-grain MIMD (Multiple-Instruction, Multiple-Data) CG-MPoPC machine based on a processor IP (Intellectual Property) core and the mixed-mode (MIMD, SIMD or M-SIMD) variant-grain HERA (HEterogeneous Reconfigurable Architecture) machine. In addition to alleviating the above difficulties, MPoPCs can offer several performance and energy advantages to our data-parallel applications when compared to ASPCs; they are simpler and more scalable, and have less verification time and cost. Various common computation-intensive benchmark algorithms, such as matrix-matrix multiplication (MMM) and LU factorization, are studied and their parallel solutions are shown for the two MPoPCs. The performance is evaluated with large sparse real-world matrices primarily from power engineering. We expect even further performance gains on MPoPCs in the near future by employing ever improving FPGAs. The innovative nature of this work has the potential to guide research in this arising field of high-performance, low-cost reconfigurable computing. The largest advantage of reconfigurable logic lies in its large degree of hardware customization and reconfiguration which allows reusing the resources to match the computation and communication needs of applications. Therefore, a major effort in the presented design methodology for mixed-mode MPoPCs, like HERA, is devoted to effective resource management. A two-phase approach is applied. A mixed-mode weighted Task Flow Graph (w-TFG) is first constructed for any given application, where tasks are classified according to their most appropriate computing mode (e.g., SIMD or MIMD). At compile time, an architecture is customized and synthesized for the TFG using an Integer Linear Programming (ILP) formulation and a parameterized hardware component library. Various run-time scheduling schemes with different performanceenergy objectives are proposed. A system-level energy model for HERA, which is based on low-level implementation data and run-time statistics, is proposed to guide performance-energy trade-off decisions. A parallel power flow analysis technique based on Newton\u27s method is proposed and employed to verify the methodology

    Choose-Your-Own Adventure: A Lightweight, High-Performance Approach To Defect And Variation Mitigation In Reconfigurable Logic

    Get PDF
    For field-programmable gate arrays (FPGAs), fine-grained pre-computed alternative configurations, combined with simple test-based selection, produce limited per-chip specialization to counter yield loss, increased delay, and increased energy costs that come from fabrication defects and variation. This lightweight approach achieves much of the benefit of knowledge-based full specialization while reducing to practical, palatable levels the computational, testing, and load-time costs that obstruct the application of the knowledge-based approach. In practice this may more than double the power-limited computational capabilities of dies fabricated with 22nm technologies. Contributions of this work: • Choose-Your-own-Adventure (CYA), a novel, lightweight, scalable methodology to achieve defect and variation mitigation • Implementation of CYA, including preparatory components (generation of diverse alternative paths) and FPGA load-time components • Detailed performance characterization of CYA – Comparison to conventional loading and dynamic frequency and voltage scaling (DFVS) – Limit studies to characterize the quality of the CYA implementation and identify potential areas for further optimizatio

    IMPLEMENTATION OF NOISE CANCELLATION WITH HARDWARE DESCRIPTION LANGUAGE

    Get PDF
    The objective of this project is to implement noise cancellation technique on an FPGA using Hardware Description Language. The performance of several adaptive algorithms is compared to determine the desirable algorithm used for adaptive noise cancellation system. The project will focus on the implementation of adaptive filter with least-meansquares (LMS) algorithm or normalized least-mean-squares (NLMS) algorithm to cancel acoustic noises. This noise consists of extraneous or unwanted waveforms that can interfere with communication. Due to the simplicity and effectiveness of adaptive noise cancellation technique, it is used to remove the noise component from the desired signal. The project is divided into four main parts: research, Matlab simulation, ModelSim simulation and hardware implementation. The project starts with research on several noise cancellation techniques, and then with Matlab code, Simulink and FDA tool, the adaptive noise cancellation system is designed with the implementation of the LMS algorithm, NLMS algorithm and recursive-least-square algorithm to remove the interference noise. By using the Matlab code and Simulink, the noise that interfered with a sinusoidal signal and a record of music can be removed. The original signal in turns can be retrieved from the noise corrupted signal by changing the coefficient of the filter. Since filter is the important component in adaptive filtering process, the filter is designed first before adding adaptive algorithm. A Finite Impulse Response (FIR) filter is designed and the desired result of functional simulation and timing simulation is obtained through ModelSim and Integrated Software Environment (ISE) software and FPGA implementation. Finally the adaptive algorithm is added to the filter, and implemented in the FPGA. The noise is greatly reduced in Matlab simulation, functional simulation and timing simulation. Hence the results of this project show that noise cancellation with adaptive filter is feasible

    Dataflow-based Design and Implementation of Image Processing Applications

    Get PDF
    Dataflow is a well known computational model and is widely used for expressing the functionality of digital signal processing (DSP) applications, such as audio and video data stream processing, digital communications, and image processing. These applications usually require real-time processing capabilities and have critical performance constraints. Dataflow provides a formal mechanism for describing specifications of DSP applications, imposes minimal data-dependency constraints in specifications, and is effective in exposing and exploiting task or data level parallelism for achieving high performance implementations. To demonstrate dataflow-based design methods in a manner that is concrete and easily adapted to different platforms and back-end design tools, we present in this report a number of case studies based on the lightweight dataflow (LWDF) programming methodology. LWDF is designed as a "minimalistic" approach for integrating coarse grain dataflow programming structures into arbitrary simulation- or platform-oriented languages, such as C, C++, CUDA, MATLAB, SystemC, Verilog, and VHDL. In particular, LWDF requires minimal dependence on specialized tools or libraries. This feature --- together with the rigorous adherence to dataflow principles throughout the LWDF design framework --- allows designers to integrate and experiment with dataflow modeling approaches relatively quickly and flexibly into existing design methodologies and processes

    TransCODE: Co-design of Transformers and Accelerators for Efficient Training and Inference

    Full text link
    Automated co-design of machine learning models and evaluation hardware is critical for efficiently deploying such models at scale. Despite the state-of-the-art performance of transformer models, they are not yet ready for execution on resource-constrained hardware platforms. High memory requirements and low parallelizability of the transformer architecture exacerbate this problem. Recently-proposed accelerators attempt to optimize the throughput and energy consumption of transformer models. However, such works are either limited to a one-sided search of the model architecture or a restricted set of off-the-shelf devices. Furthermore, previous works only accelerate model inference and not training, which incurs substantially higher memory and compute resources, making the problem even more challenging. To address these limitations, this work proposes a dynamic training framework, called DynaProp, that speeds up the training process and reduces memory consumption. DynaProp is a low-overhead pruning method that prunes activations and gradients at runtime. To effectively execute this method on hardware for a diverse set of transformer architectures, we propose ELECTOR, a framework that simulates transformer inference and training on a design space of accelerators. We use this simulator in conjunction with the proposed co-design technique, called TransCODE, to obtain the best-performing models with high accuracy on the given task and minimize latency, energy consumption, and chip area. The obtained transformer-accelerator pair achieves 0.3% higher accuracy than the state-of-the-art pair while incurring 5.2×\times lower latency and 3.0×\times lower energy consumption

    HARDWARE-ACCELERATED AUTOMATIC 3D NONRIGID IMAGE REGISTRATION

    Get PDF
    Software implementations of 3D nonrigid image registration, an essential tool in medical applications like radiotherapies and image-guided surgeries, run excessively slow on traditional computers. These algorithms can be accelerated using hardware methods by exploiting parallelism at different levels in the algorithm. We present here, an implementation of a free-form deformation-based algorithm on a field programmable gate array (FPGA) with a customized, parallel and pipelined architecture. We overcome the performance bottlenecks and gain speedups of up to 40x over traditional computers while achieving accuracies comparable to software implementations. In this work, we also present a method to optimize the deformation field using a gradient descent-based optimization scheme and solve the problem of mesh folding, commonly encountered during registration using free-form deformations, using a set of linear constraints. Finally, we present the use of novel dataflow modeling tools to automatically map registration algorithms to hardware like FPGAs while allowing for dynamic reconfiguration
    • …
    corecore