495 research outputs found

    Empowering parallel computing with field programmable gate arrays

    Get PDF
    After more than 30 years, reconfigurable computing has grown from a concept to a mature field of science and technology. The cornerstone of this evolution is the field programmable gate array, a building block enabling the configuration of a custom hardware architecture. The departure from static von Neumannlike architectures opens the way to eliminate the instruction overhead and to optimize the execution speed and power consumption. FPGAs now live in a growing ecosystem of development tools, enabling software programmers to map algorithms directly onto hardware. Applications abound in many directions, including data centers, IoT, AI, image processing and space exploration. The increasing success of FPGAs is largely due to an improved toolchain with solid high-level synthesis support as well as a better integration with processor and memory systems. On the other hand, long compile times and complex design exploration remain areas for improvement. In this paper we address the evolution of FPGAs towards advanced multi-functional accelerators, discuss different programming models and their HLS language implementations, as well as high-performance tuning of FPGAs integrated into a heterogeneous platform. We pinpoint fallacies and pitfalls, and identify opportunities for language enhancements and architectural refinements

    CPU, GPU i FPGA implementacija MALD algoritma za otkrivanje nepravilnosti na površini keramičkih pločica

    Get PDF
    This paper addresses adjustments, implementation and performance comparison of the Moving Average with Local Difference (MALD) method for ceramic tile surface defects detection. Ceramic tile production process is completely autonomous, except the final stage where human eye is required for defects detection. Recent computational platform development and advances in machine vision provides us with several options for MALD algorithm implementation. In order to exploit the shortest execution time for ceramic tile production process, the MALD method is implemented on three different platforms: CPU, GPU and FPGA, and it is implemented on each platform in at least two ways. Implementations are done in MATLAB’s MEX/C++, C++, CUDA/C++, VHDL and Assembly programming languages. Execution times are measured and compared for different algorithms and their implementations on different computational platforms.U ovom radu razmatra se prilagodba, implementacija i usporedba performansi metode pomičnog usrednjavanja s lokalnom diferencijom (MALD) s primjenom u otkrivanju površinskih nedostataka na keramičkim pločicama. Proizvodna linija keramičkih pločica je autonomna sve do zadnje faze u kojoj je potreban ljudski vid kako bi se otkrili eventualni nedostaci na keramičkim pločicama. Nedavnim razvojem računalnih platformi i razvojem metoda računalnog vida omogućena je implementacija MALD metode na nekoliko načina. U nastojanju skraćenja vremena potrebnog za proizvodnju keramičkih pločica, MALD metoda je implementirana u trima različitim platformama: CPU (central processing unit), GPU (graphic processing unit) i FPGA (field programmable gate array), te s barem dva različita algoritma. Implementacija je izvršena sa MATLAB MEX/C++, C++, CUDA/C++, VHDL te Asembler programskim jezicima. Izmjerena vremena obrade su me.usobno uspore.ena za različite algoritme i njihove implementacije na različitim računalnim platformama

    A Proposed Approach to Hybrid Software-Hardware Application Design for Enhanced Application Performance

    Get PDF
    One important aspect of many commercial computer systems is their performance; therefore, system designers seek to improve the performance next-generation systems with respect to previous generations. This could mean improved computational performance, reduced power consumption leading to better battery life in mobile devices, smaller form factors, or improvements in many areas. In terms of increased system speed and computation performance, processor manufacturers have been able to increase the clock frequency of processors up to a point, but now it is more common to seek performance gains through increased parallelism (such as a processor having more processor cores on a single chip or an increased number of physical processors in the system) rather than increased processor frequency. This paper proposes an additional strategy for increasing system performance by allowing application developers to design custom circuitry for either their whole application or at least some portions of it that are on the critical path or that are computationally expensive. This will improve application performance by avoiding executing application instructions on a general-purpose CPU and instead executing on a specialized circuit designed to perform the desired algorithm, offloading tasks from the CPU (thereby reducing the overall system load) and enabling true parallelism for applications that can take advantage of a more parallel architecture. For this paper, an Altera Cyclone II FPGA is used, but in practice, a higher performance and higher density FPGA should be used

    Design of a Five Stage Pipeline CPU with Interruption System

    Get PDF
    A central processing unit (CPU), also referred to as a central processor unit, is the hardware within a computer that carries out the instructions of a computer program by performing the basic arithmetical, logical, and input/output operations of the system. The term has been in use in the computer industry at least since the early 1960s.The form, design, and implementation of CPUs have changed over the course of their history, but their fundamental operation remains much the same. A computer can have more than one CPU; this is called multiprocessing. All modern CPUs are microprocessors, meaning contained on a single chip. Some integrated circuits (ICs) can contain multiple CPUs on a single chip; those ICs are called multi-core processors. An IC containing a CPU can also contain peripheral devices, and other components of a computer system; this is called a system on a chip (SoC).Two typical components of a CPU are the arithmetic logic unit (ALU), which performs arithmetic and logical operations, and the control unit (CU), which extracts instructions from memory and decodes and executes them, calling on the ALU when necessary. Not all computational systems rely on a central processing unit. An array processor or vector processor has multiple parallel computing elements, with no one unit considered the "center". In the distributed computing model, problems are solved by a distributed interconnected set of processors

    SMaRT: Small Machine for Research and Teaching

    Get PDF
    We introduce SMaRT, a 16-bit single-cycle RISC-type processor with 16-bit-wide instructions. SMaRT features the novel concept of 2.5-address instructions to avoid the data loss that inherently exists in 2-address processors. Additionally, SMaRT’s short-branch instructions take advantage of the temporal locality of reference in accessing the upper or lower halves of the CPU’s 16x16 orthogonal register file. This allows SMaRT to significantly extend the range of the short-branch instructions. We show that these novelties are achieved at almost no performance cost and negligible hardware cost. SMaRT has four operation modes, namely Single-Step, to execute one instruction at a time, Manual, to display and inspect individual locations of data memory, Run, to run the whole code nonstop, and Init, to copy a read-only memory to the data memory for initialization purposes. We also implement and present an input/output port and a sorting coprocessor, and then hook it up to SMaRT through the port as an example. We have successfully synthesized the combined SMaRT and the sorting coprocessor into the Altera Cyclone II FPGA chip, and tested them

    Learning to infer: RL-based search for DNN primitive selection on Heterogeneous Embedded Systems

    Full text link
    Deep Learning is increasingly being adopted by industry for computer vision applications running on embedded devices. While Convolutional Neural Networks' accuracy has achieved a mature and remarkable state, inference latency and throughput are a major concern especially when targeting low-cost and low-power embedded platforms. CNNs' inference latency may become a bottleneck for Deep Learning adoption by industry, as it is a crucial specification for many real-time processes. Furthermore, deployment of CNNs across heterogeneous platforms presents major compatibility issues due to vendor-specific technology and acceleration libraries. In this work, we present QS-DNN, a fully automatic search based on Reinforcement Learning which, combined with an inference engine optimizer, efficiently explores through the design space and empirically finds the optimal combinations of libraries and primitives to speed up the inference of CNNs on heterogeneous embedded devices. We show that, an optimized combination can achieve 45x speedup in inference latency on CPU compared to a dependency-free baseline and 2x on average on GPGPU compared to the best vendor library. Further, we demonstrate that, the quality of results and time "to-solution" is much better than with Random Search and achieves up to 15x better results for a short-time search

    High Speed 3D Tomography on CPU, GPU, and FPGA

    Get PDF
    12 pages; 50% d'acceptationInternational audienceBack-projection (BP) is a costly computational step in tomography image reconstruction such as positron emission tomography (PET). To reduce the computation time, this paper presents a pipelined, prefetch, and parallelized architecture for PET BP (3PA-PET). The key feature of this architecture is its original memory access strategy, masking the high latency of the external memory. Indeed, the pattern of the memory references to the data acquired hinders the processing unit. The memory access bottleneck is overcome by an efficient use of the intrinsic temporal and spatial locality of the BP algorithm. A loop reordering allows an efficient use of general purpose processor's caches, for software implementation, as well as the 3D predictive and adaptive cache (3D-AP cache), when considering hardware implementations. Parallel hardware pipelines are also efficient thanks to a hierarchical 3D-AP cache: each pipeline performs a memory reference in about one clock cycle to reach a computational throughput close to 100%. The 3PA-PET architecture is prototyped on a system on programmable chip (SoPC) to validate the system and to measure its expected performances. Time performances are compared with a desktop PC, a workstation, and a graphic processor unit (GPU)

    DES and TDES Performance Evaluation for Non-pipelined and Pipelined Implementations in VHDL Using the Cyclone II FPGA Technology

    Get PDF
    Two ongoing issues that engineers must face in the new era of data analytics are performance and security. Field Programmable Gate Arrays (FPGAs) offer a new solution for optimizing the performance of applications while the Data Encryption Standard (DES) and the Triple Data Encryption Standard (TDES) offer a mean to secure information. In this thesis we present a Non-Pipelined and Pipelined, in Electronic Code Book (EBC) mode, implementations in VHDL of these two commonly utilized cryptography schemes. Using Altera Cyclone II FPGA as our platform, we design and verify the implementations with the EDA tools provided by Altera. We gather cost and throughput information from the synthesis and timing results and compare the cost and performance of our designs to those presented in other literatures. Our designs achieve a throughput of 3.2 Gbps with a 50 MHz clock and our cost triples from DES to TDES
    corecore