168,387 research outputs found

    Compiling Deep Learning Models for Custom Hardware Accelerators

    Full text link
    Convolutional neural networks (CNNs) are the core of most state-of-the-art deep learning algorithms specialized for object detection and classification. CNNs are both computationally complex and embarrassingly parallel. Two properties that leave room for potential software and hardware optimizations for embedded systems. Given a programmable hardware accelerator with a CNN oriented custom instructions set, the compiler's task is to exploit the hardware's full potential, while abiding with the hardware constraints and maintaining generality to run different CNN models with varying workload properties. Snowflake is an efficient and scalable hardware accelerator implemented on programmable logic devices. It implements a control pipeline for a custom instruction set. The goal of this paper is to present Snowflake's compiler that generates machine level instructions from Torch7 model description files. The main software design points explored in this work are: model structure parsing, CNN workload breakdown, loop rearrangement for memory bandwidth optimizations and memory access balancing. The performance achieved by compiler generated instructions matches against hand optimized code for convolution layers. Generated instructions also efficiently execute AlexNet and ResNet18 inference on Snowflake. Snowflake with 256256 processing units was synthesized on Xilinx's Zynq XC7Z045 FPGA. At 250250 MHz, AlexNet achieved in 93.693.6 frames/s and 1.21.2 GB/s of off-chip memory bandwidth, and 21.421.4 frames/s and 2.22.2 GB/s for ResNet18. Total on-chip power is 55 W.Comment: 8 page

    Design space exploration tools for the ByoRISC configurable processor family

    Full text link
    In this paper, the ByoRISC (Build your own RISC) configurable application-specific instruction-set processor (ASIP) family is presented. ByoRISCs, as vendor-independent cores, provide extensive architectural parameters over a baseline processor, which can be customized by application-specific hardware extensions (ASHEs). Such extensions realize multi-input multi-output (MIMO) custom instructions with local state and load/store accesses to the data memory. ByoRISCs incorporate a true multi-port register file, zero-overhead custom instruction decoding, and scalable data forwarding mechanisms. Given these design decisions, ByoRISCs provide a unique combination of features that allow their use as architectural testbeds and the seamless and rapid development of new high-performance ASIPs. The performance characteristics of ByoRISCs, implemented as vendor-independent cores, have been evaluated for both ASIC and FPGA implementations, and it is proved that they provide a viable solution in FPGA-based system-on-a-chip design. A case study of an image processing pipeline is also presented to highlight the process of utilizing a ByoRISC custom processor. A peak performance speedup of up to 8.5×\times can be observed, whereas an average performance speedup of 4.4×\times on Xilinx Virtex-4 targets is achieved. In addition, ByoRISC outperforms an experimental VLIW architecture named VEX even in its 16-wide configuration for a number of data-intensive application kernels.Comment: 12 pages, 14 figures, 7 tables. Unpublished paper on ByoRISC, an extensible RISC with MIMO CIs that can outperform most mid-range VLIWs. Unfortunately Prof. Jorg Henkel destroyed the potential of this submission by using immoral tactics (neglecting his conflict of interest, changing reviewers accepting the paper, and requesting impossible additions for the average lifetime of an Earthlin

    Accelerating the Development of Software-Defined Network Optimization Applications Using SOL

    Full text link
    Software-defined networking (SDN) can enable diverse network management applications such as traffic engineering, service chaining, network function outsourcing, and topology reconfiguration. Realizing the benefits of SDN for these applications, however, entails addressing complex network optimizations that are central to these problems. Unfortunately, such optimization problems require significant manual effort and expertise to express and non-trivial computation and/or carefully crafted heuristics to solve. Our vision is to simplify the deployment of SDN applications using general high-level abstractions for capturing optimization requirements from which we can efficiently generate optimal solutions. To this end, we present SOL, a framework that demonstrates that it is indeed possible to simultaneously achieve generality and efficiency. The insight underlying SOL is that SDN applications can be recast within a unifying path-based optimization abstraction, from which it efficiently generates near-optimal solutions, and device configurations to implement those solutions. We illustrate the generality of SOL by prototyping diverse and new applications. We show that SOL simplifies the development of SDN-based network optimization applications and provides comparable or better scalability than custom optimization solutions

    Generating and evaluating application-specific hardware extensions

    Full text link
    Modern platform-based design involves the application-specific extension of embedded processors to fit customer requirements. To accomplish this task, the possibilities offered by recent custom/extensible processors for tuning their instruction set and microarchitecture to the applications of interest have to be exploited. A significant factor often determining the success of this process is the utomation available in application analysis and custom instruction generation. In this paper we present YARDstick, a design automation tool for custom processor development flows that focuses on generating and evaluating application-specific hardware extensions. YARDstick is a building block for ASIP development, integrating application analysis, custom instruction generation and selection with user-defined compiler intermediate representations. In a YARDstick-enabled environment, practical issues in traditional ASIP design are confronted efficiently; the exploration infrastructure is liberated from compiler and simulator idiosyncrasies, since the ASIP designer is empowered with the freedom of specifying the target architectures of choice and adding new implementations of analyses and custom instruction generation/selection methods. To illustrate the capabilities of the YARDstick approach, we present interesting exploration scenarios: quantifying the effect of machine-dependent compiler optimizations and the selection of the target architecture in terms of operation set and memory model on custom instruction generation/selection under different input/output constraints.Comment: 11 pages, 15 figures, 5 tables. An unpublished journal paper presenting the YARDstick custom instruction generation environmen

    Mask Editor : an Image Annotation Tool for Image Segmentation Tasks

    Full text link
    Deep convolutional neural network (DCNN) is the state-of-the-art method for image segmentation, which is one of key challenging computer vision tasks. However, DCNN requires a lot of training images with corresponding image masks to get a good segmentation result. Image annotation software which is easy to use and allows fast image mask generation is in great demand. To the best of our knowledge, all existing image annotation software support only drawing bounding polygons, bounding boxes, or bounding ellipses to mark target objects. These existing software are inefficient when targeting objects that have irregular shapes (e.g., defects in fabric images or tire images). In this paper we design an easy-to-use image annotation software called Mask Editor for image mask generation. Mask Editor allows drawing any bounding curve to mark objects and improves efficiency to mark objects with irregular shapes. Mask Editor also supports drawing bounding polygons, drawing bounding boxes, drawing bounding ellipses, painting, erasing, super-pixel-marking, image cropping, multi-class masks, mask loading, and mask modifying

    Resource-Aware Just-in-Time OpenCL Compiler for Coarse-Grained FPGA Overlays

    Full text link
    FPGA vendors have recently started focusing on OpenCL for FPGAs because of its ability to leverage the parallelism inherent to heterogeneous computing platforms. OpenCL allows programs running on a host computer to launch accelerator kernels which can be compiled at run-time for a specific architecture, thus enabling portability. However, the prohibitive compilation times (specifically the FPGA place and route times) are a major stumbling block when using OpenCL tools from FPGA vendors. The long compilation times mean that the tools cannot effectively use just-in-time (JIT) compilation or runtime performance scaling. Coarse-grained overlays represent a possible solution by virtue of their coarse granularity and fast compilation. In this paper, we present a methodology for run-time compilation of OpenCL kernels to a DSP block based coarse-grained overlay, rather than directly to the fine-grained FPGA fabric. The proposed methodology allows JIT compilation and on-demand resource-aware kernel replication to better utilize available overlay resources, raising the abstraction level while reducing compile times significantly. We further demonstrate that this approach can even be used for run-time compilation of OpenCL kernels on the ARM processor of the embedded heterogeneous Zynq device.Comment: Presented at 3rd International Workshop on Overlay Architectures for FPGAs (OLAF 2017) arXiv:1704.0880

    Aashiyana: Design and Evaluation of a Smart Demand-Response System for Highly-stressed Grids

    Full text link
    This paper targets the unexplored problem of demand response within the context of power-grids that are allowed to regularly enforce blackouts as a mean to balance supply with demand:highly-stressed grids. Currently these utilities use as a cyclic and binary (power/no-power) schedule over consumer groups leading to significant wastage of capacity and long hours of no-power. We present here a novel building DLC system, Aashiyana, that can enforce several user-defined low-power states. We evaluate distributed and centralized load-shedding schemes using Aashiyana that can, compared to current load-shedding strategy, reduce the number of homes with no power by 80% for minor change in the fraction of homes with full-power

    DNNVM : End-to-End Compiler Leveraging Heterogeneous Optimizations on FPGA-based CNN Accelerators

    Full text link
    The convolutional neural network (CNN) has become a state-of-the-art method for several artificial intelligence domains in recent years. The increasingly complex CNN models are both computation-bound and I/O-bound. FPGA-based accelerators driven by custom instruction set architecture (ISA) achieve a balance between generality and efficiency, but there is much on them left to be optimized. We propose the full-stack compiler DNNVM, which is an integration of optimizers for graphs, loops and data layouts, and an assembler, a runtime supporter and a validation environment. The DNNVM works in the context of deep learning frameworks and transforms CNN models into the directed acyclic graph: XGraph. Based on XGraph, we transform the optimization challenges for both the data layout and pipeline into graph-level problems. DNNVM enumerates all potentially profitable fusion opportunities by a heuristic subgraph isomorphism algorithm to leverage pipeline and data layout optimizations, and searches for the best choice of execution strategies of the whole computing graph. On the Xilinx ZU2 @330 MHz and ZU9 @330 MHz, we achieve equivalently state-of-the-art performance on our benchmarks by na\"ive implementations without optimizations, and the throughput is further improved up to 1.26x by leveraging heterogeneous optimizations in DNNVM. Finally, with ZU9 @330 MHz, we achieve state-of-the-art performance for VGG and ResNet50. We achieve a throughput of 2.82 TOPs/s and an energy efficiency of 123.7 GOPs/s/W for VGG. Additionally, we achieve 1.38 TOPs/s for ResNet50 and 1.41 TOPs/s for GoogleNet.Comment: 18 pages, 9 figures, 5 table

    Simulink model code generation for motor control applications

    Get PDF
    This article is focused on the embedded software development using code generation. C-code is generated from Simulink model. Article describes hardware interface of Simulink model for AC motor control applications code generation. Results are presented on platform with ARM Cortex-M4 micro-controller, inverter and permanent magnet synchronous machine. Measurements of speed control loop on a real machine are presented and utilization of used micro-controller are discussed in conclusion

    Effective Extensible Programming: Unleashing Julia on GPUs

    Full text link
    GPUs and other accelerators are popular devices for accelerating compute-intensive, parallelizable applications. However, programming these devices is a difficult task. Writing efficient device code is challenging, and is typically done in a low-level programming language. High-level languages are rarely supported, or do not integrate with the rest of the high-level language ecosystem. To overcome this, we propose compiler infrastructure to efficiently add support for new hardware or environments to an existing programming language. We evaluate our approach by adding support for NVIDIA GPUs to the Julia programming language. By integrating with the existing compiler, we significantly lower the cost to implement and maintain the new compiler, and facilitate reuse of existing application code. Moreover, use of the high-level Julia programming language enables new and dynamic approaches for GPU programming. This greatly improves programmer productivity, while maintaining application performance similar to that of the official NVIDIA CUDA toolkit
    corecore