508 research outputs found

    Superscalar RISC-V Processor with SIMD Vector Extension

    Get PDF
    With the increasing number of digital products in the market, the need for robust and highly configurable processors rises. The demand is convened by the stable and extensible open-sourced RISC-V instruction set architecture. RISC-V processors are becoming popular in many fields of applications and research. This thesis presents a dual-issue superscalar RISC-V processor design with dynamic execution. The proposed design employs the global sharing scheme for branch prediction and Tomasulo algorithm for out-of-order execution. The processor is capable of speculative execution with five checkpoints. Data flow in the instruction dispatch and commit stages is optimized to achieve higher instruction throughput. The superscalar processor is extended with a customized vector instruction set of single-instruction-multiple-data computations to specifically improve the performance on machine learning tasks. According to the definition of the proposed vector instruction set, the scratchpad memory and element-wise arithmetic units are implemented in the vector co-processor. Different test programs are evaluated on the fully-tested superscalar processor. Compared to the reference work, the proposed design improves 18.9% on average instruction throughput and 4.92% on average prediction hit rate, with 16.9% higher operating clock frequency synthesized on the Intel Arria 10 FPGA board. The forward propagation of a convolution neural network model is evaluated by the standalone superscalar processor and the integration of the vector co-processor. The vector program with software-level optimizations achieves 9.53Ă— improvement on instruction throughput and 10.18Ă— improvement on real-time throughput. Moreover, the integration also provides 2.22Ă— energy efficiency compared with the superscalar processor along

    Machine Learning in Resource-constrained Devices: Algorithms, Strategies, and Applications

    Get PDF
    The ever-increasing growth of technologies is changing people's everyday life. As a major consequence: 1) the amount of available data is growing and 2) several applications rely on battery supplied devices that are required to process data in real time. In this scenario the need for ad-hoc strategies for the development of low-power and low-latency intelligent systems capable of learning inductive rules from data using a modest mount of computational resources is becoming vital. At the same time, one needs to develop specic methodologies to manage complex patterns such as text and images. This Thesis presents different approaches and techniques for the development of fast learning models explicitly designed to be hosted on embedded systems. The proposed methods proved able to achieve state-of-the-art performances in term of the trade-off between generalization capabilities and area requirements when implemented in low-cost digital devices. In addition, advanced strategies for ecient sentiment analysis in text and images are proposed

    A Survey on Compiler Autotuning using Machine Learning

    Full text link
    Since the mid-1990s, researchers have been trying to use machine-learning based approaches to solve a number of different compiler optimization problems. These techniques primarily enhance the quality of the obtained results and, more importantly, make it feasible to tackle two main compiler optimization problems: optimization selection (choosing which optimizations to apply) and phase-ordering (choosing the order of applying optimizations). The compiler optimization space continues to grow due to the advancement of applications, increasing number of compiler optimizations, and new target architectures. Generic optimization passes in compilers cannot fully leverage newly introduced optimizations and, therefore, cannot keep up with the pace of increasing options. This survey summarizes and classifies the recent advances in using machine learning for the compiler optimization field, particularly on the two major problems of (1) selecting the best optimizations and (2) the phase-ordering of optimizations. The survey highlights the approaches taken so far, the obtained results, the fine-grain classification among different approaches and finally, the influential papers of the field.Comment: version 5.0 (updated on September 2018)- Preprint Version For our Accepted Journal @ ACM CSUR 2018 (42 pages) - This survey will be updated quarterly here (Send me your new published papers to be added in the subsequent version) History: Received November 2016; Revised August 2017; Revised February 2018; Accepted March 2018

    Scalable Hierarchical Instruction Cache for Ultra-Low-Power Processors Clusters

    Full text link
    High Performance and Energy Efficiency are critical requirements for Internet of Things (IoT) end-nodes. Exploiting tightly-coupled clusters of programmable processors (CMPs) has recently emerged as a suitable solution to address this challenge. One of the main bottlenecks limiting the performance and energy efficiency of these systems is the instruction cache architecture due to its criticality in terms of timing (i.e., maximum operating frequency), bandwidth, and power. We propose a hierarchical instruction cache tailored to ultra-low-power tightly-coupled processor clusters where a relatively large cache (L1.5) is shared by L1 private caches through a two-cycle latency interconnect. To address the performance loss caused by the L1 capacity misses, we introduce a next-line prefetcher with cache probe filtering (CPF) from L1 to L1.5. We optimize the core instruction fetch (IF) stage by removing the critical core-to-L1 combinational path. We present a detailed comparison of instruction cache architectures' performance and energy efficiency for parallel ultra-low-power (ULP) clusters. Focusing on the implementation, our two-level instruction cache provides better scalability than existing shared caches, delivering up to 20\% higher operating frequency. On average, the proposed two-level cache improves maximum performance by up to 17\% compared to the state-of-the-art while delivering similar energy efficiency for most relevant applications.Comment: 14 page

    Scalable Hierarchical Instruction Cache for Ultralow-Power Processors Clusters

    Get PDF
    High performance and energy efficiency are critical requirements for Internet of Things (IoT) end-nodes. Exploiting tightly coupled clusters of programmable processors (CMPs) has recently emerged as a suitable solution to address this challenge. One of the main bottlenecks limiting the performance and energy efficiency of these systems is the instruction cache architecture due to its criticality in terms of timing (i.e., maximum operating frequency), bandwidth, and power. We propose a hierarchical instruction cache tailored to ultralow-power (ULP) tightly coupled processor clusters where a relatively large cache (L1.5) is shared by L1 private (PR) caches through a two-cycle latency interconnect. To address the performance loss caused by the L1 capacity misses, we introduce a next-line prefetcher with cache probe filtering (CPF) from L1 to L1.5. We optimize the core instruction fetch (IF) stage by removing the critical core-to-L1 combinational path. We present a detailed comparison of instruction cache architectures' performance and energy efficiency for parallel ULP (PULP) clusters. Focusing on the implementation, our two-level instruction cache provides better scalability than existing shared caches, delivering up to 20% higher operating frequency. On average, the proposed two-level cache improves maximum performance by up to 17% compared to the state-of-the-art while delivering similar energy efficiency for most relevant applications

    Fast simulation techniques for microprocessor design space exploration

    Get PDF
    Designing a microprocessor is extremely time-consuming. Computer architects heavily rely on architectural simulators, e.g., to drive high-level design decisions during early stage design space exploration. The benefit of architectural simulators is that they yield relatively accurate performance results, are highly parameterizable and are very flexible to use. The downside, however, is that they are at least three or four orders of magnitude slower than real hardware execution. The current trend towards multicore processors exacerbates the problem; as the number of cores on a multicore processor increases, simulation speed has become a major concern in computer architecture research and development. In this dissertation, we propose and evaluate two simulation techniques that reduce the simulation time significantly: statistical simulation and interval simulation. Statistical simulation speeds up the simulation by reducing the number of dynamically executed instructions. First, we collect a number of program execution characteristics into a statistical profile. From this profile we can generate a synthetic trace that exhibits the same execution behavior but which has a much shorter trace length as compared to the original trace. Simulating this synthetic trace then yields a performance estimate. Interval simulation raises the level of abstraction in architectural simulation; it replaces the core-level cycle-accurate simulation model by a mechanistic analytical model. The analytical model builds on insights from interval analysis: miss events divide the smooth streaming of instructions into so called intervals. The model drives the timing by analyzing the type of the miss events and their latencies, instead of tracking the individual instructions as they propagate through the pipeline stages

    DeSyRe: on-Demand System Reliability

    No full text
    The DeSyRe project builds on-demand adaptive and reliable Systems-on-Chips (SoCs). As fabrication technology scales down, chips are becoming less reliable, thereby incurring increased power and performance costs for fault tolerance. To make matters worse, power density is becoming a significant limiting factor in SoC design, in general. In the face of such changes in the technological landscape, current solutions for fault tolerance are expected to introduce excessive overheads in future systems. Moreover, attempting to design and manufacture a totally defect and fault-free system, would impact heavily, even prohibitively, the design, manufacturing, and testing costs, as well as the system performance and power consumption. In this context, DeSyRe delivers a new generation of systems that are reliable by design at well-balanced power, performance, and design costs. In our attempt to reduce the overheads of fault-tolerance, only a small fraction of the chip is built to be fault-free. This fault-free part is then employed to manage the remaining fault-prone resources of the SoC. The DeSyRe framework is applied to two medical systems with high safety requirements (measured using the IEC 61508 functional safety standard) and tight power and performance constraints

    NEUROSim: Naturally Extensible, Unique RISC Operation Simulator

    Get PDF
    The NEUROSim framework consists of a compiler, assembler, and cycle-accurate processor simulator to facilitate computer architecture research. This framework provides a core instruction set common to many applications and a simulated datapath capable of executing these instructions. However, the core contribution of NEUROSim is its exible and extensible design allowing for the addition of instructions and architecture changes which target aspecic application. The NEUROSim framework is presented through the analysis of many system design decisions including execution forwarding, control change detection, FPU configuration, loop unrolling, recursive functions, self modifying code, branch predictors, and cache architectures. To demonstrate its exible nature, the NEUROSim framework is applied to specific domains including a modulo instruction intended for use in encryption applications, a multiply accumulate instruction analyzed in the context of digital signal processing, Taylor series expansion and lookup table instructions applied to mathematical expression approximation, and an atomic compare and swap instruction used for sorting
    • …
    corecore