942 research outputs found

    Hardware optimizations of dense binary hyperdimensional computing: Rematerialization of hypervectors, binarized bundling, and combinational associative memory

    Get PDF
    Brain-inspired hyperdimensional (HD) computing models neural activity patterns of the very size of the brain's circuits with points of a hyperdimensional space, that is, with hypervectors. Hypervectors are Ddimensional (pseudo)random vectors with independent and identically distributed (i.i.d.) components constituting ultra-wide holographic words: D = 10,000 bits, for instance. At its very core, HD computing manipulates a set of seed hypervectors to build composite hypervectors representing objects of interest. It demands memory optimizations with simple operations for an efficient hardware realization. In this article, we propose hardware techniques for optimizations of HD computing, in a synthesizable open-source VHDL library, to enable co-located implementation of both learning and classification tasks on only a small portion of Xilinx UltraScale FPGAs: (1)We propose simple logical operations to rematerialize the hypervectors on the fly rather than loading them from memory. These operations massively reduce the memory footprint by directly computing the composite hypervectors whose individual seed hypervectors do not need to be stored in memory. (2) Bundling a series of hypervectors over time requires a multibit counter per every hypervector component. We instead propose a binarized back-to-back bundling without requiring any counters. This truly enables onchip learning with minimal resources as every hypervector component remains binary over the course of training to avoid otherwise multibit components. (3) For every classification event, an associative memory is in charge of finding the closest match between a set of learned hypervectors and a query hypervector by using a distance metric. This operator is proportional to hypervector dimension (D), and hence may take O(D) cycles per classification event. Accordingly, we significantly improve the throughput of classification by proposing associative memories that steadily reduce the latency of classification to the extreme of a single cycle. (4) We perform a design space exploration incorporating the proposed techniques on FPGAs for a wearable biosignal processing application as a case study. Our techniques achieve up to 2.39 7 area saving, or 2,337 7 throughput improvement. The Pareto optimal HD architecture is mapped on only 18,340 configurable logic blocks (CLBs) to learn and classify five hand gestures using four electromyography sensors

    Signature-Based Protection from Code Reuse Attacks

    Full text link
    Abstract—Code Reuse Attacks (CRAs) recently emerged as a new class of security exploits. CRAs construct malicious programs out of small fragments (gadgets) of existing code, thus eliminating the need for code injection. Existing defenses against CRAs often incur large performance overheads or require extensive binary rewriting and other changes to the system software. In this paper, we examine a signature-based detection of CRAs, where the attack is detected by observing the behavior of programs and detecting the gadget execution patterns. We first demonstrate that naive signature-based defenses can be defeated by introducing special “delay gadgets ” as part of the attack. We then show how a software-configurable signature-based approach can be designed to defend against such stealth CRAs, including the attacks that manage to use longer-length gadgets. The proposed defense (called SCRAP) can be implemented entirely in hardware using simple logic at the commit stage of the pipeline. SCRAP is realized with minimal performance cost, no changes to the software layers and no implications on binary compatibility. Finally, we show that SCRAP generates no false alarms on a wide range of applications.

    Cache Equalizer: A Cache Pressure Aware Block Placement Scheme for Large-Scale Chip Multiprocessors

    Get PDF
    This paper describes Cache Equalizer (CE), a novel distributed cache management scheme for large scale chip multiprocessors (CMPs). Our work is motivated by large asymmetry in cache sets usages. CE decouples the physical locations of cache blocks from their addresses for the sake of reducing misses caused by destructive interferences. Temporal pressure at the on-chip last-level cache, is continuously collected at a group (comprised of cache sets) granularity, and periodically recorded at the memory controller to guide the placement process. An incoming block is consequently placed at a cache group that exhibits the minimum pressure. CE provides Quality of Service (QoS) by robustly offering better performance than the baseline shared NUCA cache. Simulation results using a full-system simulator demonstrate that CE outperforms shared NUCA caches by an average of 15.5% and by as much as 28.5% for the benchmark programs we examined. Furthermore, evaluations manifested the outperformance of CE versus related CMP cache designs

    Dynamic Hardware Resource Management for Efficient Throughput Processing.

    Full text link
    High performance computing is evolving at a rapid pace, with throughput oriented processors such as graphics processing units (GPUs), substituting for traditional processors as the computational workhorse. Their adoption has seen a tremendous increase as they provide high peak performance and energy efficiency while maintaining a friendly programming interface. Furthermore, many existing desktop, laptop, tablet, and smartphone systems support accelerating non-graphics, data parallel workloads on their GPUs. However, the multitude of systems that use GPUs as an accelerator run different genres of data parallel applications, which have significantly contrasting runtime characteristics. GPUs use thousands of identical threads to efficiently exploit the on-chip hardware resources. Therefore, if one thread uses a resource (compute, bandwidth, data cache) more heavily, there will be significant contention for that resource. This contention will eventually saturate the performance of the GPU due to contention for the bottleneck resource,leaving other resources underutilized at the same time. Traditional policies of managing the massive hardware resources work adequately, on well designed traditional scientific style applications. However, these static policies, which are oblivious to the application’s resource requirement, are not efficient for the large spectrum of data parallel workloads with varying resource requirements. Therefore, several standard hardware policies such as using maximum concurrency, fixed operational frequency and round-robin style scheduling are not efficient for modern GPU applications. This thesis defines dynamic hardware resource management mechanisms which improve the efficiency of the GPU by regulating the hardware resources at runtime. The first step in successfully achieving this goal is to make the hardware aware of the application’s characteristics at runtime through novel counters and indicators. After this detection, dynamic hardware modulation provides opportunities for increased performance, improved energy consumption, or both, leading to efficient execution. The key mechanisms for modulating the hardware at runtime are dynamic frequency regulation, managing the amount of concurrency, managing the order of execution among different threads and increasing cache utilization. The resultant increased efficiency will lead to improved energy consumption of the systems that utilize GPUs while maintaining or improving their performance.PhDComputer Science and EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/113356/1/asethia_1.pd

    RTN: Reparameterized Ternary Network

    Full text link
    To deploy deep neural networks on resource-limited devices, quantization has been widely explored. In this work, we study the extremely low-bit networks which have tremendous speed-up, memory saving with quantized activation and weights. We first bring up three omitted issues in extremely low-bit networks: the squashing range of quantized values; the gradient vanishing during backpropagation and the unexploited hardware acceleration of ternary networks. By reparameterizing quantized activation and weights vector with full precision scale and offset for fixed ternary vector, we decouple the range and magnitude from the direction to extenuate the three issues. Learnable scale and offset can automatically adjust the range of quantized values and sparsity without gradient vanishing. A novel encoding and computation pat-tern are designed to support efficient computing for our reparameterized ternary network (RTN). Experiments on ResNet-18 for ImageNet demonstrate that the proposed RTN finds a much better efficiency between bitwidth and accuracy, and achieves up to 26.76% relative accuracy improvement compared with state-of-the-art methods. Moreover, we validate the proposed computation pattern on Field Programmable Gate Arrays (FPGA), and it brings 46.46x and 89.17x savings on power and area respectively compared with the full precision convolution.Comment: To appear at AAAI-2

    Ultra low power cooperative branch prediction

    Get PDF
    Branch Prediction is a key task in the operation of a high performance processor. An inaccurate branch predictor results in increased program run-time and a rise in energy consumption. The drive towards processors with limited die-space and tighter energy requirements will continue to intensify over the coming years, as will the shift towards increasingly multicore processors. Both trends make it increasingly important and increasingly difficult to find effective and efficient branch predictor designs. This thesis presents savings in energy and die-space through the use of more efficient cooperative branch predictors achieved through novel branch prediction designs. The first contribution is a new take on the problem of a hybrid dynamic-static branch predictor allocating branches to be predicted by one of its sub-predictors. A new bias parameter is introduced as a mechanism for trading off a small amount of performance for savings in die-space and energy. This is achieved by predicting more branches with the static predictor, ensuring that only the branches that will most benefit from the dynamic predictor’s resources are predicted dynamically. This reduces pressure on the dynamic predictor’s resources allowing for a smaller predictor to achieve very high accuracy. An improvement in run-time of 7-8% over the baseline BTFN predictor is observed at a cost of a branch predictor bits budget of much less than 1KB. Next, a novel approach to branch prediction for multicore data-parallel applications is presented. The Peloton branch prediction scheme uses a pack of cyclists as an illustration of how a group of processors running similar tasks can share branch predictions to improve accuracy and reduce runtime. The results show that sharing updates for conditional branches across the existing interconnect for I-cache and D-cache updates results in a reduction of mispredictions of up to 25% and a reduction in run-time of up to 6%. McPAT is used to present an energy model that suggests the savings are achieved at little to no increase in energy required. The technique is then extended to architectures where the size of the branch predictors may differ between cores. The results show that such heterogeneity can dramatically reduce the die-space required for an accurate branch predictor while having little impact on performance and up to 9% energy savings. The approach can be combined with the Peloton branch prediction scheme for reduction in branch mispredictions of up to 5%

    Neural Methods for Resolving Hard-to-Predict Branches

    Get PDF
    This work presents a new category of branch predictors designed to be addendums to existing state of the art prediction mechanisms. We call these neural network inspired predictors Shallow Online Neural (SON) Predictors as they utilize easily quantizable shallow networks and exhibit online training as opposed to other related works. This predictor is apt as both a branch prediction scheme and as a TAGE confidence predictor.M.S

    Superscalar RISC-V Processor with SIMD Vector Extension

    Get PDF
    With the increasing number of digital products in the market, the need for robust and highly configurable processors rises. The demand is convened by the stable and extensible open-sourced RISC-V instruction set architecture. RISC-V processors are becoming popular in many fields of applications and research. This thesis presents a dual-issue superscalar RISC-V processor design with dynamic execution. The proposed design employs the global sharing scheme for branch prediction and Tomasulo algorithm for out-of-order execution. The processor is capable of speculative execution with five checkpoints. Data flow in the instruction dispatch and commit stages is optimized to achieve higher instruction throughput. The superscalar processor is extended with a customized vector instruction set of single-instruction-multiple-data computations to specifically improve the performance on machine learning tasks. According to the definition of the proposed vector instruction set, the scratchpad memory and element-wise arithmetic units are implemented in the vector co-processor. Different test programs are evaluated on the fully-tested superscalar processor. Compared to the reference work, the proposed design improves 18.9% on average instruction throughput and 4.92% on average prediction hit rate, with 16.9% higher operating clock frequency synthesized on the Intel Arria 10 FPGA board. The forward propagation of a convolution neural network model is evaluated by the standalone superscalar processor and the integration of the vector co-processor. The vector program with software-level optimizations achieves 9.53× improvement on instruction throughput and 10.18× improvement on real-time throughput. Moreover, the integration also provides 2.22× energy efficiency compared with the superscalar processor along
    • 

    corecore