2,057 research outputs found

    BinRec:Atack surface reduction through dynamic binary recovery

    Get PDF
    Compile-time specialization and feature pruning through static binary rewriting have been proposed repeatedly as techniques for reducing the attack surface of large programs, and for minimizing the trusted computing base. We propose a new approach to attack surface reduction: dynamic binary lifting and recompilation. We present BinRec, a binary recompilation framework that lifts binaries to a compiler-level intermediate representation (IR) to allow complex transformations on the captured code. After transformation, BinRec lowers the IR back to a "recovered" binary, which is semantically equivalent to the input binary, but has its unnecessary features removed. Unlike existing approaches, which are mostly based on static analysis and rewriting, our framework analyzes and lifts binaries dynamically. The crucial advantage is that we can not only observe the full program including all of its dependencies, but we can also determine which program features the end-user actually uses. We evaluate the correctness and performance of Bin-Rec, and show that our approach enables aggressive pruning of unwanted features in COTS binaries

    Wrong-Path-Aware Entangling Instruction Prefetcher

    Get PDF
    © 2023.IEEE. This document is made available under the CC-BY 4.0 license http://creativecommons.org/licenses/by /4.0/ This document is the Accepted version of a Published Work that appeared in final form in IEEE Transactions on Computers. To access the final edited and published work see DOI 10.1109/TC.2023.3337308Instruction prefetching is instrumental for guaranteeing a high flow of instructions through the processor front end for applications whose working set does not fit in the lowerlevel caches. Examples of such applications are server workloads, whose instruction footprints are constantly growing. There are two main techniques to mitigate this problem: fetch directed prefetching (or decoupled front end) and instruction cache (L1I) prefetching. This work extends the state-of-the-art Entangling prefetcher to avoid training during wrong-path execution. Our Entangling wrong-path-aware prefetcher is equipped with microarchitectural techniques that eliminate more than 99% of wrong-path pollution, thus reaching 98.9% of the performance of an ideal wrongpath-aware solution. Next, we propose two microarchitectural optimizations able to further increase performance benefits by 1.8%, on average. All this is achieved with just 304 bytes. Finally, we study the interplay between the L1I prefetcher and a decoupled front end. Our analysis shows that due to pollution caused by wrong-path instructions, the degree of decoupling cannot be increased unlimitedly without negative effects on the energy-delay product (EDP). Furthermore, the closer to ideal is the L1I prefetcher, the less decoupling is required. For example, our Entangling prefetcher reaches an optimal EDP with a decoupling degree of 64 instructions

    Vitruvius+: An area-efficient RISC-V decoupled vector coprocessor for high performance computing applications

    Get PDF
    The maturity level of RISC-V and the availability of domain-specific instruction set extensions, like vector processing, make RISC-V a good candidate for supporting the integration of specialized hardware in processor cores for the High Performance Computing (HPC) application domain. In this article,1 we present Vitruvius+, the vector processing acceleration engine that represents the core of vector instruction execution in the HPC challenge that comes within the EuroHPC initiative. It implements the RISC-V vector extension (RVV) 0.7.1 and can be easily connected to a scalar core using the Open Vector Interface standard. Vitruvius+ natively supports long vectors: 256 double precision floating-point elements in a single vector register. It is composed of a set of identical vector pipelines (lanes), each containing a slice of the Vector Register File and functional units (one integer, one floating point). The vector instruction execution scheme is hybrid in-order/out-of-order and is supported by register renaming and arithmetic/memory instruction decoupling. On a stand-alone synthesis, Vitruvius+ reaches a maximum frequency of 1.4 GHz in typical conditions (TT/0.80V/25°C) using GlobalFoundries 22FDX FD-SOI. The silicon implementation has a total area of 1.3 mm2 and maximum estimated power of ~920 mW for one instance of Vitruvius+ equipped with eight vector lanes.This research has received funding from the European High Performance Computing Joint Undertaking (JU) under Framework Partnership Agreement No 800928 (European Processor Initiative) and Specific Grant Agreement No 101036168 (EPI SGA2). The JU receives support from the European Union’s Horizon 2020 research and innovation programme and from Croatia, France, Germany, Greece, Italy, Netherlands, Portugal, Spain, Sweden, and Switzerland. The EPI-SGA2 project, PCI2022-132935 is also co-funded by MCIN/AEI/10.13039/501100011033 and by the UE NextGen- erationEU/PRTR. This work has also been partially supported by the Spanish Ministry of Science and Innovation (PID2019-107255GB-C21/AEI/10.13039/501100011033).Peer ReviewedPostprint (author's final draft

    Techniques for enlarging instruction streams

    Get PDF
    This work presents several techniques for enlarging instruction streams. We call stream to a sequence of instructions from the target of a taken branch to the next taken branch, potentially containing multiple basic blocks. The long size of instruction streams makes it possible for a fetch engine based on streams to provide high fetch bandwidth, which leads to obtaining performance results comparable to a trace cache. The long size of streams also enables the next stream predictor to tolerate the prediction table access latency. Therefore, enlarging instruction streams will improve the behavior of a fetch engine based on streams. We provide a comprehensive analysis of dynamic instruction streams, showing that focusing on particular kinds of stream is not a good strategy due to Amdahl's law. Consequently, we propose the multiple stream predictor, a novel mechanism that deals with all kinds of streams by combining single streams into long virtual streams. We show that our multiple stream predictor is able to tolerate the prediction access latency without requiring the complexity caused by additional hardware mechanisms like prediction overriding.Postprint (published version

    Domain-Specific Computing Architectures and Paradigms

    Full text link
    We live in an exciting era where artificial intelligence (AI) is fundamentally shifting the dynamics of industries and businesses around the world. AI algorithms such as deep learning (DL) have drastically advanced the state-of-the-art cognition and learning capabilities. However, the power of modern AI algorithms can only be enabled if the underlying domain-specific computing hardware can deliver orders of magnitude more performance and energy efficiency. This work focuses on this goal and explores three parts of the domain-specific computing acceleration problem; encapsulating specialized hardware and software architectures and paradigms that support the ever-growing processing demand of modern AI applications from the edge to the cloud. This first part of this work investigates the optimizations of a sparse spatio-temporal (ST) cognitive system-on-a-chip (SoC). This design extracts ST features from videos and leverages sparse inference and kernel compression to efficiently perform action classification and motion tracking. The second part of this work explores the significance of dataflows and reduction mechanisms for sparse deep neural network (DNN) acceleration. This design features a dynamic, look-ahead index matching unit in hardware to efficiently discover fine-grained parallelism, achieving high energy efficiency and low control complexity for a wide variety of DNN layers. Lastly, this work expands the scope to real-time machine learning (RTML) acceleration. A new high-level architecture modeling framework is proposed. Specifically, this framework consists of a set of high-performance RTML-specific architecture design templates, and a Python-based high-level modeling and compiler tool chain for efficient cross-stack architecture design and exploration.PHDElectrical and Computer EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/162870/1/lchingen_1.pd

    Reducing fetch architecture complexity using procedure inlining

    Get PDF
    Fetch engine performance is seriously limited by the branch prediction table access latency. This fact has lead to the development of hardware mechanisms, like prediction overriding, aimed to tolerate this latency. However, prediction overriding requires additional support and recovery mechanisms, which increases the fetch architecture complexity. In this paper, we show that this increase in complexity can be avoided if the interaction between the fetch architecture and software code optimizations is taken into account. We use aggressive procedure inlining to generate long streams of instructions that are used by the fetch engine as the basic prediction unit. We call instruction stream to a sequence of instructions from the target of a taken branch to the next taken branch. These instruction streams are long enough to feed the execution engine with instructions during multiple cycles, while a new stream prediction is being generated, and thus hiding the prediction table access latency. Our results show that the length of instruction streams compensates the increase in the instruction cache miss rate caused by inlining. We show that, using procedure inlining, the need for a prediction overriding mechanism is avoided, reducing the fetch engine complexity.Peer ReviewedPostprint (published version

    Driving the Network-on-Chip Revolution to Remove the Interconnect Bottleneck in Nanoscale Multi-Processor Systems-on-Chip

    Get PDF
    The sustained demand for faster, more powerful chips has been met by the availability of chip manufacturing processes allowing for the integration of increasing numbers of computation units onto a single die. The resulting outcome, especially in the embedded domain, has often been called SYSTEM-ON-CHIP (SoC) or MULTI-PROCESSOR SYSTEM-ON-CHIP (MP-SoC). MPSoC design brings to the foreground a large number of challenges, one of the most prominent of which is the design of the chip interconnection. With a number of on-chip blocks presently ranging in the tens, and quickly approaching the hundreds, the novel issue of how to best provide on-chip communication resources is clearly felt. NETWORKS-ON-CHIPS (NoCs) are the most comprehensive and scalable answer to this design concern. By bringing large-scale networking concepts to the on-chip domain, they guarantee a structured answer to present and future communication requirements. The point-to-point connection and packet switching paradigms they involve are also of great help in minimizing wiring overhead and physical routing issues. However, as with any technology of recent inception, NoC design is still an evolving discipline. Several main areas of interest require deep investigation for NoCs to become viable solutions: • The design of the NoC architecture needs to strike the best tradeoff among performance, features and the tight area and power constraints of the onchip domain. • Simulation and verification infrastructure must be put in place to explore, validate and optimize the NoC performance. • NoCs offer a huge design space, thanks to their extreme customizability in terms of topology and architectural parameters. Design tools are needed to prune this space and pick the best solutions. • Even more so given their global, distributed nature, it is essential to evaluate the physical implementation of NoCs to evaluate their suitability for next-generation designs and their area and power costs. This dissertation performs a design space exploration of network-on-chip architectures, in order to point-out the trade-offs associated with the design of each individual network building blocks and with the design of network topology overall. The design space exploration is preceded by a comparative analysis of state-of-the-art interconnect fabrics with themselves and with early networkon- chip prototypes. The ultimate objective is to point out the key advantages that NoC realizations provide with respect to state-of-the-art communication infrastructures and to point out the challenges that lie ahead in order to make this new interconnect technology come true. Among these latter, technologyrelated challenges are emerging that call for dedicated design techniques at all levels of the design hierarchy. In particular, leakage power dissipation, containment of process variations and of their effects. The achievement of the above objectives was enabled by means of a NoC simulation environment for cycleaccurate modelling and simulation and by means of a back-end facility for the study of NoC physical implementation effects. Overall, all the results provided by this work have been validated on actual silicon layout

    TREBUCHET: Fully Homomorphic Encryption Accelerator for Deep Computation

    Full text link
    Secure computation is of critical importance to not only the DoD, but across financial institutions, healthcare, and anywhere personally identifiable information (PII) is accessed. Traditional security techniques require data to be decrypted before performing any computation. When processed on untrusted systems the decrypted data is vulnerable to attacks to extract the sensitive information. To address these vulnerabilities Fully Homomorphic Encryption (FHE) keeps the data encrypted during computation and secures the results, even in these untrusted environments. However, FHE requires a significant amount of computation to perform equivalent unencrypted operations. To be useful, FHE must significantly close the computation gap (within 10x) to make encrypted processing practical. To accomplish this ambitious goal the TREBUCHET project is leading research and development in FHE processing hardware to accelerate deep computations on encrypted data, as part of the DARPA MTO Data Privacy for Virtual Environments (DPRIVE) program. We accelerate the major secure standardized FHE schemes (BGV, BFV, CKKS, FHEW, etc.) at >=128-bit security while integrating with the open-source PALISADE and OpenFHE libraries currently used in the DoD and in industry. We utilize a novel tile-based chip design with highly parallel ALUs optimized for vectorized 128b modulo arithmetic. The TREBUCHET coprocessor design provides a highly modular, flexible, and extensible FHE accelerator for easy reconfiguration, deployment, integration and application on other hardware form factors, such as System-on-Chip or alternate chip areas.Comment: 6 pages, 5figures, 2 table
    • …
    corecore