2,057 research outputs found
BinRec:Atack surface reduction through dynamic binary recovery
Compile-time specialization and feature pruning through static binary rewriting have been proposed repeatedly as techniques for reducing the attack surface of large programs, and for minimizing the trusted computing base. We propose a new approach to attack surface reduction: dynamic binary lifting and recompilation. We present BinRec, a binary recompilation framework that lifts binaries to a compiler-level intermediate representation (IR) to allow complex transformations on the captured code. After transformation, BinRec lowers the IR back to a "recovered" binary, which is semantically equivalent to the input binary, but has its unnecessary features removed. Unlike existing approaches, which are mostly based on static analysis and rewriting, our framework analyzes and lifts binaries dynamically. The crucial advantage is that we can not only observe the full program including all of its dependencies, but we can also determine which program features the end-user actually uses. We evaluate the correctness and performance of Bin-Rec, and show that our approach enables aggressive pruning of unwanted features in COTS binaries
Wrong-Path-Aware Entangling Instruction Prefetcher
© 2023.IEEE. This document is made available under the CC-BY 4.0 license http://creativecommons.org/licenses/by /4.0/
This document is the Accepted version of a Published Work that appeared in final form in IEEE Transactions on Computers. To access the final edited and published work see DOI 10.1109/TC.2023.3337308Instruction prefetching is instrumental for guaranteeing a high flow of instructions through the processor front
end for applications whose working set does not fit in the lowerlevel caches. Examples of such applications are server workloads,
whose instruction footprints are constantly growing. There are
two main techniques to mitigate this problem: fetch directed
prefetching (or decoupled front end) and instruction cache (L1I)
prefetching.
This work extends the state-of-the-art Entangling prefetcher
to avoid training during wrong-path execution. Our Entangling
wrong-path-aware prefetcher is equipped with microarchitectural
techniques that eliminate more than 99% of wrong-path pollution, thus reaching 98.9% of the performance of an ideal wrongpath-aware solution. Next, we propose two microarchitectural
optimizations able to further increase performance benefits by
1.8%, on average. All this is achieved with just 304 bytes.
Finally, we study the interplay between the L1I prefetcher and
a decoupled front end. Our analysis shows that due to pollution
caused by wrong-path instructions, the degree of decoupling
cannot be increased unlimitedly without negative effects on
the energy-delay product (EDP). Furthermore, the closer to
ideal is the L1I prefetcher, the less decoupling is required. For
example, our Entangling prefetcher reaches an optimal EDP with
a decoupling degree of 64 instructions
Vitruvius+: An area-efficient RISC-V decoupled vector coprocessor for high performance computing applications
The maturity level of RISC-V and the availability of domain-specific instruction set extensions, like vector processing, make RISC-V a good candidate for supporting the integration of specialized hardware in processor cores for the High Performance Computing (HPC) application domain. In this article,1 we present Vitruvius+, the vector processing acceleration engine that represents the core of vector instruction execution in the HPC challenge that comes within the EuroHPC initiative. It implements the RISC-V vector extension (RVV) 0.7.1 and can be easily connected to a scalar core using the Open Vector Interface standard. Vitruvius+ natively supports long vectors: 256 double precision floating-point elements in a single vector register. It is composed of a set of identical vector pipelines (lanes), each containing a slice of the Vector Register File and functional units (one integer, one floating point). The vector instruction execution scheme is hybrid in-order/out-of-order and is supported by register renaming and arithmetic/memory instruction decoupling. On a stand-alone synthesis, Vitruvius+ reaches a maximum frequency of 1.4 GHz in typical conditions (TT/0.80V/25°C) using GlobalFoundries 22FDX FD-SOI. The silicon implementation has a total area of 1.3 mm2 and maximum estimated power of ~920 mW for one instance of Vitruvius+ equipped with eight vector lanes.This research has received funding from the European High Performance Computing Joint Undertaking (JU) under Framework Partnership Agreement No 800928 (European Processor Initiative) and Specific Grant Agreement No 101036168 (EPI SGA2). The JU receives support from the European Union’s Horizon 2020 research and innovation programme and from Croatia, France, Germany, Greece, Italy, Netherlands, Portugal, Spain, Sweden, and Switzerland. The EPI-SGA2 project, PCI2022-132935 is also co-funded by MCIN/AEI/10.13039/501100011033 and by the UE NextGen- erationEU/PRTR. This work has also been partially supported by the Spanish Ministry of Science and Innovation (PID2019-107255GB-C21/AEI/10.13039/501100011033).Peer ReviewedPostprint (author's final draft
Techniques for enlarging instruction streams
This work presents several techniques for enlarging instruction streams. We call stream to a sequence of instructions from the target of a taken branch to the next taken branch, potentially containing multiple basic blocks. The long size of instruction streams makes it possible for a fetch engine based on streams to provide high fetch bandwidth, which leads to obtaining performance results comparable to a trace cache. The long size of streams also enables the next stream predictor to tolerate the prediction table access latency. Therefore, enlarging instruction streams will improve the behavior of a fetch engine based on streams. We provide a comprehensive analysis of dynamic instruction streams, showing that focusing on particular kinds of stream is not a good strategy due to Amdahl's law. Consequently, we propose the multiple stream predictor, a novel mechanism that deals with all kinds of streams by combining single streams into long virtual streams. We show that our multiple stream predictor is able to tolerate the prediction access latency without requiring the complexity caused by additional hardware mechanisms like prediction overriding.Postprint (published version
Domain-Specific Computing Architectures and Paradigms
We live in an exciting era where artificial intelligence (AI) is fundamentally shifting the dynamics of industries and businesses around the world. AI algorithms such as deep learning (DL) have drastically advanced the state-of-the-art cognition and learning capabilities. However, the power of modern AI algorithms can only be enabled if the underlying domain-specific computing hardware can deliver orders of magnitude more performance and energy efficiency. This work focuses on this goal and explores three parts of the domain-specific computing acceleration problem; encapsulating specialized hardware and software architectures and paradigms that support the ever-growing processing demand of modern AI applications from the edge to the cloud.
This first part of this work investigates the optimizations of a sparse spatio-temporal (ST) cognitive system-on-a-chip (SoC). This design extracts ST features from videos and leverages sparse inference and kernel compression to efficiently perform action classification and motion tracking.
The second part of this work explores the significance of dataflows and reduction mechanisms for sparse deep neural network (DNN) acceleration. This design features a dynamic, look-ahead index matching unit in hardware to efficiently discover fine-grained parallelism, achieving high energy efficiency and low control complexity for a wide variety of DNN layers.
Lastly, this work expands the scope to real-time machine learning (RTML) acceleration. A new high-level architecture modeling framework is proposed. Specifically, this framework consists of a set of high-performance RTML-specific architecture design templates, and a Python-based high-level modeling and compiler tool chain for efficient cross-stack architecture design and exploration.PHDElectrical and Computer EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/162870/1/lchingen_1.pd
Reducing fetch architecture complexity using procedure inlining
Fetch engine performance is seriously limited by the branch prediction table access latency. This fact has lead to the development of hardware mechanisms, like prediction overriding, aimed to tolerate this latency. However, prediction overriding requires additional support and recovery mechanisms, which increases the fetch architecture complexity. In this paper, we show that this increase in complexity can be avoided if the interaction between the fetch architecture and software code optimizations is taken into account. We use aggressive procedure inlining to generate long streams of instructions that are used by the fetch engine as the basic prediction unit. We call instruction stream to a sequence of instructions from the target of a taken branch to the next taken branch. These instruction streams are long enough to feed the execution engine with instructions during multiple cycles, while a new stream prediction is being generated, and thus hiding the prediction table access latency. Our results show that the length of instruction streams compensates the increase in the instruction cache miss rate caused by inlining. We show that, using procedure inlining, the need for a prediction overriding mechanism is avoided, reducing the fetch engine complexity.Peer ReviewedPostprint (published version
Driving the Network-on-Chip Revolution to Remove the Interconnect Bottleneck in Nanoscale Multi-Processor Systems-on-Chip
The sustained demand for faster, more powerful chips has been met by the
availability of chip manufacturing processes allowing for the integration of increasing
numbers of computation units onto a single die. The resulting outcome,
especially in the embedded domain, has often been called SYSTEM-ON-CHIP
(SoC) or MULTI-PROCESSOR SYSTEM-ON-CHIP (MP-SoC).
MPSoC design brings to the foreground a large number of challenges, one of
the most prominent of which is the design of the chip interconnection. With a
number of on-chip blocks presently ranging in the tens, and quickly approaching
the hundreds, the novel issue of how to best provide on-chip communication
resources is clearly felt.
NETWORKS-ON-CHIPS (NoCs) are the most comprehensive and scalable
answer to this design concern. By bringing large-scale networking concepts to
the on-chip domain, they guarantee a structured answer to present and future
communication requirements. The point-to-point connection and packet switching
paradigms they involve are also of great help in minimizing wiring overhead
and physical routing issues. However, as with any technology of recent inception,
NoC design is still an evolving discipline. Several main areas of interest
require deep investigation for NoCs to become viable solutions:
• The design of the NoC architecture needs to strike the best tradeoff among
performance, features and the tight area and power constraints of the onchip
domain.
• Simulation and verification infrastructure must be put in place to explore,
validate and optimize the NoC performance.
• NoCs offer a huge design space, thanks to their extreme customizability in
terms of topology and architectural parameters. Design tools are needed
to prune this space and pick the best solutions.
• Even more so given their global, distributed nature, it is essential to evaluate
the physical implementation of NoCs to evaluate their suitability for
next-generation designs and their area and power costs.
This dissertation performs a design space exploration of network-on-chip architectures,
in order to point-out the trade-offs associated with the design of
each individual network building blocks and with the design of network topology
overall. The design space exploration is preceded by a comparative analysis
of state-of-the-art interconnect fabrics with themselves and with early networkon-
chip prototypes. The ultimate objective is to point out the key advantages
that NoC realizations provide with respect to state-of-the-art communication
infrastructures and to point out the challenges that lie ahead in order to make
this new interconnect technology come true. Among these latter, technologyrelated
challenges are emerging that call for dedicated design techniques at all
levels of the design hierarchy. In particular, leakage power dissipation, containment
of process variations and of their effects. The achievement of the above
objectives was enabled by means of a NoC simulation environment for cycleaccurate
modelling and simulation and by means of a back-end facility for the
study of NoC physical implementation effects. Overall, all the results provided
by this work have been validated on actual silicon layout
TREBUCHET: Fully Homomorphic Encryption Accelerator for Deep Computation
Secure computation is of critical importance to not only the DoD, but across
financial institutions, healthcare, and anywhere personally identifiable
information (PII) is accessed. Traditional security techniques require data to
be decrypted before performing any computation. When processed on untrusted
systems the decrypted data is vulnerable to attacks to extract the sensitive
information. To address these vulnerabilities Fully Homomorphic Encryption
(FHE) keeps the data encrypted during computation and secures the results, even
in these untrusted environments. However, FHE requires a significant amount of
computation to perform equivalent unencrypted operations. To be useful, FHE
must significantly close the computation gap (within 10x) to make encrypted
processing practical. To accomplish this ambitious goal the TREBUCHET project
is leading research and development in FHE processing hardware to accelerate
deep computations on encrypted data, as part of the DARPA MTO Data Privacy for
Virtual Environments (DPRIVE) program. We accelerate the major secure
standardized FHE schemes (BGV, BFV, CKKS, FHEW, etc.) at >=128-bit security
while integrating with the open-source PALISADE and OpenFHE libraries currently
used in the DoD and in industry. We utilize a novel tile-based chip design with
highly parallel ALUs optimized for vectorized 128b modulo arithmetic. The
TREBUCHET coprocessor design provides a highly modular, flexible, and
extensible FHE accelerator for easy reconfiguration, deployment, integration
and application on other hardware form factors, such as System-on-Chip or
alternate chip areas.Comment: 6 pages, 5figures, 2 table
- …