226 research outputs found
Recommended from our members
Branch prediction apparatus, systems, and methods
An apparatus and a system, as well as a method and article, may operate to predict a branch within a first operating context, such as a user context, using a first strategy; and to predict a branch within a second operating context, such as an operating system context, using a second strategy. In some embodiments, apparatus and systems may comprise one or more first storage locations to store branch history information associated with a first operating context, and one ore more second storage locations to store branch history information associated with a second operating context.Board of Regents, University of Texas Syste
Mechanistic modeling of architectural vulnerability factor
Reliability to soft errors is a significant design challenge in modern microprocessors owing to an exponential increase in the number of transistors on chip and the reduction in operating voltages with each process generation. Architectural Vulnerability Factor (AVF) modeling using microarchitectural simulators enables architects to make informed performance, power, and reliability tradeoffs. However, such simulators are time-consuming and do not reveal the microarchitectural mechanisms that influence AVF. In this article, we present an accurate first-order mechanistic analytical model to compute AVF, developed using the first principles of an out-of-order superscalar execution. This model provides insight into the fundamental interactions between the workload and microarchitecture that together influence AVF. We use the model to perform design space exploration, parametric sweeps, and workload characterization for AVF
Exploiting compiler-generated schedules for energy savings in high-performance processors
This paper develops a technique that uniquely combines the advantages of static scheduling and dynamic scheduling to reduce the energy consumed in modern superscalar processors with out-of-order issue logic. In this Hybrid-Scheduling paradigm, regions of the application containing large amounts of parallelism visible at compile-time completely bypass the dynamic scheduling logic and execute in a low power static mode. Simulation studies using the Wattch framework on several media and scientific benchmarks demonstrate large improvements in overall energy consumption of 43 % in kernels and 25 % in full applications with only a 2.8 % performance degradation on average
Recommended from our members
Apparatus and method for accelerating java translation
An apparatus and method for accelerating Java translation are provided. The apparatus includes a lookup table which stores an lookup table having arrangements of bytecodes and native codes corresponding to the bytecodes, a decoder which generates pointer to the native code corresponding to the feed bytecode in the lookup table, a parameterized bytecode processing unit which detects parameterized bytecode among the feed bytecode, and generating pointer to native code required for constant embedding in the lookup table, a constant embedding unit which embeds constants into the native code with the pointer generated by the parameterized bytecode processing unit, and a native code buffer which stores the native code generated by the decoder or the constant embedding unit.Board of Regents, University of Texas Syste
HLSDataset: Open-Source Dataset for ML-Assisted FPGA Design using High Level Synthesis
Machine Learning (ML) has been widely adopted in design exploration using
high level synthesis (HLS) to give a better and faster performance, and
resource and power estimation at very early stages for FPGA-based design. To
perform prediction accurately, high-quality and large-volume datasets are
required for training ML models.This paper presents a dataset for ML-assisted
FPGA design using HLS, called HLSDataset. The dataset is generated from widely
used HLS C benchmarks including Polybench, Machsuite, CHStone and Rossetta. The
Verilog samples are generated with a variety of directives including loop
unroll, loop pipeline and array partition to make sure optimized and realistic
designs are covered. The total number of generated Verilog samples is nearly
9,000 per FPGA type. To demonstrate the effectiveness of our dataset, we
undertake case studies to perform power estimation and resource usage
estimation with ML models trained with our dataset. All the codes and dataset
are public at the github repo.We believe that HLSDataset can save valuable time
for researchers by avoiding the tedious process of running tools, scripting and
parsing files to generate the dataset, and enable them to spend more time where
it counts, that is, in training ML models.Comment: 8 pages, 5 figure
PIMSAB: A Processing-In-Memory System with Spatially-Aware Communication and Bit-Serial-Aware Computation
Bit-serial Processing-In-Memory (PIM) is an attractive paradigm for
accelerator architectures, for parallel workloads such as Deep Learning (DL),
because of its capability to achieve massive data parallelism at a low area
overhead and provide orders-of-magnitude data movement savings by moving
computational resources closer to the data. While many PIM architectures have
been proposed, improvements are needed in communicating intermediate results to
consumer kernels, for communication between tiles at scale, for reduction
operations, and for efficiently performing bit-serial operations with
constants.
We present PIMSAB, a scalable architecture that provides spatially aware
communication network for efficient intra-tile and inter-tile data movement and
provides efficient computation support for generally inefficient bit-serial
compute patterns. Our architecture consists of a massive hierarchical array of
compute-enabled SRAMs (CRAMs) and is codesigned with a compiler to achieve high
utilization. The key novelties of our architecture are: (1) providing efficient
support for spatially-aware communication by providing local H-tree network for
reductions, by adding explicit hardware for shuffling operands, and by
deploying systolic broadcasting, and (2) taking advantage of the divisible
nature of bit-serial computations through adaptive precision, bit-slicing and
efficient handling of constant operations.
When compared against a similarly provisioned modern Tensor Core GPU (NVIDIA
A100), across common DL kernels and an end-to-end DL network (Resnet18), PIMSAB
outperforms the GPU by 3x, and reduces energy by 4.2x. We compare PIMSAB with
similarly provisioned state-of-the-art SRAM PIM (Duality Cache) and DRAM PIM
(SIMDRAM) and observe a speedup of 3.7x and 3.88x respectively.Comment: Aman Arora and Jian Weng are co-first authors with equal contributio
- …