1,541 research outputs found
Dynamically allocating processor resources between nearby and distant ILP
Journal ArticleModern superscalar processors use wide instruction issue widths and out-of-order execution in order to increase instruction-level parallelism (ILP). Because instructions must be committed in order so as to guarantee precise exceptions, increasing ILP implies increasing the sizes of structures such as the register file, issue queue, and reorder buffer. Simultaneously, cycle time constraints limit the sizes of these structures, resulting in conflicting design requirements. In this paper, we present a novel microarchitecture designed to overcome the limitations of a register file size dictated by cycle time constraints. Available registers are dynamically allocated between the primary program thread and a future thread. The future thread executes instructions when the primary thread is limited by resource availability. The future thread is not constrained by in-order commit requirements. It is therefore able to examine a much larger instruction window and jump far ahead to execute ready instructions. Results are communicated back to the primary thread by warming up the register file, instruction cache, data cache, and instruction reuse buffer, and by resolving branch mispredicts early. The proposed microarchitecture is able to get an overall speedup of 1.17 over the base processor for our benchmark set, with speedups of up to 1.64
MLPerf Inference Benchmark
Machine-learning (ML) hardware and software system demand is burgeoning.
Driven by ML applications, the number of different ML inference systems has
exploded. Over 100 organizations are building ML inference chips, and the
systems that incorporate existing models span at least three orders of
magnitude in power consumption and five orders of magnitude in performance;
they range from embedded devices to data-center solutions. Fueling the hardware
are a dozen or more software frameworks and libraries. The myriad combinations
of ML hardware and ML software make assessing ML-system performance in an
architecture-neutral, representative, and reproducible manner challenging.
There is a clear need for industry-wide standard ML benchmarking and evaluation
criteria. MLPerf Inference answers that call. In this paper, we present our
benchmarking method for evaluating ML inference systems. Driven by more than 30
organizations as well as more than 200 ML engineers and practitioners, MLPerf
prescribes a set of rules and best practices to ensure comparability across
systems with wildly differing architectures. The first call for submissions
garnered more than 600 reproducible inference-performance measurements from 14
organizations, representing over 30 systems that showcase a wide range of
capabilities. The submissions attest to the benchmark's flexibility and
adaptability.Comment: ISCA 202
Neural network accelerator for quantum control
Efficient quantum control is necessary for practical quantum computing
implementations with current technologies. Conventional algorithms for
determining optimal control parameters are computationally expensive, largely
excluding them from use outside of the simulation. Existing hardware solutions
structured as lookup tables are imprecise and costly. By designing a machine
learning model to approximate the results of traditional tools, a more
efficient method can be produced. Such a model can then be synthesized into a
hardware accelerator for use in quantum systems. In this study, we demonstrate
a machine learning algorithm for predicting optimal pulse parameters. This
algorithm is lightweight enough to fit on a low-resource FPGA and perform
inference with a latency of 175 ns and pipeline interval of 5 ns with 0.99
gate fidelity. In the long term, such an accelerator could be used near quantum
computing hardware where traditional computers cannot operate, enabling quantum
control at a reasonable cost at low latencies without incurring large data
bandwidths outside of the cryogenic environment.Comment: 7 pages, 10 figure
A transprecision floating-point cluster for efficient near-sensor data analytics
Recent applications in the domain of near-sensor computing require the
adoption of floating-point arithmetic to reconcile high precision results with
a wide dynamic range. In this paper, we propose a multi-core computing cluster
that leverages the fined-grained tunable principles of transprecision computing
to provide support to near-sensor applications at a minimum power budget. Our
design - based on the open-source RISC-V architecture - combines
parallelization and sub-word vectorization with near-threshold operation,
leading to a highly scalable and versatile system. We perform an exhaustive
exploration of the design space of the transprecision cluster on a
cycle-accurate FPGA emulator, with the aim to identify the most efficient
configurations in terms of performance, energy efficiency, and area efficiency.
We also provide a full-fledged software stack support, including a parallel
runtime and a compilation toolchain, to enable the development of end-to-end
applications. We perform an experimental assessment of our design on a set of
benchmarks representative of the near-sensor processing domain, complementing
the timing results with a post place-&-route analysis of the power consumption.
Finally, a comparison with the state-of-the-art shows that our solution
outperforms the competitors in energy efficiency, reaching a peak of 97
Gflop/s/W on single-precision scalars and 162 Gflop/s/W on half-precision
vectors
A VHDL model of a superscalar implementation of the DLX instruction set architcture
The complexity of today\u27s microprocessors demands that designers have an extensive knowledge of superscalar design techniques; this knowledge is difficult to acquire outside of a professional design team. Presently, there are a limited number of adequate resources available for the student, both in textual and model form. The limited number of options available emphasizes the need for more models and simulators, allowing students the opportunity to learn more about superscalar designs prior to entering the work force. This thesis details the design and implementation of a superscalar version of the DLX instruction set architecture in behavioral VHDL. The branch prediction strategy, instruction issue model, and hazard avoidance techniques are all issues critical to superscalar processor design and are studied in this thesis. Preliminary test results demonstrate that the performance advantage of the superscalar processor is applicable even to short test sequences. Initial findings have shown a performance improvement of 26% to 57% for instruction sequences under 150 instructions
Recommended from our members
Error-efficient computing systems
This survey explores the theory and practice of techniques to make computing systems faster or more energy-efficient by allowing them to make controlled errors. In the same way that systems which only use as much energy as necessary are referred to as being energy-efficient, you can think of the class of systems addressed by this survey as being error-efficient: They only prevent as many errors as they need to. The definition of what constitutes an error varies across the parts of a system. And the errors which are acceptable depend on the application at hand. In computing systems, making errors, when behaving correctly would be too expensive, can conserve resources. The resources conserved may be time: By making some errors, systems may be faster. The resource may also be energy: A system may use less power from its batteries or from the electrical grid by only avoiding certain errors while tolerating benign errors that are associated with reduced power consumption. The resource in question may be an even more abstract quantity such as consistency of ordering of the outputs of a system. This survey is for anyone interested in an end-to-end view of one set of techniques that address the theory and practice of making computing systems more efficient by trading errors for improved efficiency
- …