274 research outputs found
Pairing Software-Managed Caching with Decay Techniques to Balance Reliability and Static Power in Next-Generation Caches
Since array structures represent well over half the area and transistors on-chip, maintaining their ability to scale is crucial for overall technology scaling. Shrinking transistor sizes are resulting in increased probabilities of single events causing single- and multi-bit upsets which require adoption of more complex and power hungry error detection and correction codes (ECC) in hardware. At the same time, SRAM leakage energy is increasing partly due to technology trends and partly due to the increasing number of transistors present.
This paper proposes and evaluates methods of reducing the static power requirements of caches, while also maintaining high reliability. In particular, we propose methods of applying reduced ECC techniques to data that has been identified (by programmer or compiler) as error-tolerant. This segregation, in turn, makes both the default data and the error-tolerant data more amenable to decay-based techniques for leakage control. We examine the potential of this split memory hierarchy along several dimensions. In particular, we consider the power and reliability issues inherent in the approach. Overall, we show that our approach allows the ECC requirements of future applications and caches to be met while also reducing leakage energy
QDB: From Quantum Algorithms Towards Correct Quantum Programs
With the advent of small-scale prototype quantum computers, researchers can now code and run quantum algorithms that were previously proposed but not fully implemented. In support of this growing interest in quantum computing experimentation, programmers need new tools and techniques to write and debug QC code. In this work, we implement a range of QC algorithms and programs in order to discover what types of bugs occur and what defenses against those bugs are possible in QC programs. We conduct our study by running small-sized QC programs in QC simulators in order to replicate published results in QC implementations. Where possible, we cross-validate results from programs written in different QC languages for the same problems and inputs. Drawing on this experience, we provide a taxonomy for QC bugs, and we propose QC language features that would aid in writing correct code
TransForm: Formally Specifying Transistency Models and Synthesizing Enhanced Litmus Tests
Memory consistency models (MCMs) specify the legal ordering and visibility of
shared memory accesses in a parallel program. Traditionally, instruction set
architecture (ISA) MCMs assume that relevant program-visible memory ordering
behaviors only result from shared memory interactions that take place between
user-level program instructions. This assumption fails to account for virtual
memory (VM) implementations that may result in additional shared memory
interactions between user-level program instructions and both 1) system-level
operations (e.g., address remappings and translation lookaside buffer
invalidations initiated by system calls) and 2) hardware-level operations
(e.g., hardware page table walks and dirty bit updates) during a user-level
program's execution. These additional shared memory interactions can impact the
observable memory ordering behaviors of user-level programs. Thus, memory
transistency models (MTMs) have been coined as a superset of MCMs to
additionally articulate VM-aware consistency rules. However, no prior work has
enabled formal MTM specifications, nor methods to support their automated
analysis.
To fill the above gap, this paper presents the TransForm framework. First,
TransForm features an axiomatic vocabulary for formally specifying MTMs.
Second, TransForm includes a synthesis engine to support the automated
generation of litmus tests enhanced with MTM features (i.e., enhanced litmus
tests, or ELTs) when supplied with a TransForm MTM specification. As a case
study, we formally define an estimated MTM for Intel x86 processors, called
x86t_elt, that is based on observations made by an ELT-based evaluation of an
Intel x86 MTM implementation from prior work and available public
documentation. Given x86t_elt and a synthesis bound as input, TransForm's
synthesis engine successfully produces a set of ELTs including relevant ELTs
from prior work.Comment: *This is an updated version of the TransForm paper that features
updated results reflecting performance optimizations and software bug fixes.
14 pages, 11 figures, Proceedings of the 47th Annual International Symposium
on Computer Architecture (ISCA
SignalGuru: Leveraging mobile phones for collaborative traffic signal schedule advisory
While traffic signals are necessary to safely control competing flows of traffic, they inevitably enforce a stop-and-go movement pattern that increases fuel consumption, reduces traffic flow and causes traffic jams. These side effects can be alleviated by providing drivers and their onboard computational devices (e.g., vehicle computer, smartphone) with information about the schedule of the traffic signals ahead. Based on when the signal ahead will turn green, drivers can then adjust speed so as to avoid coming to a complete halt. Such information is called Green Light Optimal Speed Advisory (GLOSA). Alternatively, the onboard computational device may suggest an efficient detour that will save the driver from stops and long waits at red lights ahead.
This paper introduces and evaluates SignalGuru, a novel software service that relies solely on a collection of mobile phones to detect and predict the traffic signal schedule, enabling GLOSA and other novel applications. Our SignalGuru leverages windshield-mounted phones to opportunistically detect current traffic signals with their cameras, collaboratively communicate and learn traffic signal schedule patterns, and predict their future schedule.
Results from two deployments of SignalGuru, using iPhones in cars in Cambridge (MA, USA) and Singapore, show that traffic signal schedules can be predicted accurately. On average, SignalGuru comes within 0.66s, for pre-timed traffic signals and within 2.45s, for traffic-adaptive traffic signals. Feeding SignalGuru's predicted traffic schedule to our GLOSA application, our vehicle fuel consumption measurements show savings of 20.3%, on average.National Science Foundation (U.S.). (Grant number CSR-EHS-0615175)Singapore-MIT Alliance for Research and Technology Center. Future Urban Mobilit
Noise-Adaptive Compiler Mappings for Noisy Intermediate-Scale Quantum Computers
A massive gap exists between current quantum computing (QC) prototypes, and
the size and scale required for many proposed QC algorithms. Current QC
implementations are prone to noise and variability which affect their
reliability, and yet with less than 80 quantum bits (qubits) total, they are
too resource-constrained to implement error correction. The term Noisy
Intermediate-Scale Quantum (NISQ) refers to these current and near-term systems
of 1000 qubits or less. Given NISQ's severe resource constraints, low
reliability, and high variability in physical characteristics such as coherence
time or error rates, it is of pressing importance to map computations onto them
in ways that use resources efficiently and maximize the likelihood of
successful runs.
This paper proposes and evaluates backend compiler approaches to map and
optimize high-level QC programs to execute with high reliability on NISQ
systems with diverse hardware characteristics. Our techniques all start from an
LLVM intermediate representation of the quantum program (such as would be
generated from high-level QC languages like Scaffold) and generate QC
executables runnable on the IBM Q public QC machine. We then use this framework
to implement and evaluate several optimal and heuristic mapping methods. These
methods vary in how they account for the availability of dynamic machine
calibration data, the relative importance of various noise parameters, the
different possible routing strategies, and the relative importance of
compile-time scalability versus runtime success. Using real-system
measurements, we show that fine grained spatial and temporal variations in
hardware parameters can be exploited to obtain an average x (and up to
x) improvement in program success rate over the industry standard IBM
Qiskit compiler.Comment: To appear in ASPLOS'1
Using LLMs to Facilitate Formal Verification of RTL
Formal property verification (FPV) has existed for decades and has been shown
to be effective at finding intricate RTL bugs. However, formal properties, such
as those written as SystemVerilog Assertions (SVA), are time-consuming and
error-prone to write, even for experienced users. Prior work has attempted to
lighten this burden by raising the abstraction level so that SVA is generated
from high-level specifications. However, this does not eliminate the manual
effort of reasoning and writing about the detailed hardware behavior. Motivated
by the increased need for FPV in the era of heterogeneous hardware and the
advances in large language models (LLMs), we set out to explore whether LLMs
can capture RTL behavior and generate correct SVA properties.
First, we design an FPV-based evaluation framework that measures the
correctness and completeness of SVA. Then, we evaluate GPT4 iteratively to
craft the set of syntax and semantic rules needed to prompt it toward creating
better SVA. We extend the open-source AutoSVA framework by integrating our
improved GPT4-based flow to generate safety properties, in addition to
facilitating their existing flow for liveness properties. Lastly, our use cases
evaluate (1) the FPV coverage of GPT4-generated SVA on complex open-source RTL
and (2) using generated SVA to prompt GPT4 to create RTL from scratch.
Through these experiments, we find that GPT4 can generate correct SVA even
for flawed RTL, without mirroring design errors. Particularly, it generated SVA
that exposed a bug in the RISC-V CVA6 core that eluded the prior work's
evaluation.Comment: This work has been submitted to the IEEE for possible publication.
Copyright may be transferred without notice, after which this version may no
longer be accessibl
Magic-State Functional Units: Mapping and Scheduling Multi-Level Distillation Circuits for Fault-Tolerant Quantum Architectures
Quantum computers have recently made great strides and are on a long-term
path towards useful fault-tolerant computation. A dominant overhead in
fault-tolerant quantum computation is the production of high-fidelity encoded
qubits, called magic states, which enable reliable error-corrected computation.
We present the first detailed designs of hardware functional units that
implement space-time optimized magic-state factories for surface code
error-corrected machines. Interactions among distant qubits require surface
code braids (physical pathways on chip) which must be routed. Magic-state
factories are circuits comprised of a complex set of braids that is more
difficult to route than quantum circuits considered in previous work [1]. This
paper explores the impact of scheduling techniques, such as gate reordering and
qubit renaming, and we propose two novel mapping techniques: braid repulsion
and dipole moment braid rotation. We combine these techniques with graph
partitioning and community detection algorithms, and further introduce a
stitching algorithm for mapping subgraphs onto a physical machine. Our results
show a factor of 5.64 reduction in space-time volume compared to the best-known
previous designs for magic-state factories.Comment: 13 pages, 10 figure
Resource Optimized Quantum Architectures for Surface Code Implementations of Magic-State Distillation
Quantum computers capable of solving classically intractable problems are
under construction, and intermediate-scale devices are approaching completion.
Current efforts to design large-scale devices require allocating immense
resources to error correction, with the majority dedicated to the production of
high-fidelity ancillary states known as magic-states. Leading techniques focus
on dedicating a large, contiguous region of the processor as a single
"magic-state distillation factory" responsible for meeting the magic-state
demands of applications. In this work we design and analyze a set of optimized
factory architectural layouts that divide a single factory into spatially
distributed factories located throughout the processor. We find that
distributed factory architectures minimize the space-time volume overhead
imposed by distillation. Additionally, we find that the number of distributed
components in each optimal configuration is sensitive to application
characteristics and underlying physical device error rates. More specifically,
we find that the rate at which T-gates are demanded by an application has a
significant impact on the optimal distillation architecture. We develop an
optimization procedure that discovers the optimal number of factory
distillation rounds and number of output magic states per factory, as well as
an overall system architecture that interacts with the factories. This yields
between a 10x and 20x resource reduction compared to commonly accepted single
factory designs. Performance is analyzed across representative application
classes such as quantum simulation and quantum chemistry.Comment: 16 pages, 14 figure
- …