8,888 research outputs found
Approximate Computing Survey, Part I: Terminology and Software & Hardware Approximation Techniques
The rapid growth of demanding applications in domains applying multimedia
processing and machine learning has marked a new era for edge and cloud
computing. These applications involve massive data and compute-intensive tasks,
and thus, typical computing paradigms in embedded systems and data centers are
stressed to meet the worldwide demand for high performance. Concurrently, the
landscape of the semiconductor field in the last 15 years has constituted power
as a first-class design concern. As a result, the community of computing
systems is forced to find alternative design approaches to facilitate
high-performance and/or power-efficient computing. Among the examined
solutions, Approximate Computing has attracted an ever-increasing interest,
with research works applying approximations across the entire traditional
computing stack, i.e., at software, hardware, and architectural levels. Over
the last decade, there is a plethora of approximation techniques in software
(programs, frameworks, compilers, runtimes, languages), hardware (circuits,
accelerators), and architectures (processors, memories). The current article is
Part I of our comprehensive survey on Approximate Computing, and it reviews its
motivation, terminology and principles, as well it classifies and presents the
technical details of the state-of-the-art software and hardware approximation
techniques.Comment: Under Review at ACM Computing Survey
Hardware-aware training for large-scale and diverse deep learning inference workloads using in-memory computing-based accelerators
Analog in-memory computing (AIMC) -- a promising approach for
energy-efficient acceleration of deep learning workloads -- computes
matrix-vector multiplications (MVMs) but only approximately, due to
nonidealities that often are non-deterministic or nonlinear. This can adversely
impact the achievable deep neural network (DNN) inference accuracy as compared
to a conventional floating point (FP) implementation. While retraining has
previously been suggested to improve robustness, prior work has explored only a
few DNN topologies, using disparate and overly simplified AIMC hardware models.
Here, we use hardware-aware (HWA) training to systematically examine the
accuracy of AIMC for multiple common artificial intelligence (AI) workloads
across multiple DNN topologies, and investigate sensitivity and robustness to a
broad set of nonidealities. By introducing a new and highly realistic AIMC
crossbar-model, we improve significantly on earlier retraining approaches. We
show that many large-scale DNNs of various topologies, including convolutional
neural networks (CNNs), recurrent neural networks (RNNs), and transformers, can
in fact be successfully retrained to show iso-accuracy on AIMC. Our results
further suggest that AIMC nonidealities that add noise to the inputs or
outputs, not the weights, have the largest impact on DNN accuracy, and that
RNNs are particularly robust to all nonidealities.Comment: 35 pages, 7 figures, 5 table
DOTA: A Dynamically-Operated Photonic Tensor Core for Energy-Efficient Transformer Accelerator
The wide adoption and significant computing resource consumption of
attention-based Transformers, e.g., Vision Transformer and large language
models, have driven the demands for efficient hardware accelerators. While
electronic accelerators have been commonly used, there is a growing interest in
exploring photonics as an alternative technology due to its high energy
efficiency and ultra-fast processing speed. Optical neural networks (ONNs) have
demonstrated promising results for convolutional neural network (CNN) workloads
that only require weight-static linear operations. However, they fail to
efficiently support Transformer architectures with attention operations due to
the lack of ability to process dynamic full-range tensor multiplication. In
this work, we propose a customized high-performance and energy-efficient
photonic Transformer accelerator, DOTA. To overcome the fundamental limitation
of existing ONNs, we introduce a novel photonic tensor core, consisting of a
crossbar array of interference-based optical vector dot-product engines, that
supports highly-parallel, dynamic, and full-range matrix-matrix multiplication.
Our comprehensive evaluation demonstrates that DOTA achieves a >4x energy and a
>10x latency reduction compared to prior photonic accelerators, and delivers
over 20x energy reduction and 2 to 3 orders of magnitude lower latency compared
to the electronic Transformer accelerator. Our work highlights the immense
potential of photonic computing for efficient hardware accelerators,
particularly for advanced machine learning workloads.Comment: The short version is accepted by Next-Gen AI System Workshop at MLSys
202
Adaptive Microarchitectural Optimizations to Improve Performance and Security of Multi-Core Architectures
With the current technological barriers, microarchitectural optimizations are increasingly important to ensure performance scalability of computing systems. The shift to multi-core architectures increases the demands on the memory system, and amplifies the role of microarchitectural optimizations in performance improvement. In a multi-core system, microarchitectural resources are usually shared, such as the cache, to maximize utilization but sharing can also lead to contention and lower performance. This can be mitigated through partitioning of shared caches.However, microarchitectural optimizations which were assumed to be fundamentally secure for a long time, can be used in side-channel attacks to exploit secrets, as cryptographic keys. Timing-based side-channels exploit predictable timing variations due to the interaction with microarchitectural optimizations during program execution. Going forward, there is a strong need to be able to leverage microarchitectural optimizations for performance without compromising security. This thesis contributes with three adaptive microarchitectural resource management optimizations to improve security and/or\ua0performance\ua0of multi-core architectures\ua0and a systematization-of-knowledge of timing-based side-channel attacks.\ua0We observe that to achieve high-performance cache partitioning in a multi-core system\ua0three requirements need to be met: i) fine-granularity of partitions, ii) locality-aware placement and iii) frequent changes. These requirements lead to\ua0high overheads for current centralized partitioning solutions, especially as the number of cores in the\ua0system increases. To address this problem, we present an adaptive and scalable cache partitioning solution (DELTA) using a distributed and asynchronous allocation algorithm. The\ua0allocations occur through core-to-core challenges, where applications with larger performance benefit will gain cache capacity. The\ua0solution is implementable in hardware, due to low computational complexity, and can scale to large core counts.According to our analysis, better performance can be achieved by coordination of multiple optimizations for different resources, e.g., off-chip bandwidth and cache, but is challenging due to the increased number of possible allocations which need to be evaluated.\ua0Based on these observations, we present a solution (CBP) for coordinated management of the optimizations: cache partitioning, bandwidth partitioning and prefetching.\ua0Efficient allocations, considering the inter-resource interactions and trade-offs, are achieved using local resource managers to limit the solution space.The continuously growing number of\ua0side-channel attacks leveraging\ua0microarchitectural optimizations prompts us to review attacks and defenses to understand the vulnerabilities of different microarchitectural optimizations. We identify the four root causes of timing-based side-channel attacks: determinism, sharing, access violation\ua0and information flow.\ua0Our key insight is that eliminating any of the exploited root causes, in any of the attack steps, is enough to provide protection.\ua0Based on our framework, we present a systematization of the attacks and defenses on a wide range of microarchitectural optimizations, which highlights their key similarities.\ua0Shared caches are an attractive attack surface for side-channel attacks, while defenses need to be efficient since the cache is crucial for performance.\ua0To address this issue, we present an adaptive and scalable cache partitioning solution (SCALE) for protection against cache side-channel attacks. The solution leverages randomness,\ua0and provides quantifiable and information theoretic security guarantees using differential privacy. The solution closes the performance gap to a state-of-the-art non-secure allocation policy for a mix of secure and non-secure applications
Spatial-temporal domain charging optimization and charging scenario iteration for EV
Environmental problems have become increasingly serious around the world. With lower carbon emissions, Electric Vehicles (EVs) have been utilized on a large scale over the past few years. However, EVs are limited by battery capacity and require frequent charging. Currently, EVs suffer from long charging time and charging congestion. Therefore, EV charging optimization is vital to ensure drivers’ mobility. This study first presents a literature analysis of the current charging modes taxonomy to elucidate the advantages and disadvantages of different charging modes. In specific optimization, under plug-in charging mode, an Urgency First Charging (UFC) scheduling policy is proposed with collaborative optimization of the spatialtemporal domain. The UFC policy allows those EVs with charging urgency to get preempted charging services. As conventional plug-in charging mode is limited by the deployment of Charging Stations (CSs), this study further introduces and optimizes Vehicle-to-Vehicle (V2V) charging. This is aim to maximize the utilization of charging infrastructures and to balance the grid load. This proposed reservation-based V2V charging scheme optimizes pair matching of EVs based on minimized distance. Meanwhile, this V2V scheme allows more EVs get fully charged via minimized waiting time based parking lot allocation. Constrained by shortcomings (rigid location of CSs and slow charging power under V2V converters), a single charging mode can hardly meet a large number of parallel charging requests. Thus, this study further proposes a hybrid charging mode. This mode is to utilize the advantages of plug-in and V2V modes to alleviate the pressure on the grid. Finally, this study addresses the potential problems of EV charging with a view to further optimizing EV charging in subsequent studies
Recommended from our members
Serial Biasing Technique for Rapid Single Flux Quantum Circuits
Superconductor electronics based on the Single Flux Quantum (SFQ) technology are considered a strong contender for the ‘beyond CMOS’ future of digital circuits because of the high speed and low power dissipation associated with them. In fact, digital operations beyond tens of GHz have been routinely demonstrated in the SFQ technology. These circuits have widespread applications such as high-speed analog-to-digital conversion, digital signal processing, high speed computing and in emerging topics such as control circuitry for superconducting quantum computing.
Rapid Single Flux Quantum (RSFQ) circuits have emerged as a promising candidate within the SFQ technology, with information encoded in picosecond wide, milli-volt voltage pulses. As is the case with any integrated circuit technology, scalability of RSFQ circuits is essential to realizing their applications. These circuits, based on the Josephson junction, require a DC bias current for the correct operation. The DC bias current requirement increases with circuit complexity, and this has multiple implications on circuit operation. Large currents produce magnetic fields that can interfere with logic operation. Furthermore, the heat load delivered to the superconducting chip also increases with current which could result in the circuit becoming ‘normal’ and not superconducting. These problems make reduction of the bias current necessary.
Serial Biasing (SB) is a bias current reduction technique, that has been proposed in the past. In this technique, a digital circuit is partitioned into multiple identical islands and bias current is provided to each island in a serial manner. While this scheme is promising, there are multiple challenges such as design of the driver-receiver pair circuit resulting in robust and wide operating bias margins, current management on the floating islands, etc.
This thesis investigates SB in a systematic manner, focusing on the design and measurement of the fundamental components of this technique with an emphasis on reliability and scalability. It presents works on circuit techniques achieving high speed serially biased RSFQ circuits with robust operating margins and the experimental evidence to support the ideas. It develops a framework for serial biasing that could be used by electronic design tools to automate design and synthesis of complex RSFQ circuits. It also investigates Passive Transmission Lines (PTLs) for use as passive interconnects between library cells in a complex design, reducing the DC bias current required by the active circuitry
Toward Fault-Tolerant Applications on Reconfigurable Systems-on-Chip
L'abstract è presente nell'allegato / the abstract is in the attachmen
Sparse-firing regularization methods for spiking neural networks with time-to-first spike coding
The training of multilayer spiking neural networks (SNNs) using the error
backpropagation algorithm has made significant progress in recent years. Among
the various training schemes, the error backpropagation method that directly
uses the firing time of neurons has attracted considerable attention because it
can realize ideal temporal coding. This method uses time-to-first spike (TTFS)
coding, in which each neuron fires at most once, and this restriction on the
number of firings enables information to be processed at a very low firing
frequency. This low firing frequency increases the energy efficiency of
information processing in SNNs, which is important not only because of its
similarity with information processing in the brain, but also from an
engineering point of view. However, only an upper limit has been provided for
TTFS-coded SNNs, and the information-processing capability of SNNs at lower
firing frequencies has not been fully investigated. In this paper, we propose
two spike timing-based sparse-firing (SSR) regularization methods to further
reduce the firing frequency of TTFS-coded SNNs. The first is the membrane
potential-aware SSR (M-SSR) method, which has been derived as an extreme form
of the loss function of the membrane potential value. The second is the firing
condition-aware SSR (F-SSR) method, which is a regularization function obtained
from the firing conditions. Both methods are characterized by the fact that
they only require information about the firing timing and associated weights.
The effects of these regularization methods were investigated on the MNIST,
Fashion-MNIST, and CIFAR-10 datasets using multilayer perceptron networks and
convolutional neural network structures
- …