2,843 research outputs found
Terrestrial Cosmic Ray Induced Soft Errors and Large-Scale FPGA Systems in the Cloud
Radiation from outer space can cause soft errors in microelectronic devices deployed at terrestrial altitudes on Earth. Cosmic rays entering the Earth’s atmosphere create a complex cascade of radioactive particles. The most likely form of cosmic radiation to cause soft errors in microelectronics at terrestrial levels are neutrons. SRAM-based FPGAs are susceptible to terrestrial cosmic ray induced soft errors. These soft errors occur infrequently for a single device deployed at terrestrial altitudes. When many FPGAs are deployed in a large-scale system, the impact of these soft errors on reliability can be significant. This study examines terrestrial cosmic ray induced soft errors and the effects they can have on large-scale deployment of FPGAs in cloud computing. Fifteen data-center-like designs were tested for sensitivity through fault injecting. Sensitivities ranged from less than 1% to about 12% of randomly injected faults resulting in unacceptable behavior. A hypothetical but realistic large-scale FPGA system, with 100,000 node deployed at a high-altitude, running the most sensitive design would experience the dominant failure mode of silent data corruption every 3.8 hours on average. This system would only be able to retain reliability level above 0.99 for about two minutes. Some soft error detection and recover approaches are discussed
An Experimental Study of Reduced-Voltage Operation in Modern FPGAs for Neural Network Acceleration
We empirically evaluate an undervolting technique, i.e., underscaling the
circuit supply voltage below the nominal level, to improve the power-efficiency
of Convolutional Neural Network (CNN) accelerators mapped to Field Programmable
Gate Arrays (FPGAs). Undervolting below a safe voltage level can lead to timing
faults due to excessive circuit latency increase. We evaluate the
reliability-power trade-off for such accelerators. Specifically, we
experimentally study the reduced-voltage operation of multiple components of
real FPGAs, characterize the corresponding reliability behavior of CNN
accelerators, propose techniques to minimize the drawbacks of reduced-voltage
operation, and combine undervolting with architectural CNN optimization
techniques, i.e., quantization and pruning. We investigate the effect of
environmental temperature on the reliability-power trade-off of such
accelerators. We perform experiments on three identical samples of modern
Xilinx ZCU102 FPGA platforms with five state-of-the-art image classification
CNN benchmarks. This approach allows us to study the effects of our
undervolting technique for both software and hardware variability. We achieve
more than 3X power-efficiency (GOPs/W) gain via undervolting. 2.6X of this gain
is the result of eliminating the voltage guardband region, i.e., the safe
voltage region below the nominal level that is set by FPGA vendor to ensure
correct functionality in worst-case environmental and circuit conditions. 43%
of the power-efficiency gain is due to further undervolting below the
guardband, which comes at the cost of accuracy loss in the CNN accelerator. We
evaluate an effective frequency underscaling technique that prevents this
accuracy loss, and find that it reduces the power-efficiency gain from 43% to
25%.Comment: To appear at the DSN 2020 conferenc
Towards Quantum Belief Propagation for LDPC Decoding in Wireless Networks
We present Quantum Belief Propagation (QBP), a Quantum Annealing (QA) based
decoder design for Low Density Parity Check (LDPC) error control codes, which
have found many useful applications in Wi-Fi, satellite communications, mobile
cellular systems, and data storage systems. QBP reduces the LDPC decoding to a
discrete optimization problem, then embeds that reduced design onto quantum
annealing hardware. QBP's embedding design can support LDPC codes of block
length up to 420 bits on real state-of-the-art QA hardware with 2,048 qubits.
We evaluate performance on real quantum annealer hardware, performing
sensitivity analyses on a variety of parameter settings. Our design achieves a
bit error rate of in 20 s and a 1,500 byte frame error rate of
in 50 s at SNR 9 dB over a Gaussian noise wireless channel.
Further experiments measure performance over real-world wireless channels,
requiring 30 s to achieve a 1,500 byte 99.99 frame delivery rate at
SNR 15-20 dB. QBP achieves a performance improvement over an FPGA based soft
belief propagation LDPC decoder, by reaching a bit error rate of and
a frame error rate of at an SNR 2.5--3.5 dB lower. In terms of
limitations, QBP currently cannot realize practical protocol-sized
( Wi-Fi, WiMax) LDPC codes on current QA processors. Our
further studies in this work present future cost, throughput, and QA hardware
trend considerations
Belle II Technical Design Report
The Belle detector at the KEKB electron-positron collider has collected
almost 1 billion Y(4S) events in its decade of operation. Super-KEKB, an
upgrade of KEKB is under construction, to increase the luminosity by two orders
of magnitude during a three-year shutdown, with an ultimate goal of 8E35 /cm^2
/s luminosity. To exploit the increased luminosity, an upgrade of the Belle
detector has been proposed. A new international collaboration Belle-II, is
being formed. The Technical Design Report presents physics motivation, basic
methods of the accelerator upgrade, as well as key improvements of the
detector.Comment: Edited by: Z. Dole\v{z}al and S. Un
VLSI Implementation of Deep Neural Network Using Integral Stochastic Computing
The hardware implementation of deep neural networks (DNNs) has recently
received tremendous attention: many applications in fact require high-speed
operations that suit a hardware implementation. However, numerous elements and
complex interconnections are usually required, leading to a large area
occupation and copious power consumption. Stochastic computing has shown
promising results for low-power area-efficient hardware implementations, even
though existing stochastic algorithms require long streams that cause long
latencies. In this paper, we propose an integer form of stochastic computation
and introduce some elementary circuits. We then propose an efficient
implementation of a DNN based on integral stochastic computing. The proposed
architecture has been implemented on a Virtex7 FPGA, resulting in 45% and 62%
average reductions in area and latency compared to the best reported
architecture in literature. We also synthesize the circuits in a 65 nm CMOS
technology and we show that the proposed integral stochastic architecture
results in up to 21% reduction in energy consumption compared to the binary
radix implementation at the same misclassification rate. Due to fault-tolerant
nature of stochastic architectures, we also consider a quasi-synchronous
implementation which yields 33% reduction in energy consumption w.r.t. the
binary radix implementation without any compromise on performance.Comment: 11 pages, 12 figure
New Design Techniques for Dynamic Reconfigurable Architectures
L'abstract è presente nell'allegato / the abstract is in the attachmen
Single Event Effects Assessment of UltraScale+ MPSoC Systems under Atmospheric Radiation
The AMD UltraScale+ XCZU9EG device is a Multi-Processor System-on-Chip
(MPSoC) with embedded Programmable Logic (PL) that excels in many Edge (e.g.,
automotive or avionics) and Cloud (e.g., data centres) terrestrial
applications. However, it incorporates a large amount of SRAM cells, making the
device vulnerable to Neutron-induced Single Event Upsets (NSEUs) or otherwise
soft errors. Semiconductor vendors incorporate soft error mitigation mechanisms
to recover memory upsets (i.e., faults) before they propagate to the
application output and become an error. But how effective are the MPSoC's
mitigation schemes? Can they effectively recover upsets in high altitude or
large scale applications under different workloads? This article answers the
above research questions through a solid study that entails accelerated neutron
radiation testing and dependability analysis. We test the device on a broad
range of workloads, like multi-threaded software used for pose estimation and
weather prediction or a software/hardware (SW/HW) co-design image
classification application running on the AMD Deep Learning Processing Unit
(DPU). Assuming a one-node MPSoC system in New York City (NYC) at 40k feet, all
tested software applications achieve a Mean Time To Failure (MTTF) greater than
148 months, which shows that upsets are effectively recovered in the processing
system of the MPSoC. However, the SW/HW co-design (i.e., DPU) in the same
one-node system at 40k feet has an MTTF = 4 months due to the high failure rate
of its PL accelerator, which emphasises that some MPSoC workloads may require
additional NSEU mitigation schemes. Nevertheless, we show that the MTTF of the
DPU can increase to 87 months without any overhead if one disregards the
failure rate of tolerable errors since they do not affect the correctness of
the classification output.Comment: This manuscript is under review at IEEE Transactions on Reliabilit
METICULOUS: An FPGA-based Main Memory Emulator for System Software Studies
Due to the scaling problem of the DRAM technology, non-volatile memory
devices, which are based on different principle of operation than DRAM, are now
being intensively developed to expand the main memory of computers.
Disaggregated memory is also drawing attention as an emerging technology to
scale up the main memory. Although system software studies need to discuss
management mechanisms for the new main memory designs incorporating such
emerging memory systems, there are no feasible memory emulation mechanisms that
efficiently work for large-scale, privileged programs such as operating systems
and hypervisors. In this paper, we propose an FPGA-based main memory emulator
for system software studies on new main memory systems. It can emulate the main
memory incorporating multiple memory regions with different performance
characteristics. For the address region of each memory device, it emulates the
latencies, bandwidths and bit-flip error rates of read/write operations,
respectively. The emulator is implemented at the hardware module of an
off-the-self FPGA System-on-Chip board. Any privileged/unprivileged software
programs running on its powerful 64-bit CPU cores can access emulated main
memory devices at a practical speed through the exactly same interface as
normal DRAM main memory. We confirmed that the emulator transparently worked
for CPU cores and successfully changed the performance of a memory region
according to given emulation parameters; for example, the latencies measured by
CPU cores were exactly proportional to the latencies inserted by the emulator,
involving the minimum overhead of approximately 240 ns. As a preliminary use
case, we confirmed that the emulator allows us to change the bandwidth limit
and the inserted latency individually for unmodified software programs, making
discussions on latency sensitivity much easier
- …