15,851 research outputs found
Synthetic Aperture Radar Algorithms on Transport Triggered Architecture Processors using OpenCL
Live SAR imaging from small UAVs is an emerging field. On-board processing of the radar data requires high-performance and energy-efficient platforms. One candidate for this are Transport Triggered Architecture (TTA) processors. We implement Backprojection and Backprojection Autofocus on a TTA processor specially adapted for this task using OpenCL. The resulting implementation is compared to other platforms in terms of energy efficiency. We find that the TTA is on-par with embedded GPUs and surpasses other OpenCL-based platforms. It is outperformed only by a dedicated FPGA implementation.
© 2023 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works
A database accelerator for energy-efficient query processing and optimization
Data processing on a continuously growing amount of information and the increasing power restrictions have become an ubiquitous challenge in our world today. Besides parallel computing, a promising approach to improve the energy efficiency of current systems is to integrate specialized hardware. This paper presents a Tensilica RISC processor extended with an instruction set to accelerate basic database operators frequently used in modern database systems. The core was taped out in a 28 nm SLP CMOS technology and allows energy-efficient query processing as well as query optimization by applying selectivity estimation techniques. Our chip measurements show an 1000x energy improvement on selected database operators compared to state-of-the-art systems
FireFly: A High-Throughput and Reconfigurable Hardware Accelerator for Spiking Neural Networks
Spiking neural networks (SNNs) have been widely used due to their strong
biological interpretability and high energy efficiency. With the introduction
of the backpropagation algorithm and surrogate gradient, the structure of
spiking neural networks has become more complex, and the performance gap with
artificial neural networks has gradually decreased. However, most SNN hardware
implementations for field-programmable gate arrays (FPGAs) cannot meet
arithmetic or memory efficiency requirements, which significantly restricts the
development of SNNs. They do not delve into the arithmetic operations between
the binary spikes and synaptic weights or assume unlimited on-chip RAM
resources by using overly expensive devices on small tasks. To improve
arithmetic efficiency, we analyze the neural dynamics of spiking neurons,
generalize the SNN arithmetic operation to the multiplex-accumulate operation,
and propose a high-performance implementation of such operation by utilizing
the DSP48E2 hard block in Xilinx Ultrascale FPGAs. To improve memory
efficiency, we design a memory system to enable efficient synaptic weights and
membrane voltage memory access with reasonable on-chip RAM consumption.
Combining the above two improvements, we propose an FPGA accelerator that can
process spikes generated by the firing neuron on-the-fly (FireFly). FireFly is
implemented on several FPGA edge devices with limited resources but still
guarantees a peak performance of 5.53TSOP/s at 300MHz. As a lightweight
accelerator, FireFly achieves the highest computational density efficiency
compared with existing research using large FPGA devices
Architecture and Circuit Design Optimization for Compute-In-Memory
The objective of the proposed research is to optimize computing-in-memory (CIM) design for accelerating Deep Neural Network (DNN) algorithms. As compute peripheries such as analog-to-digital converter (ADC) introduce significant overhead in CIM inference design, the research first focuses on the circuit optimization for inference acceleration and proposes a resistive random access memory (RRAM) based ADC-free in-memory compute scheme. We comprehensively explore the trade-offs involving different types of ADCs and investigate a new ADC design especially suited for the CIM, which performs the analog shift-add for multiple weight significance bits, improving the throughput and energy efficiency under similar area constraints. Furthermore, we prototype an ADC-free CIM inference chip design with a fully-analog data processing manner between sub-arrays, which can significantly improve the hardware performance over the conventional CIM designs and achieve near-software classification accuracy on ImageNet and CIFAR-10/-100 dataset. Secondly, the research focuses on hardware support for CIM on-chip training. To maximize hardware reuse of CIM weight stationary dataflow, we propose the CIM training architectures with the transpose weight mapping strategy. The cell design and periphery circuitry are modified to efficiently support bi-directional compute. A novel solution of signed number multiplication is also proposed to handle the negative input in backpropagation. Finally, we propose an SRAM-based CIM training architecture and comprehensively explore the system-level hardware performance for DNN on-chip training based on silicon measurement results.Ph.D
Towards Fast and Scalable Private Inference
Privacy and security have rapidly emerged as first order design constraints.
Users now demand more protection over who can see their data (confidentiality)
as well as how it is used (control). Here, existing cryptographic techniques
for security fall short: they secure data when stored or communicated but must
decrypt it for computation. Fortunately, a new paradigm of computing exists,
which we refer to as privacy-preserving computation (PPC). Emerging PPC
technologies can be leveraged for secure outsourced computation or to enable
two parties to compute without revealing either users' secret data. Despite
their phenomenal potential to revolutionize user protection in the digital age,
the realization has been limited due to exorbitant computational,
communication, and storage overheads.
This paper reviews recent efforts on addressing various PPC overheads using
private inference (PI) in neural network as a motivating application. First,
the problem and various technologies, including homomorphic encryption (HE),
secret sharing (SS), garbled circuits (GCs), and oblivious transfer (OT), are
introduced. Next, a characterization of their overheads when used to implement
PI is covered. The characterization motivates the need for both GCs and HE
accelerators. Then two solutions are presented: HAAC for accelerating GCs and
RPU for accelerating HE. To conclude, results and effects are shown with a
discussion on what future work is needed to overcome the remaining overheads of
PI.Comment: Appear in the 20th ACM International Conference on Computing
Frontier
Approximate Computing Survey, Part I: Terminology and Software & Hardware Approximation Techniques
The rapid growth of demanding applications in domains applying multimedia
processing and machine learning has marked a new era for edge and cloud
computing. These applications involve massive data and compute-intensive tasks,
and thus, typical computing paradigms in embedded systems and data centers are
stressed to meet the worldwide demand for high performance. Concurrently, the
landscape of the semiconductor field in the last 15 years has constituted power
as a first-class design concern. As a result, the community of computing
systems is forced to find alternative design approaches to facilitate
high-performance and/or power-efficient computing. Among the examined
solutions, Approximate Computing has attracted an ever-increasing interest,
with research works applying approximations across the entire traditional
computing stack, i.e., at software, hardware, and architectural levels. Over
the last decade, there is a plethora of approximation techniques in software
(programs, frameworks, compilers, runtimes, languages), hardware (circuits,
accelerators), and architectures (processors, memories). The current article is
Part I of our comprehensive survey on Approximate Computing, and it reviews its
motivation, terminology and principles, as well it classifies and presents the
technical details of the state-of-the-art software and hardware approximation
techniques.Comment: Under Review at ACM Computing Survey
NetClone: Fast, Scalable, and Dynamic Request Cloning for Microsecond-Scale RPCs
Spawning duplicate requests, called cloning, is a powerful technique to
reduce tail latency by masking service-time variability. However, traditional
client-based cloning is static and harmful to performance under high load,
while a recent coordinator-based approach is slow and not scalable. Both
approaches are insufficient to serve modern microsecond-scale Remote Procedure
Calls (RPCs). To this end, we present NetClone, a request cloning system that
performs cloning decisions dynamically within nanoseconds at scale. Rather than
the client or the coordinator, NetClone performs request cloning in the network
switch by leveraging the capability of programmable switch ASICs. Specifically,
NetClone replicates requests based on server states and blocks redundant
responses using request fingerprints in the switch data plane. To realize the
idea while satisfying the strict hardware constraints, we address several
technical challenges when designing a custom switch data plane. NetClone can be
integrated with emerging in-network request schedulers like RackSched. We
implement a NetClone prototype with an Intel Tofino switch and a cluster of
commodity servers. Our experimental results show that NetClone can improve the
tail latency of microsecond-scale RPCs for synthetic and real-world application
workloads and is robust to various system conditions.Comment: 13 pages, ACM SIGCOMM 202
The cusp plasma imaging detector (CuPID) cubesat observatory: instrumentation
The Cusp Plasma Imaging Detector (CuPID) CubeSat observatory is a 6U CubeSat designed to observe solar wind charge exchange in magnetospheric cusps to test competing theories of magnetic reconnection at the Earth's magnetopause. The CuPID is equipped with three instruments, namely, a wide field-of-view (4.6° × 4.6°) soft x-ray telescope, a micro-dosimeter suite, and an engineering magnetometer optimized for the science operation. The instrument suite has been tested and calibrated in relevant environments, demonstrating successful design. The testing and calibration of these instruments produced metrics and coefficients that will be used to create the CuPID mission's data product.NNX16AJ73G - NASAPublished versio
A Security RISC: Microarchitectural Attacks on Hardware RISC-V CPUs
Microarchitectural attacks threaten the security of computer systems even in the absence of software vulnerabilities. Such attacks are well explored on x86 and ARM CPUs, with a wide range of proposed but not-yet deployed hardware countermeasures. With the standardization of the RISC-V instruction set architecture and the announcement of support for the architecture by major processor vendors, RISC-V CPUs are on the verge of becoming ubiquitous. However, the microarchitectural attack surface of the first commercially available RISC-V hardware CPUs is not yet explored. This paper analyzes the two commercially-available off-the-shelf 64-bit RISC-V (hardware) CPUs used in most RISC-V systems running a full-fledged commodity Linux system. We evaluate the microarchitectural attack surface, which leads to the introduction of 3 new microarchitectural attack techniques: Cache+Time, a novel cache-line-granular cache attack without shared memory, Flush+Fault exploiting the Harvard cache architecture for Flush+Reload, and CycleDrift exploiting unprivileged access to instruction-retirement information. Additionally, we show that many known attacks are applicable to these RISC-V CPUs, mainly due to non-existing hardware countermeasures and instruction-set subtleties that do not consider the microarchitectural attack surface. We demonstrate our attacks in 6 case studies, including the first RISC-V-specific microarchitectural KASLR break and a CycleDrift-based method for detecting kernel activity. Based on our analysis, we stress the need to consider the microarchitectural attack surface during every step of a CPU design, including custom instruction-set extensions
- …