627 research outputs found
PIRM: Processing In Racetrack Memories
The growth in data needs of modern applications has created significant
challenges for modern systems leading a "memory wall." Spintronic Domain Wall
Memory (DWM), related to Spin-Transfer Torque Memory (STT-MRAM), provides
near-SRAM read/write performance, energy savings and nonvolatility, potential
for extremely high storage density, and does not have significant endurance
limitations. However, DWM's benefits cannot address data access latency and
throughput limitations of memory bus bandwidth. We propose PIRM, a DWM-based
in-memory computing solution that leverages the properties of DWM nanowires and
allows them to serve as polymorphic gates. While normally DWM is accessed by
applying spin polarized currents orthogonal to the nanowire at access points to
read individual bits, transverse access along the DWM nanowire allows the
differentiation of the aggregate resistance of multiple bits in the nanowire,
akin to a multilevel cell. PIRM leverages this transverse reading to directly
provide bulk-bitwise logic of multiple adjacent operands in the nanowire,
simultaneously. Based on this in-memory logic, PIRM provides a technique to
conduct multi-operand addition and two operand multiplication using transverse
access. PIRM provides a 1.6x speedup compared to the leading DRAM PIM technique
for query applications that leverage bulk bitwise operations. Compared to the
leading PIM technique for DWM, PIRM improves performance by 6.9x, 2.3x and
energy by 5.5x, 3.4x for 8-bit addition and multiplication, respectively. For
arithmetic heavy benchmarks, PIRM reduces access latency by 2.1x, while
decreasing energy consumption by 25.2x for a reasonable 10% area overhead
versus non-PIM DWM.Comment: This paper is accepted to the IEEE/ACM Symposium on
Microarchitecture, October 2022 under the title "CORUSCANT: Fast Efficient
Processing-in-Racetrack Memories
A new approach to improve ill-conditioned parabolic optimal control problem via time domain decomposition
In this paper we present a new steepest-descent type algorithm for convex
optimization problems. Our algorithm pieces the unknown into sub-blocs of
unknowns and considers a partial optimization over each sub-bloc. In quadratic
optimization, our method involves Newton technique to compute the step-lengths
for the sub-blocs resulting descent directions. Our optimization method is
fully parallel and easily implementable, we first presents it in a general
linear algebra setting, then we highlight its applicability to a parabolic
optimal control problem, where we consider the blocs of unknowns with respect
to the time dependency of the control variable. The parallel tasks, in the last
problem, turn "on" the control during a specific time-window and turn it "off"
elsewhere. We show that our algorithm significantly improves the computational
time compared with recognized methods. Convergence analysis of the new optimal
control algorithm is provided for an arbitrary choice of partition. Numerical
experiments are presented to illustrate the efficiency and the rapid
convergence of the method.Comment: 28 page
FourierPIM: High-Throughput In-Memory Fast Fourier Transform and Polynomial Multiplication
The Discrete Fourier Transform (DFT) is essential for various applications
ranging from signal processing to convolution and polynomial multiplication.
The groundbreaking Fast Fourier Transform (FFT) algorithm reduces DFT time
complexity from the naive O(n^2) to O(n log n), and recent works have sought
further acceleration through parallel architectures such as GPUs.
Unfortunately, accelerators such as GPUs cannot exploit their full computing
capabilities as memory access becomes the bottleneck. Therefore, this paper
accelerates the FFT algorithm using digital Processing-in-Memory (PIM)
architectures that shift computation into the memory by exploiting physical
devices capable of storage and logic (e.g., memristors). We propose an O(log n)
in-memory FFT algorithm that can also be performed in parallel across multiple
arrays for high-throughput batched execution, supporting both fixed-point and
floating-point numbers. Through the convolution theorem, we extend this
algorithm to O(log n) polynomial multiplication - a fundamental task for
applications such as cryptography. We evaluate FourierPIM on a
publicly-available cycle-accurate simulator that verifies both correctness and
performance, and demonstrate 5-15x throughput and 4-13x energy improvement over
the NVIDIA cuFFT library on state-of-the-art GPUs for FFT and polynomial
multiplication
Low power In Memory Computation with Reciprocal Ferromagnet/Topological Insulator Heterostructures
The surface state of a 3D topological insulator (3DTI) is a spin-momentum
locked conductive state, whose large spin hall angle can be used for the
energy-efficient spin orbit torque based switching of an overlying ferromagnet
(FM). Conversely, the gated switching of the magnetization of a separate FM in
or out of the TI surface plane, can turn on and off the TI surface current. The
gate tunability of the TI Dirac cone gap helps reduce its sub-threshold swing.
By exploiting this reciprocal behaviour, we can use two FM/3DTI
heterostructures to design a 1-Transistor 1-magnetic tunnel junction random
access memory unit (1T1MTJ RAM) for an ultra low power Processing-in-Memory
(PiM) architecture. Our calculation involves combining the Fokker-Planck
equation with the Non-equilibrium Green Function (NEGF) based flow of
conduction electrons and Landau-Lifshitz-Gilbert (LLG) based dynamics of
magnetization. Our combined approach allows us to connect device performance
metrics with underlying material parameters, which can guide proposed
experimental and fabrication efforts.Comment: 5 pages, 4 figure
3D microwave tomography with huber regularization applied to realistic numerical breast phantoms
Quantitative active microwave imaging for breast cancer screening and therapy monitoring applications requires adequate reconstruction algorithms, in particular with regard to the nonlinearity and ill-posedness of the inverse problem. We employ a fully vectorial three-dimensional nonlinear inversion algorithm for reconstructing complex permittivity profiles from multi-view single-frequency scattered field data, which is based on a Gauss-Newton optimization of a regularized cost function. We tested it before with various types of regularizing functions for piecewise-constant objects from Institut Fresnel and with a quadratic smoothing function for a realistic numerical breast phantom. In the present paper we adopt a cost function that includes a Huber function in its regularization term, relying on a Markov Random Field approach. The Huber function favors spatial smoothing within homogeneous regions while preserving discontinuities between contrasted tissues. We illustrate the technique with 3D reconstructions from synthetic data at 2GHz for realistic numerical breast phantoms from the University of Wisconsin-Madison UWCEM online repository: we compare Huber regularization with a multiplicative smoothing regularization and show reconstructions for various positions of a tumor, for multiple tumors and for different tumor sizes, from a sparse and from a denser data configuration
Private and Public-Key Side-Channel Threats Against Hardware Accelerated Cryptosystems
Modern side-channel attacks (SCA) have the ability to reveal sensitive data from non-protected hardware implementations of cryptographic accelerators whether they be private or public-key systems. These protocols include but are not limited to symmetric, private-key encryption using AES-128, 192, 256, or public-key cryptosystems using elliptic curve cryptography (ECC). Traditionally, scalar point (SP) operations are compelled to be high-speed at any cost to reduce point multiplication latency. The majority of high-speed architectures of contemporary elliptic curve protocols rely on non-secure SP algorithms. This thesis delivers a novel design, analysis, and successful results from a custom differential power analysis attack on AES-128. The resulting SCA can break any 16-byte master key the sophisticated cipher uses and it\u27s direct applications towards public-key cryptosystems will become clear. Further, the architecture of a SCA resistant scalar point algorithm accompanied by an implementation of an optimized serial multiplier will be constructed. The optimized hardware design of the multiplier is highly modular and can use either NIST approved 233 & 283-bit Kobliz curves utilizing a polynomial basis. The proposed architecture will be implemented on Kintex-7 FPGA to later be integrated with the ARM Cortex-A9 processor on the Zynq-7000 AP SoC (XC7Z045) for seamless data transfer and analysis of the vulnerabilities SCAs can exploit
- …