627 research outputs found

    PIRM: Processing In Racetrack Memories

    Full text link
    The growth in data needs of modern applications has created significant challenges for modern systems leading a "memory wall." Spintronic Domain Wall Memory (DWM), related to Spin-Transfer Torque Memory (STT-MRAM), provides near-SRAM read/write performance, energy savings and nonvolatility, potential for extremely high storage density, and does not have significant endurance limitations. However, DWM's benefits cannot address data access latency and throughput limitations of memory bus bandwidth. We propose PIRM, a DWM-based in-memory computing solution that leverages the properties of DWM nanowires and allows them to serve as polymorphic gates. While normally DWM is accessed by applying spin polarized currents orthogonal to the nanowire at access points to read individual bits, transverse access along the DWM nanowire allows the differentiation of the aggregate resistance of multiple bits in the nanowire, akin to a multilevel cell. PIRM leverages this transverse reading to directly provide bulk-bitwise logic of multiple adjacent operands in the nanowire, simultaneously. Based on this in-memory logic, PIRM provides a technique to conduct multi-operand addition and two operand multiplication using transverse access. PIRM provides a 1.6x speedup compared to the leading DRAM PIM technique for query applications that leverage bulk bitwise operations. Compared to the leading PIM technique for DWM, PIRM improves performance by 6.9x, 2.3x and energy by 5.5x, 3.4x for 8-bit addition and multiplication, respectively. For arithmetic heavy benchmarks, PIRM reduces access latency by 2.1x, while decreasing energy consumption by 25.2x for a reasonable 10% area overhead versus non-PIM DWM.Comment: This paper is accepted to the IEEE/ACM Symposium on Microarchitecture, October 2022 under the title "CORUSCANT: Fast Efficient Processing-in-Racetrack Memories

    A new approach to improve ill-conditioned parabolic optimal control problem via time domain decomposition

    Full text link
    In this paper we present a new steepest-descent type algorithm for convex optimization problems. Our algorithm pieces the unknown into sub-blocs of unknowns and considers a partial optimization over each sub-bloc. In quadratic optimization, our method involves Newton technique to compute the step-lengths for the sub-blocs resulting descent directions. Our optimization method is fully parallel and easily implementable, we first presents it in a general linear algebra setting, then we highlight its applicability to a parabolic optimal control problem, where we consider the blocs of unknowns with respect to the time dependency of the control variable. The parallel tasks, in the last problem, turn "on" the control during a specific time-window and turn it "off" elsewhere. We show that our algorithm significantly improves the computational time compared with recognized methods. Convergence analysis of the new optimal control algorithm is provided for an arbitrary choice of partition. Numerical experiments are presented to illustrate the efficiency and the rapid convergence of the method.Comment: 28 page

    Characterization and Acceleration of High Performance Compute Workloads

    Get PDF

    Characterization and Acceleration of High Performance Compute Workloads

    Get PDF

    FourierPIM: High-Throughput In-Memory Fast Fourier Transform and Polynomial Multiplication

    Full text link
    The Discrete Fourier Transform (DFT) is essential for various applications ranging from signal processing to convolution and polynomial multiplication. The groundbreaking Fast Fourier Transform (FFT) algorithm reduces DFT time complexity from the naive O(n^2) to O(n log n), and recent works have sought further acceleration through parallel architectures such as GPUs. Unfortunately, accelerators such as GPUs cannot exploit their full computing capabilities as memory access becomes the bottleneck. Therefore, this paper accelerates the FFT algorithm using digital Processing-in-Memory (PIM) architectures that shift computation into the memory by exploiting physical devices capable of storage and logic (e.g., memristors). We propose an O(log n) in-memory FFT algorithm that can also be performed in parallel across multiple arrays for high-throughput batched execution, supporting both fixed-point and floating-point numbers. Through the convolution theorem, we extend this algorithm to O(log n) polynomial multiplication - a fundamental task for applications such as cryptography. We evaluate FourierPIM on a publicly-available cycle-accurate simulator that verifies both correctness and performance, and demonstrate 5-15x throughput and 4-13x energy improvement over the NVIDIA cuFFT library on state-of-the-art GPUs for FFT and polynomial multiplication

    Low power In Memory Computation with Reciprocal Ferromagnet/Topological Insulator Heterostructures

    Full text link
    The surface state of a 3D topological insulator (3DTI) is a spin-momentum locked conductive state, whose large spin hall angle can be used for the energy-efficient spin orbit torque based switching of an overlying ferromagnet (FM). Conversely, the gated switching of the magnetization of a separate FM in or out of the TI surface plane, can turn on and off the TI surface current. The gate tunability of the TI Dirac cone gap helps reduce its sub-threshold swing. By exploiting this reciprocal behaviour, we can use two FM/3DTI heterostructures to design a 1-Transistor 1-magnetic tunnel junction random access memory unit (1T1MTJ RAM) for an ultra low power Processing-in-Memory (PiM) architecture. Our calculation involves combining the Fokker-Planck equation with the Non-equilibrium Green Function (NEGF) based flow of conduction electrons and Landau-Lifshitz-Gilbert (LLG) based dynamics of magnetization. Our combined approach allows us to connect device performance metrics with underlying material parameters, which can guide proposed experimental and fabrication efforts.Comment: 5 pages, 4 figure

    3D microwave tomography with huber regularization applied to realistic numerical breast phantoms

    Get PDF
    Quantitative active microwave imaging for breast cancer screening and therapy monitoring applications requires adequate reconstruction algorithms, in particular with regard to the nonlinearity and ill-posedness of the inverse problem. We employ a fully vectorial three-dimensional nonlinear inversion algorithm for reconstructing complex permittivity profiles from multi-view single-frequency scattered field data, which is based on a Gauss-Newton optimization of a regularized cost function. We tested it before with various types of regularizing functions for piecewise-constant objects from Institut Fresnel and with a quadratic smoothing function for a realistic numerical breast phantom. In the present paper we adopt a cost function that includes a Huber function in its regularization term, relying on a Markov Random Field approach. The Huber function favors spatial smoothing within homogeneous regions while preserving discontinuities between contrasted tissues. We illustrate the technique with 3D reconstructions from synthetic data at 2GHz for realistic numerical breast phantoms from the University of Wisconsin-Madison UWCEM online repository: we compare Huber regularization with a multiplicative smoothing regularization and show reconstructions for various positions of a tumor, for multiple tumors and for different tumor sizes, from a sparse and from a denser data configuration

    Private and Public-Key Side-Channel Threats Against Hardware Accelerated Cryptosystems

    Get PDF
    Modern side-channel attacks (SCA) have the ability to reveal sensitive data from non-protected hardware implementations of cryptographic accelerators whether they be private or public-key systems. These protocols include but are not limited to symmetric, private-key encryption using AES-128, 192, 256, or public-key cryptosystems using elliptic curve cryptography (ECC). Traditionally, scalar point (SP) operations are compelled to be high-speed at any cost to reduce point multiplication latency. The majority of high-speed architectures of contemporary elliptic curve protocols rely on non-secure SP algorithms. This thesis delivers a novel design, analysis, and successful results from a custom differential power analysis attack on AES-128. The resulting SCA can break any 16-byte master key the sophisticated cipher uses and it\u27s direct applications towards public-key cryptosystems will become clear. Further, the architecture of a SCA resistant scalar point algorithm accompanied by an implementation of an optimized serial multiplier will be constructed. The optimized hardware design of the multiplier is highly modular and can use either NIST approved 233 & 283-bit Kobliz curves utilizing a polynomial basis. The proposed architecture will be implemented on Kintex-7 FPGA to later be integrated with the ARM Cortex-A9 processor on the Zynq-7000 AP SoC (XC7Z045) for seamless data transfer and analysis of the vulnerabilities SCAs can exploit
    corecore