110 research outputs found

    Data processing and information classification— an in-memory approach

    Get PDF
    9noTo live in the information society means to be surrounded by billions of electronic devices full of sensors that constantly acquire data. This enormous amount of data must be processed and classified. A solution commonly adopted is to send these data to server farms to be remotely elaborated. The drawback is a huge battery drain due to high amount of information that must be exchanged. To compensate this problem data must be processed locally, near the sensor itself. But this solution requires huge computational capabilities. While microprocessors, even mobile ones, nowadays have enough computational power, their performance are severely limited by the Memory Wall problem. Memories are too slow, so microprocessors cannot fetch enough data from them, greatly limiting their performance. A solution is the Processing-In-Memory (PIM) approach. New memories are designed that can elaborate data inside them eliminating the Memory Wall problem. In this work we present an example of such a system, using as a case of study the Bitmap Indexing algorithm. Such algorithm is used to classify data coming from many sources in parallel. We propose a hardware accelerator designed around the Processing-In-Memory approach, that is capable of implementing this algorithm and that can also be reconfigured to do other tasks or to work as standard memory. The architecture has been synthesized using CMOS technology. The results that we have obtained highlights that, not only it is possible to process and classify huge amount of data locally, but also that it is possible to obtain this result with a very low power consumption.openopenAndrighetti, M. .; Turvani, G.; Santoro, G.; Vacca, M.; Marchesin, A.; Ottati, F.; Roch, M.R.; Graziano, M.; Zamboni, M.Andrighetti, M.; Turvani, G.; Santoro, G.; Vacca, M.; Marchesin, A.; Ottati, F.; Roch, M. R.; Graziano, M.; Zamboni, M

    New Logic-In-Memory Paradigms: An Architectural and Technological Perspective

    Get PDF
    Processing systems are in continuous evolution thanks to the constant technological advancement and architectural progress. Over the years, computing systems have become more and more powerful, providing support for applications, such as Machine Learning, that require high computational power. However, the growing complexity of modern computing units and applications has had a strong impact on power consumption. In addition, the memory plays a key role on the overall power consumption of the system, especially when considering data-intensive applications. These applications, in fact, require a lot of data movement between the memory and the computing unit. The consequence is twofold: Memory accesses are expensive in terms of energy and a lot of time is wasted in accessing the memory, rather than processing, because of the performance gap that exists between memories and processing units. This gap is known as the memory wall or the von Neumann bottleneck and is due to the different rate of progress between complementary metal-oxide semiconductor (CMOS) technology and memories. However, CMOS scaling is also reaching a limit where it would not be possible to make further progress. This work addresses all these problems from an architectural and technological point of view by: (1) Proposing a novel Configurable Logic-in-Memory Architecture that exploits the in-memory computing paradigm to reduce the memory wall problem while also providing high performance thanks to its flexibility and parallelism; (2) exploring a non-CMOS technology as possible candidate technology for the Logic-in-Memory paradigm

    A Survey on Approximate Multiplier Designs for Energy Efficiency: From Algorithms to Circuits

    Full text link
    Given the stringent requirements of energy efficiency for Internet-of-Things edge devices, approximate multipliers, as a basic component of many processors and accelerators, have been constantly proposed and studied for decades, especially in error-resilient applications. The computation error and energy efficiency largely depend on how and where the approximation is introduced into a design. Thus, this article aims to provide a comprehensive review of the approximation techniques in multiplier designs ranging from algorithms and architectures to circuits. We have implemented representative approximate multiplier designs in each category to understand the impact of the design techniques on accuracy and efficiency. The designs can then be effectively deployed in high-level applications, such as machine learning, to gain energy efficiency at the cost of slight accuracy loss.Comment: 38 pages, 37 figure

    Encryption AXI Transaction Core for Enhanced FPGA Security

    Get PDF
    The current hot topic in cyber-security is not constrained to software layers. As attacks on electronic circuits have become more usual and dangerous, hardening digital System-on-Chips has become crucial. This article presents a novel electronic core to encrypt and decrypt data between two digital modules through an Advanced eXtensible Interface (AXI) connection. The core is compatible with AXI and is based on a Trivium stream cipher. Its implementation has been tested on a Zynq platform. The core prevents unauthorized data extraction by encrypting data on the fly. In addition, it takes up a small area—242 LUTs—and, as the core’s AXI to AXI path is fully combinational, it does not interfere with the system’s overall performance, with a maximum AXI clock frequency of 175 MHz.This work has been supported within the fund for research groups of the Basque university system IT1440-22 by the Department of Education and within the PILAR ZE-2020/00022 and COMMUTE ZE-2021/00931 projects by the Hazitek program, both of the Basque Government, the latter also by the Ministerio de Ciencia e InnovaciĂłn of Spain through the Centro para el Desarrollo TecnolĂłgico Industrial (CDTI) within the project IDI-20201264 and IDI-20220543 and through the Fondo Europeo de Desarrollo Regional 2014–2020 (FEDER funds)

    Sequence-To-Sequence Neural Networks Inference on Embedded Processors Using Dynamic Beam Search

    Get PDF
    Sequence-to-sequence deep neural networks have become the state of the art for a variety of machine learning applications, ranging from neural machine translation (NMT) to speech recognition. Many mobile and Internet of Things (IoT) applications would benefit from the ability of performing sequence-to-sequence inference directly in embedded devices, thereby reducing the amount of raw data transmitted to the cloud, and obtaining benefits in terms of response latency, energy consumption and security. However, due to the high computational complexity of these models, specific optimization techniques are needed to achieve acceptable performance and energy consumption on single-core embedded processors. In this paper, we present a new optimization technique called dynamic beam search, in which the inference complexity is tuned to the difficulty of the processed input sequence at runtime. Results based on measurements on a real embedded device, and on three state-of-the-art deep learning models, show that our method is able to reduce the inference time and energy by up to 25% without loss of accuracy

    Side-channel security of superscalar CPUs: Evaluating the Impact of Micro-architectural Features

    Get PDF
    Side-channel attacks are performed on increasingly complex targets, starting to threaten superscalar CPUs supporting a complete operating system. The difficulty of both assessing the vulnerability of a device to them, and validating the effectiveness of countermeasures is increasing as a consequence. In this work we prove that assessing the side-channel vulnerability of a software implementation running on a CPU should take into account the microarchitectural features of the CPU itself. We characterize the impact of microarchitectural features and prove the effectiveness of such an approach attacking a dual-core superscalar CPU

    Understanding Timing Error Characteristics From Overclocked Systolic Multiply–Accumulate Arrays in FPGAs

    Get PDF
    Artificial Intelligence (AI) hardware accelerators have seen tremendous developments in recent years due to the rapid growth of AI in multiple fields. Many such accelerators comprise a Systolic Multiply–Accumulate Array (SMA) as its computational brain. In this paper, we investigate the faulty output characterization of an SMA in a real silicon FPGA board. Experiments were run on a single Zybo Z7-20 board to control for process variation at nominal voltage and in small batches to control for temperature. The FPGA is rated up to 800 MHz in the data sheet due to the max frequency of the PLL, but the design is written using Verilog for the FPGA and C++ for the processor and synthesized with a chosen constraint of a 125 MHz clock. We then operate the system at a frequency range of 125 MHz to 450 MHz for the FPGA and the nominal 667 MHz for the processor core to produce timing errors in the FPGA without affecting the processor. Our extensive experimental platform with a hardware–software ecosystem provides a methodological pathway that reveals fascinating characteristics of SMA behavior under an overclocked environment. While one may intuitively expect that timing errors resulting from overclocked hardware may produce a wide variation in output values, our post-silicon evaluation reveals a lack of variation in erroneous output values. We found an intriguing pattern where error output values are stable for a given input across a range of operating frequencies far exceeding the rated frequency of the FPGA
    • …
    corecore