19 research outputs found
DRC 2 : Dynamically Reconfigurable Computing Circuit based on Memory Architecture
International audienceThis paper presents a novel energy-efficient and Dynamically Reconfigurable Computing Circuit (DRC²) concept based on memory architecture for data-intensive (imaging, …) and secure (cryptography, …) applications. The proposed computing circuit is based on a 10-Transistor (10T) 3-Port SRAM bitcell array driven by a peripheral circuitry enabling all basic operations that can be traditionally performed by an ALU. As a result, logic and arithmetic operations can be entirely executed within the memory unit leading to a significant reduction in power consumption related to the data transfer between memories and computing units. Moreover, the proposed computing circuit can perform extremely-parallel operations enabling the processing of large volume of data. A test case based on image processing application and using the saturating increment function is analytically modeled to compare conventional and DRC²-based approaches. It is demonstrated that DRC²-based approach provides a reduction of clock cycle number of up to 2x. Finally, potential applications and must-be-considered changes at different design levels are discussed
X-SRAM: Enabling In-Memory Boolean Computations in CMOS Static Random Access Memories
Silicon-based Static Random Access Memories (SRAM) and digital Boolean logic
have been the workhorse of the state-of-art computing platforms. Despite
tremendous strides in scaling the ubiquitous metal-oxide-semiconductor
transistor, the underlying \textit{von-Neumann} computing architecture has
remained unchanged. The limited throughput and energy-efficiency of the
state-of-art computing systems, to a large extent, results from the well-known
\textit{von-Neumann bottleneck}. The energy and throughput inefficiency of the
von-Neumann machines have been accentuated in recent times due to the present
emphasis on data-intensive applications like artificial intelligence, machine
learning \textit{etc}. A possible approach towards mitigating the overhead
associated with the von-Neumann bottleneck is to enable \textit{in-memory}
Boolean computations. In this manuscript, we present an augmented version of
the conventional SRAM bit-cells, called \textit{the X-SRAM}, with the ability
to perform in-memory, vector Boolean computations, in addition to the usual
memory storage operations. We propose at least six different schemes for
enabling in-memory vector computations including NAND, NOR, IMP (implication),
XOR logic gates with respect to different bit-cell topologies the 8T cell
and the 8T Differential cell. In addition, we also present a novel
\textit{`read-compute-store'} scheme, wherein the computed Boolean function can
be directly stored in the memory without the need of latching the data and
carrying out a subsequent write operation. The feasibility of the proposed
schemes has been verified using predictive transistor models and Monte-Carlo
variation analysis.Comment: This article has been accepted in a future issue of IEEE Transactions
on Circuits and Systems-I: Regular Paper
High Performance Pre-computation based Self-Controlled Precharge-Free Content-Addressable Memory
Content-addressable memory (CAM) is a special type of memory used in networking applications for very-high-speed searching operation. It compares input search data with the table of stored data, and returns the address of matching data in a parallel search method. Also the use of parallel comparison results in reduced search time, it also significantly increases power consumption when compared to precharge based CAM. The low-power NAND-type and high-speed NOR-type CAM methods require the precharge prior to the search. This PF phase leads to increase the settling time of the output and also reduce the speed of the search operation. In this paper, a High performance Pre-computation Based Self-Controlled Precharge-Free CAM (PB-SCPF CAM) structure is proposed for high-speed applications which reduce the settling time as well as improve the speed of the search. Where search time is very important for designing larger word lengths, SCPF architecture is efficacious in applications. The experimental results show that PB-SCPF approach can attain on average 32% in power reduction and 80% in delay reduction. The most important contribution of this project is that it offers theoretical and practical proofs to verify that our suggested PB-SCPF CAM system can achieve greater power reduction without the requirement of special CAM cell design. This shows that the approach which we have used is more flexible and adaptive for general designs and high speed applications
BLADE: A BitLine Accelerator for Devices on the Edge
The increasing ubiquity of edge devices in the consumer market, along with their ever more computationally expensive workloads, necessitate corresponding increases in computing power to support such workloads. In-memory computing is attractive in edge devices as it reuses preexisting memory elements, thus limiting area overhead. Additionally, in-SRAM Computing (iSC) efficiently performs computations on spatially local data found in a variety of emerging edge device workloads. We therefore propose, implement, and benchmark BLADE, a BitLine Accelerator for Devices on the Edge. BLADE is an iSC architecture that can perform massive SIMD-like complex operations on hundreds to thousands of operands simultaneously. We implement BLADE in 28nm CMOS and demonstrate its functionality down to 0.6V, lower than any conventional state-of-the-art iSC architecture. We also benchmark BLADE in conjunction with a full Linux software stack in the gem5 architectural simulator, providing a robust demonstration of its performance gain in comparison to an equivalent embedded processor equipped with a NEON SIMD co-processor. We benchmark BLADE with three emerging edge device workloads, namely cryptography, high efficiency video coding, and convolutional neural networks, and demonstrate 4x, 6x, and 3x performance improvement, respectively, in comparison to a baseline CPU/NEON processor at an equivalent power budget
Custom Memory Design for Logic-in-Memory: Drawbacks and Improvements over Conventional Memories
The speed of modern digital systems is severely limited by memory latency (the “Memory Wall” problem). Data exchange between Logic and Memory is also responsible for a large part of the system energy consumption. Logic-in-Memory (LiM) represents an attractive solution to this problem. By performing part of the computations directly inside the memory the system speed can be improved while reducing its energy consumption. LiM solutions that offer the major boost in performance are based on the modification of the memory cell. However, what is the cost of such modifications? How do these impact the memory array performance? In this work, this question is addressed by analysing a LiM memory array implementing an algorithm for the maximum/minimum value computation. The memory array is designed at physical level using the FreePDK 45nm CMOS process, with three memory cell variants, and its performance is compared to SRAM and CAM memories. Results highlight that read and write operations performance is worsened but in-memory operations result to be very efficient: a 55.26% reduction in the energy-delay product is measured for the AND operation with respect to the SRAM read one. Therefore, the LiM approach represents a very promising solution for low-density and high-performance memories