This paper describes a multi-functional deep in-memory processor for inference applications. Deep inmemory processing is achieved by embedding pitch-matched low-SNR analog processing into a standard 6T 16KB SRAM array in 65 nm CMOS. Four applications are demonstrated. The prototype achieves up to 5.6X (9.7X estimated for multi-bank scenario) energy savings with negligible (≤1%) accuracy degradation in all four applications as compared to the conventional architecture.
3 read/write circuitry (lower) and in-memory processing blocks (upper) are physically separated to maintain functionality. Functional WL drivers generate PWM-WL signals while the reconfiguration word (RCFG) reconfigures local controllers in the CTRL. The in-memory processing chain is pipelined to enable BL precharge when the MR-FR step is complete. The architecture processes 128 8-b words per access cycle requiring two consecutive access cycles to process a 256-dimensional vector with 8-b elements. Thus, two consecutive CBLP outputs are sampled on different sampling capacitors and charge-shared before conversion by the ADC. Four 8-bit single-slope slow but energy efficient ADCs execute in parallel to process 36 128-dimensional vectors/µs.
The MR-FR step ( Fig. 3) generates BL swing ΔVBL proportional to binary-weighted bits (di) in a column [2] via the use of PWM-WL signals. An 8-b array data ( ) and streamed input ( ) precision is chosen to satisfy the requirements of many inference applications. The longest PWM-WL pulse width with VWL < VDD is chosen to be less than 40% of BL RC time constant in order to ensure sufficient linearity and prevent destructive read [2] . The shortest pulse width needs to be <250ps while driving a large RC WL, which is challenging due to the row pitch-matching constraints. Hence, sub-ranged read is proposed where 4 MSBs and 4 LSBs are stored in adjacent columns (column pair), and read simultaneously on BLMSB and BLLSB. Then, the charge on BLMSB is shared with 1/16 of BLLSB charge via switches Ø con and Ø merge.
Capacitors attached to BLs enable fine-tuning of the 1/16 capacitance ratio. The sub-ranged MR-FR ( Fig. 3) achieves a maximum INL = 0.03 LSB.
In the MD mode, the MR-FR enables D-to-A conversion, and replica cell read performs word-level add/subtract by reading ( ̅ for subtract) from the replica bit-cell array simultaneously with ( Fig. 3) [2]. The replica bit-cell array stores streamed data and can be written directly by write BL (WBL) reducing energy and latency overheads.
The BLP (Fig. 4) can be reconfigured to operate in either the DP or the MD mode for dot product or absolute computation, respectively. In the MD mode, an analog comparator and a mux is used to obtain the absolute value, and the multiplier circuit is reconfigured as a BL-wise sampler. In the DP mode, the comparator is bypassed and BLB is chosen by the mux. The mixed-signal capacitive multiplier [4] uses identical bit capacitors to meet the column pitch constraints necessitating sequential processing of multiplicand bits (pi) and thereby limiting the throughput. Sub-ranged multiplication alleviates this problem much of the savings is due to MR-FR. The CTRL energy will be amortized in a multi-bank scenario. The measured energy in DP (MD) mode is 5.6× (3.7×) smaller than conventional architecture, with savings up to 9.7× (5.4×) in a multi-bank scenario.
The multifunctional IC ( Fig. 6 ) implements four different algorithms with 2, 4, and 64-class decisions in a 512×256 SRAM array achieving better decision accuracy and comparable EDP (scaled for 65nm) than single function ICs [1, 3] . The chip micrograph (Fig. 7) shows that the deep in-memory circuitry incurs an area overhead of 25% not counting the CTRL.
Multi-row functional READ (MR-FR)
③ ADD /SUBT ① MULT 
Multi-row

