4 research outputs found
Architecture and Circuit Design Optimization for Compute-In-Memory
The objective of the proposed research is to optimize computing-in-memory (CIM) design for accelerating Deep Neural Network (DNN) algorithms. As compute peripheries such as analog-to-digital converter (ADC) introduce significant overhead in CIM inference design, the research first focuses on the circuit optimization for inference acceleration and proposes a resistive random access memory (RRAM) based ADC-free in-memory compute scheme. We comprehensively explore the trade-offs involving different types of ADCs and investigate a new ADC design especially suited for the CIM, which performs the analog shift-add for multiple weight significance bits, improving the throughput and energy efficiency under similar area constraints. Furthermore, we prototype an ADC-free CIM inference chip design with a fully-analog data processing manner between sub-arrays, which can significantly improve the hardware performance over the conventional CIM designs and achieve near-software classification accuracy on ImageNet and CIFAR-10/-100 dataset. Secondly, the research focuses on hardware support for CIM on-chip training. To maximize hardware reuse of CIM weight stationary dataflow, we propose the CIM training architectures with the transpose weight mapping strategy. The cell design and periphery circuitry are modified to efficiently support bi-directional compute. A novel solution of signed number multiplication is also proposed to handle the negative input in backpropagation. Finally, we propose an SRAM-based CIM training architecture and comprehensively explore the system-level hardware performance for DNN on-chip training based on silicon measurement results.Ph.D
PIM-QAT: Neural Network Quantization for Processing-In-Memory (PIM) Systems
Processing-in-memory (PIM), an increasingly studied neuromorphic hardware,
promises orders of energy and throughput improvements for deep learning
inference. Leveraging the massively parallel and efficient analog computing
inside memories, PIM circumvents the bottlenecks of data movements in
conventional digital hardware. However, an extra quantization step (i.e. PIM
quantization), typically with limited resolution due to hardware constraints,
is required to convert the analog computing results into digital domain.
Meanwhile, non-ideal effects extensively exist in PIM quantization because of
the imperfect analog-to-digital interface, which further compromises the
inference accuracy.
In this paper, we propose a method for training quantized networks to
incorporate PIM quantization, which is ubiquitous to all PIM systems.
Specifically, we propose a PIM quantization aware training (PIM-QAT) algorithm,
and introduce rescaling techniques during backward and forward propagation by
analyzing the training dynamics to facilitate training convergence. We also
propose two techniques, namely batch normalization (BN) calibration and
adjusted precision training, to suppress the adverse effects of non-ideal
linearity and stochastic thermal noise involved in real PIM chips. Our method
is validated on three mainstream PIM decomposition schemes, and physically on a
prototype chip. Comparing with directly deploying conventionally trained
quantized model on PIM systems, which does not take into account this extra
quantization step and thus fails, our method provides significant improvement.
It also achieves comparable inference accuracy on PIM systems as that of
conventionally quantized models on digital hardware, across CIFAR10 and
CIFAR100 datasets using various network depths for the most popular network
topology.Comment: 25 pages, 12 figures, 8 table
Probabilistic Compute-in-Memory Design For Efficient Markov Chain Monte Carlo Sampling
Markov chain Monte Carlo (MCMC) is a widely used sampling method in modern
artificial intelligence and probabilistic computing systems. It involves
repetitive random number generations and thus often dominates the latency of
probabilistic model computing. Hence, we propose a compute-in-memory (CIM)
based MCMC design as a hardware acceleration solution. This work investigates
SRAM bitcell stochasticity and proposes a novel ``pseudo-read'' operation,
based on which we offer a block-wise random number generation circuit scheme
for fast random number generation. Moreover, this work proposes a novel
multi-stage exclusive-OR gate (MSXOR) design method to generate strictly
uniformly distributed random numbers. The probability error deviating from a
uniform distribution is suppressed under . Also, this work presents a
novel in-memory copy circuit scheme to realize data copy inside a CIM
sub-array, significantly reducing the use of R/W circuits for power saving.
Evaluated in a commercial 28-nm process development kit, this CIM-based MCMC
design generates 4-bit32-bit samples with an energy efficiency of
~pJ/sample and high throughput of up to M~samples/s. Compared to
conventional processors, the overall energy efficiency improves
to times