Search CORE

118 research outputs found

Energy-efficient systems for information transfer and processing

Author: Lin Yingyan
Publication venue
Publication date: 01/08/2017
Field of study

Machine learning (ML) systems are finding excellent utility in tackling the data deluge of the big data era thanks to the exponential increase in computing power. Current ML systems adopt either centralized cloud computing or distributed edge computing. In both, the challenge of energy efficiency has been drawing increased attention. In cloud computing, data transfer due to inter-chip, inter-board, inter-shelf and inter-rack communications (I/O interface) within data centers is one of the dominant energy costs. This will intensify with the growing demand for increased I/O bandwidth of high-performance computing in data centers. On the other hand, in edge computing, energy efficiency is the primary design challenge, as mobile devices have limited energy, computation and storage resources. This challenge is being exacerbated by the need to embed ML algorithms such as convolutional neural networks (CNNs) for enabling local on-device inference capabilities. In this dissertation, we investigate techniques to address these challenges. To address the energy efficiency challenge in data centers, this dissertation focuses on reducing the energy consumption of the I/O interface. Specifically, in the emerging analog-to-digital converter (ADC)-based multi-Gb/s serial link receivers, the power dissipation is dominated by the ADC. ADCs in serial links employ signal-to-noise-and-distortion-ratio (SNDR) and effective-number-of-bits (ENOB) as performance metrics because these are the standard for generic ADC design. This dissertation presents the use of information-based metrics such as bit-error-rate (BER) to design a BER-optimal ADC (BOA) for serial links. First, theoretical analysis is developed to show when the benefits of BOA over a conventional uniform ADC (CUA) in a serial link receiver are substantial. Second, a \unit[4]{GS/s}, 4-\mbox{\textrm{bit}} on-chip ADC in a \unit[90]{nm} CMOS process is designed and integrated into a 4 Gb/s serial link receiver to verify the aforementioned analysis. Specifically, measured results demonstrate that a 3-\mathrm{bit} BOA receiver outperforms a 4-\mathrm{bit} CUA receiver at a BER <10^{-12} and provides \unit[50]{\%} power savings in the ADC. In the process, it is demonstrated conclusively that BER as opposed to ENOB is a better metric when designing ADCs for serial links. For the problem of resource-constrained computing at the edge, this dissertation tackles the issue of energy-efficient implementation of ML algorithms, particularly CNNs which have recently gained considerable interest due to their record-breaking performance in many recognition tasks. However, their implementation complexity hinders their deployment on power-constrained embedded platforms. This dissertation develops two techniques for energy-efficient CNN design. The first technique is a predictive CNN (PredictiveNet), which makes use of high sparsity in well-trained CNNs to bypass a large fraction of power-dominant convolutions at runtime without modifying the CNN structure. Analysis supported by simulations is provided to justify PredictiveNet's effectiveness. When applied to both the MNIST and CIFAR-10 datasets, simulation results show that PredictiveNet achieves 7.2\times and 4.4\times reduction in the computational and representational costs, respectively, compared with a conventional CNN. It is further shown that PredictiveNet enables computational and representational cost reductions of 2.5\times and 1.7\times, respectively, compared to a state-of-the-art CNN, while incurring only 0.02 classification accuracy loss. The second technique is a variation-tolerant architecture for CNN capable of operating in near threshold voltage (NTV) regime for aggressive energy efficiency. It is well-known that NTV computing can achieve up to 10\times energy savings but is sensitive to process, temperature, and voltage (PVT) variations which can lead to timing errors. To leverage the great potential of NTV for energy efficiency, this dissertation develops a new statistical error compensation (SEC) technique referred to as rank decomposed SEC (RD-SEC). RD-SEC makes use of inherent redundancy in CNNs to handle timing errors due to NTV computing. When evaluated in CNNs for both the MNIST and CIFAR-10 datasets, simulation results in \unit[45]{nm} CMOS show that RD-SEC enables robust CNNs operating in the NTV regime. Specifically, the proposed RD-SEC can achieve up to 11\times improvement in variation tolerance and enable up to 113\times reduction in the standard deviation of classification accuracy while incurring marginal degradation in the median classification accuracy

Illinois Digital Environment for Access to Learning and Scholarship Repository

MixRT: Mixed Neural Representations For Real-Time NeRF Rendering

Author: Li Chaojian
Lin
Vajda Peter
Wu Bichen
Yingyan
Publication venue
Publication date: 22/01/2024
Field of study

Neural Radiance Field (NeRF) has emerged as a leading technique for novel view synthesis, owing to its impressive photorealistic reconstruction and rendering capability. Nevertheless, achieving real-time NeRF rendering in large-scale scenes has presented challenges, often leading to the adoption of either intricate baked mesh representations with a substantial number of triangles or resource-intensive ray marching in baked representations. We challenge these conventions, observing that high-quality geometry, represented by meshes with substantial triangles, is not necessary for achieving photorealistic rendering quality. Consequently, we propose MixRT, a novel NeRF representation that includes a low-quality mesh, a view-dependent displacement map, and a compressed NeRF model. This design effectively harnesses the capabilities of existing graphics hardware, thus enabling real-time NeRF rendering on edge devices. Leveraging a highly-optimized WebGL-based rendering framework, our proposed MixRT attains real-time rendering speeds on edge devices (over 30 FPS at a resolution of 1280 x 720 on a MacBook M1 Pro laptop), better rendering quality (0.2 PSNR higher in indoor scenes of the Unbounded-360 datasets), and a smaller storage size (less than 80% compared to state-of-the-art methods).Comment: Accepted by 3DV'24. Project Page: https://licj15.github.io/MixRT

arXiv.org e-Print Archive

ShiftAddViT: Mixture of Multiplication Primitives Towards Efficient Vision Transformer

Author: Guo Yipin
Lin
Shi Huihong
Yingyan
You Haoran
Publication venue
Publication date: 21/09/2023
Field of study

Vision Transformers (ViTs) have shown impressive performance and have become a unified backbone for multiple vision tasks. But both attention and multi-layer perceptions (MLPs) in ViTs are not efficient enough due to dense multiplications, resulting in costly training and inference. To this end, we propose to reparameterize the pre-trained ViT with a mixture of multiplication primitives, e.g., bitwise shifts and additions, towards a new type of multiplication-reduced model, dubbed

\textbf{ShiftAddViT}

, which aims for end-to-end inference speedups on GPUs without the need of training from scratch. Specifically, all

\texttt{MatMuls}

among queries, keys, and values are reparameterized by additive kernels, after mapping queries and keys to binary codes in Hamming space. The remaining MLPs or linear layers are then reparameterized by shift kernels. We utilize TVM to implement and optimize those customized kernels for practical hardware deployment on GPUs. We find that such a reparameterization on (quadratic or linear) attention maintains model accuracy, while inevitably leading to accuracy drops when being applied to MLPs. To marry the best of both worlds, we further propose a new mixture of experts (MoE) framework to reparameterize MLPs by taking multiplication or its primitives as experts, e.g., multiplication and shift, and designing a new latency-aware load-balancing loss. Such a loss helps to train a generic router for assigning a dynamic amount of input tokens to different experts according to their latency. In principle, the faster experts run, the larger amount of input tokens are assigned. Extensive experiments consistently validate the effectiveness of our proposed ShiftAddViT, achieving up to \textbf{5.18\times} latency reductions on GPUs and \textbf{42.9%} energy savings, while maintaining comparable accuracy as original or efficient ViTs.Comment: Accepted by NeurIPS 202

arXiv.org e-Print Archive

Auto-NBA: Efficient and Effective Search Over the Joint Space of Networks, Bitwidths, and Accelerators

Author: Cox David
Fu Yonggan
Lin Yingyan
Zhang Yang
Zhang Yongan
Publication venue
Publication date: 11/06/2021
Field of study

While maximizing deep neural networks' (DNNs') acceleration efficiency requires a joint search/design of three different yet highly coupled aspects, including the networks, bitwidths, and accelerators, the challenges associated with such a joint search have not yet been fully understood and addressed. The key challenges include (1) the dilemma of whether to explode the memory consumption due to the huge joint space or achieve sub-optimal designs, (2) the discrete nature of the accelerator design space that is coupled yet different from that of the networks and bitwidths, and (3) the chicken and egg problem associated with network-accelerator co-search, i.e., co-search requires operation-wise hardware cost, which is lacking during search as the optimal accelerator depending on the whole network is still unknown during search. To tackle these daunting challenges towards optimal and fast development of DNN accelerators, we propose a framework dubbed Auto-NBA to enable jointly searching for the Networks, Bitwidths, and Accelerators, by efficiently localizing the optimal design within the huge joint design space for each target dataset and acceleration specification. Our Auto-NBA integrates a heterogeneous sampling strategy to achieve unbiased search with constant memory consumption, and a novel joint-search pipeline equipped with a generic differentiable accelerator search engine. Extensive experiments and ablation studies validate that both Auto-NBA generated networks and accelerators consistently outperform state-of-the-art designs (including co-search/exploration techniques, hardware-aware NAS methods, and DNN accelerators), in terms of search time, task accuracy, and accelerator efficiency. Our codes are available at: https://github.com/RICE-EIC/Auto-NBA.Comment: Accepted at ICML 202

arXiv.org e-Print Archive

NetBooster: Empowering Tiny Deep Learning By Standing on the Shoulders of Deep Giants

Author: Fu Yonggan
Lin Yingyan
You Haoran
Yu Zhongzhi
Yuan Jiayi
Publication venue
Publication date: 23/06/2023
Field of study

Tiny deep learning has attracted increasing attention driven by the substantial demand for deploying deep learning on numerous intelligent Internet-of-Things devices. However, it is still challenging to unleash tiny deep learning's full potential on both large-scale datasets and downstream tasks due to the under-fitting issues caused by the limited model capacity of tiny neural networks (TNNs). To this end, we propose a framework called NetBooster to empower tiny deep learning by augmenting the architectures of TNNs via an expansion-then-contraction strategy. Extensive experiments show that NetBooster consistently outperforms state-of-the-art tiny deep learning solutions

arXiv.org e-Print Archive