Search CORE

146 research outputs found

Accelerating Training of Deep Neural Networks via Sparse Edge Processing

Author: Beerel Peter A.
Chugg Keith M.
Dey Sourya
Shao Yinan
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 03/11/2017
Field of study

We propose a reconfigurable hardware architecture for deep neural networks (DNNs) capable of online training and inference, which uses algorithmically pre-determined, structured sparsity to significantly lower memory and computational requirements. This novel architecture introduces the notion of edge-processing to provide flexibility and combines junction pipelining and operational parallelization to speed up training. The overall effect is to reduce network complexity by factors up to 30x and training time by up to 35x relative to GPUs, while maintaining high fidelity of inference results. This has the potential to enable extensive parameter searches and development of the largely unexplored theoretical foundation of DNNs. The architecture automatically adapts itself to different network sizes given available hardware resources. As proof of concept, we show results obtained for different bit widths.Comment: Presented at the 26th International Conference on Artificial Neural Networks (ICANN) 2017 in Alghero, Ital

arXiv.org e-Print Archive

Crossref

Self-Attentive Pooling for Efficient Deep Learning

Author: Beerel Peter
Chen Fang
Datta Gourav
Kundu Souvik
Publication venue
Publication date: 18/09/2022
Field of study

Efficient custom pooling techniques that can aggressively trim the dimensions of a feature map and thereby reduce inference compute and memory footprint for resource-constrained computer vision applications have recently gained significant traction. However, prior pooling works extract only the local context of the activation maps, limiting their effectiveness. In contrast, we propose a novel non-local self-attentive pooling method that can be used as a drop-in replacement to the standard pooling layers, such as max/average pooling or strided convolution. The proposed self-attention module uses patch embedding, multi-head self-attention, and spatial-channel restoration, followed by sigmoid activation and exponential soft-max. This self-attention mechanism efficiently aggregates dependencies between non-local activation patches during down-sampling. Extensive experiments on standard object classification and detection tasks with various convolutional neural network (CNN) architectures demonstrate the superiority of our proposed mechanism over the state-of-the-art (SOTA) pooling techniques. In particular, we surpass the test accuracy of existing pooling techniques on different variants of MobileNet-V2 on ImageNet by an average of 1.2%. With the aggressive down-sampling of the activation maps in the initial layers (providing up to 22x reduction in memory consumption), our approach achieves 1.43% higher test accuracy compared to SOTA techniques with iso-memory footprints. This enables the deployment of our models in memory-constrained devices, such as micro-controllers (without losing significant accuracy), because the initial activation maps consume a significant amount of on-chip memory for high-resolution images required for complex vision tasks. Our proposed pooling method also leverages the idea of channel pruning to further reduce memory footprints.Comment: 9 pages, 4 figures, conferenc

arXiv.org e-Print Archive

Covering conditions and algorithms for the synthesis of speed-independent circuits

Author: Beerel Peter A.
Myers Chris J.
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/1998
Field of study

Journal ArticleAbstract-This paper presents theory and algorithms for the synthesis of standard C-implementations of speed-independent circuits. These implementations are block-level circuits which may consist of atomic gates to perform complex functions in order to ensure hazard freedom. First, we present Boolean covering conditions that guarantee that the standard C-implementations operate correctly. Then, we present two algorithms that produce optimal solutions to the covering problem. The first algorithm is always applicable, but does not complete on large circuits. The second algorithm, motivated by our observation that our covering problem can often be solved with a single cube, finds the optimal single-cube solution when such a solution exists. When applicable, the second algorithm is dramatically more efficient than the first, more general algorithm. We present results for benchmark specifications which indicate that our single-cube algorithm is applicable on most benchmark circuits and reduces run times by over an order of magnitude. The block-level circuits generated by our algorithms are a good starting point for tools that perform technology mapping to obtain gate-level speed independent circuits

The University of Utah: J. Willard Marriott Digital Library

Technology mapping of timed circuits

Author: Beerel Peter A.
Myers Chris J.
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/1995
Field of study

Journal ArticleAbstract This paper presents an automated procedure for the technology mapping of timed circuits to practical gate libraries. Timed circuits are a class of asynchronous circuits that incorporate explicit timing information in the specification which is used throughout the design process to optimize the implementation. Our procedure begins with a timed specification and a delay-annotated gate library description which must include 2-input AND gates, OR gates, and C-elements, but optionally can include higher-fanin gates, AND-OR-INVERT blocks, and generalized C-elements. Our procedure first generates a technology-independent timed circuit netlist composed of possibly high-fanin AND gates, OR gates, and 2-input Celements. The procedure then investigates simultaneous decompositions of all high-fanin gates by adding state variables to the the specfication and performing resyn-thesis. Although multiple decompositions are explored, timing information is utilized to significantly reduce their number. Once all gates are sufficiently decomposed, the netlist can be mapped to the given gate library, taking advantage of any compact complex gates available. The decomposition and resyn-thesis steps have been fully automated within the synthesis tool ATACS and we present results for several examples

The University of Utah: J. Willard Marriott Digital Library

Morse Code Datasets for Machine Learning

Author: Beerel Peter A.
Chugg Keith M.
Dey Sourya
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 30/11/2018
Field of study

We present an algorithm to generate synthetic datasets of tunable difficulty on classification of Morse code symbols for supervised machine learning problems, in particular, neural networks. The datasets are spatially one-dimensional and have a small number of input features, leading to high density of input information content. This makes them particularly challenging when implementing network complexity reduction methods. We explore how network performance is affected by deliberately adding various forms of noise and expanding the feature set and dataset size. Finally, we establish several metrics to indicate the difficulty of a dataset, and evaluate their merits. The algorithm and datasets are open-source.Comment: Presented at the 9th International Conference on Computing, Communication and Networking Technologies (ICCCNT

arXiv.org e-Print Archive

Crossref