146 research outputs found
Accelerating Training of Deep Neural Networks via Sparse Edge Processing
We propose a reconfigurable hardware architecture for deep neural networks
(DNNs) capable of online training and inference, which uses algorithmically
pre-determined, structured sparsity to significantly lower memory and
computational requirements. This novel architecture introduces the notion of
edge-processing to provide flexibility and combines junction pipelining and
operational parallelization to speed up training. The overall effect is to
reduce network complexity by factors up to 30x and training time by up to 35x
relative to GPUs, while maintaining high fidelity of inference results. This
has the potential to enable extensive parameter searches and development of the
largely unexplored theoretical foundation of DNNs. The architecture
automatically adapts itself to different network sizes given available hardware
resources. As proof of concept, we show results obtained for different bit
widths.Comment: Presented at the 26th International Conference on Artificial Neural
Networks (ICANN) 2017 in Alghero, Ital
Self-Attentive Pooling for Efficient Deep Learning
Efficient custom pooling techniques that can aggressively trim the dimensions
of a feature map and thereby reduce inference compute and memory footprint for
resource-constrained computer vision applications have recently gained
significant traction. However, prior pooling works extract only the local
context of the activation maps, limiting their effectiveness. In contrast, we
propose a novel non-local self-attentive pooling method that can be used as a
drop-in replacement to the standard pooling layers, such as max/average pooling
or strided convolution. The proposed self-attention module uses patch
embedding, multi-head self-attention, and spatial-channel restoration, followed
by sigmoid activation and exponential soft-max. This self-attention mechanism
efficiently aggregates dependencies between non-local activation patches during
down-sampling. Extensive experiments on standard object classification and
detection tasks with various convolutional neural network (CNN) architectures
demonstrate the superiority of our proposed mechanism over the state-of-the-art
(SOTA) pooling techniques. In particular, we surpass the test accuracy of
existing pooling techniques on different variants of MobileNet-V2 on ImageNet
by an average of 1.2%. With the aggressive down-sampling of the activation maps
in the initial layers (providing up to 22x reduction in memory consumption),
our approach achieves 1.43% higher test accuracy compared to SOTA techniques
with iso-memory footprints. This enables the deployment of our models in
memory-constrained devices, such as micro-controllers (without losing
significant accuracy), because the initial activation maps consume a
significant amount of on-chip memory for high-resolution images required for
complex vision tasks. Our proposed pooling method also leverages the idea of
channel pruning to further reduce memory footprints.Comment: 9 pages, 4 figures, conferenc
Covering conditions and algorithms for the synthesis of speed-independent circuits
Journal ArticleAbstract-This paper presents theory and algorithms for the synthesis of standard C-implementations of speed-independent circuits. These implementations are block-level circuits which may consist of atomic gates to perform complex functions in order to ensure hazard freedom. First, we present Boolean covering conditions that guarantee that the standard C-implementations operate correctly. Then, we present two algorithms that produce optimal solutions to the covering problem. The first algorithm is always applicable, but does not complete on large circuits. The second algorithm, motivated by our observation that our covering problem can often be solved with a single cube, finds the optimal single-cube solution when such a solution exists. When applicable, the second algorithm is dramatically more efficient than the first, more general algorithm. We present results for benchmark specifications which indicate that our single-cube algorithm is applicable on most benchmark circuits and reduces run times by over an order of magnitude. The block-level circuits generated by our algorithms are a good starting point for tools that perform technology mapping to obtain gate-level speed independent circuits
Technology mapping of timed circuits
Journal ArticleAbstract This paper presents an automated procedure for the technology mapping of timed circuits to practical gate libraries. Timed circuits are a class of asynchronous circuits that incorporate explicit timing information in the specification which is used throughout the design process to optimize the implementation. Our procedure begins with a timed specification and a delay-annotated gate library description which must include 2-input AND gates, OR gates, and C-elements, but optionally can include higher-fanin gates, AND-OR-INVERT blocks, and generalized C-elements. Our procedure first generates a technology-independent timed circuit netlist composed of possibly high-fanin AND gates, OR gates, and 2-input Celements. The procedure then investigates simultaneous decompositions of all high-fanin gates by adding state variables to the the specfication and performing resyn-thesis. Although multiple decompositions are explored, timing information is utilized to significantly reduce their number. Once all gates are sufficiently decomposed, the netlist can be mapped to the given gate library, taking advantage of any compact complex gates available. The decomposition and resyn-thesis steps have been fully automated within the synthesis tool ATACS and we present results for several examples
Morse Code Datasets for Machine Learning
We present an algorithm to generate synthetic datasets of tunable difficulty
on classification of Morse code symbols for supervised machine learning
problems, in particular, neural networks. The datasets are spatially
one-dimensional and have a small number of input features, leading to high
density of input information content. This makes them particularly challenging
when implementing network complexity reduction methods. We explore how network
performance is affected by deliberately adding various forms of noise and
expanding the feature set and dataset size. Finally, we establish several
metrics to indicate the difficulty of a dataset, and evaluate their merits. The
algorithm and datasets are open-source.Comment: Presented at the 9th International Conference on Computing,
Communication and Networking Technologies (ICCCNT
- …