2,492 research outputs found
A Survey of Prediction and Classification Techniques in Multicore Processor Systems
In multicore processor systems, being able to accurately predict the future provides new optimization opportunities, which otherwise could not be exploited. For example, an oracle able to predict a certain application\u27s behavior running on a smart phone could direct the power manager to switch to appropriate dynamic voltage and frequency scaling modes that would guarantee minimum levels of desired performance while saving energy consumption and thereby prolonging battery life. Using predictions enables systems to become proactive rather than continue to operate in a reactive manner. This prediction-based proactive approach has become increasingly popular in the design and optimization of integrated circuits and of multicore processor systems. Prediction transforms from simple forecasting to sophisticated machine learning based prediction and classification that learns from existing data, employs data mining, and predicts future behavior. This can be exploited by novel optimization techniques that can span across all layers of the computing stack. In this survey paper, we present a discussion of the most popular techniques on prediction and classification in the general context of computing systems with emphasis on multicore processors. The paper is far from comprehensive, but, it will help the reader interested in employing prediction in optimization of multicore processor systems
Balanced Quantization: An Effective and Efficient Approach to Quantized Neural Networks
Quantized Neural Networks (QNNs), which use low bitwidth numbers for
representing parameters and performing computations, have been proposed to
reduce the computation complexity, storage size and memory usage. In QNNs,
parameters and activations are uniformly quantized, such that the
multiplications and additions can be accelerated by bitwise operations.
However, distributions of parameters in Neural Networks are often imbalanced,
such that the uniform quantization determined from extremal values may under
utilize available bitwidth. In this paper, we propose a novel quantization
method that can ensure the balance of distributions of quantized values. Our
method first recursively partitions the parameters by percentiles into balanced
bins, and then applies uniform quantization. We also introduce computationally
cheaper approximations of percentiles to reduce the computation overhead
introduced. Overall, our method improves the prediction accuracies of QNNs
without introducing extra computation during inference, has negligible impact
on training speed, and is applicable to both Convolutional Neural Networks and
Recurrent Neural Networks. Experiments on standard datasets including ImageNet
and Penn Treebank confirm the effectiveness of our method. On ImageNet, the
top-5 error rate of our 4-bit quantized GoogLeNet model is 12.7\%, which is
superior to the state-of-the-arts of QNNs
On the Resilience of RTL NN Accelerators: Fault Characterization and Mitigation
Machine Learning (ML) is making a strong resurgence in tune with the massive
generation of unstructured data which in turn requires massive computational
resources. Due to the inherently compute- and power-intensive structure of
Neural Networks (NNs), hardware accelerators emerge as a promising solution.
However, with technology node scaling below 10nm, hardware accelerators become
more susceptible to faults, which in turn can impact the NN accuracy. In this
paper, we study the resilience aspects of Register-Transfer Level (RTL) model
of NN accelerators, in particular, fault characterization and mitigation. By
following a High-Level Synthesis (HLS) approach, first, we characterize the
vulnerability of various components of RTL NN. We observed that the severity of
faults depends on both i) application-level specifications, i.e., NN data
(inputs, weights, or intermediate), NN layers, and NN activation functions, and
ii) architectural-level specifications, i.e., data representation model and the
parallelism degree of the underlying accelerator. Second, motivated by
characterization results, we present a low-overhead fault mitigation technique
that can efficiently correct bit flips, by 47.3% better than state-of-the-art
methods.Comment: 8 pages, 6 figure
Stochastic Configuration Machines: FPGA Implementation
Neural networks for industrial applications generally have additional
constraints such as response speed, memory size and power usage. Randomized
learners can address some of these issues. However, hardware solutions can
provide better resource reduction whilst maintaining the model's performance.
Stochastic configuration networks (SCNs) are a prime choice in industrial
applications due to their merits and feasibility for data modelling. Stochastic
Configuration Machines (SCMs) extend this to focus on reducing the memory
constraints by limiting the randomized weights to a binary value with a scalar
for each node and using a mechanism model to improve the learning performance
and result interpretability. This paper aims to implement SCM models on a field
programmable gate array (FPGA) and introduce binary-coded inputs to the
algorithm. Results are reported for two benchmark and two industrial datasets,
including SCM with single-layer and deep architectures.Comment: 19 pages, 9 figures, 8 table
RBNN: Memory-Efficient Reconfigurable Deep Binary Neural Network with IP Protection for Internet of Things
Though deep neural network models exhibit outstanding performance for various
applications, their large model size and extensive floating-point operations
render deployment on mobile computing platforms a major challenge, and, in
particular, on Internet of Things devices. One appealing solution is model
quantization that reduces the model size and uses integer operations commonly
supported by microcontrollers . To this end, a 1-bit quantized DNN model or
deep binary neural network maximizes the memory efficiency, where each
parameter in a BNN model has only 1-bit. In this paper, we propose a
reconfigurable BNN (RBNN) to further amplify the memory efficiency for
resource-constrained IoT devices. Generally, the RBNN can be reconfigured on
demand to achieve any one of M (M>1) distinct tasks with the same parameter
set, thus only a single task determines the memory requirements. In other
words, the memory utilization is improved by times M. Our extensive experiments
corroborate that up to seven commonly used tasks can co-exist (the value of M
can be larger). These tasks with a varying number of classes have no or
negligible accuracy drop-off on three binarized popular DNN architectures
including VGG, ResNet, and ReActNet. The tasks span across different domains,
e.g., computer vision and audio domains validated herein, with the prerequisite
that the model architecture can serve those cross-domain tasks. To protect the
intellectual property of an RBNN model, the reconfiguration can be controlled
by both a user key and a device-unique root key generated by the intrinsic
hardware fingerprint. By doing so, an RBNN model can only be used per paid user
per authorized device, thus benefiting both the user and the model provider
DenseShift: Towards Accurate and Efficient Low-Bit Power-of-Two Quantization
Efficiently deploying deep neural networks on low-resource edge devices is
challenging due to their ever-increasing resource requirements. To address this
issue, researchers have proposed multiplication-free neural networks, such as
Power-of-Two quantization, or also known as Shift networks, which aim to reduce
memory usage and simplify computation. However, existing low-bit Shift networks
are not as accurate as their full-precision counterparts, typically suffering
from limited weight range encoding schemes and quantization loss. In this
paper, we propose the DenseShift network, which significantly improves the
accuracy of Shift networks, achieving competitive performance to full-precision
networks for vision and speech applications. In addition, we introduce a method
to deploy an efficient DenseShift network using non-quantized floating-point
activations, while obtaining 1.6X speed-up over existing methods. To achieve
this, we demonstrate that zero-weight values in low-bit Shift networks do not
contribute to model capacity and negatively impact inference computation. To
address this issue, we propose a zero-free shifting mechanism that simplifies
inference and increases model capacity. We further propose a sign-scale
decomposition design to enhance training efficiency and a low-variance random
initialization strategy to improve the model's transfer learning performance.
Our extensive experiments on various computer vision and speech tasks
demonstrate that DenseShift outperforms existing low-bit multiplication-free
networks and achieves competitive performance compared to full-precision
networks. Furthermore, our proposed approach exhibits strong transfer learning
performance without a drop in accuracy. Our code was released on GitHub
- …