304 research outputs found
Efficient machine learning: models and accelerations
One of the key enablers of the recent unprecedented success of machine learning is the adoption of very large models. Modern machine learning models typically consist of multiple cascaded layers such as deep neural networks, and at least millions to hundreds of millions of parameters (i.e., weights) for the entire model. The larger-scale model tend to enable the extraction of more complex high-level features, and therefore, lead to a significant improvement of the overall accuracy. On the other side, the layered deep structure and large model sizes also demand to increase computational capability and memory requirements. In order to achieve higher scalability, performance, and energy efficiency for deep learning systems, two orthogonal research and development trends have attracted enormous interests. The first trend is the acceleration while the second is the model compression. The underlying goal of these two trends is the high quality of the models to provides accurate predictions. In this thesis, we address these two problems and utilize different computing paradigms to solve real-life deep learning problems.
To explore in these two domains, this thesis first presents the cogent confabulation network for sentence completion problem. We use Chinese language as a case study to describe our exploration of the cogent confabulation based text recognition models. The exploration and optimization of the cogent confabulation based models have been conducted through various comparisons. The optimized network offered a better accuracy performance for the sentence completion. To accelerate the sentence completion problem in a multi-processing system, we propose a parallel framework for the confabulation recall algorithm. The parallel implementation reduce runtime, improve the recall accuracy by breaking the fixed evaluation order and introducing more generalization, and maintain a balanced progress in status update among all neurons. A lexicon scheduling algorithm is presented to further improve the model performance.
As deep neural networks have been proven effective to solve many real-life applications, and they are deployed on low-power devices, we then investigated the acceleration for the neural network inference using a hardware-friendly computing paradigm, stochastic computing. It is an approximate computing paradigm which requires small hardware footprint and achieves high energy efficiency. Applying this stochastic computing to deep convolutional neural networks, we design the functional hardware blocks and optimize them jointly to minimize the accuracy loss due to the approximation. The synthesis results show that the proposed design achieves the remarkable low hardware cost and power/energy consumption.
Modern neural networks usually imply a huge amount of parameters which cannot be fit into embedded devices. Compression of the deep learning models together with acceleration attracts our attention. We introduce the structured matrices based neural network to address this problem. Circulant matrix is one of the structured matrices, where a matrix can be represented using a single vector, so that the matrix is compressed. We further investigate a more flexible structure based on circulant matrix, called block-circulant matrix. It partitions a matrix into several smaller blocks and makes each submatrix is circulant. The compression ratio is controllable. With the help of Fourier Transform based equivalent computation, the inference of the deep neural network can be accelerated energy efficiently on the FPGAs. We also offer the optimization for the training algorithm for block circulant matrices based neural networks to obtain a high accuracy after compression
State of the Art of Deep Learning Technology and its Next Generation Architecture
Shorterm Memory AI model shows that during the silent period of memory, the brain can use the short-term plasticity of synaptic connections between neurons to memorize information.CGRA computing energy efficiency can reach 1000 times of CPU computing architecture,100-1000 times of GPU computing architecture, and more than 100 times of FPGA computing architecture.FpgaConvNet, ALAMO and Snowflake are mainly concerned with the feature extractor part of CNN.DeepBurning and FP-DNN support recurrent neural network (RNN) and long-term and short-term memory (LSTM) networks.In a paper in Physical Review X, MIT researchers describe a new photon accelerator that uses optical components and optical signal processing technology to reduce chip size, which will allow the chip to expand to neural networks several orders of magnitude larger than electrical chips. By taking hardware performance and power consumption as indicators in the training phase, hardware adjustable parameters, model weight and topology will be jointly modified in the optimization process to jointly optimize the application-level accuracy and the required reasoning execution time and power consumption.Artificial intelligence with deep learning architecture is still in infancy. But it has already brought a lot of help to mankind
Pruning random resistive memory for optimizing analogue AI
The rapid advancement of artificial intelligence (AI) has been marked by the
large language models exhibiting human-like intelligence. However, these models
also present unprecedented challenges to energy consumption and environmental
sustainability. One promising solution is to revisit analogue computing, a
technique that predates digital computing and exploits emerging analogue
electronic devices, such as resistive memory, which features in-memory
computing, high scalability, and nonvolatility. However, analogue computing
still faces the same challenges as before: programming nonidealities and
expensive programming due to the underlying devices physics. Here, we report a
universal solution, software-hardware co-design using structural
plasticity-inspired edge pruning to optimize the topology of a randomly
weighted analogue resistive memory neural network. Software-wise, the topology
of a randomly weighted neural network is optimized by pruning connections
rather than precisely tuning resistive memory weights. Hardware-wise, we reveal
the physical origin of the programming stochasticity using transmission
electron microscopy, which is leveraged for large-scale and low-cost
implementation of an overparameterized random neural network containing
high-performance sub-networks. We implemented the co-design on a 40nm 256K
resistive memory macro, observing 17.3% and 19.9% accuracy improvements in
image and audio classification on FashionMNIST and Spoken digits datasets, as
well as 9.8% (2%) improvement in PR (ROC) in image segmentation on DRIVE
datasets, respectively. This is accompanied by 82.1%, 51.2%, and 99.8%
improvement in energy efficiency thanks to analogue in-memory computing. By
embracing the intrinsic stochasticity and in-memory computing, this work may
solve the biggest obstacle of analogue computing systems and thus unleash their
immense potential for next-generation AI hardware
Is Neural Machine Translation Ready for Deployment? A Case Study on 30 Translation Directions
In this paper we provide the largest published comparison of translation
quality for phrase-based SMT and neural machine translation across 30
translation directions. For ten directions we also include hierarchical
phrase-based MT. Experiments are performed for the recently published United
Nations Parallel Corpus v1.0 and its large six-way sentence-aligned subcorpus.
In the second part of the paper we investigate aspects of translation speed,
introducing AmuNMT, our efficient neural machine translation decoder. We
demonstrate that current neural machine translation could already be used for
in-production systems when comparing words-per-second ratios.Comment: Accepted for presentation at IWSLT 2016, Seattl
Performance optimization of convolution calculation by blocking and sparsity on GPU
Convolution neural network (CNN) plays a paramount role in machine learning,
which has made significant contributions in medical image classification,
natural language processing, recommender system and so on. A successful
convolution neural network can achieve excellent performance with fast
execution time. The convolution operation dominates the total operation time of
convolution neural network. Therefore, in this paper, we propose a novel
convolution method on Graphic Processing Units (GPUs), which reduces the
convolution operation time and improves the execution speed by approximately 2X
than the state of the art convolution algorithm. Our work is based on the
observation that the sparsity of the input feature map of convolution operation
is relatively large, and the zero value of the feature map is redundancy for
convolution result. Therefore, we skip the zero value calculation and improve
the speed by compressing the feature map. Besides, the shape of the feature map
for the deep network is small, and the number of threads is limited. Therefore,
for a limited number of threads, it is necessary to reduce the amount of
calculation to increase the calculation speed. Our algorithm has a good effect
on the convolution operation for the feature map of the deep network with large
sparsity and small size
Artificial Intelligence Technology
This open access book aims to give our readers a basic outline of todayâs research and technology developments on artificial intelligence (AI), help them to have a general understanding of this trend, and familiarize them with the current research hotspots, as well as part of the fundamental and common theories and methodologies that are widely accepted in AI research and application. This book is written in comprehensible and plain language, featuring clearly explained theories and concepts and extensive analysis and examples. Some of the traditional findings are skipped in narration on the premise of a relatively comprehensive introduction to the evolution of artificial intelligence technology. The book provides a detailed elaboration of the basic concepts of AI, machine learning, as well as other relevant topics, including deep learning, deep learning framework, Huawei MindSpore AI development framework, Huawei Atlas computing platform, Huawei AI open platform for smart terminals, and Huawei CLOUD Enterprise Intelligence application platform. As the worldâs leading provider of ICT (information and communication technology) infrastructure and smart terminals, Huaweiâs products range from digital data communication, cyber security, wireless technology, data storage, cloud computing, and smart computing to artificial intelligence
FPGA Implementation for Real-Time Background Subtraction Based on Horprasert Model
Background subtraction is considered the first processing stage in video surveillance systems, and consists of determining objects in movement in a scene captured by a static camera. It is an intensive task with a high computational cost. This work proposes an embedded novel architecture on FPGA which is able to extract the background on resource-limited environments and offers low degradation (produced because of the hardware-friendly model modification). In addition, the original model is extended in order to detect shadows and improve the quality of the segmentation of the moving objects. We have analyzed the resource consumption and performance in Spartan3 Xilinx FPGAs and compared to others works available on the literature, showing that the current architecture is a good trade-off in terms of accuracy, performance and resources utilization. With less than a 65% of the resources utilization of a XC3SD3400 Spartan-3A low-cost family FPGA, the system achieves a frequency of 66.5 MHz reaching 32.8 fps with resolution 1,024 Ă 1,024 pixels, and an estimated power consumption of 5.76 W
- âŠ