Search CORE

17 research outputs found

BinaryConnect: Training Deep Neural Networks with binary weights during propagations

Author: Bengio Yoshua
Courbariaux Matthieu
David Jean-Pierre
Publication venue
Publication date: 01/01/2015
Field of study

Deep Neural Networks (DNN) have achieved state-of-the-art results in a wide range of tasks, with the best results obtained with large training sets and large models. In the past, GPUs enabled these breakthroughs because of their greater computational speed. In the future, faster computation at both training and test time is likely to be crucial for further progress and for consumer applications on low-power devices. As a result, there is much interest in research and development of dedicated hardware for Deep Learning (DL). Binary weights, i.e., weights which are constrained to only two possible values (e.g. -1 or 1), would bring great benefits to specialized DL hardware by replacing many multiply-accumulate operations by simple accumulations, as multipliers are the most space and power-hungry components of the digital implementation of neural networks. We introduce BinaryConnect, a method which consists in training a DNN with binary weights during the forward and backward propagations, while retaining precision of the stored weights in which gradients are accumulated. Like other dropout schemes, we show that BinaryConnect acts as regularizer and we obtain near state-of-the-art results with BinaryConnect on the permutation-invariant MNIST, CIFAR-10 and SVHN.Comment: Accepted at NIPS 2015, 9 pages, 3 figure

arXiv.org e-Print Archive

PolyPublie

Training deep neural networks with low precision multiplications

Author: Bengio Yoshua
Courbariaux Matthieu
David Jean-Pierre
Publication venue
Publication date: 01/01/2015
Field of study

Multipliers are the most space and power-hungry arithmetic operators of the digital implementation of deep neural networks. We train a set of state-of-the-art neural networks (Maxout networks) on three benchmark datasets: MNIST, CIFAR-10 and SVHN. They are trained with three distinct formats: floating point, fixed point and dynamic fixed point. For each of those datasets and for each of those formats, we assess the impact of the precision of the multiplications on the final error after training. We find that very low precision is sufficient not just for running trained networks but also for training them. For example, it is possible to train Maxout networks with 10 bits multiplications.Comment: 10 pages, 5 figures, Accepted as a workshop contribution at ICLR 201

arXiv.org e-Print Archive

PolyPublie

VLSI Implementation of Deep Neural Network Using Integral Stochastic Computing

Author: Ardakani Arash
Gross Warren J.
Hanyu Takahiro
Leduc-Primeau François
Onizawa Naoya
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2016
Field of study

The hardware implementation of deep neural networks (DNNs) has recently received tremendous attention: many applications in fact require high-speed operations that suit a hardware implementation. However, numerous elements and complex interconnections are usually required, leading to a large area occupation and copious power consumption. Stochastic computing has shown promising results for low-power area-efficient hardware implementations, even though existing stochastic algorithms require long streams that cause long latencies. In this paper, we propose an integer form of stochastic computation and introduce some elementary circuits. We then propose an efficient implementation of a DNN based on integral stochastic computing. The proposed architecture has been implemented on a Virtex7 FPGA, resulting in 45% and 62% average reductions in area and latency compared to the best reported architecture in literature. We also synthesize the circuits in a 65 nm CMOS technology and we show that the proposed integral stochastic architecture results in up to 21% reduction in energy consumption compared to the binary radix implementation at the same misclassification rate. Due to fault-tolerant nature of stochastic architectures, we also consider a quasi-synchronous implementation which yields 33% reduction in energy consumption w.r.t. the binary radix implementation without any compromise on performance.Comment: 11 pages, 12 figure

arXiv.org e-Print Archive

Crossref

HAL-Université de Bretagne Occidentale

HAL Descartes

PolyPublie

Hal-Diderot

Réduire la précision et le nombre des multiplications nécessaires à l'entraînement d'un réseau de neurones

Author: Courbariaux Matthieu
Publication venue
Publication date: 01/08/2015
Field of study

RÉSUMÉ Les Réseaux de Neurones (RdNs) sont à l’état de l’art pour un grand nombre de tâches, les meilleurs résultats étant obtenus avec de grands ensembles de données et de grands modèles. La vitesse de calcul des cartes graphiques est en grande partie à l’origine de ces progrès. À l’avenir, l’accélération des RdNs pendant les phases d’entrainement et de test permettra probablement une performance accrue ainsi que des applications grand public plus efficaces énergétiquement. En conséquence, la recherche en systèmes numériques dédiés aux RdNs est d’actualité. Les systèmes numériques sont principalement faits de mémoires et d’opérateurs arithmétiques. Les multiplieurs sont de loin les opérateurs arithmétiques les plus coûteux en termes de transistors d’un système numérique dédié aux RdNs. Dans notre premier article, nous entraînons un ensemble de RdNs à l’état de l’art (les réseaux Maxout) sur trois ensembles de données de référence : MNIST, CIFAR-10 et SVHN. Ils sont entraînés avec trois formats distincts : virgule flottante, virgule fixe et virgule fixe dynamique. Pour chacun de ces ensembles de données et pour chacun de ces formats, nous évaluons l’impact de la précision des multiplications sur l’erreur finale après l’entrainement. Nous trouvons qu’une précision très faible est suffisante non seulement pour tester des RdNs, mais aussi pour les entraîner. Par exemple, il est possible d’entraîner des réseaux Maxout avec des multiplications 10 bits. Des poids binaires, c’est à dire des poids qui sont contraints à seulement deux valeurs possibles (e.g. -1 ou 1), permettraient de beaucoup réduire le nombre de multiplications nécessaires lors de l’entraînement d’un RdN. Dans notre deuxième article, nous introduisons BinaryConnect, une méthode qui consiste à entraîner un RdN avec des poids binaires durant les propagations en avant et en arrière, tout en conservant la précision des poids stockés dans lesquels les gradients sont accumulés. Comme les autres variantes de Dropout, nous montrons que BinaryConnect agit comme régulariseur et nous obtenons des résultats proches de l’état de l’art avec BinaryConnect sur le MNIST invariant aux permutations. ----------ABSTRACT Deep Neural Networks (DNNs) have achieved state-of-the-art results in a wide range of tasks, with the best results obtained with large training sets and large models. In the past, GPUs enabled these breakthroughs because of their greater computational speed. In the future, faster computation at both training and test time is likely to be crucial for further progress and for consumer applications on low-power devices. As a result, there is much interest in research and development of dedicated hardware for Deep Learning (DL). Computer hardware is mainly made out of memories and arithmetic operators. Multipliers are by far the most space and power-hungry arithmetic operators of the digital implementation of neural networks. In our first article, we train a set of state-of-the-art neural networks (Maxout networks) on three benchmark datasets: MNIST, CIFAR-10 and SVHN. They are trained with three distinct formats: floating point, fixed point and dynamic fixed point. For each of those datasets and for each of those formats, we assess the impact of the precision of the multiplications on the final error after training. We find that very low precision is sufficient not just for running trained networks but also for training them. For example, it is possible to train Maxout networks with 10 bits multiplications. Binary weights, i.e., weights which are constrained to only two possible values (e.g. -1 or 1), would greatly reduce the number of multiplications required to train a DL. In our second article, we introduce BinaryConnect, a method which consists in training a DNN with binary weights during the forward and backward propagations, while retaining precision of the stored weights in which gradients are accumulated. Like other dropout schemes, we show that BinaryConnect acts as regularizer and we obtain near state-of-the-art results with BinaryConnect on the permutation-invariant MNIST

PolyPublie

A highly parameterizable framework for Conditional Restricted Boltzmann Machine based workloads accelerated with FPGAs and OpenCL

Author: Berral García Josep Lluís
Buchaca Prats David
Cadenelli Nicola
Carrera Pérez David
Jaksic Zoran
Polo Bardés Jordà
Publication venue: 'Elsevier BV'
Publication date: 01/03/2020
Field of study

© 2020 Elsevier. This manuscript version is made available under the CC-BY-NC-ND 4.0 license http://creativecommons.org/licenses/by-nc-nd/4.0/Conditional Restricted Boltzmann Machine (CRBM) is a promising candidate for a multidimensional system modeling that can learn a probability distribution over a set of data. It is a specific type of an artificial neural network with one input (visible) and one output (hidden) layer. Recently published works demonstrate that CRBM is a suitable mechanism for modeling multidimensional time series such as human motion, workload characterization, city traffic analysis. The process of learning and inference of these systems relies on linear algebra functions like matrix–matrix multiplication, and for higher data sets, they are very compute-intensive. In this paper, we present a configurable framework for CRBM based workloads for arbitrary large models. We show how to accelerate the learning process of CRBM with FPGAs and OpenCL, and we conduct an extensive scalability study for different model sizes and system configurations. We show significant improvement in performance/Watt for large models and batch sizes (from 1.51x up to 5.71x depending on the host configuration) when we use FPGA and OpenCL for the acceleration, and limited benefits for small models comparing to the state-of-the-art CPU solution.This work was supported by the European Research Council(ERC) under the European Union’s Horizon 2020 research andinnovation programme (grant agreements No 639595); the Min-istry of Economy of Spain under contract TIN2015-65316-P andGeneralitat de Catalunya, Spain under contract 2014SGR1051;the ICREA, Spain Academia program; the BSC-CNS Severo Ochoaprogram, Spain (SEV-2015-0493) and Intel Corporation, UnitedStatesPeer ReviewedPostprint (published version

UPCommons. Portal del coneixement obert de la UPC