34 research outputs found
Wireless Chip-Scale Communications for Neural Network Accelerators
Wireless on-chip communications have been proposed as a complement to conventional Network-on-Chip (NoC) paradigms in manycore processors. In massively parallel architectures, the fast broadcast and reconfigurability capabilities of the wireless plane open the door to new scalable and adaptive architectures with significant impact on a plethora of fields. This thesis aims to explore such impact in the all-pervasive field of AI accelerators, designing and evaluating new accelerators augmented with wireless on-chip communication.The last decade has witnessed an explosive growth in the use of Deep Neural Networks in fields such as computer vision, natural language processing, medicine or economics. Their achievements in accuracy across so many relevant and different applications exhibit the enormous potential of this disruptive technology. However, this unprecedented performance is closely tied with the fact that their new designs contain much deeper and bigger layer sets, forcing them to manage millions - and in some cases even billions - of parameters. This comes at a high computational and communication cost at the processor level, which has prompted the development of new hardware aimed at handling such large computing expense more efficiently, the so called \acrlong{dnn} accelerators. This work explores the potential of enhancing the performance of these accelerators by introducing Wireless Networks-on-Chip in their design, a novel interconnect paradigm proposed by the research community to overcome some of the communication challenges that manycore systems face. Specifically, both on-chip and off-chip wireless interconnect implementations have been studied and evaluated. In the off-chip case, a theoretical improvement of 13X in the runtime has been achieved, but at the expense of some area and power overheads.La última década ha sido testigo de un inmenso crecimiento en el uso de Deep Neural Networks en campos como la visión artificial, procesamiento de lenguaje natural, medicina o economÃa. Haber conseguido estos resultados sin precedentes en aplicaciones tan relevantes y variadas muestra el enorme potencial de esta tecnologÃa tan disruptiva. Sin embargo, estos logros van muy ligados al hecho de que los nuevos diseños contienen muchas más capas y más profundas, lo que se traduce en millones - y en algunos casos billones - de parámetros. Esto supone un gran coste computacional y de comunicación a nivel de procesador, lo que ha impulsado el desarrollo de nuevo hardware que permita gestionar tal coste de manera más eficiente, los llamados aceleradores de Deep Neural Networks. Este proyecto explora la potencial mejora en rendimiento de estos aceleradores mediante la introducción de Wireless Newtorks-on-Chip en su diseño, un nuevo paradigma de interconexiones propuesto por la comunidad cientÃfica para superar algunos de los problemas de comunicación que sistemas manycore deben afrontar. EspecÃficamente, implementaciones tanto on-chip como off-chip se han estudiado y evaluado. Se ha conseguido una mejora teórica de 13X en el runtime, pero con algunos costes añadidos de área y potencia.La darrera dècada ha estat testimoni d'un immens creixement en l'ús de Deep Neural Networks en camps com la visió artificial, processament de llenguatge natural, medicina o economia. Haver aconseguit aquests resultats sense precedents en aplicacions tan rellevants i variades mostra l?enorme potencial d?aquesta tecnologia tan disruptiva. No obstant, aquests èxits van molt lligats al fet de que els nous dissenys contenen moltes més capes i més profundes, cosa que es tradueix en milions - i en alguns casos bilions - de parà metres. Això suposa un gran cost computacional i de comunicació a nivell de processador, cosa que ha impulsat el desenvolupament de nou hardware que permetin gestionar tal cost de manera més eficient, els anomenats acceleradors de Deep Neural Networks. Aquest projecte explora la potencial millora en rendiment d'aquests acceleradors mitjançant la introducció de Wireless Newtorks-on-Chip al seu disseny, un nou paradigma d'interconnexions proposat per la comunitat cientÃfica per a superar alguns dels problemes de comunicació que sistemes manycore han d'afrontar. EspecÃficament, implementacions tant on-chip com off-chip s'han estudiat i evaluat. En el cas off-chip, s'ha aconseguit una millora teòrica de 13X al runtime però amb alguns costos afegits d'à rea i potència
FLAT: An Optimized Dataflow for Mitigating Attention Performance Bottlenecks
Attention mechanisms form the backbone of state-of-the-art machine learning
models for a variety of tasks. Deploying them on deep neural network (DNN)
accelerators, however, is prohibitively challenging especially under long
sequences, as this work identifies. This is due to operators in attention
layers exhibiting limited reuse opportunities and quadratic growth in memory
footprint, leading to severe memory-boundedness. To address this, we introduce
a new attention-tailored dataflow, termed FLAT, which identifies fusion
opportunities within the attention layer, and implements an on-chip
memory-aware interleaved execution and tiling mechanism. FLAT increases the
effective memory bandwidth by efficiently utilizing the high-bandwidth,
low-capacity on-chip buffer and thus achieves better run time and compute
resource utilization. In our evaluation, FLAT achieves 1.94x and 1.76x speedup
and 49% and 42% of energy reduction comparing to baseline execution over
state-of-the-art edge and cloud accelerators
Accelerating Fully Connected Neural Network on Optical Network-on-Chip (ONoC)
Fully Connected Neural Network (FCNN) is a class of Artificial Neural
Networks widely used in computer science and engineering, whereas the training
process can take a long time with large datasets in existing many-core systems.
Optical Network-on-Chip (ONoC), an emerging chip-scale optical interconnection
technology, has great potential to accelerate the training of FCNN with low
transmission delay, low power consumption, and high throughput. However,
existing methods based on Electrical Network-on-Chip (ENoC) cannot fit in ONoC
because of the unique properties of ONoC. In this paper, we propose a
fine-grained parallel computing model for accelerating FCNN training on ONoC
and derive the optimal number of cores for each execution stage with the
objective of minimizing the total amount of time to complete one epoch of FCNN
training. To allocate the optimal number of cores for each execution stage, we
present three mapping strategies and compare their advantages and disadvantages
in terms of hotspot level, memory requirement, and state transitions.
Simulation results show that the average prediction error for the optimal
number of cores in NN benchmarks is within 2.3%. We further carry out extensive
simulations which demonstrate that FCNN training time can be reduced by 22.28%
and 4.91% on average using our proposed scheme, compared with traditional
parallel computing methods that either allocate a fixed number of cores or
allocate as many cores as possible, respectively. Compared with ENoC,
simulation results show that under batch sizes of 64 and 128, on average ONoC
can achieve 21.02% and 12.95% on reducing training time with 47.85% and 39.27%
on saving energy, respectively.Comment: 14 pages, 10 figures. This paper is under the second review of IEEE
Transactions of Computer
Computing graph neural networks: A survey from algorithms to accelerators
Graph Neural Networks (GNNs) have exploded onto the machine learning scene in recent years owing to their capability to model and learn from graph-structured data. Such an ability has strong implications in a wide variety of fields whose data are inherently relational, for which conventional neural networks do not perform well. Indeed, as recent reviews can attest, research in the area of GNNs has grown rapidly and has lead to the development of a variety of GNN algorithm variants as well as to the exploration of ground-breaking applications in chemistry, neurology, electronics, or communication networks, among others. At the current stage research, however, the efficient processing of GNNs is still an open challenge for several reasons. Besides of their novelty, GNNs are hard to compute due to their dependence on the input graph, their combination of dense and very sparse operations, or the need to scale to huge graphs in some applications. In this context, this article aims to make two main contributions. On the one hand, a review of the field of GNNs is presented from the perspective of computing. This includes a brief tutorial on the GNN fundamentals, an overview of the evolution of the field in the last decade, and a summary of operations carried out in the multiple phases of different GNN algorithm variants. On the other hand, an in-depth analysis of current software and hardware acceleration schemes is provided, from which a hardware-software, graph-aware, and communication-centric vision for GNN accelerators is distilled.This work is possible thanks to funding from the European Union’s Horizon 2020 research and innovation programme under Grant No. 863337 (WiPLASH project) and the Spanish Ministry of Economy and Competitiveness under contract TEC2017-90034-C2-1-R (ALLIANCE project) that receives funding from FEDER.Peer ReviewedPostprint (published version
Design Space Exploration of Sparsity-Aware Application-Specific Spiking Neural Network Accelerators
Spiking Neural Networks (SNNs) offer a promising alternative to Artificial
Neural Networks (ANNs) for deep learning applications, particularly in
resource-constrained systems. This is largely due to their inherent sparsity,
influenced by factors such as the input dataset, the length of the spike train,
and the network topology. While a few prior works have demonstrated the
advantages of incorporating sparsity into the hardware design, especially in
terms of reducing energy consumption, the impact on hardware resources has not
yet been explored. This is where design space exploration (DSE) becomes
crucial, as it allows for the optimization of hardware performance by tailoring
both the hardware and model parameters to suit specific application needs.
However, DSE can be extremely challenging given the potentially large design
space and the interplay of hardware architecture design choices and
application-specific model parameters.
In this paper, we propose a flexible hardware design that leverages the
sparsity of SNNs to identify highly efficient, application-specific accelerator
designs. We develop a high-level, cycle-accurate simulation framework for this
hardware and demonstrate the framework's benefits in enabling detailed and
fine-grained exploration of SNN design choices, such as the layer-wise
logical-to-hardware ratio (LHR). Our experimental results show that our design
can (i) achieve up to reduction in hardware resources and (ii) deliver a
speed increase of up to , while requiring fewer hardware
resources compared to sparsity-oblivious designs. We further showcase the
robustness of our framework by varying spike train lengths with different
neuron population sizes to find the optimal trade-off points between accuracy
and hardware latency