411 research outputs found
MultiTASC: A Multi-Tenancy-Aware Scheduler for Cascaded DNN Inference at the Consumer Edge
Cascade systems comprise a two-model sequence, with a lightweight model
processing all samples and a heavier, higher-accuracy model conditionally
refining harder samples to improve accuracy. By placing the light model on the
device side and the heavy model on a server, model cascades constitute a widely
used distributed inference approach. With the rapid expansion of intelligent
indoor environments, such as smart homes, the new setting of Multi-Device
Cascade is emerging where multiple and diverse devices are to simultaneously
use a shared heavy model on the same server, typically located within or close
to the consumer environment. This work presents MultiTASC, a
multi-tenancy-aware scheduler that adaptively controls the forwarding decision
functions of the devices in order to maximize the system throughput, while
sustaining high accuracy and low latency. By explicitly considering device
heterogeneity, our scheduler improves the latency service-level objective (SLO)
satisfaction rate by 20-25 percentage points (pp) over state-of-the-art cascade
methods in highly heterogeneous setups, while serving over 40 devices,
showcasing its scalability.Comment: Accepted at 28th IEEE Symposium on Computers and Communications
(ISCC), 202
Exploring the Performance and Efficiency of Transformer Models for NLP on Mobile Devices
Deep learning (DL) is characterised by its dynamic nature, with new deep
neural network (DNN) architectures and approaches emerging every few years,
driving the field's advancement. At the same time, the ever-increasing use of
mobile devices (MDs) has resulted in a surge of DNN-based mobile applications.
Although traditional architectures, like CNNs and RNNs, have been successfully
integrated into MDs, this is not the case for Transformers, a relatively new
model family that has achieved new levels of accuracy across AI tasks, but
poses significant computational challenges. In this work, we aim to make steps
towards bridging this gap by examining the current state of Transformers'
on-device execution. To this end, we construct a benchmark of representative
models and thoroughly evaluate their performance across MDs with different
computational capabilities. Our experimental results show that Transformers are
not accelerator-friendly and indicate the need for software and hardware
optimisations to achieve efficient deployment.Comment: Accepted at the 3rd IEEE International Workshop on Distributed
Intelligent Systems (DistInSys), 202
Toolflows for Mapping Convolutional Neural Networks on FPGAs: A Survey and Future Directions
In the past decade, Convolutional Neural Networks (CNNs) have demonstrated
state-of-the-art performance in various Artificial Intelligence tasks. To
accelerate the experimentation and development of CNNs, several software
frameworks have been released, primarily targeting power-hungry CPUs and GPUs.
In this context, reconfigurable hardware in the form of FPGAs constitutes a
potential alternative platform that can be integrated in the existing deep
learning ecosystem to provide a tunable balance between performance, power
consumption and programmability. In this paper, a survey of the existing
CNN-to-FPGA toolflows is presented, comprising a comparative study of their key
characteristics which include the supported applications, architectural
choices, design space exploration methods and achieved performance. Moreover,
major challenges and objectives introduced by the latest trends in CNN
algorithmic research are identified and presented. Finally, a uniform
evaluation methodology is proposed, aiming at the comprehensive, complete and
in-depth evaluation of CNN-to-FPGA toolflows.Comment: Accepted for publication at the ACM Computing Surveys (CSUR) journal,
201
Approximate FPGA-based LSTMs under Computation Time Constraints
Recurrent Neural Networks and in particular Long Short-Term Memory (LSTM)
networks have demonstrated state-of-the-art accuracy in several emerging
Artificial Intelligence tasks. However, the models are becoming increasingly
demanding in terms of computational and memory load. Emerging latency-sensitive
applications including mobile robots and autonomous vehicles often operate
under stringent computation time constraints. In this paper, we address the
challenge of deploying computationally demanding LSTMs at a constrained time
budget by introducing an approximate computing scheme that combines iterative
low-rank compression and pruning, along with a novel FPGA-based LSTM
architecture. Combined in an end-to-end framework, the approximation method's
parameters are optimised and the architecture is configured to address the
problem of high-performance LSTM execution in time-constrained applications.
Quantitative evaluation on a real-life image captioning application indicates
that the proposed methods required up to 6.5x less time to achieve the same
application-level accuracy compared to a baseline method, while achieving an
average of 25x higher accuracy under the same computation time constraints.Comment: Accepted at the 14th International Symposium in Applied
Reconfigurable Computing (ARC) 201
A methodological approach to BISDN signalling performance
Sophisticated signalling protocols are required to properly handle the complex multimedia, multiparty services supported by the forthcoming BISDN. The implementation feasibility of these protocols should be evaluated during their design phase, so that possible performance bottlenecks are identified and removed. In this paper we present a methodology for evaluating the performance of BISDN signalling systems under design. New performance parameters are introduced and their network-dependent values are extracted through a message flow model which has the capability to describe the impact of call and bearer control separation on the signalling performance. Signalling protocols are modelled through a modular decomposition of the seven OSI layers including the service user to three submodels. The workload model is user descriptive in the sense that it does not approximate the direct input traffic required for evaluating the performance of a layer protocol; instead, through a multi-level approach, it describes the actual implications of user signalling activity for the general signalling traffic. The signalling protocol model is derived from the global functional model of the signalling protocols and information flows using a network of queues incorporating synchronization and dependency functions. The same queueing approach is followed for the signalling transfer network which is used to define processing speed and signalling bandwidth requirements and to identify possible performance bottlenecks stemming from the realization of the related protocols
Preprocessing algorithm for source localisation in a multipath environment
Several methods have been developed which allow the estimation of the location of an existing source with considerable accuracy in the absence of multipaths. However, if, in addition to the Line-of-Sight (LOS) path, non-LOS (NLOS) paths are also present, then all existing localisation algorithms dramatically fail to estimate the location of the source. In this paper, a passive array processing algorithm is proposed, which, if used prior to a localisation approach, suppresses all the multipath contributions in the received signal except for that of the LOS path. The performance of the proposed algorithm is evaluated through computer simulation studies
- …