20,663 research outputs found
Architecture and Applications of DADO: A Large-Scale Parallel Computer for Artificial Intelligence
As part of our research on very high performance parallel architectures, we have been investigating; machine architectures specially adapted to the highly efficient implementation of artificial intelligence (AI) software. In the course of our research we designed DADO, a highly parallel, VLSI-based, tree-structured machine, and implemented a high-speed algorithm for production systems on a simulator for DADO. Subsequent research has convinced us that DADO can support many other AI applications, including the very rapid execution of PROLOG programs, and a large share of the symbolic processing typical of contemporary knowledge-based systems. In this brief report, we outline the hardware design of a moderate size DADO prototype, comprising 1023 processing elements, which is currently under construction at Columbia University. We then sketch the software base being implemented on a small 15 processing element prototype system including several applications written in PPL/M, a high-level language designed for specifying parallel computations on DADO
An Efficient hardware implementation of the tate pairing in characteristic three
DL systems with bilinear structure recently became an important base for cryptographic protocols such as identity-based encryption (IBE). Since the main
computational task is the evaluation of the bilinear pairings over elliptic curves, known to be prohibitively expensive, efficient implementations are required to render them applicable in real life scenarios. We present an efficient accelerator for computing the Tate Pairing in characteristic 3, using the Modified Duursma-Lee algorithm. Our accelerator shows that it is possible to improve the area-time product by 12 times on FPGA, compared to estimated values from one of the best known hardware architecture [6] implemented on the same type of FPGA. Also the computation time is improved upto 16 times compared to software applications reported in [17]. In addition, we present the result of an ASIC implementation of the algorithm, which is the first hitherto
Exploring the Performance and Efficiency of Transformer Models for NLP on Mobile Devices
Deep learning (DL) is characterised by its dynamic nature, with new deep
neural network (DNN) architectures and approaches emerging every few years,
driving the field's advancement. At the same time, the ever-increasing use of
mobile devices (MDs) has resulted in a surge of DNN-based mobile applications.
Although traditional architectures, like CNNs and RNNs, have been successfully
integrated into MDs, this is not the case for Transformers, a relatively new
model family that has achieved new levels of accuracy across AI tasks, but
poses significant computational challenges. In this work, we aim to make steps
towards bridging this gap by examining the current state of Transformers'
on-device execution. To this end, we construct a benchmark of representative
models and thoroughly evaluate their performance across MDs with different
computational capabilities. Our experimental results show that Transformers are
not accelerator-friendly and indicate the need for software and hardware
optimisations to achieve efficient deployment.Comment: Accepted at the 3rd IEEE International Workshop on Distributed
Intelligent Systems (DistInSys), 202
Complexity Analysis of Reed-Solomon Decoding over GF(2^m) Without Using Syndromes
For the majority of the applications of Reed-Solomon (RS) codes, hard
decision decoding is based on syndromes. Recently, there has been renewed
interest in decoding RS codes without using syndromes. In this paper, we
investigate the complexity of syndromeless decoding for RS codes, and compare
it to that of syndrome-based decoding. Aiming to provide guidelines to
practical applications, our complexity analysis differs in several aspects from
existing asymptotic complexity analysis, which is typically based on
multiplicative fast Fourier transform (FFT) techniques and is usually in big O
notation. First, we focus on RS codes over characteristic-2 fields, over which
some multiplicative FFT techniques are not applicable. Secondly, due to
moderate block lengths of RS codes in practice, our analysis is complete since
all terms in the complexities are accounted for. Finally, in addition to fast
implementation using additive FFT techniques, we also consider direct
implementation, which is still relevant for RS codes with moderate lengths.
Comparing the complexities of both syndromeless and syndrome-based decoding
algorithms based on direct and fast implementations, we show that syndromeless
decoding algorithms have higher complexities than syndrome-based ones for high
rate RS codes regardless of the implementation. Both errors-only and
errors-and-erasures decoding are considered in this paper. We also derive
tighter bounds on the complexities of fast polynomial multiplications based on
Cantor's approach and the fast extended Euclidean algorithm.Comment: 11 pages, submitted to EURASIP Journal on Wireless Communications and
Networkin
Taking advantage of hybrid systems for sparse direct solvers via task-based runtimes
The ongoing hardware evolution exhibits an escalation in the number, as well
as in the heterogeneity, of computing resources. The pressure to maintain
reasonable levels of performance and portability forces application developers
to leave the traditional programming paradigms and explore alternative
solutions. PaStiX is a parallel sparse direct solver, based on a dynamic
scheduler for modern hierarchical manycore architectures. In this paper, we
study the benefits and limits of replacing the highly specialized internal
scheduler of the PaStiX solver with two generic runtime systems: PaRSEC and
StarPU. The tasks graph of the factorization step is made available to the two
runtimes, providing them the opportunity to process and optimize its traversal
in order to maximize the algorithm efficiency for the targeted hardware
platform. A comparative study of the performance of the PaStiX solver on top of
its native internal scheduler, PaRSEC, and StarPU frameworks, on different
execution environments, is performed. The analysis highlights that these
generic task-based runtimes achieve comparable results to the
application-optimized embedded scheduler on homogeneous platforms. Furthermore,
they are able to significantly speed up the solver on heterogeneous
environments by taking advantage of the accelerators while hiding the
complexity of their efficient manipulation from the programmer.Comment: Heterogeneity in Computing Workshop (2014
- …