Search CORE

20,663 research outputs found

Architecture and Applications of DADO: A Large-Scale Parallel Computer for Artificial Intelligence

Author: Miranker Daniel P.
Shaw David Elliot
Stolfo Salvatore
Publication venue: 'Columbia University Libraries/Information Services'
Publication date: 01/01/1983
Field of study

As part of our research on very high performance parallel architectures, we have been investigating; machine architectures specially adapted to the highly efficient implementation of artificial intelligence (AI) software. In the course of our research we designed DADO, a highly parallel, VLSI-based, tree-structured machine, and implemented a high-speed algorithm for production systems on a simulator for DADO. Subsequent research has convinced us that DADO can support many other AI applications, including the very rapid execution of PROLOG programs, and a large share of the symbolic processing typical of contemporary knowledge-based systems. In this brief report, we outline the hardware design of a moderate size DADO prototype, comprising 1023 processing elements, which is currently under construction at Columbia University. We then sketch the software base being implemented on a small 15 processing element prototype system including several applications written in PPL/M, a high-level language designed for specifying parallel computations on DADO

An Efficient hardware implementation of the tate pairing in characteristic three

Author: Komurcu Giray
Kömürcü Giray
Savas Erkay
Savaş Erkay
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2008
Field of study

DL systems with bilinear structure recently became an important base for cryptographic protocols such as identity-based encryption (IBE). Since the main computational task is the evaluation of the bilinear pairings over elliptic curves, known to be prohibitively expensive, efficient implementations are required to render them applicable in real life scenarios. We present an efficient accelerator for computing the Tate Pairing in characteristic 3, using the Modified Duursma-Lee algorithm. Our accelerator shows that it is possible to improve the area-time product by 12 times on FPGA, compared to estimated values from one of the best known hardware architecture [6] implemented on the same type of FPGA. Also the computation time is improved upto 16 times compared to software applications reported in [17]. In addition, we present the result of an ASIC implementation of the algorithm, which is the first hitherto

CiteSeerX

Exploring the Performance and Efficiency of Transformer Models for NLP on Mobile Devices

Author: Nikolaidis Sokratis
Panopoulos Ioannis
Venieris Iakovos S.
Venieris Stylianos I.
Publication venue
Publication date: 20/06/2023
Field of study

Deep learning (DL) is characterised by its dynamic nature, with new deep neural network (DNN) architectures and approaches emerging every few years, driving the field's advancement. At the same time, the ever-increasing use of mobile devices (MDs) has resulted in a surge of DNN-based mobile applications. Although traditional architectures, like CNNs and RNNs, have been successfully integrated into MDs, this is not the case for Transformers, a relatively new model family that has achieved new levels of accuracy across AI tasks, but poses significant computational challenges. In this work, we aim to make steps towards bridging this gap by examining the current state of Transformers' on-device execution. To this end, we construct a benchmark of representative models and thoroughly evaluate their performance across MDs with different computational capabilities. Our experimental results show that Transformers are not accelerator-friendly and indicate the need for software and hardware optimisations to achieve efficient deployment.Comment: Accepted at the 3rd IEEE International Workshop on Distributed Intelligent Systems (DistInSys), 202

arXiv.org e-Print Archive

Complexity Analysis of Reed-Solomon Decoding over GF(2^m) Without Using Syndromes

Author: Chen Ning
Yan Zhiyuan
Publication venue
Publication date: 01/01/2008
Field of study

For the majority of the applications of Reed-Solomon (RS) codes, hard decision decoding is based on syndromes. Recently, there has been renewed interest in decoding RS codes without using syndromes. In this paper, we investigate the complexity of syndromeless decoding for RS codes, and compare it to that of syndrome-based decoding. Aiming to provide guidelines to practical applications, our complexity analysis differs in several aspects from existing asymptotic complexity analysis, which is typically based on multiplicative fast Fourier transform (FFT) techniques and is usually in big O notation. First, we focus on RS codes over characteristic-2 fields, over which some multiplicative FFT techniques are not applicable. Secondly, due to moderate block lengths of RS codes in practice, our analysis is complete since all terms in the complexities are accounted for. Finally, in addition to fast implementation using additive FFT techniques, we also consider direct implementation, which is still relevant for RS codes with moderate lengths. Comparing the complexities of both syndromeless and syndrome-based decoding algorithms based on direct and fast implementations, we show that syndromeless decoding algorithms have higher complexities than syndrome-based ones for high rate RS codes regardless of the implementation. Both errors-only and errors-and-erasures decoding are considered in this paper. We also derive tighter bounds on the complexities of fast polynomial multiplications based on Cantor's approach and the fast extended Euclidean algorithm.Comment: 11 pages, submitted to EURASIP Journal on Wireless Communications and Networkin

arXiv.org e-Print Archive

CiteSeerX

Springer - Publisher Connector

Directory of Open Access Journals

Taking advantage of hybrid systems for sparse direct solvers via task-based runtimes

Author: Bosilca George
Faverge Mathieu
Lacoste Xavier
Ramet Pierre
Thibault Samuel
Publication venue
Publication date: 06/01/2014
Field of study

The ongoing hardware evolution exhibits an escalation in the number, as well as in the heterogeneity, of computing resources. The pressure to maintain reasonable levels of performance and portability forces application developers to leave the traditional programming paradigms and explore alternative solutions. PaStiX is a parallel sparse direct solver, based on a dynamic scheduler for modern hierarchical manycore architectures. In this paper, we study the benefits and limits of replacing the highly specialized internal scheduler of the PaStiX solver with two generic runtime systems: PaRSEC and StarPU. The tasks graph of the factorization step is made available to the two runtimes, providing them the opportunity to process and optimize its traversal in order to maximize the algorithm efficiency for the targeted hardware platform. A comparative study of the performance of the PaStiX solver on top of its native internal scheduler, PaRSEC, and StarPU frameworks, on different execution environments, is performed. The analysis highlights that these generic task-based runtimes achieve comparable results to the application-optimized embedded scheduler on homogeneous platforms. Furthermore, they are able to significantly speed up the solver on heterogeneous environments by taking advantage of the accelerators while hiding the complexity of their efficient manipulation from the programmer.Comment: Heterogeneity in Computing Workshop (2014

arXiv.org e-Print Archive

INRIA a CCSD electronic archive server