564 research outputs found

    A configurable vector processor for accelerating speech coding algorithms

    Get PDF
    The growing demand for voice-over-packer (VoIP) services and multimedia-rich applications has made increasingly important the efficient, real-time implementation of low-bit rates speech coders on embedded VLSI platforms. Such speech coders are designed to substantially reduce the bandwidth requirements thus enabling dense multichannel gateways in small form factor. This however comes at a high computational cost which mandates the use of very high performance embedded processors. This thesis investigates the potential acceleration of two major ITU-T speech coding algorithms, namely G.729A and G.723.1, through their efficient implementation on a configurable extensible vector embedded CPU architecture. New scalar and vector ISAs were introduced which resulted in up to 80% reduction in the dynamic instruction count of both workloads. These instructions were subsequently encapsulated into a parametric, hybrid SISD (scalar processor)–SIMD (vector) processor. This work presents the research and implementation of the vector datapath of this vector coprocessor which is tightly-coupled to a Sparc-V8 compliant CPU, the optimization and simulation methodologies employed and the use of Electronic System Level (ESL) techniques to rapidly design SIMD datapaths

    ACOTES project: Advanced compiler technologies for embedded streaming

    Get PDF
    Streaming applications are built of data-driven, computational components, consuming and producing unbounded data streams. Streaming oriented systems have become dominant in a wide range of domains, including embedded applications and DSPs. However, programming efficiently for streaming architectures is a challenging task, having to carefully partition the computation and map it to processes in a way that best matches the underlying streaming architecture, taking into account the distributed resources (memory, processing, real-time requirements) and communication overheads (processing and delay). These challenges have led to a number of suggested solutions, whose goal is to improve the programmer’s productivity in developing applications that process massive streams of data on programmable, parallel embedded architectures. StreamIt is one such example. Another more recent approach is that developed by the ACOTES project (Advanced Compiler Technologies for Embedded Streaming). The ACOTES approach for streaming applications consists of compiler-assisted mapping of streaming tasks to highly parallel systems in order to maximize cost-effectiveness, both in terms of energy and in terms of design effort. The analysis and transformation techniques automate large parts of the partitioning and mapping process, based on the properties of the application domain, on the quantitative information about the target systems, and on programmer directives. This paper presents the outcomes of the ACOTES project, a 3-year collaborative work of industrial (NXP, ST, IBM, Silicon Hive, NOKIA) and academic (UPC, INRIA, MINES ParisTech) partners, and advocates the use of Advanced Compiler Technologies that we developed to support Embedded Streaming.Peer ReviewedPostprint (published version

    Towards adaptive balanced computing (ABC) using reconfigurable functional caches (RFCs)

    Get PDF
    The general-purpose computing processor performs a wide range of functions. Although the performance of general-purpose processors has been steadily increasing, certain software technologies like multimedia and digital signal processing applications demand ever more computing power. Reconfigurable computing has emerged to combine the versatility of general-purpose processors with the customization ability of ASICs. The basic premise of reconfigurability is to provide better performance and higher computing density than fixed configuration processors. Most of the research in reconfigurable computing is dedicated to on-chip functional logic. If computing resources are adaptable to the computing requirement, the maximum performance can be achieved. To overcome the gap between processor and memory technology, the size of on-chip cache memory has been consistently increasing. The larger cache memory capacity, though beneficial in general, does not guarantee a higher performance for all the applications as they may not utilize all of the cache efficiently. To utilize on-chip resources effectively and to accelerate the performance of multimedia applications specifically, we propose a new architecture---Adaptive Balanced Computing (ABC). ABC uses dynamic resource configuration of on-chip cache memory by integrating Reconfigurable Functional Caches (RFC). RFC can work as a conventional cache or as a specialized computing unit when necessary. In order to convert a cache memory to a computing unit, we include additional logic to embed multi-bit output LUTs into the cache structure. We add the reconfigurability of cache memory to a conventional processor with minimal modification to the load/store microarchitecture and with minimal compiler assistance. ABC architecture utilizes resources more efficiently by reconfiguring the cache memory to computing units dynamically. The area penalty for this reconfiguration is about 50--60% of the memory cell cache array-only area with faster cache access time. In a base array cache (parallel decoding caches), the area penalty is 10--20% of the data array with 1--2% increase in the cache access time. However, we save 27% for FIR and 44% for DCT/IDCT in area with respect to memory cell array cache and about 80% for both applications with respect to base array cache if we were to implement all these units separately (such as ASICs). The simulations with multimedia and DSP applications (DCT/IDCT and FIR/IIR) show that the resource configuration with the RFC speedups ranging from 1.04X to 3.94X in overall applications and from 2.61X to 27.4X in the core computations. The simulations with various parameters indicate that the impact of reconfiguration can be minimized if an appropriate cache organization is selected

    Simulated annealing based datapath synthesis

    Get PDF

    Communication reduction techniques in numerical methods and deep neural networks

    Get PDF
    Inter-node communication has turned out to be one of the determining factors of the performance on modern HPC systems. Furthermore, the situation only gets worse with the ever-incresing size of the cores involved. Hence, this thesis explore the various possible techniques to reduce the communication during the execution of a parallel program. It turned out that there is no one-size-fit-all approach to the challenge. Despite this, the problems in each field, due to their unique characteristics, dispose of distinct opportunities for the communication reduction. The thesis, first devles into numerical linear algebra, develops an evolution of the Pipelined CG called IFCG. It eliminates the synchronizations normally take place towards the end of each iteration to increase the parallelism. Secondly, the thesis draws its attention on reducing the necessity to transfer the parameters between the CPU host and GPUs during a neural network training. It develops two routines: ADT and AWP in order to compress and decompress the weights with a reduced data representation format prior and right after the data transfer takes place. The compress rate is adjusted vis-à-vis the L2-norm of the weights of every layer. In the third contribution, the thesis diminish the communication in model parallelizing a deep neural network. Instead of splitting and distributing the neurons of each layer to the available processes on the system, now it is done every other layers. This results in a 50% percent reduction of the communication whereas it introduces 50% of extra local FP computation.La comunicació entre els nodes de computació multi-core sorgeix com un dels factors principals que impacta el rendiment d’un sistema HPC d’avui en dia. I més, mentre més core es pusa, pitjor la situació. Per tant aquesta tesi explora les possibles tècniques per a reduir la comunicació en l’execució d’un programa paral·lel. Tot i això, resulta que no existeix una sola tècnica que pugui resoldre aquest obstacle. Tot i que els problemes en cada àmbit, com que té els seus propis caracristics, disposa variosos oportunitats per la reducció de comunicació. La tesi, en primer lloc, dins de l’àmbit de l’àlgebra lineal numèriques desenvolupa un algoritme IFCG que és una evolució de Pipelined CG. IFCG elimina les sincronitzacions normalment posa cap al final de cada iteració per augmentar el paral·lelisme. En la segona contribució, la tesi dirigeix l’atenció a reduir la necessitat de transferir els paràmetres entre el CPU i els GPUs durant l’entrenament d’una xarxa neuronal. Desenvolupa rutines ADT i AWP per comprimir i descomprimir els pesos amb una representació de dades reduïda abans i just desprès de la transferència. La representació es decideix dinàmicament segons el L2-norm dels pesos a cada capa. Al final la tesi disminueix la comunicació en paral·lelitzar el model duna xarxa neurona. En lloc de distribuir les neurones de cada capa als processos disponibles en el sistema, es fa cada dues capes. Així que corta com mitja de la comunicació. En canvi, com que distribueix només cada dues capes, les capes restes es repliquen, resulta que incorre en una augmenta de 50% de computació local

    Communication reduction techniques in numerical methods and deep neural networks

    Get PDF
    Inter-node communication has turned out to be one of the determining factors of the performance on modern HPC systems. Furthermore, the situation only gets worse with the ever-incresing size of the cores involved. Hence, this thesis explore the various possible techniques to reduce the communication during the execution of a parallel program. It turned out that there is no one-size-fit-all approach to the challenge. Despite this, the problems in each field, due to their unique characteristics, dispose of distinct opportunities for the communication reduction. The thesis, first devles into numerical linear algebra, develops an evolution of the Pipelined CG called IFCG. It eliminates the synchronizations normally take place towards the end of each iteration to increase the parallelism. Secondly, the thesis draws its attention on reducing the necessity to transfer the parameters between the CPU host and GPUs during a neural network training. It develops two routines: ADT and AWP in order to compress and decompress the weights with a reduced data representation format prior and right after the data transfer takes place. The compress rate is adjusted vis-à-vis the L2-norm of the weights of every layer. In the third contribution, the thesis diminish the communication in model parallelizing a deep neural network. Instead of splitting and distributing the neurons of each layer to the available processes on the system, now it is done every other layers. This results in a 50% percent reduction of the communication whereas it introduces 50% of extra local FP computation.La comunicació entre els nodes de computació multi-core sorgeix com un dels factors principals que impacta el rendiment d’un sistema HPC d’avui en dia. I més, mentre més core es pusa, pitjor la situació. Per tant aquesta tesi explora les possibles tècniques per a reduir la comunicació en l’execució d’un programa paral·lel. Tot i això, resulta que no existeix una sola tècnica que pugui resoldre aquest obstacle. Tot i que els problemes en cada àmbit, com que té els seus propis caracristics, disposa variosos oportunitats per la reducció de comunicació. La tesi, en primer lloc, dins de l’àmbit de l’àlgebra lineal numèriques desenvolupa un algoritme IFCG que és una evolució de Pipelined CG. IFCG elimina les sincronitzacions normalment posa cap al final de cada iteració per augmentar el paral·lelisme. En la segona contribució, la tesi dirigeix l’atenció a reduir la necessitat de transferir els paràmetres entre el CPU i els GPUs durant l’entrenament d’una xarxa neuronal. Desenvolupa rutines ADT i AWP per comprimir i descomprimir els pesos amb una representació de dades reduïda abans i just desprès de la transferència. La representació es decideix dinàmicament segons el L2-norm dels pesos a cada capa. Al final la tesi disminueix la comunicació en paral·lelitzar el model duna xarxa neurona. En lloc de distribuir les neurones de cada capa als processos disponibles en el sistema, es fa cada dues capes. Així que corta com mitja de la comunicació. En canvi, com que distribueix només cada dues capes, les capes restes es repliquen, resulta que incorre en una augmenta de 50% de computació local.Postprint (published version
    corecore