473 research outputs found

    A Comprehensive Methodology for Algorithm Characterization, Regularization and Mapping Into Optimal VLSI Arrays.

    Get PDF
    This dissertation provides a fairly comprehensive treatment of a broad class of algorithms as it pertains to systolic implementation. We describe some formal algorithmic transformations that can be utilized to map regular and some irregular compute-bound algorithms into the best fit time-optimal systolic architectures. The resulted architectures can be one-dimensional, two-dimensional, three-dimensional or nonplanar. The methodology detailed in the dissertation employs, like other methods, the concept of dependence vector to order, in space and time, the index points representing the algorithm. However, by differentiating between two types of dependence vectors, the ordering procedure is allowed to be flexible and time optimal. Furthermore, unlike other methodologies, the approach reported here does not put constraints on the topology or dimensionality of the target architecture. The ordered index points are represented by nodes in a diagram called Systolic Precedence Diagram (SPD). The SPD is a form of precedence graph that takes into account the systolic operation requirements of strictly local communications and regular data flow. Therefore, any algorithm with variable dependence vectors has to be transformed into a regular indexed set of computations with local dependencies. This can be done by replacing variable dependence vectors with sets of fixed dependence vectors. The SPD is transformed into an acyclic, labeled, directed graph called the Systolic Directed Graph (SDG). The SDG models the data flow as well as the timing for the execution of the given algorithm on a time-optimal array. The target architectures are obtained by projecting the SDG along defined directions. If more than one valid projection direction exists, different designs are obtained. The resulting architectures are then evaluated to determine if an improvement in the performance can be achieved by increasing PE fan-out. If so, the methodology provides the corresponding systolic implementation. By employing a new graph transformation, the SDG is manipulated so that it can be mapped into fixed-size and fixed-depth multi-linear arrays. The latter is a new concept of systolic arrays that is adaptable to changes in the state of technology. It promises a bonded clock skew, higher throughput and better performance than the linear implementation

    Study on Low-Power Image Processing for Gastrointestinal Endoscopy

    Get PDF

    Parallelization of dynamic programming recurrences in computational biology

    Get PDF
    The rapid growth of biosequence databases over the last decade has led to a performance bottleneck in the applications analyzing them. In particular, over the last five years DNA sequencing capacity of next-generation sequencers has been doubling every six months as costs have plummeted. The data produced by these sequencers is overwhelming traditional compute systems. We believe that in the future compute performance, not sequencing, will become the bottleneck in advancing genome science. In this work, we investigate novel computing platforms to accelerate dynamic programming algorithms, which are popular in bioinformatics workloads. We study algorithm-specific hardware architectures that exploit fine-grained parallelism in dynamic programming kernels using field-programmable gate arrays: FPGAs). We advocate a high-level synthesis approach, using the recurrence equation abstraction to represent dynamic programming and polyhedral analysis to exploit parallelism. We suggest a novel technique within the polyhedral model to optimize for throughput by pipelining independent computations on an array. This design technique improves on the state of the art, which builds latency-optimal arrays. We also suggest a method to dynamically switch between a family of designs using FPGA reconfiguration to achieve a significant performance boost. We have used polyhedral methods to parallelize the Nussinov RNA folding algorithm to build a family of accelerators that can trade resources for parallelism and are between 15-130x faster than a modern dual core CPU implementation. A Zuker RNA folding accelerator we built on a single workstation with four Xilinx Virtex 4 FPGAs outperforms 198 3 GHz Intel Core 2 Duo processors. Furthermore, our design running on a single FPGA is an order of magnitude faster than competing implementations on similar-generation FPGAs and graphics processors. Our work is a step toward the goal of automated synthesis of hardware accelerators for dynamic programming algorithms

    Systolic Array Implementations With Reduced Compute Time.

    Get PDF
    The goal of the research is the establishment of a formal methodology to develop computational structures more suitable for the changing nature of real-time signal processing and control applications. A major effort is devoted to the following question: Given a systolic array designed to execute a particular algorithm, what other algorithms can be executed on the same array? One approach for answering this question is based on a general model of array operations using graph-theoretic techniques. As a result, a systematic procedure is introduced that models array operations as a function of the compute cycle. As a consequence of the analysis, the dissertation develops the concept of fast algorithm realizations. This concept characterizes specific realizations that can be evaluated in a reduced number of cycles. It restricts the operations to remain in the same class but with reduced execution time. The concept takes advantage of the data dependencies of the algorithm at hand. This feature allows the modification of existing structures by reordering the input data. Applications of the principle allows optimum time band and triangular matrix product on arrays designed for dense matrices. A second approach for analyzing the families of algorithms implementable in an array, is based on the concept of array time constrained operation. The principle uses the number of compute cycle as an additional degree of freedom to expand the class of transformations generated by a single array. A mathematical approach, based on concepts from multilinear algebra, is introduced to model the recursive transformations implemented in linear arrays at each compute cycle. The proposed representation is general enough to encompass a large class of signal processing and control applications. A complete analytical model of the linear maps implementable by the array at each compute cycle is developed. The proposed methodology results in arrays that are more adaptable to the changing nature of operations. Lessons learned from analyzing existing arrays are used to design smart arrays for special algorithm realizations. Applications of the methodology include the design of flexible time structures and the ability to decompose a full size array into subarrays implementing smaller size problems

    NONLINEAR OPERATORS FOR IMAGE PROCESSING: DESIGN, IMPLEMENTATION AND MODELING TECHNIQUES FOR POWER ESTIMATION

    Get PDF
    1998/1999Negli ultimi anni passati le applicazioni multimediali hanno visto uno sviluppo notevole, trovando applicazione in un gran numero di campi. Applicazioni come video conferenze, diagnostica medica, telefonia mobile e applicazioni militari necessitano il trattamento di una gran mole di dati ad alta velocità. Pertanto, l'elaborazione di immagini e di dati vocali è molto importante ed è stata oggetto di numerosi sforzi, nel tentativo di trovare algoritmi sempre più veloci ed efficaci. Tra gli algoritmi proposti, noi crediamo che gli operatori razionali svolgano un ruolo molto importante, grazie alla loro versatilità ed efficacia nell'elaborazione di dati. Negli ultimi anni sono stati proposti diversi algoritmi, dimostrando che questi operatori possono essere molto vantaggiosi in diverse applicazioni, producendo buoni risultati. Lo scopo di questo lavoro è di realizzare alcuni di questi algoritmi e, quindi, dimostrare che i filtri razionali, in particolare, possono essere realizzati senza ricorrere a sistemi di grandi dimensioni e possono raggiungere frequenze operative molto alte. Una volta che il blocco fondamentale di un sistema basato su operatori razionali sia stato realizzato, esso pu6 essere riusato con successo in molte altre applicazioni. Dal punto di vista del progettista, è importante avere uno schema generale di studio, che lo renda capace di studiare le varie configurazioni del sistema da realizzare e di analizzare i compromessi tra le variabili di progetto. In particolare, per soddisfare l'esigenza di metodi versatili per la stima della potenza, abbiamo sviluppato una tecnica di macro modellizazione che permette al progettista di stimare velocemente ed accuratamente la potenza dissipata da un circuito. La tesi è organizzata come segue: Nel Capitolo 1 alcuni sono presentati alcuni algoritmi studiati per la realizzazione. Ne viene data solo una veloce descrizione, lasciando comunque al lettore interessato dei riferimenti bibliografici. Nel Capitolo 2 vengono discusse le architetture fondamentali usate per la realizzazione. Principalmente sono state usate architetture a pipeline, ma viene data anche una descrizione degli approcci oggigiorno disponibili per l'ottimizzazione delle temporizzazioni. Nel Capitolo 3 sono presentate le realizzazioni di due sistemi studiati per questa tesi. Gli approcci seguiti si basano su ASIC e FPGA. Richiedono tecniche e soluzioni diverse per il progetto del sistema, per cui é interessante vedere cosa pu6 essere fatto nei due casi. Infine, nel Capitolo 4, descriviamo la nostra tecnica di macro modellizazione per la stima di potenza, dando una breve visione delle tecniche finora proposte e facendo vedere quali sono i vantaggi che il nostro metodo comporta per il progetto.In the past few years, multimedia application have been growing very fast, being applied to a large variety of fields. Applications like video conference, medical diagnostic, mobile phones, military applications require to handle large amount of data at high rate. Images as well as voice data processing are therefore very important and they have been subjected to a lot of efforts in order to find always faster and effective algorithms. Among image processing algorithms, we believe that rational operators assume an important role, due to their versatility and effectiveness in data processing. In the last years, several algorithms have been proposed, demonstrating that these operators can be very suitable in different applications with very good results. The aim of this work is to implement some of these algorithm and, therefore, demonstrate that rational filters, in particular, can be implemented without requiring large sized systems and they can operate at very high frequencies. Once the basic building block of a rational based system has been implemented, it can be successfully reused in many other applications. From the designer point of view, it is important to have a general framework, which makes it able to study various configurations of the system to be implemented and analyse the trade-off among the design variables. In particular, to meet the need far versatile tools far power estimation, we developed a new macro modelling technique, which allows the designer to estimate the power dissipated by a circuit quickly and accurately. The thesis is organized as follows: In chapter 1 we present some of the algorithms which have been studied for implementation. Only a brief overview is given, leaving to the interested reader some references in literature. In chapter 2 we discuss the basic architectures used for the implementations. Pipelined structures have been mainly used for this thesis, but an overview of the nowaday available approaches for timing optimization is presented. In chapter 3 we present two of the implementation designed for this thesis. The approaches followed are ASIC driven and FPGA drive. They require different techniques and different solution for the design of the system, therefore it is interesting to see what can be done in both the cases. Finally, in chapter 4, we describe our macro modelling techniques for power estimation, giving a brief overview of the up to now proposed techniques and showing the advantages our method brings to the design.XII Ciclo1969Versione digitalizzata della tesi di dottorato cartacea

    Tools for efficient Deep Learning

    Get PDF
    In the era of Deep Learning (DL), there is a fast-growing demand for building and deploying Deep Neural Networks (DNNs) on various platforms. This thesis proposes five tools to address the challenges for designing DNNs that are efficient in time, in resources and in power consumption. We first present Aegis and SPGC to address the challenges in improving the memory efficiency of DL training and inference. Aegis makes mixed precision training (MPT) stabler by layer-wise gradient scaling. Empirical experiments show that Aegis can improve MPT accuracy by at most 4\%. SPGC focuses on structured pruning: replacing standard convolution with group convolution (GConv) to avoid irregular sparsity. SPGC formulates GConv pruning as a channel permutation problem and proposes a novel heuristic polynomial-time algorithm. Common DNNs pruned by SPGC have maximally 1\% higher accuracy than prior work. This thesis also addresses the challenges lying in the gap between DNN descriptions and executables by Polygeist for software and POLSCA for hardware. Many novel techniques, e.g. statement splitting and memory partitioning, are explored and used to expand polyhedral optimisation. Polygeist can speed up software execution in sequential and parallel by 2.53 and 9.47 times on Polybench/C. POLSCA achieves 1.5 times speedup over hardware designs directly generated from high-level synthesis on Polybench/C. Moreover, this thesis presents Deacon, a framework that generates FPGA-based DNN accelerators of streaming architectures with advanced pipelining techniques to address the challenges from heterogeneous convolution and residual connections. Deacon provides fine-grained pipelining, graph-level optimisation, and heuristic exploration by graph colouring. Compared with prior designs, Deacon shows resource/power consumption efficiency improvement of 1.2x/3.5x for MobileNets and 1.0x/2.8x for SqueezeNets. All these tools are open source, some of which have already gained public engagement. We believe they can make efficient deep learning applications easier to build and deploy.Open Acces

    Indexed dependence metadata and its applications in software performance optimisation

    No full text
    To achieve continued performance improvements, modern microprocessor design is tending to concentrate an increasing proportion of hardware on computation units with less automatic management of data movement and extraction of parallelism. As a result, architectures increasingly include multiple computation cores and complicated, software-managed memory hierarchies. Compilers have difficulty characterizing the behaviour of a kernel in a general enough manner to enable automatic generation of efficient code in any but the most straightforward of cases. We propose the concept of indexed dependence metadata to improve application development and mapping onto such architectures. The metadata represent both the iteration space of a kernel and the mapping of that iteration space from a given index to the set of data elements that iteration might use: thus the dependence metadata is indexed by the kernel’s iteration space. This explicit mapping allows the compiler or runtime to optimise the program more efficiently, and improves the program structure for the developer. We argue that this form of explicit interface specification reduces the need for premature, architecture-specific optimisation. It improves program portability, supports intercomponent optimisation and enables generation of efficient data movement code. We offer the following contributions: an introduction to the concept of indexed dependence metadata as a generalisation of stream programming, a demonstration of its advantages in a component programming system, the decoupled access/execute model for C++ programs, and how indexed dependence metadata might be used to improve the programming model for GPU-based designs. Our experimental results with prototype implementations show that indexed dependence metadata supports automatic synthesis of double-buffered data movement for the Cell processor and enables aggressive loop fusion optimisations in image processing, linear algebra and multigrid application case studies

    Portable Ultrasound Imaging

    Get PDF
    This PhD project investigates hardware strategies and imaging methods for hand-held ultrasound systems. The overall idea is to use a wireless ultrasound probe linked to general-purpose mobile devices for the processing and visualization. The approach has the potential to reduce the upfront costs of the ultrasound system and, consequently, to allow for a wide-scale utilization of diagnostic ultrasound in any medical specialties and out of the radiology department. The first part of the contribution deals with the study of hardware solutions for the reduction of the system complexity. Analog and digital beamforming strategies are simulated from a system-level perspective. The quality of the B-mode image is evaluated and the minimum specifications are derived for the design of a portable probe with integrated electronics in-handle. The system is based on a synthetic aperture sequential beamforming approach that allows to significantly reduce the data rate between the probe and processing unit. The second part investigates the feasibility of vector flow imaging in a hand-held ultrasound system. Vector flow imaging overcomes the limitations of conventional imaging methods in terms of flow angle compensation. Furthermore, high frame rate can be obtained by using synthetic aperture focusing techniques. A method is developed combining synthetic aperture sequential beamforming and directional transverse oscillation to achieve the wireless transmission of the data along with a relatively inexpensive 2-D velocity estimation. The performance of the method is thoroughly assessed through simulations and measurements, and in vivo investigations are carried out to show its potential in presence of complex flow dynamics. A sufficient frame rate is achieved to allow for the visualization of vortices in the carotid bifurcation. Furthermore, the method is implemented on a commercially available tablet to evaluate the real-time processing performance in the built-in GPU with concurrent wireless transmission of the data. Based on the demonstrations in this thesis, a flexible framework can be implemented with performance that can be scaled to the needs of the user and according to the computing resources available. The integration of high-frame-rate vector flow imaging in a hand-held ultrasound scanner, in addition, has the potential to improve the operator’s workflow and opens the way to new possibilities in the clinical practice

    FPGA acceleration of sequence analysis tools in bioinformatics

    Full text link
    Thesis (Ph.D.)--Boston UniversityWith advances in biotechnology and computing power, biological data are being produced at an exceptional rate. The purpose of this study is to analyze the application of FPGAs to accelerate high impact production biosequence analysis tools. Compared with other alternatives, FPGAs offer huge compute power, lower power consumption, and reasonable flexibility. BLAST has become the de facto standard in bioinformatic approximate string matching and so its acceleration is of fundamental importance. It is a complex highly-optimized system, consisting of tens of thousands of lines of code and a large number of heuristics. Our idea is to emulate the main phases of its algorithm on FPGA. Utilizing our FPGA engine, we quickly reduce the size of the database to a small fraction, and then use the original code to process the query. Using a standard FPGA-based system, we achieved 12x speedup over a highly optimized multithread reference code. Multiple Sequence Alignment (MSA)--the extension of pairwise Sequence Alignment to multiple Sequences--is critical to solve many biological problems. Previous attempts to accelerate Clustal-W, the most commonly used MSA code, have directly mapped a portion of the code to the FPGA. We use a new approach: we apply prefiltering of the kind commonly used in BLAST to perform the initial all-pairs alignments. This results in a speedup of from 8Ox to 190x over the CPU code (8 cores). The quality is comparable to the original according to a commonly used benchmark suite evaluated with respect to multiple distance metrics. The challenge in FPGA-based acceleration is finding a suitable application mapping. Unfortunately many software heuristics do not fall into this category and so other methods must be applied. One is restructuring: an entirely new algorithm is applied. Another is to analyze application utilization and develop accuracy/performance tradeoffs. Using our prefiltering approach and novel FPGA programming models we have achieved significant speedup over reference programs. We have applied approximation, seeding, and filtering to this end. The bulk of this study is to introduce the pros and cons of these acceleration models for biosequence analysis tools
    • …
    corecore