126 research outputs found

    Exploiting parallelism within multidimensional multirate digital signal processing systems

    Get PDF
    The intense requirements for high processing rates of multidimensional Digital Signal Processing systems in practical applications justify the Application Specific Integrated Circuits designs and parallel processing implementations. In this dissertation, we propose novel theories, methodologies and architectures in designing high-performance VLSI implementations for general multidimensional multirate Digital Signal Processing systems by exploiting the parallelism within those applications. To systematically exploit the parallelism within the multidimensional multirate DSP algorithms, we develop novel transformations including (1) nonlinear I/O data space transforms, (2) intercalation transforms, and (3) multidimensional multirate unfolding transforms. These transformations are applied to the algorithms leading to systematic methodologies in high-performance architectural designs. With the novel design methodologies, we develop several architectures with parallel and distributed processing features for implementing multidimensional multirate applications. Experimental results have shown that those architectures are much more efficient in terms of execution time and/or hardware cost compared with existing hardware implementations

    FPGA Implementation of Data Flow Graphs for Digital Signal Processing Applications

    Get PDF
    A rapid growth in digital signal processing applications has increased the requirement for high-speed digital systems. Multiprocessor systems are the best choice for these applications. A prior sequence of operations should be applied to the operations that described the nature of these applications before hardware implementation is produced. These operations should be scheduled and hardware allocated. This paper proposes a new scheduling technique for digital signal processing (DSP) applications has been represented by data flow graphs (DFGs). In addition, hardware allocation is implemented in the form of embedded system. A proposed scheduling technique also achieves the optimal scheduling of a DFG at design time. The optimality criteria considered in this algorithm are the maximum throughput within the available hardware resources. The maximum throughput is achieved by arranging the DFG nodes according to their inter-related data dependencies. Then, two nodes can be clustered into one compound task to reduce the overall execution time by minimizing the number of tasks to be executed that minimizing the number of cycles to execute them. Then each task is presented in form of instruction to be executed in the hardware system. A hardware system is composed of one or multiple homogenous pipelined processing elements and it is designed to meet the maximum-rate schedule.  Two implementations are proposed of the system architecture according to the number of the processing elements, namely:  the serial system and the parallel system. The serial system comprises one processing element where all tasks are processed sequentially, whilst the parallel system has four processing elements to execute tasks concurrently. These systems consist mainly of seven units: central shared memory, state table, multiway function unit buffer, execution array, processing element/s, instruction buffer and the address generation unit. The hardware components were built on an FPGA chip using Verilog HDL. In synthesis results, the parallel system has better system performance by 25.5% than the serial system. While the serial system requires smaller area size, which described by the number of slice registers and the number of the slice lookup tables (LUTs) than the parallel one. The relationship between the number of instructions that are executed in both systems, and the system area and the system performance that presented by system frequency, are studied. By increasing memories size in both systems, the system performance isn’t affected as in a serial system, and it is slightly decreased as the parallel system by 1.5% to 4.5%. In terms of the systems area, both serial system area and parallel system area are increased and in some cases are doubled. The proposed scheduling technique is shown to outperform the retaining technique, which we have chosen to compare with.  The serial system has better performance by 19.3% higher system frequency than a retiming technique. And the parallel system also outperforms the retaining technique by 51.2% higher system frequency in synthesis results

    The application of genetic algorithms to high-level synthesis

    Get PDF

    Exploiting parallelism within multidimensional multirate digital signal processing systems

    Get PDF
    The intense requirements for high processing rates of multidimensional Digital Signal Processing systems in practical applications justify the Application Specific Integrated Circuits designs and parallel processing implementations. In this dissertation, we propose novel theories, methodologies and architectures in designing high-performance VLSI implementations for general multidimensional multirate Digital Signal Processing systems by exploiting the parallelism within those applications. To systematically exploit the parallelism within the multidimensional multirate DSP algorithms, we develop novel transformations including (1) nonlinear I/O data space transforms, (2) intercalation transforms, and (3) multidimensional multirate unfolding transforms. These transformations are applied to the algorithms leading to systematic methodologies in high-performance architectural designs. With the novel design methodologies, we develop several architectures with parallel and distributed processing features for implementing multidimensional multirate applications. Experimental results have shown that those architectures are much more efficient in terms of execution time and/or hardware cost compared with existing hardware implementations

    Simulation and analytical performance studies of generic atm switch fabrics.

    Get PDF
    As technology improves exciting new services such as video phone become possible and economically viable but their deployment is hampered by the inability of the present networks to carry them. The long term vision is to have a single network able to carry all present and future services. Asynchronous Transfer Mode, ATM, is the versatile new packet -based switching and multiplexing technique proposed for the single network. Interest in ATM is currently high as both industrial and academic institutions strive to understand more about the technique. Using both simulation and analysis, this research has investigated how the performance of ATM switches is affected by architectural variations in the switch fabric design and how the stochastic nature of ATM affects the timing of constant bit rate services. As a result the research has contributed new ATM switch performance data, a general purpose ATM switch simulator and analytic models that further research may utilise and has uncovered a significant timing problem of the ATM technique. The thesis will also be of interest and assistance to anyone planning on using simulation as a research tool to model an ATM switch

    Design methodology for embedded computer vision systems

    Get PDF
    Computer vision has emerged as one of the most popular domains of embedded appli¬cations. Though various new powerful embedded platforms to support such applica¬tions have emerged in recent years, there is a distinct lack of efficient domain-specific synthesis techniques for optimized implementation of such systems. In this thesis, four different aspects that contribute to efficient design and synthesis of such systems are explored: (1) Graph Transformations: Dataflow modeling is widely used in digital signal processing (DSP) systems. However, support for dynamic behavior in such systems exists mainly at the modeling level and there is a lack of optimized synthesis tech¬niques for these models. New transformation techniques for efficient system-on-chip (SoC) design methods are proposed and implemented for cyclo-static dataflow and its parameterized version (parameterized cyclo-static dataflow) -- two powerful models that allow dynamic reconfigurability and phased behavior in DSP systems. (2) Design Space Exploration: The broad range of target platforms along with the complexity of applications provides a vast design space, calling for efficient tools to explore this space and produce effective design choices. A novel architectural level design methodology based on a formalism called multirate synchronization graphs is presented along with methods for performance evaluation. (3) Multiprocessor Communication Interface: Efficient code synthesis for emerg¬ing new parallel architectures is an important and sparsely-explored problem. A widely-encountered problem in this regard is efficient communication between pro¬cessors running different sub-systems. A widely used tool in the domain of general-purpose multiprocessor clusters is MPI (Message Passing Interface). However, this does not scale well for embedded DSP systems. A new, powerful and highly optimized communication interface for multiprocessor signal processing systems is presented in this work that is based on the integration of relevant properties of MPI with dataflow semantics. (4) Parameterized Design Framework for Particle Filters: Particle filter systems constitute an important class of applications used in a wide number of fields. An effi¬cient design and implementation framework for such systems has been implemented based on the observation that a large number of such applications exhibit similar prop¬erties. The key properties of such applications are identified and parameterized appro¬priately to realize different systems that represent useful trade-off points in the space of possible implementations

    Collaborative rescheduling of flights in a single mega-hub network

    Get PDF
    Traditionally, airlines have configured flight operations into a Hub and Spoke network design. Using connecting arrival departure waves at multiple hubs these networks achieve efficient passenger flows. Recently, there has been much growth in the development of global single mega-hub (SMH) flight networks that have a significantly different operating cost structure and schedule design. These are located primarily in the Middle East and are commonly referred to as the ME3. The traditionalist view is that SMH networks are money losers and subsidized by sovereign funds. This research studies and analyzes SMH networks in an attempt to better understand their flight efficiency drivers. Key characteristics of SMH airports are identified as: (i) There are no peak periods, and flight activity is balanced with coordinated waves (ii) No priority is assigned to arrival/departure times at destinations (selfish strategy) only hub connectivity is considered (iii) There is less than 5% OD traffic at SMH (iv) The airline operates only non-stop flights (v) Passengers accept longer travel times in exchange for economic benefits (vi) Airline and airport owners work together to achieve collaborative flight schedules. This research focuses on the network structure of SMH airports to identify and optimize the operational characteristics that are the source of their advantages. A key feature of SMH airports is that the airline and airport are closely aligned in a partnership. To model this relationship, the Mega-Hub Collaborative Flight Rescheduling (MCFR). Problem is introduced. The MCFR starts with an initial flight schedule developed by the airline, then formulates a cooperative objective which is optimized iteratively by a series of reschedules. Specifically, in a network of iEM cities, the decision variables are i* the flight to be rescheduled, Di* the new departure time of flight to city i* and Hi* the new hold time at the destinatioin city i*. The daily passenger traffic is given by Ni,j and normally distributed with parameters µNi,j and sNi,j. A three-term MCFR objective function is developed to represent the intersecting scheduling decision space between airlines and airports: (i) Passenger Waiting Time (ii) Passenger Volume in Terminal, and (iii) Ground Activity Wave Imbalance. The function is non-linear in nature and the associated constraints and definitions are also non-linear. An EXCEL/VBA based simulator is developed to simulate the passenger traffic flows and generate the expected cost objective for a given flight network. This simulator is able to handle up to an M=250 flight network tracking 6250 passenger arcs. A simulation optimization approach is used to solve the MCFR. A Wave Gain Loss (WGL) strategy estimates the impact Zi of flight shift Δi on the objective. The WGL iteratively reschedules flights and is formulated as a non-linear program. It includes functions to capture the traffic affinity driven solution dependency between flights, the relationship between passengers in terminal gradients and flight shifts, and the relationship between ground traffic activity gradients and flight shifts. Each iteration generates a Zi ranked list of flights. The WGL is integrated with the EXCEL/VBA simulator and shown to generate significant costs reduction in an efficient time. Extensive testing is done on a set of 5 flight network problems, each with 3 different passengers flow networks characterized by low, medium and high traffic concentrations

    Tools for efficient Deep Learning

    Get PDF
    In the era of Deep Learning (DL), there is a fast-growing demand for building and deploying Deep Neural Networks (DNNs) on various platforms. This thesis proposes five tools to address the challenges for designing DNNs that are efficient in time, in resources and in power consumption. We first present Aegis and SPGC to address the challenges in improving the memory efficiency of DL training and inference. Aegis makes mixed precision training (MPT) stabler by layer-wise gradient scaling. Empirical experiments show that Aegis can improve MPT accuracy by at most 4\%. SPGC focuses on structured pruning: replacing standard convolution with group convolution (GConv) to avoid irregular sparsity. SPGC formulates GConv pruning as a channel permutation problem and proposes a novel heuristic polynomial-time algorithm. Common DNNs pruned by SPGC have maximally 1\% higher accuracy than prior work. This thesis also addresses the challenges lying in the gap between DNN descriptions and executables by Polygeist for software and POLSCA for hardware. Many novel techniques, e.g. statement splitting and memory partitioning, are explored and used to expand polyhedral optimisation. Polygeist can speed up software execution in sequential and parallel by 2.53 and 9.47 times on Polybench/C. POLSCA achieves 1.5 times speedup over hardware designs directly generated from high-level synthesis on Polybench/C. Moreover, this thesis presents Deacon, a framework that generates FPGA-based DNN accelerators of streaming architectures with advanced pipelining techniques to address the challenges from heterogeneous convolution and residual connections. Deacon provides fine-grained pipelining, graph-level optimisation, and heuristic exploration by graph colouring. Compared with prior designs, Deacon shows resource/power consumption efficiency improvement of 1.2x/3.5x for MobileNets and 1.0x/2.8x for SqueezeNets. All these tools are open source, some of which have already gained public engagement. We believe they can make efficient deep learning applications easier to build and deploy.Open Acces
    corecore