127 research outputs found

    Heuristics for offset assignment in embedded processors

    Get PDF
    This thesis deals with the optimization of program size and performance in current generation embedded digital signal processors (DSPs) by the design of optimal memory layouts for data. Given the tight constraints on the size, power consumption, cost and performance of these processors, the minimization of code size in terms of the number of instructions required and the associated reduction in execution time are important. Several DSPs provide limited addressing modes and the layout of data, known as offset assignment, plays a critical role in determining the code size and performance. Even the simplest variant of the offset assignment problem is NP-complete. Research effort in this area has focused on the design, implementation and evaluation of effective heuristics for several variants of the offset assignment problem. One of the most important factors in the determination of the size, and hence, execution time of a code is the number of instructions required to access the variables stored in the processor memory. The indirect addressing mode common in DSPs requires memory accesses to be realized through address registers that hold the address of the memory location to be accessed. The architecture provides instructions for adding to and subtracting from the values of the address registers to compute the addresses of subsequent data that need to be accessed. In addition, some DSP processors include multiple memory banks that allow increased parallelism in memory access. Proper partitioning of variables across memory banks is critical to effectively using the increased parallelism. The work reported in this thesis aims to evolve efficient methods for designing memory layouts under the conditions of availability of one address register (SOA) or of multiple address registers (GOA). It also proposes a novel technique for choosing the assignment of variables to the memory banks. This thesis motivates, proposes and evaluates heuristics for all these three problems. For the SOA and GOA problems, the heuristics are implemented and tested on different random sample inputs, and the results obtained are compared to those obtained by prior heuristics. In addition, this thesis provides some insight into the SOA, GOA and the variable partitioning problems

    Optimising algorithm and hardware for deep neural networks on FPGAs

    Get PDF
    This thesis proposes novel algorithm and hardware optimisation approaches to accelerate Deep Neural Networks (DNNs), including both Convolutional Neural Networks (CNNs) and Bayesian Neural Networks (BayesNNs). The first contribution of this thesis is to propose an adaptable and reconfigurable hardware design to accelerate CNNs. By analysing the computational patterns of different CNNs, a unified hardware architecture is proposed for both 2-Dimension and 3-Dimension CNNs. The accelerator is also designed with runtime adaptability, which adopts different parallelism strategies for different convolutional layers at runtime. The second contribution of this thesis is to propose a novel neural network architecture and hardware design co-optimisation approach, which improves the performance of CNNs at both algorithm and hardware levels. Our proposed three-phase co-design framework decouples network training from design space exploration, which significantly reduces the time-cost of the co-optimisation process. The third contribution of this thesis is to propose an algorithmic and hardware co-optimisation framework for accelerating BayesNNs. At the algorithmic level, three categories of structured sparsity are explored to reduce the computational complexity of BayesNNs. At the hardware level, we propose a novel hardware architecture with the aim of exploiting the structured sparsity for BayesNNs. Both algorithmic and hardware optimisations are jointly applied to push the performance limit.Open Acces

    Using reconfigurable computing technology to accelerate matrix decomposition and applications

    Get PDF
    Matrix decomposition plays an increasingly significant role in many scientific and engineering applications. Among numerous techniques, Singular Value Decomposition (SVD) and Eigenvalue Decomposition (EVD) are widely used as factorization tools to perform Principal Component Analysis for dimensionality reduction and pattern recognition in image processing, text mining and wireless communications, while QR Decomposition (QRD) and sparse LU Decomposition (LUD) are employed to solve the dense or sparse linear system of equations in bioinformatics, power system and computer vision. Matrix decompositions are computationally expensive and their sequential implementations often fail to meet the requirements of many time-sensitive applications. The emergence of reconfigurable computing has provided a flexible and low-cost opportunity to pursue high-performance parallel designs, and the use of FPGAs has shown promise in accelerating this class of computation. In this research, we have proposed and implemented several highly parallel FPGA-based architectures to accelerate matrix decompositions and their applications in data mining and signal processing. Specifically, in this dissertation we describe the following contributions: • We propose an efficient FPGA-based double-precision floating-point architecture for EVD, which can efficiently analyze large-scale matrices. • We implement a floating-point Hestenes-Jacobi architecture for SVD, which is capable of analyzing arbitrary sized matrices. • We introduce a novel deeply pipelined reconfigurable architecture for QRD, which can be dynamically configured to perform either Householder transformation or Givens rotation in a manner that takes advantage of the strengths of each. • We design a configurable architecture for sparse LUD that supports both symmetric and asymmetric sparse matrices with arbitrary sparsity patterns. • By further extending the proposed hardware solution for SVD, we parallelize a popular text mining tool-Latent Semantic Indexing with an FPGA-based architecture. • We present a configurable architecture to accelerate Homotopy l1-minimization, in which the modification of the proposed FPGA architecture for sparse LUD is used at its core to parallelize both Cholesky decomposition and rank-1 update. Our experimental results using an FPGA-based acceleration system indicate the efficiency of our proposed novel architectures, with application and dimension-dependent speedups over an optimized software implementation that range from 1.5ÃÂ to 43.6ÃÂ in terms of computation time

    Power aware data and memory management for dynamic applications

    Get PDF
    In recent years, the semiconductor industry has turned its focus towards heterogeneous multiprocessor platforms. They are an economically viable solution for coping with the growing setup and manufacturing cost of silicon systems. Furthermore, their inherent flexibility perfectly supports the emerging market of interactive, mobile data and content services. The platform’s performance and energy depend largely on how well the data-dominated services are mapped on the memory subsystem. A crucial aspect thereby is how efficient data is transferred between the different memory layers. Several compilation techniques have been developed to optimally use the available bandwidth. Unfortunately, they do not take the interaction between multiple threads into account and do not deal with the dynamic behaviour of these novel applications. The main limitations of current techniques are outlined and an approach for dealing with them is introduced

    Compilation Techniques for High-Performance Embedded Systems with Multiple Processors

    Get PDF
    Institute for Computing Systems ArchitectureDespite the progress made in developing more advanced compilers for embedded systems, programming of embedded high-performance computing systems based on Digital Signal Processors (DSPs) is still a highly skilled manual task. This is true for single-processor systems, and even more for embedded systems based on multiple DSPs. Compilers often fail to optimise existing DSP codes written in C due to the employed programming style. Parallelisation is hampered by the complex multiple address space memory architecture, which can be found in most commercial multi-DSP configurations. This thesis develops an integrated optimisation and parallelisation strategy that can deal with low-level C codes and produces optimised parallel code for a homogeneous multi-DSP architecture with distributed physical memory and multiple logical address spaces. In a first step, low-level programming idioms are identified and recovered. This enables the application of high-level code and data transformations well-known in the field of scientific computing. Iterative feedback-driven search for “good” transformation sequences is being investigated. A novel approach to parallelisation based on a unified data and loop transformation framework is presented and evaluated. Performance optimisation is achieved through exploitation of data locality on the one hand, and utilisation of DSP-specific architectural features such as Direct Memory Access (DMA) transfers on the other hand. The proposed methodology is evaluated against two benchmark suites (DSPstone & UTDSP) and four different high-performance DSPs, one of which is part of a commercial four processor multi-DSP board also used for evaluation. Experiments confirm the effectiveness of the program recovery techniques as enablers of high-level transformations and automatic parallelisation. Source-to-source transformations of DSP codes yield an average speedup of 2.21 across four different DSP architectures. The parallelisation scheme is – in conjunction with a set of locality optimisations – able to produce linear and even super-linear speedups on a number of relevant DSP kernels and applications

    KAVUAKA: a low-power application-specific processor architecture for digital hearing aids

    Get PDF
    The power consumption of digital hearing aids is very restricted due to their small physical size and the available hardware resources for signal processing are limited. However, there is a demand for more processing performance to make future hearing aids more useful and smarter. Future hearing aids should be able to detect, localize, and recognize target speakers in complex acoustic environments to further improve the speech intelligibility of the individual hearing aid user. Computationally intensive algorithms are required for this task. To maintain acceptable battery life, the hearing aid processing architecture must be highly optimized for extremely low-power consumption and high processing performance.The integration of application-specific instruction-set processors (ASIPs) into hearing aids enables a wide range of architectural customizations to meet the stringent power consumption and performance requirements. In this thesis, the application-specific hearing aid processor KAVUAKA is presented, which is customized and optimized with state-of-the-art hearing aid algorithms such as speaker localization, noise reduction, beamforming algorithms, and speech recognition. Specialized and application-specific instructions are designed and added to the baseline instruction set architecture (ISA). Among the major contributions are a multiply-accumulate (MAC) unit for real- and complex-valued numbers, architectures for power reduction during register accesses, co-processors and a low-latency audio interface. With the proposed MAC architecture, the KAVUAKA processor requires 16 % less cycles for the computation of a 128-point fast Fourier transform (FFT) compared to related programmable digital signal processors. The power consumption during register file accesses is decreased by 6 %to 17 % with isolation and by-pass techniques. The hardware-induced audio latency is 34 %lower compared to related audio interfaces for frame size of 64 samples.The final hearing aid system-on-chip (SoC) with four KAVUAKA processor cores and ten co-processors is integrated as an application-specific integrated circuit (ASIC) using a 40 nm low-power technology. The die size is 3.6 mm2. Each of the processors and co-processors contains individual customizations and hardware features with a varying datapath width between 24-bit to 64-bit. The core area of the 64-bit processor configuration is 0.134 mm2. The processors are organized in two clusters that share memory, an audio interface, co-processors and serial interfaces. The average power consumption at a clock speed of 10 MHz is 2.4 mW for SoC and 0.6 mW for the 64-bit processor.Case studies with four reference hearing aid algorithms are used to present and evaluate the proposed hardware architectures and optimizations. The program code for each processor and co-processor is generated and optimized with evolutionary algorithms for operation merging,instruction scheduling and register allocation. The KAVUAKA processor architecture is com-pared to related processor architectures in terms of processing performance, average power consumption, and silicon area requirements

    Digital mixing consoles: parallel architectures and taskforce scheduling strategies

    Get PDF
    This thesis is concerned specifically with the implementation of large-scale professional DMCs. The design of such multi-DSP audio products is extremely challenging: one cannot simply lash together n DSPs and obtain /7-times the performance of a sole device. M-P models developed here show that topology and IPC mechanisms have critical design implications. Alternative processor technologies are investigated with respect to the requirements of DMC architectures. An extensive analysis of M-P topologies is undertaken using the metrics provided by the TPG tool. Novel methods supporting DSP message-passing connectivity lead to the development of a hybrid audio M-P (HYMIPS) employing these techniques. A DMC model demonstrates the impact of task allocation on ASP M-P architectures. Five application-specific heuristics and four static-labelling schemes are developed for scheduling console taskforces on M-Ps. An integrated research framework and DCS engine enable scheduling strategies to be analysed with regard to the DMC problem domain. Three scheduling algorithms — CPM, DYN and AST — and three IPC mechanisms — FWE, NSL and NML — are investigated. Dynamic-labelling strategies and mix-bus granularity issues are further studied in detail. To summarise, this thesis elucidates those topologies, construction techniques and scheduling algorithms appropriate to professional DMC systems

    Multi-core architectures with coarse-grained dynamically reconfigurable processors for broadband wireless access technologies

    Get PDF
    Broadband Wireless Access technologies have significant market potential, especially the WiMAX protocol which can deliver data rates of tens of Mbps. Strong demand for high performance WiMAX solutions is forcing designers to seek help from multi-core processors that offer competitive advantages in terms of all performance metrics, such as speed, power and area. Through the provision of a degree of flexibility similar to that of a DSP and performance and power consumption advantages approaching that of an ASIC, coarse-grained dynamically reconfigurable processors are proving to be strong candidates for processing cores used in future high performance multi-core processor systems. This thesis investigates multi-core architectures with a newly emerging dynamically reconfigurable processor – RICA, targeting WiMAX physical layer applications. A novel master-slave multi-core architecture is proposed, using RICA processing cores. A SystemC based simulator, called MRPSIM, is devised to model this multi-core architecture. This simulator provides fast simulation speed and timing accuracy, offers flexible architectural options to configure the multi-core architecture, and enables the analysis and investigation of multi-core architectures. Meanwhile a profiling-driven mapping methodology is developed to partition the WiMAX application into multiple tasks as well as schedule and map these tasks onto the multi-core architecture, aiming to reduce the overall system execution time. Both the MRPSIM simulator and the mapping methodology are seamlessly integrated with the existing RICA tool flow. Based on the proposed master-slave multi-core architecture, a series of diverse homogeneous and heterogeneous multi-core solutions are designed for different fixed WiMAX physical layer profiles. Implemented in ANSI C and executed on the MRPSIM simulator, these multi-core solutions contain different numbers of cores, combine various memory architectures and task partitioning schemes, and deliver high throughputs at relatively low area costs. Meanwhile a design space exploration methodology is developed to search the design space for multi-core systems to find suitable solutions under certain system constraints. Finally, laying a foundation for future multithreading exploration on the proposed multi-core architecture, this thesis investigates the porting of a real-time operating system – Micro C/OS-II to a single RICA processor. A multitasking version of WiMAX is implemented on a single RICA processor with the operating system support

    Customising compilers for customisable processors

    Get PDF
    The automatic generation of instruction set extensions to provide application-specific acceleration for embedded processors has been a productive area of research in recent years. There have been incremental improvements in the quality of the algorithms that discover and select which instructions to add to a processor. The use of automatic algorithms, however, result in instructions which are radically different from those found in conventional, human-designed, RISC or CISC ISAs. This has resulted in a gap between the hardware’s capabilities and the compiler’s ability to exploit them. This thesis proposes and investigates the use of a high-level compiler pass that uses graph-subgraph isomorphism checking to exploit these complex instructions. Operating in a separate pass permits techniques to be applied that are uniquely suited for mapping complex instructions, but unsuitable for conventional instruction selection. The existing, mature, compiler back-end can then handle the remainder of the compilation. With this method, the high-level pass was able to use 1965 different automatically produced instructions to obtain an initial average speed-up of 1.11x over 179 benchmarks evaluated on a hardware-verified cycle-accurate simulator. This result was improved following an investigation of how the produced instructions were being used by the compiler. It was established that the models the automatic tools were using to develop instructions did not take account of how well the compiler could realistically use them. Adding additional parameters to the search heuristic to account for compiler issues increased the speed-up from 1.11x to 1.24x. An alternative approach using a re-designed hardware interface was also investigated and this achieved a speed-up of 1.26x while reducing hardware and compiler complexity. A complementary, high-level, method of exploiting dual memory banks was created to increase memory bandwidth to accommodate the increased data-processing bandwidth provided by extension instructions. Finally, the compiler was considered for use in a non-conventional role where rather than generating code it is used to apply source-level transformations prior to the generation of extension instructions and thus affect the shape of the instructions that are generated

    On Design and Implementation of Generic Fuzzy Logic Controllers

    Get PDF
    Soft computing techniques, unlike traditional deterministic logic based computing techniques, sometimes also called as hard computing, are tolerant of imprecision, uncertainty, and approximation. The primary inspiration for soft computing is the human mind and its ability to address day-to-day problems. The primary constituents of soft computing techniques are Artificial Neural Network, Fuzzy Logic Systems, and Evolutionary Computing. This thesis presents design and implementation of a generic hardware architecture based Type-IMamdani fuzzy logic controller (FLC) implemented on a programmable device, which can be remotely configured in real-time over Ethernet. This reconfigurability is added as a feature to existing FLCs in literature. It enables users to change parameters (those drive the FLC systems) in real-time and eliminate repeated hardware programming whenever there is a need. Realization of these systems in real-time is difficult as the computational complexity increases exponentially with an increase in the number of inputs. Hence challenge lies in reducing the Rulebase significantly such that the inference time and the throughput time is perceivable for real-time applications. To achieve these objectives, a modified thresholded fired rules hypercube (MT-FRHC) algorithm for Rulebase reduction is proposed and implemented. MT-FRHC reduces the useful rules without compromising system accuracy and improves the cycle time in terms of fuzzy logic operations per second (FzLOPS). It is imperative to understand that there are over sixty reconfigurable parameters, and it becomes an arduous task for a user to manage them. Therefore, a genetic algorithm based parameter extraction technique is proposed. This will help to develop a course tuning and provide default parameters that can be later fine-tuned by the users remotely through the Web-based User Interface. A hardware software codesign architecture for FLC is developed on TI C6748 DSP hardware with Sys/BIOS RTOS and seamlessly integrated with a webbased user interface (WebUI) for reconfigurability. Fuzzy systems employ defuzzifier to convert the fuzzy output into the real world crisp output. Centroid of Area (CoA) method is most widely used defuzzification method for control applications. However, the prevalent method of CoA computation is based on the principle of Riemann sum which is computationally complex. A vertices based CoA (VBCoA) defuzzification method is introduced. It has been observed that the proposed VBCoA method for COA computation is faster than the Riemann sum based CoA computation. A code optimization technique, exclusive to TI DSPs, is implemented to achieve memory and machine cycle optimization. The WebUI is developed in accordance to a client–server model using ASP.NET. It acquires fuzzy parameters from users, and a server application is dedicated to handling data communication between the hardware and the server. Testing and analysis of this hardware G-FLCS has been carried out by using hardware-in-loop test to control various system models in Simulink environment which includes water level control in a two tank system, intelligent cruise control system, speed control of an armature controlled DC motor and anti-windup control. The performance of the proposed G-FLCS is compared to Fuzzy Inference System of Matlab Fuzzy Logic Toolbox and PID controller in terms of settling time, transient time and steady state error. This proposed MT-FRHC based G-FLCS with VBCoA defuzzification implemented on C6748 DSP was finally deployed to control the radial position of plasma in Aditya Tokamak fusion reactor. The proposed G-FLCS is observed to deliver a smooth and fast system response
    corecore