42 research outputs found

    Dataflow-based Design and Implementation of Image Processing Applications

    Get PDF
    Dataflow is a well known computational model and is widely used for expressing the functionality of digital signal processing (DSP) applications, such as audio and video data stream processing, digital communications, and image processing. These applications usually require real-time processing capabilities and have critical performance constraints. Dataflow provides a formal mechanism for describing specifications of DSP applications, imposes minimal data-dependency constraints in specifications, and is effective in exposing and exploiting task or data level parallelism for achieving high performance implementations. To demonstrate dataflow-based design methods in a manner that is concrete and easily adapted to different platforms and back-end design tools, we present in this report a number of case studies based on the lightweight dataflow (LWDF) programming methodology. LWDF is designed as a "minimalistic" approach for integrating coarse grain dataflow programming structures into arbitrary simulation- or platform-oriented languages, such as C, C++, CUDA, MATLAB, SystemC, Verilog, and VHDL. In particular, LWDF requires minimal dependence on specialized tools or libraries. This feature --- together with the rigorous adherence to dataflow principles throughout the LWDF design framework --- allows designers to integrate and experiment with dataflow modeling approaches relatively quickly and flexibly into existing design methodologies and processes

    Real-time video scene analysis with heterogeneous processors

    Get PDF
    Field-Programmable Gate Arrays (FPGAs) and General Purpose Graphics Processing Units (GPUs) allow acceleration and real-time processing of computationally intensive computer vision algorithms. The decision to use either architecture in any application is determined by task-specific priorities such as processing latency, power consumption and algorithm accuracy. This choice is normally made at design time on a heuristic or fixed algorithmic basis; here we propose an alternative method for automatic runtime selection. In this thesis, we describe our PC-based system architecture containing both platforms; this provides greater flexibility and allows dynamic selection of processing platforms to suit changing scene priorities. Using the Histograms of Oriented Gradients (HOG) algorithm for pedestrian detection, we comprehensively explore algorithm implementation on FPGA, GPU and a combination of both, and show that the effect of data transfer time on overall processing performance is significant. We also characterise performance of each implementation and quantify tradeoffs between power, time and accuracy when moving processing between architectures, then specify the optimal architecture to use when prioritising each of these. We apply this new knowledge to a real-time surveillance application representative of anomaly detection problems: detecting parked vehicles in videos. Using motion detection and car and pedestrian HOG detectors implemented across multiple architectures to generate detections, we use trajectory clustering and a Bayesian contextual motion algorithm to generate an overall scene anomaly level. This is in turn used to select the architectures to run the compute-intensive detectors for the next frame on, with higher anomalies selecting faster, higher-power implementations. Comparing dynamic context-driven prioritisation of system performance against a fixed mapping of algorithms to architectures shows that our dynamic mapping method is 10% more accurate at detecting events than the power-optimised version, at the cost of 12W higher power consumption

    High performance bioinformatics and computational biology on general-purpose graphics processing units

    Get PDF
    Bioinformatics and Computational Biology (BCB) is a relatively new multidisciplinary field which brings together many aspects of the fields of biology, computer science, statistics, and engineering. Bioinformatics extracts useful information from biological data and makes these more intuitive and understandable by applying principles of information sciences, while computational biology harnesses computational approaches and technologies to answer biological questions conveniently. Recent years have seen an explosion of the size of biological data at a rate which outpaces the rate of increases in the computational power of mainstream computer technologies, namely general purpose processors (GPPs). The aim of this thesis is to explore the use of off-the-shelf Graphics Processing Unit (GPU) technology in the high performance and efficient implementation of BCB applications in order to meet the demands of biological data increases at affordable cost. The thesis presents detailed design and implementations of GPU solutions for a number of BCB algorithms in two widely used BCB applications, namely biological sequence alignment and phylogenetic analysis. Biological sequence alignment can be used to determine the potential information about a newly discovered biological sequence from other well-known sequences through similarity comparison. On the other hand, phylogenetic analysis is concerned with the investigation of the evolution and relationships among organisms, and has many uses in the fields of system biology and comparative genomics. In molecular-based phylogenetic analysis, the relationship between species is estimated by inferring the common history of their genes and then phylogenetic trees are constructed to illustrate evolutionary relationships among genes and organisms. However, both biological sequence alignment and phylogenetic analysis are computationally expensive applications as their computing and memory requirements grow polynomially or even worse with the size of sequence databases. The thesis firstly presents a multi-threaded parallel design of the Smith- Waterman (SW) algorithm alongside an implementation on NVIDIA GPUs. A novel technique is put forward to solve the restriction on the length of the query sequence in previous GPU-based implementations of the SW algorithm. Based on this implementation, the difference between two main task parallelization approaches (Inter-task and Intra-task parallelization) is presented. The resulting GPU implementation matches the speed of existing GPU implementations while providing more flexibility, i.e. flexible length of sequences in real world applications. It also outperforms an equivalent GPPbased implementation by 15x-20x. After this, the thesis presents the first reported multi-threaded design and GPU implementation of the Gapped BLAST with Two-Hit method algorithm, which is widely used for aligning biological sequences heuristically. This achieved up to 3x speed-up improvements compared to the most optimised GPP implementations. The thesis then presents a multi-threaded design and GPU implementation of a Neighbor-Joining (NJ)-based method for phylogenetic tree construction and multiple sequence alignment (MSA). This achieves 8x-20x speed up compared to an equivalent GPP implementation based on the widely used ClustalW software. The NJ method however only gives one possible tree which strongly depends on the evolutionary model used. A more advanced method uses maximum likelihood (ML) for scoring phylogenies with Markov Chain Monte Carlo (MCMC)-based Bayesian inference. The latter was the subject of another multi-threaded design and GPU implementation presented in this thesis, which achieved 4x-8x speed up compared to an equivalent GPP implementation based on the widely used MrBayes software. Finally, the thesis presents a general evaluation of the designs and implementations achieved in this work as a step towards the evaluation of GPU technology in BCB computing, in the context of other computer technologies including GPPs and Field Programmable Gate Arrays (FPGA) technology

    High-level FPGA accelerator design for structured-mesh-based numerical solvers

    Get PDF
    Field Programmable Gate Arrays (FPGAs) have become highly attractive as accelerators due to their low power consumption and re-programmability. However, a key limitation is the time and know-how required to program them. Even with high-level synthesis tools, they still require significant hand-tuned/low-level customizations and design space exploration to gain good performance. The need to program FPGAs using the dataflow programming model, much less well known and practised by the high-performance computing (HPC) community, is a major barrier for adoption for HPC. The underlying motivation of this work is to bridge this gap - attaining near-optimal performance vs the ease of programming. To this end, we target the important class of applications based on structured meshes, focusing on numerical algorithms based on explicit and implicit techniques. We leverage the main characteristics of the application class, its computation-communication pattern and the hardware features. For explicit schemes, characterized by stencil computations, we unify the state-of-the-art techniques such as vectorization and unrolling with a number of new high-gain optimizations such as creating perfect data reuse data-paths, batching and tiling. A key new feature is their applicability to multiple stencil loops enabling the development of real-world workloads. For implicit schemes, we re-evaluate the characteristics of the tridiagonal system solver algorithms for FPGAs and develop a new high throughput batched multi-dimensional tridiagonal system solver library with orders of magnitude better performance than the state-of-the-art. New analytic models are developed to support the solvers, elucidating and modelling the critical path of execution and parameterizing the design. This together with the optimal designs and new library lead to a unified design work-flow for synthesis on FPGAs. The new workflow is used to implement a range of applications, from simple single stencil designs, multiple stencil loops to solvers with real-world utility. They are synthesized on the currently dominant Xilinx and Intel FPGAs. Benchmarking indicate the FPGAs matching or outperforming the best GPU implementations, the current best traditional architecture device solution. Over 30% energy saving can also be observed. The performance model demonstrates over 85% accuracy. The thesis discusses the determinants for these applications to be amenable for FPGA implementation, providing insights into the feasibility and profitability of a design. Finally we propose initial steps in automating the workflow to be used through a DSL

    Embedded System Design

    Get PDF
    A unique feature of this open access textbook is to provide a comprehensive introduction to the fundamental knowledge in embedded systems, with applications in cyber-physical systems and the Internet of things. It starts with an introduction to the field and a survey of specification models and languages for embedded and cyber-physical systems. It provides a brief overview of hardware devices used for such systems and presents the essentials of system software for embedded systems, including real-time operating systems. The author also discusses evaluation and validation techniques for embedded systems and provides an overview of techniques for mapping applications to execution platforms, including multi-core platforms. Embedded systems have to operate under tight constraints and, hence, the book also contains a selected set of optimization techniques, including software optimization techniques. The book closes with a brief survey on testing. This fourth edition has been updated and revised to reflect new trends and technologies, such as the importance of cyber-physical systems (CPS) and the Internet of things (IoT), the evolution of single-core processors to multi-core processors, and the increased importance of energy efficiency and thermal issues

    Design and Implementation of a Scalable Hardware Platform for High Speed Optical Tracking

    Get PDF
    Optical tracking has been an important subject of research since several decades. The utilization of optical tracking systems can be found in a wide range of areas, including military, medicine, industry, entertainment, etc. In this thesis a complete hardware platform that targets high-speed optical tracking applications is presented. The implemented hardware system contains three main components: a high-speed camera which is equipped with a 1.3M pixel image sensor capable of operating at 500 frames per second, a CameraLink grabber which is able to interface three cameras, and an FPGA+Dual-DSP based image processing platform. The hardware system is designed using a modular approach. The flexible architecture enables to construct a scalable optical tracking system, which allows a large number of cameras to be used in the tracking environment. One of the greatest challenges in a multi-camera based optical tracking system is the huge amounts of image data that must be processed in real-time. In this thesis, the study on FPGA based high-speed image processing is performed. The FPGA implementation for a number of image processing operators is described. How to exploit different levels of parallelisms in the algorithm to achieve high processing throughput is explained in detail. This thesis also presents a new single-pass blob analysis algorithm. With an optimized FPGA implementation, the geometrical features of a large number of blobs can be calculated in real-time. At the end of this thesis, a prototype design which integrates all the implemented hardware and software modules is demonstrated to prove the usability of the proposed optical tracking system

    Embedded System Design

    Get PDF
    A unique feature of this open access textbook is to provide a comprehensive introduction to the fundamental knowledge in embedded systems, with applications in cyber-physical systems and the Internet of things. It starts with an introduction to the field and a survey of specification models and languages for embedded and cyber-physical systems. It provides a brief overview of hardware devices used for such systems and presents the essentials of system software for embedded systems, including real-time operating systems. The author also discusses evaluation and validation techniques for embedded systems and provides an overview of techniques for mapping applications to execution platforms, including multi-core platforms. Embedded systems have to operate under tight constraints and, hence, the book also contains a selected set of optimization techniques, including software optimization techniques. The book closes with a brief survey on testing. This fourth edition has been updated and revised to reflect new trends and technologies, such as the importance of cyber-physical systems (CPS) and the Internet of things (IoT), the evolution of single-core processors to multi-core processors, and the increased importance of energy efficiency and thermal issues

    Robust Real-time Vision-based Human Detection and Tracking

    Get PDF
    In the last couple of decades technology has made its way into our everyday lives including our homes, our offices and the vehicles we use for travelling. Many modern devices interact with humans in a more or less intuitive way and some of them use cameras for observing or interacting with humans. Nevertheless, teaching a machine to detect humans in an image or a video is a very difficult task. There are many aspects that contribute to the complexity of this task, such as the many variations in the humans' perceived appearances: their constitution, the clothes they wear and the dynamic nature of the activities performed by humans. The focus of this thesis is on the development of reliable algorithms for real-time vision-based human detection and tracking in indoor as well as in outdoor applications. In order to achieve this, the algorithms presented in this thesis were developed for traditional passive cameras, as they perform well in both environments. The novel approaches for vision-based human detection and tracking are presented for three different applications: gait analysis, pedestrian detection and human-robot interaction. All these approaches have in common the need for real-time human detection and tracking in video sequences in order to extract application-specific data regarding the tracked human. In order to cope with human detection and tracking as a computational expensive task, novel hardware-specific optimizations of the proposed image processing algorithms are presented, that allow the algorithms to run in real-time. For this purpose GPU implementations are presented for pedestrian detection and the processing times are compared to CPU and FPGA implementations. In the case of human-robot interaction the real-time human tracking is achieved by using distributed computing

    FPGA-based High Performance Diagnostics For Fusion

    Get PDF
    High performance diagnostics are an important aspect of fusion research. Increasing shot-lengths paired with the requirement for higher accuracy and speed make it mandatory to employ new technology to cope with the increasing demands on digitization and data handling. Field programmable gate arrays (FPGAs) are well known in high performance applications. Their ability to handle multiple fast data streams whilst remaining programmable make them an ideal tool for diagnostic development. Both the improvement of old and the design of new diagnostics can benefit from FPGA-technology and increase the amount of accessible physics significantly. In this work the developments on two FPGA-based diagnostics are presented. In the first part a new open-hardware low-cost FPGA-based digitizer is presented for the MAST-Upgrade (MAST-U) integral electron density interferometer. The system is shown to have an optically limited phase accuracy and a detection bandwidth of over 3.5 MHz. Data is acquired continuously at 20 MS/s and streamed to an acquisition PC via optical fiber. By employing a dual-FPGA approach real-time processing of the density signal can be achieved despite severly limited resources, thus providing a control signal for the MAST-U plasma control system system with less than 8 μs latency. Due to MAST-U being still inoperable, in-situ testing has been conducted on the ASDEX Upgrade, where fast wave physics up to 3.5 MHz could first be observed. The second part presents developments to the Synthetic Aperture Microwave Imaging (SAMI) diagnostic. In addition to improving the utilization of long shot-lengths and enabling dual-polarized acquisition the system has been enhanced to continuously acquire active probing profiles for 2D Doppler back-scattering (DBS), a technique recently developed using SAMI. The aim is to measure pitch angle profiles to derive the edge current density. SAMI has been transferred to the NSTX-Upgrade and integrated into the experiment’s infrastructure, where it has been acquiring data since May 2016. As part of this move an investigation into near-field effects on SAMI’s image reconstruction algorithms was conducted
    corecore