36 research outputs found

    Streaming Architectures for Medical Image Reconstruction

    Full text link
    Non-invasive imaging modalities have recently seen increased use in clinical diagnostic procedures. Unfortunately, emerging computational imaging techniques, such as those found in 3D ultrasound and iterative magnetic resonance imaging (MRI), are severely limited by the high computational requirements and poor algorithmic efficiency in current arallel hardware---often leading to significant delays before a doctor or technician can review the image, which can negatively impact patients in need of fast, highly accurate diagnosis. To make matters worse, the high raw data bandwidth found in 3D ultrasound requires on-chip volume reconstruction with a tight power dissipation budget---dissipation of more than 5~W may burn the skin of the patient. The tight power constraints and high volume rates required by emerging applications require orders of magnitude improvement over state-of-the-art systems in terms of both reconstruction time and energy efficiency. The goal of the research outlined in this dissertation is to reduce the time and energy required to perform medical image reconstruction through software/hardware co-design. By analyzing algorithms with a hardware-centric focus, we develop novel algorithmic improvements which simultaneously reduce computational requirements and map more efficiently to traditional hardware architectures. We then design and implement hardware accelerators which push the new algorithms to their full potential. In the first part of this dissertation, we characterize the performance bottlenecks of high-volume-rate 3D ultrasound imaging. By analyzing the 3D plane-wave ultrasound algorithm, we reduce computational and storage requirements in Delay Compression. Delay Compression recognizes additional symmetry in the planar transmission scheme found in 2D, 3D, and 3D-Separable plane-wave ultrasound implementations, enabling on-chip storage of the reconstruction constants for the first time and eliminating the ost power-intensive component of the reconstruction process. We then design and implement Tetris, a streaming hardware accelerator for 3D-Separable plane-wave ultrasound. Tetris is enabled by the Tetris Reserveration Station, a novel 2D register file that buffers incomplete voxels and eliminates the need for a traditional load-and-store memory interface. Utilizing a fully pipelined architecture, Tetris reconstructs volumes at physics-limited rates (i.e., limited by the physical propagation speed of sound through tissue). Next, we review a core component of several computational imaging modalities, the Non-uniform Fast Fourier Transform (NuFFT), focusing on its use in MRI reconstruction. We find that the non-uniform interpolation step therein requires over 99% of the reconstruction time due to poor spatial and temporal memory locality. While prior work has made great strides in improving the performance of the NuFFT, the most common algorithmic optimization severely limits the available parallelism, causing it to map poorly to the massively parallel processing available in modern GPUs and FPGAs. To this end, we create Slice-and-Dice, a processing model which enables efficient mapping of the NuFFT's most computationally-intensive component onto traditional parallel architectures. We then demonstrate the full acceleration potential of Slice-and-Dice with Jigsaw, a custom hardware accelerator which performs the non-uniform interpolations found in the NuFFT in time approximately linear in the number of non-uniform samples, rrespective of sampling pattern, uniform grid size, or interpolation kernel width. The algorithms and architectures herein enable faster, more efficient medical image reconstruction, without sacrificing image quality. By decreasing the time and energy required for image reconstruction, our work opens the door for future exploration into higher-resolution imaging and emerging, computationally complex reconstruction algorithms which improve the speed and quality of patient diagnosis.PHDComputer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/167986/1/westbl_1.pd

    A methodology for hardware-software codesign

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, 2013.Cataloged from PDF version of thesis.Includes bibliographical references (pages 150-156).Special purpose hardware is vital to embedded systems as it can simultaneously improve performance while reducing power consumption. The integration of special purpose hardware into applications running in software is difficult for a number of reasons. Some of the difficulty is due to the difference between the models used to program hardware and software, but great effort is also required to coordinate the simultaneous execution of the application running on the microprocessor with the accelerated kernel(s) running in hardware. To further compound the problem, current design methodologies for embedded applications require an early determination of the design partitioning which allows hardware and software to be developed simultaneously, each adhering to a rigid interface contract. This approach is problematic because often a good hardware-software decomposition is not known until deep into the design process. Fixed interfaces and the burden of reimplementation prevent the migration of functionality motivated by repartitioning. This thesis presents a two-part solution to the integration of special purpose hardware into applications running in software. The first part addresses the problem of generating infrastructure for hardware-accelerated applications. We present a methodology in which the application is represented as a dataflow graph and the computation at each node is specified for execution either in software or as specialized hardware using the programmer's language of choice. An interface compiler as been implemented which takes as input the FIFO edges of the graph and generates code to connect all the different parts of the program, including those which communicate across the hardware/software boundary. This methodology, which we demonstrate on an FPGA platform, enables programmers to effectively exploit hardware acceleration without ever leaving the application space. The second part of this thesis presents an implementation of the Bluespec Codesign Language (BCL) to address the difficulty of experimenting with hardware/software partitioning alternatives. Based on guarded atomic actions, BCL can be used to specify both hardware and low-level software. Based on Bluespec SystemVerilog (BSV) for which a hardware compiler by Bluespec Inc. is commercially available, BCL has been augmented with extensions to support more efficient software generation. In BCL, the programmer specifies the entire design, including the partitioning, allowing the compiler to synthesize efficient software and hardware, along with transactors for communication between the partitions. The benefit of using a single language to express the entire design is that a programmer can easily experiment with many different hardware/software decompositions without needing to re-write the application code. Used together, the BCL and interface compilers represent a comprehensive solution to the task of integrating specialized hardware into an application.by Myron King.Ph.D

    Image Processing Using FPGAs

    Get PDF
    This book presents a selection of papers representing current research on using field programmable gate arrays (FPGAs) for realising image processing algorithms. These papers are reprints of papers selected for a Special Issue of the Journal of Imaging on image processing using FPGAs. A diverse range of topics is covered, including parallel soft processors, memory management, image filters, segmentation, clustering, image analysis, and image compression. Applications include traffic sign recognition for autonomous driving, cell detection for histopathology, and video compression. Collectively, they represent the current state-of-the-art on image processing using FPGAs

    Incremental parallel and distributed systems

    Get PDF
    Incremental computation strives for efficient successive runs of applications by re-executing only those parts of the computation that are affected by a given input change instead of recomputing everything from scratch. To realize the benefits of incremental computation, researchers and practitioners are developing new systems where the application programmer can provide an efficient update mechanism for changing application data. Unfortunately, most of the existing solutions are limiting because they not only depart from existing programming models, but also require programmers to devise an incremental update mechanism (or a dynamic algorithm) on a per-application basis. In this thesis, we present incremental parallel and distributed systems that enable existing real-world applications to automatically benefit from efficient incremental updates. Our approach neither requires departure from current models of programming, nor the design and implementation of dynamic algorithms. To achieve these goals, we have designed and built the following incremental systems: (i) Incoop — a system for incremental MapReduce computation; (ii) Shredder — a GPU-accelerated system for incremental storage; (iii) Slider — a stream processing platform for incremental sliding window analytics; and (iv) iThreads — a threading library for parallel incremental computation. Our experience with these systems shows that significant performance can be achieved for existing applications without requiring any additional effort from programmers.Inkrementelle Berechnungen ermöglichen die effizientere Ausführung aufeinanderfolgender Anwendungsaufrufe, indem nur die Teilbereiche der Anwendung erneut ausgefürt werden, die von den Änderungen der Eingabedaten betroffen sind. Dieses Berechnungsverfahren steht dem konventionellen und vollständig neu berechnenden Verfahren gegenüber. Um den Vorteil inkrementeller Berechnungen auszunutzen, entwickeln sowohl Wissenschaft als auch Industrie neue Systeme, bei denen der Anwendungsprogrammierer den effizienten Aktualisierungsmechanismus für die Änderung der Anwendungsdaten bereitstellt. Bedauerlicherweise lassen sich existierende Lösungen meist nur eingeschränkt anwenden, da sie das konventionelle Programmierungsmodel beibehalten und dadurch die erneute Entwicklung vom Programmierer des inkrementellen Aktualisierungsmechanismus (oder einen dynamischen Algorithmus) für jede Anwendung verlangen. Diese Doktorarbeit stellt inkrementelle Parallele- und Verteiltesysteme vor, die es existierenden Real-World-Anwendungen ermöglichen vom Vorteil der inkre- mentellen Berechnung automatisch zu profitieren. Unser Ansatz erfordert weder eine Abkehr von gegenwärtigen Programmiermodellen, noch Design und Implementierung von anwendungsspezifischen dynamischen Algorithmen. Um dieses Ziel zu erreichen, haben wir die folgenden Systeme zur inkrementellen parallelen und verteilten Berechnung entworfen und implementiert: (i) Incoop — ein System für inkrementelle Map-Reduce-Programme; (ii) Shredder — ein GPU- beschleunigtes System zur inkrementellen Speicherung; (iii) Slider — eine Plat- tform zur Batch-basierten Streamverarbeitung via inkrementeller Sliding-Window- Berechnung; und (iv) iThreads — eine Threading-Bibliothek zur parallelen inkre- mentellen Berechnung. Unsere Erfahrungen mit diesen Systemen zeigen, dass unsere Methoden sehr gute Performanz liefern können, und dies ohne weiteren Aufwand des Programmierers

    Incremental parallel and distributed systems

    Get PDF
    Incremental computation strives for efficient successive runs of applications by re-executing only those parts of the computation that are affected by a given input change instead of recomputing everything from scratch. To realize the benefits of incremental computation, researchers and practitioners are developing new systems where the application programmer can provide an efficient update mechanism for changing application data. Unfortunately, most of the existing solutions are limiting because they not only depart from existing programming models, but also require programmers to devise an incremental update mechanism (or a dynamic algorithm) on a per-application basis. In this thesis, we present incremental parallel and distributed systems that enable existing real-world applications to automatically benefit from efficient incremental updates. Our approach neither requires departure from current models of programming, nor the design and implementation of dynamic algorithms. To achieve these goals, we have designed and built the following incremental systems: (i) Incoop — a system for incremental MapReduce computation; (ii) Shredder — a GPU-accelerated system for incremental storage; (iii) Slider — a stream processing platform for incremental sliding window analytics; and (iv) iThreads — a threading library for parallel incremental computation. Our experience with these systems shows that significant performance can be achieved for existing applications without requiring any additional effort from programmers.Inkrementelle Berechnungen ermöglichen die effizientere Ausführung aufeinanderfolgender Anwendungsaufrufe, indem nur die Teilbereiche der Anwendung erneut ausgefürt werden, die von den Änderungen der Eingabedaten betroffen sind. Dieses Berechnungsverfahren steht dem konventionellen und vollständig neu berechnenden Verfahren gegenüber. Um den Vorteil inkrementeller Berechnungen auszunutzen, entwickeln sowohl Wissenschaft als auch Industrie neue Systeme, bei denen der Anwendungsprogrammierer den effizienten Aktualisierungsmechanismus für die Änderung der Anwendungsdaten bereitstellt. Bedauerlicherweise lassen sich existierende Lösungen meist nur eingeschränkt anwenden, da sie das konventionelle Programmierungsmodel beibehalten und dadurch die erneute Entwicklung vom Programmierer des inkrementellen Aktualisierungsmechanismus (oder einen dynamischen Algorithmus) für jede Anwendung verlangen. Diese Doktorarbeit stellt inkrementelle Parallele- und Verteiltesysteme vor, die es existierenden Real-World-Anwendungen ermöglichen vom Vorteil der inkre- mentellen Berechnung automatisch zu profitieren. Unser Ansatz erfordert weder eine Abkehr von gegenwärtigen Programmiermodellen, noch Design und Implementierung von anwendungsspezifischen dynamischen Algorithmen. Um dieses Ziel zu erreichen, haben wir die folgenden Systeme zur inkrementellen parallelen und verteilten Berechnung entworfen und implementiert: (i) Incoop — ein System für inkrementelle Map-Reduce-Programme; (ii) Shredder — ein GPU- beschleunigtes System zur inkrementellen Speicherung; (iii) Slider — eine Plat- tform zur Batch-basierten Streamverarbeitung via inkrementeller Sliding-Window- Berechnung; und (iv) iThreads — eine Threading-Bibliothek zur parallelen inkre- mentellen Berechnung. Unsere Erfahrungen mit diesen Systemen zeigen, dass unsere Methoden sehr gute Performanz liefern können, und dies ohne weiteren Aufwand des Programmierers

    Parallelism and the software-hardware interface in embedded systems

    Get PDF
    This thesis by publications addresses issues in the architecture and microarchitecture of next generation, high performance streaming Systems-on-Chip through quantifying the most important forms of parallelism in current and emerging embedded system workloads. The work consists of three major research tracks, relating to data level parallelism, thread level parallelism and the software-hardware interface which together reflect the research interests of the author as they have been formed in the last nine years. Published works confirm that parallelism at the data level is widely accepted as the most important performance leverage for the efficient execution of embedded media and telecom applications and has been exploited via a number of approaches the most efficient being vectorlSIMD architectures. A further, complementary and substantial form of parallelism exists at the thread level but this has not been researched to the same extent in the context of embedded workloads. For the efficient execution of such applications, exploitation of both forms of parallelism is of paramount importance. This calls for a new architectural approach in the software-hardware interface as its rigidity, manifested in all desktop-based and the majority of embedded CPU's, directly affects the performance ofvectorized, threaded codes. The author advocates a holistic, mature approach where parallelism is extracted via automatic means while at the same time, the traditionally rigid hardware-software interface is optimized to match the temporal and spatial behaviour of the embedded workload. This ultimate goal calls for the precise study of these forms of parallelism for a number of applications executing on theoretical models such as instruction set simulators and parallel RAM machines as well as the development of highly parametric microarchitectural frameworks to encapSUlate that functionality.EThOS - Electronic Theses Online ServiceGBUnited Kingdo

    A novel parallel algorithm for surface editing and its FPGA implementation

    Get PDF
    A thesis submitted to the University of Bedfordshire in partial fulfilment of the requirements for the degree of Doctor of PhilosophySurface modelling and editing is one of important subjects in computer graphics. Decades of research in computer graphics has been carried out on both low-level, hardware-related algorithms and high-level, abstract software. Success of computer graphics has been seen in many application areas, such as multimedia, visualisation, virtual reality and the Internet. However, the hardware realisation of OpenGL architecture based on FPGA (field programmable gate array) is beyond the scope of most of computer graphics researches. It is an uncultivated research area where the OpenGL pipeline, from hardware through the whole embedded system (ES) up to applications, is implemented in an FPGA chip. This research proposes a hybrid approach to investigating both software and hardware methods. It aims at bridging the gap between methods of software and hardware, and enhancing the overall performance for computer graphics. It consists of four parts, the construction of an FPGA-based ES, Mesa-OpenGL implementation for FPGA-based ESs, parallel processing, and a novel algorithm for surface modelling and editing. The FPGA-based ES is built up. In addition to the Nios II soft processor and DDR SDRAM memory, it consists of the LCD display device, frame buffers, video pipeline, and algorithm-specified module to support the graphics processing. Since there is no implementation of OpenGL ES available for FPGA-based ESs, a specific OpenGL implementation based on Mesa is carried out. Because of the limited FPGA resources, the implementation adopts the fixed-point arithmetic, which can offer faster computing and lower storage than the floating point arithmetic, and the accuracy satisfying the needs of 3D rendering. Moreover, the implementation includes Bézier-spline curve and surface algorithms to support surface modelling and editing. The pipelined parallelism and co-processors are used to accelerate graphics processing in this research. These two parallelism methods extend the traditional computation parallelism in fine-grained parallel tasks in the FPGA-base ESs. The novel algorithm for surface modelling and editing, called Progressive and Mixing Algorithm (PAMA), is proposed and implemented on FPGA-based ES’s. Compared with two main surface editing methods, subdivision and deformation, the PAMA can eliminate the large storage requirement and computing cost of intermediated processes. With four independent shape parameters, the PAMA can be used to model and edit freely the shape of an open or closed surface that keeps globally the zero-order geometric continuity. The PAMA can be applied independently not only FPGA-based ESs but also other platforms. With the parallel processing, small size, and low costs of computing, storage and power, the FPGA-based ES provides an effective hybrid solution to surface modelling and editing

    Behaviour analysis in binary SoC data

    Get PDF
    corecore