Search CORE

36 research outputs found

Streaming Architectures for Medical Image Reconstruction

Author: West Brendan
Publication venue
Publication date: 01/01/2020
Field of study

Non-invasive imaging modalities have recently seen increased use in clinical diagnostic procedures. Unfortunately, emerging computational imaging techniques, such as those found in 3D ultrasound and iterative magnetic resonance imaging (MRI), are severely limited by the high computational requirements and poor algorithmic efficiency in current arallel hardware---often leading to significant delays before a doctor or technician can review the image, which can negatively impact patients in need of fast, highly accurate diagnosis. To make matters worse, the high raw data bandwidth found in 3D ultrasound requires on-chip volume reconstruction with a tight power dissipation budget---dissipation of more than 5~W may burn the skin of the patient. The tight power constraints and high volume rates required by emerging applications require orders of magnitude improvement over state-of-the-art systems in terms of both reconstruction time and energy efficiency. The goal of the research outlined in this dissertation is to reduce the time and energy required to perform medical image reconstruction through software/hardware co-design. By analyzing algorithms with a hardware-centric focus, we develop novel algorithmic improvements which simultaneously reduce computational requirements and map more efficiently to traditional hardware architectures. We then design and implement hardware accelerators which push the new algorithms to their full potential. In the first part of this dissertation, we characterize the performance bottlenecks of high-volume-rate 3D ultrasound imaging. By analyzing the 3D plane-wave ultrasound algorithm, we reduce computational and storage requirements in Delay Compression. Delay Compression recognizes additional symmetry in the planar transmission scheme found in 2D, 3D, and 3D-Separable plane-wave ultrasound implementations, enabling on-chip storage of the reconstruction constants for the first time and eliminating the ost power-intensive component of the reconstruction process. We then design and implement Tetris, a streaming hardware accelerator for 3D-Separable plane-wave ultrasound. Tetris is enabled by the Tetris Reserveration Station, a novel 2D register file that buffers incomplete voxels and eliminates the need for a traditional load-and-store memory interface. Utilizing a fully pipelined architecture, Tetris reconstructs volumes at physics-limited rates (i.e., limited by the physical propagation speed of sound through tissue). Next, we review a core component of several computational imaging modalities, the Non-uniform Fast Fourier Transform (NuFFT), focusing on its use in MRI reconstruction. We find that the non-uniform interpolation step therein requires over 99% of the reconstruction time due to poor spatial and temporal memory locality. While prior work has made great strides in improving the performance of the NuFFT, the most common algorithmic optimization severely limits the available parallelism, causing it to map poorly to the massively parallel processing available in modern GPUs and FPGAs. To this end, we create Slice-and-Dice, a processing model which enables efficient mapping of the NuFFT's most computationally-intensive component onto traditional parallel architectures. We then demonstrate the full acceleration potential of Slice-and-Dice with Jigsaw, a custom hardware accelerator which performs the non-uniform interpolations found in the NuFFT in time approximately linear in the number of non-uniform samples, rrespective of sampling pattern, uniform grid size, or interpolation kernel width. The algorithms and architectures herein enable faster, more efficient medical image reconstruction, without sacrificing image quality. By decreasing the time and energy required for image reconstruction, our work opens the door for future exploration into higher-resolution imaging and emerging, computationally complex reconstruction algorithms which improve the speed and quality of patient diagnosis.PHDComputer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/167986/1/westbl_1.pd

Deep Blue Documents at the University of Michigan

A methodology for hardware-software codesign

Author: King Myron Decker
Publication venue: Massachusetts Institute of Technology
Publication date: 01/01/2013
Field of study

Thesis (Ph. D.)--Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, 2013.Cataloged from PDF version of thesis.Includes bibliographical references (pages 150-156).Special purpose hardware is vital to embedded systems as it can simultaneously improve performance while reducing power consumption. The integration of special purpose hardware into applications running in software is difficult for a number of reasons. Some of the difficulty is due to the difference between the models used to program hardware and software, but great effort is also required to coordinate the simultaneous execution of the application running on the microprocessor with the accelerated kernel(s) running in hardware. To further compound the problem, current design methodologies for embedded applications require an early determination of the design partitioning which allows hardware and software to be developed simultaneously, each adhering to a rigid interface contract. This approach is problematic because often a good hardware-software decomposition is not known until deep into the design process. Fixed interfaces and the burden of reimplementation prevent the migration of functionality motivated by repartitioning. This thesis presents a two-part solution to the integration of special purpose hardware into applications running in software. The first part addresses the problem of generating infrastructure for hardware-accelerated applications. We present a methodology in which the application is represented as a dataflow graph and the computation at each node is specified for execution either in software or as specialized hardware using the programmer's language of choice. An interface compiler as been implemented which takes as input the FIFO edges of the graph and generates code to connect all the different parts of the program, including those which communicate across the hardware/software boundary. This methodology, which we demonstrate on an FPGA platform, enables programmers to effectively exploit hardware acceleration without ever leaving the application space. The second part of this thesis presents an implementation of the Bluespec Codesign Language (BCL) to address the difficulty of experimenting with hardware/software partitioning alternatives. Based on guarded atomic actions, BCL can be used to specify both hardware and low-level software. Based on Bluespec SystemVerilog (BSV) for which a hardware compiler by Bluespec Inc. is commercially available, BCL has been augmented with extensions to support more efficient software generation. In BCL, the programmer specifies the entire design, including the partitioning, allowing the compiler to synthesize efficient software and hardware, along with transactors for communication between the partitions. The benefit of using a single language to express the entire design is that a programmer can easily experiment with many different hardware/software decompositions without needing to re-write the application code. Used together, the BCL and interface compilers represent a comprehensive solution to the task of integrating specialized hardware into an application.by Myron King.Ph.D

DSpace@MIT

Image Processing Using FPGAs

Author: Bailey Donald
Publication venue: 'MDPI AG'
Publication date: 01/01/2019
Field of study

This book presents a selection of papers representing current research on using field programmable gate arrays (FPGAs) for realising image processing algorithms. These papers are reprints of papers selected for a Special Issue of the Journal of Imaging on image processing using FPGAs. A diverse range of topics is covered, including parallel soft processors, memory management, image filters, segmentation, clustering, image analysis, and image compression. Applications include traffic sign recognition for autonomous driving, cell detection for histopathology, and video compression. Collectively, they represent the current state-of-the-art on image processing using FPGAs

Directory of Open Access Books (DOAB)

Incremental parallel and distributed systems

Author: Bhatotia Pramod Kumar
Publication venue: Sonstige Einrichtungen. Max-Planck-Institut für Informatik
Publication date: 01/01/2015
Field of study

Incremental computation strives for efficient successive runs of applications by re-executing only those parts of the computation that are affected by a given input change instead of recomputing everything from scratch. To realize the benefits of incremental computation, researchers and practitioners are developing new systems where the application programmer can provide an efficient update mechanism for changing application data. Unfortunately, most of the existing solutions are limiting because they not only depart from existing programming models, but also require programmers to devise an incremental update mechanism (or a dynamic algorithm) on a per-application basis. In this thesis, we present incremental parallel and distributed systems that enable existing real-world applications to automatically benefit from efficient incremental updates. Our approach neither requires departure from current models of programming, nor the design and implementation of dynamic algorithms. To achieve these goals, we have designed and built the following incremental systems: (i) Incoop — a system for incremental MapReduce computation; (ii) Shredder — a GPU-accelerated system for incremental storage; (iii) Slider — a stream processing platform for incremental sliding window analytics; and (iv) iThreads — a threading library for parallel incremental computation. Our experience with these systems shows that significant performance can be achieved for existing applications without requiring any additional effort from programmers.Inkrementelle Berechnungen ermöglichen die effizientere Ausführung aufeinanderfolgender Anwendungsaufrufe, indem nur die Teilbereiche der Anwendung erneut ausgefürt werden, die von den Änderungen der Eingabedaten betroffen sind. Dieses Berechnungsverfahren steht dem konventionellen und vollständig neu berechnenden Verfahren gegenüber. Um den Vorteil inkrementeller Berechnungen auszunutzen, entwickeln sowohl Wissenschaft als auch Industrie neue Systeme, bei denen der Anwendungsprogrammierer den effizienten Aktualisierungsmechanismus für die Änderung der Anwendungsdaten bereitstellt. Bedauerlicherweise lassen sich existierende Lösungen meist nur eingeschränkt anwenden, da sie das konventionelle Programmierungsmodel beibehalten und dadurch die erneute Entwicklung vom Programmierer des inkrementellen Aktualisierungsmechanismus (oder einen dynamischen Algorithmus) für jede Anwendung verlangen. Diese Doktorarbeit stellt inkrementelle Parallele- und Verteiltesysteme vor, die es existierenden Real-World-Anwendungen ermöglichen vom Vorteil der inkre- mentellen Berechnung automatisch zu profitieren. Unser Ansatz erfordert weder eine Abkehr von gegenwärtigen Programmiermodellen, noch Design und Implementierung von anwendungsspezifischen dynamischen Algorithmen. Um dieses Ziel zu erreichen, haben wir die folgenden Systeme zur inkrementellen parallelen und verteilten Berechnung entworfen und implementiert: (i) Incoop — ein System für inkrementelle Map-Reduce-Programme; (ii) Shredder — ein GPU- beschleunigtes System zur inkrementellen Speicherung; (iii) Slider — eine Plat- tform zur Batch-basierten Streamverarbeitung via inkrementeller Sliding-Window- Berechnung; und (iv) iThreads — eine Threading-Bibliothek zur parallelen inkre- mentellen Berechnung. Unsere Erfahrungen mit diesen Systemen zeigen, dass unsere Methoden sehr gute Performanz liefern können, und dies ohne weiteren Aufwand des Programmierers

Universaar

MPG.PuRe

Acronym

Recommended from our members

Guided Automatic Binary Parallelisation

Author: ZHOU RUOYU
Publication venue: University of Cambridge
Publication date: 06/04/2018
Field of study

For decades, the software industry has amassed a vast repository of pre-compiled libraries and executables which are still valuable and actively in use. However, for a significant fraction of these binaries, most of the source code is absent or is written in old languages, making it practically impossible to recompile them for new generations of hardware. As the number of cores in chip multi-processors (CMPs) continue to scale, the performance of this legacy software becomes increasingly sub-optimal. Rewriting new optimised and parallel software would be a time-consuming and expensive task. Without source code, existing automatic performance enhancing and parallelisation techniques are not applicable for legacy software or parts of new applications linked with legacy libraries. In this dissertation, three tools are presented to address the challenge of optimising legacy binaries. The first, GBR (Guided Binary Recompilation), is a tool that recompiles stripped application binaries without the need for the source code or relocation information. GBR performs static binary analysis to determine how recompilation should be undertaken, and produces a domain-specific hint program. This hint program is loaded and interpreted by the GBR dynamic runtime, which is built on top of the open-source dynamic binary translator, DynamoRIO. In this manner, complicated recompilation of the target binary is carried out to achieve optimised execution on a real system. The problem of limited dataflow and type information is addressed through cooperation between the hint program and JIT optimisation. The utility of GBR is demonstrated by software prefetch and vectorisation optimisations to achieve performance improvements compared to their original native execution. The second tool is called BEEP (Binary Emulator for Estimating Parallelism), an extension to GBR for binary instrumentation. BEEP is used to identify potential thread-level parallelism through static binary analysis and binary instrumentation. BEEP performs preliminary static analysis on binaries and encodes all statically-undecided questions into a hint program. The hint program is interpreted by GBR so that on-demand binary instrumentation codes are inserted to answer the questions from runtime information. BEEP incorporates a few parallel cost models to evaluate identified parallelism under different parallelisation paradigms. The third tool is named GABP (Guided Automatic Binary Parallelisation), an extension to GBR for parallelisation. GABP focuses on loops from sequential application binaries and automatically extracts thread-level parallelism from them on-the-fly, under the direction of the hint program, for efficient parallel execution. It employs a range of runtime schemes, such as thread-level speculation and synchronisation, to handle runtime data dependences. GABP achieves a geometric mean of speedup of 1.91x on binaries from SPEC CPU2006 on a real x86-64 eight-core system compared to native sequential execution. Performance is obtained for SPEC CPU2006 executables compiled from a variety of source languages and by different compilers.St John's Benefactor Scholarship ARM Sponsorshi

Apollo (Cambridge)

Incremental parallel and distributed systems

Author: Bhatotia Pramod Kumar
Publication venue
Publication date: 01/01/2015
Field of study

Acronym

Parallelism and the software-hardware interface in embedded systems

Author: Chouliaras V A
Publication venue
Publication date: 01/01/2005
Field of study

This thesis by publications addresses issues in the architecture and microarchitecture of next generation, high performance streaming Systems-on-Chip through quantifying the most important forms of parallelism in current and emerging embedded system workloads. The work consists of three major research tracks, relating to data level parallelism, thread level parallelism and the software-hardware interface which together reflect the research interests of the author as they have been formed in the last nine years. Published works confirm that parallelism at the data level is widely accepted as the most important performance leverage for the efficient execution of embedded media and telecom applications and has been exploited via a number of approaches the most efficient being vectorlSIMD architectures. A further, complementary and substantial form of parallelism exists at the thread level but this has not been researched to the same extent in the context of embedded workloads. For the efficient execution of such applications, exploitation of both forms of parallelism is of paramount importance. This calls for a new architectural approach in the software-hardware interface as its rigidity, manifested in all desktop-based and the majority of embedded CPU's, directly affects the performance ofvectorized, threaded codes. The author advocates a holistic, mature approach where parallelism is extracted via automatic means while at the same time, the traditionally rigid hardware-software interface is optimized to match the temporal and spatial behaviour of the embedded workload. This ultimate goal calls for the precise study of these forms of parallelism for a number of applications executing on theoretical models such as instruction set simulators and parallel RAM machines as well as the development of highly parametric microarchitectural frameworks to encapSUlate that functionality.EThOS - Electronic Theses Online ServiceGBUnited Kingdo

Loughborough University Institutional Repository

OpenGrey Repository

A novel parallel algorithm for surface editing and its FPGA implementation

Author: Liu Yukun
Publication venue: University of Bedfordshire
Publication date: 01/09/2013
Field of study

A thesis submitted to the University of Bedfordshire in partial fulfilment of the requirements for the degree of Doctor of PhilosophySurface modelling and editing is one of important subjects in computer graphics. Decades of research in computer graphics has been carried out on both low-level, hardware-related algorithms and high-level, abstract software. Success of computer graphics has been seen in many application areas, such as multimedia, visualisation, virtual reality and the Internet. However, the hardware realisation of OpenGL architecture based on FPGA (field programmable gate array) is beyond the scope of most of computer graphics researches. It is an uncultivated research area where the OpenGL pipeline, from hardware through the whole embedded system (ES) up to applications, is implemented in an FPGA chip. This research proposes a hybrid approach to investigating both software and hardware methods. It aims at bridging the gap between methods of software and hardware, and enhancing the overall performance for computer graphics. It consists of four parts, the construction of an FPGA-based ES, Mesa-OpenGL implementation for FPGA-based ESs, parallel processing, and a novel algorithm for surface modelling and editing. The FPGA-based ES is built up. In addition to the Nios II soft processor and DDR SDRAM memory, it consists of the LCD display device, frame buffers, video pipeline, and algorithm-specified module to support the graphics processing. Since there is no implementation of OpenGL ES available for FPGA-based ESs, a specific OpenGL implementation based on Mesa is carried out. Because of the limited FPGA resources, the implementation adopts the fixed-point arithmetic, which can offer faster computing and lower storage than the floating point arithmetic, and the accuracy satisfying the needs of 3D rendering. Moreover, the implementation includes Bézier-spline curve and surface algorithms to support surface modelling and editing. The pipelined parallelism and co-processors are used to accelerate graphics processing in this research. These two parallelism methods extend the traditional computation parallelism in fine-grained parallel tasks in the FPGA-base ESs. The novel algorithm for surface modelling and editing, called Progressive and Mixing Algorithm (PAMA), is proposed and implemented on FPGA-based ES’s. Compared with two main surface editing methods, subdivision and deformation, the PAMA can eliminate the large storage requirement and computing cost of intermediated processes. With four independent shape parameters, the PAMA can be used to model and edit freely the shape of an open or closed surface that keeps globally the zero-order geometric continuity. The PAMA can be applied independently not only FPGA-based ESs but also other platforms. With the parallel processing, small size, and low costs of computing, storage and power, the FPGA-based ES provides an effective hybrid solution to surface modelling and editing

University of Bedfordshire Repository

Behaviour analysis in binary SoC data

Author: Mcewan Dave
Publication venue
Publication date: 21/06/2022
Field of study

Explore Bristol Research

Recommended from our members

Stable Multithreading: A New Paradigm for Reliable and Secure Threads

Author: Cui Heming
Publication venue: 'Columbia University Libraries/Information Services'
Publication date: 01/01/2015
Field of study

Multi threaded programs have become pervasive and critical due to the rise of the multi core hardware and the accelerating computational demand. Unfortunately, despite decades of research and engineering effort, these programs remain notoriously difficult to get right, and they are plagued with harmful concurrency bugs that can cause wrong outputs, program crashes, security breaches, and so on. Our research reveals that a root cause of this difficulty is that multithreaded programs have too many possible thread interleavings (or schedules) at runtime. Even given only a single input, a program may run into a great number of schedules, depending on factors such as hardware timing and OS scheduling. Considering all inputs, the number of schedules is even much greater. It is extremely challenging to understand, test, analyze, or verify this huge number of schedules for a multi threaded program and make sure that all these schedules are free of concurrency bugs. Thus, multi threaded programs are extremely difficult to get right. To reduce the number of possible schedules for all inputs, we looked into the relation between inputs and schedules of real-world programs, and made an exciting discovery: many programs need only a small set of schedules to efficiently process a wide range of inputs! Leveraging this discovery, we have proposed a new idea called Stable Multithreading (or StableMT) that reuses each schedule on a wide range of inputs, greatly reducing the number of possible schedules for all inputs. By addressing the root cause that makes multithreading difficult to get right, StableMT makes understanding, testing, analyzing, and verification of multithreaded programs much easier. To realize StableMT, we have built three StableMT systems, TERN, PEREGRINE, and PARROT, with each addressing a distinct research challenge. Evaluation on a wide range of 108 popular multithreaded programs with our latest StableMT system, PARROT, shows that StableMT is simple, fast, and deployable. All PARROT's source code, entire benchmarks, and raw evaluation results are available at http://github.com/columbia/smt-mc. To encourage deployment, we have applied StableMT to improve several reliability techniques, including: (1) making reproducing real world concurrency bugs much easier; (2) greatly improving the precision of static program analysis, leading to the detection of several new harmful data races in heavily tested programs; and (3) greatly increasing the coverage of model checking, a systematic testing technique, by many orders of magnitudes. StableMT has attracted the research community's interests, and some techniques and ideas in our StableMT systems have been leveraged by other researchers to compute a small set of schedules to cover all or most inputs for multi threaded programs

Columbia University Academic Commons