232 research outputs found

    Efficient parallel computation on multiprocessors with optical interconnection networks

    Get PDF
    This dissertation studies optical interconnection networks, their architecture, address schemes, and computation and communication capabilities. We focus on a simple but powerful optical interconnection network model - the Linear Array with Reconfigurable pipelined Bus System (LARPBS). We extend the LARPBS model to a simplified higher dimensional LAPRBS and provide a set of basic computation operations. We then study the following two groups of parallel computation problems on both one dimensional LARPBS\u27s as well as multi-dimensional LARPBS\u27s: parallel comparison problems, including sorting, merging, and selection; Boolean matrix multiplication, transitive closure and their applications to connected component problems. We implement an optimal sorting algorithm on an n-processor LARPBS. With this optimal sorting algorithm at disposal, we study the sorting problem for higher dimensional LARPBS\u27s and obtain the following results: • An optimal basic Columnsort algorithm on a 2D LARPBS. • Two optimal two-way merge sort algorithms on a 2D LARPBS. • An optimal multi-way merge sorting algorithm on a 2D LARPBS. • An optimal generalized column sort algorithm on a 2D LARPBS. • An optimal generalized column sort algorithm on a 3D LARPBS. • An optimal 5-phase sorting algorithm on a 3D LARPBS. Results for selection problems are as follows: • A constant time maximum-finding algorithm on an LARPBS. • An optimal maximum-finding algorithm on an LARPBS. • An O((log log n)2) time parallel selection algorithm on an LARPBS. • An O(k(log log n)2) time parallel multi-selection algorithm on an LARPBS. While studying the computation and communication properties of the LARPBS model, we find Boolean matrix multiplication and its applications to the graph are another set of problem that can be solved efficiently on the LARPBS. Following is a list of results we have obtained in this area. • A constant time Boolean matrix multiplication algorithm. • An O(log n)-time transitive closure algorithm. • An O(log n)-time connected components algorithm. • An O(log n)-time strongly connected components algorithm. The results provided in this dissertation show the strong computation and communication power of optical interconnection networks

    Performance and Power Optimization of Multi-kernel Applications on Multi-FPGA Platforms

    Get PDF
    L'abstract è presente nell'allegato / the abstract is in the attachmen

    Numerical solutions of differential equations on FPGA-enhanced computers

    Get PDF
    Conventionally, to speed up scientific or engineering (S&E) computation programs on general-purpose computers, one may elect to use faster CPUs, more memory, systems with more efficient (though complicated) architecture, better software compilers, or even coding with assembly languages. With the emergence of Field Programmable Gate Array (FPGA) based Reconfigurable Computing (RC) technology, numerical scientists and engineers now have another option using FPGA devices as core components to address their computational problems. The hardware-programmable, low-cost, but powerful “FPGA-enhanced computer” has now become an attractive approach for many S&E applications. A new computer architecture model for FPGA-enhanced computer systems and its detailed hardware implementation are proposed for accelerating the solutions of computationally demanding and data intensive numerical PDE problems. New FPGAoptimized algorithms/methods for rapid executions of representative numerical methods such as Finite Difference Methods (FDM) and Finite Element Methods (FEM) are designed, analyzed, and implemented on it. Linear wave equations based on seismic data processing applications are adopted as the targeting PDE problems to demonstrate the effectiveness of this new computer model. Their sustained computational performances are compared with pure software programs operating on commodity CPUbased general-purpose computers. Quantitative analysis is performed from a hierarchical set of aspects as customized/extraordinary computer arithmetic or function units, compact but flexible system architecture and memory hierarchy, and hardwareoptimized numerical algorithms or methods that may be inappropriate for conventional general-purpose computers. The preferable property of in-system hardware reconfigurability of the new system is emphasized aiming at effectively accelerating the execution of complex multi-stage numerical applications. Methodologies for accelerating the targeting PDE problems as well as other numerical PDE problems, such as heat equations and Laplace equations utilizing programmable hardware resources are concluded, which imply the broad usage of the proposed FPGA-enhanced computers

    Efficient reconfigurable architectures for 3D medical image compression

    Get PDF
    This thesis was submitted for the degree of Doctor of Philosophy and awarded by Brunel University.Recently, the more widespread use of three-dimensional (3-D) imaging modalities, such as magnetic resonance imaging (MRI), computed tomography (CT), positron emission tomography (PET), and ultrasound (US) have generated a massive amount of volumetric data. These have provided an impetus to the development of other applications, in particular telemedicine and teleradiology. In these fields, medical image compression is important since both efficient storage and transmission of data through high-bandwidth digital communication lines are of crucial importance. Despite their advantages, most 3-D medical imaging algorithms are computationally intensive with matrix transformation as the most fundamental operation involved in the transform-based methods. Therefore, there is a real need for high-performance systems, whilst keeping architectures exible to allow for quick upgradeability with real-time applications. Moreover, in order to obtain efficient solutions for large medical volumes data, an efficient implementation of these operations is of significant importance. Reconfigurable hardware, in the form of field programmable gate arrays (FPGAs) has been proposed as viable system building block in the construction of high-performance systems at an economical price. Consequently, FPGAs seem an ideal candidate to harness and exploit their inherent advantages such as massive parallelism capabilities, multimillion gate counts, and special low-power packages. The key achievements of the work presented in this thesis are summarised as follows. Two architectures for 3-D Haar wavelet transform (HWT) have been proposed based on transpose-based computation and partial reconfiguration suitable for 3-D medical imaging applications. These applications require continuous hardware servicing, and as a result dynamic partial reconfiguration (DPR) has been introduced. Comparative study for both non-partial and partial reconfiguration implementation has shown that DPR offers many advantages and leads to a compelling solution for implementing computationally intensive applications such as 3-D medical image compression. Using DPR, several large systems are mapped to small hardware resources, and the area, power consumption as well as maximum frequency are optimised and improved. Moreover, an FPGA-based architecture of the finite Radon transform (FRAT)with three design strategies has been proposed: direct implementation of pseudo-code with a sequential or pipelined description, and block random access memory (BRAM)- based method. An analysis with various medical imaging modalities has been carried out. Results obtained for image de-noising implementation using FRAT exhibits promising results in reducing Gaussian white noise in medical images. In terms of hardware implementation, promising trade-offs on maximum frequency, throughput and area are also achieved. Furthermore, a novel hardware implementation of 3-D medical image compression system with context-based adaptive variable length coding (CAVLC) has been proposed. An evaluation of the 3-D integer transform (IT) and the discrete wavelet transform (DWT) with lifting scheme (LS) for transform blocks reveal that 3-D IT demonstrates better computational complexity than the 3-D DWT, whilst the 3-D DWT with LS exhibits a lossless compression that is significantly useful for medical image compression. Additionally, an architecture of CAVLC that is capable of compressing high-definition (HD) images in real-time without any buffer between the quantiser and the entropy coder is proposed. Through a judicious parallelisation, promising results have been obtained with limited resources. In summary, this research is tackling the issues of massive 3-D medical volumes data that requires compression as well as hardware implementation to accelerate the slowest operations in the system. Results obtained also reveal a significant achievement in terms of the architecture efficiency and applications performance.Ministry of Higher Education Malaysia (MOHE), Universiti Tun Hussein Onn Malaysia (UTHM) and the British Counci

    Reconfigurable Architecture For H.264/avc Variable Block Size Motion Estimation Based On Motion Activity And Adaptive Search Range

    Get PDF
    Motion Estimation (ME) technique plays a key role in the video coding systems to achieve high compression ratios by removing temporal redundancies among video frames. Especially in the newest H.264/AVC video coding standard, ME engine demands large amount of computational capabilities due to its support for wide range of different block sizes for a given macroblock in order to increase accuracy in finding best matching block in the previous frames. We propose scalable architecture for H.264/AVC Variable Block Size (VBS) Motion Estimation with adaptive computing capability to support various search ranges, input video resolutions, and frame rates. Hardware architecture of the proposed ME consists of scalable Sum of Absolute Difference (SAD) arrays which can perform Full Search Block Matching Algorithm (FSBMA) for smaller 4x4 blocks. It is also shown that by predicting motion activity and adaptively adjusting the Search Range (SR) on the reconfigurable hardware platform, the computational cost of ME required for inter-frame encoding in H.264/AVC video coding standard can be reduced significantly. Dynamic Partial Reconfiguration is a unique feature of Field Programmable Gate Arrays (FPGAs) that makes best use of hardware resources and power by allowing adaptive algorithm to be implemented during run-time. We exploit this feature of FPGA to implement the proposed reconfigurable architecture of ME and maximize the architectural benefits through prediction of motion activities in the video sequences ,adaptation of SR during run-time, and fractional ME refinement. The implemented ME architecture can support real time applications at a maximum frequency of 90MHz with multiple reconfigurable regions. iv When compared to reconfiguration of complete design, partial reconfiguration process results in smaller bitstream size which allows FPGA to implement different configurations at higher speed. The proposed architecture has modular structure, regular data flow, and efficient memory organization with lower memory accesses. By increasing the number of active partial reconfigurable modules from one to four, there is a 4 fold increase in data re-use. Also, by introducing adaptive SR reduction algorithm at frame level, the computational load of ME is reduced significantly with only small degradation in PSNR (≤0.1dB)

    Doctor of Philosophy

    Get PDF
    dissertationIn-memory big data applications are growing in popularity, including in-memory versions of the MapReduce framework. The move away from disk-based datasets shifts the performance bottleneck from slow disk accesses to memory bandwidth. MapReduce is a data-parallel application, and is therefore amenable to being executed on as many parallel processors as possible, with each processor requiring high amounts of memory bandwidth. We propose using Near Data Computing (NDC) as a means to develop systems that are optimized for in-memory MapReduce workloads, offering high compute parallelism and even higher memory bandwidth. This dissertation explores three different implementations and styles of NDC to improve MapReduce execution. First, we use 3D-stacked memory+logic devices to process the Map phase on compute elements in close proximity to database splits. Second, we attempt to replicate the performance characteristics of the 3D-stacked NDC using only commodity memory and inexpensive processors to improve performance of both Map and Reduce phases. Finally, we incorporate fixed-function hardware accelerators to improve sorting performance within the Map phase. This dissertation shows that it is possible to improve in-memory MapReduce performance by potentially two orders of magnitude by designing system and memory architectures that are specifically tailored to that end

    A Model-based Design Framework for Application-specific Heterogeneous Systems

    Get PDF
    The increasing heterogeneity of computing systems enables higher performance and power efficiency. However, these improvements come at the cost of increasing the overall complexity of designing such systems. These complexities include constructing implementations for various types of processors, setting up and configuring communication protocols, and efficiently scheduling the computational work. The process for developing such systems is iterative and time consuming, with no well-defined performance goal. Current performance estimation approaches use source code implementations that require experienced developers and time to produce. We present a framework to aid in the design of heterogeneous systems and the performance tuning of applications. Our framework supports system construction: integrating custom hardware accelerators with existing cores into processors, integrating processors into cohesive systems, and mapping computations to processors to achieve overall application performance and efficient hardware usage. It also facilitates effective design space exploration using processor models (for both existing and future processors) that do not require source code implementations to estimate performance. We evaluate our framework using a variety of applications and implement them in systems ranging from low power embedded systems-on-chip (SoC) to high performance systems consisting of commercial-off-the-shelf (COTS) components. We show how the design process is improved, reducing the number of design iterations and unnecessary source code development ultimately leading to higher performing efficient systems

    Synthesis Techniques for Semi-Custom Dynamically Reconfigurable Superscalar Processors

    Get PDF
    The accelerated adoption of reconfigurable computing foreshadows a computational paradigm shift, aimed at fulfilling the need of customizable yet high-performance flexible hardware. Reconfigurable computing fulfills this need by allowing the physical resources of a chip to be adapted to the computational requirements of a specific program, thus achieving higher levels of computing performance. This dissertation evaluates the area requirements for reconfigurable processing, an important yet often disregarded assessment for partial reconfiguration. Common reconfigurable computing approaches today attempt to create custom circuitry in static co-processor accelerators. We instead focused on a new approach that synthesized semi-custom general-purpose processor cores. Each superscalar processor core's execution units can be customized for a particular application, yet the processor retains its standard microprocessor interface. We analyzed the area consumption for these computational components by studying the synthesis requirements of different processor configurations. This area/performance assessment aids designers when constraining processing elements in a fixed-size area slot, a requirement for modern partial reconfiguration approaches. Our results provide a more deterministic evaluation of performance density, hence making the area cost analysis less ambiguous when optimizing dynamic systems for coarse-grained parallelism. The results obtained showed that even though performance density decreases with processor complexity, the additional area still provides a positive contribution to the aggregate parallel processing performance. This evaluation of parallel execution density contributes to ongoing efforts in the field of reconfigurable computing by providing a baseline for area/performance trade-offs for partial reconfiguration and multi-processor systems
    • …
    corecore