62,005 research outputs found

    FPGA implementations for parallel multidimensional filtering algorithms

    Get PDF
    PhD ThesisOne and multi dimensional raw data collections introduce noise and artifacts, which need to be recovered from degradations by an automated filtering system before, further machine analysis. The need for automating wide-ranged filtering applications necessitates the design of generic filtering architectures, together with the development of multidimensional and extensive convolution operators. Consequently, the aim of this thesis is to investigate the problem of automated construction of a generic parallel filtering system. Serving this goal, performance-efficient FPGA implementation architectures are developed to realize parallel one/multi-dimensional filtering algorithms. The proposed generic architectures provide a mechanism for fast FPGA prototyping of high performance computations to obtain efficiently implemented performance indices of area, speed, dynamic power, throughput and computation rates, as a complete package. These parallel filtering algorithms and their automated generic architectures tackle the major bottlenecks and limitations of existing multiprocessor systems in wordlength, input data segmentation, boundary conditions as well as inter-processor communications, in order to support high data throughput real-time applications of low-power architectures using a Xilinx Virtex-6 FPGA board. For one-dimensional raw signal filtering case, mathematical model and architectural development of the generalized parallel 1-D filtering algorithms are presented using the 1-D block filtering method. Five generic architectures are implemented on a Virtex-6 ML605 board, evaluated and compared. A complete set of results on area, speed, power, throughput and computation rates are obtained and discussed as performance indices for the 1-D convolution architectures. A successful application of parallel 1-D cross-correlation is demonstrated. For two dimensional greyscale/colour image processing cases, new parallel 2-D/3-D filtering algorithms are presented and mathematically modelled using input decimation and output image reconstruction by interpolation. Ten generic architectures are implemented on the Virtex-6 ML605 board, evaluated and compared. Key results on area, speed, power, throughput and computation rate are obtained and discussed as performance indices for the 2-D convolution architectures. 2-D image reconfigurable processors are developed and implemented using single, dual and quad MAC FIR units. 3-D Colour image processors are devised to act as 3-D colour filtering engines. A 2-D cross-correlator parallel engine is successfully developed as a parallel 2-D matched filtering algorithm for locating any MRI slice within a MRI data stack library. Twelve 3-D MRI filtering operators are plugged in and adapted to be suitable for biomedical imaging, including 3-D edge operators and 3-D noise smoothing operators. Since three dimensional greyscale/colour volumetric image applications are computationally intensive, a new parallel 3-D/4-D filtering algorithm is presented and mathematically modelled using volumetric data image segmentation by decimation and output reconstruction by interpolation, after simultaneously and independently performing 3-D filtering. Eight generic architectures are developed and implemented on the Virtex-6 board, including 3-D spatial and FFT convolution architectures. Fourteen 3-D MRI filtering operators are plugged and adapted for this particular biomedical imaging application, including 3-D edge operators and 3-D noise smoothing operators. Three successful applications are presented in 4-D colour MRI (fMRI) filtering processors, k-space MRI volume data filter and 3-D cross-correlator.IRAQI Government

    Acceleration of stereo-matching on multi-core CPU and GPU

    Get PDF
    This paper presents an accelerated version of a dense stereo-correspondence algorithm for two different parallelism enabled architectures, multi-core CPU and GPU. The algorithm is part of the vision system developed for a binocular robot-head in the context of the CloPeMa 1 research project. This research project focuses on the conception of a new clothes folding robot with real-time and high resolution requirements for the vision system. The performance analysis shows that the parallelised stereo-matching algorithm has been significantly accelerated, maintaining 12x and 176x speed-up respectively for multi-core CPU and GPU, compared with non-SIMD singlethread CPU. To analyse the origin of the speed-up and gain deeper understanding about the choice of the optimal hardware, the algorithm was broken into key sub-tasks and the performance was tested for four different hardware architectures

    Fast and Scalable Architectures and Algorithms for the Computation of the Forward and Inverse Discrete Periodic Radon Transform with Applications to 2D Convolutions and Cross-Correlations

    Get PDF
    The Discrete Radon Transform (DRT) is an essential component of a wide range of applications in image processing, e.g. image denoising, image restoration, texture analysis, line detection, encryption, compressive sensing and reconstructing objects from projections in computed tomography and magnetic resonance imaging. A popular method to obtain the DRT, or its inverse, involves the use of the Fast Fourier Transform, with the inherent approximation/rounding errors and increased hardware complexity due the need for floating point arithmetic implementations. An alternative implementation of the DRT is through the use of the Discrete Periodic Radon Transform (DPRT). The DPRT also exhibits discrete properties of the continuous-space Radon Transform, including the Fourier Slice Theorem and the convolution property. Unfortunately, the use of the DPRT has been limited by the need to compute a large number of additions O(N^3) and the need for a large number of memory accesses. This PhD dissertation introduces a fast and scalable approach for computing the forward and inverse DPRT that is based on the use of: (i) a parallel array of fixed-point adder trees, (ii) circular shift registers to remove the need for accessing external memory components when selecting the input data for the adder trees, and (iii) an image block-based approach to DPRT computation that can fit the proposed architecture to available resources, and as a result, for an NxN image (N prime), the proposed approach can compute up to N^2 additions per clock cycle. Compared to previous approaches, the scalable approach provides the fastest known implementations for different amounts of computational resources. For the fastest case, I introduce optimized architectures that can compute the DPRT and its inverse in just 2N +ceil(log2 N)+1 and 2N +3(log2 N)+B+2 clock cycles respectively, where B is the number of bits used to represent each input pixel. In comparison, the prior state of the art method required N^2 +N +1 clock cycles for computing the forward DPRT. For systems with limited resources, the resource usage can be reduced to O(N) with a running time of ceil(N/2)(N + 9) + N + 2 for the forward DPRT and ceil(N/2)(N + 2) + 3ceil(log2 N) + B + 4 for the inverse. The results also have important applications in the computation of fast convolutions and cross-correlations for large and non-separable kernels. For this purpose, I introduce fast algorithms and scalable architectures to compute 2-D Linear convolutions/cross-correlations using the convolution property of the DPRT and fixed point arithmetic to simplify the 2-D problem into a 1-D problem. Also an alternative system is proposed for non-separable kernels with low rank using the LU decomposition. As a result, for implementations with enough resources, for a an image and convolution kernel of size PxP, linear convolutions/cross correlations can be computed in just 6N + 4 log2 N + 17 clock cycles for N = 2P-1. Finally, I also propose parallel algorithms to compute the forward and inverse DPRT using Graphic Processing Units (GPUs) and CPUs with multiple cores. The proposed algorithms are implemented in a GPU Nvidia Maxwell GM204 with 2048 cores@1367MHz, 348KB L1 cache (24KB per multiprocessor), 2048KB L2 cache (512KB per memory controller), 4GB device memory, and compared against a serial implementation on a CPU Intel Xeon E5-2630 with 8 physical cores (16 logical processors via hyper-threading)@3.2GHz, L1 cache 512K (32KB Instruction cache, 32KB data cache, per core), L2 cache 2MB (256KB per core), L3 cache 20MB (Shared among all cores), 32GB of system memory. For the CPU, there is a tenfold speedup using 16 logical cores versus a single-core serial implementation. For the GPU, there is a 715-fold speedup compared to the serial implementation. For real-time applications, for an 1021x1021 image, the forward DPRT takes 11.5ms and 11.4ms for the inverse

    Fourier Domain Decoding Algorithm of Non-Binary LDPC codes for Parallel Implementation

    Full text link
    For decoding non-binary low-density parity check (LDPC) codes, logarithm-domain sum-product (Log-SP) algorithms were proposed for reducing quantization effects of SP algorithm in conjunction with FFT. Since FFT is not applicable in the logarithm domain, the computations required at check nodes in the Log-SP algorithms are computationally intensive. What is worth, check nodes usually have higher degree than variable nodes. As a result, most of the time for decoding is used for check node computations, which leads to a bottleneck effect. In this paper, we propose a Log-SP algorithm in the Fourier domain. With this algorithm, the role of variable nodes and check nodes are switched. The intensive computations are spread over lower-degree variable nodes, which can be efficiently calculated in parallel. Furthermore, we develop a fast calculation method for the estimated bits and syndromes in the Fourier domain.Comment: To appear in IEICE Trans. Fundamentals, vol.E93-A, no.11 November 201

    Bisection of Bounded Treewidth Graphs by Convolutions

    Get PDF
    In the Bisection problem, we are given as input an edge-weighted graph G. The task is to find a partition of V(G) into two parts A and B such that ||A| - |B|| <= 1 and the sum of the weights of the edges with one endpoint in A and the other in B is minimized. We show that the complexity of the Bisection problem on trees, and more generally on graphs of bounded treewidth, is intimately linked to the (min, +)-Convolution problem. Here the input consists of two sequences (a[i])^{n-1}_{i = 0} and (b[i])^{n-1}_{i = 0}, the task is to compute the sequence (c[i])^{n-1}_{i = 0}, where c[k] = min_{i=0,...,k}(a[i] + b[k - i]). In particular, we prove that if (min, +)-Convolution can be solved in O(tau(n)) time, then Bisection of graphs of treewidth t can be solved in time O(8^t t^{O(1)} log n * tau(n)), assuming a tree decomposition of width t is provided as input. Plugging in the naive O(n^2) time algorithm for (min, +)-Convolution yields a O(8^t t^{O(1)} n^2 log n) time algorithm for Bisection. This improves over the (dependence on n of the) O(2^t n^3) time algorithm of Jansen et al. [SICOMP 2005] at the cost of a worse dependence on t. "Conversely", we show that if Bisection can be solved in time O(beta(n)) on edge weighted trees, then (min, +)-Convolution can be solved in O(beta(n)) time as well. Thus, obtaining a sub-quadratic algorithm for Bisection on trees is extremely challenging, and could even be impossible. On the other hand, for unweighted graphs of treewidth t, by making use of a recent algorithm for Bounded Difference (min, +)-Convolution of Chan and Lewenstein [STOC 2015], we obtain a sub-quadratic algorithm for Bisection with running time O(8^t t^{O(1)} n^{1.864} log n)
    • …
    corecore