840 research outputs found

    Synthesis, structure and power of systolic computations

    Get PDF
    AbstractA variety of problems related to systolic architectures, systems, models and computations are discussed. The emphases are on theoretical problems of a broader interest. Main motivations and interesting/important applications are also presented. The first part is devoted to problems related to synthesis, transformations and simulations of systolic systems and architectures. In the second part, the power and structure of tree and linear array computations are studied in detail. The goal is to survey main research directions, problems, methods and techniques in not too formal a way

    Parallelization of dynamic programming recurrences in computational biology

    Get PDF
    The rapid growth of biosequence databases over the last decade has led to a performance bottleneck in the applications analyzing them. In particular, over the last five years DNA sequencing capacity of next-generation sequencers has been doubling every six months as costs have plummeted. The data produced by these sequencers is overwhelming traditional compute systems. We believe that in the future compute performance, not sequencing, will become the bottleneck in advancing genome science. In this work, we investigate novel computing platforms to accelerate dynamic programming algorithms, which are popular in bioinformatics workloads. We study algorithm-specific hardware architectures that exploit fine-grained parallelism in dynamic programming kernels using field-programmable gate arrays: FPGAs). We advocate a high-level synthesis approach, using the recurrence equation abstraction to represent dynamic programming and polyhedral analysis to exploit parallelism. We suggest a novel technique within the polyhedral model to optimize for throughput by pipelining independent computations on an array. This design technique improves on the state of the art, which builds latency-optimal arrays. We also suggest a method to dynamically switch between a family of designs using FPGA reconfiguration to achieve a significant performance boost. We have used polyhedral methods to parallelize the Nussinov RNA folding algorithm to build a family of accelerators that can trade resources for parallelism and are between 15-130x faster than a modern dual core CPU implementation. A Zuker RNA folding accelerator we built on a single workstation with four Xilinx Virtex 4 FPGAs outperforms 198 3 GHz Intel Core 2 Duo processors. Furthermore, our design running on a single FPGA is an order of magnitude faster than competing implementations on similar-generation FPGAs and graphics processors. Our work is a step toward the goal of automated synthesis of hardware accelerators for dynamic programming algorithms

    A Comprehensive Methodology for Algorithm Characterization, Regularization and Mapping Into Optimal VLSI Arrays.

    Get PDF
    This dissertation provides a fairly comprehensive treatment of a broad class of algorithms as it pertains to systolic implementation. We describe some formal algorithmic transformations that can be utilized to map regular and some irregular compute-bound algorithms into the best fit time-optimal systolic architectures. The resulted architectures can be one-dimensional, two-dimensional, three-dimensional or nonplanar. The methodology detailed in the dissertation employs, like other methods, the concept of dependence vector to order, in space and time, the index points representing the algorithm. However, by differentiating between two types of dependence vectors, the ordering procedure is allowed to be flexible and time optimal. Furthermore, unlike other methodologies, the approach reported here does not put constraints on the topology or dimensionality of the target architecture. The ordered index points are represented by nodes in a diagram called Systolic Precedence Diagram (SPD). The SPD is a form of precedence graph that takes into account the systolic operation requirements of strictly local communications and regular data flow. Therefore, any algorithm with variable dependence vectors has to be transformed into a regular indexed set of computations with local dependencies. This can be done by replacing variable dependence vectors with sets of fixed dependence vectors. The SPD is transformed into an acyclic, labeled, directed graph called the Systolic Directed Graph (SDG). The SDG models the data flow as well as the timing for the execution of the given algorithm on a time-optimal array. The target architectures are obtained by projecting the SDG along defined directions. If more than one valid projection direction exists, different designs are obtained. The resulting architectures are then evaluated to determine if an improvement in the performance can be achieved by increasing PE fan-out. If so, the methodology provides the corresponding systolic implementation. By employing a new graph transformation, the SDG is manipulated so that it can be mapped into fixed-size and fixed-depth multi-linear arrays. The latter is a new concept of systolic arrays that is adaptable to changes in the state of technology. It promises a bonded clock skew, higher throughput and better performance than the linear implementation

    Computer vision algorithms on reconfigurable logic arrays

    Full text link

    Application of constrained optimisation techniques in electrical impedance tomography

    Get PDF
    A Constrained Optimisation technique is described for the reconstruction of temporal resistivity images. The approach solves the Inverse problem by optimising a cost function under constraints, in the form of normalised boundary potentials. Mathematical models have been developed for two different data collection methods for the chosen criterion. Both of these models express the reconstructed image in terms of one dimensional (I-D) Lagrange multiplier functions. The reconstruction problem becomes one of estimating these 1-D functions from the normalised boundary potentials. These models are based on a cost criterion of the minimisation of the variance between the reconstructed resistivity distribution and the true resistivity distribution. The methods presented In this research extend the algorithms previously developed for X-ray systems. Computational efficiency is enhanced by exploiting the structure of the associated system matrices. The structure of the system matrices was preserved in the Electrical Impedance Tomography (EIT) implementations by applying a weighting due to non-linear current distribution during the backprojection of the Lagrange multiplier functions. In order to obtain the best possible reconstruction it is important to consider the effects of noise in the boundary data. This is achieved by using a fast algorithm which matches the statistics of the error in the approximate inverse of the associated system matrix with the statistics of the noise error in the boundary data. This yields the optimum solution with the available boundary data. Novel approaches have been developed to produce the Lagrange multiplier functions. Two alternative methods are given for the design of VLSI implementations of hardware accelerators to improve computational efficiencies. These accelerators are designed to implement parallel geometries and are modelled using a verification description language to assess their performance capabilities

    Optical simulation study for high resolution monolithic detector design for TB-PET

    Get PDF
    Background The main limitations in positron emission tomography (PET) are the limited sensitivity and relatively poor spatial resolution. The administered radioactive dose and scan time could be reduced by increasing system sensitivity with a total-body (TB) PET design. The second limitation, spatial resolution, mainly originates from the specific design of the detectors that are implemented. In state-of-the-art scanners, the detectors consist of pixelated crystal arrays, where each individual crystal is isolated from its neighbors with reflector material. To obtain higher spatial resolution the crystals can be made narrower which inevitably leads to more inter-crystal scatter and larger dead space between the crystals. A monolithic detector design shows superior characteristics in (i) light collection efficiency (no gaps), (ii) timing, as it significantly reduces the number of reflections and therefore the path length of each scintillation photon and (iii) spatial resolution (including better depth-of-interaction (DOI)). The aim of this work is to develop a precise simulation model based on measured crystal data and use this powerful tool to find the limits in spatial resolution for a monolithic detector for the use in TB-PET. Materials and methods A detector (Fig. 1) based on a monolithic 50x50x16 mm3 lutetium-(yttrium) oxyorthosilicate (L(Y)SO) scintillation crystal coupled to an 8x8 array of 6x6mm2 silicon photomultipliers (SiPMs) is simulated with GATE. A recently implemented reflection model for scintillation light allows simulations based on measured surface data (1). The modeled surfaces include black painted rough finishing on the crystal sides (16x50mm2) and a specular reflector attached to a polished crystal top (50x50mm2). Maximum Likelihood estimation (MLE) is used for positioning the events. Therefore, calibration data is obtained by generating 3.000 photo-electric events at given calibration positions (Fig. 1). Compton scatter is not (yet) included. In a next step, the calibration data is organized in three layers based on the exact depth coordinate in the crystal (i.e. DOI assumed to be known). For evaluating the resolution, the full width at half maximum (FWHM) is estimated at the irradiated positions of Fig. 2 as a mean of all profiles in vertical and horizontal direction. Next, uniformity is evaluated by simulating 200k events from a flood source, placed in the calibrated area. Results For the irradiation pattern in Fig. 2 the resolution in terms of FWHM when applying MLE is: 0.86±0.13mm (Fig. 3a). Nevertheless, there are major artifacts also at non-irradiated positions. By positioning the events based on three DOI-based layers it can be seen that the events closest to the photodetector introduce the largest artifacts (Fig. 3b-d). The FWHM improves for Layer 1 and 2, to 0.69±0.04mm and 0.59±0.02mm, respectively. Layer 3 introduces major artifacts to the flood map, as events are positioned at completely different locations as the initial irradiation. A FWHM estimation is thus not useful. The uniformity (Fig. 4) degrades with proximity to the photodetector. The map in Fig. 4c shows that the positioning accuracy depends not only on DOI but also the position in the plane parallel to the photodetector array. Conclusions A simulation model for a monolithic PET detector with good characteristics for TB-PET systems was developed with GATE. A first estimate of the spatial resolution and uniformity was given, pointing out the importance of depth-dependent effects. Future studies will include several steps towards more realistic simulations e.g. surface measurements of our specific crystals for the optical surface model and inclusion of the Compton effect

    Formal synthesis of control signals for systolic arrays

    Get PDF

    An Intelligent Architecture Based on Field Programmable Gate Arrays Designed to Detect Moving Objects by Using Principal Component Analysis

    Get PDF
    This paper presents a complete implementation of the Principal Component Analysis (PCA) algorithm in Field Programmable Gate Array (FPGA) devices applied to high rate background segmentation of images. The classical sequential execution of different parts of the PCA algorithm has been parallelized. This parallelization has led to the specific development and implementation in hardware of the different stages of PCA, such as computation of the correlation matrix, matrix diagonalization using the Jacobi method and subspace projections of images. On the application side, the paper presents a motion detection algorithm, also entirely implemented on the FPGA, and based on the developed PCA core. This consists of dynamically thresholding the differences between the input image and the one obtained by expressing the input image using the PCA linear subspace previously obtained as a background model. The proposal achieves a high ratio of processed images (up to 120 frames per second) and high quality segmentation results, with a completely embedded and reliable hardware architecture based on commercial CMOS sensors and FPGA devices

    Gen-NeRF: Efficient and Generalizable Neural Radiance Fields via Algorithm-Hardware Co-Design

    Full text link
    Novel view synthesis is an essential functionality for enabling immersive experiences in various Augmented- and Virtual-Reality (AR/VR) applications, for which generalizable Neural Radiance Fields (NeRFs) have gained increasing popularity thanks to their cross-scene generalization capability. Despite their promise, the real-device deployment of generalizable NeRFs is bottlenecked by their prohibitive complexity due to the required massive memory accesses to acquire scene features, causing their ray marching process to be memory-bounded. To this end, we propose Gen-NeRF, an algorithm-hardware co-design framework dedicated to generalizable NeRF acceleration, which for the first time enables real-time generalizable NeRFs. On the algorithm side, Gen-NeRF integrates a coarse-then-focus sampling strategy, leveraging the fact that different regions of a 3D scene contribute differently to the rendered pixel, to enable sparse yet effective sampling. On the hardware side, Gen-NeRF highlights an accelerator micro-architecture to maximize the data reuse opportunities among different rays by making use of their epipolar geometric relationship. Furthermore, our Gen-NeRF accelerator features a customized dataflow to enhance data locality during point-to-hardware mapping and an optimized scene feature storage strategy to minimize memory bank conflicts. Extensive experiments validate the effectiveness of our proposed Gen-NeRF framework in enabling real-time and generalizable novel view synthesis.Comment: Accepted by ISCA 202
    • …
    corecore