37,496 research outputs found

    Array-OL Revisited, Multidimensional Intensive Signal Processing Specification

    Get PDF
    This paper presents the Array-OL specification language. It is a high-level visual language dedicated to multidimensional intensive signal processing applications. It allows to specify both the task parallelism and the data parallelism of these applications on focusing on their complex multidimensional data access patterns. This presentation includes several extensions and tools developed around Array-OL during the last few years and discusses the mapping of an Array-OL specification onto a distributed heterogeneous hardware architecture

    FAST IMPLEMENTATION TECHNIQUES OF MULTICHANNEL DIGITAL FILTERS FOR COLOR IMAGE PROCESSING USING MATRIX DECOMPOSITIONS

    Get PDF
    For the processing of color images, multivariable 3-input, 3-output 2-D digital filters are used, considering decomposition in the R, G and B components. Assuming that the three image components are decorrelated, three independent single-input, single-output (SISO) two-dimensional (2-D) digital filters are needed for the processing of each monochromatic image. Additional processing is needed for the correlated noise components in each chan- nel. The requirement of very fast processing dictates the use of special purpose hardware implementations. The VLSI array processors, which are special purpose, locally intercon- nected computing networks, are ideally suited for the fast implementation of digital filters, since they maximize concurrency by exploiting both parallelism and pipelining. In this paper fast implementation architectures of 3-input, 3-output 2-D multi-input digital filters for color image processing that are based on matrix decompositions are presented. The resulting structures are modular, regular, have high inherent parallelism and are easily pipelined, so that they may be implemented via VLSI array processors

    Communication costs in a multi-tiered MPSoC

    Get PDF
    The amount of digital processing required for phased array beamformers is very large. It requires many parallel processors, which can be organized in a multi-tiered structure. Communication costs differ for each of the stages in such an architecture. For example, communication costs from the antenna front-end to the first processing stages is costly because of the amount of connections and data rate. Furthermore there is a trade-off between sequential processing exploiting locality of reference versus exploiting parallelism but adding communication costs. Thus, the optimal architecture depends on the importance that is given to the different measures.\ud \ud A model is presented to determine the partitioning of a (beamforming) system based on communication costs. It is shown that different solutions can be explored based on the cost model and the incorporated quantitative and qualitative measures. Determining the importance of each measure is subjective to the situation and application. In this work a simple beamforming application is used optimised for energy efficiency

    Automatic parallelization of nested loop programs for non-manifest real-time stream processing applications

    Get PDF
    This thesis is concerned with the automatic parallelization of real-time stream processing applications, such that they can be executed on embedded multiprocessor systems. Stream processing applications can be found in the video and channel decoding domain. These applications often have temporal requirements and can contain non-manifest conditions and expressions. For non-manifest conditions and expressions the results cannot be evaluated at compile time. Current parallelization approaches have difficulties with the extraction of function parallelism from stream processing applications. Some of these approaches require applications with manifest behavior and affine index-expressions. For these applications, they can derive data dependencies and insert inter-task communication via FIFO buffers. But, these approaches cannot support stream processing applications with non-manifest loops. Furthermore, to the best of our knowledge current approaches can only extract a temporal analysis model from applications with manifest behavior and without cyclic data dependencies.\ud \ud To address the issues mentioned above, we present an automatic parallelization approach to extract function parallelism from sequential descriptions of real-time stream processing applications. We introduce a language to describe stream processing applications. The key property of this language is that all dependencies can be derived at compile time. In our language we support non-manifest loops, if-statements, and index-expressions. We introduce a new buffer type that can always be used to replace the array communication. This buffer supports multiple reading and writing tasks. Because we can always derive the data dependencies and always replace the array communication by communication via a buffer, we can always extract the available function parallelism. Furthermore, our parallelization approach uses an underlying temporal analysis model, in which we capture the inter-task synchronization. With this analysis model, we can compute system settings and perform optimizations. Our parallelization approach is implemented in a multiprocessor compiler. We evaluated our approach, by extracting parallelism from a WLAN channel decoder application and a JPEG decoder application with our multiprocessor compiler

    Omphale: Streamlining the Communication for Jobs in a Multi Processor System on Chip

    Get PDF
    Our Multi Processor System on Chip (MPSoC) template provides processing tiles that are connected via a network on chip. A processing tile contains a processing unit and a Scratch Pad Memory (SPM). This paper presents the Omphale tool that performs the first step in mapping a job, represented by a task graph, to such an MPSoC, given the SPM sizes as constraints. Furthermore a memory tile is introduced. The result of Omphale is a Cyclo Static DataFlow (CSDF) model and a task graph where tasks communicate via sliding windows that are located in circular buffers. The CSDF model is used to determine the size of the buffers and the communication pattern of the data. A buffer must fit in the SPM of the processing unit that is reading from it, such that low latency access is realized with a minimized number of stall cycles. If a task and its buffer exceed the size of the SPM, the task is examined for additional parallelism or the circular buffer is partly located in a memory tile. This results in an extended task graph that satisfies the SPM size constraints

    Array languages and the N-body problem

    Get PDF
    This paper is a description of the contributions to the SICSA multicore challenge on many body planetary simulation made by a compiler group at the University of Glasgow. Our group is part of the Computer Vision and Graphics research group and we have for some years been developing array compilers because we think these are a good tool both for expressing graphics algorithms and for exploiting the parallelism that computer vision applications require. We shall describe experiments using two languages on two different platforms and we shall compare the performance of these with reference C implementations running on the same platforms. Finally we shall draw conclusions both about the viability of the array language approach as compared to other approaches used in the challenge and also about the strengths and weaknesses of the two, very different, processor architectures we used

    B+-tree Index Optimization by Exploiting Internal Parallelism of Flash-based Solid State Drives

    Full text link
    Previous research addressed the potential problems of the hard-disk oriented design of DBMSs of flashSSDs. In this paper, we focus on exploiting potential benefits of flashSSDs. First, we examine the internal parallelism issues of flashSSDs by conducting benchmarks to various flashSSDs. Then, we suggest algorithm-design principles in order to best benefit from the internal parallelism. We present a new I/O request concept, called psync I/O that can exploit the internal parallelism of flashSSDs in a single process. Based on these ideas, we introduce B+-tree optimization methods in order to utilize internal parallelism. By integrating the results of these methods, we present a B+-tree variant, PIO B-tree. We confirmed that each optimization method substantially enhances the index performance. Consequently, PIO B-tree enhanced B+-tree's insert performance by a factor of up to 16.3, while improving point-search performance by a factor of 1.2. The range search of PIO B-tree was up to 5 times faster than that of the B+-tree. Moreover, PIO B-tree outperformed other flash-aware indexes in various synthetic workloads. We also confirmed that PIO B-tree outperforms B+-tree in index traces collected inside the Postgresql DBMS with TPC-C benchmark.Comment: VLDB201
    corecore