19,900 research outputs found

    Broadcast with mask on a Massively Parallel Processing on a Chip

    Get PDF
    workshop drnoc2012The delay of instructions broadcast has a significant impact on the performance of Single Instruction Multiple Data (SIMD) architecture. This is especially true for massively parallel processing Systems-on-Chip (mppSoC), where the processing stage and that of setting up the communication mechanism need several clock periods. Subnetting is the strategy used to partition a single physical network into more than one smaller logical sub-networks (subnets). This technique better controls the broadcast instructions domain and the data traffic between network nodes. Furthermore, it allows to separate synchronous communications from asynchronous processing which maintains reliable communications and rapid processing through parallel processors. This paper describes the design of a communication model called broadcast with mask. This model is dedicated to mppSoC architecture with a huge number of processor elements because it maintains performances even when the number of processors increases. Simulation results and an FPGA implementation validate our approach

    Analysing Astronomy Algorithms for GPUs and Beyond

    Full text link
    Astronomy depends on ever increasing computing power. Processor clock-rates have plateaued, and increased performance is now appearing in the form of additional processor cores on a single chip. This poses significant challenges to the astronomy software community. Graphics Processing Units (GPUs), now capable of general-purpose computation, exemplify both the difficult learning-curve and the significant speedups exhibited by massively-parallel hardware architectures. We present a generalised approach to tackling this paradigm shift, based on the analysis of algorithms. We describe a small collection of foundation algorithms relevant to astronomy and explain how they may be used to ease the transition to massively-parallel computing architectures. We demonstrate the effectiveness of our approach by applying it to four well-known astronomy problems: Hogbom CLEAN, inverse ray-shooting for gravitational lensing, pulsar dedispersion and volume rendering. Algorithms with well-defined memory access patterns and high arithmetic intensity stand to receive the greatest performance boost from massively-parallel architectures, while those that involve a significant amount of decision-making may struggle to take advantage of the available processing power.Comment: 10 pages, 3 figures, accepted for publication in MNRA

    High-throughput chromatin accessibility profiling at single-cell resolution.

    Get PDF
    Here we develop a high-throughput single-cell ATAC-seq (assay for transposition of accessible chromatin) method to measure physical access to DNA in whole cells. Our approach integrates fluorescence imaging and addressable reagent deposition across a massively parallel (5184) nano-well array, yielding a nearly 20-fold improvement in throughput (up to ~1800 cells/chip, 4-5 h on-chip processing time) and library preparation cost (~81¢ per cell) compared to prior microfluidic implementations. We apply this method to measure regulatory variation in peripheral blood mononuclear cells (PBMCs) and show robust, de novo clustering of single cells by hematopoietic cell type

    VLSI implementation of a massively parallel wavelet based zerotree coder for the intelligent pixel array

    Get PDF
    In the span of a few years, mobile multimedia communication has rapidly become a significant area of research and development constantly challenging boundaries on a variety of technologic fronts. Mobile video communications in particular encompasses a number of technical hurdles that generally steer technological advancements towards devices that are low in complexity, low in power usage yet perform the given task efficiently. Devices of this nature have been made available through the use of massively parallel processing arrays such as the Intelligent Pixel Processing Array. The Intelligent Pixel Processing array is a novel concept that integrates a parallel image capture mechanism, a parallel processing component and a parallel display component into a single chip solution geared toward mobile communications environments, be it a PDA based system or the video communicator wristwatch portrayed in Dick Tracy episodes. This thesis details work performed to provide an efficient, low power, low complexity solution surrounding the massively parallel implementation of a zerotree entropy codec for the Intelligent Pixel Array

    A high speed programmable focal-plane SIMD vision chip. Analog Integrated Circuits and Signal Processing

    Get PDF
    International audienceA high speed analog VLSI image acquisition and low-level image processing system is presented. The architecture of the chip is based on a dynamically reconfigurable SIMD processor array. The chip features a massively parallel architecture enabling the computation of programmable mask-based image processing in each pixel. Each pixel include a photodiode, an amplifier, two storage capacitors, and an analog arithmetic unit based on a four-quadrant multiplier architecture. A 64 × 64 pixel proof-of-concept chip was fabricated in a 0.35 μm standard CMOS process, with a pixel size of 35 μm × 35 μm. The chip can capture raw images up to 10,000 fps and runs low-level image processing at a framerate of 2,000–5,000 fp

    Parallel VLSI architecture emulation and the organization of APSA/MPP

    Get PDF
    The Applicative Programming System Architecture (APSA) combines an applicative language interpreter with a novel parallel computer architecture that is well suited for Very Large Scale Integration (VLSI) implementation. The Massively Parallel Processor (MPP) can simulate VLSI circuits by allocating one processing element in its square array to an area on a square VLSI chip. As long as there are not too many long data paths, the MPP can simulate a VLSI clock cycle very rapidly. The APSA circuit contains a binary tree with a few long paths and many short ones. A skewed H-tree layout allows every processing element to simulate a leaf cell and up to four tree nodes, with no loss in parallelism. Emulation of a key APSA algorithm on the MPP resulted in performance 16,000 times faster than a Vax. This speed will make it possible for the APSA language interpreter to run fast enough to support research in parallel list processing algorithms

    Macroservers: An Execution Model for DRAM Processor-In-Memory Arrays

    Get PDF
    The emergence of semiconductor fabrication technology allowing a tight coupling between high-density DRAM and CMOS logic on the same chip has led to the important new class of Processor-In-Memory (PIM) architectures. Newer developments provide powerful parallel processing capabilities on the chip, exploiting the facility to load wide words in single memory accesses and supporting complex address manipulations in the memory. Furthermore, large arrays of PIMs can be arranged into a massively parallel architecture. In this report, we describe an object-based programming model based on the notion of a macroserver. Macroservers encapsulate a set of variables and methods; threads, spawned by the activation of methods, operate asynchronously on the variables' state space. Data distributions provide a mechanism for mapping large data structures across the memory region of a macroserver, while work distributions allow explicit control of bindings between threads and data. Both data and work distributuions are first-class objects of the model, supporting the dynamic management of data and threads in memory. This offers the flexibility required for fully exploiting the processing power and memory bandwidth of a PIM array, in particular for irregular and adaptive applications. Thread synchronization is based on atomic methods, condition variables, and futures. A special type of lightweight macroserver allows the formulation of flexible scheduling strategies for the access to resources, using a monitor-like mechanism

    A model for Intelligent Random Access Memory architecture (IRAM): cellular automata algorithms on the Associative String Processing machine (ASTRA)

    Get PDF
    In the near future, the computer performance will be completely determined by how long it takes to access memory. There are bottle-necks in memory latency and memory-to processor interface bandwidth. The IRAM initiative could be the answer by putting Processor-In-Memory (PIM). Starting from the massively parallel processing concept, one reached a similar conclusion. The MPPC (Massively Parallel Processing Collaboration) project and the 8K processor ASTRA machine (Associative String Test bench for Research \& Applications) developed at CERN \cite{kuala} can be regarded as a forerunner of the IRAM concept. The computing power of the ASTRA machine, regarded as an IRAM with 64 one-bit processors on a 64×\times64 bit-matrix memory chip machine, has been demonstrated by running statistical physics algorithms: one-dimensional stochastic cellular automata, as a simple model for dynamical phase transitions. As a relevant result for physics, the damage spreading of this model has been investigated

    A 10 000 fps CMOS Sensor With Massively Parallel Image Processing

    Get PDF
    International audienceA high-speed analog VLSI image acquisition and preprocessing system has been designed and fabricated in a 0.35 standard CMOS process. The chip features a massively parallel architecture enabling the computation of programmable low-level image processing in each pixel. Extraction of spatial gradients and convolutions such as Sobel or Laplacian filters are implemented on the circuit. Measured results show that the proposed sensor successfully captures raw images up to 10 000 frames per second and runs low-level image processing at a frame rate of 2000 to 5000 frames per second

    Performance analysis of massively parallel embedded hardware architectures for retinal image processing

    Get PDF
    This paper examines the implementation of a retinal vessel tree extraction technique on different hardware platforms and architectures. Retinal vessel tree extraction is a representative application of those found in the domain of medical image processing. The low signal-to-noise ratio of the images leads to a large amount of low-level tasks in order to meet the accuracy requirements. In some applications, this might compromise computing speed. This paper is focused on the assessment of the performance of a retinal vessel tree extraction method on different hardware platforms. In particular, the retinal vessel tree extraction method is mapped onto a massively parallel SIMD (MP-SIMD) chip, a massively parallel processor array (MPPA) and onto an field-programmable gate arrays (FPGA)This work is funded by Xunta de Galicia under the projects 10PXIB206168PR and 10PXIB206037PR and the program Maria BarbeitoS
    corecore