148,896 research outputs found

    A sweep algorithm for massively parallel simulation of circuit-switched networks

    Get PDF
    A new massively parallel algorithm is presented for simulating large asymmetric circuit-switched networks, controlled by a randomized-routing policy that includes trunk-reservation. A single instruction multiple data (SIMD) implementation is described, and corresponding experiments on a 16384 processor MasPar parallel computer are reported. A multiple instruction multiple data (MIMD) implementation is also described, and corresponding experiments on an Intel IPSC/860 parallel computer, using 16 processors, are reported. By exploiting parallelism, our algorithm increases the possible execution rate of such complex simulations by as much as an order of magnitude

    Meta-State conversion

    Get PDF
    In MIMD (Multiple Instruction stream, Multiple Data stream) execution, each processor has its own state. Although these states are generally considered to be independent entities, it is also possible to view the set of processor states at a particular time as single, aggregate, Meta State. Once a program has been converted into a single finite automaton based on Meta States, only a single proram counter is needed. Hence, it is possible to duplicate the MIMD execution using SIMD (Single Instruction stream, Multiple Data stream) hardware without the ovehead of interpretation or even of having each processing element keep a copy of the MIMD code. In this paper, we present an algorithm for Meta-State Conversion (MSC) and explore some properties of the technique

    Implementing nested conditional statements in SIMD machines

    Get PDF
    Single instruction, multiple data (SIMD) computers consist of a very large number of processors executing a common sequence of instructions. Maintaining the full speedup potential of such machines is most sensitive to conditional execution in their programs, regions of code where some processing elements (PEs) perform no useful work. Techniques are presented for efficiently implementing nested conditional statements, specifically if and case statements, in SIMD machines, while adding minimal specialized hardware

    Extending Static Synchronization Beyond SIMD and VLIW

    Get PDF
    A key advantage of SIMD (Single Instruction stream, Multiple Data stream) architectures is that synchronization is effected statically at compile-time, hence the execution-time cost of synchronization between “processes” is essentially zero. VLIW (Very Long Instruction Word) machines are successful in large part because they preserve this property while providing more flexibility in terms of what kinds of operations can be parallelized. In this paper, we propose a new kind of architecture —- the “static barrier MIMD” or SBM — which can be viewed as a further generalization of the parallel execution abilities of static synchronization machines. Barrier MIMDs are asynchronous Multiple Instruction stream Multiple Data stream architectures capable of parallel execution of loops, subprogram calls, and variable execution- time instructions; however, little or no run-time synchronization is needed. When a group of processors within a barrier MIMD has just encountered a barrier, any conceptual synchronizations between the processors are statically accomplished with zero cost — as in a SIMD or VLIW and using similar compiler technology. Unlike these machines, however, as execution continues the relative timing of processors may become less precisely knowable as a static, compile-time, quantity. Where this imprecision becomes too large, the compiler simply inserts a synchronization barrier to insure that timing imprecision at that point is zero, and again employs purely static, implicit, synchronization. Both the architecture and the supporting compiler technology are discussed in detail

    Superscalar RISC-V Processor with SIMD Vector Extension

    Get PDF
    With the increasing number of digital products in the market, the need for robust and highly configurable processors rises. The demand is convened by the stable and extensible open-sourced RISC-V instruction set architecture. RISC-V processors are becoming popular in many fields of applications and research. This thesis presents a dual-issue superscalar RISC-V processor design with dynamic execution. The proposed design employs the global sharing scheme for branch prediction and Tomasulo algorithm for out-of-order execution. The processor is capable of speculative execution with five checkpoints. Data flow in the instruction dispatch and commit stages is optimized to achieve higher instruction throughput. The superscalar processor is extended with a customized vector instruction set of single-instruction-multiple-data computations to specifically improve the performance on machine learning tasks. According to the definition of the proposed vector instruction set, the scratchpad memory and element-wise arithmetic units are implemented in the vector co-processor. Different test programs are evaluated on the fully-tested superscalar processor. Compared to the reference work, the proposed design improves 18.9% on average instruction throughput and 4.92% on average prediction hit rate, with 16.9% higher operating clock frequency synthesized on the Intel Arria 10 FPGA board. The forward propagation of a convolution neural network model is evaluated by the standalone superscalar processor and the integration of the vector co-processor. The vector program with software-level optimizations achieves 9.53Ă— improvement on instruction throughput and 10.18Ă— improvement on real-time throughput. Moreover, the integration also provides 2.22Ă— energy efficiency compared with the superscalar processor along

    Strategy of microscopic parallelism for Bitplane Image Coding

    Get PDF
    Recent years have seen the upraising of a new type of processors strongly relying on the Single Instruction, Multiple Data (SIMD) architectural principle. The main idea behind SIMD computing is to apply a flow of instructions to multiple pieces of data in parallel and synchronously. This permits the execution of thousands of operations in parallel, achieving higher computational performance than with traditional Multiple Instruction, Multiple Data (MIMD) architectures. The level of parallelism required in SIMD computing can only be achieved in image coding systems via microscopic parallel strategies that code multiple coefficients in parallel. Until now, the only way to achieve microscopic parallelism in bitplane coding engines was by executing multiple coding passes in parallel. Such a strategy does not suit well SIMD computing because each thread executes different instructions. This paper introduces the first bitplane coding engine devised for the fine grain of parallelism required in SIMD computing. Its main insight is to allow parallel coefficient processing in a coding pass. Experimental tests show coding performance results similar to those of JPEG2000

    Bitplane image coding with parallel coefficient processing

    Get PDF
    Image coding systems have been traditionally tailored for multiple instruction, multiple data (MIMD) computing. In general, they partition the (transformed) image in codeblocks that can be coded in the cores of MIMD-based processors. Each core executes a sequential flow of instructions to process the coefficients in the codeblock, independently and asynchronously from the others cores. Bitplane coding is a common strategy to code such data. Most of its mechanisms require sequential processing of the coefficients. The last years have seen the upraising of processing accelerators with enhanced computational performance and power efficiency whose architecture is mainly based on the single instruction, multiple data (SIMD) principle. SIMD computing refers to the execution of the same instruction to multiple data in a lockstep synchronous way. Unfortunately, current bitplane coding strategies cannot fully profit from such processors due to inherently sequential coding task. This paper presents bitplane image coding with parallel coefficient (BPC-PaCo) processing, a coding method that can process many coefficients within a codeblock in parallel and synchronously. To this end, the scanning order, the context formation, the probability model, and the arithmetic coder of the coding engine have been re-formulated. The experimental results suggest that the penalization in coding performance of BPC-PaCo with respect to the traditional strategies is almost negligible

    A Study of the use of SIMD instructions for two image processing algorithms

    Get PDF
    Many media processing algorithms suffer from long execution times, which are most often not acceptable from an end user point of view. Recently, this problem has been exacerbated because media has higher resolution. One possible solution is through the use of Single Instruction Multiple Data (SIMD) architectures, such as ARM\u27s NEON. These architectures take advantage of the parallelism in media processing algorithms by operating on multiple pieces of data with just one instruction. SIMD instructions can significantly decrease the execution time of the algorithm, but require more time to implement. This thesis studies the use of SIMD instructions on a Cortex-A8 processor with NEON SIMD coprocessor. Both image processing algorithms, bilinear interpolation and distortion, are altered to process multiple pixels or colors simultaneously using the NEON coprocessor\u27s instruction set. The distortion algorithm is also altered at the assembly level through the removal of memory accesses and branches, adding data prefetch instructions, and interlacing ARM and NEON instructions. Altering the assembly code requires a deeper understanding of the code and more time, but allows for more control and higher speedups. The theoretical speedup for the bilinear interpolation and distortion algorithms is three and four times respectively. The actual measured speedup for the bilinear interpolation algorithm is more than two times, and for the distortion algorithm is more than three times. The results show that SIMD instructions can provide a speedup to image processing algorithms following a correct sequence of modifications of the code
    • …
    corecore