19,171 research outputs found

    Pond: A Robust, scalable, massively parallel computer architecture

    Get PDF
    A new computer architecture, intended for implementation in late and post silicon technologies, is proposed. The architecture is a fine-grained, inherently parallel system consisting of a large grid of thousands or millions of simple atomic processors (APs) employing a simple instruction set. Each AP is configured as either a program instruction or data storage element. These elements are organized into logical entities, analogous to traditional programming functions/methods and data structures. Programming work is underway to compile and run programs from traditional sequential code where parallelism is automatically discovered at the high level on both instruction level and function level, and integrated into the object code that is then sent to the processor. The result is a massively parallel architecture that fully exploits instruction and thread-level parallelism. The architecture design is presented, in-progress work involving conversion of existing code is discussed, and examples are shown to indicate the speedup potential that exists in this new architecture when compared to current architectures

    Loop pipelining with resource and timing constraints

    Get PDF
    Developing efficient programs for many of the current parallel computers is not easy due to the architectural complexity of those machines. The wide variety of machine organizations often makes it more difficult to port an existing program than to reprogram it completely. Therefore, powerful translators are necessary to generate effective code and free the programmer from concerns about the specific characteristics of the target machine. This work focuses on techniques to be used by an important class of translators, whose objective is to transform sequential programs into equivalent more parallel programs. The transformations are performed at instruction level in order to exploit low level parallelism and increase memory locality.Most of the current applications are programmed in languages which do not allow us to express parallelism between high-level sentences (as Pascal, C or Fortran). Furthermore, a lot of applications written ten or more years ago are still used today, and it is not feasible to rewrite such applications for many reasons (not only technical reasons, but also economic ones). Translators enable programmers to write the application in a familiar sequential programming language, without concerning their selves with the architecture of the target machine. Current compilers for parallel architectures not only translate a program written on a high-level language to the appropriate machine language, but also perform some transformations in the final code in order to execute the program in a more parallel way. The transformations improve the performance in the execution of the program by making use of the knowledge that the compiler has about the machine architecture. The semantics of the program remain intact after any transformation.Experiments show that limiting parallelization to basic blocks not included in loops limits maximum speedup. This is because loops often comprise a large portion of the parallelism available to be exploited in a program. For this reason, a lot of effort has been devoted in the recent years to parallelize loop execution. Several parallel computer architectures and compilation techniques have been proposed to exploit such a parallelism at different granularities. Multiprocessors exploit coarse grained parallelism by distributing entire loop iterations to different processors. Systems oriented to the high-level synthesis (HLS) of VLSI circuits, superscalar processors and very long instruction word (VLIW) processors exploit fine-grained parallelism at instruction level. This work addresses fine-grained parallelization of loops addressed to the HLS of VLSI circuits. Two algorithms are proposed for resource constraints and for timing constraints. An algorithm to reduce the number of registers required to execute a loop in a given architecture is also proposed.Postprint (published version

    Beehive: an FPGA-based multiprocessor architecture

    Get PDF
    In recent years, to accomplish with the Moore's law hardware and software designers are tending progressively to focus their efforts on exploiting instruction-level parallelism. Software simulation has been essential for studying computer architecture because of its flexibility and low cost. However, users of software simulators must choose between high performance and high fidelity emulation. This project presents an FPGA-based multiprocessor architecture to speed up multiprocessor architecture research and ease parallel software simulation

    Exploiting different levels of parallelism in the biological sequence comparison problem

    Get PDF
    In the last years the fast growth of bioinformatics field has atracted the attention of computer scientists. At the same time, de exponential growth of databases that contains biological information (such as protein and DNA data) demands great efforts to improve the performance of computational platforms. In this work, we investigate how bioinformatics applications benefit from parallel architectures that combine different alternatives to exploit coarse- and fine-grain parallelism. As a case of analysis, we study the performance behavior of the Ssearch application that implements the Smith-Waterman algorithm (SW), which is a dynamic programing approach that explores the similarity between a pair of sequences. The inherent large parallelism of the application makes it ideal for architectures supporting multiple dimensions of parallelism (thread-level parallelism, TLP; data-level parallelism, DLP; instruction-level parallelism, ILP). We study how this algorithm can take advantage of different parallel machines like the SGI Altix, IBM Power6, IBM Cell BE and MareNostrum machines. Our study includes a qualitative analysis of the parallelization opportunities and also the quantification of the performance in terms of speedup and execution time. These measures are collected taking into account the specific characteristics of each architecture. As an example, our results show that a share memory multiprocessor architecture (SMP) like the PowerPC 970MP of Marenostrum machine can surpasses a heterogeneous multi- processor machine like the current IBM Cell BE.Peer ReviewedPostprint (published version

    Beehive: an FPGA-based multiprocessor architecture

    Get PDF
    In recent years, to accomplish with the Moore's law hardware and software designers are tending progressively to focus their efforts on exploiting instruction-level parallelism. Software simulation has been essential for studying computer architecture because of its flexibility and low cost. However, users of software simulators must choose between high performance and high fidelity emulation. This project presents an FPGA-based multiprocessor architecture to speed up multiprocessor architecture research and ease parallel software simulation

    A Horizontally Reconfigurable Architecture for Extended Precision Arithmetic (Parallel Computing, Condition Codes Factoring).

    Get PDF
    A special computer for high-precision arithmetic and parallel processing which features an ALU that is dynamically reconfigurable under program control has been designed and a prototype machine constructed. The 256-bit ALU consists of eight 32-bit slices each of which has its own ALU operation code in each microinstruction. The slices can remain logically separated from each other, or can be dynamically connected to either or both of their neighbors under control of a segment control code that is part of each microinstruction. The result is a unique parallel architecture which provides real parallelism to user programs at the instruction level while globally retaining a sequential control structure. Management of parallelism is achieved through a two level hierarchy of condition codes and extended instruction sets to support conditional instruction execution. New types of parallel micro-programming tools introduce a system for reconfiguration management and parallel programming. An assembler, debug simulator, and interactive operating environment have been implemented. An analysis of the instruction times to execute arithmetic operations on the machine show that it will be exceptionally fast for problems in computational number theory and factoring of integers

    Describing the FPGA-Based Hardware Architecture of Systemic Computation (HAoS)

    Get PDF
    his paper presents HAoS, the first hardware architecture of the bio-inspired computational paradigm known as Systemic Computation (SC). SC was designed to support the modelling of biological processes inherently by defining a massively parallel non-conventional computer architecture and a model of natural behaviour. In this work we describe a novel custom digital design, which addresses the SC architecture parallelism requirement by exploiting the inbuilt parallelism of a Field Programmable Gate Array (FPGA) and by using the highly efficient matching capability of a Ternary Content Addressable Memory (TCAM). Basic processing capabilities are embedded in HAoS in order to minimize time-demanding data transfers. Its custom instruction set can be expanded based on user requirements, since the optional use of a CPU provides high-level processing support if required. We demonstrate a functional simulation-verified prototype, which takes into consideration programmability and scalability, and review various communication interfaces between HAoS and the CPU. Analysis shows that the proposed architecture provides an effective solution in terms of efficiency versus flexibility trade-off and can potentially outperform prior implementations

    Architecture for object-oriented programming model

    Get PDF
    Current mainstream architectures have ISAs that are not able to maintain all the information provided by the application programmer using a high level programming language. Typically, the information that is lost in compiling to a low-level ISA is related to parallelism and speculation [14]. For example some loops are typically expressed as parallel loops by the programmer but later the processor is not able to determine this level of parallelism; conditional execution might apply control independent execution that at execution time is basically impossible to detect; function and object-level parallelism is lost when code is transformed into a low-level ISA that is oblivious to programmer intentions and high-level programming structures. Object Oriented Programming Languages are arguably the most successful programming medium because they help the programmer to use well-known practices about data distribution through operations related with the associated data. Therefore object oriented models express data/execution locality more naturally and in an efficient manner. Other OO software mechanisms such as derivation and polymorphism further help the programmer to exploit locality better. Once object oriented programs have been compiled then all information about data/execution locality is completely lost in current assembly code (ISA code). Maintaining this information until runtime is crucial to improve locality and security. Finally, Object Oriented Programming Models maintain the idea of memory (data memory) far from the programmer. These are all desirable qualities that is mostly lost in the compilation to a low-level ISA that is oblivious to the Object-Oriented Programming model. This report considers implementing the Object Oriented (OO) Programming Model directly in the hardware to serve as a base to exploit object/level parallelism, speculation and heterogeneous computing. Towards this goal, we present new computer architecture that implements the OO Programming Models. All its hardware structures are objects and its Instruction Set directly utilizes objects hiding totally the notion of memory and other complex hardware structures. It also maintains all high-level programming language information until execution time. This enables efficient extraction of available parallelism in OO serial or parallel code at execution time with minimal compiler support. We will demonstrate the potential of this novel computer architecture through several examples.Postprint (published version

    Thread partitioning and value prediction for exploiting speculative thread-level parallelism

    Get PDF
    Speculative thread-level parallelism has been recently proposed as a source of parallelism to improve the performance in applications where parallel threads are hard to find. However, the efficiency of this execution model strongly depends on the performance of the control and data speculation techniques. Several hardware-based schemes for partitioning the program into speculative threads are analyzed and evaluated. In general, we find that spawning threads associated to loop iterations is the most effective technique. We also show that value prediction is critical for the performance of all of the spawning policies. Thus, a new value predictor, the increment predictor, is proposed. This predictor is specially oriented for this kind of architecture and clearly outperforms the adapted versions of conventional value predictors such as the last value, the stride, and the context-based, especially for small-sized history tables.Peer ReviewedPostprint (published version
    • …
    corecore