114 research outputs found

    Beyond Dataflow

    Get PDF
    This paper presents some recent advanced dataflow architectures. While the dataflow concept offers the potential of high performance, the performance of an actual dataflow implementation can be restricted by a limited number of functional units, limited memory bandwidth, and the need to associatively match pending operations with available functional units. Since the early 1970s, there have been significant developments in both fundamental research and practical realizations of dataflow models of computation. In particular, there has been active research and development in multithreaded architectures that evolved from the dataflow model. Also some other techniques for combining control-flow and dataflow emerged, such as coarse-grain dataflow, dataflow with complex machine operations, RISC dataflow, and micro dataflow. These developments have also had certain impact on the conception of highperformance superscalar processors in the “post-RISC” era

    New Challenges in Computer Architecture Education

    Get PDF
    This study was supported by the Ministry of Education, Science and Technological Development of the Republic of Serbia, and these results are parts of Grant No. 451-03-68/2022-14/200132 with the University of Kragujevac - Faculty of Technical Sciences Čačak.This paper provides a brief overview of the development of computer architecture and its impact on approaches to the presentation of appropriate information and computer education. The relative constancy of the concepts that were applied in the architecture of the computer influenced that the classical approaches to the appropriate education are kept until today. The changes that occurred in architecture during the development of computer technology, in conjunction with technological development, required a corresponding adjustment in the sphere of education. The turning point was the advent of the microprocessor, which introduced the x86 architecture into education. The beginning of the new century was marked by the ARM architecture. And today, the RISC-V architecture is emerging more and more as a new design challenge.Publishe

    Late allocation and early release of physical registers

    Get PDF
    The register file is one of the critical components of current processors in terms of access time and power consumption. Among other things, the potential to exploit instruction-level parallelism is closely related to the size and number of ports of the register file. In conventional register renaming schemes, both register allocation and releasing are conservatively done, the former at the rename stage, before registers are loaded with values, and the latter at the commit stage of the instruction redefining the same register, once registers are not used any more. We introduce VP-LAER, a renaming scheme that allocates registers later and releases them earlier than conventional schemes. Specifically, physical registers are allocated at the end of the execution stage and released as soon as the processor realizes that there will be no further use of them. VP-LAER enhances register utilization, that is, the fraction of allocated registers having a value to be read in the future. Detailed cycle-level simulations show either a significant speedup for a given register file size or a reduction in the register file size for a given performance level, especially for floating-point codes, where the register file pressure is usually high.Peer ReviewedPostprint (published version

    A data dependency recovery system for a heterogeneous multicore processor

    Get PDF
    Multicore processors often increase the performance of applications. However, with their deeper pipelining, they have proven increasingly difficult to improve. In an attempt to deliver enhanced performance at lower power requirements, semiconductor microprocessor manufacturers have progressively utilised chip-multicore processors. Existing research has utilised a very common technique known as thread-level speculation. This technique attempts to compute results before the actual result is known. However, thread-level speculation impacts operation latency, circuit timing, confounds data cache behaviour and code generation in the compiler. We describe an software framework codenamed Lyuba that handles low-level data hazards and automatically recovers the application from data hazards without programmer and speculation intervention for an asymmetric chip-multicore processor. The problem of determining correct execution of multiple threads when data hazards occur on conventional symmetrical chip-multicore processors is a significant and on-going challenge. However, there has been very little focus on the use of asymmetrical (heterogeneous) processors with applications that have complex data dependencies. The purpose of this thesis is to: (i) define the development of a software framework for an asymmetric (heterogeneous) chip-multicore processor; (ii) present an optimal software control of hardware for distributed processing and recovery from violations;(iii) provides performance results of five applications using three datasets. Applications with a small dataset showed an improvement of 17% and a larger dataset showed an improvement of 16% giving overall 11% improvement in performance

    Scalable Parallel Computers for Real-Time Signal Processing

    Get PDF
    We assess the state-of-the-art technology in massively parallel processors (MPPs) and their variations in different architectural platforms. Architectural and programming issues are identified in using MPPs for time-critical applications such as adaptive radar signal processing. We review the enabling technologies. These include high-performance CPU chips and system interconnects, distributed memory architectures, and various latency hiding mechanisms. We characterize the concept of scalability in three areas: resources, applications, and technology. Scalable performance attributes are analytically defined. Then we compare MPPs with symmetric multiprocessors (SMPs) and clusters of workstations (COWs). The purpose is to reveal their capabilities, limits, and effectiveness in signal processing. We evaluate the IBM SP2 at MHPCC, the Intel Paragon at SDSC, the Gray T3D at Gray Eagan Center, and the Gray T3E and ASCI TeraFLOP system proposed by Intel. On the software and programming side, we evaluate existing parallel programming environments, including the models, languages, compilers, software tools, and operating systems. Some guidelines for program parallelization are provided. We examine data-parallel, shared-variable, message-passing, and implicit programming models. Communication functions and their performance overhead are discussed. Available software tools and communication libraries are also introducedpublished_or_final_versio

    Design and implementation of a multi-purpose cluster system NIU

    Get PDF
    Thesis (Ph.D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 1999.Includes bibliographical references (p. 209-221).by Boon Seong Ang.Ph.D

    Recursos anchos: una técnica de bajo coste para explotar paralelismo agresivo en códigos numéricos

    Get PDF
    Els bucles son la part que més temps consumeix en les aplicacions numèriques. El rendiment dels bucles està limitat tant pels recursos oferts per l'arquitectura com per les recurrències del bucle en la computació. Per executar més operacions per cicle, els processadors actuals es dissenyen amb graus creixents de replicació de recursos (tècnica de replicació) para ports de memòria i unitats funcionals. En canvi, el gran cost en termes d'àrea i temps de cicle d'aquesta tècnica limita tenir alts graus de replicació: alts valors en temps de cicle contraresten els guanys deguts al decrement en el nombre de cicles, mentre que alts valors en l'àrea requerida poden portar a configuracions impossibles d'implementar. Una alternativa a la replicació de recursos, és fer los més amples (tècnica que anomenem "widening"), i que ha estat usada en alguns dissenys recents. Amb aquesta tècnica, l'amplitud dels recursos s'amplia, fent una mateixa operació sobre múltiples dades. Per altra banda, alguns microprocessadors escalars de propòsit general han estat implementats amb unitats de coma flotants que implementen la instrucció sumar i multiplicar unificada (tècnica de fusió), el que redueix la latència de la operació combinada, tanmateix com el nombre de recursos utilitzats. A aquest treball s'avaluen un ampli conjunt d'alternatives de disseny de processadors VLIW que combinen les tres tècniques. S'efectua una projecció tecnològica de les noves generacions de processadors per predir les possibles alternatives implementables. Com a conclusió, demostrem que tenint en compte el cost, combinar certs graus de replicació i "widening" als recursos hardware és més efectiu que aplicar únicament replicació. Així mateix, confirmem que fer servir unitats que fusionen multiplicació i suma pot tenir un impacte molt significatiu en l'increment de rendiment en futures arquitectures de processadors a un cost molt raonable.Loops are the main time-consuming part of numerical applications. The performance of the loops is limited either by the resources offered by the architecture or by recurrences in the computation. To execute more operations per cycle, current processors are designed with growing degrees of resource replication (replication technique) for memory ports and functional units. However, the high cost in terms of area and cycle time of this technique precludes the use of high degrees of replication. High values for the cycle time may clearly offset any gain in terms of number of execution cycles. High values for the area may lead to an unimplementable configuration. An alternative to resource replication is resource widening (widening technique), which has also been used in some recent designs in which the width of the resources is increased (i.e., a single operation is performed over multiple data). Moreover, several general-purpose superscalar microprocessors have been implemented with multiply-add fused floating point units (fusion technique), which reduces the latency of the combined operation and the number of resources used. On this thesis, we evaluate a broad set of VLIW processor design alternatives that combine the three techniques. We perform a technological projection for the next processor generations in order to foresee the possible implementable alternatives. From this study, we conclude that if the cost is taken into account, combining certain degrees of replication and widening in the hardware resources is more effective than applying only replication. Also, we confirm that multiply-add fused units will have a significant impact in raising the performance of future processor architectures with a reasonable increase in cost
    corecore