326 research outputs found

    FIFTY YEARS OF MICROPROCESSOR EVOLUTION: FROM SINGLE CPU TO MULTICORE AND MANYCORE SYSTEMS

    Get PDF
    Nowadays microprocessors are among the most complex electronic systems that man has ever designed. One small silicon chip can contain the complete processor, large memory and logic needed to connect it to the input-output devices. The performance of today's processors implemented on a single chip surpasses the performance of a room-sized supercomputer from just 50 years ago, which cost over $ 10 million [1]. Even the embedded processors found in everyday devices such as mobile phones are far more powerful than computer developers once imagined. The main components of a modern microprocessor are a number of general-purpose cores, a graphics processing unit, a shared cache, memory and input-output interface and a network on a chip to interconnect all these components [2]. The speed of the microprocessor is determined by its clock frequency and cannot exceed a certain limit. Namely, as the frequency increases, the power dissipation increases too, and consequently the amount of heating becomes critical. So, silicon manufacturers decided to design new processor architecture, called multicore processors [3]. With aim to increase performance and efficiency these multiple cores execute multiple instructions simultaneously. In this way, the amount of parallel computing or parallelism is increased [4]. In spite of mentioned advantages, numerous challenges must be addressed carefully when more cores and parallelism are used.This paper presents a review of microprocessor microarchitectures, discussing their generations over the past 50 years. Then, it describes the currently used implementations of the microarchitecture of modern microprocessors, pointing out the specifics of parallel computing in heterogeneous microprocessor systems. To use efficiently the possibility of multi-core technology, software applications must be multithreaded. The program execution must be distributed among the multi-core processors so they can operate simultaneously. To use multi-threading, it is imperative for programmer to understand the basic principles of parallel computing and parallel hardware. Finally, the paper provides details how to implement hardware parallelism in multicore systems

    Thread partitioning and value prediction for exploiting speculative thread-level parallelism

    Get PDF
    Speculative thread-level parallelism has been recently proposed as a source of parallelism to improve the performance in applications where parallel threads are hard to find. However, the efficiency of this execution model strongly depends on the performance of the control and data speculation techniques. Several hardware-based schemes for partitioning the program into speculative threads are analyzed and evaluated. In general, we find that spawning threads associated to loop iterations is the most effective technique. We also show that value prediction is critical for the performance of all of the spawning policies. Thus, a new value predictor, the increment predictor, is proposed. This predictor is specially oriented for this kind of architecture and clearly outperforms the adapted versions of conventional value predictors such as the last value, the stride, and the context-based, especially for small-sized history tables.Peer ReviewedPostprint (published version

    Out-of-Order Retirement of Instructions in Superscalar, Multithreaded, and Multicore Processors

    Full text link
    Los procesadores superescalares actuales utilizan un reorder buffer (ROB) para contabilizar las instrucciones en vuelo. El ROB se implementa como una cola FIFO first in first out en la que las instrucciones se insertan en orden de programa después de ser decodificadas, y de la que se extraen también en orden de programa en la etapa commit. El uso de esta estructura proporciona un soporte simple para la especulación, las excepciones precisas y la reclamación de registros. Sin embargo, el hecho de retirar instrucciones en orden puede degradar las prestaciones si una operación de alta latencia está bloqueando la cabecera del ROB. Varias propuestas se han publicado atacando este problema. La mayoría utiliza retirada de instrucciones fuera de orden de forma especulativa, requiriendo almacenar puntos de recuperación (checkpoints) para restaurar un estado válido del procesador ante un fallo de especulación. Normalmente, los checkpoints necesitan implementarse con estructuras hardware costosas, y además requieren un crecimiento de otras estructuras del procesador, lo cual a su vez puede impactar en el tiempo de ciclo de reloj. Este problema afecta a muchos tipos de procesadores actuales, independientemente del número de hilos hardware (threads) y del número de núcleos de cómputo (cores) que incluyan. Esta tesis abarca el estudio de la retirada no especulativa de instrucciones fuera de orden en procesadores superescalares, multithread y multicore.Ubal Tena, R. (2010). Out-of-Order Retirement of Instructions in Superscalar, Multithreaded, and Multicore Processors [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/8535Palanci

    Data speculative multithreaded architecture

    Get PDF
    We present a novel processor microarchitecture that relieves three of the most important bottlenecks of superscalar processors: the serialization imposed by true dependences, the relatively small window size and the instruction fetch bandwidth. The new architecture executes simultaneously multiple threads of control obtained from a single program by means of control speculation techniques that do not require any compiler/user support nor any special feature in the instruction set architecture. The multiple simultaneous threads execute different iterations of the same loop, which require the same fetch bandwidth as a single thread since they share the same code. Inter-thread dependences as well as the values that flow through them are speculated by means of data prediction techniques. The preliminary evaluation results show a significant speed-up when compared with a superscalar processor. In fact, the new processor architecture can achieve an IPC (instructions per cycle) rate even larger than the peak fetch bandwidthPeer ReviewedPostprint (published version

    Beyond Dataflow

    Get PDF
    This paper presents some recent advanced dataflow architectures. While the dataflow concept offers the potential of high performance, the performance of an actual dataflow implementation can be restricted by a limited number of functional units, limited memory bandwidth, and the need to associatively match pending operations with available functional units. Since the early 1970s, there have been significant developments in both fundamental research and practical realizations of dataflow models of computation. In particular, there has been active research and development in multithreaded architectures that evolved from the dataflow model. Also some other techniques for combining control-flow and dataflow emerged, such as coarse-grain dataflow, dataflow with complex machine operations, RISC dataflow, and micro dataflow. These developments have also had certain impact on the conception of highperformance superscalar processors in the “post-RISC” era

    Thread-spawning schemes for speculative multithreading

    Get PDF
    Speculative multithreading has been recently proposed to boost performance by means of exploiting thread-level parallelism in applications difficult to parallelize. The performance of these processors heavily depends on the partitioning policy used to split the program into threads. Previous work uses heuristics to spawn speculative threads based on easily-detectable program constructs such as loops or subroutines. In this work we propose a profile-based mechanism to divide programs into threads by searching for those parts of the code that have certain features that could benefit from potential thread-level parallelism. Our profile-based spawning scheme is evaluated on a Clustered Speculative Multithreaded Processor and results show large performance benefits. When the proposed spawning scheme is compared with traditional heuristics, we outperform them by almost 20%. When a realistic value predictor and a 8-cycle thread initialization penalty is considered, the performance difference between them is maintained. The speed-up over a single thread execution is higher than 5x for a 16-thread-unit processor and close to 2x for a 4-thread-unit processor.Peer ReviewedPostprint (published version

    An Investigation of thread scheduling heuristics for a simultaneous multithreaded processor

    Get PDF
    Over the years, the von Neumann model of computing has undergone many enhancements. These changes include an improved memory hierarchy, multiple instruction issue and branch predic tion. Since the model\u27s introduction, the performance of processors has increased at a much greater rate than that of memory. Several modifications to hide this ever widening gap in performance are being examined in current research. A very promising one is the Simultaneous Multithreaded processor. This architecture strives to further reduce the effects of long latency instructions, such as memory accesses, by allowing multiple threads of execution to be active in the processor at the same time. With the introduction of multiple active threads in a single processor, several new aspects of processor operation can have a sizeable effect on performance. One such aspect is how to choose from which thread to fetch instructions during the next cycle. For this project, three different classes of fetch scheduling mechanisms were defined and exam ples of each were either studied or proposed. The proposed mechanisms were then tested using a set of four sample programs by adding the mechanisms to a Simultaneous Multithreading sim ulator based on the Simple Scalar tool set from the University of Wisconsin-Madison. With the proper configuration, each of the proposed mechanisms improved the performance of the simulated architecture. However, the best increase in performance was produced by the Event History Table. It achieved an IPC of 2.0995 for two threads while overriding the primary scheduling mechanism only 0.070% of the time
    corecore