515 research outputs found

    Mixed Speculative Multithreaded Execution Models

    Get PDF
    Institute for Computing Systems ArchitectureThe current trend toward chip multiprocessor architectures has placed great pressure on programmers and compilers to generate thread-parallel programs. Improved execution performance can no longer be obtained via traditional single-thread instruction level parallelism (ILP), but, instead, via multithreaded execution. One notable technique that facilitates the extraction of parallel threads from sequential applications is thread-level speculation (TLS). This technique allows programmers/compilers to generate threads without checking for inter-thread data and control dependences, which are then transparently enforced by the hardware. Most prior work on TLS has concentrated on thread selection and mechanisms to efficiently support the main TLS operations, such as squashes, data versioning, and commits. This thesis seeks to enhance TLS functionality by combining it with other speculative multithreaded execution models. The main idea is that TLS already requires extensive hardware support, which when slightly augmented can accommodate other speculative multithreaded techniques. Recognizing that for different applications, or even program phases, the application bottlenecks may be different, it is reasonable to assume that the more versatile a system is, the more efficiently it will be able to execute the given program. As mentioned above, generating thread-parallel programs is hard and TLS has been suggested as an execution model that can speculatively exploit thread-level parallelism (TLP) even when thread independence cannot be guaranteed by the programmer/ compiler. Alternatively, the helper threads (HT) execution model has been proposed where subordinate threads are executed in parallel with a main thread in order to improve the execution efficiency (i.e., ILP) of the latter. Yet another execution model, runahead execution (RA), has also been proposed where subordinate versions of the main thread are dynamically created especially to cope with long-latency operations, again with the aim of improving the execution efficiency of the main thread (ILP). Each one of these multithreaded execution models works best for different applications and application phases. We combine these three models into a single execution model and single hardware infrastructure such that the system can dynamically adapt to find the most appropriate multithreaded execution model. More specifically, TLS is favored whenever successful parallel execution of instructions in multiple threads (i.e., TLP) is possible and the system can seamlessly transition at run-time to the other models otherwise. In order to understand the tradeoffs involved, we also develop a performance model that allows one to quantitatively attribute overall performance gains to either TLP or ILP in such combined multithreaded execution model. Experimental results show that our combined execution model achieves speedups of up to 41.2%, with an average of 10.2%, over an existing state-of-the-art TLS system and speedups of up to 35.2%, with an average of 18.3%, over a flavor of runahead execution for a subset of the SPEC2000 Integer benchmark suite. We then investigate how a common ILP-enhancingmicroarchitectural feature, namely branch prediction, interacts with TLS.We show that branch prediction for TLS is even more important than it is for single core machines. Unfortunately, branch prediction for TLS systems is also inherently harder. Code partitioning and re-executions of squashed threads pollute the branch history making it harder for predictors to be accurate. We thus propose to augment the hardware, so as to accommodate Multi-Path (MP) execution within the existing TLS protocol. Under the MP execution model, all paths following a number of hard-to-predict conditional branches are followed. MP execution thus, removes branches that would have been otherwise mispredicted helping in this way the processor to exploit more ILP. We show that with only minimal hardware support, one can combine these two execution models into a unified one, which can achieve far better performance than both TLS and MP execution. Experimental results show that our combied execution model achieves speedups of up to 20.1%, with an average of 8.8%, over an existing state-of-the-art TLS system and speedups of up to 125%, with an average of 29.0%, when compared with multi-path execution for a subset of the SPEC2000 Integer benchmark suite. Finally, Since systems that support speculative multithreading usually treat all threads equally, they are energy-inefficient. This inefficiency stems from the fact that speculation occasionally fails and, thus, power is spent on threads that will have to be discarded. We propose a profitability-based power allocation scheme, where we “steal” power from non-profitable threads and use it to speed up more useful ones. We evaluate our techniques for a state-of-the-art TLS system and show that, with minimalhardware support, we achieve improvements in ED of up to 25.5% with an average of 18.9%, for a subset of the SPEC 2000 Integer benchmark suite

    New scalable machine learning methods: beyond classification and regression

    Get PDF
    Programa Oficial de Doutoramento en Computación . 5009V01[Abstract] The recent surge in data available has spawned a new and promising age of machine learning. Success cases of machine learning are arriving at an increasing rate as some algorithms are able to leverage immense amounts of data to produce great complicated predictions. Still, many algorithms in the toolbox of the machine learning practitioner have been render useless in this new scenario due to the complications associated with large-scale learning. Handling large datasets entails logistical problems, limits the computational and spatial complexity of the used algorithms, favours methods with few or no hyperparameters to be con gured and exhibits speci c characteristics that complicate learning. This thesis is centered on the scalability of machine learning algorithms, that is, their capacity to maintain their e ectivity as the scale of the data grows, and how it can be improved. We focus on problems for which the existing solutions struggle when the scale grows. Therefore, we skip classi cation and regression problems and focus on feature selection, anomaly detection, graph construction and explainable machine learning. We analyze four di erent strategies to obtain scalable algorithms. First, we explore distributed computation, which is used in all of the presented algorithms. Besides this technique, we also examine the use of approximate models to speed up computations, the design of new models that take advantage of a characteristic of the input data to simplify training and the enhancement of simple models to enable them to manage large-scale learning. We have implemented four new algorithms and six versions of existing ones that tackle the mentioned problems and for each one we report experimental results that show both their validity in comparison with competing methods and their capacity to scale to large datasets. All the presented algorithms have been made available for download and are being published in journals to enable practitioners and researchers to use them.[Resumen] El reciente aumento de la cantidad de datos disponibles ha dado lugar a una nueva y prometedora era del aprendizaje máquina. Los éxitos en este campo se están sucediendo a un ritmo cada vez mayor gracias a la capacidad de algunos algoritmos de aprovechar inmensas cantidades de datos para producir predicciones difíciles y muy certeras. Sin embargo, muchos de los algoritmos hasta ahora disponibles para los científicos de datos han perdido su efectividad en este nuevo escenario debido a las complicaciones asociadas al aprendizaje a gran escala. Trabajar con grandes conjuntos de datos conlleva problemas logísticos, limita la complejidad computacional y espacial de los algoritmos utilizados, favorece los métodos con pocos o ningún hiperparámetro a configurar y muestra complicaciones específicas que dificultan el aprendizaje. Esta tesis se centra en la escalabilidad de los algoritmos de aprendizaje máquina, es decir, en su capacidad de mantener su efectividad a medida que la escala del conjunto de datos aumenta. Ponemos el foco en problemas cuyas soluciones actuales tienen problemas al aumentar la escala. Por tanto, obviando la clasificación y la regresión, nos centramos en la selección de características, detección de anomalías, construcción de grafos y en el aprendizaje máquina explicable. Analizamos cuatro estrategias diferentes para obtener algoritmos escalables. En primer lugar, exploramos la computación distribuida, que es utilizada en todos los algoritmos presentados. Además de esta técnica, también examinamos el uso de modelos aproximados para acelerar los cálculos, el dise~no de modelos que aprovechan una particularidad de los datos de entrada para simplificar el entrenamiento y la potenciación de modelos simples para adecuarlos al aprendizaje a gran escala. Hemos implementado cuatro nuevos algoritmos y seis versiones de algoritmos existentes que tratan los problemas mencionados y para cada uno de ellos detallamos resultados experimentales que muestran tanto su validez en comparación con los métodos previamente disponibles como su capacidad para escalar a grandes conjuntos de datos. Todos los algoritmos presentados han sido puestos a disposición del lector para su descarga y se han difundido mediante publicaciones en revistas científicas para facilitar que tanto investigadores como científicos de datos puedan conocerlos y utilizarlos.[Resumo] O recente aumento na cantidade de datos dispo~nibles deu lugar a unha nova e prometedora era no aprendizaxe máquina. Os éxitos neste eido estanse a suceder a un ritmo cada vez maior gracias a capacidade dalgúns algoritmos de aproveitar inmensas cantidades de datos para producir prediccións difíciles e moi acertadas. Non obstante, moitos dos algoritmos ata agora dispo~nibles para os científicos de datos perderon a súa efectividade neste novo escenario por mor das complicacións asociadas ao aprendizaxe a grande escala. Traballar con grandes conxuntos de datos leva consigo problemas loxísticos, limita a complexidade computacional e espacial dos algoritmos empregados, favorece os métodos con poucos ou ningún hiperparámetro a configurar e ten complicacións específicas que dificultan o aprendizaxe. Esta tese céntrase na escalabilidade dos algoritmos de aprendizaxe máquina, é dicir, na súa capacidade de manter a súa efectividade a medida que a escala do conxunto de datos aumenta. Tratamos problemas para os que as solucións dispoñibles teñen problemas cando crece a escala. Polo tanto, deixando no canto a clasificación e a regresión, centrámonos na selección de características, detección de anomalías, construcción de grafos e no aprendizaxe máquina explicable. Analizamos catro estratexias diferentes para obter algoritmos escalables. En primeiro lugar, exploramos a computación distribuída, que empregamos en tódolos algoritmos presentados. Ademáis desta técnica, tamén examinamos o uso de modelos aproximados para acelerar os cálculos, o deseño de modelos que aproveitan unha particularidade dos datos de entrada para simplificar o adestramento e a potenciación de modelos sinxelos para axeitalos ao aprendizaxe a gran escala. Implementamos catro novos algoritmos e seis versións de algoritmos existentes que tratan os problemas mencionados e para cada un deles expoñemos resultados experimentais que mostran tanto a súa validez en comparación cos métodos previamente dispoñibles como a súa capacidade para escalar a grandes conxuntos de datos. Tódolos algoritmos presentados foron postos a disposición do lector para a súa descarga e difundíronse mediante publicacións en revistas científicas para facilitar que tanto investigadores como científicos de datos poidan coñecelos e empregalos

    Mitosis based speculative multithreaded architectures

    Get PDF
    In the last decade, industry made a right-hand turn and shifted towards multi-core processor designs, also known as Chip-Multi-Processors (CMPs), in order to provide further performance improvements under a reasonable power budget, design complexity, and validation cost. Over the years, several processor vendors have come out with multi-core chips in their product lines and they have become mainstream, with the number of cores increasing in each processor generation. Multi-core processors improve the performance of applications by exploiting Thread Level Parallelism (TLP) while the Instruction Level Parallelism (ILP) exploited by each individual core is limited. These architectures are very efficient when multiple threads are available for execution. However, single-thread sections of code (single-thread applications and serial sections of parallel applications) pose important constraints on the benefits achieved by parallel execution, as pointed out by Amdahl’s law. Parallel programming, even with the help of recently proposed techniques like transactional memory, has proven to be a very challenging task. On the other hand, automatically partitioning applications into threads may be a straightforward task in regular applications, but becomes much harder for irregular programs, where compilers usually fail to discover sufficient TLP. In this scenario, two main directions have been followed in the research community to take benefit of multi-core platforms: Speculative Multithreading (SpMT) and Non-Speculative Clustered architectures. The former splits a sequential application into speculative threads, while the later partitions the instructions among the cores based on data-dependences but avoid large degree of speculation. Despite the large amount of research on both these approaches, the proposed techniques so far have shown marginal performance improvements. In this thesis we propose novel schemes to speed-up sequential or lightly threaded applications in multi-core processors that effectively address the main unresolved challenges of previous approaches. In particular, we propose a SpMT architecture, called Mitosis, that leverages a powerful software value prediction technique to manage inter-thread dependences, based on pre-computation slices (p-slices). Thanks to the accuracy and low cost of this technique, Mitosis is able to effectively parallelize applications even in the presence of frequent dependences among threads. We also propose a novel architecture, called Anaphase, that combines the best of SpMT schemes and clustered architectures. Anaphase effectively exploits ILP, TLP and Memory Level Parallelism (MLP), thanks to its unique finegrain thread decomposition algorithm that adapts to the available parallelism in the application

    Mitosis based speculative multithreaded architectures

    Get PDF
    In the last decade, industry made a right-hand turn and shifted towards multi-core processor designs, also known as Chip-Multi-Processors (CMPs), in order to provide further performance improvements under a reasonable power budget, design complexity, and validation cost. Over the years, several processor vendors have come out with multi-core chips in their product lines and they have become mainstream, with the number of cores increasing in each processor generation. Multi-core processors improve the performance of applications by exploiting Thread Level Parallelism (TLP) while the Instruction Level Parallelism (ILP) exploited by each individual core is limited. These architectures are very efficient when multiple threads are available for execution. However, single-thread sections of code (single-thread applications and serial sections of parallel applications) pose important constraints on the benefits achieved by parallel execution, as pointed out by Amdahl’s law. Parallel programming, even with the help of recently proposed techniques like transactional memory, has proven to be a very challenging task. On the other hand, automatically partitioning applications into threads may be a straightforward task in regular applications, but becomes much harder for irregular programs, where compilers usually fail to discover sufficient TLP. In this scenario, two main directions have been followed in the research community to take benefit of multi-core platforms: Speculative Multithreading (SpMT) and Non-Speculative Clustered architectures. The former splits a sequential application into speculative threads, while the later partitions the instructions among the cores based on data-dependences but avoid large degree of speculation. Despite the large amount of research on both these approaches, the proposed techniques so far have shown marginal performance improvements. In this thesis we propose novel schemes to speed-up sequential or lightly threaded applications in multi-core processors that effectively address the main unresolved challenges of previous approaches. In particular, we propose a SpMT architecture, called Mitosis, that leverages a powerful software value prediction technique to manage inter-thread dependences, based on pre-computation slices (p-slices). Thanks to the accuracy and low cost of this technique, Mitosis is able to effectively parallelize applications even in the presence of frequent dependences among threads. We also propose a novel architecture, called Anaphase, that combines the best of SpMT schemes and clustered architectures. Anaphase effectively exploits ILP, TLP and Memory Level Parallelism (MLP), thanks to its unique finegrain thread decomposition algorithm that adapts to the available parallelism in the application.Postprint (published version

    Design and validation of a simultaneous multi-threaded DLX processor

    Get PDF
    technical reportModern day computer systems rely on two forms of parallelism to achieve high performance, parallelism between individual instructions of a program (ILP) and parallelism between individual threads (TLP). Superscalar processors exploit ILP by issuing several instructions per clock, and multiprocessors (MP) exploit TLP by running different threads in parallel on different processors. A fundamental imitation of these approaches to exploit parallelism is that processor resources are statically partitioned. If TLP is low, processors in a MP system will be idle, and if ILP is low, issue slots in a superscalar processor will be wasted. As a consequence, the hardware cannot adapt to changing levels of ILP and TLP and resource utilization tend to be low. Since resource utilization is low there is potential to achieve higher performance if somehow useful instructions could be found to fill up the wasted issue slots. This paper explores a method called simultaneous multithreading (SMT) that addresses the utilization problem by letting multiple threads compete for the resources of a single processor each clock cycle thus increasing the potential ILP available

    Beyond Dataflow

    Get PDF
    This paper presents some recent advanced dataflow architectures. While the dataflow concept offers the potential of high performance, the performance of an actual dataflow implementation can be restricted by a limited number of functional units, limited memory bandwidth, and the need to associatively match pending operations with available functional units. Since the early 1970s, there have been significant developments in both fundamental research and practical realizations of dataflow models of computation. In particular, there has been active research and development in multithreaded architectures that evolved from the dataflow model. Also some other techniques for combining control-flow and dataflow emerged, such as coarse-grain dataflow, dataflow with complex machine operations, RISC dataflow, and micro dataflow. These developments have also had certain impact on the conception of highperformance superscalar processors in the “post-RISC” era

    A Survey on Thread-Level Speculation Techniques

    Get PDF
    Producción CientíficaThread-Level Speculation (TLS) is a promising technique that allows the parallel execution of sequential code without relying on a prior, compile-time-dependence analysis. In this work, we introduce the technique, present a taxonomy of TLS solutions, and summarize and put into perspective the most relevant advances in this field.MICINN (Spain) and ERDF program of the European Union: HomProg-HetSys project (TIN2014-58876-P), CAPAP-H5 network (TIN2014-53522-REDT), and COST Program Action IC1305: Network for Sustainable Ultrascale Computing (NESUS)

    Improving cache locality for thread-level speculation

    Full text link

    Investigation of a simultaneous multithreaded architecture

    Get PDF
    Many enhancements have been made to the traditional general purpose load-store computer architectures. Among the enhancements are memory hierarchy improvements, branch prediction, and multiple issue processors. A major problem that exists with current microprocessor design is the disparity in the much larger increase in speed of the CPU versus the moderate increase in speed accessing main memory. The simultaneous multithreaded architecture is an extension of the single-threaded architecture that helps hide the performance penalty created by long-latency instructions, branch mispredictions, and memory accesses. Simultaneous multithreaded architectures use a more flexible parallelism, which takes advantage of both instruction-level, and thread-level parallelism. The goal of this project was to design, simulate, and analyze a model of a simultaneous multithreaded architecture in order to evaluate design alternatives. The simulator was created by modifying a version of the Simple Scalar toolset, developed at the University of Wisconsin. The simulations provide documentation for an overall system performance improvement of a simulta neous multithreaded architecture. In early simulation results, performed with the same number of functional units, an improvement in the number of instructions per cycle (IPC) of between 43% and 58% was found using four threads versus a single thread. The horizontal waste rate, which measures the number of unused issue slots, was reduced between 35% and 46%. The vertical waste rate, which measures the percentage- of unused issue cycles (no issue slots used in a cycle), was reduced between 46% and 61%. These results are derived from a set of four sample programs. It was also found that increasing the number of certain functional units did not improve performance, whereas increasing the number of other types of functional units did have a significant positive impact on performance

    Event Stream Processing with Multiple Threads

    Full text link
    Current runtime verification tools seldom make use of multi-threading to speed up the evaluation of a property on a large event trace. In this paper, we present an extension to the BeepBeep 3 event stream engine that allows the use of multiple threads during the evaluation of a query. Various parallelization strategies are presented and described on simple examples. The implementation of these strategies is then evaluated empirically on a sample of problems. Compared to the previous, single-threaded version of the BeepBeep engine, the allocation of just a few threads to specific portions of a query provides dramatic improvement in terms of running time
    corecore