515 research outputs found
Mixed Speculative Multithreaded Execution Models
Institute for Computing Systems ArchitectureThe current trend toward chip multiprocessor architectures has placed great pressure
on programmers and compilers to generate thread-parallel programs. Improved execution
performance can no longer be obtained via traditional single-thread instruction
level parallelism (ILP), but, instead, via multithreaded execution. One notable technique
that facilitates the extraction of parallel threads from sequential applications is
thread-level speculation (TLS). This technique allows programmers/compilers to generate
threads without checking for inter-thread data and control dependences, which
are then transparently enforced by the hardware. Most prior work on TLS has concentrated
on thread selection and mechanisms to efficiently support the main TLS operations,
such as squashes, data versioning, and commits.
This thesis seeks to enhance TLS functionality by combining it with other speculative
multithreaded execution models. The main idea is that TLS already requires
extensive hardware support, which when slightly augmented can accommodate other
speculative multithreaded techniques. Recognizing that for different applications, or
even program phases, the application bottlenecks may be different, it is reasonable to
assume that the more versatile a system is, the more efficiently it will be able to execute
the given program.
As mentioned above, generating thread-parallel programs is hard and TLS has
been suggested as an execution model that can speculatively exploit thread-level parallelism
(TLP) even when thread independence cannot be guaranteed by the programmer/
compiler. Alternatively, the helper threads (HT) execution model has been proposed
where subordinate threads are executed in parallel with a main thread in order to
improve the execution efficiency (i.e., ILP) of the latter. Yet another execution model,
runahead execution (RA), has also been proposed where subordinate versions of the
main thread are dynamically created especially to cope with long-latency operations,
again with the aim of improving the execution efficiency of the main thread (ILP).
Each one of these multithreaded execution models works best for different applications
and application phases. We combine these three models into a single execution
model and single hardware infrastructure such that the system can dynamically adapt
to find the most appropriate multithreaded execution model. More specifically, TLS is favored whenever successful parallel execution of instructions in multiple threads
(i.e., TLP) is possible and the system can seamlessly transition at run-time to the other
models otherwise. In order to understand the tradeoffs involved, we also develop a performance
model that allows one to quantitatively attribute overall performance gains
to either TLP or ILP in such combined multithreaded execution model.
Experimental results show that our combined execution model achieves speedups
of up to 41.2%, with an average of 10.2%, over an existing state-of-the-art TLS system
and speedups of up to 35.2%, with an average of 18.3%, over a flavor of runahead
execution for a subset of the SPEC2000 Integer benchmark suite.
We then investigate how a common ILP-enhancingmicroarchitectural feature, namely
branch prediction, interacts with TLS.We show that branch prediction for TLS is even
more important than it is for single core machines. Unfortunately, branch prediction for
TLS systems is also inherently harder. Code partitioning and re-executions of squashed
threads pollute the branch history making it harder for predictors to be accurate.
We thus propose to augment the hardware, so as to accommodate Multi-Path (MP)
execution within the existing TLS protocol. Under the MP execution model, all paths
following a number of hard-to-predict conditional branches are followed. MP execution
thus, removes branches that would have been otherwise mispredicted helping in
this way the processor to exploit more ILP. We show that with only minimal hardware
support, one can combine these two execution models into a unified one, which can
achieve far better performance than both TLS and MP execution.
Experimental results show that our combied execution model achieves speedups of
up to 20.1%, with an average of 8.8%, over an existing state-of-the-art TLS system and
speedups of up to 125%, with an average of 29.0%, when compared with multi-path
execution for a subset of the SPEC2000 Integer benchmark suite.
Finally, Since systems that support speculative multithreading usually treat all
threads equally, they are energy-inefficient. This inefficiency stems from the fact that
speculation occasionally fails and, thus, power is spent on threads that will have to
be discarded. We propose a profitability-based power allocation scheme, where we
“steal” power from non-profitable threads and use it to speed up more useful ones. We
evaluate our techniques for a state-of-the-art TLS system and show that, with minimalhardware support, we achieve improvements in ED of up to 25.5% with an average of
18.9%, for a subset of the SPEC 2000 Integer benchmark suite
New scalable machine learning methods: beyond classification and regression
Programa Oficial de Doutoramento en Computación . 5009V01[Abstract]
The recent surge in data available has spawned a new and promising age of machine
learning. Success cases of machine learning are arriving at an increasing rate as some
algorithms are able to leverage immense amounts of data to produce great complicated
predictions. Still, many algorithms in the toolbox of the machine learning practitioner
have been render useless in this new scenario due to the complications associated with
large-scale learning. Handling large datasets entails logistical problems, limits the computational
and spatial complexity of the used algorithms, favours methods with few or
no hyperparameters to be con gured and exhibits speci c characteristics that complicate
learning. This thesis is centered on the scalability of machine learning algorithms,
that is, their capacity to maintain their e ectivity as the scale of the data grows, and
how it can be improved. We focus on problems for which the existing solutions struggle
when the scale grows. Therefore, we skip classi cation and regression problems and
focus on feature selection, anomaly detection, graph construction and explainable machine
learning. We analyze four di erent strategies to obtain scalable algorithms. First,
we explore distributed computation, which is used in all of the presented algorithms.
Besides this technique, we also examine the use of approximate models to speed up
computations, the design of new models that take advantage of a characteristic of the
input data to simplify training and the enhancement of simple models to enable them
to manage large-scale learning. We have implemented four new algorithms and six
versions of existing ones that tackle the mentioned problems and for each one we report
experimental results that show both their validity in comparison with competing
methods and their capacity to scale to large datasets. All the presented algorithms
have been made available for download and are being published in journals to enable
practitioners and researchers to use them.[Resumen]
El reciente aumento de la cantidad de datos disponibles ha dado lugar a una nueva y
prometedora era del aprendizaje máquina. Los éxitos en este campo se están sucediendo
a un ritmo cada vez mayor gracias a la capacidad de algunos algoritmos de aprovechar
inmensas cantidades de datos para producir predicciones difíciles y muy certeras. Sin
embargo, muchos de los algoritmos hasta ahora disponibles para los científicos de datos
han perdido su efectividad en este nuevo escenario debido a las complicaciones asociadas
al aprendizaje a gran escala. Trabajar con grandes conjuntos de datos conlleva
problemas logísticos, limita la complejidad computacional y espacial de los algoritmos
utilizados, favorece los métodos con pocos o ningún hiperparámetro a configurar y
muestra complicaciones específicas que dificultan el aprendizaje. Esta tesis se centra en
la escalabilidad de los algoritmos de aprendizaje máquina, es decir, en su capacidad de
mantener su efectividad a medida que la escala del conjunto de datos aumenta. Ponemos
el foco en problemas cuyas soluciones actuales tienen problemas al aumentar la
escala. Por tanto, obviando la clasificación y la regresión, nos centramos en la selección
de características, detección de anomalías, construcción de grafos y en el aprendizaje
máquina explicable. Analizamos cuatro estrategias diferentes para obtener algoritmos
escalables. En primer lugar, exploramos la computación distribuida, que es utilizada en
todos los algoritmos presentados. Además de esta técnica, también examinamos el uso
de modelos aproximados para acelerar los cálculos, el dise~no de modelos que aprovechan
una particularidad de los datos de entrada para simplificar el entrenamiento y la
potenciación de modelos simples para adecuarlos al aprendizaje a gran escala. Hemos
implementado cuatro nuevos algoritmos y seis versiones de algoritmos existentes que
tratan los problemas mencionados y para cada uno de ellos detallamos resultados experimentales
que muestran tanto su validez en comparación con los métodos previamente
disponibles como su capacidad para escalar a grandes conjuntos de datos. Todos los algoritmos presentados han sido puestos a disposición del lector para su descarga y
se han difundido mediante publicaciones en revistas científicas para facilitar que tanto
investigadores como científicos de datos puedan conocerlos y utilizarlos.[Resumo]
O recente aumento na cantidade de datos dispo~nibles deu lugar a unha nova e prometedora
era no aprendizaxe máquina. Os éxitos neste eido estanse a suceder a un
ritmo cada vez maior gracias a capacidade dalgúns algoritmos de aproveitar inmensas
cantidades de datos para producir prediccións difíciles e moi acertadas. Non obstante,
moitos dos algoritmos ata agora dispo~nibles para os científicos de datos perderon a súa
efectividade neste novo escenario por mor das complicacións asociadas ao aprendizaxe
a grande escala. Traballar con grandes conxuntos de datos leva consigo problemas
loxísticos, limita a complexidade computacional e espacial dos algoritmos empregados,
favorece os métodos con poucos ou ningún hiperparámetro a configurar e ten complicacións específicas que dificultan o aprendizaxe. Esta tese céntrase na escalabilidade dos
algoritmos de aprendizaxe máquina, é dicir, na súa capacidade de manter a súa efectividade
a medida que a escala do conxunto de datos aumenta. Tratamos problemas para
os que as solucións dispoñibles teñen problemas cando crece a escala. Polo tanto, deixando
no canto a clasificación e a regresión, centrámonos na selección de características,
detección de anomalías, construcción de grafos e no aprendizaxe máquina explicable.
Analizamos catro estratexias diferentes para obter algoritmos escalables. En primeiro
lugar, exploramos a computación distribuída, que empregamos en tódolos algoritmos
presentados. Ademáis desta técnica, tamén examinamos o uso de modelos aproximados
para acelerar os cálculos, o deseño de modelos que aproveitan unha particularidade dos
datos de entrada para simplificar o adestramento e a potenciación de modelos sinxelos
para axeitalos ao aprendizaxe a gran escala. Implementamos catro novos algoritmos e
seis versións de algoritmos existentes que tratan os problemas mencionados e para cada
un deles expoñemos resultados experimentais que mostran tanto a súa validez en comparación cos métodos previamente dispoñibles como a súa capacidade para escalar a
grandes conxuntos de datos. Tódolos algoritmos presentados foron postos a disposición
do lector para a súa descarga e difundíronse mediante publicacións en revistas científicas para facilitar que tanto investigadores como científicos de datos poidan coñecelos e
empregalos
Mitosis based speculative multithreaded architectures
In the last decade, industry made a right-hand turn and shifted towards multi-core processor designs, also known as Chip-Multi-Processors (CMPs), in order to provide further performance improvements under a reasonable power budget, design complexity, and validation cost. Over the years, several processor vendors have come out with multi-core chips in their product lines and they have become mainstream,
with the number of cores increasing in each processor generation. Multi-core processors improve the performance of applications by exploiting Thread Level Parallelism (TLP) while the Instruction Level Parallelism (ILP) exploited by each individual core is limited. These architectures are very efficient when multiple threads are available for execution. However, single-thread sections of code (single-thread
applications and serial sections of parallel applications) pose important constraints on the benefits achieved by parallel execution, as pointed out by Amdahl’s law.
Parallel programming, even with the help of recently proposed techniques like transactional memory, has proven to be a very challenging task. On the other hand, automatically partitioning applications into threads may be a straightforward task in regular applications, but becomes much harder for irregular programs, where compilers usually fail to discover sufficient TLP. In this scenario, two main
directions have been followed in the research community to take benefit of multi-core platforms: Speculative Multithreading (SpMT) and Non-Speculative Clustered architectures. The former splits a sequential application into speculative threads, while the later partitions the instructions among the cores based on data-dependences but avoid large degree of speculation. Despite the large amount of research on
both these approaches, the proposed techniques so far have shown marginal performance improvements.
In this thesis we propose novel schemes to speed-up sequential or lightly threaded applications in multi-core processors that effectively address the main unresolved challenges of previous approaches. In particular, we propose a SpMT architecture, called Mitosis, that leverages a powerful software value prediction technique to manage inter-thread dependences, based on pre-computation slices (p-slices).
Thanks to the accuracy and low cost of this technique, Mitosis is able to effectively parallelize applications even in the presence of frequent dependences among threads. We also propose a novel architecture, called Anaphase, that combines the best of SpMT schemes and clustered architectures. Anaphase effectively exploits ILP, TLP and Memory Level Parallelism (MLP), thanks to its unique finegrain thread decomposition algorithm that adapts to the available parallelism in the application
Mitosis based speculative multithreaded architectures
In the last decade, industry made a right-hand turn and shifted towards multi-core processor designs, also known as Chip-Multi-Processors (CMPs), in order to provide further performance improvements under a reasonable power budget, design complexity, and validation cost. Over the years, several processor vendors have come out with multi-core chips in their product lines and they have become mainstream,
with the number of cores increasing in each processor generation. Multi-core processors improve the performance of applications by exploiting Thread Level Parallelism (TLP) while the Instruction Level Parallelism (ILP) exploited by each individual core is limited. These architectures are very efficient when multiple threads are available for execution. However, single-thread sections of code (single-thread
applications and serial sections of parallel applications) pose important constraints on the benefits achieved by parallel execution, as pointed out by Amdahl’s law.
Parallel programming, even with the help of recently proposed techniques like transactional memory, has proven to be a very challenging task. On the other hand, automatically partitioning applications into threads may be a straightforward task in regular applications, but becomes much harder for irregular programs, where compilers usually fail to discover sufficient TLP. In this scenario, two main
directions have been followed in the research community to take benefit of multi-core platforms: Speculative Multithreading (SpMT) and Non-Speculative Clustered architectures. The former splits a sequential application into speculative threads, while the later partitions the instructions among the cores based on data-dependences but avoid large degree of speculation. Despite the large amount of research on
both these approaches, the proposed techniques so far have shown marginal performance improvements.
In this thesis we propose novel schemes to speed-up sequential or lightly threaded applications in multi-core processors that effectively address the main unresolved challenges of previous approaches. In particular, we propose a SpMT architecture, called Mitosis, that leverages a powerful software value prediction technique to manage inter-thread dependences, based on pre-computation slices (p-slices).
Thanks to the accuracy and low cost of this technique, Mitosis is able to effectively parallelize applications even in the presence of frequent dependences among threads. We also propose a novel architecture, called Anaphase, that combines the best of SpMT schemes and clustered architectures. Anaphase effectively exploits ILP, TLP and Memory Level Parallelism (MLP), thanks to its unique finegrain thread decomposition algorithm that adapts to the available parallelism in the application.Postprint (published version
Design and validation of a simultaneous multi-threaded DLX processor
technical reportModern day computer systems rely on two forms of parallelism to achieve high performance, parallelism between individual instructions of a program (ILP) and parallelism between individual threads (TLP). Superscalar processors exploit ILP by issuing several instructions per clock, and multiprocessors (MP) exploit TLP by running different threads in parallel on different processors. A fundamental imitation of these approaches to exploit parallelism is that processor resources are statically partitioned. If TLP is low, processors in a MP system will be idle, and if ILP is low, issue slots in a superscalar processor will be wasted. As a consequence, the hardware cannot adapt to changing levels of ILP and TLP and resource utilization tend to be low. Since resource utilization is low there is potential to achieve higher performance if somehow useful instructions could be found to fill up the wasted issue slots. This paper explores a method called simultaneous multithreading (SMT) that addresses the utilization problem by letting multiple threads compete for the resources of a single processor each clock cycle thus increasing the potential ILP available
Beyond Dataflow
This paper presents some recent advanced dataflow architectures. While the dataflow concept offers the potential of high performance, the performance of an actual dataflow implementation can be restricted by a limited number of functional units, limited memory bandwidth, and the need to associatively match pending operations with available functional units. Since the early 1970s, there have been significant developments in both fundamental research and practical realizations of dataflow models of computation. In particular, there has been active research and development in multithreaded architectures that evolved from the dataflow model. Also some other techniques for combining control-flow and dataflow emerged, such as coarse-grain dataflow, dataflow with complex machine operations, RISC dataflow, and micro dataflow. These developments have also had certain impact on the conception of highperformance superscalar processors in the “post-RISC” era
A Survey on Thread-Level Speculation Techniques
Producción CientíficaThread-Level Speculation (TLS) is a promising technique that allows the parallel execution of sequential code without relying on a prior, compile-time-dependence analysis. In this work, we introduce the technique, present a taxonomy of TLS solutions, and summarize and put into perspective the most relevant advances in this field.MICINN (Spain) and ERDF program of the European Union: HomProg-HetSys project (TIN2014-58876-P), CAPAP-H5 network (TIN2014-53522-REDT), and COST Program Action IC1305: Network for Sustainable Ultrascale Computing (NESUS)
Investigation of a simultaneous multithreaded architecture
Many enhancements have been made to the traditional general purpose load-store computer architectures. Among the enhancements are memory hierarchy improvements, branch prediction, and multiple issue processors. A major problem that exists with current microprocessor design is the disparity in the much larger increase in speed of the CPU versus the moderate increase in speed accessing main memory. The simultaneous multithreaded architecture is an extension of the single-threaded architecture that helps hide the performance penalty created by long-latency instructions, branch mispredictions, and memory accesses. Simultaneous multithreaded architectures use a more flexible parallelism, which takes advantage of both instruction-level, and thread-level parallelism. The goal of this project was to design, simulate, and analyze a model of a simultaneous multithreaded architecture in order to evaluate design alternatives. The simulator was created by modifying a version of the Simple Scalar toolset, developed at the University of Wisconsin. The simulations provide documentation for an overall system performance improvement of a simulta neous multithreaded architecture. In early simulation results, performed with the same number of functional units, an improvement in the number of instructions per cycle (IPC) of between 43% and 58% was found using four threads versus a single thread. The horizontal waste rate, which measures the number of unused issue slots, was reduced between 35% and 46%. The vertical waste rate, which measures the percentage- of unused issue cycles (no issue slots used in a cycle), was reduced between 46% and 61%. These results are derived from a set of four sample programs. It was also found that increasing the number of certain functional units did not improve performance, whereas increasing the number of other types of functional units did have a significant positive impact on performance
Event Stream Processing with Multiple Threads
Current runtime verification tools seldom make use of multi-threading to
speed up the evaluation of a property on a large event trace. In this paper, we
present an extension to the BeepBeep 3 event stream engine that allows the use
of multiple threads during the evaluation of a query. Various parallelization
strategies are presented and described on simple examples. The implementation
of these strategies is then evaluated empirically on a sample of problems.
Compared to the previous, single-threaded version of the BeepBeep engine, the
allocation of just a few threads to specific portions of a query provides
dramatic improvement in terms of running time
- …