96 research outputs found
Hybrid Caching for Chip Multiprocessors Using Compiler-Based Data Classification
The high performance delivered by modern computer system keeps scaling with an increasingnumber of processors connected using distributed network on-chip. As a result, memory accesslatency, largely dominated by remote data cache access and inter-processor communication, is becoming a critical performance bottleneck. To release this problem, it is necessary to localize data access as much as possible while keep efficient on-chip cache memory utilization. Achieving this however, is application dependent and needs a keen insight into the memory access characteristics of the applications. This thesis demonstrates how using fairly simple thus inexpensive compiler analysis memory accesses can be classified into private data access and shared data access. In addition, we introduce a third classification named probably private access and demonstrate the impact of this category compared to traditional private and shared memory classification. The memory access classification information from the compiler analysis is then provided to the runtime system through a modified memory allocator and page table to facilitate a hybrid private-shared caching technique. The hybrid cache mechanism is aware of different data access classification and adopts appropriate placement and search policies accordingly to improve performance. Our analysis demonstrates that many applications have a significant amount of both private and shared data and that compiler analysis can identify the private data effectively for many applications. Experimentsresults show that the implemented hybrid caching scheme achieves 4.03% performance improvement over state of the art NUCA-base caching
Recommended from our members
Improving virtual memory performance in virtualized environments
Virtual Memory is a major system performance bottleneck in virtualized environments. In addition to expensive address translations, frequent virtual machine context switches are common in virtualized environments, resulting in increased TLB miss rates, subsequent expensive page walks and data cache contention due to incoming page table entries evicting useful data. Orthogonally, translation coherence, which is currently an expensive operation implemented in software, can consume up to 50% of the runtime of an application executing on the guest. To improve the performance of virtual memory in virtualized environments, two solutions have been proposed in this thesis - namely, (1) Context Switch Aware Large TLB (CSALT), an architecture which addresses the problem of increased TLB miss rates and their adverse impact on data caches. CSALT copes with the increased demand of context switches by storing a large number TLB entries. It mitigates data cache contention by employing a novel TLB-aware cache partitioning scheme. On 8-core systems that switch between two virtual machine contexts executing multi-threaded workloads, CSALT achieves an average performance improvement of 85% over a baseline with conventional L1-L2 TLBs and 25% over a baseline which has a large L3 TLB (2) Translation Coherence using Addressable TLBs (TCAT), a hardware translation coherence scheme which eliminates almost all of the overheads associated with address translation coherence. TCAT overlays translation coherence atop cache coherence to accurately identify slave cores. It then leverages the addressable Part-Of-Memory TLB (POM-TLB) to eliminate expensive Inter Processor Interrupts (IPI) and achieve precise invalidations on the slave core. On 8-core systems with one virtual machine context executing multi-threaded workloads, TCAT achieves an average performance improvement of 13% over the kvmtlb baselineElectrical and Computer Engineerin
Design of Efficient TLB-based Data Classification Mechanisms in Chip Multiprocessors
Most of the data referenced by sequential and parallel applications running in current chip multiprocessors are referenced by a single thread, i.e., private. Recent proposals leverage this observation to improve many aspects of chip multiprocessors, such as reducing coherence overhead or the access latency to distributed caches. The effectiveness of those proposals depends to a large extent on the amount of detected private data. However, the mechanisms proposed so far either do not consider either thread migration or the private use of data within different application phases, or do entail high overhead. As a result, a considerable amount of private data is not detected. In order to increase the detection of private data, this thesis proposes a TLB-based
mechanism that is able to account for both thread migration and private application phases with low overhead. Classification status in the proposed TLB-based classification mechanisms is determined by the presence of the page translation stored in other
core's TLBs. The classification schemes are analyzed in multilevel TLB hierarchies, for systems with both private and distributed shared last-level TLBs.
This thesis introduces a page classification approach based on inspecting other core's TLBs upon every TLB miss. In particular, the proposed classification approach is based on exchange and count of tokens. Token counting on TLBs is a natural and efficient way for classifying memory pages. It does not require the use of complex and undesirable persistent requests or arbitration, since when two ormore TLBs race for accessing a page, tokens are appropriately distributed classifying the page as shared.
However, TLB-based ability to classify private pages is strongly dependent on TLB size, as it relies on the presence of a page translation in the system TLBs. To overcome that, different TLB usage predictors (UP) have been proposed, which allow a page classification unaffected by TLB size. Specifically, this thesis introduces a predictor that obtains system-wide page usage information by either employing a shared last-level TLB structure (SUP) or cooperative TLBs working together (CUP).La mayor parte de los datos referenciados por aplicaciones paralelas y secuenciales que se ejecutan enCMPs actuales son referenciadas por un único hilo, es decir, son privados. Recientemente, algunas propuestas aprovechan esta observación para mejorar muchos aspectos de los CMPs, como por ejemplo reducir el sobrecoste de la coherencia o la latencia de los accesos a cachés distribuidas. La efectividad de estas propuestas depende en gran medida de la cantidad de datos que son considerados privados. Sin embargo, los mecanismos propuestos hasta la fecha no consideran la migración de hilos de ejecución ni las fases de una aplicación. Por tanto, una cantidad considerable de datos privados no se detecta apropiadamente. Con el fin de aumentar la detección de datos privados, proponemos un mecanismo basado en las TLBs, capaz de reclasificar los datos a privado, y que detecta la migración de los hilos de ejecución sin añadir complejidad al sistema. Los mecanismos de clasificación en las TLBs se han analizado en estructuras de varios niveles, incluyendo TLBs privadas y con un último nivel de TLB compartido y distribuido.
Esta tesis también presenta un mecanismo de clasificación de páginas basado en la inspección de las TLBs de otros núcleos tras cada fallo de TLB. De forma particular, el mecanismo propuesto se basa en el intercambio y el cuenteo de tokens (testigos).
Contar tokens en las TLBs supone una forma natural y eficiente para la clasificación de páginas de memoria. Además, evita el uso de solicitudes persistentes o arbitraje alguno, ya que si dos o más TLBs compiten para acceder a una página, los tokens se
distribuyen apropiadamente y la clasifican como compartida.
Sin embargo, la habilidad de los mecanismos basados en TLB para clasificar páginas privadas depende del tamaño de las TLBs. La clasificación basada en las TLBs se basa en la presencia de una traducción en las TLBs del sistema. Para evitarlo, se han propuesto diversos predictores de uso en las TLBs (UP), los cuales permiten una clasificación independiente del tamaño de las TLBs. En concreto, esta tesis presenta un sistema mediante el que se obtiene información de uso de página a nivel de sistema con la ayuda de un nivel de TLB compartida (SUP) o mediante TLBs cooperando juntas (CUP).La major part de les dades referenciades per aplicacions paral·leles i seqüencials que s'executen en CMPs actuals són referenciades per un sol fil, és a dir, són privades. Recentment, algunes propostes aprofiten aquesta observació per a millorar molts aspectes dels CMPs, com és reduir el sobrecost de la coherència o la latència d'accés a memòries cau distribuïdes. L'efectivitat d'aquestes propostes depen en gran mesura de la quantitat de dades detectades com a privades. No obstant això, els mecanismes proposats fins a la data no consideren la migració de fils d'execució ni les fases d'una aplicació. Per tant, una quantitat considerable de dades privades no es detecta apropiadament. A fi d'augmentar la detecció de dades privades, aquesta tesi proposa un mecanisme basat en les TLBs, capaç de reclassificar les dades com a privades, i que detecta la migració dels fils d'execució sense afegir complexitat al sistema. Els mecanismes de classificació en les TLBs s'han analitzat en estructures de diversos nivells, incloent-hi sistemes amb TLBs d'últimnivell compartides i distribuïdes.
Aquesta tesi presenta un mecanisme de classificació de pàgines basat en inspeccionar les TLBs d'altres nuclis després de cada fallada de TLB. Concretament, el mecanisme proposat es basa en l'intercanvi i el compte de tokens. Comptar tokens en les TLBs suposa una forma natural i eficient per a la classificació de pàgines de memòria. A més, evita l'ús de sol·licituds persistents o arbitratge, ja que si dues o més TLBs competeixen per a accedir a una pàgina, els tokens es distribueixen apropiadament i la classifiquen com a compartida.
No obstant això, l'habilitat dels mecanismes basats en TLB per a classificar pàgines privades depenen de la grandària de les TLBs. La classificació basada en les TLBs resta en la presència d'una traducció en les TLBs del sistema. Per a evitar-ho, s'han proposat diversos predictors d'ús en les TLBs (UP), els quals permeten una classificació independent de la grandària de les TLBs. Específicament, aquesta tesi introdueix un predictor que obté informació d'ús de la pàgina a escala de sistema mitjançant un nivell de TLB compartida (SUP) or mitjançant TLBs cooperant juntes (CUP).Esteve García, A. (2017). Design of Efficient TLB-based Data Classification Mechanisms in Chip Multiprocessors [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/86136TESI
Parallel and Distributed Computing
The 14 chapters presented in this book cover a wide variety of representative works ranging from hardware design to application development. Particularly, the topics that are addressed are programmable and reconfigurable devices and systems, dependability of GPUs (General Purpose Units), network topologies, cache coherence protocols, resource allocation, scheduling algorithms, peertopeer networks, largescale network simulation, and parallel routines and algorithms. In this way, the articles included in this book constitute an excellent reference for engineers and researchers who have particular interests in each of these topics in parallel and distributed computing
Application of HPC in eddy current electromagnetic problem solution
As engineering problems are becoming more and more advanced, the size of an average model solved by partial differential equations is rapidly growing and, in order to keep simulation times within reasonable bounds, both faster computers and more efficient software implementations are needed.
In the first part of this thesis, the full potential of simulation software has been exploited through high performance parallel computing techniques. In particular, the simulation of induction heating processes is accomplished within reasonable solution times, by implementing different parallel direct solvers for large sparse linear system, in the solution process of a commercial software. The performance of such library on shared memory systems has been remarkably improved by implementing a multithreaded version of MUMPS (MUltifrontal Massively Parallel Solver) library, which have been tested on benchmark matrices arising from typical induction heating process simulations.
A new multithreading approach and a low rank approximation technique have been implemented and developed by MUMPS team in Lyon and Toulouse. In the context of a collaboration between MUMPS team and DII-University of Padova, a preliminary version of such functionalities could be tested on induction heating benchmark problems, and a substantial reduction of the computational cost and memory requirements could be achieved.
In the second part of this thesis, some examples of design methodology by virtual prototyping have been described. Complex multiphysics simulations involving electromagnetic, circuital, thermal and mechanical problems have been performed by exploiting parallel solvers, as developed in the first part of this thesis. Finally, multiobjective stochastic optimization algorithms have been applied to multiphysics 3D model simulations in search of a set of improved induction heating device configurations
Doctor of Philosophy
dissertationStochastic methods, dense free-form mapping, atlas construction, and total variation are examples of advanced image processing techniques which are robust but computationally demanding. These algorithms often require a large amount of computational power as well as massive memory bandwidth. These requirements used to be ful lled only by supercomputers. The development of heterogeneous parallel subsystems and computation-specialized devices such as Graphic Processing Units (GPUs) has brought the requisite power to commodity hardware, opening up opportunities for scientists to experiment and evaluate the in uence of these techniques on their research and practical applications. However, harnessing the processing power from modern hardware is challenging. The di fferences between multicore parallel processing systems and conventional models are signi ficant, often requiring algorithms and data structures to be redesigned signi ficantly for efficiency. It also demands in-depth knowledge about modern hardware architectures to optimize these implementations, sometimes on a per-architecture basis. The goal of this dissertation is to introduce a solution for this problem based on a 3D image processing framework, using high performance APIs at the core level to utilize parallel processing power of the GPUs. The design of the framework facilitates an efficient application development process, which does not require scientists to have extensive knowledge about GPU systems, and encourages them to harness this power to solve their computationally challenging problems. To present the development of this framework, four main problems are described, and the solutions are discussed and evaluated: (1) essential components of a general 3D image processing library: data structures and algorithms, as well as how to implement these building blocks on the GPU architecture for optimal performance; (2) an implementation of unbiased atlas construction algorithms|an illustration of how to solve a highly complex and computationally expensive algorithm using this framework; (3) an extension of the framework to account for geometry descriptors to solve registration challenges with large scale shape changes and high intensity-contrast di fferences; and (4) an out-of-core streaming model, which enables developers to implement multi-image processing techniques on commodity hardware
Improving Energy and Area Scalability of the Cache Hierarchy in CMPs
As the core counts increase in each chip multiprocessor generation, CMPs should improve scalability in performance, area, and energy consumption to meet the demands of
larger core counts. Directory-based protocols constitute the most scalable alternative.
A conventional directory, however, suffers from an inefficient use of storage and energy.
First, the large, non-scalable, sharer vectors consume unnecessary area and leakage, especially considering that most of the blocks tracked in a directory are cached by a single
core. Second, although increasing directory size and associativity could boost system
performance by reducing the coverage misses, it would come at the expense of area and
energy consumption.
This thesis focuses and exploits the important differences of behavior between private
and shared blocks from the directory point of view. These differences claim for a separate
management of both types of blocks at the directory. First, we propose the PS-Directory,
a two-level directory cache that keeps the reduced number of frequently accessed shared
entries in a small and fast first-level cache, namely Shared Directory Cache, and uses
a larger and slower second-level Private Directory Cache to track the large amount of
private blocks. Experimental results show that, compared to a conventional directory, the PS-Directory improves performance while also reducing silicon area and energy consumption.
In this thesis we also show that the shared/private ratio of entries in the directory varies
across applications and across different execution phases within the applications, which
encourages us to propose Dynamic Way Partitioning (DWP) Directory. DWP-Directory
reduces the number of ways with storage for shared blocks and it allows this storage to be
powered off or on at run-time according to the dynamic requirements of the applications
following a repartitioning algorithm. Results show similar performance as a traditional
directory with high associativity, and similar area requirements as recent state-of-the-art schemes. In addition, DWP-Directory achieves notable static and dynamic power
consumption savings.
This dissertation also deals with the scalability issues in terms of power found
in processor caches. A significant fraction of the total power budget is consumed by
on-chip caches which are usually deployed with a high associativity degree (even L1
caches are being implemented with eight ways) to enhance the system performance. On
a cache access, each way in the corresponding set is accessed in parallel, which is costly
in terms of energy. This thesis presents the PS-Cache architecture, an energy-efficient
cache design that reduces the number of accessed ways without hurting the performance.
The PS-Cache takes advantage of the private-shared knowledge of the referenced block
to reduce energy by accessing only those ways holding the kind of block looked up.
Results show significant dynamic power consumption savings.
Finally, we propose an energy-efficient architectural design that can be effectively applied
to any kind of set-associative cache memory, not only to processor caches. The proposed
approach, called the Tag Filter (TF) Architecture, filters the ways accessed in the target
cache set, and just a few ways are searched in the tag and data arrays. This allows the
approach to reduce the dynamic energy consumption of caches without hurting their
access time. For this purpose, the proposed architecture holds the X least significant
bits of each tag in a small auxiliary X-bit-wide array. These bits are used to filter
the ways where the least significant bits of the tag do not match with the bits in the
X-bit array. Experimental results show that this filtering mechanism achieves energy
consumption in set-associative caches similar to direct mapped ones.
Experimental results show that the proposals presented in this thesis offer a good tradeoff
among these three major design axes.Conforme se incrementa el número de núcleos en las nuevas generaciones de multiprocesadores en chip, los CMPs deben de escalar en prestaciones, área y consumo energético
para cumplir con las demandas de un número núcleos mayor. Los protocolos basados
en directorio constituyen la alternativa más escalable. Un directorio convencional, no
obstante, sufre de una utilización ineficiente de almacenamiento y energía. En primer
lugar, los grandes y poco escalables vectores de compartidores consumen una cantidad
de energía de fuga y de área innecesaria, especialmente si se tiene en consideración que
la mayoría de los bloques en un directorio solo se encuentran en la cache de un único
núcleo. En segundo lugar, aunque incrementar el tamaño y la asociatividad del directorio aumentaría las prestaciones del sistema, esto supondría un incremento notable en el
consumo energético.
Esta tesis estudia las diferencias significativas entre el comportamiento de bloques privados y compartidos en el directorio, lo que nos lleva hacia una gestión separada para
cada uno de los tipos de bloque. Proponemos el PS-Directory, una cache de directorio de dos niveles que mantiene el reducido número de las entradas compartidas, que
son los que se acceden con más frecuencia, en una estructura pequeña de primer nivel
(concretamente, la Shared Directory Cache) y que utiliza una estructura más grande y
lenta en el segundo nivel (Private Directory Cache) para poder mantener la información
de los bloques privados. Los resultados experimentales muestran
que, comparado con un directorio convencional, el PS-Directory consigue mejorar las
prestaciones a la vez que reduce el área de silicio y el consumo energético.
Ya que el ratio compartido/privado de las entradas en el directorio varia entre aplicaciones y entre las diferentes fases de ejecución dentro de las aplicaciones, proponemos el
Dynamic Way Partitioning (DWP) Directory. El DWP-Directory reduce el número de
vías que almacenan entradas compartidas y permite que éstas se enciendan o apaguen
en tiempo de ejecución según los requisitos dinámicos de las aplicaciones según un algoritmo de reparticionado. Los resultados muestran unas prestaciones similares a un
directorio tradicional de alta asociatividad y un área similar a otros esquemas recientes
del estado del arte. Adicionalmente, el DWP-Directory obtiene importantes reducciones
de consumo estático y dinámico.
Esta disertación también se enfrenta a los problemas de escalabilidad que se pueden
encontrar en las memorias cache. En un acceso a la cache, se accede a cada vía del conjunto en paralelo, siendo
así un acción costosa en energía. Esta tesis presenta la arquitectura PS-Cache, un
diseño energéticamente eficiente que reduce el número de vías accedidas sin perjudicar
las prestaciones. La PS-Cache utiliza la información del estado privado-compartido del
bloque referenciado para reducir la energía, ya que tan solo accedemos a un subconjunto
de las vías que mantienen los bloques del tipo solicitado. Los resultados muestran unos
importantes ahorros de energía dinámica.
Finalmente, proponemos otro diseño de arquitectura energéticamente eficiente que se
puede aplicar a cualquier tipo de memoria cache asociativa por conjuntos. La propuesta, la Tag Filter (TF) Architecture, filtra las vías accedidas en el conjunto de la cache, de manera que solo se mira un número reducido de
vías tanto en el array de etiquetas como en el de datos. Esto permite que nuestra propuesta reduzca el consumo de energía dinámico de las caches sin perjudicar su tiempo de
acceso. Los resultados experimentales muestran que este mecanismo de filtrado es capaz de obtener un
consumo energético en caches asociativas por conjunto similar de las caches de mapeado
directo.
Los resultados
experimentales muestran que las propuestas presentadas en esta tesis consiguen un buen
compromiso entre estos tres importantes pilares de diseño.Conforme s'incrementen el nombre de nuclis en les noves generacions de multiprocessadors en xip, els CMPs han d'escalar en prestacions, àrea i consum energètic per complir en les demandes d'un nombre de nuclis major. El protocols basats en directori són
l'alternativa més escalable. Un directori convencional, no obstant, pateix una utilització
ineficient d'emmagatzematge i energia. En primer lloc, els grans i poc escalables vectors
de compartidors consumeixen una quantitat d'energia estàtica i d'àrea innecessària, especialment si es considera que la majoria dels blocs en un directori només es troben en la
cache d'un sol nucli. En segon lloc, tot i que incrementar la grandària i l'associativitat del
directori augmentaria les prestacions del sistema, això suposaria un increment notable
en el consum d'energia.
Aquesta tesis estudia les diferències significatives entre el comportament de blocs privats
i compartits dins del directori, la qual cosa ens guia cap a una gestió separada per a cada
un dels tipus de bloc. Proposem el PS-Directory, una cache de directori de dos nivells que
manté el reduït nombre de les entrades de blocs compartits, que són els que s'accedeixen
amb més freqüència, en una estructura menuda de primer nivell (concretament, la Shared
Directory Cache) i que empra una estructura més gran i lenta en el segon nivell (Private
Directory Cache) per poder mantenir la informació dels blocs privats.
Els resultats experimentals mostren que, comparat amb un directori convencional, el
PS-Directory aconsegueix millorar les prestacions a la vegada que redueix l'àrea de silici
i el consum energètic.
Ja que la ràtio compartit/privat de les entrades en el directori varia entre aplicacions
i entre les diferents fases d'execució dins de les aplicacions, proposem el Dynamic Way
Partitioning (DWP) Directory. DWP-Directory redueix el nombre de vies que emmagatzemen entrades compartides i permeten que aquest s'encengui o apagui en temps
d'execució segons els requeriments dinàmics de les aplicacions seguint un algoritme de
reparticionat. Els resultats mostren unes prestacions similars a un directori tradicional
d'alta associativitat i una àrea similar a altres esquemes recents de l'estat de l'art. Adicionalment, el DWP-Directory obté importants reduccions de consum estàtic i dinàmic.
Aquesta dissertació també s'enfronta als problemes d'escalabilitat que es poden tro-
bar en les memòries cache. Les caches on-chip consumeixen una part significativa del
consum total del sistema. Aquestes caches implementen un alt nivell d'associativitat. En un accés a la cache, s'accedeix a cada via del conjunt en paral·lel, essent
així una acció costosa en energia. Aquesta tesis presenta l'arquitectura PS-Cache, un
disseny energèticament eficient que redueix el nombre de vies accedides sense perjudicar
les prestacions. La PS-Cache utilitza la informació de l'estat privat-compartit del bloc
referenciat per a reduir energia, ja que només accedim al subconjunt de vies que mantenen blocs del tipus sol·licitat. Els resultats mostren uns importants estalvis d'energia
dinàmica.
Finalment, proposem un altre disseny d'arquitectura energèticament eficient que es pot
aplicar a qualsevol tipus de memòria cache associativa per conjunts. La proposta, la Tag Filter (TF) Architecture, filtra les vies
accedides en el conjunt de la cache, de manera que només un reduït nombre de vies es
miren tant en el array d'etiquetes com en el de dades. Això permet que la nostra proposta
redueixi el consum dinàmic energètic de les caches sense perjudicar el seu temps d'accés.
Els
resultats experimentals mostren que aquest mecanisme de filtre és capaç d'obtenir un
consum energètic en caches associatives per conjunt similar al de les caches de mapejada
directa.
Els resultats experimentals mostren que les propostes presentades en aquesta tesis conseguixen un bon
compromís entre aquestros tres importants pilars de diseny.Valls Mompó, JJ. (2017). Improving Energy and Area Scalability of the Cache Hierarchy in CMPs [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/79551TESI
New hardware support transactional memory and parallel debugging in multicore processors
This thesis contributes to the area of hardware support for parallel programming by introducing new hardware elements in multicore processors, with the aim of improving the performance and optimize new tools, abstractions and applications related with parallel programming, such as transactional memory and data race detectors. Specifically, we configure a hardware transactional memory system with signatures as part of the hardware support, and we develop a new hardware filter for reducing the signature size. We also develop the first hardware asymmetric data race detector (which is also able to tolerate them), based also in hardware signatures. Finally, we propose a new module of hardware signatures that solves some of the problems that we found in the previous tools related with the lack of flexibility in hardware signatures
PS-Architecture: A scalable and energy-efficient architecture for CMP NUCAs
As the number of cores increases in both incoming and future shared-memory chip--multiprocessor (CMP) generations, coherence protocols and all elements in the cache hierarchy must scale to sustain performance. In this work we attack the scalability problem in the CMPs by studying and proposing some improvements for two of those elements, namely the directory and data caches. Each of these two structures have its particular issues which we try to solve employing some mechanisms involving the different type of blocks that can be found in parallel workloads.
We introduce the PS directory, a directory cache that uses two different cache structures, each one
tailored to one of these types of blocks (i.e., private and shared). The Shared directory cache, which tracks shared blocks is small, with low associativity and fast. The Private directory cache is aimed at
tracking private blocks, which are highly dominant in current workloads. This structure does not store the sharer vector, is larger than the shared cache, and it has higher associativity.
We also introduce the PS cache, an energy-efficient cache design which only accesses a subset of the set ways without hurting performance.Valls Mompó, JJ. (2013). PS-Architecture: A scalable and energy-efficient architecture for CMP NUCAs. http://hdl.handle.net/10251/44383Archivo delegad
- …