38 research outputs found
Hybrid Caching for Chip Multiprocessors Using Compiler-Based Data Classification
The high performance delivered by modern computer system keeps scaling with an increasingnumber of processors connected using distributed network on-chip. As a result, memory accesslatency, largely dominated by remote data cache access and inter-processor communication, is becoming a critical performance bottleneck. To release this problem, it is necessary to localize data access as much as possible while keep efficient on-chip cache memory utilization. Achieving this however, is application dependent and needs a keen insight into the memory access characteristics of the applications. This thesis demonstrates how using fairly simple thus inexpensive compiler analysis memory accesses can be classified into private data access and shared data access. In addition, we introduce a third classification named probably private access and demonstrate the impact of this category compared to traditional private and shared memory classification. The memory access classification information from the compiler analysis is then provided to the runtime system through a modified memory allocator and page table to facilitate a hybrid private-shared caching technique. The hybrid cache mechanism is aware of different data access classification and adopts appropriate placement and search policies accordingly to improve performance. Our analysis demonstrates that many applications have a significant amount of both private and shared data and that compiler analysis can identify the private data effectively for many applications. Experimentsresults show that the implemented hybrid caching scheme achieves 4.03% performance improvement over state of the art NUCA-base caching
Latency reduction techniques in chip multiprocessor cache systems
Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2006.Includes bibliographical references (p. 117-122).Single-chip multiprocessors (CMPs) solve several bottlenecks facing chip designers today. Compared to traditional superscalars, CMPs deliver higher performance at lower power for thread-parallel workloads. In this thesis, we consider tiled CMPs, a class of CMPs where each tile contains a slice of the total on-chip L2 cache storage, and tiles are connected by an on-chip network. Two basic schemes are currently used to manage L2 slices. First, each slice can be used as a private L2 for the tile. Private L2 caches provide the lowest hit latency but reduce the total effective cache capacity because each tile creates a local copy of any block it touches. Second, all slices are aggregated to form a single large L2 shared by all tiles. A shared L2 cache increases the effective cache capacity for shared data, but incurs longer hit latencies when L2 data is on a remote tile. In practice, either private or shared works better for a given workload. We present two new policies, victim replication and victim migration, both of which combine the advantages of private and shared designs. They are variants of the shared scheme which attempt to keep copies of local L1 cache victims within the local L2 cache slice.(cont.) Hits to these replicated copies reduce the effective latency of the shared L2 cache, while retaining the benefits of a higher effective capacity for shared data. We evaluate the various schemes using full-system simulation of single-threaded, multi-threaded, and multi-programmed workloads running on an eight-processor tiled CMP. We show that both techniques achieve significant performance improvement over baseline private and shared schemes for these workloads.by Michael Zhang.Ph.D
Software-Oriented Distributed Shared Cache Management for Chip Multiprocessors
This thesis proposes a software-oriented distributed shared cache management approach for chip multiprocessors (CMPs). Unlike hardware-based schemes, our approach offloads the cache management task to trace analysis phase, allowing flexible management strategies. For single-threaded programs, a static 2D page coloring scheme is proposed to utilize oracle trace information to derive an optimal data placement schema for a program. In addition, a dynamic 2D page coloring scheme is proposed as a practical solution, which tries to ap- proach the performance of the static scheme. The evaluation results show that the static scheme achieves 44.7% performance improvement over the conventional shared cache scheme on average while the dynamic scheme performs 32.3% better than the shared cache scheme. For latency-oriented multithreaded programs, a pattern recognition algorithm based on the K-means clustering method is introduced. The algorithm tries to identify data access pat- terns that can be utilized to guide the placement of private data and the replication of shared data. The experimental results show that data placement and replication based on these access patterns lead to 19% performance improvement over the shared cache scheme. The reduced remote cache accesses and aggregated cache miss rate result in much lower bandwidth requirements for the on-chip network and the off-chip main memory bus. Lastly, for throughput-oriented multithreaded programs, we propose a hint-guided data replication scheme to identify memory instructions of a target program that access data with a high reuse property. The derived hints are then used to guide data replication at run time. By balancing the amount of data replication and local cache pressure, the proposed scheme has the potential to help achieve comparable performance to best existing hardware-based schemes.Our proposed software-oriented shared cache management approach is an effective way to manage program performance on CMPs. This approach provides an alternative direction to the research of the distributed cache management problem. Given the known difficulties (e.g., scalability and design complexity) we face with hardware-based schemes, this software- oriented approach may receive a serious consideration from researchers in the future. In this perspective, the thesis provides valuable contributions to the computer architecture research society
Improving Energy and Area Scalability of the Cache Hierarchy in CMPs
As the core counts increase in each chip multiprocessor generation, CMPs should improve scalability in performance, area, and energy consumption to meet the demands of
larger core counts. Directory-based protocols constitute the most scalable alternative.
A conventional directory, however, suffers from an inefficient use of storage and energy.
First, the large, non-scalable, sharer vectors consume unnecessary area and leakage, especially considering that most of the blocks tracked in a directory are cached by a single
core. Second, although increasing directory size and associativity could boost system
performance by reducing the coverage misses, it would come at the expense of area and
energy consumption.
This thesis focuses and exploits the important differences of behavior between private
and shared blocks from the directory point of view. These differences claim for a separate
management of both types of blocks at the directory. First, we propose the PS-Directory,
a two-level directory cache that keeps the reduced number of frequently accessed shared
entries in a small and fast first-level cache, namely Shared Directory Cache, and uses
a larger and slower second-level Private Directory Cache to track the large amount of
private blocks. Experimental results show that, compared to a conventional directory, the PS-Directory improves performance while also reducing silicon area and energy consumption.
In this thesis we also show that the shared/private ratio of entries in the directory varies
across applications and across different execution phases within the applications, which
encourages us to propose Dynamic Way Partitioning (DWP) Directory. DWP-Directory
reduces the number of ways with storage for shared blocks and it allows this storage to be
powered off or on at run-time according to the dynamic requirements of the applications
following a repartitioning algorithm. Results show similar performance as a traditional
directory with high associativity, and similar area requirements as recent state-of-the-art schemes. In addition, DWP-Directory achieves notable static and dynamic power
consumption savings.
This dissertation also deals with the scalability issues in terms of power found
in processor caches. A significant fraction of the total power budget is consumed by
on-chip caches which are usually deployed with a high associativity degree (even L1
caches are being implemented with eight ways) to enhance the system performance. On
a cache access, each way in the corresponding set is accessed in parallel, which is costly
in terms of energy. This thesis presents the PS-Cache architecture, an energy-efficient
cache design that reduces the number of accessed ways without hurting the performance.
The PS-Cache takes advantage of the private-shared knowledge of the referenced block
to reduce energy by accessing only those ways holding the kind of block looked up.
Results show significant dynamic power consumption savings.
Finally, we propose an energy-efficient architectural design that can be effectively applied
to any kind of set-associative cache memory, not only to processor caches. The proposed
approach, called the Tag Filter (TF) Architecture, filters the ways accessed in the target
cache set, and just a few ways are searched in the tag and data arrays. This allows the
approach to reduce the dynamic energy consumption of caches without hurting their
access time. For this purpose, the proposed architecture holds the X least significant
bits of each tag in a small auxiliary X-bit-wide array. These bits are used to filter
the ways where the least significant bits of the tag do not match with the bits in the
X-bit array. Experimental results show that this filtering mechanism achieves energy
consumption in set-associative caches similar to direct mapped ones.
Experimental results show that the proposals presented in this thesis offer a good tradeoff
among these three major design axes.Conforme se incrementa el número de núcleos en las nuevas generaciones de multiprocesadores en chip, los CMPs deben de escalar en prestaciones, área y consumo energético
para cumplir con las demandas de un número núcleos mayor. Los protocolos basados
en directorio constituyen la alternativa más escalable. Un directorio convencional, no
obstante, sufre de una utilización ineficiente de almacenamiento y energÃa. En primer
lugar, los grandes y poco escalables vectores de compartidores consumen una cantidad
de energÃa de fuga y de área innecesaria, especialmente si se tiene en consideración que
la mayorÃa de los bloques en un directorio solo se encuentran en la cache de un único
núcleo. En segundo lugar, aunque incrementar el tamaño y la asociatividad del directorio aumentarÃa las prestaciones del sistema, esto supondrÃa un incremento notable en el
consumo energético.
Esta tesis estudia las diferencias significativas entre el comportamiento de bloques privados y compartidos en el directorio, lo que nos lleva hacia una gestión separada para
cada uno de los tipos de bloque. Proponemos el PS-Directory, una cache de directorio de dos niveles que mantiene el reducido número de las entradas compartidas, que
son los que se acceden con más frecuencia, en una estructura pequeña de primer nivel
(concretamente, la Shared Directory Cache) y que utiliza una estructura más grande y
lenta en el segundo nivel (Private Directory Cache) para poder mantener la información
de los bloques privados. Los resultados experimentales muestran
que, comparado con un directorio convencional, el PS-Directory consigue mejorar las
prestaciones a la vez que reduce el área de silicio y el consumo energético.
Ya que el ratio compartido/privado de las entradas en el directorio varia entre aplicaciones y entre las diferentes fases de ejecución dentro de las aplicaciones, proponemos el
Dynamic Way Partitioning (DWP) Directory. El DWP-Directory reduce el número de
vÃas que almacenan entradas compartidas y permite que éstas se enciendan o apaguen
en tiempo de ejecución según los requisitos dinámicos de las aplicaciones según un algoritmo de reparticionado. Los resultados muestran unas prestaciones similares a un
directorio tradicional de alta asociatividad y un área similar a otros esquemas recientes
del estado del arte. Adicionalmente, el DWP-Directory obtiene importantes reducciones
de consumo estático y dinámico.
Esta disertación también se enfrenta a los problemas de escalabilidad que se pueden
encontrar en las memorias cache. En un acceso a la cache, se accede a cada vÃa del conjunto en paralelo, siendo
asà un acción costosa en energÃa. Esta tesis presenta la arquitectura PS-Cache, un
diseño energéticamente eficiente que reduce el número de vÃas accedidas sin perjudicar
las prestaciones. La PS-Cache utiliza la información del estado privado-compartido del
bloque referenciado para reducir la energÃa, ya que tan solo accedemos a un subconjunto
de las vÃas que mantienen los bloques del tipo solicitado. Los resultados muestran unos
importantes ahorros de energÃa dinámica.
Finalmente, proponemos otro diseño de arquitectura energéticamente eficiente que se
puede aplicar a cualquier tipo de memoria cache asociativa por conjuntos. La propuesta, la Tag Filter (TF) Architecture, filtra las vÃas accedidas en el conjunto de la cache, de manera que solo se mira un número reducido de
vÃas tanto en el array de etiquetas como en el de datos. Esto permite que nuestra propuesta reduzca el consumo de energÃa dinámico de las caches sin perjudicar su tiempo de
acceso. Los resultados experimentales muestran que este mecanismo de filtrado es capaz de obtener un
consumo energético en caches asociativas por conjunto similar de las caches de mapeado
directo.
Los resultados
experimentales muestran que las propuestas presentadas en esta tesis consiguen un buen
compromiso entre estos tres importantes pilares de diseño.Conforme s'incrementen el nombre de nuclis en les noves generacions de multiprocessadors en xip, els CMPs han d'escalar en prestacions, à rea i consum energètic per complir en les demandes d'un nombre de nuclis major. El protocols basats en directori són
l'alternativa més escalable. Un directori convencional, no obstant, pateix una utilització
ineficient d'emmagatzematge i energia. En primer lloc, els grans i poc escalables vectors
de compartidors consumeixen una quantitat d'energia està tica i d'à rea innecessà ria, especialment si es considera que la majoria dels blocs en un directori només es troben en la
cache d'un sol nucli. En segon lloc, tot i que incrementar la grandà ria i l'associativitat del
directori augmentaria les prestacions del sistema, això suposaria un increment notable
en el consum d'energia.
Aquesta tesis estudia les diferències significatives entre el comportament de blocs privats
i compartits dins del directori, la qual cosa ens guia cap a una gestió separada per a cada
un dels tipus de bloc. Proposem el PS-Directory, una cache de directori de dos nivells que
manté el reduït nombre de les entrades de blocs compartits, que són els que s'accedeixen
amb més freqüència, en una estructura menuda de primer nivell (concretament, la Shared
Directory Cache) i que empra una estructura més gran i lenta en el segon nivell (Private
Directory Cache) per poder mantenir la informació dels blocs privats.
Els resultats experimentals mostren que, comparat amb un directori convencional, el
PS-Directory aconsegueix millorar les prestacions a la vegada que redueix l'Ã rea de silici
i el consum energètic.
Ja que la rà tio compartit/privat de les entrades en el directori varia entre aplicacions
i entre les diferents fases d'execució dins de les aplicacions, proposem el Dynamic Way
Partitioning (DWP) Directory. DWP-Directory redueix el nombre de vies que emmagatzemen entrades compartides i permeten que aquest s'encengui o apagui en temps
d'execució segons els requeriments dinà mics de les aplicacions seguint un algoritme de
reparticionat. Els resultats mostren unes prestacions similars a un directori tradicional
d'alta associativitat i una à rea similar a altres esquemes recents de l'estat de l'art. Adicionalment, el DWP-Directory obté importants reduccions de consum està tic i dinà mic.
Aquesta dissertació també s'enfronta als problemes d'escalabilitat que es poden tro-
bar en les memòries cache. Les caches on-chip consumeixen una part significativa del
consum total del sistema. Aquestes caches implementen un alt nivell d'associativitat. En un accés a la cache, s'accedeix a cada via del conjunt en paral·lel, essent
aixà una acció costosa en energia. Aquesta tesis presenta l'arquitectura PS-Cache, un
disseny energèticament eficient que redueix el nombre de vies accedides sense perjudicar
les prestacions. La PS-Cache utilitza la informació de l'estat privat-compartit del bloc
referenciat per a reduir energia, ja que només accedim al subconjunt de vies que mantenen blocs del tipus sol·licitat. Els resultats mostren uns importants estalvis d'energia
dinà mica.
Finalment, proposem un altre disseny d'arquitectura energèticament eficient que es pot
aplicar a qualsevol tipus de memòria cache associativa per conjunts. La proposta, la Tag Filter (TF) Architecture, filtra les vies
accedides en el conjunt de la cache, de manera que només un reduït nombre de vies es
miren tant en el array d'etiquetes com en el de dades. Això permet que la nostra proposta
redueixi el consum dinà mic energètic de les caches sense perjudicar el seu temps d'accés.
Els
resultats experimentals mostren que aquest mecanisme de filtre és capaç d'obtenir un
consum energètic en caches associatives per conjunt similar al de les caches de mapejada
directa.
Els resultats experimentals mostren que les propostes presentades en aquesta tesis conseguixen un bon
compromÃs entre aquestros tres importants pilars de diseny.Valls Mompó, JJ. (2017). Improving Energy and Area Scalability of the Cache Hierarchy in CMPs [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/79551TESI
Architectural Support for Efficient Communication in Future Microprocessors
Traditionally, the microprocessor design has focused on the computational aspects
of the problem at hand. However, as the number of components on a single chip
continues to increase, the design of communication architecture has become a crucial
and dominating factor in defining performance models of the overall system. On-chip
networks, also known as Networks-on-Chip (NoC), emerged recently as a promising
architecture to coordinate chip-wide communication.
Although there are numerous interconnection network studies in an inter-chip
environment, an intra-chip network design poses a number of substantial challenges
to this well-established interconnection network field. This research investigates designs
and applications of on-chip interconnection network in next-generation microprocessors
for optimizing performance, power consumption, and area cost. First,
we present domain-specific NoC designs targeted to large-scale and wire-delay dominated
L2 cache systems. The domain-specifically designed interconnect shows 38%
performance improvement and uses only 12% of the mesh-based interconnect. Then,
we present a methodology of communication characterization in parallel programs
and application of characterization results to long-channel reconfiguration. Reconfigured
long channels suited to communication patterns enhance the latency of the
mesh network by 16% and 14% in 16-core and 64-core systems, respectively. Finally,
we discuss an adaptive data compression technique that builds a network-wide frequent value pattern map and reduces the packet size. In two examined multi-core
systems, cache traffic has 69% compressibility and shows high value sharing among
flows. Compression-enabled NoC improves the latency by up to 63% and saves energy
consumption by up to 12%
Novel Cache Hierarchies with Photonic Interconnects for Chip Multiprocessors
[ES] Los procesadores multinúcleo actuales cuentan con recursos compartidos entre los diferentes núcleos. Dos de estos recursos compartidos, la cache de último nivel y el ancho de banda de memoria principal, pueden convertirse en cuellos de botella para el rendimiento. Además, con el crecimiento del número de núcleos que implementan los diseños más recientes, la red dentro del chip también se convierte en un cuello de botella que puede afectar negativamente al rendimiento, ya que las redes tradicionales pueden encontrar limitaciones a su escalabilidad en el futuro cercano. Prácticamente la totalidad de los diseños actuales implementan jerarquÃas de memoria que se comunican mediante rápidas redes de interconexión. Esta organización es eficaz dado que permite reducir el número de accesos que se realizan a memoria principal y la latencia media de acceso a memoria. Las caches, la red de interconexión y la memoria principal, conjuntamente con otras técnicas conocidas como la prebúsqueda, permiten reducir las enormes latencias de acceso a memoria principal, limitando asà el impacto negativo ocasionado por la diferencia de rendimiento existente entre los núcleos de cómputo y la memoria. Sin embargo, compartir los recursos mencionados es fuente de diferentes problemas y retos, siendo uno de los principales el manejo de la interferencia entre aplicaciones. Hacer un uso eficiente de la jerarquÃa de memoria y las caches, asà como contar con una red de interconexión apropiada, es necesario para sostener el crecimiento del rendimiento en los diseños tanto actuales como futuros. Esta tesis analiza y estudia los principales problemas e inconvenientes observados en estos dos recursos: la cache de último nivel y la red dentro del chip. En primer lugar, se estudia la escalabilidad de las tradicionales redes dentro del chip con topologÃa de malla, asà como esta puede verse comprometida en próximos diseños que cuenten con mayor número de núcleos. Los resultados de este estudio muestran que, a mayor número de núcleos, el impacto negativo de la distancia entre núcleos en la latencia puede afectar seriamente al rendimiento del procesador. Como solución a este problema, en esta tesis proponemos una de red de interconexión óptica modelada en un entorno de simulación detallado, que supone una solución viable a los problemas de escalabilidad observados en los diseños tradicionales. A continuación, esta tesis dedica un esfuerzo importante a identificar y proponer soluciones a los principales problemas de diseño de las jerarquÃas de memoria actuales como son, por ejemplo, el sobredimensionado del espacio de cache privado, la existencia de réplicas de datos y rigidez e incapacidad de adaptación de las estructuras de cache. Aunque bien conocidos, estos problemas y sus efectos adversos en el rendimiento pueden ser evitados en procesadores de alto rendimiento gracias a la enorme capacidad de la cache de último nivel que este tipo de procesadores tÃpicamente implementan. Sin embargo, en procesadores de bajo consumo, no existe la posibilidad de contar con tales capacidades y hacer un uso eficiente del espacio disponible es crÃtico para mantener el rendimiento. Como solución a estos problemas en procesadores de bajo consumo, proponemos una novedosa organización de jerarquÃa de dos niveles cache que utiliza una red de interconexión óptica. Los resultados obtenidos muestran que, comparado con diseños convencionales, el consumo de energÃa estática en la arquitectura propuesta es un 60% menor, pese a que los resultados de rendimiento presentan valores similares. Por último, hemos extendido la arquitectura propuesta para dar soporte tanto a aplicaciones paralelas como secuenciales. Los resultados obtenidos con la esta nueva arquitectura muestran un ahorro de hasta el 78 % de energÃa estática en la ejecución de aplicaciones paralelas.[CA] Els processadors multinucli actuals compten amb recursos compartits entre els diferents nuclis. Dos d'aquests recursos compartits, la memòria d’últim nivell i l'ample de banda de memòria principal, poden convertir-se en colls d'ampolla per al rendiment. A mes, amb el creixement del nombre de nuclis que implementen els dissenys mes recents, la xarxa dins del xip també es converteix en un coll d'ampolla que pot afectar negativament el rendiment, ja que les xarxes tradicionals poden trobar limitacions a la seva escalabilitat en el futur proper. Prà cticament la totalitat dels dissenys actuals implementen jerarquies de memòria que es comuniquen mitjançant rapides xarxes d’interconnexió. Aquesta organització es eficaç ates que permet reduir el nombre d'accessos que es realitzen a memòria principal i la latència mitjana d’accés a memòria. Les caches, la xarxa d’interconnexió i la memòria principal, conjuntament amb altres tècniques conegudes com la prebúsqueda, permeten reduir les enormes latències d’accés a memòria principal, limitant aixà l'impacte negatiu ocasionat per la diferencia de rendiment existent entre els nuclis de còmput i la memòria. No obstant això, compartir els recursos esmentats és font de diversos problemes i reptes, sent un dels principals la gestió de la interferència entre aplicacions. Fer un us eficient de la jerarquia de memòria i les caches, aixà com comptar amb una xarxa d’interconnexió apropiada, es necessari per sostenir el creixement del rendiment en els dissenys tant actuals com futurs. Aquesta tesi analitza i estudia els principals problemes i inconvenients observats en aquests dos recursos: la memòria cache d’últim nivell i la xarxa dins del xip. En primer lloc, s'estudia l'escalabilitat de les xarxes tradicionals dins del xip amb topologia de malla, aixà com aquesta es pot veure compromesa en propers dissenys que compten amb major nombre de nuclis. Els resultats d'aquest estudi mostren que, a major nombre de nuclis, l'impacte negatiu de la distà ncia entre nuclis en la latència pot afectar seriosament al rendiment del processador. Com a solució' a aquest problema, en aquesta tesi proposem una xarxa d’interconnexió' òptica modelada en un entorn de simulació detallat, que suposa una solució viable als problemes d'escalabilitat observats en els dissenys tradicionals. A continuació, aquesta tesi dedica un esforç important a identificar i proposar solucions als principals problemes de disseny de les jerarquies de memòria actuals com son, per exemple, el sobredimensionat de l'espai de memòria cache privat, l’existència de repliques de dades i la rigidesa i incapacitat d’adaptació' de les estructures de memòria cache. Encara que ben coneguts, aquests problemes i els seus efectes adversos en el rendiment poden ser evitats en processadors d'alt rendiment gracies a l'enorme capacitat de la memòria cache d’últim nivell que aquest tipus de processadors tÃpicament implementen. No obstant això, en processadors de baix consum, no hi ha la possibilitat de comptar amb aquestes capacitats, i fer un us eficient de l'espai disponible es torna crÃtic per mantenir el rendiment. Com a solució a aquests problemes en processadors de baix consum, proposem una nova organització de jerarquia de dos nivells de memòria cache que utilitza una xarxa d’interconnexió òptica. Els resultats obtinguts mostren que, comparat amb dissenys convencionals, el consum d'energia està tica en l'arquitectura proposada és un 60% menor, malgrat que els resultats de rendiment presenten valors similars. Per últim, hem estes l'arquitectura proposada per donar suport tant a aplicacions paral·leles com seqüencials. Els resultats obtinguts amb aquesta nova arquitectura mostren un estalvi de fins al 78 % d'energia està tica en l’execució d'aplicacions paral·leles.[EN] Current multicores face the challenge of sharing resources among the different processor cores.
Two main shared resources act as major performance bottlenecks in current designs: the off-chip main memory bandwidth and the last level cache.
Additionally, as the core count grows, the network on-chip is also becoming a potential performance bottleneck, since traditional designs may find scalability issues in the near future.
Memory hierarchies communicated through fast interconnects are implemented in almost every current design as they reduce the number of off-chip accesses and the overall latency, respectively.
Main memory, caches, and interconnection resources, together with other widely-used techniques like prefetching, help alleviate the huge memory access latencies and limit the impact of the core-memory speed gap.
However, sharing these resources brings several concerns, being one of the most challenging the management of the inter-application interference.
Since almost every running application needs to access to main memory, all of them are exposed to interference from other co-runners in their way to the memory controller.
For this reason, making an efficient use of the available cache space, together with achieving fast and scalable interconnects, is critical to sustain the performance in current and future designs.
This dissertation analyzes and addresses the most important shortcomings of two major shared resources: the Last Level Cache (LLC) and the Network on Chip (NoC).
First, we study the scalability of both electrical and optical NoCs for future multicoresand many-cores.
To perform this study, we model optical interconnects in a cycle-accurate multicore simulation framework. A proper model is required; otherwise, important performance deviations may be observed otherwise in the evaluation results.
The study reveals that, as the core count grows, the effect of distance on the end-to-end latency can negatively impact on the processor performance.
In contrast, the study also shows that silicon nanophotonics are a viable solution to solve the mentioned latency problems.
This dissertation is also motivated by important design concerns related to current memory hierarchies, like the oversizing of private cache space, data replication overheads, and lack of flexibility regarding sharing of cache structures.
These issues, which can be overcome in high performance processors by virtue of huge LLCs, can compromise performance in low power processors.
To address these issues we propose a more efficient cache hierarchy organization that leverages optical interconnects.
The proposed architecture is conceived as an optically interconnected two-level cache hierarchy composed of multiple cache modules that can be dynamically turned on and off independently.
Experimental results show that, compared to conventional designs, static energy consumption is improved by up to 60% while achieving similar performance results.
Finally, we extend the proposal to support both sequential and parallel applications.
This extension is required since the proposal adapts to the dynamic cache space needs of the running applications, and multithreaded applications's behaviors widely differ from those of single threaded programs.
In addition, coherence management is also addressed, which is challenging since each cache module can be assigned to any core at a given time in the proposed approach.
For parallel applications, the evaluation shows that the proposal achieves up to 78% static energy savings.
In summary, this thesis tackles major challenges originated by the sharing of on-chip caches and communication resources in current multicores, and proposes new cache hierarchy organizations leveraging optical interconnects to address them.
The proposed organizations reduce both static and dynamic energy consumption compared to conventional approaches while achieving similar performance; which results in better energy efficiency.Puche Lara, J. (2021). Novel Cache Hierarchies with Photonic Interconnects for Chip Multiprocessors [Tesis doctoral]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/165254TESI
Software-Oriented Data Access Characterization for Chip Multiprocessor Architecture Optimizations
The integration of an increasing amount of on-chip hardware in Chip-Multiprocessors (CMPs) poses a challenge of efficiently utilizing the on-chip resources to maximize performance. Prior research proposals largely rely on additional hardware support to achieve desirable tradeoffs. However, these purely hardware-oriented mechanisms typically result in more generic but less efficient approaches. A new trend is designing adaptive systems by exploiting and leveraging application-level information. In this work a wide range of applications are analyzed and remarkable data access behaviors/patterns are recognized to be useful for architectural and system optimizations. In particular, this dissertation work introduces software-based techniques that can be used to extract data access characteristics for cross-layer optimizations on performance and scalability. The collected information is utilized to guide cache data placement, network configuration, coherence operations, address translation, memory configuration, etc. In particular, an approach is proposed to classify data blocks into different categories to optimize an on-chip coherent cache organization. For applications with compile-time deterministic data access localities, a compiler technique
is proposed to determine data partitions that guide the last level cache data placement and communication patterns for network configuration. A page-level data classification is also demonstrated to improve address translation performance. The successful utilization of data access characteristics on traditional CMP architectures demonstrates that the proposed approach is promising and generic and can be potentially applied to future CMP architectures with emerging technologies such as the Spin-transfer torque RAM (STT-RAM)
Improving heterogeneous system efficiency : architecture, scheduling, and machine learning
Computer architects are beginning to embrace heterogeneous systems as an effective method to utilize increases in transistor densities for executing a diverse range of workloads under varying performance and energy constraints. As heterogeneous systems become more ubiquitous, architects will need to develop novel CPU scheduling techniques capable of exploiting the diversity of computational resources. In recognizing hardware diversity, state-of-the-art heterogeneous schedulers are able to produce significant performance improvements over their predecessors and enable more flexible system designs. Nearly all of these, however, are unable to efficiently identify the mapping schemes which will result in the highest system performance.
Accurately estimating the performance of applications on different heterogeneous resources can provide a significant advantage to heterogeneous schedulers for identifying a performance maximizing mapping scheme to improve system performance. Recent advances in machine learning techniques including artificial neural networks have led to the development of powerful and practical prediction models for a variety of fields. As of yet, however, no significant leaps have been taken towards employing machine learning for heterogeneous scheduling in order to maximize system throughput.
The core issue we approach is how to understand and utilize the rise of heterogeneous architectures, benefits of heterogeneous scheduling, and the promise of machine learning techniques with respect to maximizing system performance. We present studies that promote a future computing model capable of supporting massive hardware diversity, discuss the constraints faced by heterogeneous designers, explore the advantages and shortcomings of conventional heterogeneous schedulers, and pioneer applying machine learning to optimize mapping and system throughput. The goal of this thesis is to highlight the importance of efficiently exploiting heterogeneity and to validate the opportunities that machine learning can offer for various areas in computer architecture.Arquitectos de computadores estan empesando a diseñar systemas heterogeneos como una manera efficiente de usar los incrementos en densidades de transistors para ejecutar una gran diversidad de programas corriendo debajo de differentes condiciones y requisitos de energia y rendimiento (performance). En cuanto los sistemas heterogeneos van ganando popularidad de uso, arquitectos van a necesitar a diseñar nuevas formas de hacer el scheduling de las applicaciones en los cores distintos de los CPUs. Schedulers nuevos que tienen en cuenta la heterogeniedad de los recursos en el hardware logran importantes beneficios en terminos de rendimiento en comparacion con schedulers hecho para sistemas homogenios. Pero, casi todos de estos schedulers heterogeneos no son capaz de poder identificar la esquema de mapping que produce el rendimiento maximo dado el estado de los cores y las applicaciones. Estimando con precision el rendimiento de los programas ejecutando sobre diferentes cores de un CPU es un a gran ventaja para poder identificar el mapping para lograr el mejor rendimiento posible para el proximo scheduling quantum. Desarollos nuevos en la area de machine learning, como redes neurales, han producido predictores muy potentes y con gran precision in disciplinas numerosas. Pero en estos momentos, la aplicacion de metodos de machine learning no se han casi explorados para poder mejorar la eficiencia de los CPUs y menos para mejorar los schedulers para sistemas heterogeneos. El tema de enfoque en esta tesis es como poder entender y utilizar los sistemas heterogeneos, los beneficios de scheduling para estos sistemas, y como aprovechar las promesas de los metodos de machine learning con respeto a maximizer el redimiento de el Sistema. Presentamos estudios que dan una esquema para un modelo de computacion para el futuro capaz de dar suporte a recursos heterogeneos en gran escala, discutimos las restricciones enfrentados por diseñadores de sistemas heterogeneos, exploramos las ventajas y desventajas de las ultimas schedulers heterogeneos, y abrimos el camino de usar metodos de machine learning para optimizer el mapping y rendimiento de un sistema heterogeneo. El objetivo de esta tesis es destacar la imporancia de explotando eficientemente la heterogenidad de los recursos y tambien validar las oportunidades para mejorar la eficiencia en diferente areas de arquitectura de computadoras que pueden ser realizadas gracias a machine learning.Postprint (published version
Near-Optimal Cache Block Placement with Reactive Nonuniform Cache Architectures
The growing core counts and caches of modern processors result in data access latency becoming a function of the data's physical location in the cache. Thus, the placement of cache blocks determines the cache's performance. Reactive nonuniform cache architectures (R-NUCA) achieve near-optimal cache block placement by classifying blocks online and placing data close to the cores that use them