    Runtime-assisted optimizations in the on-chip memory hierarchy

    Following Moore's Law, the number of transistors on chip has been increasing exponentially, which has led to the increasing complexity of modern processors. As a result, the efficient programming of such systems has become more difficult. Many programming models have been developed to answer this issue. Of particular interest are task-based programming models that employ simple annotations to define parallel work in an application. The information available at the level of the runtime systems associated with these programming models offers great potential for improving hardware design. Moreover, due to technological limitations, Moore's Law is predicted to eventually come to an end, so novel paradigms are necessary to maintain the current performance improvement trends. The main goal of this thesis is to exploit the knowledge about a parallel application available at the runtime system level to improve the design of the on-chip memory hierarchy. The coupling of the runtime system and the microprocessor enables a better hardware design without hurting the programmability. The first contribution is a set of insertion policies for shared last-level caches that exploit information about tasks and task data dependencies. The intuition behind this proposal revolves around the observation that parallel threads exhibit different memory access patterns. Even within the same thread, accesses to different variables often follow distinct patterns. The proposed policies insert cache lines into different logical positions depending on the dependency type and task type to which the corresponding memory request belongs. The second proposal optimizes the execution of reductions, defined as a programming pattern that combines input data to form the resulting reduction variable. This is achieved with a runtime-assisted technique for performing reductions in the processor's cache hierarchy. The proposal's goal is to be a universally applicable solution regardless of the reduction variable type, size and access pattern. On the software level, the programming model is extended to let a programmer specify the reduction variables for tasks, as well as the desired cache level where a certain reduction will be performed. The source-to-source compiler and the runtime system are extended to translate and forward this information to the underlying hardware. On the hardware level, private and shared caches are equipped with functional units and the accompanying logic to perform reductions at the cache level. This design avoids unnecessary data movements to the core and back as the data is operated at the place where it resides. The third contribution is a runtime-assisted prioritization scheme for memory requests inside the on-chip memory hierarchy. The proposal is based on the notion of a critical path in the context of parallel codes and a known fact that accelerating critical tasks reduces the execution time of the whole application. In the context of this work, task criticality is observed at a level of a task type as it enables simple annotation by the programmer. The acceleration of critical tasks is achieved by the prioritization of corresponding memory requests in the microprocessor.Siguiendo la ley de Moore, el n煤mero de transistores en los chips ha crecido exponencialmente, lo que ha comportado una mayor complejidad en los procesadores modernos y, como resultado, de la dificultad de la programaci贸n eficiente de estos sistemas. Se han desarrollado muchos modelos de programaci贸n para resolver este problema; un ejemplo particular son los modelos de programaci贸n basados en tareas, que emplean anotaciones sencillas para definir los Trabajos paralelos de una aplicaci贸n. La informaci贸n de que disponen los sistemas en tiempo de ejecuci贸n (runtime systems) asociada con estos modelos de programaci贸n ofrece un enorme potencial para la mejora del dise帽o del hardware. Por otro lado, las limitaciones tecnol贸gicas hacen que la ley de Moore pueda dejar de cumplirse pr贸ximamente, por lo que se necesitan paradigmas nuevos para mantener las tendencias actuales de mejora de rendimiento. El objetivo principal de esta tesis es aprovechar el conocimiento de las aplicaciones paral路leles de que dispone el runtime system para mejorar el dise帽o de la jerarqu铆a de memoria del chip. El acoplamiento del runtime system junto con el microprocesador permite realizar mejores dise帽os hardware sin afectar Negativamente en la programabilidad de dichos sistemas. La primera contribuci贸n de esta tesis consiste en un conjunto de pol铆ticas de inserci贸n para las memorias cach茅 compartidas de 煤ltimo nivel que aprovecha la informaci贸n de las tareas y las dependencias de datos entre estas. La intuici贸n tras esta propuesta se basa en la observaci贸n de que los hilos de ejecuci贸n paralelos muestran distintos patrones de acceso a memoria e, incluso dentro del mismo hilo, los accesos a diferentes variables a menudo siguen patrones distintos. Las pol铆ticas que se proponen insertan l铆neas de cach茅 en posiciones l贸gicas diferentes en funci贸n de los tipos de dependencia y tarea a los que corresponde la petici贸n de memoria. La segunda propuesta optimiza la ejecuci贸n de las reducciones, que se definen como un patr贸n de programaci贸n que combina datos de entrada para conseguir la variable de reducci贸n como resultado. Esto se consigue mediante una t茅cnica asistida por el runtime system para la realizaci贸n de reducciones en la jerarqu铆a de la cach茅 del procesador, con el objetivo de ser una soluci贸n aplicable de forma universal sin depender del tipo de la variable de la reducci贸n, su tama帽o o el patr贸n de acceso. A nivel de software, el modelo de programaci贸n se extiende para que el programador especifique las variables de reducci贸n de las tareas, as铆 como el nivel de cach茅 escogido para que se realice una determinada reducci贸n. El compilador fuente a Fuente (compilador source-to-source) y el runtime ssytem se modifican para que traduzcan y pasen esta informaci贸n al hardware subyacente, evitando as铆 movimientos de datos innecesarios hacia y desde el n煤cleo del procesador, al realizarse la operaci贸n donde se encuentran los datos de la misma. La tercera contribuci贸n proporciona un esquema de priorizaci贸n asistido por el runtime system para peticiones de memoria dentro de la jerarqu铆a de memoria del chip. La propuesta se basa en la noci贸n de camino cr铆tico en el contexto de los c贸digos paralelos y en el hecho conocido de que acelerar tareas cr铆ticas reduce el tiempo de ejecuci贸n de la aplicaci贸n completa. En el contexto de este trabajo, la criticidad de las tareas se considera a nivel del tipo de tarea ya que permite que el programador las indique mediante anotaciones sencillas. La aceleraci贸n de las tareas cr铆ticas se consigue priorizando las correspondientes peticiones de memoria en el microprocesador.Seguint la llei de Moore, el nombre de transistors que contenen els xips ha patit un creixement exponencial, fet que ha provocat un augment de la complexitat dels processadors moderns i, per tant, de la dificultat de la programaci贸 eficient d鈥檃quests sistemes. Per intentar solucionar-ho, s鈥檋an desenvolupat diversos models de programaci贸; un exemple particular en s贸n els models basats en tasques, que fan servir anotacions senzilles per definir treballs paral路lels dins d鈥檜na aplicaci贸. La informaci贸 que hi ha al nivell dels sistemes en temps d鈥檈xecuci贸 (runtime systems) associada amb aquests models de programaci贸 ofereix un gran potencial a l鈥檋ora de millorar el disseny del maquinari. D鈥檃ltra banda, les limitacions tecnol貌giques fan que la llei de Moore pugui deixar de complir-se properament, per la qual cosa calen nous paradigmes per mantenir les tend猫ncies actuals en la millora de rendiment. L鈥檕bjectiu principal d鈥檃questa tesi 茅s aprofitar els coneixements que el runtime System t茅 d鈥檜na aplicaci贸 paral路lela per millorar el disseny de la jerarquia de mem貌ria dins el xip. L鈥檃coblament del runtime system i el microprocessador permet millorar el disseny del maquinari sense malmetre la programabilitat d鈥檃quests sistemes. La primera contribuci贸 d鈥檃questa tesi consisteix en un conjunt de pol铆tiques d鈥檌nserci贸 a les mem貌ries cau (cache memories) compartides d鈥櫭簂tim nivell que aprofita informaci贸 sobre tasques i les depend猫ncies de dades entre aquestes. La intu茂ci贸 que hi ha al darrere d鈥檃questa proposta es basa en el fet que els fils d鈥檈xecuci贸 paral路lels mostren diferents patrons d鈥檃cc茅s a la mem貌ria; fins i tot dins el mateix fil, els accessos a variables diferents sovint segueixen patrons diferents. Les pol铆tiques que s鈥檋i proposen insereixen l铆nies de la mem貌ria cau a diferents ubicacions l貌giques en funci贸 dels tipus de depend猫ncia i de tasca als quals correspon la petici贸 de mem貌ria. La segona proposta optimitza l鈥檈xecuci贸 de les reduccions, que es defineixen com un patr贸 de programaci贸 que combina dades d鈥檈ntrada per aconseguir la variable de reducci贸 com a resultat. Aix貌 s鈥檃consegueix mitjan莽ant una t猫cnica assistida pel runtime system per dur a terme reduccions en la jerarquia de la mem貌ria cau del processador, amb l鈥檕bjectiu que la proposta sigui aplicable de manera universal, sense dependre del tipus de la variable a la qual es realitza la reducci贸, la seva mida o el patr贸 d鈥檃cc茅s. A nivell de programari, es realitza una extensi贸 del model de programaci贸 per facilitar que el programador especifiqui les variables de les reduccions que usaran les tasques, aix铆 com el nivell de mem貌ria cau desitjat on s鈥檋auria de realitzar una certa reducci贸. El compilador font a font (compilador source-to-source) i el runtime system s鈥檃mplien per traduir i passar aquesta informaci贸 al maquinari subjacent. A nivell de maquinari, les mem貌ries cau privades i compartides s鈥檈quipen amb unitats funcionals i la l貌gica corresponent per poder dur a terme les reduccions a la pr貌pia mem貌ria cau, evitant aix铆 moviments de dades innecessaris entre el nucli del processador i la jerarquia de mem貌ria. La tercera contribuci贸 proporciona un esquema de prioritzaci贸 assistit pel runtime System per peticions de mem貌ria dins de la jerarquia de mem貌ria del xip. La proposta es basa en la noci贸 de cam铆 cr铆tic en el context dels codis paral路lels i en el fet conegut que l鈥檃cceleraci贸 de les tasques que formen part del cam铆 cr铆tic redueix el temps d鈥檈xecuci贸 de l鈥檃plicaci贸 sencera. En el context d鈥檃quest treball, la criticitat de les tasques s鈥檕bserva al nivell del seu tipus ja que permet que el programador les indiqui mitjan莽ant anotacions senzilles. L鈥檃cceleraci贸 de les tasques cr铆tiques s鈥檃consegueix prioritzant les corresponents peticions de mem貌ria dins el microprocessador

    Low Energy Solutions for Multi- and Triple-Level Cell Non-Volatile Memories

    Due to the high refresh power and scalability issues of DRAM, non-volatile memories (NVM) such as phase change memory (PCM) and resistive RAM (RRAM) are being actively investigated as viable replacements of DRAM. However, although these NVMs are more scalable than DRAM, they have shortcomings such as higher write energy and lower endurance. Further, the increased capacity of multi- and triple-level cells (MLC/TLC) in these NVM technologies comes at the cost of even higher write energies and lower endurance attributed to the MLC/TLC program-and-verify (P&V) techniques. This dissertation makes the following contributions to address the high write energy associated with MLC/TLC NVMs. First, we describe MFNW, a Flip-N-Write encoding that effectively reduces the write energy and improves the endurance of MLC NVMs. MFNW encodes an MLC/TLC word into a number of codewords and selects the one resulting in lowest write energy. Second, we present another encoding solution that is based on perfect knowledge frequent value encoding (FVE). This encoding technique leverages machine learning to cluster a set of general-purpose applications according to their frequency profiles and generates a dedicated offline FVE for every cluster to maximize energy reduction across a broad spectrum of applications. Whereas the proposed encodings are used as an add-on layer on top of the MLC/TLC P&V solutions, the third contribution is a low latency, low energy P&V (L3EP) approach for MLC/TLC PCM. The primary motivation of L^3EP is to fix the problem from its origin by crafting a higher speed programming algorithm. A reduction in write latency implies a reduction in write energy as well as an improvement in cell endurance. Directions for future research include the integration and evaluation of a software-based hybrid encoding mechanism for MLC/TLC NVMs; this is a page-level encoding that employs a DRAM cache for coding/decoding purposes. The main challenges include how the cache block replacement algorithm can easily access the page-level auxiliary cells to encode the cache block correctly. In summary, this work presents multiple solutions to address major challenges of MLC/TLC NVMs, including write latency, write energy, and cell endurance

    RAID Organizations for Improved Reliability and Performance: A Not Entirely Unbiased Tutorial (1st revision)

    RAID proposal advocated replacing large disks with arrays of PC disks, but as the capacity of small disks increased 100-fold in 1990s the production of large disks was discontinued. Storage dependability is increased via replication or erasure coding. Cloud storage providers store multiple copies of data obviating for need for further redundancy. Varitaions of RAID based on local recovery codes, partial MDS reduce recovery cost. NAND flash Solid State Disks - SSDs have low latency and high bandwidth, are more reliable, consume less power and have a lower TCO than Hard Disk Drives, which are more viable for hyperscalers.Comment: Submitted to ACM Computing Surveys. arXiv admin note: substantial text overlap with arXiv:2306.0876


    Over the last two decades, functions of the embedded systems have evolved from simple real-time control and monitoring to more complicated services. Embedded systems equipped with powerful chips can provide the performance that computationally demanding information processing applications need. However, due to the power issue, the easy way to gain increasing performance by scaling up chip frequencies is no longer feasible. Recently, low-power architecture designs have been the main trend in embedded system designs. In this dissertation, we present our approaches to attack the energy-related issues in embedded system designs, such as thermal issues in the 3D chip multiprocessor (CMP), the endurance issue in the phase-change memory(PCM), the battery issue in the embedded system designs, the impact of inaccurate information in embedded system, and the cloud computing to move the workload to remote cloud computing facilities. We propose a real-time constrained task scheduling method to reduce peak temperature on a 3D CMP, including an online 3D CMP temperature prediction model and a set of algorithm for scheduling tasks to different cores in order to minimize the peak temperature on chip. To address the challenging issues in applying PCM in embedded systems, we propose a PCM main memory optimization mechanism through the utilization of the scratch pad memory (SPM). Furthermore, we propose an MLC/SLC configuration optimization algorithm to enhance the efficiency of the hybrid DRAM + PCM memory. We also propose an energy-aware task scheduling algorithm for parallel computing in mobile systems powered by batteries. When scheduling tasks in embedded systems, we make the scheduling decisions based on information, such as estimated execution time of tasks. Therefore, we design an evaluation method for impacts of inaccurate information on the resource allocation in embedded systems. Finally, in order to move workload from embedded systems to remote cloud computing facility, we present a resource optimization mechanism in heterogeneous federated multi-cloud systems. And we also propose two online dynamic algorithms for resource allocation and task scheduling. We consider the resource contention in the task scheduling


    Volume 1 establishes the foundations of this new field. It goes through all the steps from data collection, their summary and clustering, to different aspects of resource-aware learning, i.e., hardware, memory, energy, and communication awareness. Machine learning methods are inspected with respect to resource requirements and how to enhance scalability on diverse computing architectures ranging from embedded systems to large computing clusters


    Emerging Technologies

    This monograph investigates a multitude of emerging technologies including 3D printing, 5G, blockchain, and many more to assess their potential for use to further humanity鈥檚 shared goal of sustainable development. Through case studies detailing how these technologies are already being used at companies worldwide, author Sinan K眉feo臒lu explores how emerging technologies can be used to enhance progress toward each of the seventeen United Nations Sustainable Development Goals and to guarantee economic growth even in the face of challenges such as climate change. To assemble this book, the author explored the business models of 650 companies in order to demonstrate how innovations can be converted into value to support sustainable development. To ensure practical application, only technologies currently on the market and in use actual companies were investigated. This volume will be of great use to academics, policymakers, innovators at the forefront of green business, and anyone else who is interested in novel and innovative business models and how they could help to achieve the Sustainable Development Goals. This is an open access book