    Memory Management for Emerging Memory Technologies

    The Memory Wall, or the gap between CPU speed and main memory latency, is ever increasing. The latency of Dynamic Random-Access Memory (DRAM) is now of the order of hundreds of CPU cycles. Additionally, the DRAM main memory is experiencing power, performance and capacity constraints that limit process technology scaling. On the other hand, the workloads running on such systems are themselves changing due to virtualization and cloud computing demanding more performance of the data centers. Not only do these workloads have larger working set sizes, but they are also changing the way memory gets used, resulting in higher sharing and increased bandwidth demands. New Non-Volatile Memory technologies (NVM) are emerging as an answer to the current main memory issues. This thesis looks at memory management issues as the emerging memory technologies get integrated into the memory hierarchy. We consider the problems at various levels in the memory hierarchy, including sharing of CPU LLC, traffic management to future non-volatile memories behind the LLC, and extending main memory through the employment of NVM. The first solution we propose is “Adaptive Replacement and Insertion" (ARI), an adaptive approach to last-level CPU cache management, optimizing the cache miss rate and writeback rate simultaneously. Our specific focus is to reduce writebacks as much as possible while maintaining or improving miss rate relative to conventional LRU replacement policy, with minimal hardware overhead. ARI reduces writebacks on benchmarks from SPEC2006 suite on average by 32.9% while also decreasing misses on average by 4.7%. In a PCM based memory system, this decreases energy consumption by 23% compared to LRU and provides a 49% lifetime improvement beyond what is possible with randomized wear-leveling. Our second proposal is “Variable-Timeslice Thread Scheduling" (VATS), an OS kernel-level approach to CPU cache sharing. With modern, large, last-level caches (LLC), the time to fill the LLC is greater than the OS scheduling window. As a result, when a thread aggressively thrashes the LLC by replacing much of the data in it, another thread may not be able to recover its working set before being rescheduled. We isolate the threads in time by increasing their allotted time quanta, and allowing larger periods of time between interfering threads. Our approach, compared to conventional scheduling, mitigates up to 100% of the performance loss caused by CPU LLC interference. The system throughput is boosted by up to 15%. As an unconventional approach to utilizing emerging memory technologies, we present a Ternary Content-Addressable Memory (TCAM) design with Flash transistors. TCAM is successfully used in network routing but can also be utilized in the OS Virtual Memory applications. Based on our layout and circuit simulation experiments, we conclude that our FTCAM block achieves an area improvement of 7.9× and a power improvement of 1.64× compared to a CMOS approach. In order to lower the cost of Main Memory in systems with huge memory demand, it is becoming practical to extend the DRAM in the system with the less-expensive NVMe Flash, for a much lower system cost. However, given the relatively high Flash devices access latency, naively using them as main memory leads to serious performance degradation. We propose OSVPP, a software-only, OS swap-based page prefetching scheme for managing such hybrid DRAM + NVM systems. We show that it is possible to gain about 50% of the lost performance due to swapping into the NVM and thus enable the utilization of such hybrid systems for memory-hungry applications, lowering the memory cost while keeping the performance comparable to the DRAM-only system

    Memory Management for Emerging Memory Technologies

    Hardware thread scheduling algorithms for single-ISA asymmetric CMPs

    Through the past several decades, based on the Moore's law, the semiconductor industry was doubling the number of transistors on the single chip roughly every eighteen months. For a long time this continuous increase in transistor budget drove the increase in performance as the processors continued to exploit the instruction level parallelism (ILP) of the sequential programs. This pattern hit the wall in the early years of the twentieth century when designing larger and more complex cores became difficult because of the power and complexity reasons. Computer architects responded by integrating many cores on the same die thereby creating Chip Multicore Processors (CMP). In the last decade, the computing technology experienced tremendous developments, Chip Multiprocessors (CMP) expanded from the symmetric and homogeneous to the asymmetric or heterogeneous Multiprocessors. Having cores of different types in a single processor enables optimizing performance, power and energy efficiency for a wider range of workloads. It enables chip designers to employ specialization (that is, we can use each type of core for the type of computation where it delivers the best performance/energy trade-off). The benefits of Asymmetric Chip Multiprocessors (ACMP) are intuitive as it is well known that different workloads have different resource requirements. The CMPs improve the performance of applications by exploiting the Thread Level Parallelism (TLP). Parallel applications relying on multiple threads must be efficiently managed and dispatched for execution if the parallelism is to be properly exploited. Since more and more applications become multi-threaded we expect to find a growing number of threads executing on a machine. Consequently, the operating system will require increasingly larger amounts of CPU time to schedule these threads efficiently. Thus, dynamic thread scheduling techniques are of paramount importance in ACMP designs since they can make or break performance benefits derived from the asymmetric hardware or parallel software. Several thread scheduling methods have been proposed and applied to ACMPs. In this thesis, we first study the state of the art thread scheduling techniques and identify the main reasons limiting the thread level parallelism in an ACMP systems. We propose three novel approaches to schedule and manage threads and exploit thread level parallelism implemented in hardware, instead of perpetuating the trend of performing more complex thread scheduling in the operating system. Our first goal is to improve the performance of an ACMP systems by improving thread scheduling at the hardware level. We also show that the hardware thread scheduling reduces the energy consumption of an ACMP systems by allowing better utilization of the underlying hardware.A través de las últimas décadas, con base en la ley de Moore, la industria de semiconductores duplica el número de transistores en el chip alrededor de una vez cada dieciocho meses. Durante mucho tiempo, este aumento continuo en el número de transistores impulsó el aumento en el rendimiento de los procesadores solo explotando el paralelismo a nivel de instrucción (ILP) y el aumento de la frecuencia de los procesadores, permitiendo un aumento del rendimiento de los programas secuenciales. Este patrón llego a su limite en los primeros años del siglo XX, cuando el diseño de procesadores más grandes y complejos se convirtió en una tareá difícil debido a las debido al consumo requerido. La respuesta a este problema por parte de los arquitectos fue la integración de muchos núcleos en el mismo chip creando así chip multinúcleo Procesadores (CMP). En la última década, la tecnología de la computación experimentado enormes avances, sobre todo el en chip multiprocesadores (CMP) donde se ha pasado de diseños simetricos y homogeneous a sistemas asimétricos y heterogeneous. Tener núcleos de diferentes tipos en un solo procesador permite optimizar el rendimiento, la potencia y la eficiencia energética para una amplia gama de cargas de trabajo. Permite a los diseñadores de chips emplear especialización (es decir, podemos utilizar un tipo de núcleo diferente para distintos tipos de cálculo dependiendo del trade-off respecto del consumo y rendimiento). Los beneficios de la asimétrica chip multiprocesadores (ACMP) son intuitivos, ya que es bien sabido que diferentes cargas de trabajo tienen diferentes necesidades de recursos. Los CMP mejoran el rendimiento de las aplicaciones mediante la explotación del paralelismo a nivel de hilo (TLP). En las aplicaciones paralelas que dependen de múltiples hilos, estos deben ser manejados y enviados para su ejecución, y el paralelismo se debe explotar de manera eficiente. Cada día hay mas aplicaciones multi-hilo, por lo tanto encotraremos un numero mayor de hilos que se estaran ejecutando en la máquina. En consecuencia, el sistema operativo requerirá cantidades cada vez mayores de tiempo de CPU para organizar y ejecutar estos hilos de manera eficiente. Por lo tanto, las técnicas de optimizacion dinámica para la organizacion de la ejecucion de hilos son de suma importancia en los diseños ACMP ya que pueden incrementar o dsiminuir el rendimiento del hardware asimétrico o del software paralelo. Se han propuesto y aplicado a ACMPs varios métodos de organizar y ejecutar los hilos. En esta tesis, primero estudiamos el estado del arte en las técnicas para la gestionar la ejecucion de los hilos y hemos identificado las principales razones que limitan el paralelismo en sistemas ACMP. Proponemos tres nuevos enfoques para programar y gestionar los hilos y explotar el paralelismo a nivel de hardware, en lugar de perpetuar la tendencia actual de dejar esta gestion cada vez maas compleja al sistema operativo. Nuestro primer objetivo es mejorar el rendimiento de un sistema ACMP mediante la mejora en la gestion de los hilos a nivel de hardware. También mostramos que la gestion del los hilos a nivel de hardware reduce el consumo de energía de un sistemas de ACMP al permitir una mejor utilización del hardware subyacente

    DeNovo: rethinking the memory hierarchy for disciplined parallelism

    As multicore systems become widespread, both software and hardware face a major challenge in efficiently exploiting and implementing parallelism. While shared–memory remains a popular programming model due to its global address space, it is plagued with undisciplined programming practices that allow implicit communication and unstructured non-determinism. Such “wild” shared-memory behavior not only makes it difficult to test and maintain software but also complicates hardware, preventing it from scaling in a power-efficient manner. Recent research has proposed replacing the wild shared-memory programming models with a more disciplined approach. The DeNovo project asks the following question: if software is more disciplined, can we build more power-, performance-, and complexity-efficient shared-memory hardware? Focusing on deterministic programs as a discipline to drive DeNovo, we first show that coherence and communication can be made much simpler and more efficient than the current state of the art. The resulting protocol is without transient states, invalidation traffic, directory sharer–lists, or false sharing - all significant sources of inefficiencies in existing protocols. Widening the software space further, we then show how DeNovo can support software with disciplined non-determinism without giving up its benefits for deterministic programs. The remaining challenge is supporting synchronization accesses that are inherently “racy” on DeNovo without writer-initiated invalidation. We show that arbitrary synchronization can be supported on DeNovo with a simple yet efficient hardware mechanism, a big step toward our eventual goal of supporting legacy programs. Finally, we explore the potential for a comprehensive coherence solution that merges all previous DeNovo coherence mechanisms and adaptively switches between them depending on the level of “discipline” of software. In summary, DeNovo shows the potential for commercially viable software-driven shared-memory systems with higher complexity-, performance-, and energy-efficiency than today’s software-oblivious hardware

    Understanding the behavior and implications of context switch misses

