23 research outputs found

    HPC memory systems: Implications of system simulation and checkpointing

    Get PDF
    The memory system is a significant contributor for most of the current challenges in computer architecture: application performance bottlenecks and operational costs in large data-centers as HPC supercomputers. With the advent of emerging memory technologies, the exploration for novel designs on the memory hierarchy for HPC systems is an open invitation for computer architecture researchers to improve and optimize current designs and deployments. System simulation is the preferred approach to perform architectural explorations due to the low cost to prototype hardware systems, acceptable performance estimates, and accurate energy consumption predictions. Despite the broad presence and extensive usage of system simulators, their validation is not standardized; either because the main purpose of the simulator is not meant to mimic real hardware, or because the design assumptions are too narrow on a particular computer architecture topic. This thesis provides the first steps for a systematic methodology to validate system simulators when compared to real systems. We unveil real-machine´s micro-architectural parameters through a set of specially crafted micro-benchmarks. The unveiled parameters are used to upgrade the simulation infrastructure in order to obtain higher accuracy in the simulation domain. To evaluate the accuracy on the simulation domain, we propose the retirement factor, an extension to a well-known application´s performance methodology. Our proposal provides a new metric to measure the impact simulator´s parameter-tuning when looking for the most accurate configuration. We further present the delay queue, a modification to the memory controller that imposes a configurable delay for all memory transactions that reach the main memory devices; evaluated using the retirement factor, the delay queue allows us to identify the sources of deviations between the simulator infrastructure and the real system. Memory accesses directly affect application performance, both in the real-world machine as well as in the simulation accuracy. From single-read access to a unique memory location up to simultaneous read/write operations to a single or multiple memory locations, HPC applications memory usage differs from workload to workload. A property that allows to glimpse on the application´s memory usage is the workload´s memory footprint. In this work, we found a link between HPC workload´s memory footprint and simulation performance. Actual trends on HPC data-center memory deployments and current HPC application’s memory footprint led us to envision an opportunity for emerging memory technologies to include them as part of the reliability support on HPC systems. Emerging memory technologies such as 3D-stacked DRAM are getting deployed in current HPC systems but in limited quantities in comparison with standard DRAM storage making them suitable to use for low memory footprint HPC applications. We exploit and evaluate this characteristic enabling a Checkpoint-Restart library to support a heterogeneous memory system deployed with an emerging memory technology. Our implementation imposes negligible overhead while offering a simple interface to allocate, manage, and migrate data sets between heterogeneous memory systems. Moreover, we showed that the usage of an emerging memory technology it is not a direct solution to performance bottlenecks; correct data placement and crafted code implementation are critical when comes to obtain the best computing performance. Overall, this thesis provides a technique for validating main memory system simulators when integrated in a simulation infrastructure and compared to real systems. In addition, we explored a link between the workload´s memory footprint and simulation performance on current HPC workloads. Finally, we enabled low memory footprint HPC applications with resilience support while transparently profiting from the usage of emerging memory deployments.El sistema de memoria es el mayor contribuidor de los desafíos actuales en el campo de la arquitectura de ordenadores como lo son los cuellos de botella en el rendimiento de las aplicaciones, así como los costos operativos en los grandes centros de datos. Con la llegada de tecnologías emergentes de memoria, existe una invitación para que los investigadores mejoren y optimicen las implementaciones actuales con novedosos diseños en la jerarquía de memoria. La simulación de los ordenadores es el enfoque preferido para realizar exploraciones de arquitectura debido al bajo costo que representan frente a la realización de prototipos físicos, arrojando estimaciones de rendimiento aceptables con predicciones precisas. A pesar del amplio uso de simuladores de ordenadores, su validación no está estandarizada ya sea porque el propósito principal del simulador no es imitar al sistema real o porque las suposiciones de diseño son demasiado específicas. Esta tesis proporciona los primeros pasos hacia una metodología sistemática para validar simuladores de ordenadores cuando son comparados con sistemas reales. Primero se descubren los parámetros de microarquitectura en la máquina real a través de un conjunto de micro-pruebas diseñadas para actualizar la infraestructura de simulación con el fin de mejorar la precisión en el dominio de la simulación. Para evaluar la precisión de la simulación, proponemos "el factor de retiro", una extensión a una conocida herramienta para medir el rendimiento de las aplicaciones, pero enfocada al impacto del ajuste de parámetros en el simulador. Además, presentamos "la cola de retardo", una modificación virtual al controlador de memoria que agrega un retraso configurable a todas las transacciones de memoria que alcanzan la memoria principal. Usando el factor de retiro, la cola de retraso nos permite identificar el origen de las desviaciones entre la infraestructura del simulador y el sistema real. Todos los accesos de memoria afectan directamente el rendimiento de la aplicación. Desde el acceso de lectura a una única localidad memoria hasta operaciones simultáneas de lectura/escritura a una o varias localidades de memoria, una propiedad que permite reflejar el uso de memoria de la aplicación es su "huella de memoria". En esta tesis encontramos un vínculo entre la huella de memoria de las aplicaciones de alto desempeño y su rendimiento en simulación. Las tecnologías de memoria emergentes se están implementando en sistemas de alto desempeño en cantidades limitadas en comparación con la memoria principal haciéndolas adecuadas para su uso en aplicaciones con baja huella de memoria. En este trabajo, habilitamos y evaluamos el uso de un sistema de memoria heterogéneo basado en un sistema emergente de memoria. Nuestra implementación agrega una carga despreciable al mismo tiempo que ofrece una interfaz simple para ubicar, administrar y migrar datos entre sistemas de memoria heterogéneos. Además, demostramos que el uso de una tecnología de memoria emergente no es una solución directa a los cuellos de botella en el desempeño. La implementación es fundamental a la hora de obtener el mejor rendimiento ya sea ubicando correctamente los datos, o bien diseñando código especializado. En general, esta tesis proporciona una técnica para validar los simuladores respecto al sistema de memoria principal cuando se integra en una infraestructura de simulación y se compara con sistemas reales. Además, exploramos un vínculo entre la huella de memoria de la carga de trabajo y el rendimiento de la simulación en cargas de trabajo de aplicaciones de alto desempeño. Finalmente, habilitamos aplicaciones de alto desempeño con soporte de resiliencia mientras que se benefician de manera transparente con el uso de un sistema de memoria emergente.Postprint (published version

    Physically Dense Server Architectures.

    Full text link
    Distributed, in-memory key-value stores have emerged as one of today's most important data center workloads. Being critical for the scalability of modern web services, vast resources are dedicated to key-value stores in order to ensure that quality of service guarantees are met. These resources include: many server racks to store terabytes of key-value data, the power necessary to run all of the machines, networking equipment and bandwidth, and the data center warehouses used to house the racks. There is, however, a mismatch between the key-value store software and the commodity servers on which it is run, leading to inefficient use of resources. The primary cause of inefficiency is the overhead incurred from processing individual network packets, which typically carry small payloads, and require minimal compute resources. Thus, one of the key challenges as we enter the exascale era is how to best adjust to the paradigm shift from compute-centric to storage-centric data centers. This dissertation presents a hardware/software solution that addresses the inefficiency issues present in the modern data centers on which key-value stores are currently deployed. First, it proposes two physical server designs, both of which use 3D-stacking technology and low-power CPUs to improve density and efficiency. The first 3D architecture---Mercury---consists of stacks of low-power CPUs with 3D-stacked DRAM. The second architecture---Iridium---replaces DRAM with 3D NAND Flash to improve density. The second portion of this dissertation proposes and enhanced version of the Mercury server design---called KeyVault---that incorporates integrated, zero-copy network interfaces along with an integrated switching fabric. In order to utilize the integrated networking hardware, as well as reduce the response time of requests, a custom networking protocol is proposed. Unlike prior works on accelerating key-value stores---e.g., by completely bypassing the CPU and OS when processing requests---this work only bypasses the CPU and OS when placing network payloads into a process' memory. The insight behind this is that because most of the overhead comes from processing packets in the OS kernel---and not the request processing itself---direct placement of packet's payload is sufficient to provide higher throughput and lower latency than prior approaches.PhDComputer Science and EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/111414/1/atgutier_1.pd

    High-Performance Energy-Efficient and Reliable Design of Spin-Transfer Torque Magnetic Memory

    Get PDF
    In this dissertation new computing paradigms, architectures and design philosophy are proposed and evaluated for adopting the STT-MRAM technology as highly reliable, energy efficient and fast memory. For this purpose, a novel cross-layer framework from the cell-level all the way up to the system- and application-level has been developed. In these framework, the reliability issues are modeled accurately with appropriate fault models at different abstraction levels in order to analyze the overall failure rates of the entire memory and its Mean Time To Failure (MTTF) along with considering the temperature and process variation effects. Design-time, compile-time and run-time solutions have been provided to address the challenges associated with STT-MRAM. The effectiveness of the proposed solutions is demonstrated in extensive experiments that show significant improvements in comparison to state-of-the-art solutions, i.e. lower-power, higher-performance and more reliable STT-MRAM design

    Memory Subsystem Optimization Techniques for Modern High-Performance General-Purpose Processors

    Get PDF
    abstract: General-purpose processors propel the advances and innovations that are the subject of humanity’s many endeavors. Catering to this demand, chip-multiprocessors (CMPs) and general-purpose graphics processing units (GPGPUs) have seen many high-performance innovations in their architectures. With these advances, the memory subsystem has become the performance- and energy-limiting aspect of CMPs and GPGPUs alike. This dissertation identifies and mitigates the key performance and energy-efficiency bottlenecks in the memory subsystem of general-purpose processors via novel, practical, microarchitecture and system-architecture solutions. Addressing the important Last Level Cache (LLC) management problem in CMPs, I observe that LLC management decisions made in isolation, as in prior proposals, often lead to sub-optimal system performance. I demonstrate that in order to maximize system performance, it is essential to manage the LLCs while being cognizant of its interaction with the system main memory. I propose ReMAP, which reduces the net memory access cost by evicting cache lines that either have no reuse, or have low memory access cost. ReMAP improves the performance of the CMP system by as much as 13%, and by an average of 6.5%. Rather than the LLC, the L1 data cache has a pronounced impact on GPGPU performance by acting as the bandwidth filter for the rest of the memory subsystem. Prior work has shown that the severely constrained data cache capacity in GPGPUs leads to sub-optimal performance. In this thesis, I propose two novel techniques that address the GPGPU data cache capacity problem. I propose ID-Cache that performs effective cache bypassing and cache line size selection to improve cache capacity utilization. Next, I propose LATTE-CC that considers the GPU’s latency tolerance feature and adaptively compresses the data stored in the data cache, thereby increasing its effective capacity. ID-Cache and LATTE-CC are shown to achieve 71% and 19.2% speedup, respectively, over a wide variety of GPGPU applications. Complementing the aforementioned microarchitecture techniques, I identify the need for system architecture innovations to sustain performance scalability of GPG- PUs in the face of slowing Moore’s Law. I propose a novel GPU architecture called the Multi-Chip-Module GPU (MCM-GPU) that integrates multiple GPU modules to form a single logical GPU. With intelligent memory subsystem optimizations tailored for MCM-GPUs, it can achieve within 7% of the performance of a similar but hypothetical monolithic die GPU. Taking a step further, I present an in-depth study of the energy-efficiency characteristics of future MCM-GPUs. I demonstrate that the inherent non-uniform memory access side-effects form the key energy-efficiency bottleneck in the future. In summary, this thesis offers key insights into the performance and energy-efficiency bottlenecks in CMPs and GPGPUs, which can guide future architects towards developing high-performance and energy-efficient general-purpose processors.Dissertation/ThesisDoctoral Dissertation Computer Science 201

    A Modern Primer on Processing in Memory

    Full text link
    Modern computing systems are overwhelmingly designed to move data to computation. This design choice goes directly against at least three key trends in computing that cause performance, scalability and energy bottlenecks: (1) data access is a key bottleneck as many important applications are increasingly data-intensive, and memory bandwidth and energy do not scale well, (2) energy consumption is a key limiter in almost all computing platforms, especially server and mobile systems, (3) data movement, especially off-chip to on-chip, is very expensive in terms of bandwidth, energy and latency, much more so than computation. These trends are especially severely-felt in the data-intensive server and energy-constrained mobile systems of today. At the same time, conventional memory technology is facing many technology scaling challenges in terms of reliability, energy, and performance. As a result, memory system architects are open to organizing memory in different ways and making it more intelligent, at the expense of higher cost. The emergence of 3D-stacked memory plus logic, the adoption of error correcting codes inside the latest DRAM chips, proliferation of different main memory standards and chips, specialized for different purposes (e.g., graphics, low-power, high bandwidth, low latency), and the necessity of designing new solutions to serious reliability and security issues, such as the RowHammer phenomenon, are an evidence of this trend. This chapter discusses recent research that aims to practically enable computation close to data, an approach we call processing-in-memory (PIM). PIM places computation mechanisms in or near where the data is stored (i.e., inside the memory chips, in the logic layer of 3D-stacked memory, or in the memory controllers), so that data movement between the computation units and memory is reduced or eliminated.Comment: arXiv admin note: substantial text overlap with arXiv:1903.0398

    Design Space Exploration and Resource Management of Multi/Many-Core Systems

    Get PDF
    The increasing demand of processing a higher number of applications and related data on computing platforms has resulted in reliance on multi-/many-core chips as they facilitate parallel processing. However, there is a desire for these platforms to be energy-efficient and reliable, and they need to perform secure computations for the interest of the whole community. This book provides perspectives on the aforementioned aspects from leading researchers in terms of state-of-the-art contributions and upcoming trends

    Design space exploration of photonic interconnects

    Get PDF
    Thesis (S.M.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2011.Cataloged from PDF version of thesis.Includes bibliographical references (p. 109-113).As processors scale deep into the multi-core and many-core regimes, bandwidth and energy-efficiency of the on-die interconnect network have become paramount design issues. Recognizing potential limits of electrical interconnects, emerging nanophotonic integration has been recently proposed as a potential technology option for both on-chip and chip-to-chip applications. As optical links avoid the capacitive, resistive and signal integrity limits imposed upon electrical interconnects, the introduction of integrated photonics allows for efficient realization of physical connectivity that are costly to accomplish electrically. While many recent works have since cited the potential benefits of optics, inherent design tradeoffs of photonic datapath and backend components remain relatively unknown at the system-level. This thesis develops insights regarding the behavior of electrical and hybrid optoelectrical networks and systems. We present power and area models that capture the behavior of electrical interface circuits and their interactions with optical devices. To animate these models in the context of a full system, we contribute DSENT, a novel physical modeling framework capable of estimating the costs of generalized digital electronics, mixed-signal interface circuitry, and optical links. With DSENT, we enable fast power and area evaluation of entire networks to connect the dynamics of an underlying photonics interconnect to that of an otherwise electrical system. Using our methodolody, we perform a technology-driven design space exploration of intra-chip networks and highlight the importance of thermal tuning and parasitic receiver capacitances in network power consumption. We show that the performance gains enabled by photonics-inspired architectures can enable savings in total system energy even if the network is more costly. Finally, we propose a photonically interconnected DRAM system as a solution to the core-to-DRAM bandwidth bottleneck. By attacking energy consumption at the DRAM channel, chip, and bank level with integrated photoncis, we cut the power consumption of the DRAM system by 10x while remaining area neutral when compared to a projected electrical baseline.by Chen Sun.S.M

    On the co-design of scientific applications and long vector architectures

    Get PDF
    The landscape of High Performance Computing (HPC) system architectures keeps expanding with new technologies and increased complexity. To improve the efficiency of next-generation compute devices, architects are looking for solutions beyond the commodity CPU approach. In 2021, the five most powerful supercomputers in the world use either GP-GPU (General-purpose computing on graphics processing units) accelerators or a customized CPU specially designed to target HPC applications. This trend is only expected to grow in the next years motivated by the compute demands of science and industry. As architectures evolve, the ecosystem of tools and applications must follow. The choices in the number of cores in a socket, the floating point-units per core and the bandwidth through the memory hierarchy among others, have a large impact in the power consumption and compute capabilities of the devices. To balance CPU and accelerators, designers require accurate tools for analyzing and predicting the impact of new architectural features on the performance of complex scientific applications at scale. In such a large design space, capturing and modeling with simulators the complex interactions between the system software and hardware components is a defying challenge. Moreover, applications must be able to exploit those designs with aggressive compute capabilities and memory bandwidth configurations. Algorithms and data structures will need to be redesigned accordingly to expose a high degree of data-level parallelism allowing them to scale in large systems. Therefore, next-generation computing devices will be the result of a co-design effort in hardware and applications supported by advanced simulation tools. In this thesis, we focus our work on the co-design of scientific applications and long vector architectures. We significantly extend a multi-scale simulation toolchain enabling accurate performance and power estimations of large-scale HPC systems. Through simulation, we explore the large design space in current HPC trends over a wide range of applications. We extract speedup and energy consumption figures analyzing the trade-offs and optimal configurations for each of the applications. We describe in detail the optimization process of two challenging applications on real vector accelerators, achieving outstanding operation performance and full memory bandwidth utilization. Overall, we provide evidence-based architectural and programming recommendations that will serve as hardware and software co-design guidelines for the next generation of specialized compute devices.El panorama de las arquitecturas de los sistemas para la Computación de Alto Rendimiento (HPC, de sus siglas en inglés) sigue expandiéndose con nuevas tecnologías y complejidad adicional. Para mejorar la eficiencia de la próxima generación de dispositivos de computación, los arquitectos están buscando soluciones más allá de las CPUs. En 2021, los cinco supercomputadores más potentes del mundo utilizan aceleradores gráficos aplicados a propósito general (GP-GPU, de sus siglas en inglés) o CPUs diseñadas especialmente para aplicaciones HPC. En los próximos años, se espera que esta tendencia siga creciendo motivada por las demandas de más potencia de computación de la ciencia y la industria. A medida que las arquitecturas evolucionan, el ecosistema de herramientas y aplicaciones les debe seguir. Las decisiones eligiendo el número de núcleos por zócalo, las unidades de coma flotante por núcleo y el ancho de banda a través de la jerarquía de memoría entre otros, tienen un gran impacto en el consumo de energía y las capacidades de cómputo de los dispositivos. Para equilibrar las CPUs y los aceleradores, los diseñadores deben utilizar herramientas precisas para analizar y predecir el impacto de nuevas características de la arquitectura en el rendimiento de complejas aplicaciones científicas a gran escala. Dado semejante espacio de diseño, capturar y modelar con simuladores las complejas interacciones entre el software de sistema y los componentes de hardware es un reto desafiante. Además, las aplicaciones deben ser capaces de explotar tales diseños con agresivas capacidades de cómputo y ancho de banda de memoria. Los algoritmos y estructuras de datos deberán ser rediseñadas para exponer un alto grado de paralelismo de datos permitiendo así escalarlos en grandes sistemas. Por lo tanto, la siguiente generación de dispósitivos de cálculo será el resultado de un esfuerzo de codiseño tanto en hardware como en aplicaciones y soportado por avanzadas herramientas de simulación. En esta tesis, centramos nuestro trabajo en el codiseño de aplicaciones científicas y arquitecturas vectoriales largas. Extendemos significativamente una serie de herramientas para la simulación multiescala permitiendo así obtener estimaciones de rendimiento y potencia de sistemas HPC de gran escala. A través de simulaciones, exploramos el gran espacio de diseño de las tendencias actuales en HPC sobre un amplio rango de aplicaciones. Extraemos datos sobre la mejora y el consumo energético analizando las contrapartidas y las configuraciones óptimas para cada una de las aplicaciones. Describimos en detalle el proceso de optimización de dos aplicaciones en aceleradores vectoriales, obteniendo un rendimiento extraordinario a nivel de operaciones y completa utilización del ancho de memoria disponible. Con todo, ofrecemos recomendaciones empíricas a nivel de arquitectura y programación que servirán como instrucciones para diseñar mejor hardware y software para la siguiente generación de dispositivos de cálculo especializados.Postprint (published version

    Coordinating the Design and Management of Heterogeneous Datacenter Resources

    Get PDF
    <p>Heterogeneous design presents an opportunity to improve energy efficiency but raises a challenge in management. Whereas prior work separates the two, we coordinate heterogeneous design and management. We present a market-based resource allocation mechanism that navigates the performance and power trade-offs of heterogeneous architectures. Given this management framework, we explore a design space of heterogeneous processors and show a 12x reduction in response time violations when equipping a datacenter with three processor types over a homogeneous system that consumes the same power. To better understand trade-offs in large heterogeneous design spaces, we explore dozens of design strategies and present a risk taxonomy that classifies the reasons why a deployed system may underperform relative to design targets. We propose design strategies that explicitly mitigate risk, such as a strategy that minimizes the coefficient of variation in performance. In our experiments, we find that risk-aware design accounts for more than 70% of the strategies that produce systems with the best service quality. We also present a new datacenter management mechanism that fairly allocates processors to latency-sensitive applications. Tasks express value for performance using sophisticated piecewise-linear utility functions. With fairness in market allocations, we show how datacenters can mitigate envy amongst latency-sensitive users. We quantify the price of fairness and detail efficiency-fairness trade-offs. Finally, we extend the market to fairly allocate heterogeneous processors.</p>Dissertatio

    Model-Based Performance Prediction for Concurrent Software on Multicore Architectures

    Get PDF
    Model-based performance prediction is a well-known concept to ensure the quality of software.Current approaches are based on a single-metric model, which leads to inaccurate predictions for modern architectures. This thesis presents a multi-strategies approach to extend performance prediction models to support multicore architectures.We implemented the strategies into Palladio and significantly increased the performance prediction power
    corecore