1,642 research outputs found

    HPC Accelerators with 3D Memory

    Get PDF
    Artículo invitado, publicado en las actas del congreso por IEEE Society Press. Páginas 320 a 328. ISBN: 978-1-5090-3593-9.DOI 10.1109/CSE-EUC-DCABES-2016.203After a decade evolving in the High Performance Computing arena, GPU-equipped supercomputers have con- quered the top500 and green500 lists, providing us unprecedented levels of computational power and memory bandwidth. This year, major vendors have introduced new accelerators based on 3D memory, like Xeon Phi Knights Landing by Intel and Pascal architecture by Nvidia. This paper reviews hardware features of those new HPC accelerators and unveils potential performance for scientific applications, with an emphasis on Hybrid Memory Cube (HMC) and High Bandwidth Memory (HBM) used by commercial products according to roadmaps already announced.Universidad de Málaga. Campus de Excelencia Internacional Andalucia Tec

    High Performance Hybrid Memory Systems with 3D-stacked DRAM

    Get PDF
    The bandwidth of traditional DRAM is pin limited and so does not scale well with the increasing demand of data intensive workloads. 3D-stacked DRAM can alleviate this problem providing substantially higher bandwidth to a processor chip. However, the capacity of 3D-stacked DRAM is not enough to replace the bulk of the memory and therefore it is used together with off-chip DRAM in a hybrid memory system, either as a DRAM cache or as part of a flat address space with support for data migration. The performance of both above alternative designs is limited by their particular overheads. This thesis proposes new designs that improve the performance of hybrid memory systems. It does so first by alleviating the overheads of current approaches and second, by proposing a new design that combines the best attributes of DRAM caching and data migration while addressing their respective weaknesses. The first part of this thesis focuses on improving the performance of DRAM caches. Besides the unavoidable DRAM access to fetch the requested data, tag access is in the critical path adding significant latency and energy costs. Existing approaches are not able to remove these overheads and in some cases limit DRAM cache design options. To alleviate the tag access overheads of DRAM caches this thesis proposes Decoupled Fused Cache (DFC), a DRAM cache design that fuses DRAM cache tags with the tags of the on-chip Last Level Cache (LLC) to access the DRAM cache data directly on LLC misses. Compared to current state-of-the-art DRAM caches, DFC improves system performance by 11% on average. Finally, DFC reduces DRAM cache traffic by 25% and DRAM cache energy consumption by 24.5%. The second part of this thesis focuses on improving the performance of data migration. Data migration has significant performance potential, but also entails overheads which may diminish its benefits or even degrade performance. These overheads are mainly due to the high cost of swapping data between memories which also makes selecting which data to migrate critical to performance. To address these challenges of data migration this thesis proposes LLC guided Data Migration (LGM). LGM uses the LLC to predict future reuse and select memory segments for migration. Furthermore, LGM reduces the data migration traffic overheads by not migrating the cache lines of memory segments which are present in the LLC. LGM outperforms current state-of-the art data migration, improving system performance by 12.1% and reducing memory system dynamic energy by 13.2%. DRAM caches and data migration offer different tradeoffs for the utilization of 3D-stacked DRAM but also share some similar challenges. The third part of this thesis aims to provide an alternative approach to the utilization of 3D-stacked DRAM combining the strengths of both DRAM caches and data migration while eliminating their weaknesses. To that end, this thesis proposes Hybrid2, a hybrid memory system design which uses only a small fraction of the 3D-stacked DRAM as a cache and thus does not deny valuable capacity from the memory system. It further leverages the DRAM cache as a staging area to select the data most suitable for migration. Finally, Hybrid2 alleviates the metadata overheads of both DRAM caches and migration using a common mechanism. Depending on the system configuration, Hybrid2 on average outperforms state-of-the-art migration schemes by 6.4% to 9.1%, compared to DRAM caches Hybrid2 gives away on average only 0.3%, to 5.3% of performance offering up to 24.6% more main memory capacity

    DAMOV: A New Methodology and Benchmark Suite for Evaluating Data Movement Bottlenecks

    Full text link
    Data movement between the CPU and main memory is a first-order obstacle against improving performance, scalability, and energy efficiency in modern systems. Computer systems employ a range of techniques to reduce overheads tied to data movement, spanning from traditional mechanisms (e.g., deep multi-level cache hierarchies, aggressive hardware prefetchers) to emerging techniques such as Near-Data Processing (NDP), where some computation is moved close to memory. Our goal is to methodically identify potential sources of data movement over a broad set of applications and to comprehensively compare traditional compute-centric data movement mitigation techniques to more memory-centric techniques, thereby developing a rigorous understanding of the best techniques to mitigate each source of data movement. With this goal in mind, we perform the first large-scale characterization of a wide variety of applications, across a wide range of application domains, to identify fundamental program properties that lead to data movement to/from main memory. We develop the first systematic methodology to classify applications based on the sources contributing to data movement bottlenecks. From our large-scale characterization of 77K functions across 345 applications, we select 144 functions to form the first open-source benchmark suite (DAMOV) for main memory data movement studies. We select a diverse range of functions that (1) represent different types of data movement bottlenecks, and (2) come from a wide range of application domains. Using NDP as a case study, we identify new insights about the different data movement bottlenecks and use these insights to determine the most suitable data movement mitigation mechanism for a particular application. We open-source DAMOV and the complete source code for our new characterization methodology at https://github.com/CMU-SAFARI/DAMOV.Comment: Our open source software is available at https://github.com/CMU-SAFARI/DAMO

    Genome sequence alignment in processing-In-memory architectures

    Get PDF
    Finalmente, también realizamos un estudio experimental de varias arquitecturas con diferentes tecnologías de memoria (DDR y HBM) y núcleos de procesamiento de distintos tipos, explotando, en algunos casos, procesamiento en la memoria (PIM). La aplicación de referencia es Bowtie2, una aplicación completa para el alineamiento de secuencias en el genoma. La implementación y evaluación de estas arquitecturas se realiza utilizando un simulador arquitectural basado en gem5.La combinación de la aparición de un cuello de botella en el acceso a los datos y la creciente importancia de las aplicaciones de procesamiento intensivo de datos, muy limitadas por el sistema de memoria, crea un importante problema que debe ser abordado. Por ello, en esta tesis nos proponemos afrontar este problema e intentar reducir su efecto en la medida de lo posible. El principal objetivo de esta tesis es el diseño de nuevas soluciones arquitecturales y algorítmicas para superar el problema del cuello de botella conocido como memory-wall y mejorar el rendimiento de aplicaciones con gran uso de memoria que no son capaces de beneficiarse lo suficiente de las jerarquías de memoria actuales. Además, creemos que es esencial centrarse en la eficiencia energética, un factor cuya importancia crece cada día y uno de los factores más limitantes en la computación de alto rendimiento. Las principales contribuciones de esta tesis son: Primero, analizamos el comportamiento de aplicaciones con accesos de memoria aleatorios, que no aprovechan correctamente las nuevas arquitecturas de memoria con jerarquías cache profundas. Específicamente, analizamos la estructura de datos FM-index y un algoritmo de búsqueda de secuencias basado en esa estructura, ampliamente usado en el alineamiento de secuencias en el genoma. Después de este análisis y de obtener un conocimiento más detallado del cuello de botella de la memoria, proponemos una nueva versión de FM-index que permite reducir el consumo de ancho de banda de memoria, de forma que mejora significativamente el rendimiento computacional. Posteriormente, proponemos una nueva arquitectura energéticamente eficiente, basada en un cubo de memoria en 3D (3D-Stacked) al que añadimos unos núcleos sencillos de bajo consumo en su capa lógica. Esta arquitectura permite la ejecución cerca de los datos (near-data-processing

    Hybrid2: Combining Caching and Migration in Hybrid Memory Systems

    Get PDF
    This paper considers a hybrid memory system composed of memory technologies with different characteristics; in particular a small, near memory exhibiting high bandwidth, i.e., 3D-stacked DRAM, and a larger, far memory offering capacity at lower bandwidth, i.e., off-chip DRAM. In the past,the near memory of such a system has been used either as a DRAM cache or as part of a flat address space combined with a migration mechanism. Caches and migration offer different tradeoffs (between performance, main memory capacity, data transfer costs, etc.) and share similar challenges related todata-transfer granularity and metadata management. This paper proposes Hybrid2 , a new hybrid memory system architecture that combines a DRAM cache with a migration scheme. Hybrid 2 does not deny valuable capacity from the memory system because it uses only a small fraction of the near memory as a DRAM cache; 64MB in our experiments.It further leverages the DRAM cache as a staging area to select the data most suitable for migration. Finally, Hybrid2 alleviates the metadata overheads of both DRAM caches and migration using a common mechanism. Using near to far memory ratios of 1:16, 1:8 and 1:4 in our experiments, Hybrid2 on average outperforms current state-of-the-art migration schemes by 7.9%, 9.1% and 6.4%, respectively. In the same system configurations, compared to DRAM caches Hybrid2 gives away on average only 0.3%, 1.2%, and 5.3% of performance offering 5.9%, 12.1%, and 24.6% more main memory capacity, respectively
    • …
    corecore