186 research outputs found
Demystifying the Characteristics of 3D-Stacked Memories: A Case Study for Hybrid Memory Cube
Three-dimensional (3D)-stacking technology, which enables the integration of
DRAM and logic dies, offers high bandwidth and low energy consumption. This
technology also empowers new memory designs for executing tasks not
traditionally associated with memories. A practical 3D-stacked memory is Hybrid
Memory Cube (HMC), which provides significant access bandwidth and low power
consumption in a small area. Although several studies have taken advantage of
the novel architecture of HMC, its characteristics in terms of latency and
bandwidth or their correlation with temperature and power consumption have not
been fully explored. This paper is the first, to the best of our knowledge, to
characterize the thermal behavior of HMC in a real environment using the AC-510
accelerator and to identify temperature as a new limitation for this
state-of-the-art design space. Moreover, besides bandwidth studies, we
deconstruct factors that contribute to latency and reveal their sources for
high- and low-load accesses. The results of this paper demonstrates essential
behaviors and performance bottlenecks for future explorations of
packet-switched and 3D-stacked memories.Comment: EEE Catalog Number: CFP17236-USB ISBN 13: 978-1-5386-1232-
Performance Implications of NoCs on 3D-Stacked Memories: Insights from the Hybrid Memory Cube
Memories that exploit three-dimensional (3D)-stacking technology, which
integrate memory and logic dies in a single stack, are becoming popular. These
memories, such as Hybrid Memory Cube (HMC), utilize a network-on-chip (NoC)
design for connecting their internal structural organizations. This novel usage
of NoC, in addition to aiding processing-in-memory capabilities, enables
numerous benefits such as high bandwidth and memory-level parallelism. However,
the implications of NoCs on the characteristics of 3D-stacked memories in terms
of memory access latency and bandwidth have not been fully explored. This paper
addresses this knowledge gap by (i) characterizing an HMC prototype on the
AC-510 accelerator board and revealing its access latency behaviors, and (ii)
by investigating the implications of such behaviors on system and software
designs
Genome sequence alignment in processing-In-memory architectures
Finalmente, también realizamos un estudio experimental de varias arquitecturas con diferentes tecnologías de memoria (DDR y HBM) y núcleos de procesamiento de distintos tipos, explotando, en algunos casos, procesamiento en la memoria (PIM). La aplicación de referencia es Bowtie2, una aplicación completa para el alineamiento de secuencias en el genoma. La implementación y evaluación de estas arquitecturas se realiza utilizando un simulador arquitectural basado en gem5.La combinación de la aparición de un cuello de botella en el acceso a los datos y la creciente importancia de las aplicaciones de procesamiento intensivo de datos, muy limitadas por el sistema de memoria, crea un importante problema que debe ser abordado. Por ello, en esta tesis nos proponemos afrontar este problema e intentar reducir su efecto en la medida de lo posible.
El principal objetivo de esta tesis es el diseño de nuevas soluciones arquitecturales y algorítmicas para superar el problema del cuello de botella conocido como memory-wall y mejorar el rendimiento de aplicaciones con gran uso de memoria que no son capaces de beneficiarse lo suficiente de las jerarquías de memoria actuales. Además, creemos que es esencial centrarse en la eficiencia energética, un factor cuya importancia crece cada día y uno de los factores más limitantes en la computación de alto rendimiento.
Las principales contribuciones de esta tesis son:
Primero, analizamos el comportamiento de aplicaciones con accesos de memoria aleatorios, que no aprovechan correctamente las nuevas arquitecturas de memoria con jerarquías cache profundas. Específicamente, analizamos la estructura de datos FM-index y un algoritmo de búsqueda de secuencias basado en esa estructura, ampliamente usado en el alineamiento de secuencias en el genoma.
Después de este análisis y de obtener un conocimiento más detallado del cuello de botella de la memoria, proponemos una nueva versión de FM-index que permite reducir el consumo de ancho de banda de memoria, de forma que mejora significativamente el rendimiento computacional.
Posteriormente, proponemos una nueva arquitectura energéticamente eficiente, basada en un cubo de memoria en 3D (3D-Stacked) al que añadimos unos núcleos sencillos de bajo consumo en su capa lógica. Esta arquitectura permite la ejecución cerca de los datos (near-data-processing
A Modern Primer on Processing in Memory
Modern computing systems are overwhelmingly designed to move data to
computation. This design choice goes directly against at least three key trends
in computing that cause performance, scalability and energy bottlenecks: (1)
data access is a key bottleneck as many important applications are increasingly
data-intensive, and memory bandwidth and energy do not scale well, (2) energy
consumption is a key limiter in almost all computing platforms, especially
server and mobile systems, (3) data movement, especially off-chip to on-chip,
is very expensive in terms of bandwidth, energy and latency, much more so than
computation. These trends are especially severely-felt in the data-intensive
server and energy-constrained mobile systems of today. At the same time,
conventional memory technology is facing many technology scaling challenges in
terms of reliability, energy, and performance. As a result, memory system
architects are open to organizing memory in different ways and making it more
intelligent, at the expense of higher cost. The emergence of 3D-stacked memory
plus logic, the adoption of error correcting codes inside the latest DRAM
chips, proliferation of different main memory standards and chips, specialized
for different purposes (e.g., graphics, low-power, high bandwidth, low
latency), and the necessity of designing new solutions to serious reliability
and security issues, such as the RowHammer phenomenon, are an evidence of this
trend. This chapter discusses recent research that aims to practically enable
computation close to data, an approach we call processing-in-memory (PIM). PIM
places computation mechanisms in or near where the data is stored (i.e., inside
the memory chips, in the logic layer of 3D-stacked memory, or in the memory
controllers), so that data movement between the computation units and memory is
reduced or eliminated.Comment: arXiv admin note: substantial text overlap with arXiv:1903.0398
Memory Hierarchy Design for Next Generation Scalable Many-core Platforms
Performance and energy consumption in modern computing platforms is largely dominated by the memory hierarchy. The increasing computational power in the multiprocessors and accelerators, and the emergence of the data-intensive workloads (e.g. large-scale graph traversal and scientific algorithms) requiring fast transfer of large volumes of data, are two main trends which intensify this problem by putting even higher pressure on the memory hierarchy. This increasing gap between computation speed and data transfer speed is commonly referred as the “memory wall” problem. With the emergence of heterogeneous Three Dimensional (3D) Integration based on through-silicon-vias (TSV), this situation has started to recover in the past years. On one hand, it is now possible to improve memory access bandwidth and/or latency by either stacking memories directly on top of processors or through abstracted memory interfaces such as Micron’s Hybrid Memory Cube (HMC). On the other hand, near memory computation has become worthy of revisiting due to the cost-effective integration of logic and memory in 3D stacks. These two directions bring about several interesting opportunities including performance improvement, energy and cost reduction, product miniaturization, and modular design for improved time to market. In this research, we study the effectiveness of the 3D integration technology and the optimization opportunities which it can provide in the different layers of the memory hierarchy in cluster-based many-core platforms ranging from intra-cluster L1 to inter-cluster L2 scratchpad memories (SPMs), as well as the main memory. In addition, by moving a part of the computation to where data resides, in the 3D-stacked memory context, we demonstrate further energy and performance improvement opportunities
Design Space Exploration and Resource Management of Multi/Many-Core Systems
The increasing demand of processing a higher number of applications and related data on computing platforms has resulted in reliance on multi-/many-core chips as they facilitate parallel processing. However, there is a desire for these platforms to be energy-efficient and reliable, and they need to perform secure computations for the interest of the whole community. This book provides perspectives on the aforementioned aspects from leading researchers in terms of state-of-the-art contributions and upcoming trends
- …