4,495 research outputs found
DeepNVM++: Cross-Layer Modeling and Optimization Framework of Non-Volatile Memories for Deep Learning
Non-volatile memory (NVM) technologies such as spin-transfer torque magnetic
random access memory (STT-MRAM) and spin-orbit torque magnetic random access
memory (SOT-MRAM) have significant advantages compared to conventional SRAM due
to their non-volatility, higher cell density, and scalability features. While
previous work has investigated several architectural implications of NVM for
generic applications, in this work we present DeepNVM++, a framework to
characterize, model, and analyze NVM-based caches in GPU architectures for deep
learning (DL) applications by combining technology-specific circuit-level
models and the actual memory behavior of various DL workloads. We present both
iso-capacity and iso-area performance and energy analysis for systems whose
last-level caches rely on conventional SRAM and emerging STT-MRAM and SOT-MRAM
technologies. In the iso-capacity case, STT-MRAM and SOT-MRAM provide up to
3.8x and 4.7x energy-delay product (EDP) reduction and 2.4x and 2.8x area
reduction compared to conventional SRAM, respectively. Under iso-area
assumptions, STT-MRAM and SOT-MRAM provide up to 2x and 2.3x EDP reduction and
accommodate 2.3x and 3.3x cache capacity when compared to SRAM, respectively.
We also perform a scalability analysis and show that STT-MRAM and SOT-MRAM
achieve orders of magnitude EDP reduction when compared to SRAM for large cache
capacities. Our comprehensive cross-layer framework is demonstrated on
STT-/SOT-MRAM technologies and can be used for the characterization, modeling,
and analysis of any NVM technology for last-level caches in GPUs for DL
applications.Comment: 12 pages, 10 figure
On the Theory of Spatial and Temporal Locality
This paper studies the theory of caching and temporal and spatial locality. We show the following results: (1) hashing can be used to guarantee that caches with limited associativity behave as well as fully associative cache; (2) temporal locality cannot be characterized using one, or few parameters; (3) temporal locality and spatial locality cannot be studied separately; and (4) unlike temporal locality, spatial locality cannot be managed efficiently online
Catalog Dynamics: Impact of Content Publishing and Perishing on the Performance of a LRU Cache
The Internet heavily relies on Content Distribution Networks and transparent
caches to cope with the ever-increasing traffic demand of users. Content,
however, is essentially versatile: once published at a given time, its
popularity vanishes over time. All requests for a given document are then
concentrated between the publishing time and an effective perishing time.
In this paper, we propose a new model for the arrival of content requests,
which takes into account the dynamical nature of the content catalog. Based on
two large traffic traces collected on the Orange network, we use the
semi-experimental method and determine invariants of the content request
process. This allows us to define a simple mathematical model for content
requests; by extending the so-called "Che approximation", we then compute the
performance of a LRU cache fed with such a request process, expressed by its
hit ratio. We numerically validate the good accuracy of our model by comparison
to trace-based simulation.Comment: 13 Pages, 9 figures. Full version of the article submitted to the ITC
2014 conference. Small corrections in the appendix from the previous versio
TaskInsight: Understanding Task Schedules Effects on Memory and Performance
Recent scheduling heuristics for task-based applications have managed to improve their by taking into account memory-related properties such as data locality and cache sharing. However, there is still a general lack of tools that can provide insights into why, and where, different schedulers improve memory behavior, and how this is related to the applications' performance.
To address this, we present TaskInsight, a technique to characterize the memory behavior of different task schedulers through the analysis of data reuse between tasks. TaskInsight provides high-level, quantitative information that can be correlated with tasks' performance variation over time to understand data reuse through the caches due to scheduling choices. TaskInsight is useful to diagnose and identify which scheduling decisions affected performance, when were they taken, and why the performance changed, both in single and multi-threaded executions.
We demonstrate how TaskInsight can diagnose examples where poor scheduling caused over 10% difference in performance for tasks of the same type, due to changes in the tasks' data reuse through the private and shared caches, in single and multi-threaded executions of the same application. This flexible insight is key for optimization in many contexts, including data locality, throughput, memory footprint or even energy efficiency.We thank the reviewers for their feedback. This work was supported by the Swedish Research Council, the Swedish Foundation for Strategic Research project FFL12-0051 and carried out within the Linnaeus Centre of Excellence UPMARC, Uppsala Programming for Multicore Architectures Research Center. This paper
was also published with the support of the HiPEAC network that received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement no. 687698.Peer ReviewedPostprint (published version
When parallel speedups hit the memory wall
After Amdahl's trailblazing work, many other authors proposed analytical
speedup models but none have considered the limiting effect of the memory wall.
These models exploited aspects such as problem-size variation, memory size,
communication overhead, and synchronization overhead, but data-access delays
are assumed to be constant. Nevertheless, such delays can vary, for example,
according to the number of cores used and the ratio between processor and
memory frequencies. Given the large number of possible configurations of
operating frequency and number of cores that current architectures can offer,
suitable speedup models to describe such variations among these configurations
are quite desirable for off-line or on-line scheduling decisions. This work
proposes new parallel speedup models that account for variations of the average
data-access delay to describe the limiting effect of the memory wall on
parallel speedups. Analytical results indicate that the proposed modeling can
capture the desired behavior while experimental hardware results validate the
former. Additionally, we show that when accounting for parameters that reflect
the intrinsic characteristics of the applications, such as degree of
parallelism and susceptibility to the memory wall, our proposal has significant
advantages over machine-learning-based modeling. Moreover, besides being
black-box modeling, our experiments show that conventional machine-learning
modeling needs about one order of magnitude more measurements to reach the same
level of accuracy achieved in our modeling.Comment: 24 page
- …