Search CORE

169 research outputs found

Emulating and evaluating hybrid memory for managed languages on NUMA hardware

Author: Akram Shoaib
Eeckhout Lieven
McKinley Kathryn S.
Sartor Jennifer
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2019
Field of study

Non-volatile memory (NVM) has the potential to become a mainstream memory technology and challenge DRAM. Researchers evaluating the speed, endurance, and abstractions of hybrid memories with DRAM and NVM typically use simulation, making it easy to evaluate the impact of different hardware technologies and parameters. Simulation is, however, extremely slow, limiting the applications and datasets in the evaluation. Simulation also precludes critical workloads, especially those written in managed languages such as Java and C#. Good methodology embraces a variety of techniques for evaluating new ideas, expanding the experimental scope, and uncovering new insights. This paper introduces a platform to emulate hybrid memory for managed languages using commodity NUMA servers. Emulation complements simulation but offers richer software experimentation. We use a thread-local socket to emulate DRAM and a remote socket to emulate NVM. We use standard C library routines to allocate heap memory on the DRAM and NVM sockets for use with explicit memory management or garbage collection. We evaluate the emulator using various configurations of write-rationing garbage collectors that improve NVM lifetimes by limiting writes to NVM, using 15 applications and various datasets and workload configurations. We show emulation and simulation confirm each other's trends in terms of writes to NVM for different software configurations, increasing our confidence in predicting future system effects. Emulation brings novel insights, such as the non-linear effects of multi-programmed workloads on NVM writes, and that Java applications write significantly more than their C++ equivalents. We make our software infrastructure publicly available to advance the evaluation of novel memory management schemes on hybrid memories

Crossref

Ghent University Academic Bibliography

Towards co-designed optimizations in parallel frameworks: A MapReduce case study

Author: Barrett Colin
Kotselidis Christos
Luján Mikel
Publication venue
Publication date: 01/01/2016
Field of study

The explosion of Big Data was followed by the proliferation of numerous complex parallel software stacks whose aim is to tackle the challenges of data deluge. A drawback of a such multi-layered hierarchical deployment is the inability to maintain and delegate vital semantic information between layers in the stack. Software abstractions increase the semantic distance between an application and its generated code. However, parallel software frameworks contain inherent semantic information that general purpose compilers are not designed to exploit. This paper presents a case study demonstrating how the specific semantic information of the MapReduce paradigm can be exploited on multicore architectures. MR4J has been implemented in Java and evaluated against hand-optimized C and C++ equivalents. The initial observed results led to the design of a semantically aware optimizer that runs automatically without requiring modification to application code. The optimizer is able to speedup the execution time of MR4J by up to 2.0x. The introduced optimization not only improves the performance of the generated code, during the map phase, but also reduces the pressure on the garbage collector. This demonstrates how semantic information can be harnessed without sacrificing sound software engineering practices when using parallel software frameworks.Comment: 8 page

arXiv.org e-Print Archive

The University of Manchester - Institutional Repository

Eliminating read barriers through procrastination and cleanliness

Author: Kc Sivaramakrishnan
Lukasz Ziarek
Suresh Jagannathan
Publication venue: ACM
Publication date: 01/01/2012
Field of study

Managed languages use read barriers to interpret forwarding pointers introduced to keep track of copied objects. For example, in a split-heap managed runtime for a multicore environment, an object initially allocated on a local heap may be copied to a shared heap if it becomes the source of a store operation whose target location resides on the shared heap. As part of the copy operation, a forwarding pointer may be established to allow existing references to the local object to reference the copied version. In this paper, we consider the design of a managed runtime that avoids the need for read barriers. Our design is premised on the availability of a sufficient degree of concurrency to stall operations that would otherwise necessitate the copy. Stalled actions are deferred until the next local collection, avoiding exposing forwarding pointers to the mutator. In certain important cases, procrastination is unnecessary- lightweight runtime techniques can sometimes be used to allow objects to be eagerly copied when their set of incoming references is known, or when it can be determined that having multiple copies would not violate program semantics. Experimental results over a range of parallel benchmarks on a number of different architectural platforms including an 864 core Azul Vega 3, and a 48 core Intel SCC, indicate that our approach leads to notable performance gains (20- 32 % on average) without incurring any additional complexity

CiteSeerX

Crossref

Topology-Aware Parallelism for NUMA Copying Collectors

Author: A Muddukrishna
Lokesh Gidra
Mohammad Dashti
Takeshi Ogasawara
Xianglong Huang
Y Chicha
Yefim Shuf
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2016
Field of study

Abstract. NUMA-aware parallel algorithms in runtime systems attempt to improve locality by allocating memory from local NUMA nodes. Re-searchers have suggested that the garbage collector should profile mem-ory access patterns or use object locality heuristics to determine the tar-get NUMA node before moving an object. However, these solutions are costly when applied to every live object in the reference graph. Our earlier research suggests that connected objects represented by the rooted sub-graphs provide abundant locality and they are appropriate for NUMA architecture. In this paper, we utilize the intrinsic locality of rooted sub-graphs to improve parallel copying collector performance. Our new topology-aware parallel copying collector preserves rooted sub-graph integrity by moving the connected objects as a unit to the target NUMA node. In addition, it distributes and assigns the copying tasks to appropriate (i.e. NUMA node local) GC threads. For load balancing, our solution enforces locality on the work-stealing mechanism by stealing from local NUMA nodes only. We evaluated our approach on SPECjbb2013, DaCapo 9.12 and Neo4j. Results show an improvement in GC performance by up to 2.5x speedup and 37 % better application performance

CiteSeerX

Crossref

Enlighten

Effective memory management for mobile environments

Author: Hussein Ahmed Mohamed Abd-elhaffiez
Publication venue: 'Purdue University (bepress)'
Publication date: 01/01/2016
Field of study

Smartphones, tablets, and other mobile devices exhibit vastly different constraints compared to regular or classic computing environments like desktops, laptops, or servers. Mobile devices run dozens of so-called “apps” hosted by independent virtual machines (VM). All these VMs run concurrently and each VM deploys purely local heuristics to organize resources like memory, performance, and power. Such a design causes conflicts across all layers of the software stack, calling for the evaluation of VMs and the optimization techniques specific for mobile frameworks. In this dissertation, we study the design of managed runtime systems for mobile platforms. More specifically, we deepen the understanding of interactions between garbage collection (GC) and system layers. We develop tools to monitor the memory behavior of Android-based apps and to characterize GC performance, leading to the development of new techniques for memory management that address energy constraints, time performance, and responsiveness. We implement a GC-aware frequency scaling governor for Android devices. We also explore the tradeoffs of power and performance in vivo for a range of realistic GC variants, with established benchmarks and real applications running on Android virtual machines. We control for variation due to dynamic voltage and frequency scaling (DVFS), Just-in-time (JIT) compilation, and across established dimensions of heap memory size and concurrency. Finally, we provision GC as a global service that collects statistics from all running VMs and then makes an informed decision that optimizes across all them (and not just locally), and across all layers of the stack. Our evaluation illustrates the power of such a central coordination service and garbage collection mechanism in improving memory utilization, throughput, and adaptability to user activities. In fact, our techniques aim at a sweet spot, where total on-chip energy is reduced (20–30%) with minimal impact on throughput and responsiveness (5–10%). The simplicity and efficacy of our approach reaches well beyond the usual optimization techniques

Purdue E-Pubs

Enhancing in-memory Efficiency for MapReduce-based Data Processing

Author: Expósito Roberto R.
López Taboada Guillermo
Touriño Juan
Veiga Fachal Jorge
Publication venue: 'Elsevier BV'
Publication date: 01/01/2018
Field of study

This is a post-peer-review, pre-copyedit version of an article published in Journal of Parallel and Distributed Computing. The final authenticated version is available online at: https://doi.org/10.1016/j.jpdc.2018.04.001[Abstract] As the memory capacity of computational systems increases, the in-memory data management of Big Data processing frameworks becomes more crucial for performance. This paper analyzes and improves the memory efficiency of Flame-MR, a framework that accelerates Hadoop applications, providing valuable insight into the impact of memory management on performance. By optimizing memory allocation, the garbage collection overheads and execution times have been reduced by up to 85% and 44%, respectively, on a multi-core cluster. Moreover, different data buffer implementations are evaluated, showing that off-heap buffers achieve better results overall. Memory resources are also leveraged by caching intermediate results, improving iterative applications by up to 26%. The memory-enhanced version of Flame-MR has been compared with Hadoop and Spark on the Amazon EC2 cloud platform. The experimental results have shown significant performance benefits reducing Hadoop execution times by up to 65%, while providing very competitive results compared to Spark.Ministerio de Economía, industria y Competitividad; TIN2016-75845-P, AEI/FEDER/EUMinisterio de Educación; FPU14/0280

Repositorio da Universidade da Coruña

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Performance Analysis and Optimizations of Managed Applications on Non-Uniform Memory Architectures

Author: Papadakis Orion
Publication venue
Publication date: 31/12/2022
Field of study

The University of Manchester - Institutional Repository