Search CORE

276 research outputs found

Object co-location and memory reuse for Java programs

Author: Lau FCM
Wang CL
Yu ZCH
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2008
Field of study

postprin

HKU Scholars Hub

Fifty Years of ISCA: A data-driven retrospective on key trends

Author: Jain Rutwik
Parthasarathy Nidhi
Patterson David
Ranganathan Parthasarathy
Sampson Adrian
Shah Shaan
Sinclair Matthew D.
Upasani Gaurang
Publication venue
Publication date: 08/06/2023
Field of study

Computer Architecture, broadly, involves optimizing hardware and software for current and future processing systems. Although there are several other top venues to publish Computer Architecture research, including ASPLOS, HPCA, and MICRO, ISCA (the International Symposium on Computer Architecture) is one of the oldest, longest running, and most prestigious venues for publishing Computer Architecture research. Since 1973, except for 1975, ISCA has been organized annually. Accordingly, this year will be the 50th year of ISCA. Thus, we set out to analyze the past 50 years of ISCA to understand who and what has been driving and innovating computing systems thus far. Our analysis identifies several interesting trends that reflect how ISCA, and Computer Architecture in general, has grown and evolved in the past 50 years, including minicomputers, general-purpose uniprocessor CPUs, multiprocessor and multi-core CPUs, general-purpose GPUs, and accelerators.Comment: 17 pages, 11 figure

arXiv.org e-Print Archive

Waterfall: Primitives Generation on the Fly

Author: Bruni Camillo
Chari Guido
Denker Marcus
Ducasse Stéphane
Garbervetsky Diego
Publication venue: HAL CCSD
Publication date: 04/09/2013
Field of study

Modern languages are typically supported by managed runtimes (Virtual Machines). Since VMs have to deal with many concepts such as memory management, abstract execution model and scheduling, they tend to be very complex. Additionally, VMs have to meet strong performance requirements. This demand of performance is one of the main reasons why many VMs are built statically. Thus, design decisions are frozen at compile time preventing changes at runtime. One clear example is the impossibility to dynamically adapt or change primitives of the VM once it has been compiled. In this work we present a toolchain that allows for altering and configuring components such as primitives and plug-ins at runtime. The main contribution is Waterfall, a dynamic and reflective translator from Slang, a restricted subset of Smalltalk, to native code. Waterfall generates primitives on demand and executes them on the fly. We validate our approach by implementing dynamic primitive modification and runtime customization of VM plug-ins

HAL - Lille 3

Crossref

Characterization and reduction of memory usage in 64-bit Java Virtual Machines

Author: Venstermans Kris
Publication venue: Ghent University. Faculty of Engineering
Publication date: 01/01/2007
Field of study

Ghent University Academic Bibliography

HeTM: Transactional Memory for Heterogeneous Systems

Author: Castro Daniel
Ilic Aleksandar
Khan Amin M.
Romano Paolo
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 02/09/2019
Field of study

Modern heterogeneous computing architectures, which couple multi-core CPUs with discrete many-core GPUs (or other specialized hardware accelerators), enable unprecedented peak performance and energy efficiency levels. Unfortunately, though, developing applications that can take full advantage of the potential of heterogeneous systems is a notoriously hard task. This work takes a step towards reducing the complexity of programming heterogeneous systems by introducing the abstraction of Heterogeneous Transactional Memory (HeTM). HeTM provides programmers with the illusion of a single memory region, shared among the CPUs and the (discrete) GPU(s) of a heterogeneous system, with support for atomic transactions. Besides introducing the abstract semantics and programming model of HeTM, we present the design and evaluation of a concrete implementation of the proposed abstraction, which we named Speculative HeTM (SHeTM). SHeTM makes use of a novel design that leverages on speculative techniques and aims at hiding the inherently large communication latency between CPUs and discrete GPUs and at minimizing inter-device synchronization overhead. SHeTM is based on a modular and extensible design that allows for easily integrating alternative TM implementations on the CPU's and GPU's sides, which allows the flexibility to adopt, on either side, the TM implementation (e.g., in hardware or software) that best fits the applications' workload and the architectural characteristics of the processing unit. We demonstrate the efficiency of the SHeTM via an extensive quantitative study based both on synthetic benchmarks and on a porting of a popular object caching system.Comment: The current work was accepted in the 28th International Conference on Parallel Architectures and Compilation Techniques (PACT'19

arXiv.org e-Print Archive

Crossref

Tutorial: Stream processing optimizations

Author: Gedik B.
Hirzel M.
Schneider S.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2013
Field of study

This tutorial starts with a survey of optimizations for streaming applications. The survey is organized as a catalog that introduces uniform terminology and a common categorization of optimizations across disciplines, such as data management, programming languages, and operating systems. After this survey, the tutorial continues with a deep-dive into the fission optimization, which automatically transforms streaming applications for data-parallelism. Fis-sion helps an application improve its throughput by taking advantage of multiple cores in a machine, or, in the case of a distributed streaming engine, multiple machines in a cluster. While the survey of optimizations covers a wide range of work from the literature, the in-depth discussion of ission relies more heavily on the presenters' own research and experience in the area. The tutorial concludes with a discussion of open research challenges in the field of stream processing optimizations. Copyright © 2013 ACM

Crossref

Bilkent University Institutional Repository

Garbage collection optimization for non uniform memory access architectures

Author: Alnowaiser Khaled Abdulrahman
Publication venue
Publication date: 01/01/2016
Field of study

Cache-coherent non uniform memory access (ccNUMA) architecture is a standard design pattern for contemporary multicore processors, and future generations of architectures are likely to be NUMA. NUMA architectures create new challenges for managed runtime systems. Memory-intensive applications use the system’s distributed memory banks to allocate data, and the automatic memory manager collects garbage left in these memory banks. The garbage collector may need to access remote memory banks, which entails access latency overhead and potential bandwidth saturation for the interconnection between memory banks. This dissertation makes ﬁve signiﬁcant contributions to garbage collection on NUMA systems, with a case study implementation using the Hotspot Java Virtual Machine. It empirically studies data locality for a Stop-The-World garbage collector when tracing connected objects in NUMA heaps. First, it identiﬁes a locality richness which exists naturally in connected objects that contain a root object and its reachable set— ‘rooted sub-graphs’. Second, this dissertation leverages the locality characteristic of rooted sub-graphs to develop a new NUMA-aware garbage collection mechanism. A garbage collector thread processes a local root and its reachable set, which is likely to have a large number of objects in the same NUMA node. Third, a garbage collector thread steals references from sibling threads that run on the same NUMA node to improve data locality. This research evaluates the new NUMA-aware garbage collector using seven benchmarks of an established real-world DaCapo benchmark suite. In addition, evaluation involves a widely used SPECjbb benchmark and Neo4J graph database Java benchmark, as well as an artiﬁcial benchmark. The results of the NUMA-aware garbage collector on a multi-hop NUMA architecture show an average of 15% performance improvement. Furthermore, this performance gain is shown to be as a result of an improved NUMA memory access in a ccNUMA system. Fourth, the existing Hotspot JVM adaptive policy for conﬁguring the number of garbage collection threads is shown to be suboptimal for current NUMA machines. The policy uses outdated assumptions and it generates a constant thread count. In fact, the Hotspot JVM still uses this policy in the production version. This research shows that the optimal number of garbage collection threads is application-speciﬁc and conﬁguring the optimal number of garbage collection threads yields better collection throughput than the default policy. Fifth, this dissertation designs and implements a runtime technique, which involves heuristics from dynamic collection behavior to calculate an optimal number of garbage collector threads for each collection cycle. The results show an average of 21% improvements to the garbage collection performance for DaCapo benchmarks

Glasgow Theses Service

Scalable locality-conscious multithreaded memory allocation

Author
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2006
Field of study

Crossref

White-box compression: Learning and exploiting compact table representations

Author: Boncz P.A. (Peter)
Ghita B. (Bogdan)
Gomes Tomé D. (Diego)
Publication venue
Publication date: 12/01/2020
Field of study

We formulate a conceptual model for white-box compression, which represents the logical columns in tabular data as an openly deﬁned function over some actually stored physical columns. Each block of data should thus go accompanied by a header that describes this functional mapping. Because these compression functions are openly deﬁned, database systems can exploit them using query optimization and during execution, enabling e.g. better ﬁlter predicate pushdown. In addition, we show that white-box compression is able to identify a broad variety of new opportunities for compression, leading to much better compression factors. These opportunities are identiﬁed using an automatic learning process that learns the functions from the data. We provide a recursive pattern-driven algorithm for such learning. Finally, we demonstrate the effectiveness of white-box compression on a new benchmark we contribute hereby: the Public BI benchmark provides a rich set of real-world datasets.We believe our basic prototype for white-box compression opens the way for future research into transparent compressed data representations on the one hand and database system architectures that can eﬃciently exploit these on the other, and should be seen as another step into the direction of data management systems that are self-learning and optimize themselves for the data they are deployed on.</p

CWI's Institutional Repository