Search CORE

50 research outputs found

A 'cool' load balancer for parallel applications

Author: Laxmikant V. Kale
Osman Sarood
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2011
Field of study

Meeting power requirements of huge exascale machines of the future would be one major challenge. Our focus in this paper is to minimize cooling power and we propose a tech-nique, that uses a combination of DVFS and temperature aware load balancing to constrain core temperatures as well as save cooling energy. Our scheme is specifically designed to suit parallel applications which are typically tightly coupled. The temperature control comes at the cost of execution time and we try to minimize the timing penalty. We experiment with three applications (with different power utilization profiles), run on a 128-core (32-node) cluster with a dedicated air conditioning unit. We calibrate the efficacy of our scheme based on three metrics: ability to control aver-age core temperatures thereby avoiding hot spot occurence, timing penalty minimization, and cooling energy savings. Our results show cooling energy savings of up to 57 % with timing penalty mostly in the range of 2 to 20%. 1

CiteSeerX

Crossref

Runtime Coordinated Heterogeneous Tasks in Charm++

Author: Buch Ronak
Kale Laxmikant V.
Robson Michael P.
Publication venue: Smith ScholarWorks
Publication date: 24/01/2017
Field of study

Effective utilization of the increasingly heterogeneous hardware in modern supercomputers is a significant challenge. Many applications have seen performance gains by using GPUs, but many implementations leave CPUs sitting idle.In this paper, we describe a runtime managed system for coordinating heterogeneous execution. This system manages data transfers to and from GPU devices and schedules work across the computational resources of the system. The programmer need only tag methods and parameters to enable heterogeneous execution.Using this system, we observe improvements in programmer productivity and application performance. For selected benchmarks, when using heterogeneous execution we observe speedups of up to 3.09x relative to using only the host cores or only the device

Crossref

Smith College: Smith ScholarWorks

Object-Oriented Implementation of the NAS Parallel Benchmarks using Charm++

Author: Bhandarkar Milind
Kale Laxmikant V.
Krishnan Sanjeev
Publication venue
Publication date
Field of study

This report describes experiences with implementing the NAS Computational Fluid Dynamics benchmarks using a parallel object-oriented language, Charm++. Our main objective in implementing the NAS CFD kernel benchmarks was to develop a code that could be used to easily experiment with different domain decomposition strategies and dynamic load balancing. We also wished to leverage the object-orientation provided by the Charm++ parallel object-oriented language, to develop reusable abstractions that would simplify the process of developing parallel applications. We first describe the Charm++ parallel programming model and the parallel object array abstraction, then go into detail about each of the Scalar Pentadiagonal (SP) and Lower/Upper Triangular (LU) benchmarks, along with performance results. Finally we conclude with an evaluation of the methodology used

NASA Technical Reports Server

Project Final Report: HPC-Colony II

Author: Jones Terry R
Kale Laxmikant V
Moreira Jose
Publication venue: 'Office of Scientific and Technical Information (OSTI)'
Publication date: 01/11/2013
Field of study

This report recounts the HPC Colony II Project which was a computer science effort funded by DOE's Advanced Scientific Computing Research office. The project included researchers from ORNL, IBM, and the University of Illinois at Urbana-Champaign. The topic of the effort was adaptive system software for extreme scale parallel machines. A description of findings is included

Crossref

UNT Digital Library

Distributed Strategies for Topic Modeling

Author: Hockenmaier Julia
Jindal Prateek
Kale Laxmikant V.
Publication venue
Publication date: 04/12/2013
Field of study

Topic modeling algorithms (like Latent Dirichlet Allocation) tend to be very slow when run over large document collections. In this presentation, we discuss distributed strategies for topic modeling. We use Charm++ as our parallelization framework. Our results show that parallelization can considerably increase the efficiency of topic modeling.unpublishednot peer reviewe

Illinois Digital Environment for Access to Learning and Scholarship Repository

Using an Adaptive HPC Runtime System to Reconfigure the Cache Hierarchy

Author: Ehsan Totoni
Josep Torrellas
Laxmikant V. Kale
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 11/12/2014
Field of study

The cache hierarchy often consumes a large portion of a processor’s energy. To save energy in HPC environments, this paper proposes software-controlled reconfiguration of the cache hierarchy with an adaptive runtime system. Our approach addresses the two major limitations associated with other methods that reconfigure the caches: predicting the application’s future and finding the best cache hierarchy configuration. Our approach uses formal language theory to express the application’s pattern and help predict its future. Furthermore, it uses the prevalent Single Program Multiple Data (SPMD) model of HPC codes to find the best configuration in parallel quickly. Our experiments using cycle-level simulations indicate that 67 % of the cache energy can be saved with only a 2.4 % performance penalty on average. Moreover, we demonstrate that, for some applica-tions, switching to a software-controlled reconfigurable streaming buffer configuration can improve performance by up to 30 % and save 75 % of the cache energy. I

CiteSeerX

Crossref

Performance Evaluation of Python Parallel Programming Models: Charm4Py and mpi4py

Author: Choi Jaemin
Diener Matthias
Fink Zane
Kale Laxmikant V.
Liu Simeng
Publication venue
Publication date: 11/11/2021
Field of study

Python is rapidly becoming the lingua franca of machine learning and scientific computing. With the broad use of frameworks such as Numpy, SciPy, and TensorFlow, scientific computing and machine learning are seeing a productivity boost on systems without a requisite loss in performance. While high-performance libraries often provide adequate performance within a node, distributed computing is required to scale Python across nodes and make it genuinely competitive in large-scale high-performance computing. Many frameworks, such as Charm4Py, DaCe, Dask, Legate Numpy, mpi4py, and Ray, scale Python across nodes. However, little is known about these frameworks' relative strengths and weaknesses, leaving practitioners and scientists without enough information about which frameworks are suitable for their requirements. In this paper, we seek to narrow this knowledge gap by studying the relative performance of two such frameworks: Charm4Py and mpi4py. We perform a comparative performance analysis of Charm4Py and mpi4py using CPU and GPU-based microbenchmarks other representative mini-apps for scientific computing.Comment: 7 pages, 7 figures. To appear at "Sixth International IEEE Workshop on Extreme Scale Programming Models and Middleware

arXiv.org e-Print Archive

A study of memory-aware scheduling in message driven parallel programs

Author: Chao Mei
Isaac Dooley
Jonathan Liffl
Laxmikant V. Kale
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2010
Field of study

Abstract—This paper presents a simple, but powerful memory-aware scheduling mechanism that adaptively schedules tasks in a message driven distributed-memory parallel program. The scheduler adapts its behavior whenever memory usage exceeds a threshold by scheduling tasks known to reduce memory usage. The usefulness of the scheduler and its low overhead are demonstrated in the context of an LU matrix factorization program. In the LU program, only a single additional line of code is required to make use of the new general-purpose memory-aware scheduling mechanism. Without memory-aware scheduling, the LU program can only run with small problem sizes, but with the new memory-aware scheduling, the program scales to larger problem sizes. I

CiteSeerX

Crossref