Search CORE

17 research outputs found

Recommended from our members

LUsim: A Framework for Simulation-Based Performance Modelingand Prediction of Parallel Sparse LU Factorization

Author: Baden Scott B.
Cicotti Pietro
Li Xiaoye Sherry
Li Xiaoye Sherry
Univ. of California San Diego
Publication venue: Lawrence Berkeley National Laboratory
Publication date: 15/04/2008
Field of study

Sparse parallel factorization is among the most complicated and irregular algorithms to analyze and optimize. Performance depends both on system characteristics such as the floating point rate, the memory hierarchy, and the interconnect performance, as well as input matrix characteristics such as such as the number and location of nonzeros. We present LUsim, a simulation framework for modeling the performance of sparse LU factorization. Our framework uses micro-benchmarks to calibrate the parameters of machine characteristics and additional tools to facilitate real-time performance modeling. We are using LUsim to analyze an existing parallel sparse LU factorization code, and to explore a latency tolerant variant. We developed and validated a model of the factorization in SuperLU_DIST, then we modeled and implemented a new variant of slud, replacing a blocking collective communication phase with a non-blocking asynchronous point-to-point one. Our strategy realized a mean improvement of 11percent over a suite of test matrices

UNT Digital Library

Tarragon : a programming model for latency-hiding scientific computations

Author: Cicotti Pietro
Publication venue: eScholarship, University of California
Publication date: 01/01/2011
Field of study

In supercomputing systems, architectural changes that increase computational power are often reflected in the programming model. As a result, in order to realize and sustain the potential performance of such systems, it is necessary in practice to deal with architectural details and explicitly manage the resources to an increasing extent. In particular, programmers are required to develop code that exposes a high degree of parallelism, exhibits high locality, dynamically adapts to the available resources, and hides communication latency. Hiding communication latency is crucial to realize the potential of today's distributed memory machines with highly parallel processing modules, and technological trends indicate that communication latencies will continue to be an issue as the performance gap between computation and communication widens. However, under Bulk Synchronous Parallel models, the predominant paradigm in scientific computing, scheduling is embedded into the application code. All the phases of a computation are defined and laid out as a linear sequence of operations limiting overlap and the program's ability to adapt to communication delays. This thesis proposes an alternative model, called Tarragon, to overcome the limitations of Bulk Synchronous Parallelism. Tarragon, which is based on dataflow, targets latency tolerant scientific computations. Tarragon supports a task-dependency graph abstraction in which tasks, the basic unit of computation, are organized as a graph according to their data dependencies, i.e. task precedence. In addition to the task graph, Tarragon supports metadata abstractions, annotations to the task graph, to express locality information and scheduling policies to improve performance. Tarragon's functionality and underlying programming methodology are demonstrated on three classes of computations used in scientific domains: structured grids, sparse linear algebra, and dynamic programming. In the application studies, Tarragon implementations achieve high performance, in many cases exceeding the performance of equivalent latency-tolerant, hard coded MPI implementations. The results presented in this dissertation demonstrate that data-driven execution, coupled with metadata abstractions, effectively support latency tolerance. In addition, performance metadata enable performance optimization techniques that are decoupled from the algorithmic formulation and the control flow of the application code. By expressing the structure of the computation and its characteristics with metadata, the programmer can focus on the application and rely on Tarragon and its run-time system to automatically overlap communication with computation and optimize the performanc

Ezid

eScholarship - University of California

ADAMANT: Tools to Capture, Analyze, and Manage Data Movement

Author: Carrington Laura
Cicotti Pietro
Publication venue: The Author(s). Published by Elsevier B.V.
Publication date: 31/12/2016
Field of study

AbstractIn the converging world of High Performance Computing and Big Data, moving data is becoming a critical aspect of performance and energy efficiency. In this paper we present the Advanced DAta Movement Analysis Toolkit (ADAMANT), a set of tools to capture and analyze data movement within an application, and to aid in understanding performance and energy efficiency in current and future systems. ADAMANT identifies all the data objects allocated by an application and uses instrumentation modules to monitor relevant events (e.g. cache misses). Finally, ADAMANT produces a per-object performance profile.In this paper we demonstrate the use of ADAMANT in analyzing three applications, BT, BFS, and Velvet, and evaluate the impact of different memory technology. With the information produced by ADAMANT we were able to model and compare different memory configurations and object placement solutions. In BFS we devised a placement which outperforms caching, while in the other two cases we were able to point out which data objects may be problematic for the configurations explored, and would require refactoring to improve performance

Elsevier - Publisher Connector

Evaluation of emerging memory technologies for HPC, data intensive applications

Author: Amoghavarsha Suresh
Laura Carrington
Pietro Cicotti
Publication venue
Publication date: 23/04/2020
Field of study

Abstract-DRAM technology has several shortcomings in terms of performance, energy efficiency and scaling. Several emerging memory technologies have the potential to compensate for the limitations of DRAM when replacing or complementing DRAM in the memory sub-system. In this paper, we evaluate the impact of emerging technologies on HPC and data-intensive workloads modeling a 5-level hybrid memory hierarchy design. Our results show that 1) an additional level of faster DRAM technology (i.e. EDRAM or HMC) interposed between the last level cache and DRAM can improve performance and energy efficiency, 2) a non-volatile main memory (i.e. PCM, STTRAM, or FeRAM) with a small DRAM acting as a cache can reduce the cost and energy consumption at large capacities, and 3) a combination of the two approaches, which essentially replaces the traditional DRAM with a small EDRAM or HMC cache between the last level cache and the non-volatile memory, can grant capacity and improved performance and energy efficiency. We also explore a hybrid DRAM-NVM design with a partitioned address space and find that this approach is marginally beneficial compared to the simpler 5-level design. Finally, we generalize our analysis and show the impact of emerging technologies for a range of latency and energy parameters

CiteSeerX

Latency hiding and performance tuning with graph-based execution

Author: Pietro Cicotti
Scott B. Baden
Publication venue
Publication date: 01/01/2011
Field of study

In the current practice, scientific programmer and HPC users are required to develop code that exposes a high degree of parallelism, exhibits high locality, dynamically adapts to the available resources, and hides communication latency. Hiding communication latency is crucial to realize the potential of today’s distributed memory machines with highly parallel processing modules, and technological trends indicate that communication latencies will continue to be an issue as the performance gap between computation and communication widens. However, under Bulk Synchronous Parallel models, the predominant paradigm in scientific computing, scheduling is embedded into the application code. All the phases of a computation are defined and laid out as a linear sequence of operations limiting overlap and the program’s ability to adapt to communication delays. In this paper we present an alternative model, called Tarragon, to overcome the limitations of Bulk Synchronous Parallelism. Tarragon, which is based on dataflow, targets latency tolerant scientific computations. Tarragon supports a task-dependency graph abstraction in which tasks, the basic unit of computation, are organized as a graph according to their data dependencies, i.e. task precedence. In addition to the task graph, Tarragon supports metadata abstractions, annotations to the task graph, to express locality information and scheduling policies to improve performance. Tarragon’s functionality and underlying programming methodology are demonstrated on three classes of computations used in scientific domains: structured grids, sparse linear algebra, and dynamic programming. In the application studies, Tarragon implementations achieve high performance, in many cases exceeding the performance of equivalent latency-tolerant, hard coded MPI implementations. 1

CiteSeerX

Crossref

DGMonitor: a Performance Monitoring Tool for Sandbox-based Desktop Grid Platforms

Author: Chien Andrew
Cicotti Pietro
Taufer Michela
Publication venue: eScholarship, University of California
Publication date: 24/10/2003
Field of study

Accurate and continuous monitoring and profiling are important issues of performance tuning and scheduling optimization. In desktop grid systems based on sandboxing techniques these issues are particularly challenging because (1) subjobs inside sandboxes are executed in a virtual environment and (2) sandboxes are usually reset to an initial (empty) state at each subjob termination. To address this problem, we present in this paper DGMonitor, a monitoring tool to build a global, accurate and continuous view of real resource utilization for desktop grids based on sandboxing techniques. Our monitoring tool provides unobtrusive and reliable performance measures, uses a simple performance data model and is easy to use. Our work demonstrates that DGMonitor can easily take over the monitoring of large desktop grids (up to 12000 workers) maintaining low load in terms of resource consumption due to the monitoring process (less than 0.1%) on desktop PCs. Although we use DGMonitor with the Entropia DGrid platform, it can be easily integrated in other desktop grids and its data can be used as an information source for existing information services for performance tuning and scheduling optimization. Keywords: Performance monitoring and profiling, desktop grids, sandboxing techniques, distributed computing.Pre-2018 CSE ID: CS2003-077

CiteSeerX

eScholarship - University of California

Asynchronous programming with tarragon

Author: Pietro Cicotti
Scott B. Baden
Publication venue
Publication date
Field of study

Tarragon is an actor-based programming model and library for implementing latency tolerant asynchronous event driven simulations. It is novel in its support for meta data describing run time virtualized process structures, which may be optimized as a free-standing object. We demonstrate early results with a synthetic benchmark, and observe that Tarragon can mask communication costs with ongoing computation.

CiteSeerX

Recommended from our members

LUsim: A Framework for Simulation-Based Performance Modeling and Prediction of Parallel Sparse LU Factorization

Author: Cicotti Pietro
Univ. of California San Diego
Publication venue: eScholarship, University of California
Publication date: 08/05/2008
Field of study

eScholarship - University of California