23 research outputs found
Static and Dynamic Scheduling for Effective Use of Multicore Systems
Multicore systems have increasingly gained importance in high performance computers. Compared to the traditional microarchitectures, multicore architectures have a simpler design, higher performance-to-area ratio, and improved power efficiency. Although the multicore architecture has various advantages, traditional parallel programming techniques do not apply to the new architecture efficiently. This dissertation addresses how to determine optimized thread schedules to improve data reuse on shared-memory multicore systems and how to seek a scalable solution to designing parallel software on both shared-memory and distributed-memory multicore systems.
We propose an analytical cache model to predict the number of cache misses on the time-sharing L2 cache on a multicore processor. The model provides an insight into the impact of cache sharing and cache contention between threads. Inspired by the model, we build the framework of affinity based thread scheduling to determine optimized thread schedules to improve data reuse on all the levels in a complex memory hierarchy. The affinity based thread scheduling framework includes a model to estimate the cost of a thread schedule, which consists of three submodels: an affinity graph submodel, a memory hierarchy submodel, and a cost submodel. Based on the model, we design a hierarchical graph partitioning algorithm to determine near-optimal solutions. We have also extended the algorithm to support threads with data dependences. The algorithms are implemented and incorporated into a feedback directed optimization prototype system. The prototype system builds upon a binary instrumentation tool and can improve program performance greatly on shared-memory multicore architectures.
We also study the dynamic data-availability driven scheduling approach to designing new parallel software on distributed-memory multicore architectures. We have implemented a decentralized dynamic runtime system. The design of the runtime system is focused on the scalability metric. At any time only a small portion of a task graph exists in memory. We propose an algorithm to solve data dependences without process cooperation in a distributed manner. Our experimental results demonstrate the scalability and practicality of the approach for both shared-memory and distributed-memory multicore systems. Finally, we present a scalable nonblocking topology-aware multicast scheme for distributed DAG scheduling applications
Task-based Runtime Optimizations Towards High Performance Computing Applications
The last decades have witnessed a rapid improvement of computational capabilities in high-performance computing (HPC) platforms thanks to hardware technology scaling. HPC architectures benefit from mainstream advances on the hardware with many-core systems, deep hierarchical memory subsystem, non-uniform memory access, and an ever-increasing gap between computational power and memory bandwidth. This has necessitated continuous adaptations across the software stack to maintain high hardware utilization. In this HPC landscape of potentially million-way parallelism, task-based programming models associated with dynamic runtime systems are becoming more popular, which fosters developersâ productivity at extreme scale by abstracting the underlying hardware complexity.
In this context, this dissertation highlights how a software bundle powered by a task-based programming model can address the heterogeneous workloads engendered by HPC applications., i.e., data redistribution, geospatial modeling and 3D unstructured mesh deformation here. Data redistribution aims to reshuffle data to optimize some objective for an algorithm, whose objective can be multi-dimensional, such as improving computational load balance or decreasing communication volume or cost, with the ultimate goal of increasing the efficiency and therefore reducing the time-to-solution for the algorithm. Geostatistical modeling, one of the prime motivating applications for exascale computing, is a technique for predicting desired quantities from geographically distributed data, based on statistical models and optimization of parameters. Meshing the deformable contour of moving 3D bodies is an expensive operation that can cause huge computational challenges in fluid-structure interaction (FSI) applications. Therefore, in this dissertation, Redistribute-PaRSEC, ExaGeoStat-PaRSEC and HiCMA-PaRSEC are proposed to efficiently tackle these HPC applications respectively at extreme scale, and they are evaluated on multiple HPC clusters, including AMD-based, Intel-based, Arm-based CPU systems and IBM-based multi-GPU system. This multidisciplinary work emphasizes the need for runtime systems to go beyond their primary responsibility of task scheduling on massively parallel hardware system for servicing the next-generation scientific applications
Analysis and Design of Communication Avoiding Algorithms for Out of Memory(OOM) SVD
Many applications â including big data analytics, information retrieval, gene expression analysis, and numerical weather prediction â require the solution of large, dense singular value decomposition (SVD). The size of matrices used in many of these applications is becoming too large to fit into into a computerâs main memory at one time, and the traditional SVD algorithms that require all the matrix components to be loaded into memory before computation starts cannot be used directly. Moving data (communication) between levels of memory hierarchy and the disk exposes extra challenges to design SVD for such big matrices because of the exponential growth in the gap between floating-point arithmetic rate and bandwidth for many different storage devices on modern high performance computers. In this dissertation, we have analyzed communication overhead on hierarchical memory systems and disks for SVD algorithms and designed communication-avoiding (CA) Out of Memory (OOM) SVD algorithms. By Out of Memory we mean that the matrix is too big to fit in the main memory and therefore must reside in external or internal storage. We have studied communication overhead for classical one-stage blocked SVD and two-stage tiled SVD algorithms and proposed our OOM SVD algorithm, which reduces the communication cost. We have presented theoretical analysis and strategies to design CA OOM SVD algorithms, developed optimized implementation of CA OOM SVD for multicore architecture, and presented its performance results.
When matrices are tall, performance of OOM SVD can be improved significantly by carrying out QR decomposition on the original matrix in the first place. The upper triangular matrix generated by QR decomposition may fit in the main memory, and in-core SVD can be used efficiently. Even if the upper triangular matrix does not fit in the main memory, OOM SVD will work on a smaller matrix. That is why we have analyzed communication reduction for OOM QR algorithm, implemented optimized OOM tiled QR for multicore systems and showed performance improvement of OOM SVD algorithms for tall matrices
Composing resilience techniques: ABFT, periodic and incremental checkpointing
An electrochemical sensor is described for the determination of L-dopa (levodopa; 3,4-dihydroxyphenylalanine). An inkjet-printed carbon nanotube (IJPCNT) electrode was modified with manganese dioxide microspheres by drop-casting. They coating was characterized by field emission scanning electron microscopy, Fourier-transform infrared spectroscopy and X-ray powder diffraction. The sensor, best operated at a working voltage of 0.3 V, has a linear response in the 0.1 to 10 \u3bcM L-dopa concentration range, a 54 nM detection limit, excellent reproducibility, repeatability and selectivity. The amperometric approach was applied to the determination of L-dopa in spiked biological fluids and displayed satisfactory accuracy and precision. Graphical abstract Schematic representation of an amperometric method for determination L-dopa. It is based on the use of inkjet-printed carbon nanotube electrode (IJPCNT) modified with manganese dioxide (MnO2)
Parallel programming systems for scalable scientific computing
High-performance computing (HPC) systems are more powerful than ever before. However, this rise in performance brings with it greater complexity, presenting significant challenges for researchers who wish to use these systems for their scientific work. This dissertation explores the development of scalable programming solutions for scientific computing. These solutions aim to be effective across a diverse range of computing platforms, from personal desktops to advanced supercomputers.To better understand HPC systems, this dissertation begins with a literature review on exascale supercomputers, massive systems capable of performing 10Âčâž floating-point operations per second. This review combines both manual and data-driven analyses, revealing that while traditional challenges of exascale computing have largely been addressed, issues like software complexity and data volume remain. Additionally, the dissertation introduces the open-source software tool (called LitStudy) developed for this research.Next, this dissertation introduces two novel programming systems. The first system (called Rocket) is designed to scale all-versus-all algorithms to massive datasets. It features a multi-level software-based cache, a divide-and-conquer approach, hierarchical work-stealing, and asynchronous processing to maximize data reuse, exploit data locality, dynamically balance workloads, and optimize resource utilization. The second system (called Lightning) aims to scale existing single-GPU kernel functions across multiple GPUs, even on different nodes, with minimal code adjustments. Results across eight benchmarks on up to 32 GPUs show excellent scalability.The dissertation concludes by proposing a set of design principles for developing parallel programming systems for scalable scientific computing. These principles, based on lessons from this PhD research, represent significant steps forward in enabling researchers to efficiently utilize HPC systems
Recommended from our members
A Parallel Direct Method for Finite Element Electromagnetic Computations Based on Domain Decomposition
High performance parallel computing and direct (factorization-based) solution methods have been the two main trends in electromagnetic computations in recent years. When time-harmonic (frequency-domain) Maxwell\u27s equation are directly discretized with the Finite Element Method (FEM) or other Partial Differential Equation (PDE) methods, the resulting linear system of equations is sparse and indefinite, thus harder to efficiently factorize serially or in parallel than alternative methods e.g. integral equation solutions, that result in dense linear systems. State-of-the-art sparse matrix direct solvers such as MUMPS and PARDISO don\u27t scale favorably, have low parallel efficiency and high memory footprint. This work introduces a new class of sparse direct solvers based on domain decomposition method, termed Direct Domain Decomposition Method (D3M), which is reliable, memory efficient, and offers very good parallel scalability for arbitrary 3D FEM problems.
Unlike recent trends in approximate/low-rank solvers, this method focuses on `numerically exact\u27 solution methods as they are more reliable for complex `real-life\u27 models. The proposed method leverages physical insights at every stage of the development through a new symmetric domain decomposition method (DDM) with one set of Lagrange multipliers. Applying a special regularization scheme at the interfaces, either artificial loss or gain is introduced to each domain to eliminate non-physical internal resonances. A block-wise recursive algorithm based on Takahashi relationship is proposed for the efficient computation of discrete Dirichlet-to-Neumann (DtN) map to reduce the volumetric problem from all domains into an auxiliary surfacial problem defined on the domain interfaces only. Numerical results show up to 50% run-time saving in DtN map computation using the proposed block-wise recursive algorithm compared to alternative approaches. The auxiliary unknowns on the domain interfaces form a considerably (approximately an order of magnitude) smaller block-wise sparse matrix, which is efficiently factorized using a customized block LDL factorization with restricted pivoting to ensure stability.
The parallelization of the proposed D3M is realized based on Directed Acyclic Graph (DAG). Recent advances in parallel dense direct solvers, have shifted toward parallel implementation that rely on DAG scheduling to achieve highly efficient asynchronous parallel execution. However, adaptation of such schemes to sparse matrices is harder and often impractical. In D3M, computation of each domain\u27s discrete DtN map ``embarrassingly parallel\u27\u27, whereas the customized block LDLT is suitable for a block directed acyclic graph (B-DAG) task scheduling, similar to that used in dense matrix parallel direct solvers. In this approach, computations are represented as a sequence of small tasks that operate on domains of DDM or dense matrix blocks of the reduced matrix. These tasks can be statically scheduled for parallel execution using their DAG dependencies and weights that depend on estimates of computation and communication costs.
Comparisons with state-of-the-art exact direct solvers on electrically large problems suggest up to 20% better parallel efficiency, 30% - 3X less memory and slightly faster in runtime, while maintaining the same accuracy
Performance Modeling and Prediction for Dense Linear Algebra
This dissertation introduces measurement-based performance modeling and
prediction techniques for dense linear algebra algorithms. As a core principle,
these techniques avoid executions of such algorithms entirely, and instead
predict their performance through runtime estimates for the underlying compute
kernels. For a variety of operations, these predictions allow to quickly select
the fastest algorithm configurations from available alternatives. We consider
two scenarios that cover a wide range of computations:
To predict the performance of blocked algorithms, we design
algorithm-independent performance models for kernel operations that are
generated automatically once per platform. For various matrix operations,
instantaneous predictions based on such models both accurately identify the
fastest algorithm, and select a near-optimal block size.
For performance predictions of BLAS-based tensor contractions, we propose
cache-aware micro-benchmarks that take advantage of the highly regular
structure inherent to contraction algorithms. At merely a fraction of a
contraction's runtime, predictions based on such micro-benchmarks identify the
fastest combination of tensor traversal and compute kernel
Fundamentals
Volume 1 establishes the foundations of this new field. It goes through all the steps from data collection, their summary and clustering, to different aspects of resource-aware learning, i.e., hardware, memory, energy, and communication awareness. Machine learning methods are inspected with respect to resource requirements and how to enhance scalability on diverse computing architectures ranging from embedded systems to large computing clusters