301 research outputs found

    GLB: Lifeline-based Global Load Balancing library in X10

    Full text link
    We present GLB, a programming model and an associated implementation that can handle a wide range of irregular paral- lel programming problems running over large-scale distributed systems. GLB is applicable both to problems that are easily load-balanced via static scheduling and to problems that are hard to statically load balance. GLB hides the intricate syn- chronizations (e.g., inter-node communication, initialization and startup, load balancing, termination and result collection) from the users. GLB internally uses a version of the lifeline graph based work-stealing algorithm proposed by Saraswat et al. Users of GLB are simply required to write several pieces of sequential code that comply with the GLB interface. GLB then schedules and orchestrates the parallel execution of the code correctly and efficiently at scale. We have applied GLB to two representative benchmarks: Betweenness Centrality (BC) and Unbalanced Tree Search (UTS). Among them, BC can be statically load-balanced whereas UTS cannot. In either case, GLB scales well-- achieving nearly linear speedup on different computer architectures (Power, Blue Gene/Q, and K) -- up to 16K cores

    GSI: a GPU stall inspector to characterize the sources of memory stalls for tightly coupled GPUs

    Get PDF
    In recent years the power wall has prevented the continued scaling of single core performance. This has led to the rise of dark silicon and motivated a move toward parallelism and specialization. As a result, energy-efficient high-throughput GPU cores are increasingly favored for accelerating data-parallel applications. However, the best way to efficiently communicate and synchronize across heterogeneous cores remains an important open research question. Many methods have been proposed to improve the efficiency of heterogeneous memory systems, but current methods for evaluating the performance effects of these innovations are limited in their ability to attribute differences in execution time to sources of latency in the memory system. Performance characterization of tightly coupled CPU-GPU systems is complicated by the high levels of parallelism present in GPU codes. Existing simulation tools provide only coarse-grained metrics which can obscure the underlying memory system interactions that cause performance differences. In this thesis we introduce GPU Stall Inspector (GSI), a method for identifying and visualizing the causes of GPU stalls with a focus on a tightly coupled CPU-GPU memory subsystem. We demonstrate the utility of our approach by evaluating the sources of stalls in several recent architectural innovations for tightly coupled, heterogeneous CPU-GPU systems

    A comparison of automatic protocol generation techniques

    Get PDF
    Due to the increasing complexity of applications and the availability of high speed networks classical protocols have become the main bottleneck in communication systems. Although tailored protocols are able to respond to the needs of a given application their development is expensive in terms of time and effort. An automatic protocogeneration environment is most desirable. Two approaches currently used are the stub compilation and the runtime adaptive techniques. We have studied these two approaches and the behaviour of the resulting tailored transport protocols. Relative performance comparisons and discussions about these two approaches are presented in this paper

    Parallel Processes in HPX: Designing an Infrastructure for Adaptive Resource Management

    Get PDF
    Advancement in cutting edge technologies have enabled better energy efficiency as well as scaling computational power for the latest High Performance Computing(HPC) systems. However, complexity, due to hybrid architectures as well as emerging classes of applications, have shown poor computational scalability using conventional execution models. Thus alternative means of computation, that addresses the bottlenecks in computation, is warranted. More precisely, dynamic adaptive resource management feature, both from systems as well as application\u27s perspective, is essential for better computational scalability and efficiency. This research presents and expands the notion of Parallel Processes as a placeholder for procedure definitions, targeted at one or more synchronous domains, meta data for computation and resource management as well as infrastructure for dynamic policy deployment. In addition to this, the research presents additional guidelines for a framework for resource management in HPX runtime system. Further, this research also lists design principles for scalability of Active Global Address Space (AGAS), a necessary feature for Parallel Processes. Also, to verify the usefulness of Parallel Processes, a preliminary performance evaluation of different task scheduling policies is carried out using two different applications. The applications used are: Unbalanced Tree Search, a reference dynamic graph application, implemented by this research in HPX and MiniGhost, a reference stencil based application using bulk synchronous parallel model. The results show that different scheduling policies provide better performance for different classes of applications; and for the same application class, in certain instances, one policy fared better than the others, while vice versa in other instances, hence supporting the hypothesis of the need of dynamic adaptive resource management infrastructure, for deploying different policies and task granularities, for scalable distributed computing

    A Generalized Framework on Beamformer Design and CSI Acquisition for Single-Carrier Massive MIMO Systems in Millimeter Wave Channels

    Get PDF
    In this paper, we establish a general framework on the reduced dimensional channel state information (CSI) estimation and pre-beamformer design for frequency-selective massive multiple-input multiple-output MIMO systems employing single-carrier (SC) modulation in time division duplex (TDD) mode by exploiting the joint angle-delay domain channel sparsity in millimeter (mm) wave frequencies. First, based on a generic subspace projection taking the joint angle-delay power profile and user-grouping into account, the reduced rank minimum mean square error (RR-MMSE) instantaneous CSI estimator is derived for spatially correlated wideband MIMO channels. Second, the statistical pre-beamformer design is considered for frequency-selective SC massive MIMO channels. We examine the dimension reduction problem and subspace (beamspace) construction on which the RR-MMSE estimation can be realized as accurately as possible. Finally, a spatio-temporal domain correlator type reduced rank channel estimator, as an approximation of the RR-MMSE estimate, is obtained by carrying out least square (LS) estimation in a proper reduced dimensional beamspace. It is observed that the proposed techniques show remarkable robustness to the pilot interference (or contamination) with a significant reduction in pilot overhead

    Extensive evaluation of programming models and ISAs impact on multicore soft error reliability

    Get PDF
    To take advantage of the performance enhancements provided by multicore processors, new instruction set architectures (ISAs) and parallel programming libraries have been investigated across multiple industrial segments. It is investigated the impact of parallelization libraries and distinct ISAs on the soft error reliability of two multicore ARM processor models (i.e., Cortex-A9 and CortexA72), running Linux Kernel and benchmarks with up to 87 billion instructions. An extensive soft error evaluation with more than 1.2 million simulation hours, considering ARMv7 and ARMv8 ISAs and the NAS Parallel Benchmark (NPB) suite is presented

    The Eureka Programming Model for Speculative Task Parallelism

    Get PDF
    In this paper, we describe the Eureka Programming Model (EuPM) that simplifies the expression of speculative parallel tasks, and is especially well suited for parallel search and optimization applications. The focus of this work is to provide a clean semantics for, and efficiently support, such "eureka-style" computations (EuSCs) in general structured task parallel programming models. In EuSCs, a eureka event is a point in a program that announces that a result has been found. A eureka triggered by a speculative task can cause a group of related speculative tasks to become redundant, and enable them to be terminated at well-defined program points. Our approach provides a bound on the additional work done in redundant speculative tasks after such a eureka event occurs. We identify various patterns that are supported by our eureka construct, which include search, optimization, convergence, and soft real-time deadlines. These different patterns of computations can also be safely combined or nested in the EuPM, along with regular task-parallel constructs, thereby enabling high degrees of composability and reusability. As demonstrated by our implementation, the EuPM can also be implemented efficiently. We use a cooperative runtime that uses delimited continuations to manage the termination of redundant tasks and their synchronization at join points. In contrast to current approaches, EuPM obviates the need for cumbersome manual refactoring by the programmer that may (for example) require the insertion of if checks and early return statements in every method in the call chain. Experimental results show that solutions using the EuPM simplify programmability, achieve performance comparable to hand-coded speculative task-based solutions and out-perform non-speculative task-based solutions

    Dynamic Machine Level Resource Allocation to Improve Tasking Performance Across Multiple Processes

    Get PDF
    Across the landscape of computing, parallelism within applications is increasingly important in order to track advances in hardware capability and meet critical performance metrics. However, writing parallel applications is difficult to do in a scalable way, which has led to the creation of tasking libraries and language extensions like OpenMP, Intel Threading Building Blocks, Qthreads, and more. These tools abstract parallel execution by expressing it in terms of work units (tasks) rather than specific hardware details. This abstraction enables scaling and allows programmers to write software solutions that can leverage whatever level of parallelism is available.However, the typical task scheduler is greedy and naïve. Thus, concurrent parallel processes compete for computational resources, which results in unnecessary context switches, mis-timed synchronization, unnecessary resource contention, and the associated consequences. By providing a mechanism of communication between the task schedulers, processes can cooperate to more effectively utilize hardware and avoid the negative consequences of coarse-grained resource contention. This work uses Qthreads to demonstrate that cooperative allocation of computational resources reduces contention and decreases execution time. The overhead added for the resource allocation is shown to have minimal impact. Using the Unbalanced Tree Search (UTS) and High Performance Conjugate Gradient (HPCG) benchmarks, execution time across concurrent processes shows significant decreases across a range of machines running a variety of hardware resources and software configurations. Tests also indicate that dynamic compute-resource allocation provides a clear performance benefit even when hardware resources are oversubscribed: when there are more processes than processing units. UTS tests saw an average of 4.98% reduction in execution time in Linux compared to Qthread\u27s yielding option and an 89.32% reduction in execution time in Apple OS X. HPCG resulted in partitioning reducing execution time by an average of 22.31% compared to the default Qthreads configuration across all test platforms
    • …
    corecore