29,961 research outputs found

    Task-based adaptive multiresolution for time-space multi-scale reaction-diffusion systems on multi-core architectures

    Get PDF
    A new solver featuring time-space adaptation and error control has been recently introduced to tackle the numerical solution of stiff reaction-diffusion systems. Based on operator splitting, finite volume adaptive multiresolution and high order time integrators with specific stability properties for each operator, this strategy yields high computational efficiency for large multidimensional computations on standard architectures such as powerful workstations. However, the data structure of the original implementation, based on trees of pointers, provides limited opportunities for efficiency enhancements, while posing serious challenges in terms of parallel programming and load balancing. The present contribution proposes a new implementation of the whole set of numerical methods including Radau5 and ROCK4, relying on a fully different data structure together with the use of a specific library, TBB, for shared-memory, task-based parallelism with work-stealing. The performance of our implementation is assessed in a series of test-cases of increasing difficulty in two and three dimensions on multi-core and many-core architectures, demonstrating high scalability

    WebWave: Globally Load Balanced Fully Distributed Caching of Hot Published Documents

    Full text link
    Document publication service over such a large network as the Internet challenges us to harness available server and network resources to meet fast growing demand. In this paper, we show that large-scale dynamic caching can be employed to globally minimize server idle time, and hence maximize the aggregate server throughput of the whole service. To be efficient, scalable and robust, a successful caching mechanism must have three properties: (1) maximize the global throughput of the system, (2) find cache copies without recourse to a directory service, or to a discovery protocol, and (3) be completely distributed in the sense of operating only on the basis of local information. In this paper, we develop a precise definition, which we call tree load-balance (TLB), of what it means for a mechanism to satisfy these three goals. We present an algorithm that computes TLB off-line, and a distributed protocol that induces a load distribution that converges quickly to a TLB one. Both algorithms place cache copies of immutable documents, on the routing tree that connects the cached document's home server to its clients, thus enabling requests to stumble on cache copies en route to the home server.Harvard University; The Saudi Cultural Mission to the U.S.A

    An adaptive hierarchical domain decomposition method for parallel contact dynamics simulations of granular materials

    Full text link
    A fully parallel version of the contact dynamics (CD) method is presented in this paper. For large enough systems, 100% efficiency has been demonstrated for up to 256 processors using a hierarchical domain decomposition with dynamic load balancing. The iterative scheme to calculate the contact forces is left domain-wise sequential, with data exchange after each iteration step, which ensures its stability. The number of additional iterations required for convergence by the partially parallel updates at the domain boundaries becomes negligible with increasing number of particles, which allows for an effective parallelization. Compared to the sequential implementation, we found no influence of the parallelization on simulation results.Comment: 19 pages, 15 figures, published in Journal of Computational Physics (2011

    Locally Optimal Load Balancing

    Full text link
    This work studies distributed algorithms for locally optimal load-balancing: We are given a graph of maximum degree Δ\Delta, and each node has up to LL units of load. The task is to distribute the load more evenly so that the loads of adjacent nodes differ by at most 11. If the graph is a path (Δ=2\Delta = 2), it is easy to solve the fractional version of the problem in O(L)O(L) communication rounds, independently of the number of nodes. We show that this is tight, and we show that it is possible to solve also the discrete version of the problem in O(L)O(L) rounds in paths. For the general case (Δ>2\Delta > 2), we show that fractional load balancing can be solved in poly(L,Δ)\operatorname{poly}(L,\Delta) rounds and discrete load balancing in f(L,Δ)f(L,\Delta) rounds for some function ff, independently of the number of nodes.Comment: 19 pages, 11 figure

    Join-Idle-Queue with Service Elasticity: Large-Scale Asymptotics of a Non-monotone System

    Get PDF
    We consider the model of a token-based joint auto-scaling and load balancing strategy, proposed in a recent paper by Mukherjee, Dhara, Borst, and van Leeuwaarden (SIGMETRICS '17, arXiv:1703.08373), which offers an efficient scalable implementation and yet achieves asymptotically optimal steady-state delay performance and energy consumption as the number of servers NN\to\infty. In the above work, the asymptotic results are obtained under the assumption that the queues have fixed-size finite buffers, and therefore the fundamental question of stability of the proposed scheme with infinite buffers was left open. In this paper, we address this fundamental stability question. The system stability under the usual subcritical load assumption is not automatic. Moreover, the stability may not even hold for all NN. The key challenge stems from the fact that the process lacks monotonicity, which has been the powerful primary tool for establishing stability in load balancing models. We develop a novel method to prove that the subcritically loaded system is stable for large enough NN, and establish convergence of steady-state distributions to the optimal one, as NN \to \infty. The method goes beyond the state of the art techniques -- it uses an induction-based idea and a "weak monotonicity" property of the model; this technique is of independent interest and may have broader applicability.Comment: 30 page

    Achieving High Speed CFD simulations: Optimization, Parallelization, and FPGA Acceleration for the unstructured DLR TAU Code

    Get PDF
    Today, large scale parallel simulations are fundamental tools to handle complex problems. The number of processors in current computation platforms has been recently increased and therefore it is necessary to optimize the application performance and to enhance the scalability of massively-parallel systems. In addition, new heterogeneous architectures, combining conventional processors with specific hardware, like FPGAs, to accelerate the most time consuming functions are considered as a strong alternative to boost the performance. In this paper, the performance of the DLR TAU code is analyzed and optimized. The improvement of the code efficiency is addressed through three key activities: Optimization, parallelization and hardware acceleration. At first, a profiling analysis of the most time-consuming processes of the Reynolds Averaged Navier Stokes flow solver on a three-dimensional unstructured mesh is performed. Then, a study of the code scalability with new partitioning algorithms are tested to show the most suitable partitioning algorithms for the selected applications. Finally, a feasibility study on the application of FPGAs and GPUs for the hardware acceleration of CFD simulations is presented
    corecore