25 research outputs found

    Revisiting Actor Programming in C++

    Full text link
    The actor model of computation has gained significant popularity over the last decade. Its high level of abstraction makes it appealing for concurrent applications in parallel and distributed systems. However, designing a real-world actor framework that subsumes full scalability, strong reliability, and high resource efficiency requires many conceptual and algorithmic additives to the original model. In this paper, we report on designing and building CAF, the "C++ Actor Framework". CAF targets at providing a concurrent and distributed native environment for scaling up to very large, high-performance applications, and equally well down to small constrained systems. We present the key specifications and design concepts---in particular a message-transparent architecture, type-safe message interfaces, and pattern matching facilities---that make native actors a viable approach for many robust, elastic, and highly distributed developments. We demonstrate the feasibility of CAF in three scenarios: first for elastic, upscaling environments, second for including heterogeneous hardware like GPGPUs, and third for distributed runtime systems. Extensive performance evaluations indicate ideal runtime behaviour for up to 64 cores at very low memory footprint, or in the presence of GPUs. In these tests, CAF continuously outperforms the competing actor environments Erlang, Charm++, SalsaLite, Scala, ActorFoundry, and even the OpenMPI.Comment: 33 page

    Generating and auto-tuning parallel stencil codes

    Get PDF
    In this thesis, we present a software framework, Patus, which generates high performance stencil codes for different types of hardware platforms, including current multicore CPU and graphics processing unit architectures. The ultimate goals of the framework are productivity, portability (of both the code and performance), and achieving a high performance on the target platform. A stencil computation updates every grid point in a structured grid based on the values of its neighboring points. This class of computations occurs frequently in scientific and general purpose computing (e.g., in partial differential equation solvers or in image processing), justifying the focus on this kind of computation. The proposed key ingredients to achieve the goals of productivity, portability, and performance are domain specific languages (DSLs) and the auto-tuning methodology. The Patus stencil specification DSL allows the programmer to express a stencil computation in a concise way independently of hardware architecture-specific details. Thus, it increases the programmer productivity by disburdening her or him of low level programming model issues and of manually applying hardware platform-specific code optimization techniques. The use of domain specific languages also implies code reusability: once implemented, the same stencil specification can be reused on different hardware platforms, i.e., the specification code is portable across hardware architectures. Constructing the language to be geared towards a special purpose makes it amenable to more aggressive optimizations and therefore to potentially higher performance. Auto-tuning provides performance and performance portability by automated adaptation of implementation-specific parameters to the characteristics of the hardware on which the code will run. By automating the process of parameter tuning — which essentially amounts to solving an integer programming problem in which the objective function is the number representing the code's performance as a function of the parameter configuration, — the system can also be used more productively than if the programmer had to fine-tune the code manually. We show performance results for a variety of stencils, for which Patus was used to generate the corresponding implementations. The selection includes stencils taken from two real-world applications: a simulation of the temperature within the human body during hyperthermia cancer treatment and a seismic application. These examples demonstrate the framework's flexibility and ability to produce high performance code

    Parallel fluid dynamics for the film and animation industries

    Get PDF
    Includes bibliographical references (leaves 142-149).The creation of automated fluid effects for film and media using computer simulations is popular, as artist time is reduced and greater realism can be achieved through the use of numerical simulation of physical equations. The fluid effects in today’s films and animations have large scenes with high detail requirements. With these requirements, the time taken by such automated approaches is large. To solve this, cluster environments making use of hundreds or more CPUs have been used. This overcomes the processing power and memory limitations of a single computer and allows very large scenes to be created. One of the newer methods for fluid simulation is the Lattice Boltzmann Method (LBM). This is a cellular automata type of algorithm, which parallelizes easily. An important part of the process of parallelization is load balancing; the distribution of computation amongst the available computing resources in the cluster. To date, the parallelization of the Lattice Boltzmann method only makes use of static load balancing. Instead, it is possible to make use of dynamic load balancing, which adjusts the computation distribution as the simulation progresses. Here, we investigate the use of the LBM in conjunction with a Volume of Fluid (VOF) surface representation in a parallel environment with the aim of producing large scale scenes for the film and animation industries. The VOF method tracks mass exchange between cells of the LBM. In particular, we implement the new dynamic load balancing algorithm to improve the efficiency of the fluid simulation using this method. Fluid scenes from films and animations have two important requirements: the amount of detail and the spatial resolution of the fluid. These aspects of the VOF LBM are explored by considering the time for scene creation using a single and multi-CPU implementation of the method. The scalability of the method is studied by plotting the run time, speedup and efficiency of scene creation against the number of CPUs. From such plots, an estimate is obtained of the feasibility of creating scenes of a giving level of detail. Such estimates enable the recommendation of architectures for creation of specific scenes. Using a parallel implementation of the VOF LBM method we successfully create large scenes with great detail. In general, considering the significant amounts of communication required for the parallel method, it is shown to scale well, favouring scenes with greater detail. The scalability studies show that the new dynamic load balancing algorithm improves the efficiency of the parallel implementation, but only when using lower number of CPUs. In fact, for larger number of CPUs, the dynamic algorithm reduces the efficiency. We hypothesise the latter effect can be removed by making using of centralized load balancing decision instead of the current decentralized approach. The use of a cluster comprising of 200 CPUs is recommended for the production of large scenes of a grid size 6003 in a reasonable time frame

    Reducing adaptive optics latency using many-core processors

    Get PDF
    Atmospheric turbulence reduces the achievable resolution of ground based optical telescopes. Adaptive optics systems attempt to mitigate the impact of this turbulence and are required to update their corrections quickly and deterministically (i.e. in realtime). The technological challenges faced by the future extremely large telescopes (ELTs) and their associated instruments are considerable. A simple extrapolation of current systems to the ELT scale is not sufficient. My thesis work consisted in the identification and examination of new many-core technologies for accelerating the adaptive optics real-time control loop. I investigated the Mellanox TILE-Gx36 and the Intel Xeon Phi (5110p). The TILE-Gx36 with 4x10 GbE ports and 36 processing cores is a good candidate for fast computation of the wavefront sensor images. The Intel Xeon Phi with 60 processing cores and high memory bandwidth is particularly well suited for the acceleration of the wavefront reconstruction. Through extensive testing I have shown that the TILE-Gx can provide the performance required for the wavefront processing units of the ELT first light instruments. The Intel Xeon Phi (Knights Corner) while providing good overall performance does not have the required determinism. We believe that the next generation of Xeon Phi (Knights Landing) will provide the necessary determinism and increased performance. In this thesis, we show that by using currently available novel many-core processors it is possible to reach the performance required for ELT instruments

    Algorithmic Advancements and Massive Parallelism for Large-Scale Datasets in Phylogenetic Bayesian Markov Chain Monte Carlo

    Get PDF
    Datasets used for the inference of the "tree of life" grow at unprecedented rates, thus inducing a high computational burden for analytic methods. First, we introduce a scalable software package that allows us to conduct state of the art Bayesian analyses on datasets of almost arbitrary size. Second, we derive a proposal mechanism for MCMC that is substantially more efficient than traditional branch length proposals. Third, we present an efficient algorithm for solving the rogue taxon problem

    Integrating Amdahl-like Laws and Divisible Load Theory

    No full text
    corecore