550 research outputs found
A Fast Causal Profiler for Task Parallel Programs
This paper proposes TASKPROF, a profiler that identifies parallelism
bottlenecks in task parallel programs. It leverages the structure of a task
parallel execution to perform fine-grained attribution of work to various parts
of the program. TASKPROF's use of hardware performance counters to perform
fine-grained measurements minimizes perturbation. TASKPROF's profile execution
runs in parallel using multi-cores. TASKPROF's causal profile enables users to
estimate improvements in parallelism when a region of code is optimized even
when concrete optimizations are not yet known. We have used TASKPROF to isolate
parallelism bottlenecks in twenty three applications that use the Intel
Threading Building Blocks library. We have designed parallelization techniques
in five applications to in- crease parallelism by an order of magnitude using
TASKPROF. Our user study indicates that developers are able to isolate
performance bottlenecks with ease using TASKPROF.Comment: 11 page
Task-based adaptive multiresolution for time-space multi-scale reaction-diffusion systems on multi-core architectures
A new solver featuring time-space adaptation and error control has been
recently introduced to tackle the numerical solution of stiff
reaction-diffusion systems. Based on operator splitting, finite volume adaptive
multiresolution and high order time integrators with specific stability
properties for each operator, this strategy yields high computational
efficiency for large multidimensional computations on standard architectures
such as powerful workstations. However, the data structure of the original
implementation, based on trees of pointers, provides limited opportunities for
efficiency enhancements, while posing serious challenges in terms of parallel
programming and load balancing. The present contribution proposes a new
implementation of the whole set of numerical methods including Radau5 and
ROCK4, relying on a fully different data structure together with the use of a
specific library, TBB, for shared-memory, task-based parallelism with
work-stealing. The performance of our implementation is assessed in a series of
test-cases of increasing difficulty in two and three dimensions on multi-core
and many-core architectures, demonstrating high scalability
Spatial and Temporal Cache Sharing Analysis in Tasks
Proceedings of the First PhD Symposium on Sustainable Ultrascale
Computing Systems (NESUS PhD 2016) Timisoara, Romania. February 8-11, 2016.Understanding performance of large scale multicore systems is crucial for getting faster execution times
and optimize workload efficiency, but it is becoming harder due to the increased complexity of hardware
architectures. Cache sharing is a key component for performance in modern architectures, and it has been
the focus of performance analysis tools and techniques in recent years. At the same time, new programming
models have been introduced to aid the programmer dealing with the complexity of large scale systems,
simplifying the coding process and making applications more scalable regardless of resource sharing. Taskbased
runtime systems are one example of this that became popular recently. In this work we develop models
to tackle performance analysis of shared resources in the task-based context, and for that we study cache
sharing both in temporal and spatial ways. In temporal cache sharing, the effect of data reused over time by
the tasks executed is modeled to predict different scenarios resulting in a tool called StatTask. In spatial
cache sharing, the effect of tasks fighting for the cache at a given point in time through their execution is
quantified and used to model their behavior on arbitrary cache sizes. Finally, we explain how these tools
set up a unique and solid platform to improve runtime systems schedulers, maximizing performance of
execution of large-scale task-based applications.European Cooperation in Science and Technology. COSTThe work presented in this paper has been partially supported by EU under the COST programme Action
IC1305,‘Network for Sustainable Ultrascale Computing (NESUS)’, and by the Swedish Research Council, carried out within the Linnaeus centre of excellence UPMARC, Uppsala Programming for Multicore Architectures Research Center
Parallel Processes in HPX: Designing an Infrastructure for Adaptive Resource Management
Advancement in cutting edge technologies have enabled better energy efficiency as well as scaling computational power for the latest High Performance Computing(HPC) systems. However, complexity, due to hybrid architectures as well as emerging classes of applications, have shown poor computational scalability using conventional execution models. Thus alternative means of computation, that addresses the bottlenecks in computation, is warranted. More precisely, dynamic adaptive resource management feature, both from systems as well as application\u27s perspective, is essential for better computational scalability and efficiency. This research presents and expands the notion of Parallel Processes as a placeholder for procedure definitions, targeted at one or more synchronous domains, meta data for computation and resource management as well as infrastructure for dynamic policy deployment. In addition to this, the research presents additional guidelines for a framework for resource management in HPX runtime system. Further, this research also lists design principles for scalability of Active Global Address Space (AGAS), a necessary feature for Parallel Processes. Also, to verify the usefulness of Parallel Processes, a preliminary performance evaluation of different task scheduling policies is carried out using two different applications. The applications used are: Unbalanced Tree Search, a reference dynamic graph application, implemented by this research in HPX and MiniGhost, a reference stencil based application using bulk synchronous parallel model. The results show that different scheduling policies provide better performance for different classes of applications; and for the same application class, in certain instances, one policy fared better than the others, while vice versa in other instances, hence supporting the hypothesis of the need of dynamic adaptive resource management infrastructure, for deploying different policies and task granularities, for scalable distributed computing
Performance analysis of a hardware accelerator of dependence management for taskbased dataflow programming models
Along with the popularity of multicore and manycore, task-based dataflow programming models obtain great attention for being able to extract high parallelism from applications without exposing the complexity to programmers. One of these pioneers is the OpenMP Superscalar (OmpSs). By implementing dynamic task dependence analysis, dataflow scheduling and out-of-order execution in runtime, OmpSs achieves high performance using coarse and
medium granularity tasks. In theory, for the same application, the more parallel tasks can be exposed, the higher possible speedup can be achieved. Yet this factor is limited by task granularity, up to a point where the runtime overhead outweighs the performance increase and slows down the application. To overcome this handicap, Picos
was proposed to support task-based dataflow programming models like OmpSs as a fast hardware accelerator for fine-grained task and dependence management, and a simulator was developed to perform design space exploration. This paper presents the very first functional hardware prototype inspired by Picos. An embedded system based on a Zynq 7000 All-Programmable SoC is developed to study its capabilities and possible bottlenecks. Initial scalability and hardware consumption studies of different Picos designs are performed to find the one with the highest performance and lowest hardware cost. A further thorough performance study is employed on both the prototype with the most balanced configuration and the OmpSs software-only alternative. Results show that our OmpSs runtime hardware support significantly outperforms the software-only implementation currently available in the runtime system for finegrained tasks.This work is supported by the Spanish Government through Programa Severo Ochoa (SEV-2015-0493), by the Spanish Ministry of Science and Technology through TIN2015-65316-P project, by the Generalitat de Catalunya (contracts 2014-SGR-1051 and 2014-SGR-1272) and by the European Research Council RoMoL Grant Agreement number 321253. We also thank the Xilinx University Program for its hardware and
software donations.Peer ReviewedPostprint (published version
Improving the efficiency of the Energy-Split tool to compute the energy of very large molecular systems
Dissertação de mestrado integrado em Engenharia InformáticaThe Energy-Split tool receives as input pieces of a very large molecular system and computes
all intra and inter-molecular energies, separately calculating the energies of each fragment
and then the total energy of the molecule. It takes into account the connectivity information
among atoms in a molecule to compute (i) the energy of all terms involving atoms covalently
bonded, namely bonds, angles, dihedral angles, and improper angles, and (ii) Coulomb
and the Van der Waals energies, that are independent of the atom’s connections, which
have to be computed for every atom in the system. The required operations to obtain the
total energy of a large molecule are computationally intensive, which require an efficient
high-performance computing approach to obtain results in an acceptable time slot.
The original Energy-Split Tcl code was thoroughly analyzed to be ported to a parallel and
more efficient C++ version. New data structures were defined with data locality features, to
take advantage of the advanced features present in current laptop or server systems. These
include the vector extensions to the scalar processors, an efficient on-chip memory hierarchy,
and the inherent parallelism in multicore devices. To improve the Energy-Split’s sequential
variant a parallel version was developed using auxiliary libraries. Both implementations
were tested on different multicore devices and optimized to take the most advantage of the
features in high performance computing.
Significant results by applying professional performance engineering approaches, namely
(i) by identifying the data values that can be represented as Boolean variables (such as
variables used in auxiliar data structures on the traversal algorithm that computes the
Euclidean distance between atoms), leading to significant performance improvements due to
the reduced memory bottleneck (over 10 times faster), and (ii) using an adequate compress
format (CSR) to represent and operate on sparse matrices (namely matrices with Euclidean
distances between atoms pairs, since all distances further the cut-off distance (user defined)
are considered as zero, and these are the majority of values).
After the first code optimizations, the performance of the sequential version was improved
by around 100 times when compared to the original version on a dual-socket server. The
parallel version improved up to 24 times, depending on the molecules tested, on the same
server. The overall picture shows that the Energy-Split code is highly scalable, obtaining
better results with larger molecule files, even when the atom’s arrangement influences the
algorithm’s performance.A ferramenta Energy-Split recebe como ficheiro de input a descrição de fragmentos de
um sistema molecular de grandes dimensões, de maneira a calcular os valores de energia
intramolecular. Separadamente, também efetua o cálculo da energia de cada fragmento e a
energia total de uma molécula. Ao mesmo tempo, tem em conta a informação das ligações
entre átomos de uma molécula para calcular (i) a energia que envolve todos os átomos
ligados covalentemente, nomeadamente bonds, angles, dihedral angles and improper angles,
e (ii) energias de Coulomb e Vand der Waals, que são independentes das conexões dos
átomos e têm de ser calculadas para cada átomo do sistema. Para cada átomo, o Energy-Split
calcula a energia de interação com todos os outros átomos do sistema, considerando a
partição da molécula em fragmentos, feita num programa open source, Visual Molecular
Dynamics.
As operações para o cálculo destas energias podem levar a tarefas muito intensivas,
computacionalmente, fazendo com que seja necessário utilizar uma abordagem que tire
proveito de computação de alto desempenho de modo a desenvolver código mais eficiente.
O código fornecido, em Tcl, foi profundamente analisado e convertido para uma versão
paralela e, mais eficiente, em C++.
Ao mesmo tempo, foram definidas novas estruturas de dados, que aproveitam a boa
localidade dos mesmos para tirar vantagem das extensões vetoriais presentes em qualquer
computador e, também, para explorar o paralelismo inerente a máquinas multicore. Assim,
foi implementada uma versão paralela do código convertido numa fase anterior com recurso
ao uso de bibliotecas auxiliares. Ambas as versões foram testadas em diferentes ambientes
multicore e otimizadas de maneira a ser possível tirar o máximo partido da computação de
alto desempenho para obter os melhores resultados.
Após a aplicação de técnicas de engenharia de performance como (i) a identificação de
dados que poderiam ser representados em formatos mais leves como variáveis booleanas
(por exemplo, variáveis usadas em estruturas de dados auxiliares ao cálculo da distância
Euclideana entre átomos, utilizadas no algoritmo de travessia da molécula), o que levou a
melhorias significativas na performance (cerca de 10 vezes) devido à redução de sobrecarga
da memória. (ii) a utilização de um formato adequado para a representação de matirzes
esparsas (nomeadamente a de representação das mesmas distâncias Euclidianas do primeiro
ponto, uma vez que todas as distâncias que ultrapassem a distância de cutoff (definida pelo
utilizador) são consideradas como 0, representado a maioria dos valores).
3
4
Depois das otimizações à versão sequencial, esta apresentou uma melhoria de cerca de 100
vezes em relação à versão original. A versão paralela foi melhorada até 24 vezes, dependendo
das moléculas em questão. No geral, o código é escalável, uma vez que apresenta melhores
resultados consoante o aumento do tamanho das moléculas testadas, apesar de se concluir
que a disposição dos átomos também influencia a perfomance do algoritmo.This work was supported by FCT (Fundação para a Ciência e Tecnologia) within project
RDB-TS: Uma base de dados de reações químicas baseadas em informação de estados de transição derivados de cálculos quânticos (Refª BI2-2019_NORTE-01-0145-FEDER-031689_UMINHO),
co-funded by the North Portugal Regional Operational Programme, through the European
Regional Development Fun
Scaling Reliably: Improving the Scalability of the Erlang Distributed Actor Platform
Distributed actor languages are an effective means of constructing scalable reliable systems, and the Erlang programming language has a well-established and influential model. While the Erlang model conceptually provides reliable scalability, it has some inherent scalability limits and these force developers to depart from the model at scale. This article establishes the scalability limits of Erlang systems and reports the work of the EU RELEASE project to improve the scalability and understandability of the Erlang reliable distributed actor model.
We systematically study the scalability limits of Erlang and then address the issues at the virtual machine, language, and tool levels. More specifically: (1) We have evolved the Erlang virtual machine so that it can work effectively in large-scale single-host multicore and NUMA architectures. We have made important changes and architectural improvements to the widely used Erlang/OTP release. (2) We have designed and implemented Scalable Distributed (SD) Erlang libraries to address language-level scalability issues and provided and validated a set of semantics for the new language constructs. (3) To make large Erlang systems easier to deploy, monitor, and debug, we have developed and made open source releases of five complementary tools, some specific to SD Erlang.
Throughout the article we use two case studies to investigate the capabilities of our new technologies and tools: a distributed hash table based Orbit calculation and Ant Colony Optimisation (ACO). Chaos Monkey experiments show that two versions of ACO survive random process failure and hence that SD Erlang preserves the Erlang reliability model. While we report measurements on a range of NUMA and cluster architectures, the key scalability experiments are conducted on the Athos cluster with 256 hosts (6,144 cores). Even for programs with no global recovery data to maintain, SD Erlang partitions the network to reduce network traffic and hence improves performance of the Orbit and ACO benchmarks above 80 hosts. ACO measurements show that maintaining global recovery data dramatically limits scalability; however, scalability is recovered by partitioning the recovery data. We exceed the established scalability limits of distributed Erlang, and do not reach the limits of SD Erlang for these benchmarks at this scal
Peer-to-Peer Networks and Computation: Current Trends and Future Perspectives
This research papers examines the state-of-the-art in the area of P2P networks/computation. It attempts to identify the challenges that confront the community of P2P researchers and developers, which need to be addressed before the potential of P2P-based systems, can be effectively realized beyond content distribution and file-sharing applications to build real-world, intelligent and commercial software systems. Future perspectives and some thoughts on the evolution of P2P-based systems are also provided
- …