    Could be improved the efficiency of SPMD applications on heterogeneous environments?

    The goal of this work is to execute SPMD applications efficiently on heterogeneous environments. Applications used to test our work are designed with message-passing interface to communicate and are developed to be executed in a single core cluster. However, we create a methodology to execute e fficiently these SPMD applications over heterogeneous architectures. The SPMD applications are selected because they present high level of synchronism and communications; both elements could generate challenges when we want to obtain our objec- tive, which is de ned as to obtain an improvement in the execution time while maintaining the e fficiency level over a threshold defi ned by programmer, taking into consideration the communications heterogeneities present in a multicore cluster. This objective is achieved using a map-ping and scheduling strategies included in our methodology. Finally, the results obtained show an improvement around 40% in the best case of effi ciency in SPMD applications tested, when our methodology is applied.Presentado en el IX Workshop Procesamiento Distribuido y Paralelo (WPDP)Red de Universidades con Carreras en Informática (RedUNCI

    Data Science and Ebola

    Data Science---Today, everybody and everything produces data. People produce large amounts of data in social networks and in commercial transactions. Medical, corporate, and government databases continue to grow. Sensors continue to get cheaper and are increasingly connected, creating an Internet of Things, and generating even more data. In every discipline, large, diverse, and rich data sets are emerging, from astrophysics, to the life sciences, to the behavioral sciences, to finance and commerce, to the humanities and to the arts. In every discipline people want to organize, analyze, optimize and understand their data to answer questions and to deepen insights. The science that is transforming this ocean of data into a sea of knowledge is called data science. This lecture will discuss how data science has changed the way in which one of the most visible challenges to public health is handled, the 2014 Ebola outbreak in West Africa.Comment: Inaugural lecture Leiden Universit

    Aplicaciones Single Program Multiple Data (SPMD) en ambientes distribuidos

    Un reto al ejecutar las aplicaciones en un cluster es lograr mejorar las prestaciones utilizando los recursos de manera eficiente, y este reto es mayor al utilizar un ambiente distribuido. Teniendo en cuenta este reto, se proponen un conjunto de reglas para realizar el cómputo en cada uno de los nodos, basado en el análisis de cómputo y comunicaciones de las aplicaciones, se analiza un esquema de mapping de celdas y un método para planificar el orden de ejecución, tomando en consideración la ejecución por prioridad, donde las celdas de fronteras tienen una mayor prioridad con respecto a las celdas internas. En la experimentación se muestra el solapamiento del computo interno con las comunicaciones de las celdas fronteras, obteniendo resultados donde el Speedup aumenta y los niveles de eficiencia se mantienen por encima de un 85%, finalmente se obtiene ganancias de los tiempos de ejecución, concluyendo que si se puede diseñar un esquemas de solapamiento que permita que la ejecución de las aplicaciones SPMD en un cluster se hagan de forma eficiente.Un repte a l'executar les aplicacions en un cluster és assolir millorar les prestacions utilitzant els recursos de manera eficient, i aquest repte és major a l'utilitzar un ambient distribuït. Tenint en compte aquest repte, es proposen un conjunt de regles per a realitzar el còmput en cadascun dels nodes, basat en l'anàlisi de còmput i comunicacions de les aplicacions, s'analitza un esquema de mapping de cel·les i un mètode per a planificar l'ordre d'execució, prenent en consideració l'execució per prioritat, on les cel·les de fronteres tenen una major prioritat pel que fa a les cel·les internes. En l'experimentació es mostra el solapament del còmput intern amb les comunicacions de les cel·les fronteres, obtenint resultats on el Speedup augmenta i els nivells d'eficiència es mantenen per sobre d'un 85%, finalment s'obté guanys dels temps d'execució, concloent que si es pot dissenyar un esquema de solapament que permeti que l'execució de les aplicacions SPMD en un cluster es facin de forma eficient.A challenge to execute some applications in a cluster, is to achieve better performance using resources efficiently. This challenge is greater when is using a distributed environment. Whereas this challenge, this investigation propose a set of rules to make computing in each nodes, based on an analysis of computing and communications inside of the applications. It analyzes an outline of mapping cell and a method for planning the execution order in the group of cell, taking into consideration the execution priority, where border cells have a higher priority than internal cells. In the experiment shows the overlap between border communications cells and internal cells, where the results increases the speedup and the efficiency levels remain above 85%. Finally obtained profits of execution times, concluding that if it can design an overlapping schemes that allow the execution of applications SPMD in a cluster become an efficient manner

    Sensitivity of Parallel Applications to Large Differences in Bandwidth and Latency in Two-Layer Interconnects

    This paper studies application performance on systems with strongly non-uniform remote memory access. In current generation NUMAs the speed difference between the slowest and fastest link in an interconnect---the "NUMA gap"---is typically less than an order of magnitude, and many conventional parallel programs achieve good performance. We study how different NUMA gaps influence application performance, up to and including typical wide-area latencies and bandwidths. We find that for gaps larger than those of current generation NUMAs, performance suffers considerably (for applications that were designed for a uniform access interconnect). For many applications, however, performance can be greatly improved with comparatively simple changes: traffic over slow links can be reduced by making communication patterns hierarchical---like the interconnect. We find that in four out of our six applications the size of the gap can be increased by an order of magnitude or more without severel..

    Optimization of MPI Collective Communication Operations

    High-performance computing (HPC) systems keep growing in scale and heterogeneity to satisfy the increasing need for computation, and this brings new challenges to the design of Message Passing Interface (MPI) libraries, especially with regard to collective operations.The implementations of state-of-the-art MPI collective operations heavily rely on synchronizations, and these implementations magnify noise across the participating processes, resulting in significant performance slowdowns. Therefore, I create a new collective communication framework in Open MPI, using an event-driven design to relax synchronizations and maintain the minimal data dependencies of MPI collective operations.The recent growth in hardware heterogeneity results in increasingly complex hardware hierarchies and larger communication performance differences.Hence, in this dissertation, I present two approaches to perform hierarchical collective operations, and both can exploit the different bandwidths of hardware in heterogeneous systems and maximizing concurrent communications.Finally, to provide a fast and accurate autotuning mechanism for my framework, I design a new autotuning approach by combining two existing methods. This new approach significantly reduces the search space to save the autotuning time and is still able to provide accurate estimations.I evaluate my work with microbenchmarks and applications at different scales. Microbenchmark results show my work speedups MPI_Bcast and MPI_Allreduce up to 7.34X and 4.86X, respectively, on 4096 processes.In terms of applications, I achieve a 24.3% improvement for Hovorod and a 143% improvement for ASP on 1536 processes as compared to the current Open MPI

    Kernel-assisted and Topology-aware MPI Collective Communication among Multicore or Many-core Clusters

    Multicore or many-core clusters have become the most prominent form of High Performance Computing (HPC) systems. Hardware complexity and hierarchies not only exist in the inter-node layer, i.e., hierarchical networks, but also exist in internals of multicore compute nodes, e.g., Non Uniform Memory Accesses (NUMA), network-style interconnect, and memory and shared cache hierarchies. Message Passing Interface (MPI), the most widely adopted in the HPC communities, suffers from decreased performance and portability due to increased hardware complexity of multiple levels. We identified three critical issues specific to collective communication: The first problem arises from the gap between logical collective topologies and underlying hardware topologies; Second, current MPI communications lack efficient shared memory message delivering approaches; Last, on distributed memory machines, like multicore clusters, a single approach cannot encompass the extreme variations not only in the bandwidth and latency capabilities, but also in features such as the aptitude to operate multiple concurrent copies simultaneously. To bridge the gap between logical collective topologies and hardware topologies, we developed a distance-aware framework to integrate the knowledge of hardware distance into collective algorithms in order to dynamically reshape the communication patterns to suit the hardware capabilities. Based on process distance information, we used graph partitioning techniques to organize the MPI processes in a multi-level hierarchy, mapping on the hardware characteristics. Meanwhile, we took advantage of the kernel-assisted one-sided single-copy approach (KNEM) as the default shared memory delivering method. Via kernel-assisted memory copy, the collective algorithms offload copy tasks onto non-leader/not-root processes to evenly distribute copy workloads among available cores. Finally, on distributed memory machines, we developed a technique to compose multi-layered collective algorithms together to express a multi-level algorithm with tight interoperability between the levels. This tight collaboration results in more overlaps between inter- and intra-node communication. Experimental results have confirmed that, by leveraging several technologies together, such as kernel-assisted memory copy, the distance-aware framework, and collective algorithm composition, not only do MPI collectives reach the potential maximum performance on a wide variation of platforms, but they also deliver a level of performance immune to modifications of the underlying process-core binding

