    An Optimized Model for MapReduce Based on Hadoop

    Aiming at the waste of computing resources resulting from sequential control of running mechanism of MapReduce model on Hadoop platform,Fork/Join framework has been introduced into this model to make full use of CPU resource of each node. From the perspective of fine-grained parallel data processing, combined with Fork/Join framework,a parallel and multi-thread model,this paper optimizes MapReduce model and puts forward a MapReduce+Fork/Join programming model which is a distributed and parallel architecture combined with coarse-grained and fine-grained on Hadoop platform to Support two-tier levels of parallelism architecture both in shared and distributed memory machines. A test is made under the environment of Hadoop cluster composed of four nodes. And the experimental results prove that this model really can improve performance and efficiency of the whole system and it is not only suitable for handling tasks with data intensive but also tasks with computing intensive. it is an effective optimization and improvement to the MapReduce model of big data processing

    Scalability of Incompressible Flow Computations on Multi-GPU Clusters Using Dual-Level and Tri-Level Parallelism

    High performance computing using graphics processing units (GPUs) is gaining popularity in the scientific computing field, with many large compute clusters being augmented with multiple GPUs in each node. We investigate hybrid tri-level (MPI-OpenMP-CUDA) parallel implementations to explore the efficiency and scalability of incompressible flow computations on GPU clusters up to 128 GPUS. This work details some of the unique issues faced when merging fine-grain parallelism on the GPU using CUDA with coarse-grain parallelism using OpenMP for intra-node and MPI for inter-node communication. Comparisons between the tri-level MPI-OpenMP-CUDA and dual-level MPI-CUDA implementations are shown using computationally large computational fluid dynamics (CFD) simulations. Our results demonstrate that a tri-level parallel implementation does not provide a significant advantage in performance over the dual-level implementation, however further research is needed to justify our conclusion for a cluster with a high GPU per node density or when using software that can utilize OpenMP’s fine-grain parallelism more effectively

    Real-Time Detection of Foreground in Video Surveillance Cameras Using CUDA

    The rapid growth of video processing techniques has led to remarkable contributions in several applications such as compression, filtering, segmentation and object tracking. A fundamental task of video surveillance cameras is to detect and capture major moving objects (foreground). Processing video frame by frame is complex and difficult for real time applications. GPUs have led to significant advancements in the field of image/video processing especially in real time applications. In this work, we make use of the parallel computing capacity of GPUs to speed up the runtime of foreground detection algorithm. The focus of the thesis is to accelerate the runtime of the algorithm by parallelizing the time consuming portions. The final goal would then be to analyze and come up with the optimal parallelization technique(s) that give(s) the best performance

    APCM: An Auto-Parallelism Computational Model : Increasing the performance of MPI applications in multi-core environments

    Given the availability of computer clusters based on multi-core processors, the hybrid programming model has become an important ally of high-performance computing users in improving the performance of their parallel applications. However, creating hybrid applications is a complex task because it requires developers to be familiar with two distinct parallel programming models. Against this background, this article introduces APCM, an auto-parallelism computational model. APCM's goal is to create hybrid parallel applications, i.e., OpenMP (memory programming) and a message-passing interface (MPI), from MPI applications. This goal is achieved in a simple, automated manner that is transparent for the user while increasing application performance. In the article's conclusion, we present consistent results that attest the efficacy of the proposed model.

    A spatiotemporal data aggregation technique for performance analysis of large-scale execution traces

    A task-based message passing framework

    Over the past decade, it has become clear that parallel and distributed programming will occupy an increasingly larger proportion of a developer's work. While numerous programming languages and libraries have been built to facilitate working with concurrency, developer work is still difficult and error-prone. In this thesis, we propose a task-based message passing framework. The proposed framework combines the actor model with message passing functionality to offer a useful and efficient way to implement parallel and distributed algorithms. The framework is intended to be part of a novel C compiler that will offer built-in task and message features. Perhaps most importantly, the new framework aims to be intuitive and efficient. We have used the framework to implement a parallel sample-sort and a client-server application. Our results demonstrate both strong performance for a parallel sorting algorithm and scalability that extends to thousands of concurrent messages. In addition, we have developed a client server app that emphasizes the intuitive nature of the development cycle for the new model. We conclude that the proposed message passing framework would be well suited to concurrent development environments and offers a simple and efficient way to build applications for the new wave of multi-core hardware platforms

    Actes du 10ème Atelier en Évaluation de Performances

    National audienceL'Atelier en Évaluation de Performances est une réunion destinée à faire s'exprimer et se rencontrer les jeunes chercheurs (doctorants et nouveaux docteurs) dans le domaine de la Modélisation et de l'Évaluation de Performances, une discipline consacrée à l'étude et l'optimisation de systèmes dynamiques stochastiques et/ou temporisés apparaissant en Informatique, Télécommunications, Productique et Robotique entre autres. La présentation informelle de travaux, même en cours, y est encouragée afin de renforcer les interactions entre jeunes chercheurs et préparer des soumissions de nouveaux projets scientifiques. Des exposés de synthèse sur des domaines de recherche d'actualité, donnés par des chercheurs confirmés du domaine renforcent la partie formation de l'atelier

    Early Experiments with the OpenMP/MPI Hybrid Programming Model

    Abstract. The paper describes some very early experiments on new architectures that support the hybrid programming model. Our results are promising in that OpenMP threads interact with MPI as desired, allowing OpenMP-agnostic tools to be used. We explore three environments: a “typical ” Linux cluster, a new large-scale machine from SiCortex, and the new IBM BG/P, which have quite different compilers and runtime systems for both OpenMP and MPI. We look at a few simple, diagnostic programs, and one “application-like ” test program. We demonstrate the use of a tool that can examine the detailed sequence of events in a hybrid program and illustrate that a hybrid computation might not always proceed as expected.

    Efficient implementation of resource-constrained cyber-physical systems using multi-core parallelism

    The quest for more performance of applications and systems became more challenging in the recent years. Especially in the cyber-physical and mobile domain, the performance requirements increased significantly. Applications, previously found in the high-performance domain, emerge in the area of resource-constrained domain. Modern heterogeneous high-performance MPSoCs provide a solid foundation to satisfy the high demand. Such systems combine general processors with specialized accelerators ranging from GPUs to machine learning chips. On the other side of the performance spectrum, the demand for small energy efficient systems exposed by modern IoT applications increased vastly. Developing efficient software for such resource-constrained multi-core systems is an error-prone, time-consuming and challenging task. This thesis provides with PA4RES a holistic semiautomatic approach to parallelize and implement applications for such platforms efficiently. Our solution supports the developer to find good trade-offs to tackle the requirements exposed by modern applications and systems. With PICO, we propose a comprehensive approach to express parallelism in sequential applications. PICO detects data dependencies and implements required synchronization automatically. Using a genetic algorithm, PICO optimizes the data synchronization. The evolutionary algorithm considers channel capacity, memory mapping, channel merging and flexibility offered by the channel implementation with respect to execution time, energy consumption and memory footprint. PICO's communication optimization phase was able to generate a speedup almost 2 or an energy improvement of 30% for certain benchmarks. The PAMONO sensor approach enables a fast detection of biological viruses using optical methods. With a sophisticated virus detection software, a real-time virus detection running on stationary computers was achieved. Within this thesis, we were able to derive a soft real-time capable virus detection running on a high-performance embedded system, commonly found in today's smart phones. This was accomplished with smart DSE algorithm which optimizes for execution time, energy consumption and detection quality. Compared to a baseline implementation, our solution achieved a speedup of 4.1 and 87\% energy savings and satisfied the soft real-time requirements. Accepting a degradation of the detection quality, which still is usable in medical context, led to a speedup of 11.1. This work provides the fundamentals for a truly mobile real-time virus detection solution. The growing demand for processing power can no longer satisfied following well-known approaches like higher frequencies. These so-called performance walls expose a serious challenge for the growing performance demand. Approximate computing is a promising approach to overcome or at least shift the performance walls by accepting a degradation in the output quality to gain improvements in other objectives. Especially for a safe integration of approximation into existing application or during the development of new approximation techniques, a method to assess the impact on the output quality is essential. With QCAPES, we provide a multi-metric assessment framework to analyze the impact of approximation. Furthermore, QCAPES provides useful insights on the impact of approximation on execution time and energy consumption. With ApproxPICO we propose an extension to PICO to consider approximate computing during the parallelization of sequential applications

    A dependency-aware parallel programming model

    Designing parallel codes is hard. One of the most important roadblocks to parallel programming is the presence of data dependencies. These restrict parallelism and, in general, to work them around requires complex analysis and leads to convoluted solutions that decrease the quality of the code. This thesis proposes a solution to parallel programming that incorporates data dependencies into the model. The programming model can handle that information and to dynamically find parallelism that otherwise would be hard to find. This approach improves both programmability and parallelism, and thus performance. While this problem has already been solved in OpenMP 4 at the time of this publication, this research begun before the problem was even being considered for OpenMP 3. In fact, some of the contributions of this thesis have had an influence on the approach taken in OpenMP 4. However, the contributions go beyond that and cover aspects that have not been considered yet in OpenMP 4. The approach we propose is based on function-level dependencies across disjoint blocks of contiguous memory. While finding dependencies under those constraints is simple, it is much harder to do so over strided and possibly partially overlapping sets of data. This thesis also proposes a solution to this problem. By doing so, we increase the range of applicability of the original solution and increase the span of applicability of the programming model. OpenMP4 does not currently cover this aspect. Finally, we present a solution to take advantage of the performance characteristics of Non-Uniform Memory Access architectures. Our proposal is at the programming model level and does not require changes in the code. It automatically distributes the data and does not rely on data migration nor replication. Instead, it is based exclusively on scheduling the computations. While this process is automatic, it can be tuned through minor changes in the code that do not require any change in the programming model. Throughout the thesis, we demonstrate the effectiveness of the proposal through benchmarks that are either hard to program using other paradigms or that have different solutions. In most cases, our solutions perform either on par or better than already existing solutions. This includes the implementations available in well-known high-performance parallel libraries.Dissenyar codis paral·lels es complex. Un dels principals esculls a l'hora de programar aplicacions paral·leles és la presència de dependències. Aquestes constrenyen el paral·lelisme, i en general, per evitar-les es requereix realitzar anàlisis complicades que donen lloc a solucions complexes que redueixen la qualitat del codi. Aquesta tesi proposa una solució a la programació paral·lela que incorpora al model les dependències de dades. El model de programació és capaç d'utilitzar aquesta informació per a trobar paral·lelisme que altrament seria molt difícil de detectar i d'extreure. Aquest enfoc augmenta la programabilitat i el paral·lelisme, i per tant també el rendiment. Tot i que al moment de la publicació d'aquesta tesi, el problema ja ha estat resolt a OpenMP 4, la recerca d'aquesta tesi va començar abans de que el problema s'hagués plantejat en l'àmbit d'OpenMP 3. De fet, algunes de les contribucions de la tesi han influït en la solució emprada a OpenMP 4. Tanmateix, les contribucions van més enllà i cobreixen aspectes que encara no han estat considerats a OpenMP 4. La proposta es basa en dependències a nivell de funció entre blocs de memòria continus i sense intersecció. Tot i que trobar dependències sota aquestes condicions és senzill, fer-ho sobre dades no contínues amb possibles interseccions parcials és molt més complex. Aquesta tesi també proposa una solució a aquest problema. Fent això, es millora el rang d'aplicació de la solució original i per tant el del model de programació. Aquest és un dels aspectes que encara no es contemplen a OpenMP 4. Finalment, es presenta una solució que té en compte les característiques de rendiment de les arquitectures NUMA (Accés No Uniforme a la Memòria). La proposta es planteja a nivell del model de programació i no precisa de canvis al codi ja que les dades es distribueixen automàticament. En lloc de basar-se en la migració i la replicació de les dades, es basa exclusivament en la planificació de l'execució de les computacions. Tot i que aquest procés és automàtic, es pot afinar mitjançant petits canvis en el codi que no arriben a alterar el model de programació. Al llarg d'aquesta tesi es demostra la efectivitat de les propostes a través de bancs de proves que son difícils de programar amb altres paradigmes o que tenen solucions diferents. A la majoria dels casos les nostres solucions tenen un rendiment similar o millor que les solucions preexistents, que inclouen implementacions en ben reconegues biblioteques paral·leles d'alt rendiment