9 research outputs found

    Optimizing Streaming Parallelism on Heterogeneous Many-Core Architectures

    Get PDF
    As many-core accelerators keep integrating more processing units, it becomes increasingly more difficult for a parallel application to make effective use of all available resources. An effective way for improving hardware utilization is to exploit spatial and temporal sharing of the heterogeneous processing units by multiplexing computation and communication tasks - a strategy known as heterogeneous streaming. Achieving effective heterogeneous streaming requires carefully partitioning hardware among tasks, and matching the granularity of task parallelism to the resource partition. However, finding the right resource partitioning and task granularity is extremely challenging, because there is a large number of possible solutions and the optimal solution varies across programs and datasets. This article presents an automatic approach to quickly derive a good solution for hardware resource partition and task granularity for task-based parallel applications on heterogeneous many-core architectures. Our approach employs a performance model to estimate the resulting performance of the target application under a given resource partition and task granularity configuration. The model is used as a utility to quickly search for a good configuration at runtime. Instead of hand-crafting an analytical model that requires expert insights into low-level hardware details, we employ machine learning techniques to automatically learn it. We achieve this by first learning a predictive model offline using training programs. The learnt model can then be used to predict the performance of any unseen program at runtime. We apply our approach to 39 representative parallel applications and evaluate it on two representative heterogeneous many-core platforms: a CPU-XeonPhi platform and a CPU-GPU platform. Compared to the single-stream version, our approach achieves, on average, a 1.6x and 1.1x speedup on the XeonPhi and the GPU platform, respectively. These results translate to over 93% of the performance delivered by a theoretically perfect predictor

    Paralelización de un esferizador geométrico

    Get PDF
    En este artículo se describe la paralelización de un esferizador geométrico utilizado en un detector jerárquico de colisiones. La paralelización se basa en la computación par alela con la utilización de la her r amienta PVM (Parallel Virtual Machine). Se discute la estrategia utilizada junto a la implementación. Finalmente, se muestran resultados experimentales y se discuten los resultados obtenidos

    On the scalability of CFD tool for supersonic jet flow configurations

    Get PDF
    New regulations are imposing noise emissions limitations for the aviation industry which are pushing researchers and engineers to invest efforts in studying the aeroacoustics phenomena. Following this trend, an in-house computational fluid dynamics tool is build to reproduce high fidelity results of supersonic jet flows for aeroacoustic analogy applications. The solver is written using the large eddy simulation formulation that is discretized using a finite difference approach and an explicit time integration. Numerical simulations of supersonic jet flows are very expensive and demand efficient high-performance computing. Therefore, non-blocking message passage interface protocols and parallel Input/Output features are implemented into the code in order to perform simulations which demand up to one billion grid points. The present work addresses the evaluation of code improvements along with the computational performance of the solver running on a computer with maximum theoretical peak of 2.727 PFlops. Different mesh configurations, whose size varies from a few hundred thousand to approximately one billion grid points, are evaluated in the present paper. Calculations are performed using different workloads in order to assess the strong and weak scalability of the parallel computational tool. Moreover, validation results of a realistic flow condition are also presented in the current work

    Proteus:Network-aware Web Browsing on Heterogeneous Mobile Systems

    Get PDF
    We present Proteus, a novel network-aware approach for optimizing web browsing on heterogeneous multi-core mobile systems. It employs machine learning techniques to predict which of the heterogeneous cores to use to render a given webpage and the operating frequencies of the processors. It achieves this by first learning offline a set of predictive models for a range of typical networking environments. A learnt model is then chosen at runtime to predict the optimal processor configuration, based on the web content, the network status and the optimization goal. We evaluate Proteus by implementing it into the open-source Chromium browser and testing it on two representative ARM big.LITTLE mobile multi-core platforms. We apply Proteus to the top 1,000 popular websites across seven typical network environments. Proteus achieves over 80% of best available performance. It obtains, on average, over 17% (up to 63%), 31% (up to 88%), and 30% (up to 91%) improvement respectively for load time, energy consumption and the energy delay product, when compared to two state-of-the-art approaches

    Using Machine Learning to Optimize Web Interactions on Heterogeneous Mobile Systems

    Get PDF
    The web has become a ubiquitous application development platform for mobile systems. Yet, web access on mobile devices remains an energy-hungry activity. Prior work in the field mainly focuses on the initial page loading stage, but fails to exploit the opportunities for energy-efficiency optimization while the user is interacting with a loaded page. This paper presents a novel approach for performing energy optimization for interactive mobile web browsing. At the heart of our approach is a set of machine learning models, which estimate at runtime the frames per second for a given user interaction input by running the computation-intensive web render engine on a specific processor core under a given clock speed. We use the learned predictive models as a utility function to quickly search for the optimal processor setting to carefully trade responsive time for reduced energy consumption. We integrate our techniques to the open-source Chromium browser and apply it to two representative mobile user events: scrolling and pinching (i.e., zoom in and out). We evaluate the developed system on the landing pages of the top-100 hottest websites and two big.LITTLE heterogeneous mobile platforms. Our extensive experiments show that the proposed approach reduces the system-wide energy consumption by over 36% on average and up to 70%. This translates to an over 17% improvement on energy-efficiency over a state-of-the-art event-based web browser scheduler, but with significantly fewer violations on the quality of service

    CrossSense:towards cross-site and large-scale WiFi sensing

    Get PDF
    We present CrossSense, a novel system for scaling up WiFi sensing to new environments and larger problems. To reduce the cost of sensing model training data collection, CrossSense employs machine learning to train, off-line, a roaming model that generates from one set of measurements synthetic training samples for each target environment. To scale up to a larger problem size, CrossSense adopts a mixture-of-experts approach where multiple specialized sensing models, or experts, are used to capture the mapping from diverse WiFi inputs to the desired outputs. The experts are trained offline and at runtime the appropriate expert for a given input is automatically chosen. We evaluate CrossSense by applying it to two representative WiFi sensing applications, gait identification and gesture recognition, in controlled single-link environments. We show that CrossSense boosts the accuracy of state-of-the-art WiFi sensing techniques from 20% to over 80% and 90% for gait identification and gesture recognition respectively, delivering consistently good performance – particularly when the problem size is significantly greater than that current approaches can effectively handle

    Distributed Simulation of High-Level Algebraic Petri Nets

    Get PDF
    In the field of Petri nets, simulation is an essential tool to validate and evaluate models. Conventional simulation techniques, designed for their use in sequential computers, are too slow if the system to simulate is large or complex. The aim of this work is to search for techniques to accelerate simulations exploiting the parallelism available in current, commercial multicomputers, and to use these techniques to study a class of Petri nets called high-level algebraic nets. These nets exploit the rich theory of algebraic specifications for high-level Petri nets: Petri nets gain a great deal of modelling power by representing dynamically changing items as structured tokens whereas algebraic specifications turned out to be an adequate and flexible instrument for handling structured items. In this work we focus on ECATNets (Extended Concurrent Algebraic Term Nets) whose most distinctive feature is their semantics which is defined in terms of rewriting logic. Nevertheless, ECATNets have two drawbacks: the occultation of the aspect of time and a bad exploitation of the parallelism inherent in the models. Three distributed simulation techniques have been considered: asynchronous conservative, asynchronous optimistic and synchronous. These algorithms have been implemented in a multicomputer environment: a network of workstations. The influence that factors such as the characteristics of the simulated models, the organisation of the simulators and the characteristics of the target multicomputer have in the performance of the simulations have been measured and characterised. It is concluded that synchronous distributed simulation techniques are not suitable for the considered kind of models, although they may provide good performance in other environments. Conservative and optimistic distributed simulation techniques perform well, specially if the model to simulate is complex or large - precisely the worst case for traditional, sequential simulators. This way, studies previously considered as unrealisable, due to their exceedingly high computational cost, can be performed in reasonable times. Additionally, the spectrum of possibilities of using multicomputers can be broadened to execute more than numeric applications
    corecore