153 research outputs found

    Achieving High Performance and High Productivity in Next Generational Parallel Programming Languages

    Get PDF
    Processor design has turned toward parallelism and heterogeneity cores to achieve performance and energy efficiency. Developers find high-level languages attractive because they use abstraction to offer productivity and portability over hardware complexities. To achieve performance, some modern implementations of high-level languages use work-stealing scheduling for load balancing of dynamically created tasks. Work-stealing is a promising approach for effectively exploiting software parallelism on parallel hardware. A programmer who uses work-stealing explicitly identifies potential parallelism and the runtime then schedules work, keeping otherwise idle hardware busy while relieving overloaded hardware of its burden. However, work-stealing comes with substantial overheads. These overheads arise as a necessary side effect of the implementation and hamper parallel performance. In addition to runtime-imposed overheads, there is a substantial cognitive load associated with ensuring that parallel code is data-race free. This dissertation explores the overheads associated with achieving high performance parallelism in modern high-level languages. My thesis is that, by exploiting existing underlying mechanisms of managed runtimes; and by extending existing language design, high-level languages will be able to deliver productivity and parallel performance at the levels necessary for widespread uptake. The key contributions of my thesis are: 1) a detailed analysis of the key sources of overhead associated with a work-stealing runtime, namely sequential and dynamic overheads; 2) novel techniques to reduce these overheads that use rich features of managed runtimes such as the yieldpoint mechanism, on-stack replacement, dynamic code-patching, exception handling support, and return barriers; 3) comprehensive analysis of the resulting benefits, which demonstrate that work-stealing overheads can be significantly reduced, leading to substantial performance improvements; and 4) a small set of language extensions that achieve both high performance and high productivity with minimal programmer effort. A managed runtime forms the backbone of any modern implementation of a high-level language. Managed runtimes enjoy the benefits of a long history of research and their implementations are highly optimized. My thesis demonstrates that converging these highly optimized features together with the expressiveness of high-level languages, gives further hope for achieving high performance and high productivity on modern parallel hardwar

    Understanding Spark System Performance for Image Processing in a Heterogeneous Commodity Cluster

    Get PDF
    In recent years, Apache Spark has seen a widespread adoption in industries and institutions due to its cache mechanism for faster Big Data analytics. However, the speed advantage Spark provides, especially in a heterogeneous cluster environment, is not obtainable out-of-the-box; it requires the right combination of configuration parameters from the myriads of parameters provided by Spark developers. Recognizing this challenge, this thesis undertakes a study to provide insight on Spark performance particularly, regarding the impact of choice parameter settings. These are parameters that are critical to fast job completion and effective utilization of resources. To this end, the study focuses on two specific example applications namely, flowerCounter and imageClustering, for processing still image datasets of Canola plants collected during the Summer of 2016 from selected plot fields using timelapse cameras in a heterogeneous Spark-clustered environments. These applications were of initial interest to the Plant Phenotyping and Imaging Research Centre (P2IRC) at the University of Saskatchewan. The P2IRC is responsible for developing systems that will aid fast analysis of large-scale seed breeding to ensure global food security. The flowerCounter application estimates the count of flowers from the images while the imageClustering application clusters images based on physical plant attributes. Two clusters are used for the experiments: a 12-node and 3-node cluster (including a master node), with Hadoop Distributed File System (HDFS) as the storage medium for the image datasets. Experiments with the two case study applications demonstrate that increasing the number of tasks does not always speed-up job processing due to increased communication overheads. Findings from other experiments show that numerous tasks with one core per executor and small allocated memory limits parallelism within an executor and result in inefficient use of cluster resources. Executors with large CPU and memory, on the other hand, do not speed-up analytics due to processing delays and threads concurrency. Further experimental results indicate that application processing time depends on input data storage in conjunction with locality levels and executor run time is largely dominated by the disk I/O time especially, the read time cost. With respect to horizontal node scaling, Spark scales with increasing homogeneous computing nodes but the speed-up degrades with heterogeneous nodes. Finally, this study shows that the effectiveness of speculative tasks execution in mitigating the impact of slow nodes varies for the applications

    Adaptive sampling-based profiling techniques for optimizing the distributed JVM runtime

    Get PDF
    Extending the standard Java virtual machine (JVM) for cluster-awareness is a transparent approach to scaling out multithreaded Java applications. While this clustering solution is gaining momentum in recent years, efficient runtime support for fine-grained object sharing over the distributed JVM remains a challenge. The system efficiency is strongly connected to the global object sharing profile that determines the overall communication cost. Once the sharing or correlation between threads is known, access locality can be optimized by collocating highly correlated threads via dynamic thread migrations. Although correlation tracking techniques have been studied in some page-based sof Tware DSM systems, they would entail prohibitively high overheads and low accuracy when ported to fine-grained object-based systems. In this paper, we propose a lightweight sampling-based profiling technique for tracking inter-thread sharing. To preserve locality across migrations, we also propose a stack sampling mechanism for profiling the set of objects which are tightly coupled with a migrant thread. Sampling rates in both techniques can vary adaptively to strike a balance between preciseness and overhead. Such adaptive techniques are particularly useful for applications whose sharing patterns could change dynamically. The profiling results can be exploited for effective thread-to-core placement and dynamic load balancing in a distributed object sharing environment. We present the design and preliminary performance result of our distributed JVM with the profiling implemented. Experimental results show that the profiling is able to obtain over 95% accurate global sharing profiles at a cost of only a few percents of execution time increase for fine- to medium- grained applications. © 2010 IEEE.published_or_final_versionThe 24th IEEE International Symposium on Parallel & Distributed Processing (IPDPS 2010), Atlanta, GA., 19-23 April 2010. In Proceedings of the 24th IPDPS, 2010, p. 1-1

    Profiling tools for Java

    Get PDF
    Dissertação de mestrado integrado em Informatics EngineeringAtualmente, Java é uma das linguagens de programação mais populares. Esta popularidade é parcialmente devida à sua portabilidade que advém do facto do código Java ser compilado para bytecode que poderá ser executado por uma máquina virtual Java (JVM) compatível em qualquer sistema. A JVM pode depois interpretar diretamente ou compilar para código máquina a aplicação Java. No entanto, esta execução sobre uma máquina virtual cria alguns obstáculos à obtenção do perfil de execução de aplicações. Perfis de execução são valiosos para quem procura compreender o comportamento de uma aplicação pela recolha de métricas sobre a sua execução. A obtenção de perfis corretos é importante, mas a sua obtenção e análise pode ser desafiante, particularmente para aplicações paralelas. Esta dissertação sugere um fluxo de trabalho de otimização a aplicar na procura de aumentos na escalabilidade de aplicações Java paralelas. Este fluxo sugerido foi concebido para facilitar a descoberta dos problemas de desempenho que afetam uma dada aplicação paralela e sugerir ações a tomar para os investigar a fundo. O fluxo de trabalho utiliza a noção de possible speedups para quantificar o impacto de problemas de desempenho diferentes. A ideia de possible speedups passa por estimar o speedup que uma aplicação poderia atingir se um problema de desempenho específico fosse completamente removido. Esta estimativa é calculada utilizando as métricas recolhidas durante uma análise ao perfil de uma aplicação paralela e de uma versão sequencial da mesma aplicação. O conjunto de problemas de desempenho considerados incluem o desequilíbrio da carga de trabalho, sobre carga de paralelismo devido ao aumento no número de instruções executadas, sobrecarga de sincronização, gargalos de desempenho no acesso à memória e a fração de trabalho sequencial. Estes problemas foram considerados as causas mais comuns de limitações à escalabilidade de aplicações paralelas. Para investigar mais a fundo o efeito destes problemas numa aplicação paralela, são sugeridos alguns modos de visualização do perfil de execução de uma aplicação dependendo do problema que mais limita a sua escalabilidade. As visualizações sugeridas consistem maioritariamente de diferentes tipos de flame graphs do perfil de uma aplicação. Duas ferramentas foram desenvolvidas para ajudar a aplicar este fluxo de trabalho na otimização de aplicações Java paralelas. Uma destas ferramentas utiliza o async-profiler para recolher perfis de execução de uma dada aplicação Java. A outra ferramenta utiliza os perfis recolhidos pela primeira ferramenta para estimar possible speedups e produzir todas as visualizações mencionadas no fluxo de trabalho sugerido. Por fim, o fluxo de trabalho foi validado com alguns casos de estudo. O caso de estudo principal consistiu na otimização iterativa de um algoritmo K-means, partindo de uma implementação sequencial e resultando no aumento gradual da escalabilidade da aplicação. Casos de estudo adicionais também foram apresentados para ilustrar possibilidades não abordadas no caso de estudo principal.Java is currently one of the most popular programming languages. This popularity is, in part, due to the portability it offers which comes from the fact that Java source code is compiled into bytecode which can be executed by a compatible Java Virtual Machine (JVM) in a different system. The JVM can then directly interpret or compile into machine code the Java application. However, this execution on top of a virtual machine creates some obstacles to developers looking to profile their applications. Profilers are precious tools for developers who seek to understand an application’s behaviour by collecting metrics about its execution. Obtaining accurate profiles of an application is important, but they can also be challenging to obtain and to analyse, particularly for parallel applications. This dissertation suggests an optimisation workflow to employ in the pursuit of reducing scalability bottlenecks of parallel Java applications. The workflow is designed to simplify the discovery of the performance problems affecting a given parallel application and suggest possible actions to investigate them further. The suggested workflow relies on possible speedups to quantify the impact of different performance problems. The idea of possible speedups is to estimate the speedup an application could achieve if a specific performance problem were to completely disappear. This estimation is performed using metrics collected during the profile of the parallel application and its sequential version. The set of performance problems considered include workload imbalance, parallelism overhead due to an increase in the number of instructions, synchronisation overhead, memory bottlenecks and the fraction of se quential workloads. These were deemed to be the most common causes for scalability issues in parallel appli cations. To further investigate the effect of these problems on a parallel application, some visualisations of the application’s behaviour are suggested depending on which problem limits scalability the most. The suggested visualisations mostly consist of different flame graphs of the application’s profile. Two tools were also developed to help in the application of this optimisation workflow for parallel Java appli cations. One of these tools relies on async-profiler to collect profiles of a given Java application. The other tool uses the profiles collected by the first tool to estimate possible speedups and also produce all visualisations mentioned in the suggested workflow. Finally, the workflow was validated on multiple case studies. The main case study was the iterative optimisation of a K-means algorithm, starting from a sequential implementation and resulting in the gradual increase of the application’s scalability. Additional case studies were also presented in order to highlight additional paths not covered in the main case study

    Program Transformations for Light-Weight CPU Accounting and Control in the Java Virtual Machine - A Systematic Review

    Get PDF
    This article constitutes a thorough presentation of an original scheme for portable CPU accounting and control in Java, which is based on program transformation techniques at the bytecode level and can be used with every standard Java Virtual Machine. In our approach applications, middleware, and even the standard Java runtime libraries (i.e., the Java Development Kit) are modified in a fully portable way, in order to expose details regarding the execution of threads. These transformations however incur a certain overhead at runtime. Further contributions of this article are the systematic review of the origin of such overheads and the description of a new static path prediction scheme targeted at reducing them
    corecore