302 research outputs found

    A Co-Processor Approach for Efficient Java Execution in Embedded Systems

    Get PDF
    This thesis deals with a hardware accelerated Java virtual machine, named REALJava. The REALJava virtual machine is targeted for resource constrained embedded systems. The goal is to attain increased computational performance with reduced power consumption. While these objectives are often seen as trade-offs, in this context both of them can be attained simultaneously by using dedicated hardware. The target level of the computational performance of the REALJava virtual machine is initially set to be as fast as the currently available full custom ASIC Java processors. As a secondary goal all of the components of the virtual machine are designed so that the resulting system can be scaled to support multiple co-processor cores. The virtual machine is designed using the hardware/software co-design paradigm. The partitioning between the two domains is flexible, allowing customizations to the resulting system, for instance the floating point support can be omitted from the hardware in order to decrease the size of the co-processor core. The communication between the hardware and the software domains is encapsulated into modules. This allows the REALJava virtual machine to be easily integrated into any system, simply by redesigning the communication modules. Besides the virtual machine and the related co-processor architecture, several performance enhancing techniques are presented. These include techniques related to instruction folding, stack handling, method invocation, constant loading and control in time domain. The REALJava virtual machine is prototyped using three different FPGA platforms. The original pipeline structure is modified to suit the FPGA environment. The performance of the resulting Java virtual machine is evaluated against existing Java solutions in the embedded systems field. The results show that the goals are attained, both in terms of computational performance and power consumption. Especially the computational performance is evaluated thoroughly, and the results show that the REALJava is more than twice as fast as the fastest full custom ASIC Java processor. In addition to standard Java virtual machine benchmarks, several new Java applications are designed to both verify the results and broaden the spectrum of the tests.Siirretty Doriast

    Portable and Accurate Collection of Calling-Context-Sensitive Bytecode Metrics for the Java Virtual Machine

    Get PDF
    Calling-context profiles and dynamic metrics at the bytecode level are important for profiling, workload characterization, program comprehension, and reverse engineering. Prevailing tools for collecting calling-context profiles or dynamic bytecode metrics often provide only incomplete information or suffer from limited compatibility with standard JVMs. However, completeness and accuracy of the profiles is essential for tasks such as workload characterization, and compatibility with standard JVMs is important to ensure that complex workloads can be executed. In this paper, we present the design and implementation of JP2, a new tool that profiles both the inter- and intra-procedural control flow of workloads on standard JVMs. JP2 produces calling-context profiles preserving callsite information, as well as execution statistics at the level of individual basic blocks of code. JP2 is complemented with scripts that compute various dynamic bytecode metrics from the profiles. As a case-study and tutorial on the use of JP2, we use it for cross-profiling for an embedded Java processor

    Efficient execution of Java programs on GPU

    Get PDF
    Dissertação de mestrado em Informatics EngineeringWith the overwhelming increase of demand of computational power made by fields as Big Data, Deep Machine learning and Image processing the Graphics Processing Units (GPUs) has been seen as a valuable tool to compute the main workload involved. Nonetheless, these solutions have limited support for object-oriented languages that often require manual memory handling which is an obstacle to bringing together the large community of object oriented programmers and the high-performance computing field. In this master thesis, different memory optimizations and their impacts were studied in a GPU Java context using Aparapi. These include solutions for different identifiable bottlenecks of commonly used kernels exploiting its full capabilities by studying the GPU hardware and current techniques available. These results were set against common used C/OpenCL benchmarks and respective optimizations proving, that high-level languages can be a solution to high-performance software demand.Com o aumento de poder computacional requisitado por campos como Big Data, Deep Machine Learning e Processamento de Imagens, as unidades de processamento gráfico (GPUs) tem sido vistas como uma ferramenta valiosa para executar a principal carga de trabalho envolvida. No entanto, esta solução tem suporte limitado para linguagens orientadas a objetos. Frequentemente estas requerem manipulação manual de memória, o que é um obstáculo para reunir a grande comunidade de programadores orientados a objetos e o campo da computação de alto desempenho. Nesta dissertação de mestrado, diferentes otimizações de memória e os seus impactos foram estudados utilizando Aparapi. As otimizações estudadas pretendem solucionar bottle-necks identificáveis em kernels frequentemente utilizados. Os resultados obtidos foram comparados com benchmarks C / OpenCL populares e as suas respectivas otimizações, provando que as linguagens de alto nível podem ser uma solução para programas que requerem computação de alto desempenho

    Profileringstechnieken voor prestatieanalyse en optimalisatie van Javaprogramma's

    Get PDF

    Three pitfalls in Java performance evaluation

    Get PDF
    The Java programming language has known a remarkable growth over the last decade. This is partially due to the infrastructure required to run Java ap- plications on general purpose microprocessors: a Java virtual machine (VM). The VM ensures that Java applications are portable across different hardware platforms, because it shelters the applications from the underlying system. Hence the motto write once, run (almost) anywhere. Java applications are compiled to an intermediate form, called bytecode, and consist of a number of so-called class files. The virtual machine takes care of class loading, interpreting or compiling the bytecode to the native code of the underlying hardware platform, thread scheduling, garbage collection, etc. As such, during the execution of a Java application, the VM regularly intervenes to take care of housekeeping tasks and to optimise the application as it is executing. Furthermore, the specific implementation details of most virtual machines insert non-deterministic behaviour, not into the semantic part of the execution, but rather into the lower level execution. For example, to bring a Java application up to competitive speed with classical compiled programs written in languages such as C, the virtual machine needs to optimise Java bytecode. To limit the execution overhead, most virtual machines use a time sampling mechanism to determine the hot methods in the application. This introduces non-determinism, as over several runs, the methods are not always optimised at the same moment, nor is the set of optimised methods always the same. Other factors that introduce non-determinism are the thread scheduling, garbage collection, etc. It is readily seen that performance analysis of Java applications is not as simple as it seems at first, and warrants closer inspection. In this dissertation we are mainly interested in the behaviour of Java applications and their performance. In the course of this work, we uncovered three major pitfalls that were not taken into account by researchers when analysing Java performance prior to this work. We will briefly summarise the main achievements presented in this dissertation. The first pitfall we present involves the interaction between the virtual machine, the application and the input to the application. The performance for short running applications is shown to be mainly determined by the virtual machine. For longer running applications, this influence decreases, but remains tangible. We use statistical analysis, such as principal components analysis and cluster analysis (K-means and hierarchical clustering) to demonstrate and clarify the pitfall. By means of a large number of performance char- acteristics measured using hardware performance counters, five virtual machines and fourteen benchmarks with both a small and a large input size, we demonstrate that short running workloads are primarily clustered by virtual machines. Even for long running applications from the SPECjvm98 benchmark suite, the virtual machine still exerts a large influence on the observed behaviour at the microarchitectural level. This work has shown the need for both larger and longer running benchmarks than were available prior to it – this was (partially) met by the introduction of the DaCapo benchmark suite – as well as a careful consideration when setting up an experiment to avoid measuring the virtual machine, rather than the benchmark. Prior to this work, people were quite often using simulation with short running applications (to save time) for exploring Java performance. The second pitfall we uncover involves the analysis of performance numbers. During a survey of 50 papers published at premier conferences, such as OOPSLA, PLDI, CGO, ISMM and VEE, over the past seven years, we found that a variety of approaches are used, both for experimental design – for example, the input size, virtual machines, heap sizes, etc. – and, even more importantly, for data analysis – for example, using a best out of 3 performance number. New techniques are pitted against existing work using these prevalent approaches, and conclusions regarding their successfulness in beating prior state-of-the-art are based upon them. Given the fact that the execution of Java applications usually involves non-determinism in the virtual machine – for example, when determining which methods to optimise – it should come as no surprise that the lack of statistical rigour in these prevalent approaches leads to misleading or even incorrect conclusions. By this we mean that the conclusions are either not representative of what actually happens, or even contradict reality, as modelled in a statistical manner. To circumvent this pitfall, we propose a rigorous statistical approach that uses confidence intervals to both report and compare performance numbers. We also claim that sufficient experiments should be conducted to get a reliable performance measure. The non-determinism caused by the timer-based optimisation component in a virtual machine can be eliminated using so-called replay compilation. This technique will record a compilation plan during a first execution or profiling run of the application. During a second execution, the application is iterated twice: once to compile and optimise all methods found in the compilation plan, and a second time to perform the actual measurement. It turns out however that current practice of using either a single plan – corresponding to the best performing profiling run – or a combined plan choosing the methods that were optimised in, say, more than half the profiling runs, is no match for using multiple plans. The variability observed in the plans themselves is too large to capture in one of the current practices. Consequently, using multiple plans is definitely the better option. Moreover, this allows using a matched-pair approach in the data analysis, which results in tighter confidence intervals for the mean performance number. The third pitfall we examine is the usage of global performance numbers when tuning either an application or a virtual machine. We show that Java applications exhibit phase behaviour at the method level. This means that instances of the same method show more similarity to each other, behaviourwise, than to instances of other methods. A phase can then be identified as a set of sub-trees of the dynamic call-tree, with each sub-tree headed by the same method. We present an two-step algorithm that allows correlating hardware performance counter data in step 2 with the phases determined in step 1. The information obtained can be applied to show the programmer which methods perform worse than average, for example with respect to the number of cache misses they incur. In the dissertation, we pay particular attention to statistical rigour. For each pitfall, we use statistics to demonstrate its presence. Hopefully this work will encourage other researchers to use more rigour in their work as well

    Towards an embedded real-time Java virtual machine

    Get PDF
    Most computers today are embedded, i.e. they are built into some products or system that is not perceived as a computer. It is highly desirable to use modern safe object-oriented software techniques for a rapid development of reliable systems. However, languages and run-time platforms for embedded systems have not kept up with the front line of language development. Reasons include complex and, in some cases, contradictory requirements on timing, concurrency, predictability, safety, and flexibility. A carefully tailored Java virtual machine (called IVM) is proposed as an approach to overcome these difficulties. In particular, real-time garbage collection has been considered an essential part. The set of bytecodes has been revised to require less memory and to facilitate predictable execution. To further reduce the memory footprint, the class loader can be located outside the embedded processor. Since the accomplished concurrency is crucial for the function of many embedded applications, the scheduling can be defined on the application level in Java. Finally considering future needs for flexibility and on-line configuration of embedded system, the IVM has a unique structure with which, for instance, methods being objects that can be replaced and GCed. The approach has been experimentally verified by a full prototype implementation of such a virtual machine. By making the prototype available for development of real products, this in turn has confronted the solutions with real industrial demands. It was found that the IVM can be easily integrated in typical systems today and the mentioned requirements are fulfilled. Based on experiences from more than 10 projects utilising the novel Java-oriented techniques, there are reasons to believe that the proposed approach is very promising for future flexible embedded systems

    Profiling tools for Java

    Get PDF
    Dissertação de mestrado integrado em Informatics EngineeringAtualmente, Java é uma das linguagens de programação mais populares. Esta popularidade é parcialmente devida à sua portabilidade que advém do facto do código Java ser compilado para bytecode que poderá ser executado por uma máquina virtual Java (JVM) compatível em qualquer sistema. A JVM pode depois interpretar diretamente ou compilar para código máquina a aplicação Java. No entanto, esta execução sobre uma máquina virtual cria alguns obstáculos à obtenção do perfil de execução de aplicações. Perfis de execução são valiosos para quem procura compreender o comportamento de uma aplicação pela recolha de métricas sobre a sua execução. A obtenção de perfis corretos é importante, mas a sua obtenção e análise pode ser desafiante, particularmente para aplicações paralelas. Esta dissertação sugere um fluxo de trabalho de otimização a aplicar na procura de aumentos na escalabilidade de aplicações Java paralelas. Este fluxo sugerido foi concebido para facilitar a descoberta dos problemas de desempenho que afetam uma dada aplicação paralela e sugerir ações a tomar para os investigar a fundo. O fluxo de trabalho utiliza a noção de possible speedups para quantificar o impacto de problemas de desempenho diferentes. A ideia de possible speedups passa por estimar o speedup que uma aplicação poderia atingir se um problema de desempenho específico fosse completamente removido. Esta estimativa é calculada utilizando as métricas recolhidas durante uma análise ao perfil de uma aplicação paralela e de uma versão sequencial da mesma aplicação. O conjunto de problemas de desempenho considerados incluem o desequilíbrio da carga de trabalho, sobre carga de paralelismo devido ao aumento no número de instruções executadas, sobrecarga de sincronização, gargalos de desempenho no acesso à memória e a fração de trabalho sequencial. Estes problemas foram considerados as causas mais comuns de limitações à escalabilidade de aplicações paralelas. Para investigar mais a fundo o efeito destes problemas numa aplicação paralela, são sugeridos alguns modos de visualização do perfil de execução de uma aplicação dependendo do problema que mais limita a sua escalabilidade. As visualizações sugeridas consistem maioritariamente de diferentes tipos de flame graphs do perfil de uma aplicação. Duas ferramentas foram desenvolvidas para ajudar a aplicar este fluxo de trabalho na otimização de aplicações Java paralelas. Uma destas ferramentas utiliza o async-profiler para recolher perfis de execução de uma dada aplicação Java. A outra ferramenta utiliza os perfis recolhidos pela primeira ferramenta para estimar possible speedups e produzir todas as visualizações mencionadas no fluxo de trabalho sugerido. Por fim, o fluxo de trabalho foi validado com alguns casos de estudo. O caso de estudo principal consistiu na otimização iterativa de um algoritmo K-means, partindo de uma implementação sequencial e resultando no aumento gradual da escalabilidade da aplicação. Casos de estudo adicionais também foram apresentados para ilustrar possibilidades não abordadas no caso de estudo principal.Java is currently one of the most popular programming languages. This popularity is, in part, due to the portability it offers which comes from the fact that Java source code is compiled into bytecode which can be executed by a compatible Java Virtual Machine (JVM) in a different system. The JVM can then directly interpret or compile into machine code the Java application. However, this execution on top of a virtual machine creates some obstacles to developers looking to profile their applications. Profilers are precious tools for developers who seek to understand an application’s behaviour by collecting metrics about its execution. Obtaining accurate profiles of an application is important, but they can also be challenging to obtain and to analyse, particularly for parallel applications. This dissertation suggests an optimisation workflow to employ in the pursuit of reducing scalability bottlenecks of parallel Java applications. The workflow is designed to simplify the discovery of the performance problems affecting a given parallel application and suggest possible actions to investigate them further. The suggested workflow relies on possible speedups to quantify the impact of different performance problems. The idea of possible speedups is to estimate the speedup an application could achieve if a specific performance problem were to completely disappear. This estimation is performed using metrics collected during the profile of the parallel application and its sequential version. The set of performance problems considered include workload imbalance, parallelism overhead due to an increase in the number of instructions, synchronisation overhead, memory bottlenecks and the fraction of se quential workloads. These were deemed to be the most common causes for scalability issues in parallel appli cations. To further investigate the effect of these problems on a parallel application, some visualisations of the application’s behaviour are suggested depending on which problem limits scalability the most. The suggested visualisations mostly consist of different flame graphs of the application’s profile. Two tools were also developed to help in the application of this optimisation workflow for parallel Java appli cations. One of these tools relies on async-profiler to collect profiles of a given Java application. The other tool uses the profiles collected by the first tool to estimate possible speedups and also produce all visualisations mentioned in the suggested workflow. Finally, the workflow was validated on multiple case studies. The main case study was the iterative optimisation of a K-means algorithm, starting from a sequential implementation and resulting in the gradual increase of the application’s scalability. Additional case studies were also presented in order to highlight additional paths not covered in the main case study

    Software Performance Engineering using Virtual Time Program Execution

    Get PDF
    In this thesis we introduce a novel approach to software performance engineering that is based on the execution of code in virtual time. Virtual time execution models the timing-behaviour of unmodified applications by scaling observed method times or replacing them with results acquired from performance model simulation. This facilitates the investigation of "what-if" performance predictions of applications comprising an arbitrary combination of real code and performance models. The ability to analyse code and models in a single framework enables performance testing throughout the software lifecycle, without the need to to extract performance models from code. This is accomplished by forcing thread scheduling decisions to take into account the hypothetical time-scaling or model-based performance specifications of each method. The virtual time execution of I/O operations or multicore targets is also investigated. We explore these ideas using a Virtual EXecution (VEX) framework, which provides performance predictions for multi-threaded applications. The language-independent VEX core is driven by an instrumentation layer that notifies it of thread state changes and method profiling events; it is then up to VEX to control the progress of application threads in virtual time on top of the operating system scheduler. We also describe a Java Instrumentation Environment (JINE), demonstrating the challenges involved in virtual time execution at the JVM level. We evaluate the VEX/JINE tools by executing client-side Java benchmarks in virtual time and identifying the causes of deviations from observed real times. Our results show that VEX and JINE transparently provide predictions for the response time of unmodified applications with typically good accuracy (within 5-10%) and low simulation overheads (25-50% additional time). We conclude this thesis with a case study that shows how models and code can be integrated, thus illustrating our vision on how virtual time execution can support performance testing throughout the software lifecycle
    corecore