1,462 research outputs found

    Three pitfalls in Java performance evaluation

    Get PDF
    The Java programming language has known a remarkable growth over the last decade. This is partially due to the infrastructure required to run Java ap- plications on general purpose microprocessors: a Java virtual machine (VM). The VM ensures that Java applications are portable across different hardware platforms, because it shelters the applications from the underlying system. Hence the motto write once, run (almost) anywhere. Java applications are compiled to an intermediate form, called bytecode, and consist of a number of so-called class files. The virtual machine takes care of class loading, interpreting or compiling the bytecode to the native code of the underlying hardware platform, thread scheduling, garbage collection, etc. As such, during the execution of a Java application, the VM regularly intervenes to take care of housekeeping tasks and to optimise the application as it is executing. Furthermore, the specific implementation details of most virtual machines insert non-deterministic behaviour, not into the semantic part of the execution, but rather into the lower level execution. For example, to bring a Java application up to competitive speed with classical compiled programs written in languages such as C, the virtual machine needs to optimise Java bytecode. To limit the execution overhead, most virtual machines use a time sampling mechanism to determine the hot methods in the application. This introduces non-determinism, as over several runs, the methods are not always optimised at the same moment, nor is the set of optimised methods always the same. Other factors that introduce non-determinism are the thread scheduling, garbage collection, etc. It is readily seen that performance analysis of Java applications is not as simple as it seems at first, and warrants closer inspection. In this dissertation we are mainly interested in the behaviour of Java applications and their performance. In the course of this work, we uncovered three major pitfalls that were not taken into account by researchers when analysing Java performance prior to this work. We will briefly summarise the main achievements presented in this dissertation. The first pitfall we present involves the interaction between the virtual machine, the application and the input to the application. The performance for short running applications is shown to be mainly determined by the virtual machine. For longer running applications, this influence decreases, but remains tangible. We use statistical analysis, such as principal components analysis and cluster analysis (K-means and hierarchical clustering) to demonstrate and clarify the pitfall. By means of a large number of performance char- acteristics measured using hardware performance counters, five virtual machines and fourteen benchmarks with both a small and a large input size, we demonstrate that short running workloads are primarily clustered by virtual machines. Even for long running applications from the SPECjvm98 benchmark suite, the virtual machine still exerts a large influence on the observed behaviour at the microarchitectural level. This work has shown the need for both larger and longer running benchmarks than were available prior to it – this was (partially) met by the introduction of the DaCapo benchmark suite – as well as a careful consideration when setting up an experiment to avoid measuring the virtual machine, rather than the benchmark. Prior to this work, people were quite often using simulation with short running applications (to save time) for exploring Java performance. The second pitfall we uncover involves the analysis of performance numbers. During a survey of 50 papers published at premier conferences, such as OOPSLA, PLDI, CGO, ISMM and VEE, over the past seven years, we found that a variety of approaches are used, both for experimental design – for example, the input size, virtual machines, heap sizes, etc. – and, even more importantly, for data analysis – for example, using a best out of 3 performance number. New techniques are pitted against existing work using these prevalent approaches, and conclusions regarding their successfulness in beating prior state-of-the-art are based upon them. Given the fact that the execution of Java applications usually involves non-determinism in the virtual machine – for example, when determining which methods to optimise – it should come as no surprise that the lack of statistical rigour in these prevalent approaches leads to misleading or even incorrect conclusions. By this we mean that the conclusions are either not representative of what actually happens, or even contradict reality, as modelled in a statistical manner. To circumvent this pitfall, we propose a rigorous statistical approach that uses confidence intervals to both report and compare performance numbers. We also claim that sufficient experiments should be conducted to get a reliable performance measure. The non-determinism caused by the timer-based optimisation component in a virtual machine can be eliminated using so-called replay compilation. This technique will record a compilation plan during a first execution or profiling run of the application. During a second execution, the application is iterated twice: once to compile and optimise all methods found in the compilation plan, and a second time to perform the actual measurement. It turns out however that current practice of using either a single plan – corresponding to the best performing profiling run – or a combined plan choosing the methods that were optimised in, say, more than half the profiling runs, is no match for using multiple plans. The variability observed in the plans themselves is too large to capture in one of the current practices. Consequently, using multiple plans is definitely the better option. Moreover, this allows using a matched-pair approach in the data analysis, which results in tighter confidence intervals for the mean performance number. The third pitfall we examine is the usage of global performance numbers when tuning either an application or a virtual machine. We show that Java applications exhibit phase behaviour at the method level. This means that instances of the same method show more similarity to each other, behaviourwise, than to instances of other methods. A phase can then be identified as a set of sub-trees of the dynamic call-tree, with each sub-tree headed by the same method. We present an two-step algorithm that allows correlating hardware performance counter data in step 2 with the phases determined in step 1. The information obtained can be applied to show the programmer which methods perform worse than average, for example with respect to the number of cache misses they incur. In the dissertation, we pay particular attention to statistical rigour. For each pitfall, we use statistics to demonstrate its presence. Hopefully this work will encourage other researchers to use more rigour in their work as well

    Optimal Design of Wireless Sensor Networks

    Get PDF
    Since their introduction,Wireless SensorNetworks(WSN) have been proposed as a powerful support for environment monitoring, ranging from monitoring of remote or hard-to-reach locations to fine-grained control of cultivations. Development of a WSN-based application is a complex task and challenging issues must be tackled starting from the first phases of the design cycle.We present here a tool supporting the DSE phase to perform architectural choices for the nodes and network topology, taking into account target performance goals and estimated costs. When designing applications based onWSN, the most challenging problem is energy shortage. Nodes are normally supplied through batteries, hence a limited amount of energy is available and no breakthroughs are foreseen in a near future. In our design cycle we approach this issue through a methodology that allows analysing and optimising the power performances in a hierarchical fashion, encompassing various abstraction levels

    Data Learning Methodologies for Improving the Efficiency of Constrained Random Verification

    Get PDF
    Functional verification continues to be one of the most time-consuming steps in the chip design cycle. Simulation-based verification is well practised in industry thanks to its flexibility and scalability. The completeness of the verification is measured by coverage metrics. Generating effective tests to achieve a satisfactory coverage level is a difficult task in verification. Constrained random verification is commonly used to alleviate the manual efforts for producing direct tests. However, there are yet many situations where unnecessary verification efforts in terms of simulation cycles and man hours are spent. Also, it is observed that lots of data generated in existing constrained random verification process are barely analysed, and then discarded after simplistic correctness checking. Based on our previous research on data mining and exposure to the industrial verification process, we identify that there are opportunities in extracting knowledge from the constrained random verification data and use it to improve the verification efficiency.In constrained random verification, when a simulation run of tests instantiated by a test template cannot reach the coverage goal, there are two possible reasons: insufficient simulation, and improper constraints and/or biases. There are three actions that a verification engineer can usually do to address the problem: to simulate more tests, to refine the test template, or to change to a new test template. Accordingly, we propose three data learning methodologies to help the engineers make more informed decisions in these three application scenarios and thus improve the verification efficiency.The first methodology identifies important ("novel") tests before simulation based on what have been already simulated. By only simulating those novel tests and filtering out redundant tests, tremendous resources such as simulation cycles and licenses can be saved. The second methodology extracts the unique properties from those novel tests identified in simulation and uses them to refine the test template. By leveraging the extracted knowledge, more tests similar to the novel ones are generated. And thus the new tests are more likely to activate coverage events that are otherwise difficult to hit by extensive simulation. The third methodology analyses a collection of existing test items (test templates) and identifies feasible augmentation to the test plan. By automatically adding new test items based on the data analysis, it alleviates the manual efforts for closing coverage holes.The proposed data learning methodologies were developed and applied in the setting of verifying commercial microprocessor and SoC platform designs. The experiments in this dissertation were conducted in the verification environment of a commercial microprocessor and a SoC platform in Freescale Semiconductor Inc. and were in parallel with the on-going verification efforts. The experiment results demonstrate the feasibility and effectiveness of building learning frameworks to improve verification efficiency

    An energy-efficient internet of things (IoT) architecture for preventive conservation of cultural heritage

    Full text link
    [EN] Internet of Things (IoT) technologies can facilitate the preventive conservation of cultural heritage (CH) by enabling the management of data collected from electronic sensors. This work presents an IoT architecture for this purpose. Firstly, we discuss the requirements from the artwork standpoint, data acquisition, cloud processing and data visualization to the end user. The results presented in this work focuses on the most critical aspect of the architecture, which are the sensor nodes. We designed a solution based on LoRa and Sigfox technologies to produce the minimum impact in the artwork, achieving a lifespan of more than 10 years. The solution will be capable of scaling the processing and storage resources, deployed either in a public or on-premise cloud, embedding complex predictive models. This combination of technologies can cope with different types of cultural heritage environments.This work was partially funded by the Generalitat Valenciana project AICO/2016/058 and by the Plan Nacional de I+D, Comision Interministerial de Ciencia y TecnologiA (FEDER-CICYT) under the project HAR2013-47895-C2-1-P.Perles Ivars, A.; Pérez Marín, E.; Mercado Romero, R.; Segrelles Quilis, JD.; Blanquer Espert, I.; Zarzo Castelló, M.; García Diego, FJ. (2018). An energy-efficient internet of things (IoT) architecture for preventive conservation of cultural heritage. Future Generation Computer Systems. 81:566-581. https://doi.org/10.1016/j.future.2017.06.030S5665818

    Planificación consciente de la contención y gestión de recursos en arquitecturas multicore emergentes

    Get PDF
    Tesis inédita de la Universidad Complutense de Madrid, Facultad de Informática, Departamento de Arquitectura de Computadores y Automática, leída el 14-12-2021Chip multicore processors (CMPs) currently constitute the architecture of choice for mosto general-pùrpose computing systems, and they will likely continue to be dominant in the near future. Advances in technology have enabled to pack an increasing number of cores and bigger caches on the same chip. Nevertheless, contention on shared resources on CMPs -present since the advent of these architectures- still poses a big challenge. Cores in a CMP typically share a last-level cache (LLC) and other memory-related resources with the remaining cores, such as a DRAM controller and an interconnection network. This causes that co-running applications may intensively compete with each other for these shared resources, leading to substantial and uneven performance degradation...Los procesadores multinúcleo o CMPs (Chip Multicore Processors) son actualmente la arquitectura más usada por la mayoría de sistemas de computación de propósito general, y muy probablemente se mantendrían en esa posición dominante en el futuro cercano. Los avances tecnológicos han permitido integrar progresivamente en el mismo chip más cores y aumentar los tamaños de los distintos niveles de cache. No obstante, la contención de recursos compartidos en CMPs {presente desde la aparición de estas arquitecturas{ todavía representa un reto importante que afrontar. Los cores en un CMP comparten en la mayor parte de los diseños una cache de último nivel o LLC (Last-Level Cache) y otros recursos, como el controlador de DRAM o una red de interconexión. La existencia de dichos recursos compartidos provoca en ocasiones que cuando se ejecutan dos o más aplicaciones simultáneamente en el sistema, se produzca una degradación sustancial y potencialmente desigual del rendimiento entre aplicaciones...Fac. de InformáticaTRUEunpu

    GUNDAM : A toolkit for fast spatial correlation functions in galaxy surveys

    Get PDF
    We describe the capabilities of a new software package to calculate two-point correlation functions (2PCFs) of large galaxy samples. The code can efficiently estimate 3D/projected/angular 2PCFs with a variety of statistical estimators and bootstrap errors, and is intended to provide a complete framework (including calculation, storage, manipulation, and plotting) to perform this type of spatial analysis with large redshift surveys. GUNDAM implements a very fast skip list/linked list algorithm that efficiently counts galaxy pairs and avoids the computation of unnecessary distances. It is several orders of magnitude faster than a naive pair counter, and matches or even surpass other advanced algorithms. The implementation is also embarrassingly parallel, making full use of multicore processors or large computational clusters when available. The software is designed to be flexible, user friendly and easily extensible, integrating optimized, well-tested packages already available in the astronomy community. Out of the box, it already provides advanced features such as custom weighting schemes, fibre collision corrections and 2D correlations. GUNDAM will ultimately provide an efficient toolkit to analyse the large-scale structure 'buried' in upcoming extremely large data sets generated by future surveys.Fil: Donoso, Emilio. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - San Juan. Instituto de Ciencias Astronómicas, de la Tierra y del Espacio. Universidad Nacional de San Juan. Instituto de Ciencias Astronómicas, de la Tierra y del Espacio; Argentin

    Real-time analysis of video signals

    Get PDF
    Many practical and experimental systems employing image processing techniques have been built by other workers for various applications. Most of these systems are computer-based and very few operate in a real time environment. The objective of this work is to build a microprocessor-based system for video image processing. The system is used in conjunction with an on-line TV camera and processing is carried out in real time. The enormous storage requirement of digitized TV signals and the real time constraint suggest that some simplification of the data must take place prior to any viable processing. Data reduction is attained through the representation of objects by their edges, an approach often adopted for feature extraction in pattern recognition systems. A new technique for edge detection by applying comparison criteria to differentials at adjacent pixels of the video image is developed and implemented as a preprocessing hardware unit. A circuit for the generation of the co-ordinates of edge points is constructed to free the processing computer of this task, allowing it more time for on-line analysis of video signals. Besides the edge detector and co-ordinate generator the hardware built consists of a microprocessor system based on a Texas Instruments T.US 9900 device, a first-in-first-out buffer store and interface circuitry to a TV camera and display devices. All hardware modules and their power supplies are assembled in one unit to provide a standalone instrument. The problem chosen for investigation is analysis of motion in a visual scene. Aspects of motion studied concern the tracking of moving objects with simple geometric shapes and description of their motion. More emphasis is paid to the analysis of human eye movements and measurement of its point-of-regard which has many practical applications in the fields of physiology and psychology. This study provides a basis for the design of a processing unit attached to an oculometer to replace bulky minicomputer-based eye motion analysis systems. Programs are written for storage, analysis and display of results in real time

    An overview of decision table literature 1982-1995.

    Get PDF
    This report gives an overview of the literature on decision tables over the past 15 years. As much as possible, for each reference, an author supplied abstract, a number of keywords and a classification are provided. In some cases own comments are added. The purpose of these comments is to show where, how and why decision tables are used. The literature is classified according to application area, theoretical versus practical character, year of publication, country or origin (not necessarily country of publication) and the language of the document. After a description of the scope of the interview, classification results and the classification by topic are presented. The main body of the paper is the ordered list of publications with abstract, classification and comments.

    Application of clustering analysis and sequence analysis on the performance analysis of parallel applications

    Get PDF
    High Performance Computing and Supercomputing is the high end area of the computing science that studies and develops the most powerful computers available. Current supercomputers are extremely complex so are the applications that run on them. To take advantage of the huge amount of computing power available it is strictly necessary to maximize the knowledge we have about how these applications behave and perform. This is the mission of the (parallel) performance analysis. In general, performance analysis toolkits oUer a very simplistic manipulations of the performance data. First order statistics such as average or standard deviation are used to summarize the values of a given performance metric, hiding in some cases interesting facts available on the raw performance data. For this reason, we require the Performance Analytics, i.e. the application of Data Analytics techniques in the performance analysis area. This thesis contributes with two new techniques to the Performance Analytics Veld. First contribution is the application of the cluster analysis to detect the parallel application computation structure. Cluster analysis is the unsupervised classiVcation of patterns (observations, data items or feature vectors) into groups (clusters). In this thesis we use the cluster analysis to group the CPU burst of a parallel application, the regions on each process in-between communication calls or calls to the parallel runtime. The resulting clusters obtained are the diUerent computational trends or phases that appear in the application. These clusters are useful to understand the behaviour of computation part of the application and focus the analyses to those that present performance issues. We demonstrate that our approach requires diUerent clustering algorithms previously used in the area. Second contribution of the thesis is the application of multiple sequence alignment algorithms to evaluate the computation structure detected. Multiple sequence alignment (MSA) is technique commonly used in bioinformatics to determine the similarities across two or more biological sequences: DNA or roteins. The Cluster Sequence Score we introduce applies a Multiple Sequence Alignment (MSA) algorithm to evaluate the SPMDiness of an application, i.e. how well its computation structure represents the Single Program Multiple Data (SPMD) paradigm structure. We also use this score in the Aggregative Cluster Re-Vnement, a new clustering algorithm we designed, able to detect the SPMD phases of an application at Vne-grain, surpassing the cluster algorithms we used initially. We demonstrate the usefulness of these techniques with three practical uses. The Vrst one is an extrapolation methodology able to maximize the performance metrics that characterize the application phases detected using a single application execution. The second one is the use of the computation structure detected to speedup in a multi-level simulation infrastructure. Finally, we analyse four production-class applications using the computation characterization to study the impact of possible application improvements and portings of the applications to diUerent hardware conVgurations. In summary, this thesis proposes the use of cluster analysis and sequence analysis to automatically detect and characterize the diUerent computation trends of a parallel application. These techniques provide the developer / analyst an useful insight of the application performance and ease the understanding of the application’s behaviour. The contributions of the thesis are not reduced to proposals and publications of the techniques themselves, but also practical uses to demonstrate their usefulness in the analysis task. In addition, the research carried out during these years has provided a production tool for analysing applications’ structure, part of BSC Tools suite
    • …
    corecore