596 research outputs found

    On the usage of the probability integral transform to reduce the complexity of multi-way fuzzy decision trees in Big Data classification problems

    Full text link
    We present a new distributed fuzzy partitioning method to reduce the complexity of multi-way fuzzy decision trees in Big Data classification problems. The proposed algorithm builds a fixed number of fuzzy sets for all variables and adjusts their shape and position to the real distribution of training data. A two-step process is applied : 1) transformation of the original distribution into a standard uniform distribution by means of the probability integral transform. Since the original distribution is generally unknown, the cumulative distribution function is approximated by computing the q-quantiles of the training set; 2) construction of a Ruspini strong fuzzy partition in the transformed attribute space using a fixed number of equally distributed triangular membership functions. Despite the aforementioned transformation, the definition of every fuzzy set in the original space can be recovered by applying the inverse cumulative distribution function (also known as quantile function). The experimental results reveal that the proposed methodology allows the state-of-the-art multi-way fuzzy decision tree (FMDT) induction algorithm to maintain classification accuracy with up to 6 million fewer leaves.Comment: Appeared in 2018 IEEE International Congress on Big Data (BigData Congress). arXiv admin note: text overlap with arXiv:1902.0935

    Exploring Dynamic Compilation and Cross-Layer Object Management Policies for Managed Language Applications

    Get PDF
    Recent years have witnessed the widespread adoption of managed programming languages that are designed to execute on virtual machines. Virtual machine architectures provide several powerful software engineering advantages over statically compiled binaries, such as portable program representations, additional safety guarantees, automatic memory and thread management, and dynamic program composition, which have largely driven their success. To support and facilitate the use of these features, virtual machines implement a number of services that adaptively manage and optimize application behavior during execution. Such runtime services often require tradeoffs between efficiency and effectiveness, and different policies can have major implications on the system's performance and energy requirements. In this work, we extensively explore policies for the two runtime services that are most important for achieving performance and energy efficiency: dynamic (or Just-In-Time (JIT)) compilation and memory management. First, we examine the properties of single-tier and multi-tier JIT compilation policies in order to find strategies that realize the best program performance for existing and future machines. Our analysis performs hundreds of experiments with different compiler aggressiveness and optimization levels to evaluate the performance impact of varying if and when methods are compiled. We later investigate the issue of how to optimize program regions to maximize performance in JIT compilation environments. For this study, we conduct a thorough analysis of the behavior of optimization phases in our dynamic compiler, and construct a custom experimental framework to determine the performance limits of phase selection during dynamic compilation. Next, we explore innovative memory management strategies to improve energy efficiency in the memory subsystem. We propose and develop a novel cross-layer approach to memory management that integrates information and analysis in the VM with fine-grained management of memory resources in the operating system. Using custom as well as standard benchmark workloads, we perform detailed evaluation that demonstrates the energy-saving potential of our approach. We implement and evaluate all of our studies using the industry-standard Oracle HotSpot Java Virtual Machine to ensure that our conclusions are supported by widely-used, state-of-the-art runtime technology

    Statistiline lÀhenemine mÀlulekete tuvastamiseks Java rakendustes

    Get PDF
    Kaasaegsed hallatud kĂ€itusaja keskkonnad (ingl. managed runtime environment) ja programmeerimiskeeled lihtsustavad rakenduste loomist ning haldamist. KĂ”ige levinumaks nĂ€iteks sÀÀrase keele ja keskkonna kohta on Java. Üheks tĂ€htsaks hallatud kĂ€itusaja keskkonna ĂŒlesandeks on automaatne mĂ€luhaldus. Vaatamata sisseehitatud prĂŒgikoristajale, mĂ€lulekke probleem Javas on endiselt relevantne ning tĂ€hendab tarbetut mĂ€lu hoidmist. Probleem on eriti kriitiline rakendustes mis peaksid ööpĂ€evaringselt tĂ”rgeteta toimima, kuna mĂ€luleke on ĂŒks vĂ€heseid programmeerimisvigu mis vĂ”ib hĂ€vitada kogu Java rakenduse. Parimaks indikaatoriks otsustamaks kas objekt on kasutuses vĂ”i mitte on objekti viimane kasutusaeg. Selle meetrika pĂ”hiliseks puudujÀÀgiks on selle hind jĂ”udluse mĂ”ttes. KĂ€esolev vĂ€itekiri uurib mĂ€lulekete problemaatikat Javas ning pakub vĂ€lja uudse mĂ€lulekkeid tuvastava ning diagnoosiva algoritmi. VĂ€itekirjas kirjeldatakse alternatiivset lĂ€henemisviisi objektide kasutuse hindamiseks. PĂ”hihĂŒpoteesiks on idee et lekkivaid objekte saab statistiliste meetoditega eristada mittelekkivatest kui vaadelda objektide populatsiooni eluiga erinevate gruppide lĂ”ikes. Pakutud lĂ€henemine on oluliselt odavama hinnaga jĂ”udluse mĂ”ttes, kuna objekti kohta on vaja salvestada infot ainult selle loomise hetkel. VĂ€itekirja uurimistöö tulemusi on rakendatud mĂ€lulekete tuvastamise tööriista Plumbr arendamisel, mida hetkel edukalt kasutatakse ka erinevates toodangkeskkondades. PĂ€rast sissejuhatavaid peatĂŒkke, vĂ€itekirjas vaadeldakse siiani pakutud lahendusi ning on pakutud vĂ€lja ka nende meetodite klassifikatsioon. JĂ€rgnevalt on kirjeldatud statistiline baasmeetod mĂ€lulekete tuvastamiseks. Lisaks on analĂŒĂŒsitud ka kirjeldatud baasmeetodi puudujÀÀke. JĂ€rgnevalt on kirjeldatud kuidas said defineeritud lisamÔÔdikud mis aitasid masinĂ”ppe abil baasmeetodit tĂ€psemaks teha. Testandmeid masinĂ”ppe tarbeks on kogutud Plumbri abil pĂ€ris rakendustest ning toodangkeskkondadest. Lisaks, kirjeldatakse vĂ€itekirjas juhtumianalĂŒĂŒse ning vĂ”rdlust ĂŒhe olemasoleva mĂ€lulekete tuvastamise lahendusega.Modern managed runtime environments and programming languages greatly simplify creation and maintenance of applications. One of the best examples of such managed runtime environments and a language is the Java Virtual Machine and the Java programming language. Despite the built in garbage collector, the memory leak problem is still relevant in Java and means wasting memory by preventing unused objects from being removed. The problem of memory leaks is especially critical for applications, which are expected to work uninterrupted around the clock, as running out of memory is one of a few reasons which may cause the termination of the whole Java application. The best indicator of whether an object is used or not is the time of the last access. However, the main disadvantage of this metric is the incurred performance overhead. Current thesis researches the memory leak problem and proposes a novel approach for memory leak detection and diagnosis. The thesis proposes an alternative approach for estimation of the 'unusedness' of objects. The main hypothesis is that leaked objects may be identified by applying statistical methods to analyze lifetimes of objects, by observing the ages of the population of objects grouped by their allocation points. Proposed solution is much more efficient performance-wise as for each object it is sufficient to record any information at the time of creation of the object. The research conducted for the thesis is utilized in a memory leak detection tool Plumbr. After the introduction and overview of the state of the art, current thesis reviews existing solutions and proposes the classification for memory leak detection approaches. Next, the statistical approach for memory leak detection is described along with the description of the main metric used to distinguish leaking objects from non-leaking ones. Follows the analysis of this single metric. Based on this analysis additional metrics are designed and machine learning algorithms are applied on the statistical data acquired from real production environments from the Plumbr tool. Case studies of real applications and one previous solution for the memory leak detection are performed in order to evaluate performance overhead of the tool

    Efficient branch and node testing

    Get PDF
    Software testing evaluates the correctness of a program’s implementation through a test suite. The quality of a test case or suite is assessed with a coverage metric indicating what percentage of a program’s structure was exercised (covered) during execution. Coverage of every execution path is impossible due to infeasible paths and loops that result in an exponential or infinite number of paths. Instead, metrics such as the number of statements (nodes) or control-flow branches covered are used. Node and branch coverage require instrumentation probes to be present during program runtime. Traditionally, probes were statically inserted during compilation. These static probes remain even after coverage is recorded, incurring unnecessary overhead, reducing the number of tests that can be run, or requiring large amounts of memory In this dissertation, I present three novel techniques for improving branch and node coverage performance for the Java runtime. First, Demand-driven Structural Testing (DDST) uses dynamic insertion and removal of probes so they can be removed after recording coverage, avoiding the unnecessary overhead of static instrumentation. DDST is built on a new framework for developing and researching coverage techniques, Jazz. DDST for node coverage averages 19.7% faster than statically-inserted instrumentation on an industry-standard benchmark suite, SPECjvm98. Due to DDST’s higher-cost probes, no single branch coverage technique performs best on all programs or methods. To address this, I developed Hybrid Structural Testing (HST). HST combines different test techniques, including static and DDST, into one run. HST uses a cost model for analysis, reducing the cost of branch coverage testing an average of 48% versus Static and 56% versus DDST on SPECjvm98. HST never chooses certain techniques due to expensive analysis. I developed a third technique, Test Plan Caching (TPC), that exploits the inherent repetition in testing over a suite. TPC saves analysis results to avoid recomputation. Combined with HST, TPC produces a mix of techniques that record coverage quickly and efficiently. My three techniques reduce the average cost of branch coverage by 51.6–90.8% over previous approaches on SPECjvm98, allowing twice as many test cases in a given time budget

    Reducing energy usage in resource-intensive Java-based scientific applications via micro-benchmark based code refactorings

    Get PDF
    In-silico research has grown considerably. Today's scientific code involves long-running computer simulations and hence powerful computing infrastructures are needed. Traditionally, research in high-performance computing has focused on executing code as fast as possible, while energy has been recently recognized as another goal to consider. Yet, energy-driven research has mostly focused on the hardware and middleware layers, but few efforts target the application level, where many energy-aware optimizations are possible. We revisit a catalog of Java primitives commonly used in OO scientific programming, or micro-benchmarks, to identify energy-friendly versions of the same primitive. We then apply the micro-benchmarks to classical scientific application kernels and machine learning algorithms for both single-thread and multi-thread implementations on a server. Energy usage reductions at the micro-benchmark level are substantial, while for applications obtained reductions range from 3.90% to 99.18%.Fil: Longo, Mathias. Consejo Nacional de Investigaciones CientĂ­ficas y TĂ©cnicas. Centro CientĂ­fico TecnolĂłgico Conicet - Tandil. Instituto Superior de IngenierĂ­a del Software. Universidad Nacional del Centro de la Provincia de Buenos Aires. Instituto Superior de IngenierĂ­a del Software; Argentina. University of Southern California; Estados UnidosFil: Rodriguez, Ana Virginia. Consejo Nacional de Investigaciones CientĂ­ficas y TĂ©cnicas. Centro CientĂ­fico TecnolĂłgico Conicet - Tandil. Instituto Superior de IngenierĂ­a del Software. Universidad Nacional del Centro de la Provincia de Buenos Aires. Instituto Superior de IngenierĂ­a del Software; ArgentinaFil: Mateos Diaz, Cristian Maximiliano. Consejo Nacional de Investigaciones CientĂ­ficas y TĂ©cnicas. Centro CientĂ­fico TecnolĂłgico Conicet - Tandil. Instituto Superior de IngenierĂ­a del Software. Universidad Nacional del Centro de la Provincia de Buenos Aires. Instituto Superior de IngenierĂ­a del Software; ArgentinaFil: Zunino Suarez, Alejandro Octavio. Consejo Nacional de Investigaciones CientĂ­ficas y TĂ©cnicas. Centro CientĂ­fico TecnolĂłgico Conicet - Tandil. Instituto Superior de IngenierĂ­a del Software. Universidad Nacional del Centro de la Provincia de Buenos Aires. Instituto Superior de IngenierĂ­a del Software; Argentin
    • 

    corecore