766 research outputs found

    Speculative Staging for Interpreter Optimization

    Full text link
    Interpreters have a bad reputation for having lower performance than just-in-time compilers. We present a new way of building high performance interpreters that is particularly effective for executing dynamically typed programming languages. The key idea is to combine speculative staging of optimized interpreter instructions with a novel technique of incrementally and iteratively concerting them at run-time. This paper introduces the concepts behind deriving optimized instructions from existing interpreter instructions---incrementally peeling off layers of complexity. When compiling the interpreter, these optimized derivatives will be compiled along with the original interpreter instructions. Therefore, our technique is portable by construction since it leverages the existing compiler's backend. At run-time we use instruction substitution from the interpreter's original and expensive instructions to optimized instruction derivatives to speed up execution. Our technique unites high performance with the simplicity and portability of interpreters---we report that our optimization makes the CPython interpreter up to more than four times faster, where our interpreter closes the gap between and sometimes even outperforms PyPy's just-in-time compiler.Comment: 16 pages, 4 figures, 3 tables. Uses CPython 3.2.3 and PyPy 1.

    PolyBench/Python: Benchmarking Python Environments With Polyhedral Optimizations

    Get PDF
    [Abstract] Python has become one of the most used and taught languages nowadays. Its expressiveness, cross-compatibility and ease of use have made it popular in areas as diverse as finance, bioinformatics or machine learning. However, Python programs are often significantly slower to execute than an equivalent native C implementation, especially for computation-intensive numerical kernels. This work presents PolyBench/Python, implementing the 30 kernels in PolyBench/C, one of the standard benchmark suites for polyhedral optimization, in Python. In addition to the benchmark kernels, a functional wrapper including mechanisms for performance measurement, testing, and execution configuration has been developed. The framework includes support for different ways to translate C-array codes into Python, offering insight into the tradeoffs of Python lists and NumPy arrays. The benchmark performance is thoroughly evaluated on different Python interpreters, and compared against its PolyBench/C counterpart to highlight the profitability (or lack thereof) of using Python for regular numerical codes.Ministerio de Ciencia e innovación; PID2019-104184RB-I00Ministerio de Ciencia e innovación; AEI/10.13039/501100011033U.S. National Science Foundation; CCF-1750399Xunta de Galicia; ED431G 2019/0

    Optimizing SIMD execution in HW/SW co-designed processors

    Get PDF
    SIMD accelerators are ubiquitous in microprocessors from different computing domains. Their high compute power and hardware simplicity improve overall performance in an energy efficient manner. Moreover, their replicated functional units and simple control mechanism make them amenable to scaling to higher vector lengths. However, code generation for these accelerators has been a challenge from the days of their inception. Compilers generate vector code conservatively to ensure correctness. As a result they lose significant vectorization opportunities and fail to extract maximum benefits out of SIMD accelerators. This thesis proposes to vectorize the program binary at runtime in a speculative manner, in addition to the compile time static vectorization. There are different environments that support runtime profiling and optimization support required for dynamic vectorization, one of most prominent ones being: 1) Dynamic Binary Translators and Optimizers (DBTO) and 2) Hardware/Software (HW/SW) Co-designed Processors. HW/SW co-designed environment provides several advantages over DBTOs like transparent incorporations of new hardware features, binary compatibility, etc. Therefore, we use HW/SW co-designed environment to assess the potential of speculative dynamic vectorization. Furthermore, we analyze vector code generation for wider vector units and find out that even though SIMD accelerators are amenable to scaling from the hardware point of view, vector code generation at higher vector length is even more challenging. The two major factors impeding vectorization for wider SIMD units are: 1) Reduced dynamic instruction stream coverage for vectorization and 2) Large number of permutation instructions. To solve the first problem we propose Variable Length Vectorization that iteratively vectorizes for multiple vector lengths to improve dynamic instruction stream coverage. Secondly, to reduce the number of permutation instructions we propose Selective Writing that selectively writes to different parts of a vector register and avoids permutations. Finally, we tackle the problem of leakage energy in SIMD accelerators. Since SIMD accelerators consume significant amount of real estate on the chip, they become the principle source of leakage if not utilized judiciously. Power gating is one of the most widely used techniques to reduce leakage energy of functional units. However, power gating has its own energy and performance overhead associated with it. We propose to selectively devectorize the vector code when higher SIMD lanes are used intermittently. This selective devectorization keeps the higher SIMD lanes idle and power gated for maximum duration. Therefore, resulting in overall leakage energy reduction.Postprint (published version

    The Family of MapReduce and Large Scale Data Processing Systems

    Full text link
    In the last two decades, the continuous increase of computational power has produced an overwhelming flow of data which has called for a paradigm shift in the computing architecture and large scale data processing mechanisms. MapReduce is a simple and powerful programming model that enables easy development of scalable parallel applications to process vast amounts of data on large clusters of commodity machines. It isolates the application from the details of running a distributed program such as issues on data distribution, scheduling and fault tolerance. However, the original implementation of the MapReduce framework had some limitations that have been tackled by many research efforts in several followup works after its introduction. This article provides a comprehensive survey for a family of approaches and mechanisms of large scale data processing mechanisms that have been implemented based on the original idea of the MapReduce framework and are currently gaining a lot of momentum in both research and industrial communities. We also cover a set of introduced systems that have been implemented to provide declarative programming interfaces on top of the MapReduce framework. In addition, we review several large scale data processing systems that resemble some of the ideas of the MapReduce framework for different purposes and application scenarios. Finally, we discuss some of the future research directions for implementing the next generation of MapReduce-like solutions.Comment: arXiv admin note: text overlap with arXiv:1105.4252 by other author

    Static analysis techniques for the synthesis of efficient asynchronous circuits

    Get PDF
    technical reportIn the context of deriving asynchronous circuits from high-level descriptions, determining whether two actions are potentially concurrent (overlapped execution) or serial (non-overlapped execution) has several advantages. This knowledge can be utilized to efficiently implement shared variables, support speculative guard evaluation, and optimize resources (circuitry) by sharing. In a distributed environment with several concurrent processes, determining whether two actions are potentially concurrent or not, automatically, is often difficult to formulate and computationally expensive. In this paper, we present techniques to overcome these problems. First, we present a tool called parComp which infers the composite behavior of a collection of modules, and then we present an algorithm called conCur to analyze the inferred behavior to detect the seriality of two actions. Simple heuristics are presented for the abstraction of the inferred behavioral descriptions and improving the efficiency of conCur. The algorithms parComp and conCur are illustrated in the hopCP framework and implemented in Standard ML of New Jersey. Execution times of the algorithms are reported on a variety of examples. The results are quite encouraging

    Fully reflective execution environments

    Get PDF
    Las Máquinas Virtuales (MV) son artefactos de software complejos. Sus responsabilidades abarcan desde realizar la semántica de algún lenguaje de programación en particular hasta garantizar propiedades tales como la eficiencia, la portabilidad y la seguridad de los programas. Actualmente, las MV son construidas como “cajas negras”, lo cual reduce significativamente la posibilidad de observar o modificar su comportamiento mientras están siendo ejecutadas. En este trabajo pregonamos que la falta de interacción entre las aplicaciones y las MV impone un límite a las posibilidades de adaptación de los programas, mientras están siendo ejecutados, ante nuevos requerimientos. Para solucionar esta limitación presentamos la noción de plataformas de ejecución reflexivas: un tipo especial de MV que promueve su propia inspección y modificación en tiempo de ejecución permitiendo de este modo a las aplicaciones reconfigurar el comportamiento de la MV cuando sus requerimientos cambian. Proponemos una arquitectura de referencia para construir plataformas de ejecución reflexivas e introducimos una serie de optimizaciones específicamente diseñadas para este tipo de plataformas. En particular proponemos aplicar técnicas de optimización especulativa, técnicas estándar en el contexto de los lenguajes dinámicos, a nivel dela MV misma. Para evaluar nuestro enfoque construimos dos plataformas de ejecución reflexivas, una basada en un compilador de métodos y la otra en un optimizador de trazas. Luego, analizamos una serie de casos de estudio que nos permitieron evaluar sus propiedades distintivas para lidiar con escenarios adaptativos. Comparamos nuestras implementaciones con soluciones alternativas de nivel de lenguaje y argumentamos porqué una plataforma de ejecución reflexiva potencialmente las subsume a todas. Por otra parte, mostramos empíricamente que las MV reflexivas pueden ejecutarse con un desempeño asintótico similar al de las MV estándar (no reflexivas) cuando las capacidades reflexivas no se usan. También que la degradación deldesempeño es bajo (comparado con las soluciones alternativas) cuando estos mecanismos sí son utilizados. Aprovechando nuestras dos implementaciones, estudiamos cómo impactan las diferentes familias de compiladores (por método vs. por trazas) en los resultados finales. Por último, realizamos una serie de experimentos con el objetivo de estudiarlos efectos de exponer el comportamiento de los módulos de compilación a las aplicaciones. Los resultados preliminares muestran que este es un enfoque plausiblepara mejorar el desempeño de aplicaciones sobre las cuales las heurísticas de los compiladores dinámicos producen resultados subóptimos.Many programming languages run on top of a Virtual Machine (VM). VMs are complex pieces of software because they realize the language semantics and provide efficiency, portability, and security. Unfortunately, mainstream VMs are engineered as “black boxes” and provide only minimal means to expose their state and behavior at run time. In this thesis we argue that the lack of interaction between applications and VMs put a limit on the adaptation capabilities of (running) applications. To overcome this situation we introduce the notion of fully reflective VM: a new kind of VM providing reflection not only at the application but also at the VM level. In other words, a fully reflective VM provides means to support its own observability and modifiability at run time and enables programming languages to interact with and adapt the underlying VM to changing requirements. We propose a reference architecture for such VMs and discuss some challenges in terms of performance degradation that these systems may induce. We then introduce a series of optimizations targeted specially to this kind of platforms. They are based on a key assumption: that the variability of the VM behavior tend to be low at run time. Accordingly, we apply standard dynamic compilation techniques such as specialization, speculation, and deoptimization on the VM code itself as a means tomitigate the overheads. To validate our claims we built two reflective VMs, one featuring a methodbased just in time (JIT) compiler and the other running on top of a trace-based optimizer. We start our evaluation by analyzing a series of case studies to understand how a reflective VM could deal with unanticipated adaptation scenarios on the fly. Furthermore, we compare our approach with the existing language-level alternatives and provide an elaborated discussion on why a reflective VM would subsume all of them. Then, we empirically show that our implementations can feature similar peak performance of that of standard VMs when their reflective mechanisms are not activated. Moreover, they present low overheads (in comparison to existing alternatives) when VM’s reflective capabilities are used. We also analyzed how the different compilation strategies (per-method vs. tracing) impact on the overallresults. Finally, we conduct a series of experiments in order to study the effects of opening up the compilation module of a reflective VM to the applications. We conclude that it is a plausible approach that brings new opportunities for optimizing algorithms in which the compiler heuristics fail to give optimal results.Fil: Chari, Guido Martín. Universidad de Buenos Aires. Facultad de Ciencias Exactas y Naturales; Argentina

    Parallel programming paradigms and frameworks in big data era

    Get PDF
    With Cloud Computing emerging as a promising new approach for ad-hoc parallel data processing, major companies have started to integrate frameworks for parallel data processing in their product portfolio, making it easy for customers to access these services and to deploy their programs. We have entered the Era of Big Data. The explosion and profusion of available data in a wide range of application domains rise up new challenges and opportunities in a plethora of disciplines-ranging from science and engineering to biology and business. One major challenge is how to take advantage of the unprecedented scale of data-typically of heterogeneous nature-in order to acquire further insights and knowledge for improving the quality of the offered services. To exploit this new resource, we need to scale up and scale out both our infrastructures and standard techniques. Our society is already data-rich, but the question remains whether or not we have the conceptual tools to handle it. In this paper we discuss and analyze opportunities and challenges for efficient parallel data processing. Big Data is the next frontier for innovation, competition, and productivity, and many solutions continue to appear, partly supported by the considerable enthusiasm around the MapReduce paradigm for large-scale data analysis. We review various parallel and distributed programming paradigms, analyzing how they fit into the Big Data era, and present modern emerging paradigms and frameworks. To better support practitioners interesting in this domain, we end with an analysis of on-going research challenges towards the truly fourth generation data-intensive science.Peer ReviewedPostprint (author's final draft
    corecore