    Open Programming Language Interpreters

    Context: This paper presents the concept of open programming language interpreters and the implementation of a framework-level metaobject protocol (MOP) to support them. Inquiry: We address the problem of dynamic interpreter adaptation to tailor the interpreter's behavior on the task to be solved and to introduce new features to fulfill unforeseen requirements. Many languages provide a MOP that to some degree supports reflection. However, MOPs are typically language-specific, their reflective functionality is often restricted, and the adaptation and application logic are often mixed which hardens the understanding and maintenance of the source code. Our system overcomes these limitations. Approach: We designed and implemented a system to support open programming language interpreters. The prototype implementation is integrated in the Neverlang framework. The system exposes the structure, behavior and the runtime state of any Neverlang-based interpreter with the ability to modify it. Knowledge: Our system provides a complete control over interpreter's structure, behavior and its runtime state. The approach is applicable to every Neverlang-based interpreter. Adaptation code can potentially be reused across different language implementations. Grounding: Having a prototype implementation we focused on feasibility evaluation. The paper shows that our approach well addresses problems commonly found in the research literature. We have a demonstrative video and examples that illustrate our approach on dynamic software adaptation, aspect-oriented programming, debugging and context-aware interpreters. Importance: To our knowledge, our paper presents the first reflective approach targeting a general framework for language development. Our system provides full reflective support for free to any Neverlang-based interpreter. We are not aware of any prior application of open implementations to programming language interpreters in the sense defined in this paper. Rather than substituting other approaches, we believe our system can be used as a complementary technique in situations where other approaches present serious limitations

    Cautiously Optimistic Program Analyses for Secure and Reliable Software

    Modern computer systems still have various security and reliability vulnerabilities. Well-known dynamic analyses solutions can mitigate them using runtime monitors that serve as lifeguards. But the additional work in enforcing these security and safety properties incurs exorbitant performance costs, and such tools are rarely used in practice. Our work addresses this problem by constructing a novel technique- Cautiously Optimistic Program Analysis (COPA). COPA is optimistic- it infers likely program invariants from dynamic observations, and assumes them in its static reasoning to precisely identify and elide wasteful runtime monitors. The resulting system is fast, but also ensures soundness by recovering to a conservatively optimized analysis when a likely invariant rarely fails at runtime. COPA is also cautious- by carefully restricting optimizations to only safe elisions, the recovery is greatly simplified. It avoids unbounded rollbacks upon recovery, thereby enabling analysis for live production software. We demonstrate the effectiveness of Cautiously Optimistic Program Analyses in three areas: Information-Flow Tracking (IFT) can help prevent security breaches and information leaks. But they are rarely used in practice due to their high performance overhead (>500% for web/email servers). COPA dramatically reduces this cost by eliding wasteful IFT monitors to make it practical (9% overhead, 4x speedup). Automatic Garbage Collection (GC) in managed languages (e.g. Java) simplifies programming tasks while ensuring memory safety. However, there is no correct GC for weakly-typed languages (e.g. C/C++), and manual memory management is prone to errors that have been exploited in high profile attacks. We develop the first sound GC for C/C++, and use COPA to optimize its performance (16% overhead). Sequential Consistency (SC) provides intuitive semantics to concurrent programs that simplifies reasoning for their correctness. However, ensuring SC behavior on commodity hardware remains expensive. We use COPA to ensure SC for Java at the language-level efficiently, and significantly reduce its cost (from 24% down to 5% on x86). COPA provides a way to realize strong software security, reliability and semantic guarantees at practical costs.PHDComputer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/170027/1/subarno_1.pd

    Achieving High Performance and High Productivity in Next Generational Parallel Programming Languages

    Processor design has turned toward parallelism and heterogeneity cores to achieve performance and energy efficiency. Developers find high-level languages attractive because they use abstraction to offer productivity and portability over hardware complexities. To achieve performance, some modern implementations of high-level languages use work-stealing scheduling for load balancing of dynamically created tasks. Work-stealing is a promising approach for effectively exploiting software parallelism on parallel hardware. A programmer who uses work-stealing explicitly identifies potential parallelism and the runtime then schedules work, keeping otherwise idle hardware busy while relieving overloaded hardware of its burden. However, work-stealing comes with substantial overheads. These overheads arise as a necessary side effect of the implementation and hamper parallel performance. In addition to runtime-imposed overheads, there is a substantial cognitive load associated with ensuring that parallel code is data-race free. This dissertation explores the overheads associated with achieving high performance parallelism in modern high-level languages. My thesis is that, by exploiting existing underlying mechanisms of managed runtimes; and by extending existing language design, high-level languages will be able to deliver productivity and parallel performance at the levels necessary for widespread uptake. The key contributions of my thesis are: 1) a detailed analysis of the key sources of overhead associated with a work-stealing runtime, namely sequential and dynamic overheads; 2) novel techniques to reduce these overheads that use rich features of managed runtimes such as the yieldpoint mechanism, on-stack replacement, dynamic code-patching, exception handling support, and return barriers; 3) comprehensive analysis of the resulting benefits, which demonstrate that work-stealing overheads can be significantly reduced, leading to substantial performance improvements; and 4) a small set of language extensions that achieve both high performance and high productivity with minimal programmer effort. A managed runtime forms the backbone of any modern implementation of a high-level language. Managed runtimes enjoy the benefits of a long history of research and their implementations are highly optimized. My thesis demonstrates that converging these highly optimized features together with the expressiveness of high-level languages, gives further hope for achieving high performance and high productivity on modern parallel hardwar

    On the fly type specialization without type analysis

    Les langages de programmation typés dynamiquement tels que JavaScript et Python repoussent la vérification de typage jusqu’au moment de l’exécution. Afin d’optimiser la performance de ces langages, les implémentations de machines virtuelles pour langages dynamiques doivent tenter d’éliminer les tests de typage dynamiques redondants. Cela se fait habituellement en utilisant une analyse d’inférence de types. Cependant, les analyses de ce genre sont souvent coûteuses et impliquent des compromis entre le temps de compilation et la précision des résultats obtenus. Ceci a conduit à la conception d’architectures de VM de plus en plus complexes. Nous proposons le versionnement paresseux de blocs de base, une technique de compilation à la volée simple qui élimine efficacement les tests de typage dynamiques redondants sur les chemins d’exécution critiques. Cette nouvelle approche génère paresseusement des versions spécialisées des blocs de base tout en propageant de l’information de typage contextualisée. Notre technique ne nécessite pas l’utilisation d’analyses de programme coûteuses, n’est pas contrainte par les limitations de précision des analyses d’inférence de types traditionnelles et évite la complexité des techniques d’optimisation spéculatives. Trois extensions sont apportées au versionnement de blocs de base afin de lui donner des capacités d’optimisation interprocédurale. Une première extension lui donne la possibilité de joindre des informations de typage aux propriétés des objets et aux variables globales. Puis, la spécialisation de points d’entrée lui permet de passer de l’information de typage des fonctions appellantes aux fonctions appellées. Finalement, la spécialisation des continuations d’appels permet de transmettre le type des valeurs de retour des fonctions appellées aux appellants sans coût dynamique. Nous démontrons empiriquement que ces extensions permettent au versionnement de blocs de base d’éliminer plus de tests de typage dynamiques que toute analyse d’inférence de typage statique.Dynamically typed programming languages such as JavaScript and Python defer type checking to run time. In order to maximize performance, dynamic language virtual machine implementations must attempt to eliminate redundant dynamic type checks. This is typically done using type inference analysis. However, type inference analyses are often costly and involve tradeoffs between compilation time and resulting precision. This has lead to the creation of increasingly complex multi-tiered VM architectures. We introduce lazy basic block versioning, a simple just-in-time compilation technique which effectively removes redundant type checks from critical code paths. This novel approach lazily generates type-specialized versions of basic blocks on the fly while propagating context-dependent type information. This does not require the use of costly program analyses, is not restricted by the precision limitations of traditional type analyses and avoids the implementation complexity of speculative optimization techniques. Three extensions are made to the basic block versioning technique in order to give it interprocedural optimization capabilities. Typed object shapes give it the ability to attach type information to object properties and global variables. Entry point specialization allows it to pass type information from callers to callees, and call continuation specialization makes it possible to pass return value type information back to callers without dynamic overhead. We empirically demonstrate that these extensions enable basic block versioning to exceed the capabilities of static whole-program type analyses

    Understanding Performance Inefficiencies In Native And Managed Languages

    Production software packages have become increasingly complex with millions of lines of code, sophisticated control and data flow, and references to a hierarchy of external libraries. This complexity often introduces performance inefficiencies across software stacks, making it practically impossible for users to pinpoint them manually. Performance profiling tools (a.k.a. profilers) abound in the tools community to aid software developers in understanding program behavior. Classical profiling techniques focus on identifying hotspots. The hotspot analysis is indispensable; however, it can hardly diagnose whether a resource is being used in a productive manner that contributes to the overall efficiency of a program. Consequently, a significant burden is on developers to make a judgment call on whether there is scope to optimize a hotspot. Derived metrics, e.g., cache miss ratio, offer slightly better intuition into hotspots but are still not panaceas. Hence, there is a need for profilers that investigate resource wastage instead of usage. To overcome the critical missing pieces in prior work and complement existing profilers, we propose novel fine- and coarse-grained profilers to pinpoint varieties of performance inefficiencies and provide optimization guidance for a wide range of software covering benchmarks, enterprise applications, and large-scale parallel applications running on supercomputers and data centers. Fine-grained profilers are indispensable to understand performance inefficiencies comprehensively. We propose a whole-program profiler called LoadSpy, which works on binary executables to detect and quantify wasteful memory operations in their context and scope. Our observation, which is justified by myriad case studies, is that wasteful memory operations are often an indicator of various forms of performance inefficiencies, such as suboptimal choices of algorithms or data structures, missed compiler optimizations, and developers’ inattention to performance. Guided by LoadSpy, we are able to optimize a large number of well-known benchmarks and real-world applications, yielding significant speedups. Despite deep performance insights offered by fine-grained profilers, the high overhead keeps them away from widespread adoption, particularly in production. By contrast, coarse-grained profilers introduce low overhead at the cost of poor performance insights. Hence, another research topic is how we benefit from both, that is, the combination of deep insights of fine-grained profilers and low overhead of coarse-grained ones. The first effort to do so is proposing a lightweight profiler called JXPerf. It abandons heavyweight instrumentation by combining hardware performance monitoring units and debug registers available in commodity CPUs to detect wasteful memory operations. Compared with LoadSpy, JXPerf reduces the runtime overhead from 10x to 7% on average. The lightweight nature makes it useful in production. Another effort is proposing a lightweight profiler called FVSampler, the first nonintrusive profiler to study function execution variance

    Vectorization system for unstructured codes with a Data-parallel Compiler IR

    With Dennard Scaling coming to an end, Single Instruction Multiple Data (SIMD) offers itself as a way to improve the compute throughput of CPUs. One fundamental technique in SIMD code generators is the vectorization of data-parallel code regions. This has applications in outer-loop vectorization, whole-function vectorization and vectorization of explicitly data-parallel languages. This thesis makes contributions to the reliable vectorization of data-parallel code regions with unstructured, reducible control flow. Reducibility is the case in practice where all control-flow loops have exactly one entry point. We present P-LLVM, a novel, full-featured, intermediate representation for vectorizers that provides a semantics for the code region at every stage of the vectorization pipeline. Partial control-flow linearization is a novel partial if-conversion scheme, an essential technique to vectorize divergent control flow. Different to prior techniques, partial linearization has linear running time, does not insert additional branches or blocks and gives proved guarantees on the control flow retained. Divergence of control induces value divergence at join points in the control-flow graph (CFG). We present a novel control-divergence analysis for directed acyclic graphs with optimal running time and prove that it is correct and precise under common static assumptions. We extend this technique to obtain a quadratic-time, control-divergence analysis for arbitrary reducible CFGs. For this analysis, we show on a range of realistic examples how earlier approaches are either less precise or incorrect. We present a feature-complete divergence analysis for P-LLVM programs. The analysis is the first to analyze stack-allocated objects in an unstructured control setting. Finally, we generalize single-dimensional vectorization of outer loops to multi-dimensional tensorization of loop nests. SIMD targets benefit from tensorization through more opportunities for re-use of loaded values and more efficient memory access behavior. The techniques were implemented in the Region Vectorizer (RV) for vectorization and TensorRV for loop-nest tensorization. Our evaluation validates that the general-purpose RV vectorization system matches the performance of more specialized approaches. RV performs on par with the ISPC compiler, which only supports its structured domain-specific language, on a range of tree traversal codes with complex control flow. RV is able to outperform the loop vectorizers of state-of-the-art compilers, as we show for the SPEC2017 nab_s benchmark and the XSBench proxy application.Mit dem Ausreizen des Dennard Scalings erreichen die gewohnten Zuwächse in der skalaren Rechenleistung zusehends ihr Ende. Moderne Prozessoren setzen verstärkt auf parallele Berechnung, um den Rechendurchsatz zu erhöhen. Hierbei spielen SIMD Instruktionen (Single Instruction Multiple Data), die eine Operation gleichzeitig auf mehrere Eingaben anwenden, eine zentrale Rolle. Eine fundamentale Technik, um SIMD Programmcode zu erzeugen, ist der Einsatz datenparalleler Vektorisierung. Diese unterliegt populären Verfahren, wie der Vektorisierung äußerer Schleifen, der Vektorisierung gesamter Funktionen bis hin zu explizit datenparallelen Programmiersprachen. Der Beitrag der vorliegenden Arbeit besteht darin, ein zuverlässiges Vektorisierungssystem für datenparallelen Code mit reduziblem Steuerfluss zu entwickeln. Diese Anforderung ist für alle Steuerflussgraphen erfüllt, deren Schleifen nur einen Eingang haben, was in der Praxis der Fall ist. Wir präsentieren P-LLVM, eine ausdrucksstarke Zwischendarstellung für Vektorisierer, welche dem Programm in jedem Stadium der Transformation von datenparallelem Code zu SIMD Code eine definierte Semantik verleiht. Partielle Steuerfluss-Linearisierung ist ein neuer Algorithmus zur If-Conversion, welcher Sprünge erhalten kann. Anders als existierende Verfahren hat Partielle Linearisierung eine lineare Laufzeit und fügt keine neuen Sprünge oder Blöcke ein. Wir zeigen Kriterien, unter denen der Algorithmus Steuerfluss erhält, und beweisen diese. Steuerflussdivergenz induziert Divergenz an Punkten zusammenfließenden Steuerflusses. Wir stellen eine neue Steuerflussdivergenzanalyse für azyklische Graphen mit optimaler Laufzeit vor und beweisen deren Korrektheit und Präzision. Wir verallgemeinern die Technik zu einem Algorithmus mit quadratischer Laufzeit für beliebiege, reduzible Steuerflussgraphen. Eine Studie auf realistischen Beispielgraphen zeigt, dass vergleichbare Techniken entweder weniger präsize sind oder falsche Ergebnisse liefern. Ebenfalls präsentieren wir eine Divergenzanalyse für P-LLVM Programme. Diese Analyse ist die erste Divergenzanalyse, welche Divergenz in stapelallokierten Objekten unter unstrukturiertem Steuerfluss analysiert. Schließlich generalisieren wir die eindimensionale Vektorisierung von äußeren Schleifen zur multidimensionalen Tensorisierung von Schleifennestern. Tensorisierung eröffnet für SIMD Prozessoren mehr Möglichkeiten, bereits geladene Werte wiederzuverwenden und das Speicherzugriffsverhalten des Programms zu optimieren, als dies mit Vektorisierung der Fall ist. Die vorgestellten Techniken wurden in den Region Vectorizer (RV) für Vektorisierung und TensorRV für die Tensorisierung von Schleifennestern implementiert. Wir zeigen auf einer Reihe von steuerflusslastigen Programmen für die Traversierung von Baumdatenstrukturen, dass RV das gleiche Niveau erreicht wie der ISPC Compiler, welcher nur seine strukturierte Eingabesprache verarbeiten kann. RV kann schnellere SIMD-Programme erzeugen als die Schleifenvektorisierer in aktuellen Industriecompilern. Dies demonstrieren wir mit dem nab_s benchmark aus der SPEC2017 Benchmarksuite und der XSBench Proxy-Anwendung

    Design and evaluation of a Thread-Level Speculation runtime library

    En los próximos años es más que probable que máquinas con cientos o incluso miles de procesadores sean algo habitual. Para aprovechar estas máquinas, y debido a la dificultad de programar de forma paralela, sería deseable disponer de sistemas de compilación o ejecución que extraigan todo el paralelismo posible de las aplicaciones existentes. Así en los últimos tiempos se han propuesto multitud de técnicas paralelas. Sin embargo, la mayoría de ellas se centran en códigos simples, es decir, sin dependencias entre sus instrucciones. La paralelización especulativa surge como una solución para estos códigos complejos, posibilitando la ejecución de cualquier tipo de códigos, con o sin dependencias. Esta técnica asume de forma optimista que la ejecución paralela de cualquier tipo de código no de lugar a errores y, por lo tanto, necesitan de un mecanismo que detecte cualquier tipo de colisión. Para ello, constan de un monitor responsable que comprueba constantemente que la ejecución no sea errónea, asegurando que los resultados obtenidos de forma paralela sean similares a los de cualquier ejecución secuencial. En caso de que la ejecución fuese errónea los threads se detendrían y reiniciarían su ejecución para asegurar que la ejecución sigue la semántica secuencial. Nuestra contribución en este campo incluye (1) una nueva librería de ejecución especulativa fácil de utilizar; (2) nuevas propuestas que permiten reducir de forma significativa el número de accesos requeridos en las peraciones especulativas, así como consejos para reducir la memoria a utilizar; (3) propuestas para mejorar los métodos de scheduling centradas en la gestión dinámica de los bloques de iteraciones utilizados en las ejecuciones especulativas; (4) una solución híbrida que utiliza memoria transaccional para implementar las secciones críticas de una librería de paralelización especulativa; y (5) un análisis de las técnicas especulativas en uno de los dispositivos más vanguardistas del momento, los coprocesadores Intel Xeon Phi. Como hemos podido comprobar, la paralelización especulativa es un campo de investigación activo. Nuestros resultados demuestran que esta técnica permite obtener mejoras de rendimiento en un gran número de aplicaciones. Así, esperamos que este trabajo contribuya a facilitar el uso de soluciones especulativas en compiladores comerciales y/o modelos de programación paralela de memoria compartida.Departamento de Informática (Arquitectura y Tecnología de Computadores, Ciencias de la Computación e Inteligencia Artificial, Lenguajes y Sistemas Informáticos

    Functional Programming for Embedded Systems

    Embedded Systems application development has traditionally been carried out in low-level machine-oriented programming languages like C or Assembler that can result in unsafe, error-prone and difficult-to-maintain code. Functional programming with features such as higher-order functions, algebraic data types, polymorphism, strong static typing and automatic memory management appears to be an ideal candidate to address the issues with low-level languages plaguing embedded systems. However, embedded systems usually run on heavily memory-constrained devices with memory in the order of hundreds of kilobytes and applications running on such devices embody the general characteristics of being (i) I/O- bound, (ii) concurrent and (iii) timing-aware. Popular functional language compilers and runtimes either do not fare well with such scarce memory resources or do not provide high-level abstractions that address all the three listed characteristics. This work attempts to address this gap by investigating and proposing high-level abstractions specialised for I/O-bound, concurrent and timing-aware embedded-systems programs. We implement the proposed abstractions on eagerly-evaluated, statically-typed functional languages running natively on microcontrollers. Our contributions are divided into two parts - Part 1 presents a functional reactive programming language - Hailstorm - that tracks side effects like I/O in its type system using a feature called resource types. Hailstorm’s programming model is illustrated on the GRiSP microcontroller board.Part 2 comprises two papers that describe the design and implementation of Synchron, a runtime API that provides a uniform message-passing framework for the handling of software messages as well as hardware interrupts. Additionally, the Synchron API supports a novel timing operator to capture the notion of time, common in embedded applications. The Synchron API is implemented as a virtual machine - SynchronVM - that is run on the NRF52 and STM32 microcontroller boards. We present programming examples that illustrate the concurrency, I/O and timing capabilities of the VM and provide various benchmarks on the response time, memory and power usage of SynchronVM

    Benchmark-driven Software Performance Optimization

    Software systems are an integral part of modern society. As we continue to harness software automation in all aspects of our daily lives, the runtime performance of these systems become increasingly important. When everything seems just a click away, performance issues that compromise the responsiveness of a system can lead to severe financial and reputation losses. Designing efficient code is critical for ensuring good and consistent performance of software systems. It requires performance expertize, and encompasses a set of difficult design decisions that need to be continuously revisited throughout the evolution of the software. Developers must test the performance of their core implementations, select efficient data structures and algorithms, explore parallel processing when it provides performance benefits, among many other aspects. Furthermore, the constant pressure for high-productivity laid on developers, aligned with the increasing complexity of modern software, makes designing efficient code an even more challenging endeavor. This thesis presents a series of novel approaches based on empirical insights that attempt to support developers at the task of designing efficient code. We present contributions in three aspects. First, we investigate the prevalence and impact of bad practices on performance benchmarks of Java-based open-source software. We show that not only these bad practices occur frequently, they often distort the benchmark results substantially. Moreover, we devise a tool that can be used by developers to identify bad practices during benchmark creation automatically. Second, we design an application-level framework that identifies suboptimal implementations and selects optimized variants at runtime, effectively optimizing the execution time and memory usage of the target application. Furthermore, we investigate the performance of data structures from several popular collection libraries. Our findings show that alternative variants can be selected for substantial performance improvement under specific usage scenarios. Third, we investigate the parallelization of object processing via Java streams. We propose a decision-support framework that leverages machine-learning models trained through a series of benchmarks, to identify and report stream pipelines that should be processed in parallel for better performance