387 research outputs found

    The Scalable Commutativity Rule: Designing Scalable Software for Multicore Processors

    Get PDF
    What fundamental opportunities for scalability are latent in interfaces, such as system call APIs? Can scalability opportunities be identified even before any implementation exists, simply by considering interface specifications? To answer these questions this paper introduces the following rule: Whenever interface operations commute, they can be implemented in a way that scales. This rule aids developers in building more scalable software starting from interface design and carrying on through implementation, testing, and evaluation. To help developers apply the rule, a new tool named Commuter accepts high-level interface models and generates tests of operations that commute and hence could scale. Using these tests, Commuter can evaluate the scalability of an implementation. We apply Commuter to 18 POSIX calls and use the results to guide the implementation of a new research operating system kernel called sv6. Linux scales for 68% of the 13,664 tests generated by Commuter for these calls, and Commuter finds many problems that have been observed to limit application scalability. sv6 scales for 99% of the tests.Engineering and Applied Science

    Extempore: The design, implementation and application of a cyber-physical programming language

    Get PDF
    There is a long history of experimental and exploratory programming supported by systems that expose interaction through a programming language interface. These live programming systems enable software developers to create, extend, and modify the behaviour of executing software by changing source code without perceptual breaks for recompilation. These live programming systems have taken many forms, but have generally been limited in their ability to express low-level programming concepts and the generation of efficient native machine code. These shortcomings have limited the effectiveness of live programming in domains that require highly efficient numerical processing and explicit memory management. The most general questions addressed by this thesis are what a systems language designed for live programming might look like and how such a language might influence the development of live programming in performance sensitive domains requiring real-time support, direct hardware control, or high performance computing. This thesis answers these questions by exploring the design, implementation and application of Extempore, a new systems programming language, designed specifically for live interactive programming

    Scaling Reliably: Improving the Scalability of the Erlang Distributed Actor Platform

    Get PDF
    Distributed actor languages are an effective means of constructing scalable reliable systems, and the Erlang programming language has a well-established and influential model. While the Erlang model conceptually provides reliable scalability, it has some inherent scalability limits and these force developers to depart from the model at scale. This article establishes the scalability limits of Erlang systems and reports the work of the EU RELEASE project to improve the scalability and understandability of the Erlang reliable distributed actor model. We systematically study the scalability limits of Erlang and then address the issues at the virtual machine, language, and tool levels. More specifically: (1) We have evolved the Erlang virtual machine so that it can work effectively in large-scale single-host multicore and NUMA architectures. We have made important changes and architectural improvements to the widely used Erlang/OTP release. (2) We have designed and implemented Scalable Distributed (SD) Erlang libraries to address language-level scalability issues and provided and validated a set of semantics for the new language constructs. (3) To make large Erlang systems easier to deploy, monitor, and debug, we have developed and made open source releases of five complementary tools, some specific to SD Erlang. Throughout the article we use two case studies to investigate the capabilities of our new technologies and tools: a distributed hash table based Orbit calculation and Ant Colony Optimisation (ACO). Chaos Monkey experiments show that two versions of ACO survive random process failure and hence that SD Erlang preserves the Erlang reliability model. While we report measurements on a range of NUMA and cluster architectures, the key scalability experiments are conducted on the Athos cluster with 256 hosts (6,144 cores). Even for programs with no global recovery data to maintain, SD Erlang partitions the network to reduce network traffic and hence improves performance of the Orbit and ACO benchmarks above 80 hosts. ACO measurements show that maintaining global recovery data dramatically limits scalability; however, scalability is recovered by partitioning the recovery data. We exceed the established scalability limits of distributed Erlang, and do not reach the limits of SD Erlang for these benchmarks at this scal

    Traçage de logiciels bénéficiant d'accélération graphique

    Get PDF
    RÉSUMÉ En programmation, les récents changements d'architecture comme les processeurs à plusieurs cœurs de calcul rendirent la synchronisation des tâches qui y sont exécuté plus complexe à analyser. Pour y remédier, des outils de traçage comme LTTng furent implémentés dans l'optique de fournir des outils d'analyse de processus tout en gardant en tête les défis qu'implique les systèmes multi-cœur. Une seconde révolution dans le monde de l'informatique, les accélérateurs graphiques, créa alors un autre besoin de traçage. Les manufacturiers d'accélérateurs graphiques fournirent alors des outils d'analyse pour accélérateurs graphiques. Ces derniers permettent d'analyser l'exécution de commandes sur les accélérateurs graphiques. Ce mémoire apporte une solution au manque d'outil de traçage unifié entre le système hôte (le processeur central (CPU)) et l'exécution de noyaux de calcul OpenCL sur le périphérique (l'accélérateur graphique (GPU)). Par unifié, nous référons à la capacité d'un outil de prise de traces à collecter une trace du noyau de l'hôte sur lequel un périphérique d'accélération graphique est présent en plus de la trace d'exécution du périphérique d'accélération graphique. L'objectif initial principal de ce mémoire avait été défini comme suit: fournir un outil de traçage et les méthodes d’analyse qui permettent d'acquérir simultanément les traces de l’accélérateur graphique et du processeur central. En plus de l'objectif principal, les objectifs secondaires ajoutaient des critères de performance et de visualisation des traces enregistrés par la solution que ce mémoire présente. Les différentes notions de recherche explorés ont permis d'établir de hypothèses de départ. Ces dernières mentionnaient que le format de trace Common Trace Format (CTF) semblait permettre l'enregistrent de traces à faible surcoût et que des travaux précédents permettront d'effectuer la synchronisation entre les différents espaces temporels du CPU et du GPU. La solution présentée, OpenCL User Space Tracepoint (CLUST) consiste en une librairie qui remplace les symboles de la librairie de calcul GPGPU OpenCL. Pour l'utiliser, elle doit être chargée dynamiquement avant de lancer le programme à tracer. Elle instrumente ensuite toutes les fonctions d'OpenCL grâce aux points de trace LTTng-UST, permettant alors d'enregistrer les appels et de gérer les événements asynchrones communs aux GPUs. La performance de la librairie faisant partie des objectifs de départ, une analyse de la performance des différents cas d'utilisation de cette dernière démontre son faible surcoût : pour les charges de travail d'une taille raisonnable, un surcoût variant entre 0.5 % et 2 % fut mesuré. Cet accomplissement ouvre la porte à plusieurs cas d'utilisation. Effectivement, considérant le faible surcoût d'utilisation, CLUST ne représente pas seulement un outil qui permet l'acquisition de traces pour aider au développement de programmes mais peut aussi servir en tant qu'enregistreur permanent dans les systèmes critiques. La fonction "d'enregistreur de vol" de LTTng permet d'enregistrer une trace au disque seulement lorsque requis : l'ajout de données concernant l'état du GPU peut se révéler être un précieux avantage pour diagnostiquer un problème sur un serveur de production. Le tout sans ralentir le système de façon significative.----------ABSTRACT In the world of computing, programmers now have to face the complex challenges that multi-core processors have brought. To address this problem, tracing frameworks such as LTTng were implemented to provide tools to analyze multi-core systems without adding a major overhead on the system. Recently, Graphical Processing Units (GPUs) started a new revolution: General Purpose Graphical Processing Unit (GPGPU) computing. This allows programs to offload their parallel computation sections to the ultra parallel architecture that GPUs offer. Unfortunately, the tracing tools that were provided by the GPU manufacturers did not interoperate with CPU tracing. We propose a solution, OpenCL User Space Tracepoint (CLUST), that enables OpenCL GPGPU computing tracing as an extension to the LTTng kernel tracer. This allows unifying the CPU trace and the GPU trace in one efficient format that enables advanced trace viewing and analysis, to include both models in the analysis and therefore provide more information to the programmer. The objectives of this thesis are to provide a low overhead unified CPU-GPU tracing extension of LTTng, the required algorithms to perform trace domain synchronization between the CPU and the GPU time source domain, and provide a visualization model for the unified traces. As foundation work, we determined that already existing GPU tracing techniques could incorporate well with LTTng, and that trace synchronization algorithms already presented could be used to synchronize the CPU trace with the GPU trace. Therefore, we demonstrate the low overhead characteristics of the CLUST tracing library for typical applications under different use cases. The unified CPU-GPU tracing overhead is also measured to be insignificant (less than 2%) for a typical GPGPU application. Moreover, we use synchronization methods to determine the trace domain synchronization value between both traces. This solution is a more complete and robust implementation that provides the programmer with the required tools, never before implemented, in the hope of helping programmers develop more efficient OpenCL applications

    Virtualization of network I/O on modern operating systems

    Get PDF
    Network I/O of modern operating systems is incomplete. In this networkage, users and their applications are still unable to control theirown traffic, even on their local host. Network I/O is a sharedresource of a host machine, and traditionally, to address problemswith a shared resource, system research has virtualized the resource.Therefore, it is reasonable to ask if the virtualization can providesolutions to problems in network I/O of modern operating systems, inthe same way as the other components of computer systems, such asmemory and CPU. With the aim of establishing the virtualization ofnetwork I/O as a design principle of operating systems, thisdissertation first presents a virtualization model, hierarchicalvirtualization of network interface. Systematic evaluation illustratesthat the virtualization model possesses desirable properties forvirtualization of network I/O, namely flexible control granularity,resource protection, partitioning of resource consumption, properaccess control and generality as a control model. The implementedprototype exhibits practical performance with expected functionality,and allowed flexible and dynamic network control by users andapplications, unlike existing systems designed solely for systemadministrators. However, because the implementation was hardcoded inkernel source code, the prototype was not perfect in its functionalcoverage and flexibility. Accordingly, this dissertation investigatedhow to decouple OS kernels and packet processing code throughvirtualization, and studied three degrees of code virtualization,namely, limited virtualization, partial virtualization, and completevirtualization. In this process, a novel programming model waspresented, based on embedded Java technology, and the prototypeimplementation exhibited the following characteristics, which aredesirable for network code virtualization. First, users program inJava to carry out safe and simple programming for packetprocessing. Second, anyone, even untrusted applications, can performinjection of packet processing code in the kernel, due to isolation ofcode execution. Third, the prototype implementation empirically provedthat such a virtualization does not jeopardize system performance.These cases illustrate advantages of virtualization, and suggest thatthe hierarchical virtualization of network interfaces can be aneffective solution to problems in network I/O of modern operatingsystems, both in the control model and in implementation

    Rethinking Timestamping: Time Stamp Counter Design for Virtualized Environment

    Full text link
    Almost every processor supports Time Stamp Counter (TSC), which is a hardware register that increments its value every clock cycle. Due to its high resolution and accessibility, TSC is now widely used for a variety tasks that need time measurements such as wall clock, code benchmarking, or metering hardware usage for account billing. However, if not carefully configured and interpreted, TSC-based time measurements can yield inaccurate readings. For instance, modern CPU may dynamically change its frequency or enter into low-power states. Also, time spent on scheduling events, system calls, page faults, etc. should be correctly accounted for. Even more complications arise when TSC measurements are done in virtual environments; virtual machines, on which TSC readings are taken, can be suspended, migrated, and scheduled on a machine with different clock rate and performance. In production virtualization systems, some management tasks are executed inside guests on behalf of the management system, effectively consuming end-user’s CPU time, which we believe should be excluded from end-user billing. We argue that the main problem with current TSC is that its hardware semantic is too vague to serve as a multi-purpose time source. In this thesis, we propose an improved TSC design, called Caviar, to address most of the issues. Caviar extends existing TSC hardware interface by adding a control-register based configuration interface through which a system can set up secondary TSCs whose behavior should be correct when accessed in a localized execution context including virtualized environment. We experimentally confirmed inaccurate readings with current TSC by conducting a series of TSC measurements on various x86 platforms, including virtualized cloud computing servers. We analyzed some of the results and argue that how our proposed solution can fix the problems. In conclusion, we believe that the simple interface of Caviar can solve most of current TSC complications, be implemented with minimal hardware cost, and be adopted easily by system software

    Least space-time first scheduling algorithm : scheduling complex tasks with hard deadline on parallel machines

    Get PDF
    Both time constraints and logical correctness are essential to real-time systems and failure to specify and observe a time constraint may result in disaster. Two orthogonal issues arise in the design and analysis of real-time systems: one is the specification of the system, and the semantic model describing the properties of real-time programs; the other is the scheduling and allocation of resources that may be shared by real-time program modules. The problem of scheduling tasks with precedence and timing constraints onto a set of processors in a way that minimizes maximum tardiness is here considered. A new scheduling heuristic, Least Space Time First (LSTF), is proposed for this NP-Complete problem. Basic properties of LSTF are explored; for example, it is shown that (1) LSTF dominates Earliest-Deadline-First (EDF) for scheduling a set of tasks on a single processor (i.e., if a set of tasks are schedulable under EDF, they are also schedulable under LSTF); and (2) LSTF is more effective than EDF for scheduling a set of independent simple tasks on multiple processors. Within an idealized framework, theoretical bounds on maximum tardiness for scheduling algorithms in general, and tighter bounds for LSTF in particular, are proven for worst case behavior. Furthermore, simulation benchmarks are developed, comparing the performance of LSTF with other scheduling disciplines for average case behavior. Several techniques are introduced to integrate overhead (for example, scheduler and context switch) and more realistic assumptions (such as inter-processor communication cost) in various execution models. A workload generator and symbolic simulator have been implemented for comparing the performance of LSTF (and a variant -- LSTF+) with that of several standard scheduling algorithms. LSTF\u27s execution model, basic theories, and overhead considerations have been defined and developed. Based upon the evidence, it is proposed that LSTF is a good and practical scheduling algorithm for building predictable, analyzable, and reliable complex real-time systems. There remain some open issues to be explored, such as relaxing some current restrictions, discovering more properties and theorems of LSTF under different models, etc. We strongly believe that LSTF can be a practical scheduling algorithm in the near future

    Monitoring Architecture for Real Time Systems

    Get PDF
    It can be hard to understand how an operating system - and software in general - reached a certain output just by looking at said output. A simple approach is to use loggers, or simple print statements on some specific critical areas, however that is an approach that does not scale very well in a consistent and manageable way. The purpose of this thesis is to propose and develop a tool - a Monitoring Tool - capable of capturing and recording the execution of a given application with minimal intrusion in the context of real-time embedded systems, namely using a space-qualified version of the RTEMS real-time operating system, and making that information available for further processing and analysis. Multicore environments are also considered. The current state of the art in monitoring and execution tracing is presented, featuring both a literature review and a discussion of existing tools and frameworks. Using an implementation of the proposed architecture, the tool was tested in both unicore and multicore configurations in both sparc and arm architectures, and was able to record execution data of a sample application, with varying degrees of verbosity.Nem sempre é fácil perceber como é que um sistema operativo - e software em geral - chegaram a determinado resultado apenas olhando para este. A abordagem normal é usar registos, ou pequenas impressões em locais estratégicos do código, no entanto esta abordagem não é escalável de forma consistente e sustentada. O propósito desta tese é o de propor e desenvolver uma ferramenta - uma ferramenta de monitorização - capaz de capturar e registar a execução de uma dada aplicação com o mínimo de impacto no contexto de sistemas embebidos de tempo-real, nomeadamente usando uma versão do sistema operativo de tempo-real Real-Time Executive for Multiprocessor Systems (RTEMS) qualificada para o espaço, e colocando essa informação à disposição para processamento e análise futura. Ambientes com múltiplos núcleos de processamento são também considerados. O atual estado da arte em monitorização e registo de execução de software é apresentado, destacando tanto exemplos da literatura como ferramentas e frameworks existentes. Usando uma implementação da arquitetura proposta, a ferramenta foi testada em configurações com um ou mais núcleos de processamento em arquiteturas sparc e arm, tendo sido capaz de registar e gravar dados da execução de uma aplicação de exemplo, como vários níveis de detalhe
    corecore