124 research outputs found

    Cooperative hierarchical resource management for efficient composition of parallel software

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2010.Cataloged from PDF version of thesis.Includes bibliographical references (p. 93-96).There cannot be a thriving software industry in the upcoming manycore era unless programmers can compose arbitrary parallel codes without sacrificing performance. We believe that the efficient composition of parallel codes is best achieved by exposing unvirtualized hardware resources and sharing these cooperatively across parallel codes within an application. This thesis presents Lithe, a user-level framework that enables efficient composition of parallel software components. Lithe provides the basic primitives, standard interface, and thin runtime to enable parallel codes to efficiently use and share processing resources. Lithe can be inserted underneath the runtimes of legacy parallel software environments to provide bolt-on composability - without changing a single line of the original application code. Lithe can also serve as the foundation for building new parallel abstractions and runtime systems that automatically interoperate with one another. We have built and ported a wide range of interoperable scheduling, synchronization, and domain-specific libraries using Lithe. We show that the modifications needed are small and impose no performance penalty when running each library standalone. We also show that Lithe improves the performance of real world applications composed of multiple parallel libraries by simply relinking them with the new library binaries. Moreover, the Lithe version of an application even outperformed a third-party expert-tuned implementation by being more adaptive to different phases of the computation.by Heidi Pan.Ph.D

    Work Stealing Scheduler for Automatic Parallelization in Faust

    Get PDF
    International audienceFaust 0.9.10 introduces an alternative to OpenMP based parallel code generation using a Work Steal- ing Scheduler and explicit management of worker threads. This paper explains the new option and presents some benchmarks

    Argobots: A Lightweight Low-Level Threading and Tasking Framework

    Get PDF
    In the past few decades, a number of user-level threading and tasking models have been proposed in the literature to address the shortcomings of OS-level threads, primarily with respect to cost and flexibility. Current state-of-the-art user-level threading and tasking models, however, either are too specific to applications or architectures or are not as powerful or flexible. In this paper, we present Argobots, a lightweight, low-level threading and tasking framework that is designed as a portable and performant substrate for high-level programming models or runtime systems. Argobots offers a carefully designed execution model that balances generality of functionality with providing a rich set of controls to allow specialization by end users or high-level programming models. We describe the design, implementation, and performance characterization of Argobots and present integrations with three high-level models: OpenMP, MPI, and colocated I/O services. Evaluations show that (1) Argobots, while providing richer capabilities, is competitive with existing simpler generic threading runtimes; (2) our OpenMP runtime offers more efficient interoperability capabilities than production OpenMP runtimes do; (3) when MPI interoperates with Argobots instead of Pthreads, it enjoys reduced synchronization costs and better latency-hiding capabilities; and (4) I/O services with Argobots reduce interference with colocated applications while achieving performance competitive with that of a Pthreads approach

    Resource aggregation for task-based Cholesky Factorization on top of modern architectures

    Get PDF
    This paper is submitted for review to the Parallel Computing special issue for HCW and HeteroPar 16 workshopsHybrid computing platforms are now commonplace, featuring a large number of CPU cores and accelerators. This trend makes balancing computations between these heterogeneous resources performance critical. In this paper we propose ag-gregating several CPU cores in order to execute larger parallel tasks and improve load balancing between CPUs and accelerators. Additionally, we present our approach to exploit internal parallelism within tasks, by combining two runtime system schedulers: a global runtime system to schedule the main task graph and a local one one to cope with internal task parallelism. We demonstrate the relevance of our approach in the context of the dense Cholesky factorization kernel implemented on top of the StarPU task-based runtime system. We present experimental results showing that our solution outperforms state of the art implementations on two architectures: a modern heterogeneous machine and the Intel Xeon Phi Knights Landing

    Le problème de la composition parallèle : une approche supervisée

    Get PDF
    International audienceEnabling HPC applications to perform efficiently when invoking multiple parallel libraries simultaneously is a great challenge. Even if a single runtime system is used underneath, scheduling tasks or threads coming from different libraries over the same set of hardware resources introduce many issues, such as resource oversubscription, undesirable cache flushes or memory bus contention. This paper presents an extension of starpu, a runtime system specifically designed for heterogeneous architectures, that allows multiple parallel codes to run concurrently with minimal interference. Such parallel codes run within \emph{scheduling contexts} that provide confined execution environments which can be used to partition computing resources. Scheduling contexts can be dynamically resized to optimize the allocation of computing resources among concurrently running libraries. We introduce a \emph{hypervisor} that automatically expands or shrinks contexts using feedback from the runtime system (e.g. resource utilization). We demonstrate the relevance of our approach using benchmarks invoking multiple high performance linear algebra kernels simultaneously on top of heterogeneous multicore machines. We show that our mechanism can dramatically improve the overall application run time (-34%), most notably by reducing the average cache miss ratio (-50%).L'utilisation simultanée de plusieurs bibliothèques de calcul parallèle au sein d'une application soulève bien sou-vent des problèmes d'efficacité. En compétition pour l'obtention des ressources, les routines parallèles, pourtant optimisées, se gênent et l'on voit alors apparaître des phénomènes de surcharge, de contention ou de défaut de cache. Nous présentons une technique de cloisonnement de flux de calculs qui permet de limiter les effets de telles inter-férences. Le cloisonnement est réalisé à l'aide de contextes d'exécution qui partitionnement les unités de calculs voire en partagent certaines. La répartition des ressources entre les contextes peut être modifiée dynamiquement afin d'optimiser le rendement de la machine. À cette fin, nous proposons l'utilisation de métriques par un super- viseur pour redistribuer automatiquement les ressources aux contextes. Nous décrivons l'intégration des contextes d'exécution au support d'exécution pour machines hétérogènes StarPU et présentons des résultats d'expériences démontrant la pertinence de notre approche

    CAT: Closed-loop Adversarial Training for Safe End-to-End Driving

    Full text link
    Driving safety is a top priority for autonomous vehicles. Orthogonal to prior work handling accident-prone traffic events by algorithm designs at the policy level, we investigate a Closed-loop Adversarial Training (CAT) framework for safe end-to-end driving in this paper through the lens of environment augmentation. CAT aims to continuously improve the safety of driving agents by training the agent on safety-critical scenarios that are dynamically generated over time. A novel resampling technique is developed to turn log-replay real-world driving scenarios into safety-critical ones via probabilistic factorization, where the adversarial traffic generation is modeled as the multiplication of standard motion prediction sub-problems. Consequently, CAT can launch more efficient physical attacks compared to existing safety-critical scenario generation methods and yields a significantly less computational cost in the iterative learning pipeline. We incorporate CAT into the MetaDrive simulator and validate our approach on hundreds of driving scenarios imported from real-world driving datasets. Experimental results demonstrate that CAT can effectively generate adversarial scenarios countering the agent being trained. After training, the agent can achieve superior driving safety in both log-replay and safety-critical traffic scenarios on the held-out test set. Code and data are available at https://metadriverse.github.io/cat.Comment: 7th Conference on Robot Learning (CoRL 2023

    SEED: Searching Encrypted Email Dependably. A design specification for secured webmail.

    Get PDF
    Webmail services are a convenient, internet-based access point for email management. A webmail user must trust the service provider to honor the user\u27s individual privacy while accomodating their email contents. Webmail users are increasingly conscious of the risk to their privacy as many webmail services have fallen victim to cyberattacks where unwanted observers have exploited server vulnerabilities to steal user private data. The relationship of trust between webmail provider and webmail user has been further called into question with the reveal of NSA snooping of user email, often with the tacit approval of the webmail provider. We augment a modern webmail service with end-to-end encryption of user email data. Our system, SEED, is designed to respect the original functionality of the webmail service. Most notably, we enable search of encrypted message bodies using the webmail service\u27s built-in search engine. With an ancillary web browser extension called SEED add-on, the user is able to manage email in the webmail client while decrypting sensitive email information in a separate local process. The browser extension manages the user\u27s encryption keys and decrypts email ciphertext automatically such that the user remains ignorant of the underlying cryptographic implementation as they browse their email. Built upon Gmail, SEED stores a user\u27s email data on Google\u27s remote servers and guarantees that Google is unable to interpret it. When managing their email, the user still enjoys the full capabilities of the Gmail web client, including composing, reading, and robustly searching email by message metadata. The user is able to do all of this without revealing their usage habits to Google. The user is able to do all of this without revealing their emails to Google. Using SEED, the user benefits from the conveniences of webmail and preserves the integrity of their private information stored online

    Efficient Multiprogramming for Multicores with SCAF

    Get PDF
    As hardware becomes increasingly parallel and the availability of scalable parallel software improves, the problem of managing multiple multithreaded applications (processes) becomes important. Malleable processes, which can vary the number of threads used as they run, enable sophisticated and flexible resource management. Although many existing applications parallelized for SMPs with parallel runtimes are in fact already malleable, deployed run-time environments provide no interface nor any strategy for intelligently allocating hardware threads or even preventing oversubscription. Prior research methods either depend upon profiling applications ahead of time in order to make good decisions about allocations, or do not account for process efficiency at all, leading to poor performance. None of these prior methods have been adapted widely in practice. This paper presents the Scheduling and Allocation with Feedback (SCAF) system: a drop-in runtime solution which supports existing malleable applications in making intelligent allocation decisions based on observed efficiency without any changes to semantics, program modification, offline profiling, or even recompilation. Our existing implementation can control most unmodified OpenMP applications. Other malleable threading libraries can also easily be supported with small modifications, without requiring application modification or recompilation. In this work, we present the SCAF daemon and a SCAF-aware port of the GNU OpenMP runtime. We present a new technique for estimating process efficiency purely at runtime using available hardware counters, and demonstrate its effectiveness in aiding allocation decisions. We evaluated SCAF using NAS NPB parallel benchmarks on five commodity parallel platforms, enumerating architectural features and their effects on our scheme. We measured the benefit of SCAF in terms of sum of speedups improvement (a common metric for multiprogrammed environments) when running all benchmark pairs concurrently compared to equipartitioning --- the best existing competing scheme in the literature. If the sum of speedups with SCAF is within 5% of equipartitioning (i.e., improvement factor is 0.95X < improvement factor in sum of speedups < 1.05X), then we deem SCAF to break even. Less than 0.95X is considered a slowdown; greater than 1.05X is an improvement. We found that SCAF improves on equipartitioning on 4 out of 5 machines, breaking even or improving in 80-89% of pairs and showing a mean improvement of 1.11-1.22X for benchmark pairs for which it shows an improvement, depending on the machine. Since we are not aware of any widely available tool for equipartitioning, we also compare SCAF against multiprogramming using unmodified OpenMP, which is the only environment available to end-users today. SCAF improves or breaks even on the unmodified OpenMP runtimes for all 5 machines in 72-100% of pairs, with a mean improvement of 1.27-1.7X, depending on the machine

    Synchrony of the Sublime: A Performer\u27s Guide to Duke Ellington\u27s Wordless Melodies for Soprano

    Get PDF
    This monograph provides an in-depth examination of the background, musical, and performance issues related to Duke Ellington’s wordless melodies, as well as epigrammatic biographies of Ellington and three female vocalists whose voices he employed as instruments: Adelaide Hall, Kay Davis, and Alice Babs. As early as the twenties, Ellington innovatively used the voice as a wordless instrumental color—an idea he extended into both his secular and sacred works. His iconoclastic instrumentalization of the soprano voice in compositions such as “Creole Love Call”, “Minnehaha”, “Transblucency”, “On a Turquoise Cloud”, and “T.G.G.T.” merits consideration by scholars and performers alike; these artistically complex melodies offer valuable insights into Ellington’s organic and collaborative compositional process. Although Ellington’s wordless melodies for the soprano voice have fallen on the periphery of discussions on twentieth-century American music, perhaps out of sheer obscurity, the need for alternative teaching and performance materials gives rise to a host of topics for further study regarding these pieces. Assimilating Ellington’s programmatic and mood pieces for the instrumentalized soprano voice into the canon of chamber repertoire opens a new arena of scholarly and artistic endeavor for the trained singer. Therefore, central to this study are the following considerations: context, pedagogical challenges (range, tessitura, vowels, phrase length, etc.) nature of accompaniment and instrumentation, form, and the nature of Ellington’s vocal writing as it pertains to the wordless obbligato and concert works featuring the wordless voice including, “Minnehaha,” “Transblucency,” “On A Turquoise Cloud,” and “T.G.T.T.” aka “Too Good To Title.” This study evaluates Ellington’s technique of casting the wordless female voice in unique musical contexts via musical analysis, as well as pedagogical and interpretive assessments of selected Ellington pieces,. The resultant amalgam of musical identities, both instrumental and vocal, fostered creative polyphony and epitomized the coined “Ellington Effect.” The following analysis centers on a chronological survey of Ellington’s wordless melodies performed and recorded by Adelaide Hall, Kay Davis, and Alice Babs. The goal of this project is to present a study in historical context and significance, style, device, and pedagogical/performance considerations of those works that employ the flexibility, technique, and aural training of the studied singer with instrumental jazz idioms in a cross-genre context
    • …
    corecore