287 research outputs found

    High performance cloud computing on multicore computers

    Get PDF
    The cloud has become a major computing platform, with virtualization being a key to allow applications to run and share the resources in the cloud. A wide spectrum of applications need to process large amounts of data at high speeds in the cloud, e.g., analyzing customer data to find out purchase behavior, processing location data to determine geographical trends, or mining social media data to assess brand sentiment. To achieve high performance, these applications create and use multiple threads running on multicore processors. However, existing virtualization technology cannot support the efficient execution of such applications on virtual machines, making them suffer poor and unstable performance in the cloud. Targeting multi-threaded applications, the dissertation analyzes and diagnoses their performance issues on virtual machines, and designs practical solutions to improve their performance. The dissertation makes the following contributions. First, the dissertation conducts extensive experiments with standard multicore applications, in order to evaluate the performance overhead on virtualization systems and diagnose the causing factors. Second, focusing on one main source of the performance overhead, excessive spinning, the dissertation designs and evaluates a holistic solution to make effective utilization of the hardware virtualization support in processors to reduce excessive spinning with low cost. Third, focusing on application scalability, which is the most important performance feature for multi-threaded applications, the dissertation models application scalability in virtual machines and analyzes how application scalability changes with virtualization and resource sharing. Based on the modeling and analysis, the dissertation identifies key application features and system factors that have impacts on application scalability, and reveals possible approaches for improving scalability. Forth, the dissertation explores one approach to improving application scalability by making fully utilization of virtual resources of each virtual machine. The general idea is to match the workload distribution among the virtual CPUs in a virtual machine and the virtual CPU resource of the virtual machine manager

    The Case for a Factored Operating System (fos)

    Get PDF
    The next decade will afford us computer chips with 1,000 - 10,000 cores on a single piece of silicon. Contemporary operating systems have been designed to operate on a single core or small number of cores and hence are not well suited to manage and provide operating system services at such large scale. Managing 10,000 cores is so fundamentally different from managing two cores that the traditional evolutionary approach of operating system optimization will cease to work. The fundamental design of operating systems and operating system data structures must be rethought. This work begins by documenting the scalability problems of contemporary operating systems. These studies are used to motivate the design of a factored operating system (fos). fos is a new operating system targeting 1000+ core multicore systems where space sharing replaces traditional time sharing to increase scalability. fos is built as a collection of Internet inspired services. Each operating system service is factored into a fleet of communicating servers which in aggregate implement a system service. These servers are designed much in the way that distributed Internet services are designed, but instead of providing high level Internet services, these servers provide traditional kernel services and manage traditional kernel data structures in a factored, spatially distributed manner. The servers are bound to distinct processing cores and by doing so do not fight with end user applications for implicit resources such as TLBs and caches. Also, spatial distribution of these OS services facilitates locality as many operations only need to communicate with the nearest server for a given service

    An Analysis of Linux Scalability to Many Cores

    Get PDF
    URL to paper from conference siteThis paper analyzes the scalability of seven system applications (Exim, memcached, Apache, PostgreSQL, gmake, Psearchy, and MapReduce) running on Linux on a 48- core computer. Except for gmake, all applications trigger scalability bottlenecks inside a recent Linux kernel. Using mostly standard parallel programming techniques— this paper introduces one new technique, sloppy counters— these bottlenecks can be removed from the kernel or avoided by changing the applications slightly. Modifying the kernel required in total 3002 lines of code changes. A speculative conclusion from this analysis is that there is no scalability reason to give up on traditional operating system organizations just yet.Quanta Computer (Firm)National Science Foundation (U.S.) (0834415)National Science Foundation (U.S.) (0915164)Microsoft Research (Fellowship)Irwin Mark Jacobs and Joan Klein Jacobs Presidential Fellowshi

    Optimistic Parallel State-Machine Replication

    Full text link
    State-machine replication, a fundamental approach to fault tolerance, requires replicas to execute commands deterministically, which usually results in sequential execution of commands. Sequential execution limits performance and underuses servers, which are increasingly parallel (i.e., multicore). To narrow the gap between state-machine replication requirements and the characteristics of modern servers, researchers have recently come up with alternative execution models. This paper surveys existing approaches to parallel state-machine replication and proposes a novel optimistic protocol that inherits the scalable features of previous techniques. Using a replicated B+-tree service, we demonstrate in the paper that our protocol outperforms the most efficient techniques by a factor of 2.4 times

    Lessons Learned from a Decade of Providing Interactive, On-Demand High Performance Computing to Scientists and Engineers

    Full text link
    For decades, the use of HPC systems was limited to those in the physical sciences who had mastered their domain in conjunction with a deep understanding of HPC architectures and algorithms. During these same decades, consumer computing device advances produced tablets and smartphones that allow millions of children to interactively develop and share code projects across the globe. As the HPC community faces the challenges associated with guiding researchers from disciplines using high productivity interactive tools to effective use of HPC systems, it seems appropriate to revisit the assumptions surrounding the necessary skills required for access to large computational systems. For over a decade, MIT Lincoln Laboratory has been supporting interactive, on-demand high performance computing by seamlessly integrating familiar high productivity tools to provide users with an increased number of design turns, rapid prototyping capability, and faster time to insight. In this paper, we discuss the lessons learned while supporting interactive, on-demand high performance computing from the perspectives of the users and the team supporting the users and the system. Building on these lessons, we present an overview of current needs and the technical solutions we are building to lower the barrier to entry for new users from the humanities, social, and biological sciences.Comment: 15 pages, 3 figures, First Workshop on Interactive High Performance Computing (WIHPC) 2018 held in conjunction with ISC High Performance 2018 in Frankfurt, German

    Runtime-guided management of scratchpad memories in multicore architectures

    Get PDF
    © 2015 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.The increasing number of cores and the anticipated level of heterogeneity in upcoming multicore architectures cause important problems in traditional cache hierarchies. A good way to alleviate these problems is to add scratchpad memories alongside the cache hierarchy, forming a hybrid memory hierarchy. This memory organization has the potential to improve performance and to reduce the power consumption and the on-chip network traffic, but exposing such a complex memory model to the programmer has a very negative impact on the programmability of the architecture. Emerging task-based programming models are a promising alternative to program heterogeneous multicore architectures. In these models the runtime system manages the execution of the tasks on the architecture, allowing them to apply many optimizations in a generic way at the runtime system level. This paper proposes giving the runtime system the responsibility to manage the scratchpad memories of a hybrid memory hierarchy in multicore processors, transparently to the programmer. In the envisioned system, the runtime system takes advantage of the information found in the task dependences to map the inputs and outputs of a task to the scratchpad memory of the core that is going to execute it. In addition, the paper exploits two mechanisms to overlap the data transfers with computation and a locality-aware scheduler to reduce the data motion. In a 32-core multicore architecture, the hybrid memory hierarchy outperforms cache-only hierarchies by up to 16%, reduces on-chip network traffic by up to 31% and saves up to 22% of the consumed power.Peer ReviewedPostprint (author's final draft
    corecore