642 research outputs found

    An Analysis of Linux Scalability to Many Cores

    Get PDF
    URL to paper from conference siteThis paper analyzes the scalability of seven system applications (Exim, memcached, Apache, PostgreSQL, gmake, Psearchy, and MapReduce) running on Linux on a 48- core computer. Except for gmake, all applications trigger scalability bottlenecks inside a recent Linux kernel. Using mostly standard parallel programming techniques— this paper introduces one new technique, sloppy counters— these bottlenecks can be removed from the kernel or avoided by changing the applications slightly. Modifying the kernel required in total 3002 lines of code changes. A speculative conclusion from this analysis is that there is no scalability reason to give up on traditional operating system organizations just yet.Quanta Computer (Firm)National Science Foundation (U.S.) (0834415)National Science Foundation (U.S.) (0915164)Microsoft Research (Fellowship)Irwin Mark Jacobs and Joan Klein Jacobs Presidential Fellowshi

    Improving network connection locality on multicore systems

    Get PDF
    Incoming and outgoing processing for a given TCP connection often execute on different cores: an incoming packet is typically processed on the core that receives the interrupt, while outgoing data processing occurs on the core running the relevant user code. As a result, accesses to read/write connection state (such as TCP control blocks) often involve cache invalidations and data movement between cores' caches. These can take hundreds of processor cycles, enough to significantly reduce performance. We present a new design, called Affinity-Accept, that causes all processing for a given TCP connection to occur on the same core. Affinity-Accept arranges for the network interface to determine the core on which application processing for each new connection occurs, in a lightweight way; it adjusts the card's choices only in response to imbalances in CPU scheduling. Measurements show that for the Apache web server serving static files on a 48-core AMD system, Affinity-Accept reduces time spent in the TCP stack by 30% and improves overall throughput by 24%.National Science Foundation (U.S.). (Grant number CNS-0834415)National Science Foundation (U.S.). (Grant number CNS-0915164)Quanta Computer (Firm

    Self-tuning of disk input–output in operating systems

    Get PDF
    The final publication is available via http://dx.doi.org/10.1016/j.jss.2011.07.030One of the most difficult and hard to learn tasks in computer system management is tuning the kernel parameters in order to get the maximum performance. Traditionally, this tuning has been set using either fixed configurations or the subjective administrator's criteria. The main bottleneck among the subsystems managed by the operating systems is disk input/output (I/O). An evolutionary module has been developed to perform the tuning of this subsystem automatically, using an adaptive and dynamic approach. Any computer change, both at the hardware level, and due to the nature of the workload itself, will make our module adapt automatically and in a transparent way. Thus, system administrators are released from this kind of task and able to achieve some optimal performances adapted to the framework of each of their systems. The experiment made shows a productivity increase in 88.2% of cases and an average improvement of 29.63% with regard to the default configuration of the Linux operating system. A decrease of the average latency was achieved in 77.5% of cases and the mean decrease in the request processing time of I/O was 12.79%

    Near-Memory Address Translation

    Full text link
    Memory and logic integration on the same chip is becoming increasingly cost effective, creating the opportunity to offload data-intensive functionality to processing units placed inside memory chips. The introduction of memory-side processing units (MPUs) into conventional systems faces virtual memory as the first big showstopper: without efficient hardware support for address translation MPUs have highly limited applicability. Unfortunately, conventional translation mechanisms fall short of providing fast translations as contemporary memories exceed the reach of TLBs, making expensive page walks common. In this paper, we are the first to show that the historically important flexibility to map any virtual page to any page frame is unnecessary in today's servers. We find that while limiting the associativity of the virtual-to-physical mapping incurs no penalty, it can break the translate-then-fetch serialization if combined with careful data placement in the MPU's memory, allowing for translation and data fetch to proceed independently and in parallel. We propose the Distributed Inverted Page Table (DIPTA), a near-memory structure in which the smallest memory partition keeps the translation information for its data share, ensuring that the translation completes together with the data fetch. DIPTA completely eliminates the performance overhead of translation, achieving speedups of up to 3.81x and 2.13x over conventional translation using 4KB and 1GB pages respectively.Comment: 15 pages, 9 figure

    Load Balancing in Heterogeneous Cloud Environments by Using PROMETHEE Method

    Get PDF
    Abstract: Efficient Scheduling of tasks in a cloud environment improves resources utilization thereby meeting users' requirements. One of the most important objectives of a scheduling algorithm in cloud environment is a balanced load distribution over various resources for enhancing the overall performance of the cloud. Such a scheduling is complex in nature due to the dynamicity of resources and incoming application specifications. In this paper, we employ PROMETHEE decision making model to design a scheduling algorithm, called PROMETHEE Load Balancing (PLB).This paper formulates the load balancing issue as a multi-criteria decision making problem and aims to achieve well-balanced load across virtual machines for maximizing the overall throughput of the cloud. Extensive simulation results in CloudSim environment show that the proposed algorithm outperforms existing algorithms in terms of load balancing index (LBI), VM load variation, makespan, average execution time and waiting time
    • …
    corecore