33 research outputs found

    Measuring Operating System Overhead on CMT Processors

    Get PDF
    Numerous studies have shown that Operating System (OS) noise is one of the reasons for significant performance degradation in clustered architectures. Although many studies examine the OS noise for High Performance Computing (HPC), especially in multi-processor/core systems, most of them focus on 2- or 4-core systems. In this paper, we analyze the major sources of OS noise on a massive multithreading processor, the Sun UltraSPARC T1, running Linux and Solaris. Since a real system is too complex to analyze, we compare those results with a low-overhead runtime environment: the Netra Data Plane Software Suite (Netra DPS). Our results show that the overhead introduced by the OS timer interrupt in Linux and Solaris depends on the particular core and hardware context in which the application is running. This overhead is up to 30% when the application is executed on the same hardware context of the timer interrupt handler and up to 10% when the application and the timer interrupt handler run on different contexts but on the same core. We detect no overhead when the benchmark and the timer interrupt handler run on different cores of the processor.Peer Reviewe

    Towards an Adaptive OS Noise Mitigation Technique for Microbenchmarking on Apple Ipad Devices

    Get PDF
    This study investigates levels of Operating System (OS) noise on Apple iPad mobile devices. OS noise causes variations in application performance that interfere with microbenchmark results. OS noise manifests in collected data through extreme outliers and variations in skewness. Using our collected data, we develop an iterative, semi-automated outlier removal process for Apple iPad OS noise profiles. The profiles generated by outlier removal represent the first step toward an adaptive noise mitigation technique, which presents opportunities for use in microbenchmarking across other mobile platforms

    Direct Inter-Process Communication (dIPC): Repurposing the CODOMs architecture to accelerate IPC

    Get PDF
    In current architectures, page tables are the fundamental mechanism that allows contemporary OSs to isolate user processes, binding each thread to a specific page table. A thread cannot therefore directly call another process's function or access its data; instead, the OS kernel provides data communication primitives and mediates process synchronization through inter-process communication (IPC) channels, which impede system performance. Alternatively, the recently proposed CODOMs architecture provides memory protection across software modules. Threads can cross module protection boundaries inside the same process using simple procedure calls, while preserving memory isolation. We present dIPC (for "direct IPC"), an OS extension that repurposes and extends the CODOMs architecture to allow threads to cross process boundaries. It maps processes into a shared address space, and eliminates the OS kernel from the critical path of inter-process communication. dIPC is 64.12× faster than local remote procedure calls (RPCs), and 8.87× faster than IPC in the L4 microkernel. We show that applying dIPC to a multi-tier OLTP web server improves performance by up to 5.12× (2.13× on average), and reaches over 94% of the ideal system efficiency.We thank Diego Marr´on for helping with MariaDB, the anonymous reviewers for their feedback and, especially, Andrew Baumann for helping us improve the paper. This research was partially funded by HiPEAC through a collaboration grant for Lluís Vilanova (agreement number 687698 for the EU’s Horizon2020 research and innovation programme), the Israel Science Fundation (ISF grant 769/12) and the Israeli Ministry of Science, Technology and Space.Peer ReviewedPostprint (author's final draft

    Reducing Power Consumption and Latency in Mobile Devices using a Push Event Stream Model, Kernel Display Server, and GUI Scheduler

    Get PDF
    The power consumed by mobile devices can be dramatically reduced by improving how mobile operating systems handle events and display management. Currently, mobile operating systems use a pull model that employs a polling loop to constantly ask the operating system if an event exists. This constant querying prevents the CPU from entering a deep sleep, which unnecessarily consumes power. We’ve improved this process by switching to a push model which we refer to as the event stream model (ESM). This model leverages modern device interrupt controllers which automatically notify an application when events occur, thus removing the need to constantly rouse the CPU in order to poll for events. Since the CPU rests while no events are occurring, power consumption is reduced. Furthermore, an application is immediately notified when an event occurs, as opposed to waiting for a polling loop to recognize when an event has occurred. This immediate notification reduces latency, which is the elapsed time between the occurrence of an event and the beginning of its processing by an application. We further improved the benefits of the ESM by moving the display server, a central piece of the graphical user interface (GUI), into the kernel. Existing display servers duplicate some of the kernel code. They contain important information about an application that can assist the kernel with scheduling, such as whether the application is visible and able to receive events. However, they do not share such information with the kernel. Our new kernel-level display server (KDS) interacts directly with the process scheduler to determine when applications are allowed to use the CPU. For example, when an application is idle and not visible on the screen, the KDS prevents that application from using the CPU, thus conserving power. These combined improvements have reduced power consumption by up to 31.2% and latency by up to 17.1 milliseconds in our experimental applications. This improvement in power consumption roughly increases battery life by one to four hours when the device is being actively used or fifty to three-hundred hours when the device is idle

    Exploiting Heterogeneous Multicore Processors Through Fine-Grained Scheduling and Low-Overhead Thread Migration

    Get PDF
    Heterogeneous (also known as asymmetric) multicore processors (HMPs) offer significant advantages over homogeneous multicores in terms of both power and performance. Power-efficient cores can be paired with higher-performance cores to achieve advantageous power/performance tradeoffs. Particular cores could also be tailored to efficiently meet the demands of particular application domains. Unfortunately, HMPs also create unique challenges in effective mapping of running processes to cores. The greater the diversity of cores, the more complex this problem becomes.Existing dynamic scheduling approaches for HMPs fall into two categories: sampling and prediction. Sampling approaches permute running applications to across core types to find the best-performing assignment. This sampling step hurts performance and power due to time spent migrating threads through non-optimal assignments. Alternatively, prediction-based approaches estimate the performance for each application on different types of cores and choose the schedule with the best estimated performance. Prediction eliminates the cost of sampling but may result in sub-optimal scheduling decisions.This dissertation introduces new and novel phase-identification-based online schedulers for HMPs that combine aspects of both sampling and prediction approaches by identifying phases of execution (instruction sequences with similar behavior), sampling new phases, recognizing repeating phases and reusing recorded phase information to predict the best performing schedule and optimize the schedule for either performance or power consumption. While previous approaches utilized only phase-change detection to begin evaluating new schedules, the proposed approaches recognize the current phase of each executing thread and reuse phase information recorded in a Signature History Table when the same or similar behavior of programs reoccurs. This dissertation further proposes machine-learning based schedulers that learns effective scheduling policies using the same characteristics of these program phases.Exploiting differences between relatively short duration phases using the presented scheduling techniques results in frequent thread migrations that can harm performance. Operating system (OS) context switching can be time consuming. To reduce this context switching overhead, a context switching circuit that both accelerates thread switches among cores in HMPs and reduces switching cost within each core (multitasking) is further introduced in this work. This novel context switch circuit enables low-overhead hardware-level thread migration between cores on a chip and results in up to1380X speedup as compared to an OS context switch. Together with the presented scheduling approaches, this mechanism enables efficient and fine-grained scheduling for HMPs
    corecore