Search CORE

136 research outputs found

Doctor of Philosophy

Author: Burtsev Anton
Publication venue: University of Utah
Publication date: 01/05/2013
Field of study

dissertationA modern software system is a composition of parts that are themselves highly complex: operating systems, middleware, libraries, servers, and so on. In principle, compositionality of interfaces means that we can understand any given module independently of the internal workings of other parts. In practice, however, abstractions are leaky, and with every generation, modern software systems grow in complexity. Traditional ways of understanding failures, explaining anomalous executions, and analyzing performance are reaching their limits in the face of emergent behavior, unrepeatability, cross-component execution, software aging, and adversarial changes to the system at run time. Deterministic systems analysis has a potential to change the way we analyze and debug software systems. Recorded once, the execution of the system becomes an independent artifact, which can be analyzed offline. The availability of the complete system state, the guaranteed behavior of re-execution, and the absence of limitations on the run-time complexity of analysis collectively enable the deep, iterative, and automatic exploration of the dynamic properties of the system. This work creates a foundation for making deterministic replay a ubiquitous system analysis tool. It defines design and engineering principles for building fast and practical replay machines capable of capturing complete execution of the entire operating system with an overhead of several percents, on a realistic workload, and with minimal installation costs. To enable an intuitive interface of constructing replay analysis tools, this work implements a powerful virtual machine introspection layer that enables an analysis algorithm to be programmed against the state of the recorded system through familiar terms of source-level variable and type names. To support performance analysis, the replay engine provides a faithful performance model of the original execution during replay

The University of Utah: J. Willard Marriott Digital Library

Memory resource balancing for virtualized computing

Author: Zhao Weiming
Publication venue: Digital Commons @ Michigan Tech
Publication date: 01/01/2011
Field of study

Virtualization has become a common abstraction layer in modern data centers. By multiplexing hardware resources into multiple virtual machines (VMs) and thus enabling several operating systems to run on the same physical platform simultaneously, it can effectively reduce power consumption and building size or improve security by isolating VMs. In a virtualized system, memory resource management plays a critical role in achieving high resource utilization and performance. Insufficient memory allocation to a VM will degrade its performance dramatically. On the contrary, over-allocation causes waste of memory resources. Meanwhile, a VM’s memory demand may vary significantly. As a result, effective memory resource management calls for a dynamic memory balancer, which, ideally, can adjust memory allocation in a timely manner for each VM based on their current memory demand and thus achieve the best memory utilization and the optimal overall performance. In order to estimate the memory demand of each VM and to arbitrate possible memory resource contention, a widely proposed approach is to construct an LRU-based miss ratio curve (MRC), which provides not only the current working set size (WSS) but also the correlation between performance and the target memory allocation size. Unfortunately, the cost of constructing an MRC is nontrivial. In this dissertation, we first present a low overhead LRU-based memory demand tracking scheme, which includes three orthogonal optimizations: AVL-based LRU organization, dynamic hot set sizing and intermittent memory tracking. Our evaluation results show that, for the whole SPEC CPU 2006 benchmark suite, after applying the three optimizing techniques, the mean overhead of MRC construction is lowered from 173% to only 2%. Based on current WSS, we then predict its trend in the near future and take different strategies for different prediction results. When there is a sufficient amount of physical memory on the host, it locally balances its memory resource for the VMs. Once the local memory resource is insufficient and the memory pressure is predicted to sustain for a sufficiently long time, a relatively expensive solution, VM live migration, is used to move one or more VMs from the hot host to other host(s). Finally, for transient memory pressure, a remote cache is used to alleviate the temporary performance penalty. Our experimental results show that this design achieves 49% center-wide speedup

Michigan Technological University

A trace-driven methodology to evaluate memory management services of distributed operating systems for lightweight manycores

Author: Podestá Junior Emmanuel
Publication venue
Publication date: 01/01/2022
Field of study

Dissertação (mestrado) - Universidade Federal de Santa Catarina, Centro Tecnológico, Programa de Pós-Graduação em Ciência da Computação, Florianópolis, 2022.Os lightweight manycores pertencem a uma nova classe de processadores emergentes de baixa potência para a era Exascale. Esses processadores apresentam vários desafios para o desenvolvimento de aplicações, como arquitetura de memória distribuída, quantidade limitada de memória no chip e nenhuma coerência de cache. Recentemente, Sistemas Operacionais distribuídos foram propostos para enfrentar esses desafios de forma transparente. Nesses sistemas, diferentes serviços do Sistema Operacional são implantados nos núcleos do processador, sendo o serviço de gerenciamento de memória um dos mais importantes. No entanto, os desafios citados anteriormente sobre lightweight manycores trazem vários obstáculos para o design, implementação e otimizações futuras de serviços de gerenciamento de memória. Esta dissertação propõe uma metodologia baseada em traces para avaliar e otimizar recursos do serviço de gerenciamento de memória em Sistemas Operacionais distribuídos para lightweight manycores. Usando uma representação compacta do padrão de acesso às páginas das aplicações, a metodologia consegue imitar o padrão de acesso à memória das aplicações originais no Sistema Operacional distribuído rodando em um lightweight manycore. A metodologia foi integrada em um Sistema Operacional distribuído (Nanvix) e validada usando cinco aplicações de um benchmark específico para lightweight manycores (Capbench). Em seguida, a metodologia foi aplicada para realizar um estudo de caso usando uma implementação de cache gerenciada por software disponível no Nanvix. A metodologia permitiu avaliar várias configurações e diferentes políticas de substituição de páginas no processador MPPA, mesmo sem o suporte necessário da arquitetura para implementá-los.Abstract: Lightweight manycores belong to a new class of emerging low-power processors for the Exascale era. These processors present several challenges for the development of applications, such as distributed memory architecture, limited amount of on-chip memory and no cache coherence. Recently, distributed Operating Systems have been proposed to address these challenges in a transparent way. In these systems, different Operating Systems services are deployed across the processor cores, being the memory management service one of the most important. However, the aforementioned challenges of lightweight manycores bring several demands to the design, implementation and future optimizations of memory management services. This dissertation proposes a trace-driven methodology to evaluate and optimize features of a memory management service of distributed Operating Systems for lightweight manycores. By using a compact representation of the page access pattern of applications, our methodology is capable of mimicking the memory access pattern of the original applications on the target distributed Operating System running on a lightweight manycore. The methodology was integrated in a distributed Operating System (Nanvix) and validated using five applications from a specific benchmark for lightweight manycores (Capbench). Then, the methodology was applied to carry out a case study using a software-managed cache implementation available in Nanvix. The methodology enables evaluation of several configurations and different page replacement policies on MPPA processor, even without the support from the architecture to implement them

Repositório Institucional da UFSC

SimuBoost: Scalable Parallelization of Functional System Simulation

Author: Rittinghaus Marc
Publication venue: KIT-Bibliothek, Karlsruhe
Publication date: 21/08/2019
Field of study

Für das Sammeln detaillierter Laufzeitinformationen, wie Speicherzugriffsmustern, wird in der Betriebssystem- und Sicherheitsforschung häufig auf die funktionale Systemsimulation zurückgegriffen. Der Simulator führt dabei die zu untersuchende Arbeitslast in einer virtuellen Maschine (VM) aus, indem er schrittweise Instruktionen interpretiert oder derart übersetzt, sodass diese auf dem Zustand der VM arbeiten. Dieser Prozess ermöglicht es, eine umfangreiche Instrumentierung durchzuführen und so an Informationen zum Laufzeitverhalten zu gelangen, die auf einer physischen Maschine nicht zugänglich sind. Obwohl die funktionale Systemsimulation als mächtiges Werkzeug gilt, stellt die durch die Interpretation oder Übersetzung resultierende immense Ausführungsverlangsamung eine substanzielle Einschränkung des Verfahrens dar. Im Vergleich zu einer nativen Ausführung messen wir für QEMU eine 30-fache Verlangsamung, wobei die Aufzeichnung von Speicherzugriffen diesen Faktor verdoppelt. Mit Simulatoren, die umfangreichere Instrumentierungsmöglichkeiten mitbringen als QEMU, kann die Verlangsamung um eine Größenordnung höher ausfallen. Dies macht die funktionale Simulation für lang laufende, vernetzte oder interaktive Arbeitslasten uninteressant. Darüber hinaus erzeugt die Verlangsamung ein unrealistisches Zeitverhalten, sobald Aktivitäten außerhalb der VM (z. B. Ein-/Ausgabe) involviert sind. In dieser Arbeit stellen wir SimuBoost vor, eine Methode zur drastischen Beschleunigung funktionaler Systemsimulation. SimuBoost führt die zu untersuchende Arbeitslast zunächst in einer schnellen hardwaregestützten virtuellen Maschine aus. Dies ermöglicht volle Interaktivität mit Benutzern und Netzwerkgeräten. Während der Ausführung erstellt SimuBoost periodisch Abbilder der VM (engl. Checkpoints). Diese dienen als Ausgangspunkt für eine parallele Simulation, bei der jedes Intervall unabhängig simuliert und analysiert wird. Eine heterogene deterministische Wiederholung (engl. heterogeneous deterministic Replay) garantiert, dass in dieser Phase die vorherige hardwaregestützte Ausführung jedes Intervalls exakt reproduziert wird, einschließlich Interaktionen und realistischem Zeitverhalten. Unser Prototyp ist in der Lage, die Laufzeit einer funktionalen Systemsimulation deutlich zu reduzieren. Während mit herkömmlichen Verfahren für die Simulation des Bauprozesses eines modernen Linux über 5 Stunden benötigt werden, schließt SimuBoost die Simulation in nur 15 Minuten ab. Dies sind lediglich 16% mehr Zeit, als der Bau in einer schnellen hardwaregestützten VM in Anspruch nimmt. SimuBoost ist imstande, diese Geschwindigkeit auch bei voller Instrumentierung zur Aufzeichnung von Speicherzugriffen beizubehalten. Die vorliegende Arbeit ist das erste Projekt, welches das Konzept der Partitionierung und Parallelisierung der Ausführungszeit auf die interaktive Systemvirtualisierung in einer Weise anwendet, die eine sofortige parallele funktionale Simulation gestattet. Wir ergänzen die praktische Umsetzung mit einem mathematischen Modell zur formalen Beschreibung der Beschleunigungseigenschaften. Dies erlaubt es, für ein gegebenes Szenario die voraussichtliche parallele Simulationszeit zu prognostizieren und gibt eine Orientierung zur Wahl der optimalen Intervalllänge. Im Gegensatz zu bisherigen Arbeiten legt SimuBoost einen starken Fokus auf die Skalierbarkeit über die Grenzen eines einzelnen physischen Systems hinaus. Ein zentraler Schlüssel hierzu ist der Einsatz moderner Checkpointing-Technologien. Im Rahmen dieser Arbeit präsentieren wir zwei neuartige Methoden zur effizienten und effektiven Kompression von periodischen Systemabbildern

KITopen

Service Boosters: Library Operating Systems For The Datacenter

Author: Demoulin Henri Maxime
Publication venue: ScholarlyCommons
Publication date: 01/01/2021
Field of study

Cloud applications are taking an increasingly important place our technology and economic landscape. Consequently, they are subject to stringent performance requirements. High tail latency — percentiles at the tail of the response time distribution — is a threat to these requirements. As little as 0.01% slow requests in one microservice can significantly degrade performance for the entire application. The conventional wisdom is that application-awareness is crucial to design optimized performance management systems, but comes at the cost of maneuverability. Consequently, existing execution environments are often general-purpose and ignore important application features such as the architecture of request processing pipelines or the type of requests being served. These one-size-fits-all solutions are missing crucial information to identify and remove sources of high tail latency. This thesis aims to develop a lightweight execution environment exploiting application semantics to optimize tail performance for cloud services. This system, dubbed Service Boosters, is a library operating system exposing application structure and semantics to the underlying resource management stack. Using Service Boosters, programmers use a generic programming model to build, declare and an-notate their request processing pipeline, while performance engineers can program advanced management strategies. Using Service Boosters, I present three systems, FineLame, Perséphone, and DeDoS, that exploit application awareness to provide real time anomaly detection; tail-tolerant RPC scheduling; and resource harvesting. FineLame leverages awareness of the request processing pipeline to deploy monitoring and anomaly detection probes. Using these, FineLame can detect abnormal requests in-flight whenever they depart from the expected behavior and alerts other resource management modules. Pers ́ephone exploits an understanding of request types to dynamically allocate resources to each type and forbid pathological head-of-line blocking from heavy-tailed workloads, without the need for interrupts. Pers ́ephone is a low overhead solution well suited for microsecond scale workloads. Finally, DeDoS can identify overloaded components and dynamically scale them, harvesting only the resources needed to quench the overload. Service Boosters is a powerful framework to handle tail latency in the datacenter. Service Boosters clearly separates the roles of application development and performance engineering, proposing a general purpose application programming model while enabling the development of specialized resource management modules such as Perséphone and DeDoS

ScholarlyCommons@Penn

Evaluating TLB (Translation Lookaside Buffer) Performance Overhead for NVM (non-volatile Memory) Hybrid System

Author: Guo Xiang
Publication venue: DigitalCommons@UMaine
Publication date: 20/12/2020
Field of study

As the non-volatile memory (NVM) technology offers near-DRAM performance and near-disk capacity, NVM has emerged as a new storage class. Conventional file systems, designed for hard disk drives or solid-state drives, need to be re-examined or even re-designed for NVM storage. For example, new file systems such as NOVA, HMFS, HMVFS and Ext4-DAX, have been developed and implemented to fully leverage NVM’s characteristics, such as fast fine-grained access. This thesis research uses a variety of I/O workloads to evaluate the performance overhead of the TLB (translation lookaside buffer) in various file systems on emulated NVM storage systems, in which NVM resides on the memory bus. As NVM’s capacity becomes much greater than DRAM and applications’ footprints continue to increase rapidly, the number of TLB entries scales up with the same pace, leading to a significant amount of TLB misses. The goal of this research is to gain insights into file system optimizations on storage-class memory. Experimental results show that NVM based file systems can have 50% more TLB overhead compare to with conventional file systems, under the same file operations. Profiling based on performance counters show that TLB-friendly journaling/logging should be taken into consideration into future file system design

University of Maine

Recommended from our members

Architectural support for message queue task parallelism

Author: Wu Qinzhe
Publication venue
Publication date: 04/01/2024
Field of study

The scaling of threads is an attractive way to exploit task-level parallelism and boost performance. From the perspective of software programming, many applications (e.g., network package processing, SQL queries) could be composite of a set of small tasks. Those tasks are arranged in a data flow graph and each task is undertaken by some threads. Message queues are often used to coordinate the tasks among the threads. On the other side, thread scaling is in favor of the hardware advancing trend that there are more Processing Elements (PE) in modern Chip Multiprocessors (CMP) than ever before. This is because single PE cannot simply run faster due to power and thermal limitations; instead architects have to use more transistors for increasing number of PEs, in order to improve the overall computing power of a processor. Unfortunately, this paradigm using message queues to drive parallel tasks sometime leads to diminishing performance returns due to issues lying in the architecture and system design. Particularly, the conventional coherent shared-memory architectures let task-parallel workloads suffer from unnecessary synchronization overhead and load-to-use latency. For instance, when passing messages through queues, multiple threads could contend for the exclusivity of the cacheline where the shared queue data structure stays. The more threads, the more severe the contention is, because every transition upgrading a cacheline from shared to exclusive state needs to invalidate more copies in the private caches of other cores, and waits for the acknowledgements from more cores. Such a overhead hurts the scalability of threads synchronizing via message queues. Adding to the coherence overhead, the load-to-use latency (from a consumer requesting data until the data being moved to the consumer to use) is often on the critical path, slowing down the computation. This is because the cache hierarchy in modern processors creates some layers of local storage to buffer data separately for different cores. Therefore, serving message queue data in an ondemand manner incurs longer load-to-use latency. It is also challenging to schedule message-driven tasks to use cores efficiently when arrival rate and service rate mismatch. It wastes CPU cycles if a runtime system leaves tasks blocked on full/empty message queues, while switching tasks has additional scheduling overheads. Diverse system topologies further complicate the problem, as the scheduling also needs to take data locality into consideration. This dissertation explores architectural supports for enhancing the scalability of message queue task parallelism, reducing the load-to-use latency, as well as avoiding blocking. Specifically, this dissertation designs and evaluates a message queue architecture that lowers the overhead of synchronization on shared queue states, a speculation technique to hide the load-to-use latency, as well as a locality-aware message queue runtime system with low overhead on scheduling and buffer resizing. The first contribution of the dissertation is Virtual-Link scalable message queue architecture (VL). Instead of having threads access the shared queue state variables (i.e., head, tail, or lock) atomically, VL provides configurable hardware support, providing both data transfer and synchronization. Unlike other hardware queue architectures with dedicated network, VL reuses the existing cache coherence network and delivers a virtualized channel as if there were a direct link (or route) between two arbitrary PEs. VL facilitates efficient synchronized data movement between M:N producers and consumers with several benefits: (i) the number of sharers on synchronization primitives is reduced to zero, eliminating a primary bottleneck of traditional lock-free queues, (ii) memory spills, snoops, and invalidations are reduced, (iii) data stays on the fast path (inside the interconnect) a majority of the time. Another contribution of the dissertation is SPAMeR speculation mechanism. SPAMeR has the capability to speculatively push messages in anticipation of consumer message requests. With the speculation, the latency of moving data from the source to the consumer that needs the data could be partially or fully overlapped with the message processing time. Unlike pre-fetch approaches which predict what addresses to fetch next, with a queue we know exactly what data is needed next but not when it is needed; SPAMeR proposes algorithms to learn from queue operation history in order to predict this. Finally the dissertation contributes ARMQ locality-aware runtime. ARMQ collects a set of approaches that avoids message queue blocking, ranging from the most general yielding, to dynamically resizing the buffer, and to spawning helper tasks. On one hand, ARMQ minimizes the overheads (e.g., wasteful polling, context switch, memory allocation and copying etc.) with a few techniques (e.g., userspace threading, chunk-based ringbuffer etc.) On the other hand, ARMQ schedules the message-driven tasks precisely and opportunely, in order to maximize the data locality preserved (in favor of cache) and balance the resource allocation.Electrical and Computer Engineerin

Texas ScholarWorks

Hardware-assisted instruction profiling and latency detection

Author: Dagenais Michel R.
Sharma Suchakrapani Datt
Publication venue: 'Institution of Engineering and Technology (IET)'
Publication date: 01/08/2016
Field of study

Debugging and profiling tools can alter the execution flow or timing, can induce heisenbugs and are thus marginally useful for debugging time critical systems. Software tracing, however advanced it may be, depends on consuming precious computing resources. In this study, the authors analyse state-of-the-art hardware-tracing support, as provided in modern Intel processors and propose a new technique which uses the processor hardware for tracing without any code instrumentation or tracepoints. They demonstrate the utility of their approach with contributions in three areas - syscall latency profiling, instruction profiling and software-tracer impact detection. They present improvements in performance and the granularity of data gathered with hardware-assisted approach, as compared with traditional software only tracing and profiling. The performance impact on the target system – measured as time overhead – is on average 2–3%, with the worst case being 22%. They also define a way to measure and quantify the time resolution provided by hardware tracers for trace events, and observe the effect of finetuning hardware tracing for optimum utilisation. As compared with other in-kernel tracers, they observed that hardware-based tracing has a much reduced overhead, while achieving greater precision. Moreover, the other tracing techniques are ineffective in certain tracing scenarios

Crossref

Directory of Open Access Journals

PolyPublie

Monitoring and analysis system for performance troubleshooting in data centers

Author: Wang Chengwei
Publication venue: Georgia Institute of Technology
Publication date: 13/01/2014
Field of study

It was not long ago. On Christmas Eve 2012, a war of troubleshooting began in Amazon data centers. It started at 12:24 PM, with an mistaken deletion of the state data of Amazon Elastic Load Balancing Service (ELB for short), which was not realized at that time. The mistake first led to a local issue that a small number of ELB service APIs were affected. In about six minutes, it evolved into a critical one that EC2 customers were significantly affected. One example was that Netflix, which was using hundreds of Amazon ELB services, was experiencing an extensive streaming service outage when many customers could not watch TV shows or movies on Christmas Eve. It took Amazon engineers 5 hours 42 minutes to find the root cause, the mistaken deletion, and another 15 hours and 32 minutes to fully recover the ELB service. The war ended at 8:15 AM the next day and brought the performance troubleshooting in data centers to world’s attention. As shown in this Amazon ELB case.Troubleshooting runtime performance issues is crucial in time-sensitive multi-tier cloud services because of their stringent end-to-end timing requirements, but it is also notoriously difficult and time consuming. To address the troubleshooting challenge, this dissertation proposes VScope, a flexible monitoring and analysis system for online troubleshooting in data centers. VScope provides primitive operations which data center operators can use to troubleshoot various performance issues. Each operation is essentially a series of monitoring and analysis functions executed on an overlay network. We design a novel software architecture for VScope so that the overlay networks can be generated, executed and terminated automatically, on-demand. From the troubleshooting side, we design novel anomaly detection algorithms and implement them in VScope. By running anomaly detection algorithms in VScope, data center operators are notified when performance anomalies happen. We also design a graph-based guidance approach, called VFocus, which tracks the interactions among hardware and software components in data centers. VFocus provides primitive operations by which operators can analyze the interactions to find out which components are relevant to the performance issue. VScope’s capabilities and performance are evaluated on a testbed with over 1000 virtual machines (VMs). Experimental results show that the VScope runtime negligibly perturbs system and application performance, and requires mere seconds to deploy monitoring and analytics functions on over 1000 nodes. This demonstrates VScope’s ability to support fast operation and online queries against a comprehensive set of application to system/platform level metrics, and a variety of representative analytics functions. When supporting algorithms with high computation complexity, VScope serves as a ‘thin layer’ that occupies no more than 5% of their total latency. Further, by using VFocus, VScope can locate problematic VMs that cannot be found via solely application-level monitoring, and in one of the use cases explored in the dissertation, it operates with levels of perturbation of over 400% less than what is seen for brute-force and most sampling-based approaches. We also validate VFocus with real-world data center traces. The experimental results show that VFocus has troubleshooting accuracy of 83% on average.Ph.D

Scholarly Materials And Research @ Georgia Tech