1,426 research outputs found

    Doctor of Philosophy

    Get PDF
    dissertationA modern software system is a composition of parts that are themselves highly complex: operating systems, middleware, libraries, servers, and so on. In principle, compositionality of interfaces means that we can understand any given module independently of the internal workings of other parts. In practice, however, abstractions are leaky, and with every generation, modern software systems grow in complexity. Traditional ways of understanding failures, explaining anomalous executions, and analyzing performance are reaching their limits in the face of emergent behavior, unrepeatability, cross-component execution, software aging, and adversarial changes to the system at run time. Deterministic systems analysis has a potential to change the way we analyze and debug software systems. Recorded once, the execution of the system becomes an independent artifact, which can be analyzed offline. The availability of the complete system state, the guaranteed behavior of re-execution, and the absence of limitations on the run-time complexity of analysis collectively enable the deep, iterative, and automatic exploration of the dynamic properties of the system. This work creates a foundation for making deterministic replay a ubiquitous system analysis tool. It defines design and engineering principles for building fast and practical replay machines capable of capturing complete execution of the entire operating system with an overhead of several percents, on a realistic workload, and with minimal installation costs. To enable an intuitive interface of constructing replay analysis tools, this work implements a powerful virtual machine introspection layer that enables an analysis algorithm to be programmed against the state of the recorded system through familiar terms of source-level variable and type names. To support performance analysis, the replay engine provides a faithful performance model of the original execution during replay

    비휘발성 메모리 환경에서의 관계형 데이터베이스 로그 시스템 연구

    Get PDF
    학위논문(석사) -- 서울대학교대학원 : 공과대학 컴퓨터공학부, 2022. 8. 염헌영.비휘발성 메모리는 새로운 스토리지 기술로 고성능과 바이트 주소 접근 특성을 연결했지만 기존 관계형 데이터 베이스에 적용이 많이 진행되어 있지 못하고 있다. 그것은 기존 관계형 데이터 베이스들이 데이터와 로그를 블록 저장장치에 저장하 는 걸 가정하고 설계되기 때문이다. 결과적으로 디비에서 발생하는 쓰기 작업은 비휘발성 메모리를 충분히 사용 못하며 전체 성능을 저하할 수 있다. 본 연구는 우선 InnDB에서 발생하는 redo 로그에 대해 전반적으로 분석하여 해당 로그에 발 생하는 작업은 전반적인 성능에 대한 영향을 분석했다. 그것은 바탕으로 NF-Log 를 제시했다. NF-Log는 기존의 로그의 중복적인 플러시 현상을 제거하고 페이지 캐시에 대한 영향도 최소화 했다. 또한 본 연구에서 원격 비휘발성 메모리에 적근 할 때 발생하는 오브헤드를 최소화 했고 미리 쓰기 버퍼에 대한 최적화도 했다. 결과적으로 본 연구는 비휘발성 메모리에 바이트 접근성과 영구성 특성을 이용함으로 기준 InnoDB보다 쓰기 사이즈를 30%를 줄였고 sysbench인 경우 성능을 38% 증가시켯고 TPC-C인 경우 16%를 증가했다.Non-volatile memory (NVM) is a promising storage technology that combines not only high performance and byte-addressability (like DRAM) but also durability (like SSD). However, as existing relational database management systems(RDBMS) are originally designed based on the assumption that all the data and log are stored on high latency block based devices, they are not able to take full advantage of this new technology yet. Consequently, write operations will under-utilize the device(NVM) and eventually downgrade the performance. In this work, after properly analyzing the redo-log mechanism in InnoDB and tested its impact on the overall system performance. We propose NF-log, a redesigned log critical path that aims to eliminate redundant flushes to the disk by eliminating the impact of page cache and adapting the appropriate APIs. After that, we eliminated the remote persistent memory overhead and optimized the write ahead mechanism to reduce memory copy overhead by adjusting the WA-mechanism triggering threshold and granularity. Our design utilizes the NVM byte addressability and its persistence feature to reduce the write size by 30% and boost up the performance to up to 38% for sysbench write intensive workloads and up to 16% for TPC-C.Chapter 1 Introduction 1 1.1 Motivation 3 1.2 Contribution 6 1.3 Outline 7 1.3.1 Related Work 8 1.3.2 background 10 Chapter 2 Implementation and Design 17 2.1 Analysis of InnoDB Log System 17 2.1.1 Concurrent Writes 17 2.1.2 Checkpoint 20 2.2 Design Goals 21 2.3 Immediate persists 22 2.4 Improve data locality 25 2.5 Improve write ahead mechanism 26 Chapter 3 Evaluation 30 3.1 Experimental Setup 30 3.2 Performance breakdown 31 3.2.1 Evaluating immediate persist optimization 32 3.2.2 Evaluating write ahead optimization 33 3.2.3 Evaluating data access locality optimization 34 3.3 Overall Performance 35 3.4 Resource utilization 39 3.5 Write Size 43 Chapter 4 Conclusion 45 요약 51석

    EHAP-ORAM: Efficient Hardware-Assisted Persistent ORAM System for Non-volatile Memory

    Full text link
    Oblivious RAM (ORAM) protected access pattern is essential for secure NVM. In the ORAM system, data and PosMap metadata are maps in pairs to perform secure access. Therefore, we focus on the problem of crash consistency in the ORAM system. Unfortunately, using traditional software-based support for ORAM system crash consistency is not only expensive, it can also lead to information leaks. At present, there is no relevant research on the specific crash consistency mechanism supporting the ORAM system. To support crash consistency without damaging ORAM system security and compromising the performance, we propose EHAP-ORAM. Firstly, we analyze the access steps of basic ORAM to obtain the basic requirements to support the ORAM system crash consistency. Secondly, improve the ORAM controller. Thirdly, for the improved hardware system, we propose several persistence protocols supporting the ORAM system crash consistency. Finally, we compared our persistent ORAM with the system without crash consistency support, non-recursive and recursive EHAP-ORAM only incurs 3.36% and 3.65% performance overhead. The results show that EHAP-ORAM not only supports effective crash consistency with minimal performance and hardware overhead but also is friendly to NVM lifetime

    Resiliency in numerical algorithm design for extreme scale simulations

    Get PDF
    This work is based on the seminar titled ‘Resiliency in Numerical Algorithm Design for Extreme Scale Simulations’ held March 1–6, 2020, at Schloss Dagstuhl, that was attended by all the authors. Advanced supercomputing is characterized by very high computation speeds at the cost of involving an enormous amount of resources and costs. A typical large-scale computation running for 48 h on a system consuming 20 MW, as predicted for exascale systems, would consume a million kWh, corresponding to about 100k Euro in energy cost for executing 1023 floating-point operations. It is clearly unacceptable to lose the whole computation if any of the several million parallel processes fails during the execution. Moreover, if a single operation suffers from a bit-flip error, should the whole computation be declared invalid? What about the notion of reproducibility itself: should this core paradigm of science be revised and refined for results that are obtained by large-scale simulation? Naive versions of conventional resilience techniques will not scale to the exascale regime: with a main memory footprint of tens of Petabytes, synchronously writing checkpoint data all the way to background storage at frequent intervals will create intolerable overheads in runtime and energy consumption. Forecasts show that the mean time between failures could be lower than the time to recover from such a checkpoint, so that large calculations at scale might not make any progress if robust alternatives are not investigated. More advanced resilience techniques must be devised. The key may lie in exploiting both advanced system features as well as specific application knowledge. Research will face two essential questions: (1) what are the reliability requirements for a particular computation and (2) how do we best design the algorithms and software to meet these requirements? While the analysis of use cases can help understand the particular reliability requirements, the construction of remedies is currently wide open. One avenue would be to refine and improve on system- or application-level checkpointing and rollback strategies in the case an error is detected. Developers might use fault notification interfaces and flexible runtime systems to respond to node failures in an application-dependent fashion. Novel numerical algorithms or more stochastic computational approaches may be required to meet accuracy requirements in the face of undetectable soft errors. These ideas constituted an essential topic of the seminar. The goal of this Dagstuhl Seminar was to bring together a diverse group of scientists with expertise in exascale computing to discuss novel ways to make applications resilient against detected and undetected faults. In particular, participants explored the role that algorithms and applications play in the holistic approach needed to tackle this challenge. This article gathers a broad range of perspectives on the role of algorithms, applications and systems in achieving resilience for extreme scale simulations. The ultimate goal is to spark novel ideas and encourage the development of concrete solutions for achieving such resilience holistically.Peer Reviewed"Article signat per 36 autors/es: Emmanuel Agullo, Mirco Altenbernd, Hartwig Anzt, Leonardo Bautista-Gomez, Tommaso Benacchio, Luca Bonaventura, Hans-Joachim Bungartz, Sanjay Chatterjee, Florina M. Ciorba, Nathan DeBardeleben, Daniel Drzisga, Sebastian Eibl, Christian Engelmann, Wilfried N. Gansterer, Luc Giraud, Dominik G ̈oddeke, Marco Heisig, Fabienne Jezequel, Nils Kohl, Xiaoye Sherry Li, Romain Lion, Miriam Mehl, Paul Mycek, Michael Obersteiner, Enrique S. Quintana-Ortiz, Francesco Rizzi, Ulrich Rude, Martin Schulz, Fred Fung, Robert Speck, Linda Stals, Keita Teranishi, Samuel Thibault, Dominik Thonnes, Andreas Wagner and Barbara Wohlmuth"Postprint (author's final draft

    Understanding and Optimizing Flash-based Key-value Systems in Data Centers

    Get PDF
    Flash-based key-value systems are widely deployed in today’s data centers for providing high-speed data processing services. These systems deploy flash-friendly data structures, such as slab and Log Structured Merge(LSM) tree, on flash-based Solid State Drives(SSDs) and provide efficient solutions in caching and storage scenarios. With the rapid evolution of data centers, there appear plenty of challenges and opportunities for future optimizations. In this dissertation, we focus on understanding and optimizing flash-based key-value systems from the perspective of workloads, software, and hardware as data centers evolve. We first propose an on-line compression scheme, called SlimCache, considering the unique characteristics of key-value workloads, to virtually enlarge the cache space, increase the hit ratio, and improve the cache performance. Furthermore, to appropriately configure increasingly complex modern key-value data systems, which can have more than 50 parameters with additional hardware and system settings, we quantitatively study and compare five multi-objective optimization methods for auto-tuning the performance of an LSM-tree based key-value store in terms of throughput, the 99th percentile tail latency, convergence time, real-time system throughput, and the iteration process, etc. Last but not least, we conduct an in-depth, comprehensive measurement work on flash-optimized key-value stores with recently emerging 3D XPoint SSDs. We reveal several unexpected bottlenecks in the current key-value store design and present three exemplary case studies to showcase the efficacy of removing these bottlenecks with simple methods on 3D XPoint SSDs. Our experimental results show that our proposed solutions significantly outperform traditional methods. Our study also contributes to providing system implications for auto-tuning the key-value system on flash-based SSDs and optimizing it on revolutionary 3D XPoint based SSDs

    APUS: Fast and Scalable PAXOS on RDMA

    Get PDF
    State machine replication (SMR) uses Paxos to enforce the same inputs for a program (e.g., Redis) replicated on a number of hosts, tolerating various types of failures. Unfortunately, traditional Paxos protocols incur prohibitive performance overhead on server programs due to their high consensus latency on TCP/IP. Worse, the consensus latency of extant Paxos protocols increases drastically when more concurrent client connections or hosts are added. This paper presents APUS, the first RDMA-based Paxos protocol that aims to be fast and scalable to client connections and hosts. APUS intercepts inbound socket calls of an unmodified server program, assigns a total order for all input requests, and uses fast RDMA primitives to replicate these requests concurrently. We evaluated APUS on nine widely-used server programs (e.g., Redis and MySQL). APUS incurred a mean overhead of 4.3% in response time and 4.2% in throughput. We integrated APUS with an SMR system Calvin. Our Calvin-APUS integration was 8.2X faster than the extant Calvin-ZooKeeper integration. The consensus latency of APUS outperformed an RDMA-based consensus protocol by 4.9X. APUS source code and raw results are released on github. com/hku-systems/apus.published_or_final_versio

    SimuBoost: Scalable Parallelization of Functional System Simulation

    Get PDF
    Für das Sammeln detaillierter Laufzeitinformationen, wie Speicherzugriffsmustern, wird in der Betriebssystem- und Sicherheitsforschung häufig auf die funktionale Systemsimulation zurückgegriffen. Der Simulator führt dabei die zu untersuchende Arbeitslast in einer virtuellen Maschine (VM) aus, indem er schrittweise Instruktionen interpretiert oder derart übersetzt, sodass diese auf dem Zustand der VM arbeiten. Dieser Prozess ermöglicht es, eine umfangreiche Instrumentierung durchzuführen und so an Informationen zum Laufzeitverhalten zu gelangen, die auf einer physischen Maschine nicht zugänglich sind. Obwohl die funktionale Systemsimulation als mächtiges Werkzeug gilt, stellt die durch die Interpretation oder Übersetzung resultierende immense Ausführungsverlangsamung eine substanzielle Einschränkung des Verfahrens dar. Im Vergleich zu einer nativen Ausführung messen wir für QEMU eine 30-fache Verlangsamung, wobei die Aufzeichnung von Speicherzugriffen diesen Faktor verdoppelt. Mit Simulatoren, die umfangreichere Instrumentierungsmöglichkeiten mitbringen als QEMU, kann die Verlangsamung um eine Größenordnung höher ausfallen. Dies macht die funktionale Simulation für lang laufende, vernetzte oder interaktive Arbeitslasten uninteressant. Darüber hinaus erzeugt die Verlangsamung ein unrealistisches Zeitverhalten, sobald Aktivitäten außerhalb der VM (z. B. Ein-/Ausgabe) involviert sind. In dieser Arbeit stellen wir SimuBoost vor, eine Methode zur drastischen Beschleunigung funktionaler Systemsimulation. SimuBoost führt die zu untersuchende Arbeitslast zunächst in einer schnellen hardwaregestützten virtuellen Maschine aus. Dies ermöglicht volle Interaktivität mit Benutzern und Netzwerkgeräten. Während der Ausführung erstellt SimuBoost periodisch Abbilder der VM (engl. Checkpoints). Diese dienen als Ausgangspunkt für eine parallele Simulation, bei der jedes Intervall unabhängig simuliert und analysiert wird. Eine heterogene deterministische Wiederholung (engl. heterogeneous deterministic Replay) garantiert, dass in dieser Phase die vorherige hardwaregestützte Ausführung jedes Intervalls exakt reproduziert wird, einschließlich Interaktionen und realistischem Zeitverhalten. Unser Prototyp ist in der Lage, die Laufzeit einer funktionalen Systemsimulation deutlich zu reduzieren. Während mit herkömmlichen Verfahren für die Simulation des Bauprozesses eines modernen Linux über 5 Stunden benötigt werden, schließt SimuBoost die Simulation in nur 15 Minuten ab. Dies sind lediglich 16% mehr Zeit, als der Bau in einer schnellen hardwaregestützten VM in Anspruch nimmt. SimuBoost ist imstande, diese Geschwindigkeit auch bei voller Instrumentierung zur Aufzeichnung von Speicherzugriffen beizubehalten. Die vorliegende Arbeit ist das erste Projekt, welches das Konzept der Partitionierung und Parallelisierung der Ausführungszeit auf die interaktive Systemvirtualisierung in einer Weise anwendet, die eine sofortige parallele funktionale Simulation gestattet. Wir ergänzen die praktische Umsetzung mit einem mathematischen Modell zur formalen Beschreibung der Beschleunigungseigenschaften. Dies erlaubt es, für ein gegebenes Szenario die voraussichtliche parallele Simulationszeit zu prognostizieren und gibt eine Orientierung zur Wahl der optimalen Intervalllänge. Im Gegensatz zu bisherigen Arbeiten legt SimuBoost einen starken Fokus auf die Skalierbarkeit über die Grenzen eines einzelnen physischen Systems hinaus. Ein zentraler Schlüssel hierzu ist der Einsatz moderner Checkpointing-Technologien. Im Rahmen dieser Arbeit präsentieren wir zwei neuartige Methoden zur effizienten und effektiven Kompression von periodischen Systemabbildern
    corecore