108 research outputs found

    Combatting Advanced Persistent Threat via Causality Inference and Program Analysis

    Get PDF
    Cyber attackers are becoming more and more sophisticated. In particular, Advanced Persistent Threat (APT) is a new class of attack that targets a specifc organization and compromises systems over a long time without being detected. Over the years, we have seen notorious examples of APTs including Stuxnet which disrupted Iranian nuclear centrifuges and data breaches affecting millions of users. Investigating APT is challenging as it occurs over an extended period of time and the attack process is highly sophisticated and stealthy. Also, preventing APTs is diffcult due to ever-expanding attack vectors. In this dissertation, we present proposals for dealing with challenges in attack investigation. Specifcally, we present LDX which conducts precise counter-factual causality inference to determine dependencies between system calls (e.g., between input and output system calls) and allows investigators to determine the origin of an attack (e.g., receiving a spam email) and the propagation path of the attack, and assess the consequences of the attack. LDX is four times more accurate and two orders of magnitude faster than state-of-the-art taint analysis techniques. Moreover, we then present a practical model-based causality inference system, MCI, which achieves precise and accurate causality inference without requiring any modifcation or instrumentation in end-user systems. Second, we show a general protection system against a wide spectrum of attack vectors and methods. Specifcally, we present A2C that prevents a wide range of attacks by randomizing inputs such that any malicious payloads contained in the inputs are corrupted. The protection provided by A2C is both general (e.g., against various attack vectors) and practical (7% runtime overhead)

    Making non-volatile memory programmable

    Get PDF
    Byte-addressable, non-volatile memory (NVM) is emerging as a revolutionary memory technology that provides persistence, near-DRAM performance, and scalable capacity. By using NVM, applications can directly create and manipulate durable data in place without the need for serialization out to SSDs. Ideally, through NVM, persistent applications will be able to maintain crash-consistency at a minimal cost. However, before this is possible, improvements must be made at both the hardware and software level to support persistent applications. Currently, software support for NVM places too high of a burden on the developer, introducing many opportunities for mistakes while also being too rigid for compiler optimizations. Likewise, at the hardware level, too little information is passed to the processor about the instruction-level ordering requirements of persistent applications; this forces the hardware to require the use of coarse fences, which significantly slow down execution. To help realize the promise of NVM, this thesis proposes both new software and hardware support that make NVM programmable. From the software side, this thesis proposes a new NVM programming model which relieves the programmer from performing much of the accounting work in persistent applications, instead relying on the runtime to perform error-prone tasks. Specifically, within the proposed model, the user only needs to provide minimal markings to identify the persistent data set and to ensure data is updated in a crash-consistent manner. Given this new NVM programming model, this thesis next presents an implementation of the model in Java. I call my implementation AutoPersist and build my support into the Maxine research Java Virtual Machine (JVM). In this thesis I describe how the JVM can be changed to support the proposed NVM programming model, including adding new Java libraries, adding new JVM runtime features, and augmenting the behavior of existing Java bytecodes. In addition to being easy-to-use, another advantage of the proposed model is that it is amenable to compiler optimizations. In this thesis I highlight two profile-guided optimizations: eagerly allocating objects directly into NVM and speculatively pruning control flow to only include expected-to-be taken paths. I also describe how to apply these optimizations to AutoPersist and show they have a substantial performance impact. While designing AutoPersist, I often observed that dependency information known by the compiler cannot be passed down to the underlying hardware; instead, the compiler must insert coarse-grain fences to enforce needed dependencies. This is because current instruction set architectures (ISA) cannot describe arbitrary instruction-level execution ordering constraints. To fix this limitation, I introduce the Execution Dependency Extension (EDE), and describe how EDE can be added to an existing ISA as well as be implemented in current processor pipelines. Overall, emerging NVM technologies can deliver programmer-friendly high performance. However, for this to happen, both software and hardware improvements are necessary. This thesis takes steps to address current the software and hardware gaps: I propose new software support to assist in the development of persistent applications and also introduce new instructions which allow for arbitrary instruction-level dependencies to be conveyed and enforced by the underlying hardware. With these improvements, hopefully the dream of programmable high-performance NVM is one step closer to being realized

    GPU 에러 안정성 보장을 위한 컴파일러 기법

    Get PDF
    학위논문 (박사) -- 서울대학교 대학원 : 공과대학 전기·컴퓨터공학부, 2020. 8. 이재진.Due to semiconductor technology scaling and near-threshold voltage computing, soft error resilience has become more important. Nowadays, GPUs are widely used in high performance computing (HPC) because of its efficient parallel processing and modern GPUs designed for HPC use error correction code (ECC) to protect their storage including register files. However, adopting ECC in the register file imposes high area and energy overhead. To replace the expensive hardware cost of ECC, we propose Penny, a lightweight compiler-directed resilience scheme for GPU register file protection. We combine recent advances in idempotent recovery with low-cost error detection code. Our approach focuses on solving two important problems: 1. Can we guarantee correct error recovery using idempotent execution with error detection code? We show that when an error detection code is used with idempotence recovery, certain restrictions required by previous idempotent recovery schemes are no longer needed. We also propose a software-based scheme to prevent the checkpoint value from being overwritten before the end of the region where the value is required for correct recovery. 2. How do we reduce the execution overhead caused by checkpointing? In GPUs additional checkpointing store instructions inflicts considerably higher overhead compared to CPUs, due to its architectural characteristics, such as lack of store buffers. We propose a number of compiler optimizations techniques that significantly reduce the overhead.반도체 미세공정 기술이 발전하고 문턱전압 근처 컴퓨팅(near-threashold voltage computing)이 도입됨에 따라서 소프트 에러로부터의 복원이 중요한 과제가 되었다. 강력한 병렬 계산 성능을 지닌 GPU는 고성능 컴퓨팅에서 중요한 위치를 차지하게 되었고, 슈퍼 컴퓨터에서 쓰이는 GPU들은 에러 복원 코드인 ECC를 사용하여 레지스터 파일 및 메모리 등에 저장된 데이터를 보호하게 되었다. 하지만 레지스터 파일에 ECC를 사용하는 것은 큰 하드웨어나 에너지 비용을 필요로 한다. 이런 값비싼 ECC의 하드웨어 비용을 줄이기 위해 본 논문에서는 컴파일러 기반의 저비용 GPU 레지스터 파일 복원 기법인 Penny를 제안한다. 이는 최신의 멱등성(idempotency) 기반 에러 복원 기법을 저비용의 에러 검출 코드(EDC)와 결합한 것이다. 본 논문은 다음 두가지 문제를 해결하는 데에 집중한다. 1. 에러 검출 코드 기반으로 멱등성 기반 에러 복원을 사용시 소프트 에러로부터의 안전한 복원을 보장할 수 있는가?} 본 논문에서는 에러 검출 코드가 멱등성 기반 복원 기술과 같이 사용되었을 경우 기존의 복원 기법에서 필요로 했던 조건들 없이도 안전하게 에러로부터 복원할 수 있음을 보인다. 2. 체크포인팅에드는 비용을 어떻게 절감할 수 있는가?} GPU는 스토어 버퍼가 없는 등 아키텍쳐적인 특성으로 인해서 CPU와 비교하여 체크포인트 값을 저장하는 데에 큰 오버헤드가 든다. 이 문제를 해결하기 위해 본 논문에서는 다양한 컴파일러 최적화 기법을 통하여 오버헤드를 줄인다.1 Introduction 1 1.1 Why is Soft Error Resilience Important in GPUs 1 1.2 How can the ECC Overhead be Reduced 3 1.3 What are the Challenges 4 1.4 How do We Solve the Challenges 5 2 Comparison of Error Detection and Correction Coding Schemes for Register File Protection 7 2.1 Error Correction Codes and Error Detection Codes 8 2.2 Cost of Coding Schemes 9 2.3 Soft Error Frequency of GPUs 11 3 Idempotent Recovery and Challenges 13 3.1 Idempotent Execution 13 3.2 Previous Idempotent Schemes 13 3.2.1 De Kruijf's Idempotent Translation 14 3.2.2 Bolts's Idempotent Recovery 15 3.2.3 Comparison between Idempotent Schemes 15 3.3 Idempotent Recovery Process 17 3.4 Idempotent Recovery Challenges for GPUs 18 3.4.1 Checkpoint Overwriting 20 3.4.2 Performance Overhead 20 4 Correctness of Recovery 22 4.1 Proof of Safe Recovery 23 4.1.1 Prevention of Error Propagation 23 4.1.2 Proof of Correct State Recovery 24 4.1.3 Correctness in Multi-Threaded Execution 28 4.2 Preventing Checkpoint Overwriting 30 4.2.1 Register renaming 31 4.2.2 Storage Alternation by Checkpoint Coloring 33 4.2.3 Automatic Algorithm Selection 38 4.2.4 Future Works 38 5 Performance Optimizations 40 5.1 Compilation Phases of Penny 40 5.1.1 Region Formation 41 5.1.2 Bimodal Checkpoint Placement 41 5.1.3 Storage Alternation 42 5.1.4 Checkpoint Pruning 43 5.1.5 Storage Assignment 44 5.1.6 Code Generation and Low-level Optimizations 45 5.2 Cost Estimation Model 45 5.3 Region Formation 46 5.3.1 De Kruijf's Heuristic Region Formation 46 5.3.2 Region splitting and Region Stitching 47 5.3.3 Checkpoint-Cost Aware Optimal Region Formation 48 5.4 Bimodal Checkpoint Placement 52 5.5 Optimal Checkpoint Pruning 55 5.5.1 Bolt's Naive Pruning Algorithm and Overview of Penny's Optimal Pruning Algorithm 55 5.5.2 Phase 1: Collecting Global-Decision Independent Status 56 5.5.3 Phase2: Ordering and Finalizing Renaming Decisions 60 5.5.4 Effectiveness of Eliminating the Checkpoints 63 5.6 Automatic Checkpoint Storage Assignment 69 5.7 Low-Level Optimizations and Code Generation 70 6 Evaluation 74 6.1 Test Environment 74 6.1.1 GPU Architecture and Simulation Setup 74 6.1.2 Tested Applications 75 6.1.3 Register Assignment 76 6.2 Performance Evaluation 77 6.2.1 Overall Performance Overheads 77 6.2.2 Impact of Penny's Optimizations 78 6.2.3 Assigning Checkpoint Storage and Its Integrity 79 6.2.4 Impact of Optimal Checkpoint Pruning 80 6.2.5 Impact of Alias Analysis 81 6.3 Repurposing the Saved ECC Area 82 6.4 Energy Impact on Execution 83 6.5 Performance Overhead on Volta Architecture 85 6.6 Compilation Time 85 7 RelatedWorks 87 8 Conclusion and Future Works 89 8.1 Limitation and Future Work 90Docto

    Data Parallel C++

    Get PDF
    Learn how to accelerate C++ programs using data parallelism. This open access book enables C++ programmers to be at the forefront of this exciting and important new development that is helping to push computing to new levels. It is full of practical advice, detailed explanations, and code examples to illustrate key topics. Data parallelism in C++ enables access to parallel resources in a modern heterogeneous system, freeing you from being locked into any particular computing device. Now a single C++ application can use any combination of devices—including GPUs, CPUs, FPGAs and AI ASICs—that are suitable to the problems at hand. This book begins by introducing data parallelism and foundational topics for effective use of the SYCL standard from the Khronos Group and Data Parallel C++ (DPC++), the open source compiler used in this book. Later chapters cover advanced topics including error handling, hardware-specific programming, communication and synchronization, and memory model considerations. Data Parallel C++ provides you with everything needed to use SYCL for programming heterogeneous systems. What You'll Learn Accelerate C++ programs using data-parallel programming Target multiple device types (e.g. CPU, GPU, FPGA) Use SYCL and SYCL compilers Connect with computing’s heterogeneous future via Intel’s oneAPI initiative Who This Book Is For Those new data-parallel programming and computer programmers interested in data-parallel programming using C++

    Optimal Global Instruction Scheduling for the Itanium® Processor Architecture

    Get PDF
    On the Itanium 2 processor, effective global instruction scheduling is crucial to high performance. At the same time, it poses a challenge to the compiler: This code generation subtask involves strongly interdependent decisions and complex trade-offs that are difficult to cope with for heuristics. We tackle this NP-complete problem with integer linear programming (ILP), a search-based method that yields provably optimal results. This promises faster code as well as insights into the potential of the architecture. Our ILP model comprises global code motion with compensation copies, predication, and Itanium-specific features like control/data speculation. In integer linear programming, well-structured models are the key to acceptable solution times. The feasible solutions of an ILP are represented by integer points inside a polytope. If all vertices of this polytope are integral, then the ILP can be solved in polynomial time. We define two subproblems of global scheduling in which some constraint classes are omitted and show that the corresponding two subpolytopes of our ILP model are integral and polynomial sized. This substantiates that the found model is of high efficiency, which is also confirmed by the reasonable solution times. The ILP formulation is extended by further transformations like cyclic code motion, which moves instructions upwards out of a loop, circularly in the opposite direction of the loop backedges. Since the architecture requires instructions to be encoded in fixed-sized bundles of three, a bundler is developed that computes bundle sequences of minimal size by means of precomputed results and dynamic programming. Experiments have been conducted with a postpass tool that implements the ILP scheduler. It parses assembly procedures generated by Intel�s Itanium compiler and reschedules them as a whole. Using this tool, we optimize a selection of hot functions from the SPECint 2000 benchmark. The results show a significant speedup over the original code.Globale Instruktionsanordnung hat beim Itanium-2-Prozessor großen Einfluß auf die Leistung und stellt dabei gleichzeitig eine Herausforderung für den Compiler dar: Sie ist mit zahlreichen komplexen, wechselseitig voneinander abhängigen Entscheidungen verbunden, die für Heuristiken nur schwer zu beherrschen sind.Wir lösen diesesNP-vollständige Problem mit ganzzahliger linearer Programmierung (ILP), einer suchbasierten Methode mit beweisbar optimalen Ergebnissen. Das ermöglicht neben schnellerem Code auch Einblicke in das Potential der Itanium- Prozessorarchitektur. Unser ILP-Modell umfaßt globale Codeverschiebungen mit Kompensationscode, Prädikation und Itanium-spezifische Techniken wie Kontroll- und Datenspekulation. Bei ganzzahliger linearer Programmierung sind wohlstrukturierte Modelle der Schlüssel zu akzeptablen Lösungszeiten. Die zulässigen Lösungen eines ILPs werden durch ganzzahlige Punkte innerhalb eines Polytops repräsentiert. Sind die Eckpunkte dieses Polytops ganzzahlig, kann das ILP in Polynomialzeit gelöst werden. Wir definieren zwei Teilprobleme globaler Instruktionsanordnung durch Auslassung bestimmter Klassen von Nebenbedingungen und beweisen, daß die korrespondierenden Teilpolytope unseres ILP-Modells ganzzahlig und von polynomieller Größe sind. Dies untermauert die hohe Effizienz des gefundenen Modells, die auch durch moderate Lösungszeiten bestätigt wird. Das ILP-Modell wird um weitere Transformationen wie zyklische Codeverschiebung erweitert; letztere bezeichnet das Verschieben von Befehlen aufwärts aus einer Schleife heraus, in Gegenrichtung ihrer Rückwärtskanten. Da die Architektur eine Kodierung der Befehle in Dreierbündeln fester Größe vorschreibt, wird ein Bundler entwickelt, der Bündelsequenzen minimaler Länge mit Hilfe vorberechneter Teilergebnisse und dynamischer Programmierung erzeugt. Für die Experimente wurde ein Postpassoptimierer erstellt. Er liest von Intels Itanium-Compiler erzeugte Assemblerroutinen ein und ordnet die enthaltenen Instruktionen mit Hilfe der ILP-Methode neu an. Angewandt auf eine Auswahl von Funktionen aus dem Benchmark SPECint 2000 erreicht der Optimierer eine signifikante Beschleunigung gegenüber dem Originalcode

    Optimal Global Instruction Scheduling for the Itanium® Processor Architecture

    Get PDF
    On the Itanium 2 processor, effective global instruction scheduling is crucial to high performance. At the same time, it poses a challenge to the compiler: This code generation subtask involves strongly interdependent decisions and complex trade-offs that are difficult to cope with for heuristics. We tackle this NP-complete problem with integer linear programming (ILP), a search-based method that yields provably optimal results. This promises faster code as well as insights into the potential of the architecture. Our ILP model comprises global code motion with compensation copies, predication, and Itanium-specific features like control/data speculation. In integer linear programming, well-structured models are the key to acceptable solution times. The feasible solutions of an ILP are represented by integer points inside a polytope. If all vertices of this polytope are integral, then the ILP can be solved in polynomial time. We define two subproblems of global scheduling in which some constraint classes are omitted and show that the corresponding two subpolytopes of our ILP model are integral and polynomial sized. This substantiates that the found model is of high efficiency, which is also confirmed by the reasonable solution times. The ILP formulation is extended by further transformations like cyclic code motion, which moves instructions upwards out of a loop, circularly in the opposite direction of the loop backedges. Since the architecture requires instructions to be encoded in fixed-sized bundles of three, a bundler is developed that computes bundle sequences of minimal size by means of precomputed results and dynamic programming. Experiments have been conducted with a postpass tool that implements the ILP scheduler. It parses assembly procedures generated by Intel�s Itanium compiler and reschedules them as a whole. Using this tool, we optimize a selection of hot functions from the SPECint 2000 benchmark. The results show a significant speedup over the original code.Globale Instruktionsanordnung hat beim Itanium-2-Prozessor großen Einfluß auf die Leistung und stellt dabei gleichzeitig eine Herausforderung für den Compiler dar: Sie ist mit zahlreichen komplexen, wechselseitig voneinander abhängigen Entscheidungen verbunden, die für Heuristiken nur schwer zu beherrschen sind.Wir lösen diesesNP-vollständige Problem mit ganzzahliger linearer Programmierung (ILP), einer suchbasierten Methode mit beweisbar optimalen Ergebnissen. Das ermöglicht neben schnellerem Code auch Einblicke in das Potential der Itanium- Prozessorarchitektur. Unser ILP-Modell umfaßt globale Codeverschiebungen mit Kompensationscode, Prädikation und Itanium-spezifische Techniken wie Kontroll- und Datenspekulation. Bei ganzzahliger linearer Programmierung sind wohlstrukturierte Modelle der Schlüssel zu akzeptablen Lösungszeiten. Die zulässigen Lösungen eines ILPs werden durch ganzzahlige Punkte innerhalb eines Polytops repräsentiert. Sind die Eckpunkte dieses Polytops ganzzahlig, kann das ILP in Polynomialzeit gelöst werden. Wir definieren zwei Teilprobleme globaler Instruktionsanordnung durch Auslassung bestimmter Klassen von Nebenbedingungen und beweisen, daß die korrespondierenden Teilpolytope unseres ILP-Modells ganzzahlig und von polynomieller Größe sind. Dies untermauert die hohe Effizienz des gefundenen Modells, die auch durch moderate Lösungszeiten bestätigt wird. Das ILP-Modell wird um weitere Transformationen wie zyklische Codeverschiebung erweitert; letztere bezeichnet das Verschieben von Befehlen aufwärts aus einer Schleife heraus, in Gegenrichtung ihrer Rückwärtskanten. Da die Architektur eine Kodierung der Befehle in Dreierbündeln fester Größe vorschreibt, wird ein Bundler entwickelt, der Bündelsequenzen minimaler Länge mit Hilfe vorberechneter Teilergebnisse und dynamischer Programmierung erzeugt. Für die Experimente wurde ein Postpassoptimierer erstellt. Er liest von Intels Itanium-Compiler erzeugte Assemblerroutinen ein und ordnet die enthaltenen Instruktionen mit Hilfe der ILP-Methode neu an. Angewandt auf eine Auswahl von Funktionen aus dem Benchmark SPECint 2000 erreicht der Optimierer eine signifikante Beschleunigung gegenüber dem Originalcode

    Formal synthesis of control signals for systolic arrays

    Get PDF

    Data Parallel C++

    Get PDF
    Learn how to accelerate C++ programs using data parallelism. This open access book enables C++ programmers to be at the forefront of this exciting and important new development that is helping to push computing to new levels. It is full of practical advice, detailed explanations, and code examples to illustrate key topics. Data parallelism in C++ enables access to parallel resources in a modern heterogeneous system, freeing you from being locked into any particular computing device. Now a single C++ application can use any combination of devices—including GPUs, CPUs, FPGAs and AI ASICs—that are suitable to the problems at hand. This book begins by introducing data parallelism and foundational topics for effective use of the SYCL standard from the Khronos Group and Data Parallel C++ (DPC++), the open source compiler used in this book. Later chapters cover advanced topics including error handling, hardware-specific programming, communication and synchronization, and memory model considerations. Data Parallel C++ provides you with everything needed to use SYCL for programming heterogeneous systems. What You'll Learn Accelerate C++ programs using data-parallel programming Target multiple device types (e.g. CPU, GPU, FPGA) Use SYCL and SYCL compilers Connect with computing’s heterogeneous future via Intel’s oneAPI initiative Who This Book Is For Those new data-parallel programming and computer programmers interested in data-parallel programming using C++

    Indexed dependence metadata and its applications in software performance optimisation

    No full text
    To achieve continued performance improvements, modern microprocessor design is tending to concentrate an increasing proportion of hardware on computation units with less automatic management of data movement and extraction of parallelism. As a result, architectures increasingly include multiple computation cores and complicated, software-managed memory hierarchies. Compilers have difficulty characterizing the behaviour of a kernel in a general enough manner to enable automatic generation of efficient code in any but the most straightforward of cases. We propose the concept of indexed dependence metadata to improve application development and mapping onto such architectures. The metadata represent both the iteration space of a kernel and the mapping of that iteration space from a given index to the set of data elements that iteration might use: thus the dependence metadata is indexed by the kernel’s iteration space. This explicit mapping allows the compiler or runtime to optimise the program more efficiently, and improves the program structure for the developer. We argue that this form of explicit interface specification reduces the need for premature, architecture-specific optimisation. It improves program portability, supports intercomponent optimisation and enables generation of efficient data movement code. We offer the following contributions: an introduction to the concept of indexed dependence metadata as a generalisation of stream programming, a demonstration of its advantages in a component programming system, the decoupled access/execute model for C++ programs, and how indexed dependence metadata might be used to improve the programming model for GPU-based designs. Our experimental results with prototype implementations show that indexed dependence metadata supports automatic synthesis of double-buffered data movement for the Cell processor and enables aggressive loop fusion optimisations in image processing, linear algebra and multigrid application case studies

    Data structure repair using goal-directed reasoning

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2006.Includes bibliographical references (p. 203-207).Software errors, hardware faults, and user errors can cause data structures in running applications to become damaged so that they violate key consistency properties. As a result of this violation, the application may produce unacceptable results or even crash. This dissertation presents a new data structure repair system that accepts a specification of key data structure consistency constraints, then generates repair algorithms that dynamically detect and repair violations of these constraints, enabling the application to continue to execute productively even in the face of otherwise crippling errors. We have successfully applied our system to five benchmarks: CTAS, an air traffic control tool; AbiWord, an open source word processing application; Freeciv, an online game; a parallel x86 emulator; and a simplified Linux file system. Our experience using our system indicates that the specifications are relatively easy to develop once one understands the data structures. Furthermore, for our set of benchmark applications, our experimental results show that our system can effectively repair inconsistent data structures and enable the application to continue to operate successfully.by Brian C. Demsky.Ph.D
    corecore