363 research outputs found

    Improving processor efficiency by exploiting common-case behaviors of memory instructions

    Get PDF
    Processor efficiency can be described with the help of a number of  desirable effects or metrics, for example, performance, power, area, design complexity and access latency. These metrics serve as valuable tools used in designing new processors and they also act as  effective standards for comparing current processors. Various factors impact the efficiency of modern out-of-order processors and one important factor is the manner in which instructions are processed through the processor pipeline. In this dissertation research, we study the impact of load and store instructions (collectively known as memory instructions) on processor efficiency,  and show how to improve efficiency by exploiting common-case or  predictable patterns in the behavior of memory instructions. The memory behavior patterns that we focus on in our research are the predictability of memory dependences, the predictability in data forwarding patterns,   predictability in instruction criticality and conservativeness in resource allocation and deallocation policies. We first design a scalable  and high-performance memory dependence predictor and then apply accurate memory dependence prediction to improve the efficiency of the fetch engine of a simultaneous multi-threaded processor. We then use predictable data forwarding patterns to eliminate power-hungry  hardware in the processor with no loss in performance.  We then move to  studying instruction criticality to improve  processor efficiency. We study the behavior of critical load instructions  and propose applications that can be optimized using  predictable, load-criticality  information. Finally, we explore conventional techniques for allocation and deallocation  of critical structures that process memory instructions and propose new techniques to optimize the same.  Our new designs have the potential to reduce  the power and the area required by processors significantly without losing  performance, which lead to efficient designs of processors.Ph.D.Committee Chair: Loh, Gabriel H.; Committee Member: Clark, Nathan; Committee Member: Jaleel, Aamer; Committee Member: Kim, Hyesoon; Committee Member: Lee, Hsien-Hsin S.; Committee Member: Prvulovic, Milo

    A Non-blocking Buddy System for Scalable Memory Allocation on Multi-core Machines

    Get PDF
    Common implementations of core memory allocation components handle concurrent allocation/release requests by synchronizing threads via spin-locks. This approach is not prone to scale with large thread counts, a problem that has been addressed in the literature by introducing layered allocation services or replicating the core allocators - the bottom most ones within the layered architecture. Both these solutions tend to reduce the pressure of actual concurrent accesses to each individual core allocator. In this article we explore an alternative approach to scalability of memory allocation/release, which can be still combined with those literature proposals. We present a fully non-blocking buddy-system, that allows threads to proceed in parallel, and commit their allocations/releases unless a conflict is materialized while handling its metadata. Beyond improving scalability and performance it is resilient to performance degradation in face of concurrent accesses independently of the current level of fragmentation of the handled memory blocks

    NBBS: A Non-blocking Buddy System for Multi-core Machines

    Get PDF
    Common implementations of core memory allocation components, like the Linux buddy system, handle concurrent allocation/release requests by synchronizing threads via spinlocks. This approach is not prone to scale with large thread counts, a problem that has been addressed in the literature by introducing layered allocation services or replicating the core allocators—the bottom most ones within the layered architecture. Both these solutions tend to reduce the pressure of actual concurrent accesses to each individual core allocator. In this article we explore an alternative approach to scalability of memory allocation/release, which can be still combined with those literature proposals. We present a fully non-blocking buddy-system, where threads performing concurrent allocations/releases do not undergo any spinlock based synchronization. Our solution allows threads to proceed in parallel, and commit their allocations/releases unless a conflict is materialized while handling its metadata. Conflict detection relies on conventional atomic machine instructions in the Read-Modify-Write (RMW) class. Beyond improving scalability and performance, our solution can also avoid wasting clock cycles for spin-lock operations by threads that could in principle carry out their memory allocation/release in full concurrency. Thus, it is resilient to performance degradation—in face of concurrent accesses—independently of the current level of fragmentation of the handled memory blocks

    메모리 보호를 위한 보안 정책을 시행하기 위한 코드 변환 기술

    Get PDF
    학위논문(박사)--서울대학교 대학원 :공과대학 전기·컴퓨터공학부,2020. 2. 백윤흥.Computer memory is a critical component in computer systems that needs to be protected to ensure the security of computer systems. It contains security sensitive data that should not be disclosed to adversaries. Also, it contains the important data for operating the system that should not be manipulated by the attackers. Thus, many security solutions focus on protecting memory so that sensitive data cannot be leaked out of the computer system or on preventing illegal access to computer data. In this thesis, I will present various code transformation techniques for enforcing security policies for memory protection. First, I will present a code transformation technique to track implicit data flows so that security sensitive data cannot leak through implicit data flow channels (i.e., conditional branches). Then I will present a compiler technique to instrument C/C++ program to mitigate use-after-free errors, which is a type of vulnerability that allow illegal access to stale memory location. Finally, I will present a code transformation technique for low-end embedded devices to enable execute-only memory, which is a strong security policy to protect secrets and harden the computing device against code reuse attacks.컴퓨터 메모리는 컴퓨터 시스템의 보안을 위해 보호되어야 하는 중요한 컴포넌트이다. 컴퓨터 메모리는 보안상 중요한 데이터를 담고 있을 뿐만 아니라, 시스템의 올바른 동작을 위해 공격자에 의해 조작되어서는 안되는 중요한 데이터 값들을 저장한다. 따라서 많은 보안 솔루션은 메모리를 보호하여 컴퓨터 시스템에서 중요한 데이터가 유출되거나 컴퓨터 데이터에 대한 불법적인 접근을 방지하는 데 중점을 둔다. 본 논문에서는 메모리 보호를 위한 보안 정책을 시행하기 위한 다양한 코드 변환 기술을 제시한다. 먼저, 프로그램에서 분기문을 통해 보안에 민감한 데이터가 유출되지 않도록 암시적 데이터 흐름을 추적하는 코드 변환 기술을 제시한다. 그 다음으로 C / C ++ 프로그램을 변환하여 use-after-free 오류를 완화하는 컴파일러 기술을 제시한다. 마지막으로, 중요 데이터를 보호하고 코드 재사용 공격으로부터 디바이스를 강화할 수 있는 강력한 보안 정책인 실행 전용 메모리(execute-only memory)를 저사양 임베디드 디바이스에 구현하기 위한 코드 변환 기술을 제시한다.1 Introduction 1 2 Background 4 3 A Hardware-based Technique for Efficient Implicit Information Flow Tracking 8 3.1 Introduction 8 3.2 Related Work 10 3.3 Our Approach for Implicit Flow Tracking 12 3.3.1 Implicit Flow Tracking Scheme with Program Counter Tag 12 3.3.2 tP C Management Technique 15 3.3.3 Compensation for the Untaken Path 20 3.4 Architecture Design of IFTU 22 3.4.1 Overall System 22 3.4.2 Tag Computing Core 24 3.5 Performance and Area Analysis 26 3.6 Security Analysis 28 3.7 Summary 30 4 CRCount: Pointer Invalidation with Reference Counting to Mitigate Useafter-free in Legacy C/C++ 31 4.1 Introduction 31 4.2 Related Work 36 4.3 Threat Model 40 4.4 Implicit Pointer Invalidation 40 4.4.1 Invalidation with Reference Counting 40 4.4.2 Reference Counting in C/C++ 42 4.5 Design 44 4.5.1 Overview 45 4.5.2 Pointer Footprinting 46 4.5.3 Delayed Object Free 50 4.6 Implementation 53 4.7 Evaluation 56 4.7.1 Statistics 56 4.7.2 Performance Overhead 58 4.7.3 Memory Overhead 62 4.8 Security Analysis 67 4.8.1 Attack Prevention 68 4.8.2 Security considerations 69 4.9 Limitations 69 4.10 Summary 71 5 uXOM: Efficient eXecute-Only Memory on ARM Cortex-M 73 5.1 Introduction 73 5.2 Background 78 5.2.1 ARMv7-M Address Map and the Private Peripheral Bus (PPB) 78 5.2.2 Memory Protection Unit (MPU) 79 5.2.3 Unprivileged Loads/Stores 80 5.2.4 Exception Entry and Return 80 5.3 Threat Model and Assumptions 81 5.4 Approach and Challenges 82 5.5 uXOM 85 5.5.1 Basic Design 85 5.5.2 Solving the Challenges 89 5.5.3 Optimizations 98 5.5.4 Security Analysis 99 5.6 Evaluation 100 5.6.1 Runtime Overhead 103 5.6.2 Code Size Overhead 106 5.6.3 Energy Overhead 107 5.6.4 Security and Usability 107 5.6.5 Use Cases 108 5.7 Discussion 110 5.8 Related Work 111 5.9 Summary 113 6 Conclusion and Future Work 114 6.1 Future Work 115 Abstract (In Korean) 132 Acknowlegement 133Docto

    A grammar based approach towards the automatic implementation of data communication protocols in hardware

    Get PDF

    The "MIND" Scalable PIM Architecture

    Get PDF
    MIND (Memory, Intelligence, and Network Device) is an advanced parallel computer architecture for high performance computing and scalable embedded processing. It is a Processor-in-Memory (PIM) architecture integrating both DRAM bit cells and CMOS logic devices on the same silicon die. MIND is multicore with multiple memory/processor nodes on each chip and supports global shared memory across systems of MIND components. MIND is distinguished from other PIM architectures in that it incorporates mechanisms for efficient support of a global parallel execution model based on the semantics of message-driven multithreaded split-transaction processing. MIND is designed to operate either in conjunction with other conventional microprocessors or in standalone arrays of like devices. It also incorporates mechanisms for fault tolerance, real time execution, and active power management. This paper describes the major elements and operational methods of the MIND architecture

    Energy-Efficient Algorithms

    Full text link
    We initiate the systematic study of the energy complexity of algorithms (in addition to time and space complexity) based on Landauer's Principle in physics, which gives a lower bound on the amount of energy a system must dissipate if it destroys information. We propose energy-aware variations of three standard models of computation: circuit RAM, word RAM, and transdichotomous RAM. On top of these models, we build familiar high-level primitives such as control logic, memory allocation, and garbage collection with zero energy complexity and only constant-factor overheads in space and time complexity, enabling simple expression of energy-efficient algorithms. We analyze several classic algorithms in our models and develop low-energy variations: comparison sort, insertion sort, counting sort, breadth-first search, Bellman-Ford, Floyd-Warshall, matrix all-pairs shortest paths, AVL trees, binary heaps, and dynamic arrays. We explore the time/space/energy trade-off and develop several general techniques for analyzing algorithms and reducing their energy complexity. These results lay a theoretical foundation for a new field of semi-reversible computing and provide a new framework for the investigation of algorithms.Comment: 40 pages, 8 pdf figures, full version of work published in ITCS 201

    A new approach to reversible computing with applications to speculative parallel simulation

    Get PDF
    In this thesis, we propose an innovative approach to reversible computing that shifts the focus from the operations to the memory outcome of a generic program. This choice allows us to overcome some typical challenges of "plain" reversible computing. Our methodology is to instrument a generic application with the help of an instrumentation tool, namely Hijacker, which we have redesigned and developed for the purpose. Through compile-time instrumentation, we enhance the program's code to keep track of the memory trace it produces until the end. Regardless of the complexity behind the generation of each computational step of the program, we can build inverse machine instructions just by inspecting the instruction that is attempting to write some value to memory. Therefore from this information, we craft an ad-hoc instruction that conveys this old value and the knowledge of where to replace it. This instruction will become part of a more comprehensive structure, namely the reverse window. Through this structure, we have sufficient information to cancel all the updates done by the generic program during its execution. In this writing, we will discuss the structure of the reverse window, as the building block for the whole reversing framework we designed and finally realized. Albeit we settle our solution in the specific context of the parallel discrete event simulation (PDES) adopting the Time Warp synchronization protocol, this framework paves the way for further general-purpose development and employment. We also present two additional innovative contributions coming from our innovative reversibility approach, both of them still embrace traditional state saving-based rollback strategy. The first contribution aims to harness the advantages of both the possible approaches. We implement the rollback operation combining state saving together with our reversible support through a mathematical model. This model enables the system to choose in autonomicity the best rollback strategy, by the mutable runtime dynamics of programs. The second contribution explores an orthogonal direction, still related to reversible computing aspects. In particular, we will address the problem of reversing shared libraries. Indeed, leading from their nature, shared objects are visible to the whole system and so does every possible external modification of their code. As a consequence, it is not possible to instrument them without affecting other unaware applications. We propose a different method to deal with the instrumentation of shared objects. All our innovative proposals have been assessed using the last generation of the open source ROOT-Sim PDES platform, where we integrated our solutions. ROOT-Sim is a C-based package implementing a general purpose simulation environment based on the Time Warp synchronization protocol
    corecore