36 research outputs found

    Efficient Multiprogramming for Multicores with SCAF

    Get PDF
    As hardware becomes increasingly parallel and the availability of scalable parallel software improves, the problem of managing multiple multithreaded applications (processes) becomes important. Malleable processes, which can vary the number of threads used as they run, enable sophisticated and flexible resource management. Although many existing applications parallelized for SMPs with parallel runtimes are in fact already malleable, deployed run-time environments provide no interface nor any strategy for intelligently allocating hardware threads or even preventing oversubscription. Prior research methods either depend upon profiling applications ahead of time in order to make good decisions about allocations, or do not account for process efficiency at all, leading to poor performance. None of these prior methods have been adapted widely in practice. This paper presents the Scheduling and Allocation with Feedback (SCAF) system: a drop-in runtime solution which supports existing malleable applications in making intelligent allocation decisions based on observed efficiency without any changes to semantics, program modification, offline profiling, or even recompilation. Our existing implementation can control most unmodified OpenMP applications. Other malleable threading libraries can also easily be supported with small modifications, without requiring application modification or recompilation. In this work, we present the SCAF daemon and a SCAF-aware port of the GNU OpenMP runtime. We present a new technique for estimating process efficiency purely at runtime using available hardware counters, and demonstrate its effectiveness in aiding allocation decisions. We evaluated SCAF using NAS NPB parallel benchmarks on five commodity parallel platforms, enumerating architectural features and their effects on our scheme. We measured the benefit of SCAF in terms of sum of speedups improvement (a common metric for multiprogrammed environments) when running all benchmark pairs concurrently compared to equipartitioning --- the best existing competing scheme in the literature. If the sum of speedups with SCAF is within 5% of equipartitioning (i.e., improvement factor is 0.95X < improvement factor in sum of speedups < 1.05X), then we deem SCAF to break even. Less than 0.95X is considered a slowdown; greater than 1.05X is an improvement. We found that SCAF improves on equipartitioning on 4 out of 5 machines, breaking even or improving in 80-89% of pairs and showing a mean improvement of 1.11-1.22X for benchmark pairs for which it shows an improvement, depending on the machine. Since we are not aware of any widely available tool for equipartitioning, we also compare SCAF against multiprogramming using unmodified OpenMP, which is the only environment available to end-users today. SCAF improves or breaks even on the unmodified OpenMP runtimes for all 5 machines in 72-100% of pairs, with a mean improvement of 1.27-1.7X, depending on the machine

    Hardware-Oriented Cache Management for Large-Scale Chip Multiprocessors

    Get PDF
    One of the key requirements to obtaining high performance from chip multiprocessors (CMPs) is to effectively manage the limited on-chip cache resources shared among co-scheduled threads/processes. This thesis proposes new hardware-oriented solutions for distributed CMP caches. Computer architects are faced with growing challenges when designing cache systems for CMPs. These challenges result from non-uniform access latencies, interference misses, the bandwidth wall problem, and diverse workload characteristics. Our exploration of the CMP cache management problem suggests a CMP caching framework (CC-FR) that defines three main approaches to solve the problem: (1) data placement, (2) data retention, and (3) data relocation. We effectively implement CC-FR's components by proposing and evaluating multiple cache management mechanisms.Pressure and Distance Aware Placement (PDA) decouples the physical locations of cache blocks from their addresses for the sake of reducing misses caused by destructive interferences. Flexible Set Balancing (FSB), on the other hand, reduces interference misses via extending the life time of cache lines through retaining some fraction of the working set at underutilized local sets to satisfy far-flung reuses. PDA implements CC-FR's data placement and relocation components and FSB applies CC-FR's retention approach.To alleviate non-uniform access latencies and adapt to phase changes in programs, Adaptive Controlled Migration (ACM) dynamically and periodically promotes cache blocks towards L2 banks close to requesting cores. ACM lies under CC-FR's data relocation category. Dynamic Cache Clustering (DCC), on the other hand, addresses diverse workload characteristics and growing non-uniform access latencies challenges via constructing a cache cluster for each core and expands/contracts all clusters synergistically to match each core's cache demand. DCC implements CC-FR's data placement and relocation approaches. Lastly, Dynamic Pressure and Distance Aware Placement (DPDA) combines PDA and ACM to cooperatively mitigate interference misses and non-uniform access latencies. Dynamic Cache Clustering and Balancing (DCCB), on the other hand, combines DCC and FSB to employ all CC-FR's categories and achieve higher system performance. Simulation results demonstrate the effectiveness of the proposed mechanisms and show that they compare favorably with related cache designs

    λ©€ν‹° νƒœμŠ€ν‚Ή ν™˜κ²½μ—μ„œ GPUλ₯Ό μ‚¬μš©ν•œ λ²”μš©μ  계산 μ‘μš©μ˜ 효율적인 μ‹œμŠ€ν…œ μžμ› ν™œμš©μ„ μœ„ν•œ GPU μ‹œμŠ€ν…œ μ΅œμ ν™”

    Get PDF
    ν•™μœ„λ…Όλ¬Έ (박사) -- μ„œμšΈλŒ€ν•™κ΅ λŒ€ν•™μ› : κ³΅κ³ΌλŒ€ν•™ 전기·컴퓨터곡학뢀, 2020. 8. μ—Όν—Œμ˜.Recently, General Purpose GPU (GPGPU) applications are playing key roles in many different research fields, such as high-performance computing (HPC) and deep learning (DL). The common feature exists in these applications is that all of them require massive computation power, which follows the high parallelism characteristics of the graphics processing unit (GPU). However, because of the resource usage pattern of each GPGPU application varies, a single application cannot fully exploit the GPU systems resources to achieve the best performance of the GPU since the GPU system is designed to provide system-level fairness to all applications instead of optimizing for a specific type. GPU multitasking can address the issue by co-locating multiple kernels with diverse resource usage patterns to share the GPU resource in parallel. However, the current GPU mul- titasking scheme focuses just on co-launching the kernels rather than making them execute more efficiently. Besides, the current GPU multitasking scheme is not open-sourced, which makes it more difficult to be optimized, since the GPGPU applications and the GPU system are unaware of the feature of each other. In this dissertation, we claim that using the support from framework between the GPU system and the GPGPU applications without modifying the application can yield better performance. We design and implement the frame- work while addressing two issues in GPGPU applications. First, we introduce a GPU memory checkpointing approach between the host memory and the device memory to address the problem that GPU memory cannot be over-subscripted in a multitasking environment. Second, we present a fine-grained GPU kernel management scheme to avoid the GPU resource under-utilization problem in a i multitasking environment. We implement and evaluate our schemes on a real GPU system. The experimental results show that our proposed approaches can solve the problems related to GPGPU applications than the existing approaches while delivering better performance.졜근 λ²”μš© GPU (GPGPU) μ‘μš© ν”„λ‘œκ·Έλž¨μ€ κ³ μ„±λŠ₯ μ»΄ν“¨νŒ… (HPC) 및 λ”₯ λŸ¬λ‹ (DL)κ³Ό 같은 λ‹€μ–‘ν•œ 연ꡬ λΆ„μ•Όμ—μ„œ 핡심적인 역할을 μˆ˜ν–‰ν•˜κ³  μžˆλ‹€. μ΄λŸ¬ν•œ 응 용 λΆ„μ•Όμ˜ 곡톡적인 νŠΉμ„±μ€ κ±°λŒ€ν•œ 계산 μ„±λŠ₯이 ν•„μš”ν•œ 것이며 κ·Έλž˜ν”½ 처리 μž₯치 (GPU)의 높은 병렬 처리 νŠΉμ„±κ³Ό 맀우 μ ν•©ν•˜λ‹€. κ·ΈλŸ¬λ‚˜ GPU μ‹œμŠ€ν…œμ€ νŠΉμ • 유 ν˜•μ˜ μ‘μš© ν”„λ‘œκ·Έλž¨μ— μ΅œμ €ν™”ν•˜λŠ” λŒ€μ‹  λͺ¨λ“  μ‘μš© ν”„λ‘œκ·Έλž¨μ— μ‹œμŠ€ν…œ μˆ˜μ€€μ˜ 곡정 성을 μ œκ³΅ν•˜λ„λ‘ μ„€κ³„λ˜μ–΄ 있으며 각 GPGPU μ‘μš© ν”„λ‘œκ·Έλž¨μ˜ μžμ› μ‚¬μš© νŒ¨ν„΄μ΄ λ‹€μ–‘ν•˜κΈ° λ•Œλ¬Έμ— 단일 μ‘μš© ν”„λ‘œκ·Έλž¨μ΄ GPU μ‹œμŠ€ν…œμ˜ λ¦¬μ†ŒμŠ€λ₯Ό μ™„μ „νžˆ ν™œμš©ν•˜μ—¬ GPU의 졜고 μ„±λŠ₯을 달성 ν•  μˆ˜λŠ” μ—†λ‹€. λ”°λΌμ„œ GPU λ©€ν‹° νƒœμŠ€ν‚Ήμ€ λ‹€μ–‘ν•œ λ¦¬μ†ŒμŠ€ μ‚¬μš© νŒ¨ν„΄μ„ 가진 μ—¬λŸ¬ μ‘μš© ν”„λ‘œκ·Έ λž¨μ„ ν•¨κ»˜ λ°°μΉ˜ν•˜μ—¬ GPU λ¦¬μ†ŒμŠ€λ₯Ό κ³΅μœ ν•¨μœΌλ‘œμ¨ GPU μžμ› μ‚¬μš©λ₯  μ €ν•˜ 문제λ₯Ό ν•΄κ²°ν•  수 μžˆλ‹€. κ·ΈλŸ¬λ‚˜ κΈ°μ‘΄ GPU λ©€ν‹° νƒœμŠ€ν‚Ή κΈ°μˆ μ€ μžμ› μ‚¬μš©λ₯  κ΄€μ μ—μ„œ 응 용 ν”„λ‘œκ·Έλž¨μ˜ 효율적인 싀행보닀 κ³΅λ™μœΌλ‘œ μ‹€ν–‰ν•˜λŠ” 데 쀑점을 λ‘”λ‹€. λ˜ν•œ ν˜„μž¬ GPU λ©€ν‹° νƒœμŠ€ν‚Ή κΈ°μˆ μ€ μ˜€ν”ˆ μ†ŒμŠ€κ°€ μ•„λ‹ˆλ―€λ‘œ μ‘μš© ν”„λ‘œκ·Έλž¨κ³Ό GPU μ‹œμŠ€ν…œμ΄ μ„œλ‘œμ˜ κΈ°λŠ₯을 μΈμ‹ν•˜μ§€ λͺ»ν•˜κΈ° λ•Œλ¬Έμ— μ΅œμ ν™”ν•˜κΈ°κ°€ 더 μ–΄λ €μšΈ μˆ˜λ„ μžˆλ‹€. λ³Έ λ…Όλ¬Έμ—μ„œλŠ” μ‘μš© ν”„λ‘œκ·Έλž¨μ„ μˆ˜μ • 없이 GPU μ‹œμŠ€ν…œκ³Ό GPGPU μ‘μš© 사 이의 ν”„λ ˆμž„μ›Œν¬λ₯Ό 톡해 μ‚¬μš©ν•˜λ©΄ 보닀 높은 μ‘μš©μ„±λŠ₯κ³Ό μžμ› μ‚¬μš©μ„ 보일 수 μžˆμŒμ„ 증λͺ…ν•˜κ³ μž ν•œλ‹€. 그러기 μœ„ν•΄ GPU νƒœμŠ€ν¬ 관리 ν”„λ ˆμž„μ›Œν¬λ₯Ό κ°œλ°œν•˜μ—¬ GPU λ©€ν‹° νƒœμŠ€ν‚Ή ν™˜κ²½μ—μ„œ λ°œμƒν•˜λŠ” 두 가지 문제λ₯Ό ν•΄κ²°ν•˜μ˜€λ‹€. 첫째, λ©€ν‹° νƒœ μŠ€ν‚Ή ν™˜κ²½μ—μ„œ GPU λ©”λͺ¨λ¦¬ 초과 ν• λ‹Ήν•  수 μ—†λŠ” 문제λ₯Ό ν•΄κ²°ν•˜κΈ° μœ„ν•΄ 호슀트 λ©”λͺ¨λ¦¬μ™€ λ””λ°”μ΄μŠ€ λ©”λͺ¨λ¦¬μ— 체크포인트 방식을 λ„μž…ν•˜μ˜€λ‹€. λ‘˜μ§Έ, λ©€ν‹° νƒœμŠ€ν‚Ή ν™˜ κ²½μ—μ„œ GPU μžμ› μ‚¬μš©μœ¨ μ €ν•˜ 문제λ₯Ό ν•΄κ²°ν•˜κΈ° μœ„ν•΄ λ”μš± μ„ΈλΆ„ν™” 된 GPU 컀널 관리 μ‹œμŠ€ν…œμ„ μ œμ‹œν•˜μ˜€λ‹€. λ³Έ λ…Όλ¬Έμ—μ„œλŠ” μ œμ•ˆν•œ λ°©λ²•λ“€μ˜ 효과λ₯Ό 증λͺ…ν•˜κΈ° μœ„ν•΄ μ‹€μ œ GPU μ‹œμŠ€ν…œμ— 92 κ΅¬ν˜„ν•˜κ³  κ·Έ μ„±λŠ₯을 ν‰κ°€ν•˜μ˜€λ‹€. μ œμ•ˆν•œ 접근방식이 κΈ°μ‘΄ μ ‘κ·Ό 방식보닀 GPGPU μ‘μš© ν”„λ‘œκ·Έλž¨κ³Ό κ΄€λ ¨λœ 문제λ₯Ό ν•΄κ²°ν•  수 있으며 더 높은 μ„±λŠ₯을 μ œκ³΅ν•  수 μžˆμŒμ„ 확인할 수 μžˆμ—ˆλ‹€.Chapter 1 Introduction 1 1.1 Motivation 2 1.2 Contribution . 7 1.3 Outline 8 Chapter 2 Background 10 2.1 GraphicsProcessingUnit(GPU) and CUDA 10 2.2 CheckpointandRestart . 11 2.3 ResourceSharingModel. 11 2.4 CUDAContext 12 2.5 GPUThreadBlockScheduling . 13 2.6 Multi-ProcessServicewithHyper-Q 13 Chapter 3 Checkpoint based solution for GPU memory over- subscription problem 16 3.1 Motivation 16 3.2 RelatedWork. 18 3.3 DesignandImplementation . 20 3.3.1 System Design 21 3.3.2 CUDAAPIwrappingmodule 22 3.3.3 Scheduler . 28 3.4 Evaluation. 31 3.4.1 Evaluationsetup . 31 3.4.2 OverheadofFlexGPU 32 3.4.3 Performance with GPU Benchmark Suits 34 3.4.4 Performance with Real-world Workloads 36 3.4.5 Performance of workloads composed of multiple applications 39 3.5 Summary 42 Chapter 4 A Workload-aware Fine-grained Resource Manage- ment Framework for GPGPUs 43 4.1 Motivation 43 4.2 RelatedWork. 45 4.2.1 GPUresourcesharing 45 4.2.2 GPUscheduling . 46 4.3 DesignandImplementation . 47 4.3.1 SystemArchitecture . 47 4.3.2 CUDAAPIWrappingModule . 49 4.3.3 smCompactorRuntime . 50 4.3.4 ImplementationDetails . 57 4.4 Analysis on the relation between performance and workload usage pattern 60 4.4.1 WorkloadDefinition . 60 4.4.2 Analysisonperformancesaturation 60 4.4.3 Predict the necessary SMs and thread blocks for best performance . 64 4.5 Evaluation. 69 4.5.1 EvaluationMethodology. 70 4.5.2 OverheadofsmCompactor . 71 4.5.3 Performance with Different Thread Block Counts on Dif- ferentNumberofSMs 72 4.5.4 Performance with Concurrent Kernel and Resource Sharing 74 4.6 Summary . 79 Chapter 5 Conclusion. 81 μš”μ•½. 92Docto

    An efficient virtual network interface in the FUGU scalable workstation dc by Kenneth Martin Mackenzie.

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 1998.Includes bibliographical references (p. 123-129).Ph.D

    Minimal Virtual Machines on IoT Microcontrollers: The Case of Berkeley Packet Filters with rBPF

    Get PDF
    to be published in the proceedings of IFIP/IEEE PEMWN 2020International audienceVirtual machines (VM) are widely used to host and isolate software modules. However, extremely small memory and low-energy budgets have so far prevented wide use of VMs on typical microcontrollerbased IoT devices. In this paper, we explore the potential of two minimal VM approaches on such lowpower hardware. We design rBPF, a register-based VM based on extended Berkeley Packet Filters (eBPF). We compare it with a stack-based VM based on We-bAssembly (Wasm) adapted for embedded systems. We implement prototypes of each VM, hosted in the IoT operating system RIOT. We perform measurements on commercial off-the-shelf IoT hardware. Unsurprisingly, we observe that both Wasm and rBPF virtual machines yield execution time and memory overhead, compared to not using a VM. We show however that this execution time overhead is tolerable for low-throughput, lowenergy IoT devices. We further show that, while using a VM based on Wasm entails doubling the memory budget for a simple networked IoT application using a 6LoWPAN/CoAP stack, using a VM based on rBPF requires only negligible memory overhead (less than 10% more memory). rBPF is thus a promising approach to host small software modules, isolated from OS software, and updatable on-demand, over low-power networks

    Is Rust Used Safely by Software Developers?

    Full text link
    Rust, an emerging programming language with explosive growth, provides a robust type system that enables programmers to write memory-safe and data-race free code. To allow access to a machine's hardware and to support low-level performance optimizations, a second language, Unsafe Rust, is embedded in Rust. It contains support for operations that are difficult to statically check, such as C-style pointers for access to arbitrary memory locations and mutable global variables. When a program uses these features, the compiler is unable to statically guarantee the safety properties Rust promotes. In this work, we perform a large-scale empirical study to explore how software developers are using Unsafe Rust in real-world Rust libraries and applications. Our results indicate that software engineers use the keyword unsafe in less than 30% of Rust libraries, but more than half cannot be entirely statically checked by the Rust compiler because of Unsafe Rust hidden somewhere in a library's call chain. We conclude that although the use of the keyword unsafe is limited, the propagation of unsafeness offers a challenge to the claim of Rust as a memory-safe language. Furthermore, we recommend changes to the Rust compiler and to the central Rust repository's interface to help Rust software developers be aware of when their Rust code is unsafe

    uTango: an open-source TEE for IoT devices

    Get PDF
    Security is one of the main challenges of the Internet of Things (IoT). IoT devices are mainly powered by low-cost microcontrollers (MCUs) that typically lack basic hardware security mechanisms to separate security-critical applications from less critical components. Recently, Arm has started to release Cortex-M MCUs enhanced with TrustZone technology (i.e., TrustZone-M), a system-wide security solution aiming at providing robust protection for IoT devices. Trusted Execution Environments (TEEs) relying on TrustZone hardware have been perceived as safe havens for securing mobile devices. However, for the past few years, considerable effort has gone into unveiling hundreds of vulnerabilities and proposing a collection of relevant defense techniques to address several issues. While new TEE solutions built on TrustZone-M start flourishing, the lessons gathered from the research community appear to be falling short, as these new systems are trapping into the same pitfalls of the past. In this paper, we present UTANGO, the first multi-world TEE for modern IoT devices. UTANGO proposes a novel architecture aiming at tackling the major architectural deficiencies currently affecting TrustZone(-M)-assisted TEEs. In particular, we leverage the very same TrustZone hardware primitives used by dual-world implementations to create multiple and equally secure execution environments within the normal world. We demonstrate the benefits of UTANGO by conducting an extensive evaluation on a real TrustZone-M hardware platform, i.e., Arm Musca-B1. UTANGO will be open-sourced and freely available on GitHub in hopes of engaging academia and industry on securing the foreseeable trillion IoT devices.This work was supported in part by the Fundacao para a Ciencia e Tecnologia (FCT) within the Research and Development Units under Grant UIDB/00319/2020, and in part by FCT within the Ph.D. Scholarship under Grant 2020.04585.BD

    Hybrid Post-Quantum Signatures in Hardware Security Keys

    Get PDF
    Recent advances in quantum computing are increasingly jeopardizing the security of cryptosystems currently in widespread use, such as RSA or elliptic-curve signatures. To address this threat, researchers and standardization institutes have accelerated the transition to quantum-resistant cryptosystems, collectively known as Post-Quantum Cryptography (PQC). These PQC schemes present new challenges due to their larger memory and computational footprints and their higher chance of latent vulnerabilities. In this work, we address these challenges by introducing a scheme to upgrade the digital signatures used by security keys to PQC. We introduce a hybrid digital signature scheme based on two building blocks: a classically-secure scheme, ECDSA, and a post-quantum secure one, Dilithium. Our hybrid scheme maintains the guarantees of each underlying building block even if the other one is broken, thus being resistant to classical and quantum attacks. We experimentally show that our hybrid signature scheme can successfully execute on current security keys, even though secure PQC schemes are known to require substantial resources. We publish an open-source implementation of our scheme at https://github.com/google/OpenSK/releases/tag/hybrid-pqc so that other researchers can reproduce our results on a nRF52840 development kit

    Femto-Containers: DevOps on Microcontrollers with Lightweight Virtualization & Isolation for IoT Software Modules

    Get PDF
    Development, deployment and maintenance of networked software has been revolutionized by DevOps,which have become essential to boost system software quality and to enable agile evolution.Meanwhile the Internet of Things (IoT) connects more and more devices which are not covered by DevOps tools:low-power, microcontroller-based devices.In this paper, we contribute to bridge this gap by designing Femto-Containers,a new architecture which enables containerization, virtualizationand secure deployment of software modules embedded on microcontrollers over low-power networks.As proof-of-concept, we implemented and evaluated Femto-Containers on popular microcontroller architectures (Arm Cortex-M, ESP32 and RISC-V), using eBPF virtualization, and RIOT, a common operating system in this space.We show that Femto-Containers can virtualize and isolate multiple software modules, executed concurrently, with very small memory footprint overhead (below 10%) and very small startup time (tens of microseconds) compared to native code execution.We show that Femto-Containers can satisfy the constraints of bothlow-level debug logic inserted in a hot code path, and high-level business logic coded in a variety of common programming languages.Compared to prior work, Femto-Containers thus offer an attractive trade-off in terms of memory footprint, energy consumption, agility and security
    corecore