30 research outputs found

    Understanding Suspend/Resume Path of Linux Device Drivers

    Get PDF
    Suspend/Resume (S/R), stands for putting mobile devices into sleep mode and wakes them up. Such a S/R process is heavily used in mobile devices today. While controlling by the operating system (OS), S/R process consumes a dominating portion of energy. In order to minimize the power consumption, we have to understand what happens on the S/R Path of modern device drivers so that further solutions reducing the overhead in that process can be found. In a modern OS, device drivers can make up over 70% of the source code, while still heavily dependent on the rest of the OS. Such a property made analyzing the driver code an extremely complicated and important task. We built a static code analysis tool and using the tool, we were able to quantitatively analyze the S/R path of Linux device drivers. By comparing different versions, we observed the evolution of Linux S/R path over time. In this paper, we present a quantitative analysis of Linux driver codes on the S/R path and show how they evolve over time

    Decelerating I/O Power Management

    Get PDF
    System suspend/resume is crucial to energy proportionality of modern computers, from wearable to cloud servers. Ironically, this OS mechanism itself is slow and energy hungry. Through characterizing the Linux kernel on a variety of modern system-on-chips (SoCs), we show the major reason as slow power state transitions of IO, which keeps CPU waiting. Furthermore, we argue that the IO wait can hardly be reduced to a satisfactory level, because most of slow transitions of IO are bounded by peripherals, low-speed buses, or physical factors. Therefore, the kernel execution for suspend/resume should be offloaded to a miniature core that waits more efficiently. To fix this problem, we propose a power management core running novel hypervisor that dynamically translates and executes Power Management functions. This method not only supports offloading a complex kernel subsystem but also provides forward compatibility with a commodity kernel. Based on QEMU, an open source hypervisor, we implement the backend for ARMv7M ISA. We optimize QEMU’s translation by directly mapping flag emulation to hardware. In the end, we are able to achieve 100% increase performance compared with QEMU’s original version

    Efficient Deep Speech Understanding at the Edge

    Full text link
    In contemporary speech understanding (SU), a sophisticated pipeline is employed, encompassing the ingestion of streaming voice input. The pipeline executes beam search iteratively, invoking a deep neural network to generate tentative outputs (referred to as hypotheses) in an autoregressive manner. Periodically, the pipeline assesses attention and Connectionist Temporal Classification (CTC) scores. This paper aims to enhance SU performance on edge devices with limited resources. Adopting a hybrid strategy, our approach focuses on accelerating on-device execution and offloading inputs surpassing the device's capacity. While this approach is established, we tackle SU's distinctive challenges through innovative techniques: (1) Late Contextualization: This involves the parallel execution of a model's attentive encoder during input ingestion. (2) Pilot Inference: Addressing temporal load imbalances in the SU pipeline, this technique aims to mitigate them effectively. (3) Autoregression Offramps: Decisions regarding offloading are made solely based on hypotheses, presenting a novel approach. These techniques are designed to seamlessly integrate with existing speech models, pipelines, and frameworks, offering flexibility for independent or combined application. Collectively, they form a hybrid solution for edge SU. Our prototype, named XYZ, has undergone testing on Arm platforms featuring 6 to 8 cores, demonstrating state-of-the-art accuracy. Notably, it achieves a 2x reduction in end-to-end latency and a corresponding 2x decrease in offloading requirements

    Secure and Effective Data Appraisal for Machine Learning

    Full text link
    Essential for an unfettered data market is the ability to discreetly select and evaluate training data before finalizing a transaction between the data owner and model owner. To safeguard the privacy of both data and model, this process involves scrutinizing the target model through Multi-Party Computation (MPC). While prior research has posited that the MPC-based evaluation of Transformer models is excessively resource-intensive, this paper introduces an innovative approach that renders data selection practical. The contributions of this study encompass three pivotal elements: (1) a groundbreaking pipeline for confidential data selection using MPC, (2) replicating intricate high-dimensional operations with simplified low-dimensional MLPs trained on a limited subset of pertinent data, and (3) implementing MPC in a concurrent, multi-phase manner. The proposed method is assessed across an array of Transformer models and NLP/CV benchmarks. In comparison to the direct MPC-based evaluation of the target model, our approach substantially reduces the time required, from thousands of hours to mere tens of hours, with only a nominal 0.20% dip in accuracy when training with the selected data
    corecore