30 research outputs found
Understanding Suspend/Resume Path of Linux Device Drivers
Suspend/Resume (S/R), stands for putting mobile devices into sleep mode and wakes them up. Such a S/R process is heavily used in mobile devices today. While controlling by the operating system (OS), S/R process consumes a dominating portion of energy. In order to minimize the power consumption, we have to understand what happens on the S/R Path of modern device drivers so that further solutions reducing the overhead in that process can be found. In a modern OS, device drivers can make up over 70% of the source code, while still heavily dependent on the rest of the OS. Such a property made analyzing the driver code an extremely complicated and important task. We built a static code analysis tool and using the tool, we were able to quantitatively analyze the S/R path of Linux device drivers. By comparing different versions, we observed the evolution of Linux S/R path over time. In this paper, we present a quantitative analysis of Linux driver codes on the S/R path and show how they evolve over time
Decelerating I/O Power Management
System suspend/resume is crucial to energy proportionality of modern computers, from wearable to cloud servers. Ironically, this OS mechanism itself is slow and energy hungry. Through characterizing the Linux kernel on a variety of modern system-on-chips (SoCs), we show the major reason as slow power state transitions of IO, which keeps CPU waiting. Furthermore, we argue that the IO wait can hardly be reduced to a satisfactory level, because most of slow transitions of IO are bounded by peripherals, low-speed buses, or physical factors. Therefore, the kernel execution for suspend/resume should be offloaded to a miniature core that waits more efficiently. To fix this problem, we propose a power management core running novel hypervisor that dynamically translates and executes Power Management functions. This method not only supports offloading a complex kernel subsystem but also provides forward compatibility with a commodity kernel. Based on QEMU, an open source hypervisor, we implement the backend for ARMv7M ISA. We optimize QEMU’s translation by directly mapping flag emulation to hardware. In the end, we are able to achieve 100% increase performance compared with QEMU’s original version
Efficient Deep Speech Understanding at the Edge
In contemporary speech understanding (SU), a sophisticated pipeline is
employed, encompassing the ingestion of streaming voice input. The pipeline
executes beam search iteratively, invoking a deep neural network to generate
tentative outputs (referred to as hypotheses) in an autoregressive manner.
Periodically, the pipeline assesses attention and Connectionist Temporal
Classification (CTC) scores.
This paper aims to enhance SU performance on edge devices with limited
resources. Adopting a hybrid strategy, our approach focuses on accelerating
on-device execution and offloading inputs surpassing the device's capacity.
While this approach is established, we tackle SU's distinctive challenges
through innovative techniques: (1) Late Contextualization: This involves the
parallel execution of a model's attentive encoder during input ingestion. (2)
Pilot Inference: Addressing temporal load imbalances in the SU pipeline, this
technique aims to mitigate them effectively. (3) Autoregression Offramps:
Decisions regarding offloading are made solely based on hypotheses, presenting
a novel approach.
These techniques are designed to seamlessly integrate with existing speech
models, pipelines, and frameworks, offering flexibility for independent or
combined application. Collectively, they form a hybrid solution for edge SU.
Our prototype, named XYZ, has undergone testing on Arm platforms featuring 6 to
8 cores, demonstrating state-of-the-art accuracy. Notably, it achieves a 2x
reduction in end-to-end latency and a corresponding 2x decrease in offloading
requirements
Secure and Effective Data Appraisal for Machine Learning
Essential for an unfettered data market is the ability to discreetly select
and evaluate training data before finalizing a transaction between the data
owner and model owner. To safeguard the privacy of both data and model, this
process involves scrutinizing the target model through Multi-Party Computation
(MPC). While prior research has posited that the MPC-based evaluation of
Transformer models is excessively resource-intensive, this paper introduces an
innovative approach that renders data selection practical. The contributions of
this study encompass three pivotal elements: (1) a groundbreaking pipeline for
confidential data selection using MPC, (2) replicating intricate
high-dimensional operations with simplified low-dimensional MLPs trained on a
limited subset of pertinent data, and (3) implementing MPC in a concurrent,
multi-phase manner. The proposed method is assessed across an array of
Transformer models and NLP/CV benchmarks. In comparison to the direct MPC-based
evaluation of the target model, our approach substantially reduces the time
required, from thousands of hours to mere tens of hours, with only a nominal
0.20% dip in accuracy when training with the selected data