42 research outputs found
Collaborative Heterogeneity-Aware OS Scheduler for Asymmetric Multicore Processors
Funding: This work is supported in part by the China Postdoctoral Science Foundation (Grant No. 2020TQ0169), the ShuiMu Tsinghua Scholar fellowship (2019SM131), National Key R&D Program of China (2020AAA0105200), National Natural Science Foundation of China (U20A20226), Beijing Natural Science Foundation (4202031), Beijing Academy of Artificial Intelligence BAAI), the UK EPSRC grants Discovery: Pattern Discovery and Program Shaping for Manycore Systems (EP/P020631/1). This work is also supported by the Royal Academy of Engineering under the Research Fellowship scheme.Asymmetric multicore processors (AMP) offer multiple types of cores under the same programming interface. Extracting the full potential of AMPs requires intelligent scheduling decisions, matching each thread with the right kind of core, the core that will maximize performance or minimize wasted energy for this thread. Existing OS schedulers are not up to this task. While they may handle certain aspects of asymmetry in the system, none can handle all runtime factors affecting AMPs for the general case of multi-threaded multi-programmed workloads. We address this problem by introducing COLAB, a general purpose asymmetry-aware scheduler targeting multi-threaded multi-programmed workloads. It estimates the performance and power of each thread on each type of core and identifies communication patterns and bottleneck threads. With this information, the scheduler makes coordinated core assignment and thread selection decisions that still provide each application its fair share of the processor’s time. We evaluate our approach using both the GEM5 simulator on four distinct big.LITTLE configurations and a development board with ARM Cortex-A73/A53 processors and mixed workloads composed of PARSEC and SPLASH2 benchmarks. Compared to the state-of-the art Linux CFS and AMP-aware schedulers, we demonstrate performance gains of up to 25% and 5% to 15% on average,together with an average 5% energy saving depending on the hardware setup.PostprintPeer reviewe
Guiding the PLMs with Semantic Anchors as Intermediate Supervision: Towards Interpretable Semantic Parsing
The recent prevalence of pretrained language models (PLMs) has dramatically
shifted the paradigm of semantic parsing, where the mapping from natural
language utterances to structured logical forms is now formulated as a Seq2Seq
task. Despite the promising performance, previous PLM-based approaches often
suffer from hallucination problems due to their negligence of the structural
information contained in the sentence, which essentially constitutes the key
semantics of the logical forms. Furthermore, most works treat PLM as a black
box in which the generation process of the target logical form is hidden
beneath the decoder modules, which greatly hinders the model's intrinsic
interpretability. To address these two issues, we propose to incorporate the
current PLMs with a hierarchical decoder network. By taking the first-principle
structures as the semantic anchors, we propose two novel intermediate
supervision tasks, namely Semantic Anchor Extraction and Semantic Anchor
Alignment, for training the hierarchical decoders and probing the model
intermediate representations in a self-adaptive manner alongside the
fine-tuning process. We conduct intensive experiments on several semantic
parsing benchmarks and demonstrate that our approach can consistently
outperform the baselines. More importantly, by analyzing the intermediate
representations of the hierarchical decoders, our approach also makes a huge
step toward the intrinsic interpretability of PLMs in the domain of semantic
parsing
PowerFusion: A Tensor Compiler with Explicit Data Movement Description and Instruction-level Graph IR
Deep neural networks (DNNs) are of critical use in different domains. To
accelerate DNN computation, tensor compilers are proposed to generate efficient
code on different domain-specific accelerators. Existing tensor compilers
mainly focus on optimizing computation efficiency. However, memory access is
becoming a key performance bottleneck because the computational performance of
accelerators is increasing much faster than memory performance. The lack of
direct description of memory access and data dependence in current tensor
compilers' intermediate representation (IR) brings significant challenges to
generate memory-efficient code.
In this paper, we propose IntelliGen, a tensor compiler that can generate
high-performance code for memory-intensive operators by considering both
computation and data movement optimizations. IntelliGen represent a DNN program
using GIR, which includes primitives indicating its computation, data movement,
and parallel strategies. This information will be further composed as an
instruction-level dataflow graph to perform holistic optimizations by searching
different memory access patterns and computation operations, and generating
memory-efficient code on different hardware. We evaluate IntelliGen on NVIDIA
GPU, AMD GPU, and Cambricon MLU, showing speedup up to 1.97x, 2.93x, and
16.91x(1.28x, 1.23x, and 2.31x on average), respectively, compared to current
most performant frameworks.Comment: 12 pages, 14 figure