152 research outputs found
Approximating ReLU on a Reduced Ring for Efficient MPC-based Private Inference
Secure multi-party computation (MPC) allows users to offload machine learning
inference on untrusted servers without having to share their privacy-sensitive
data. Despite their strong security properties, MPC-based private inference has
not been widely adopted in the real world due to their high communication
overhead. When evaluating ReLU layers, MPC protocols incur a significant amount
of communication between the parties, making the end-to-end execution time
multiple orders slower than its non-private counterpart.
This paper presents HummingBird, an MPC framework that reduces the ReLU
communication overhead significantly by using only a subset of the bits to
evaluate ReLU on a smaller ring. Based on theoretical analyses, HummingBird
identifies bits in the secret share that are not crucial for accuracy and
excludes them during ReLU evaluation to reduce communication. With its
efficient search engine, HummingBird discards 87--91% of the bits during ReLU
and still maintains high accuracy. On a real MPC setup involving multiple
servers, HummingBird achieves on average 2.03--2.67x end-to-end speedup without
introducing any errors, and up to 8.64x average speedup when some amount of
accuracy degradation can be tolerated, due to its up to 8.76x communication
reduction
Minefield: A Software-only Protection for SGX Enclaves against DVFS Attacks
Modern CPUs adapt clock frequencies and voltage levels to workloads to reduce energy consumption and heat dissipation. This mechanism, dynamic voltage and frequency scaling (DVFS), is controlled from privileged software but affects all execution modes, including SGX. Prior work showed that manipulating voltage or frequency can fault instructions and thereby subvert SGX enclaves. Consequently, Intel disabled the overclocking mailbox (OCM) required for software undervolting, also preventing benign use for energy saving.
In this paper, we propose Minefield, the first software-level defense against DVFS attacks. The idea of Minefield is not to prevent DVFS faults but to deflect faults to trap instructions and handle them before they lead to harmful behavior. As groundwork for Minefield, we systematically analyze DVFS attacks and observe a timing gap of at least 57.8 us between every OCM transition, leading to random faults over at least 57000 cycles. Minefield places highly fault-susceptible trap instructions in the victim code during compilation. Like redundancy countermeasures, Minefield is scalable and enables enclave developers to choose a security parameter between 0% and almost 100%, yielding a fine-grained security-performance trade-off. Our evaluation shows a density of 0.75, i.e., one trap after every 1-2 instruction, mitigates all known DVFS attacks in 99% on Intel SGX, incurring an overhead of 148.4% on protected enclaves. However, Minefield has no performance effect on the remaining system. Thus, Minefield is a better solution than hardware- or microcode-based patches disabling the OCM interface
Modular Deep Learning
Transfer learning has recently become the dominant paradigm of machine
learning. Pre-trained models fine-tuned for downstream tasks achieve better
performance with fewer labelled examples. Nonetheless, it remains unclear how
to develop models that specialise towards multiple tasks without incurring
negative interference and that generalise systematically to non-identically
distributed tasks. Modular deep learning has emerged as a promising solution to
these challenges. In this framework, units of computation are often implemented
as autonomous parameter-efficient modules. Information is conditionally routed
to a subset of modules and subsequently aggregated. These properties enable
positive transfer and systematic generalisation by separating computation from
routing and updating modules locally. We offer a survey of modular
architectures, providing a unified view over several threads of research that
evolved independently in the scientific literature. Moreover, we explore
various additional purposes of modularity, including scaling language models,
causal inference, programme induction, and planning in reinforcement learning.
Finally, we report various concrete applications where modularity has been
successfully deployed such as cross-lingual and cross-modal knowledge transfer.
Related talks and projects to this survey, are available at
https://www.modulardeeplearning.com/
A Comprehensive Survey on Distributed Training of Graph Neural Networks
Graph neural networks (GNNs) have been demonstrated to be a powerful
algorithmic model in broad application fields for their effectiveness in
learning over graphs. To scale GNN training up for large-scale and ever-growing
graphs, the most promising solution is distributed training which distributes
the workload of training across multiple computing nodes. At present, the
volume of related research on distributed GNN training is exceptionally vast,
accompanied by an extraordinarily rapid pace of publication. Moreover, the
approaches reported in these studies exhibit significant divergence. This
situation poses a considerable challenge for newcomers, hindering their ability
to grasp a comprehensive understanding of the workflows, computational
patterns, communication strategies, and optimization techniques employed in
distributed GNN training. As a result, there is a pressing need for a survey to
provide correct recognition, analysis, and comparisons in this field. In this
paper, we provide a comprehensive survey of distributed GNN training by
investigating various optimization techniques used in distributed GNN training.
First, distributed GNN training is classified into several categories according
to their workflows. In addition, their computational patterns and communication
patterns, as well as the optimization techniques proposed by recent work are
introduced. Second, the software frameworks and hardware platforms of
distributed GNN training are also introduced for a deeper understanding. Third,
distributed GNN training is compared with distributed training of deep neural
networks, emphasizing the uniqueness of distributed GNN training. Finally,
interesting issues and opportunities in this field are discussed.Comment: To Appear in Proceedings of the IEE
Robust and Traffic Aware Medium Access Control Mechanisms for Energy-Efficient mm-Wave Wireless Network-on-Chip Architectures
To cater to the performance/watt needs, processors with multiple processing cores on the same chip have become the de-facto design choice. In such multicore systems, Network-on-Chip (NoC) serves as a communication infrastructure for data transfer among the cores on the chip. However, conventional metallic interconnect based NoCs are constrained by their long multi-hop latencies and high power consumption, limiting the performance gain in these systems. Among, different alternatives, due to the CMOS compatibility and energy-efficiency, low-latency wireless interconnect operating in the millimeter wave (mm-wave) band is nearer term solution to this multi-hop communication problem. This has led to the recent exploration of millimeter-wave (mm-wave) wireless technologies in wireless NoC architectures (WiNoC).
To realize the mm-wave wireless interconnect in a WiNoC, a wireless interface (WI) equipped with on-chip antenna and transceiver circuit operating at 60GHz frequency range is integrated to the ports of some NoC switches. The WIs are also equipped with a medium access control (MAC) mechanism that ensures a collision free and energy-efficient communication among the WIs located at different parts on the chip. However, due to shrinking feature size and complex integration in CMOS technology, high-density chips like multicore systems are prone to manufacturing defects and dynamic faults during chip operation. Such failures can result in permanently broken wireless links or cause the MAC to malfunction in a WiNoC. Consequently, the energy-efficient communication through the wireless medium will be compromised. Furthermore, the energy efficiency in the wireless channel access is also dependent on the traffic pattern of the applications running on the multicore systems. Due to the bursty and self-similar nature of the NoC traffic patterns, the traffic demand of the WIs can vary both spatially and temporally. Ineffective management of such traffic variation of the WIs, limits the performance and energy benefits of the novel mm-wave interconnect technology. Hence, to utilize the full potential of the novel mm-wave interconnect technology in WiNoCs, design of a simple, fair, robust, and efficient MAC is of paramount importance.
The main goal of this dissertation is to propose the design principles for robust and traffic-aware MAC mechanisms to provide high bandwidth, low latency, and energy-efficient data communication in mm-wave WiNoCs. The proposed solution has two parts. In the first part, we propose the cross-layer design methodology of robust WiNoC architecture that can minimize the effect of permanent failure of the wireless links and recover from transient failures caused by single event upsets (SEU). Then, in the second part, we present a traffic-aware MAC mechanism that can adjust the transmission slots of the WIs based on the traffic demand of the WIs. The proposed MAC is also robust against the failure of the wireless access mechanism. Finally, as future research directions, this idea of traffic awareness is extended throughout the whole NoC by enabling adaptiveness in both wired and wireless interconnection fabric
- …