146 research outputs found
Recommended from our members
Efficient machine learning software stack from algorithms to compilation
Machine learning enables the extraction of knowledge from data and decision-making without explicit programming, achieving great success and revolutionizing many fields. These successes can be attributed to the continuous advancements in machine learning software and hardware, which have expanded the boundaries and facilitated breakthroughs in diverse applications. The machine learning software stack is a comprehensive collection of components used to solve problems with machine learning algorithms. It encompasses problem definitions, data processing, model and method designs, software frameworks, libraries, code optimization, and system management. This stack supports the entire life cycle of a machine learning project. The software stack allows the community to stand on the shoulders of previous great work and push the limit of machine learning, fostering innovation and enabling broader adoption of machine learning techniques in academia and industry. The software stack is usually divided into algorithm and compilation with distinct design principles. Algorithm design prioritizes task-related performance, while compilation focuses on execution time and resource consumption on hardware devices. Maintaining arithmetic equivalence is optional in algorithm design, but compulsory in compilation to ensure consistent results. The compilation is closer to hardware than algorithm design. Compilation engineers optimize for hardware specifications, while algorithm developers usually do not prioritize hardware-friendliness. Opportunities to enhance hardware efficiency exist in algorithm and compilation designs, as well as their interplay. Despite extensive innovations and improvements, efficiency in the machine learning software stack is a continuing challenge. Algorithm design proposes efficient model architectures and learning algorithms, while compilation design optimizes computation graphs and simplifies operations. However, there is still a gap between the demand for efficiency and the current solutions, driven by rapidly growing workloads, limited resources in specific machine learning applications, and the need for cross-layer design. Addressing these challenges requires interdisciplinary research and collaboration. Improving efficiency in the machine learning software stack will optimize performance and enhance the accessibility and applicability of machine learning technologies. In this dissertation, we focus on addressing these efficiency challenges from the perspectives of machine learning algorithms and compilation. We introduce three novel improvements that enhance the efficiency of mainstream machine learning algorithms. Firstly, effective gradient matching for dataset condensation generates a small insightful dataset, accelerating training and other related tasks. Additionally, NormSoftmax proposes to append a normalization layer to achieve fast and stable training in Transformers and classification models. Lastly, mixed precision hardware-aware neural architecture search combines mixed-precision quantization, neural architecture search, and hardware energy efficiency, resulting in significantly more efficient neural networks than using a single method. However, algorithmic efficiency alone is insufficient to fully exploit the potential in the machine learning software stack. We delve into and optimize the compilation processes with three techniques. Firstly, we simplify the layer normalization in the influential Transformers, obtaining two equivalent and efficient Transformer variants with alternative normalization types. Our proposed variants enable efficient training and inference of popular models like GPT and ViT. Secondly, we formulate and solve the scheduling problem for reversible neural architectures, finding the optimal training schedule that fully leverages the computation and memory resources on hardware accelerators. Lastly, optimizer fusion allows users to accelerate the training process in the eager execution mode of machine learning frameworks. It leverages the better locality on hardware and parallelism in the computation graphs. Throughout the dissertation, we emphasize the integration of efficient algorithms and compilation into a cohesive machine learning software stack. We also consider hardware properties to provide hardware-friendly software designs. We demonstrate the effectiveness of the proposed methods in algorithm and compilation through extensive experiments. Our approaches effectively reduce the time and energy required for both training and inference. Ultimately, our methods have the potential to empower machine learning practitioners and researchers to build more efficient, powerful, robust, scalable, and accessible machine learning solutions.Electrical and Computer Engineerin
PC-SNN: Supervised Learning with Local Hebbian Synaptic Plasticity based on Predictive Coding in Spiking Neural Networks
Deemed as the third generation of neural networks, the event-driven Spiking
Neural Networks(SNNs) combined with bio-plausible local learning rules make it
promising to build low-power, neuromorphic hardware for SNNs. However, because
of the non-linearity and discrete property of spiking neural networks, the
training of SNN remains difficult and is still under discussion. Originating
from gradient descent, backprop has achieved stunning success in multi-layer
SNNs. Nevertheless, it is assumed to lack biological plausibility, while
consuming relatively high computational resources. In this paper, we propose a
novel learning algorithm inspired by predictive coding theory and show that it
can perform supervised learning fully autonomously and successfully as the
backprop, utilizing only local Hebbian plasticity. Furthermore, this method
achieves a favorable performance compared to the state-of-the-art multi-layer
SNNs: test accuracy of 99.25% for the Caltech Face/Motorbike dataset, 84.25%
for the ETH-80 dataset, 98.1% for the MNIST dataset and 98.5% for the
neuromorphic dataset: N-MNIST. Furthermore, our work provides a new perspective
on how supervised learning algorithms are directly implemented in spiking
neural circuitry, which may give some new insights into neuromorphological
calculation in neuroscience.Comment: 15 pages, 11fig
Pre-RMSNorm and Pre-CRMSNorm Transformers: Equivalent and Efficient Pre-LN Transformers
Transformers have achieved great success in machine learning applications.
Normalization techniques, such as Layer Normalization (LayerNorm, LN) and Root
Mean Square Normalization (RMSNorm), play a critical role in accelerating and
stabilizing the training of Transformers. While LayerNorm recenters and
rescales input vectors, RMSNorm only rescales the vectors by their RMS value.
Despite being more computationally efficient, RMSNorm may compromise the
representation ability of Transformers. There is currently no consensus
regarding the preferred normalization technique, as some models employ
LayerNorm while others utilize RMSNorm, especially in recent large language
models. It is challenging to convert Transformers with one normalization to the
other type. While there is an ongoing disagreement between the two
normalization types, we propose a solution to unify two mainstream Transformer
architectures, Pre-LN and Pre-RMSNorm Transformers. By removing the inherent
redundant mean information in the main branch of Pre-LN Transformers, we can
reduce LayerNorm to RMSNorm, achieving higher efficiency. We further propose
the Compressed RMSNorm (CRMSNorm) and Pre-CRMSNorm Transformer based on a
lossless compression of the zero-mean vectors. We formally establish the
equivalence of Pre-LN, Pre-RMSNorm, and Pre-CRMSNorm Transformer variants in
both training and inference. It implies that Pre-LN Transformers can be
substituted with Pre-(C)RMSNorm counterparts at almost no cost, offering the
same arithmetic functionality along with free efficiency improvement.
Experiments demonstrate that we can reduce the training and inference time of
Pre-LN Transformers by up to 10%.Comment: 15 pages, 5 tables, code available at
https://github.com/ZixuanJiang/pre-rmsnorm-transforme
Federated Reinforcement Learning for Real-Time Electric Vehicle Charging and Discharging Control
With the recent advances in mobile energy storage technologies, electric
vehicles (EVs) have become a crucial part of smart grids. When EVs participate
in the demand response program, the charging cost can be significantly reduced
by taking full advantage of the real-time pricing signals. However, many
stochastic factors exist in the dynamic environment, bringing significant
challenges to design an optimal charging/discharging control strategy. This
paper develops an optimal EV charging/discharging control strategy for
different EV users under dynamic environments to maximize EV users' benefits.
We first formulate this problem as a Markov decision process (MDP). Then we
consider EV users with different behaviors as agents in different environments.
Furthermore, a horizontal federated reinforcement learning (HFRL)-based method
is proposed to fit various users' behaviors and dynamic environments. This
approach can learn an optimal charging/discharging control strategy without
sharing users' profiles. Simulation results illustrate that the proposed
real-time EV charging/discharging control strategy can perform well among
various stochastic factors
Comparative Synthesis: Learning Near-Optimal Network Designs by Query
When managing wide-area networks, network architects must decide how to
balance multiple conflicting metrics, and ensure fair allocations to competing
traffic while prioritizing critical traffic. The state of practice poses
challenges since architects must precisely encode their intent into formal
optimization models using abstract notions such as utility functions, and
ad-hoc manually tuned knobs. In this paper, we present the first effort to
synthesize optimal network designs with indeterminate objectives using an
interactive program-synthesis-based approach. We make three contributions.
First, we present comparative synthesis, an interactive synthesis framework
which produces near-optimal programs (network designs) through two kinds of
queries (Propose and Compare), without an objective explicitly given. Second,
we develop the first learning algorithm for comparative synthesis in which a
voting-guided learner picks the most informative query in each iteration. We
present theoretical analysis of the convergence rate of the algorithm. Third,
we implemented Net10Q, a system based on our approach, and demonstrate its
effectiveness on four real-world network case studies using black-box oracles
and simulation experiments, as well as a pilot user study comprising network
researchers and practitioners. Both theoretical and experimental results show
the promise of our approach
SDT: A Low-cost and Topology-reconfigurable Testbed for Network Research
Network experiments are essential to network-related scientific research
(e.g., congestion control, QoS, network topology design, and traffic
engineering). However, (re)configuring various topologies on a real testbed is
expensive, time-consuming, and error-prone. In this paper, we propose
\emph{Software Defined Topology Testbed (SDT)}, a method for constructing a
user-defined network topology using a few commodity switches. SDT is low-cost,
deployment-friendly, and reconfigurable, which can run multiple sets of
experiments under different topologies by simply using different topology
configuration files at the controller we designed. We implement a prototype of
SDT and conduct numerous experiments. Evaluations show that SDT only introduces
at most 2\% extra overhead than full testbeds on multi-hop latency and is far
more efficient than software simulators (reducing the evaluation time by up to
2899x). SDT is more cost-effective and scalable than existing Topology
Projection (TP) solutions. Further experiments show that SDT can support
various network research experiments at a low cost on topics including but not
limited to topology design, congestion control, and traffic engineering.Comment: This paper will be published in IEEE CLUSTER 2023. Preview version
onl
- …