84 research outputs found
Learning Sparse Neural Networks with Identity Layers
The sparsity of Deep Neural Networks is well investigated to maximize the
performance and reduce the size of overparameterized networks as possible.
Existing methods focus on pruning parameters in the training process by using
thresholds and metrics. Meanwhile, feature similarity between different layers
has not been discussed sufficiently before, which could be rigorously proved to
be highly correlated to the network sparsity in this paper. Inspired by
interlayer feature similarity in overparameterized models, we investigate the
intrinsic link between network sparsity and interlayer feature similarity.
Specifically, we prove that reducing interlayer feature similarity based on
Centered Kernel Alignment (CKA) improves the sparsity of the network by using
information bottleneck theory. Applying such theory, we propose a plug-and-play
CKA-based Sparsity Regularization for sparse network training, dubbed CKA-SR,
which utilizes CKA to reduce feature similarity between layers and increase
network sparsity. In other words, layers of our sparse network tend to have
their own identity compared to each other. Experimentally, we plug the proposed
CKA-SR into the training process of sparse network training methods and find
that CKA-SR consistently improves the performance of several State-Of-The-Art
sparse training methods, especially at extremely high sparsity. Code is
included in the supplementary materials
A Fast Post-Training Pruning Framework for Transformers
Pruning is an effective way to reduce the huge inference cost of large
Transformer models. However, prior work on model pruning requires retraining
the model. This can add high cost and complexity to model deployment, making it
difficult to use in many practical situations. To address this, we propose a
fast post-training pruning framework for Transformers that does not require any
retraining. Given a resource constraint and a sample dataset, our framework
automatically prunes the Transformer model using structured sparsity methods.
To retain high accuracy without retraining, we introduce three novel
techniques: (i) a lightweight mask search algorithm that finds which heads and
filters to prune based on the Fisher information; (ii) mask rearrangement that
complements the search algorithm; and (iii) mask tuning that reconstructs the
output activations for each layer. We apply our method to BERT-BASE and
DistilBERT, and we evaluate its effectiveness on GLUE and SQuAD benchmarks. Our
framework achieves up to 2.0x reduction in FLOPs and 1.56x speedup in inference
latency, while maintaining < 1% loss in accuracy. Importantly, our framework
prunes Transformers in less than 3 minutes on a single GPU, which is over two
orders of magnitude faster than existing pruning approaches that retrain. Our
code is publicly available
Efficient Latency-Aware CNN Depth Compression via Two-Stage Dynamic Programming
Recent works on neural network pruning advocate that reducing the depth of
the network is more effective in reducing run-time memory usage and
accelerating inference latency than reducing the width of the network through
channel pruning. In this regard, some recent works propose depth compression
algorithms that merge convolution layers. However, the existing algorithms have
a constricted search space and rely on human-engineered heuristics. In this
paper, we propose a novel depth compression algorithm which targets general
convolution operations. We propose a subset selection problem that replaces
inefficient activation layers with identity functions and optimally merges
consecutive convolution operations into shallow equivalent convolution
operations for efficient end-to-end inference latency. Since the proposed
subset selection problem is NP-hard, we formulate a surrogate optimization
problem that can be solved exactly via two-stage dynamic programming within a
few seconds. We evaluate our methods and baselines by TensorRT for a fair
inference latency comparison. Our method outperforms the baseline method with
higher accuracy and faster inference speed in MobileNetV2 on the ImageNet
dataset. Specifically, we achieve speed-up with \%p accuracy
gain in MobileNetV2-1.0 on the ImageNet.Comment: ICML 2023; Codes at
https://github.com/snu-mllab/Efficient-CNN-Depth-Compressio
ECLM: Efficient Edge-Cloud Collaborative Learning with Continuous Environment Adaptation
Pervasive mobile AI applications primarily employ one of the two learning
paradigms: cloud-based learning (with powerful large models) or on-device
learning (with lightweight small models). Despite their own advantages, neither
paradigm can effectively handle dynamic edge environments with frequent data
distribution shifts and on-device resource fluctuations, inevitably suffering
from performance degradation. In this paper, we propose ECLM, an edge-cloud
collaborative learning framework for rapid model adaptation for dynamic edge
environments. We first propose a novel block-level model decomposition design
to decompose the original large cloud model into multiple combinable modules.
By flexibly combining a subset of the modules, this design enables the
derivation of compact, task-specific sub-models for heterogeneous edge devices
from the large cloud model, and the seamless integration of new knowledge
learned on these devices into the cloud model periodically. As such, ECLM
ensures that the cloud model always provides up-to-date sub-models for edge
devices. We further propose an end-to-end learning framework that incorporates
the modular model design into an efficient model adaptation pipeline including
an offline on-cloud model prototyping and training stage, and an online
edge-cloud collaborative adaptation stage. Extensive experiments over various
datasets demonstrate that ECLM significantly improves model performance (e.g.,
18.89% accuracy increase) and resource efficiency (e.g., 7.12x communication
cost reduction) in adapting models to dynamic edge environments by efficiently
collaborating the edge and the cloud models
TopX : efficient and versatile top-k query processing for text, structured, and semistructured data
TopX is a top-k retrieval engine for text and XML data. Unlike Boolean engines, it stops query processing as soon as it can safely determine the k top-ranked result objects according to a monotonous score aggregation function with respect to a multidimensional query. The main contributions of the thesis unfold into four main points, confirmed by previous publications at international conferences or workshops:
• Top-k query processing with probabilistic guarantees.
• Index-access optimized top-k query processing.
• Dynamic and self-tuning, incremental query expansion for top-k query
processing.
• Efficient support for ranked XML retrieval and full-text search.
Our experiments demonstrate the viability and improved efficiency of our approach compared to existing related work for a broad variety of retrieval scenarios.TopX ist eine Top-k Suchmaschine fĂĽr Text und XML Daten. Im Gegensatz
zu Boole\u27; schen Suchmaschinen terminiert TopX die Anfragebearbeitung,
sobald die k besten Ergebnisobjekte im Hinblick auf eine mehrdimensionale
Anfrage gefunden wurden. Die Hauptbeiträge dieser Arbeit teilen sich in
vier Schwerpunkte basierend auf vorherigen Veröffentlichungen bei internationalen
Konferenzen oder Workshops:
• Top-k Anfragebearbeitung mit probabilistischen Garantien.
• Zugriffsoptimierte Top-k Anfragebearbeitung.
• Dynamische und selbstoptimierende, inkrementelle Anfrageexpansion für Top-k Anfragebearbeitung.
• Effiziente Unterstützung für XML-Anfragen und Volltextsuche.
Unsere Experimente bestätigen die Vielseitigkeit und gesteigerte Effizienz unserer Verfahren gegenüber existierenden, führenden Ansätzen für eine weite
Bandbreite von Anwendungen in der Informationssuche
L-GreCo: Layerwise-Adaptive Gradient Compression for Efficient and Accurate Deep Learning
Data-parallel distributed training of deep neural networks (DNN) has gained
very widespread adoption, but can still experience communication bottlenecks.
To address this issue, entire families of compression mechanisms have been
developed, including quantization, sparsification, and low-rank approximation,
some of which are seeing significant practical adoption. Despite this progress,
almost all known compression schemes apply compression uniformly across DNN
layers, although layers are heterogeneous in terms of parameter count and their
impact on model accuracy. In this work, we provide a general framework for
adapting the degree of compression across the model's layers dynamically during
training, improving the overall compression, while leading to substantial
speedups, without sacrificing accuracy. Our framework, called L-GreCo, is based
on an adaptive algorithm, which automatically picks the optimal compression
parameters for model layers guaranteeing the best compression ratio while
satisfying an error constraint. Extensive experiments over image classification
and language modeling tasks shows that L-GreCo is effective across all existing
families of compression methods, and achieves up to 2.5 training
speedup and up to 5 compression improvement over efficient
implementations of existing approaches, while recovering full accuracy.
Moreover, L-GreCo is complementary to existing adaptive algorithms, improving
their compression ratio by 50% and practical throughput by 66%
Adaptive Model Pruning for Communication and Computation Efficient Wireless Federated Learning
Most existing wireless federated learning (FL) studies focused on homogeneous model settings where devices train identical local models. In this setting, the devices with poor communication and computation capabilities may delay the global model update and degrade the performance of FL. Moreover, in the homogenous model settings, the scale of the global model is restricted by the device with the lowest capability. To tackle these challenges, this work proposes an adaptive model pruning-based FL (AMP-FL) framework, where the edge server dynamically generates sub-models by pruning the global model for devices’ local training to adapt their heterogeneous computation capabilities and time-varying channel conditions. Since the involvement of diverse structures of devices’ sub-models in the global model updating may negatively affect the training convergence, we propose compensating for the gradients of pruned model regions by devices’ historical gradients. We then introduce an age of information (AoI) metric to characterize the staleness of local gradients and theoretically analyze the convergence behaviour of AMP-FL. The convergence bound suggests scheduling devices with large AoI of gradients and pruning the model regions with small AoI for devices to improve the learning performance. Inspired by this, we define a new objective function, i.e., the average AoI of local gradients, to transform the inexplicit global loss minimization problem into a tractable one for device scheduling, model pruning, and resource block (RB) allocation design. Through detailed analysis, we derive the optimal model pruning strategy and transform the RB allocation problem into equivalent linear programming that can be effectively solved. Experimental results demonstrate the effectiveness and superiority of the proposed approaches. The proposed AMP-FL is capable of achieving 1.9x and 1.6x speed up for FL on MNIST and CIFAR-10 datasets in comparison with the FL schemes with homogeneous model settings
- …