17 research outputs found
The Expressive Power of Tuning Only the Norm Layers
Feature normalization transforms such as Batch and Layer-Normalization have
become indispensable ingredients of state-of-the-art deep neural networks.
Recent studies on fine-tuning large pretrained models indicate that just tuning
the parameters of these affine transforms can achieve high accuracy for
downstream tasks. These findings open the questions about the expressive power
of tuning the normalization layers of frozen networks. In this work, we take
the first step towards this question and show that for random ReLU networks,
fine-tuning only its normalization layers can reconstruct any target network
that is times smaller. We show that this holds even
for randomly sparsified networks, under sufficient overparameterization, in
agreement with prior empirical work
Maestro: Uncovering Low-Rank Structures via Trainable Decomposition
Deep Neural Networks (DNNs) have been a large driver and enabler for AI
breakthroughs in recent years. These models have been getting larger in their
attempt to become more accurate and tackle new upcoming use-cases, including
AR/VR and intelligent assistants. However, the training process of such large
models is a costly and time-consuming process, which typically yields a single
model to fit all targets. To mitigate this, various techniques have been
proposed in the literature, including pruning, sparsification or quantization
of the model weights and updates. While able to achieve high compression rates,
they often incur computational overheads or accuracy penalties. Alternatively,
factorization methods have been leveraged to incorporate low-rank compression
in the training process. Similarly, such techniques (e.g.,~SVD) frequently rely
on the computationally expensive decomposition of layers and are potentially
sub-optimal for non-linear models, such as DNNs. In this work, we take a
further step in designing efficient low-rank models and propose Maestro, a
framework for trainable low-rank layers. Instead of regularly applying a priori
decompositions such as SVD, the low-rank structure is built into the training
process through a generalized variant of Ordered Dropout. This method imposes
an importance ordering via sampling on the decomposed DNN structure. Our
theoretical analysis demonstrates that our method recovers the SVD
decomposition of linear mapping on uniformly distributed data and PCA for
linear autoencoders. We further apply our technique on DNNs and empirically
illustrate that Maestro enables the extraction of lower footprint models that
preserve model performance while allowing for graceful accuracy-latency
tradeoff for the deployment to devices of different capabilities.Comment: Under revie
Utilizing Language-Image Pretraining for Efficient and Robust Bilingual Word Alignment
Word translation without parallel corpora has become feasible, rivaling the
performance of supervised methods. Recent findings have shown that the accuracy
and robustness of unsupervised word translation (UWT) can be improved by making
use of visual observations, which are universal representations across
languages. In this work, we investigate the potential of using not only visual
observations but also pretrained language-image models for enabling a more
efficient and robust UWT. Specifically, we develop a novel UWT method dubbed
Word Alignment using Language-Image Pretraining (WALIP), which leverages visual
observations via the shared embedding space of images and texts provided by
CLIP models (Radford et al., 2021). WALIP has a two-step procedure. First, we
retrieve word pairs with high confidences of similarity, computed using our
proposed image-based fingerprints, which define the initial pivot for the word
alignment. Second, we apply our robust Procrustes algorithm to estimate the
linear mapping between two embedding spaces, which iteratively corrects and
refines the estimated alignment. Our extensive experiments show that WALIP
improves upon the state-of-the-art performance of bilingual word alignment for
a few language pairs across different word embeddings and displays great
robustness to the dissimilarity of language pairs or training corpora for two
word embeddings.Comment: In Proceedings of the 2022 Conference on Empirical Methods in Natural
Language Processing (EMNLP Findings