5,303 research outputs found
IsoBN: Fine-Tuning BERT with Isotropic Batch Normalization
Fine-tuning pre-trained language models (PTLMs), such as BERT and its better
variant RoBERTa, has been a common practice for advancing performance in
natural language understanding (NLU) tasks. Recent advance in representation
learning shows that isotropic (i.e., unit-variance and uncorrelated) embeddings
can significantly improve performance on downstream tasks with faster
convergence and better generalization. The isotropy of the pre-trained
embeddings in PTLMs, however, is relatively under-explored. In this paper, we
analyze the isotropy of the pre-trained [CLS] embeddings of PTLMs with
straightforward visualization, and point out two major issues: high variance in
their standard deviation, and high correlation between different dimensions. We
also propose a new network regularization method, isotropic batch normalization
(IsoBN) to address the issues, towards learning more isotropic representations
in fine-tuning by dynamically penalizing dominating principal components. This
simple yet effective fine-tuning method yields about 1.0 absolute increment on
the average of seven NLU tasks.Comment: AAAI 202
Crosslingual Document Embedding as Reduced-Rank Ridge Regression
There has recently been much interest in extending vector-based word
representations to multiple languages, such that words can be compared across
languages. In this paper, we shift the focus from words to documents and
introduce a method for embedding documents written in any language into a
single, language-independent vector space. For training, our approach leverages
a multilingual corpus where the same concept is covered in multiple languages
(but not necessarily via exact translations), such as Wikipedia. Our method,
Cr5 (Crosslingual reduced-rank ridge regression), starts by training a
ridge-regression-based classifier that uses language-specific bag-of-word
features in order to predict the concept that a given document is about. We
show that, when constraining the learned weight matrix to be of low rank, it
can be factored to obtain the desired mappings from language-specific
bags-of-words to language-independent embeddings. As opposed to most prior
methods, which use pretrained monolingual word vectors, postprocess them to
make them crosslingual, and finally average word vectors to obtain document
vectors, Cr5 is trained end-to-end and is thus natively crosslingual as well as
document-level. Moreover, since our algorithm uses the singular value
decomposition as its core operation, it is highly scalable. Experiments show
that our method achieves state-of-the-art performance on a crosslingual
document retrieval task. Finally, although not trained for embedding sentences
and words, it also achieves competitive performance on crosslingual sentence
and word retrieval tasks.Comment: In The Twelfth ACM International Conference on Web Search and Data
Mining (WSDM '19
Optimality of the Johnson-Lindenstrauss Lemma
For any integers and , we show the existence of a set of vectors such that any embedding satisfying
must have This lower bound matches the upper bound given by the Johnson-Lindenstrauss
lemma [JL84]. Furthermore, our lower bound holds for nearly the full range of
of interest, since there is always an isometric embedding into
dimension (either the identity map, or projection onto
).
Previously such a lower bound was only known to hold against linear maps ,
and not for such a wide range of parameters [LN16]. The
best previously known lower bound for general was [Wel74, Lev83, Alo03], which
is suboptimal for any .Comment: v2: simplified proof, also added reference to Lev8
Exploiting Metric Structure for Efficient Private Query Release
We consider the problem of privately answering queries defined on databases
which are collections of points belonging to some metric space. We give simple,
computationally efficient algorithms for answering distance queries defined
over an arbitrary metric. Distance queries are specified by points in the
metric space, and ask for the average distance from the query point to the
points contained in the database, according to the specified metric. Our
algorithms run efficiently in the database size and the dimension of the space,
and operate in both the online query release setting, and the offline setting
in which they must in polynomial time generate a fixed data structure which can
answer all queries of interest. This represents one of the first subclasses of
linear queries for which efficient algorithms are known for the private query
release problem, circumventing known hardness results for generic linear
queries
Dimension reduction by random hyperplane tessellations
Given a subset K of the unit Euclidean sphere, we estimate the minimal number
m = m(K) of hyperplanes that generate a uniform tessellation of K, in the sense
that the fraction of the hyperplanes separating any pair x, y in K is nearly
proportional to the Euclidean distance between x and y. Random hyperplanes
prove to be almost ideal for this problem; they achieve the almost optimal
bound m = O(w(K)^2) where w(K) is the Gaussian mean width of K. Using the map
that sends x in K to the sign vector with respect to the hyperplanes, we
conclude that every bounded subset K of R^n embeds into the Hamming cube {-1,
1}^m with a small distortion in the Gromov-Haussdorf metric. Since for many
sets K one has m = m(K) << n, this yields a new discrete mechanism of dimension
reduction for sets in Euclidean spaces.Comment: 17 pages, 3 figures, minor update
Approximate Matrix Multiplication with Application to Linear Embeddings
In this paper, we study the problem of approximately computing the product of
two real matrices. In particular, we analyze a dimensionality-reduction-based
approximation algorithm due to Sarlos [1], introducing the notion of nuclear
rank as the ratio of the nuclear norm over the spectral norm. The presented
bound has improved dependence with respect to the approximation error (as
compared to previous approaches), whereas the subspace -- on which we project
the input matrices -- has dimensions proportional to the maximum of their
nuclear rank and it is independent of the input dimensions. In addition, we
provide an application of this result to linear low-dimensional embeddings.
Namely, we show that any Euclidean point-set with bounded nuclear rank is
amenable to projection onto number of dimensions that is independent of the
input dimensionality, while achieving additive error guarantees.Comment: 8 pages, International Symposium on Information Theor
- …