4,418 research outputs found
Correlated Attention in Transformers for Multivariate Time Series
Multivariate time series (MTS) analysis prevails in real-world applications
such as finance, climate science and healthcare. The various self-attention
mechanisms, the backbone of the state-of-the-art Transformer-based models,
efficiently discover the temporal dependencies, yet cannot well capture the
intricate cross-correlation between different features of MTS data, which
inherently stems from complex dynamical systems in practice. To this end, we
propose a novel correlated attention mechanism, which not only efficiently
captures feature-wise dependencies, but can also be seamlessly integrated
within the encoder blocks of existing well-known Transformers to gain
efficiency improvement. In particular, correlated attention operates across
feature channels to compute cross-covariance matrices between queries and keys
with different lag values, and selectively aggregate representations at the
sub-series level. This architecture facilitates automated discovery and
representation learning of not only instantaneous but also lagged
cross-correlations, while inherently capturing time series auto-correlation.
When combined with prevalent Transformer baselines, correlated attention
mechanism constitutes a better alternative for encoder-only architectures,
which are suitable for a wide range of tasks including imputation, anomaly
detection and classification. Extensive experiments on the aforementioned tasks
consistently underscore the advantages of correlated attention mechanism in
enhancing base Transformer models, and demonstrate our state-of-the-art results
in imputation, anomaly detection and classification
On the Convergence to a Global Solution of Shuffling-Type Gradient Algorithms
Stochastic gradient descent (SGD) algorithm is the method of choice in many
machine learning tasks thanks to its scalability and efficiency in dealing with
large-scale problems. In this paper, we focus on the shuffling version of SGD
which matches the mainstream practical heuristics. We show the convergence to a
global solution of shuffling SGD for a class of non-convex functions under
over-parameterized settings. Our analysis employs more relaxed non-convex
assumptions than previous literature. Nevertheless, we maintain the desired
computational complexity as shuffling SGD has achieved in the general convex
setting.Comment: The 37th Conference on Neural Information Processing Systems (NeurIPS
2023
Generalizing DP-SGD with Shuffling and Batch Clipping
Classical differential private DP-SGD implements individual clipping with
random subsampling, which forces a mini-batch SGD approach. We provide a
general differential private algorithmic framework that goes beyond DP-SGD and
allows any possible first order optimizers (e.g., classical SGD and momentum
based SGD approaches) in combination with batch clipping, which clips an
aggregate of computed gradients rather than summing clipped gradients (as is
done in individual clipping). The framework also admits sampling techniques
beyond random subsampling such as shuffling. Our DP analysis follows the -DP
approach and introduces a new proof technique which allows us to derive simple
closed form expressions and to also analyse group privacy. In particular, for
epochs work and groups of size , we show a DP dependency
for batch clipping with shuffling.Comment: Update disclaimer
- …