72 research outputs found
FlashFFTConv: Efficient Convolutions for Long Sequences with Tensor Cores
Convolution models with long filters have demonstrated state-of-the-art
reasoning abilities in many long-sequence tasks but lag behind the most
optimized Transformers in wall-clock time. A major bottleneck is the Fast
Fourier Transform (FFT)--which allows long convolutions to run in
time in sequence length but has poor hardware utilization. In this paper,
we study how to optimize the FFT convolution. We find two key bottlenecks: the
FFT does not effectively use specialized matrix multiply units, and it incurs
expensive I/O between layers of the memory hierarchy. In response, we propose
FlashFFTConv. FlashFFTConv uses a matrix decomposition that computes the FFT
using matrix multiply units and enables kernel fusion for long sequences,
reducing I/O. We also present two sparse convolution algorithms--1) partial
convolutions and 2) frequency-sparse convolutions--which can be implemented
simply by skipping blocks in the matrix decomposition, enabling further
opportunities for memory and compute savings. FlashFFTConv speeds up exact FFT
convolutions by up to 7.93 over PyTorch and achieves up to 4.4
speedup end-to-end. Given the same compute budget, FlashFFTConv allows
Hyena-GPT-s to achieve 2.3 points better perplexity on the PILE and
M2-BERT-base to achieve 3.3 points higher GLUE score--matching models with
twice the parameter count. FlashFFTConv also achieves 96.1% accuracy on
Path-512, a high-resolution vision task where no model had previously achieved
better than 50%. Furthermore, partial convolutions enable longer-sequence
models--yielding the first DNA model that can process the longest human genes
(2.3M base pairs)--and frequency-sparse convolutions speed up pretrained models
while maintaining or improving model quality
Cross-Modal Data Programming Enables Rapid Medical Machine Learning
Labeling training datasets has become a key barrier to building medical
machine learning models. One strategy is to generate training labels
programmatically, for example by applying natural language processing pipelines
to text reports associated with imaging studies. We propose cross-modal data
programming, which generalizes this intuitive strategy in a
theoretically-grounded way that enables simpler, clinician-driven input,
reduces required labeling time, and improves with additional unlabeled data. In
this approach, clinicians generate training labels for models defined over a
target modality (e.g. images or time series) by writing rules over an auxiliary
modality (e.g. text reports). The resulting technical challenge consists of
estimating the accuracies and correlations of these rules; we extend a recent
unsupervised generative modeling technique to handle this cross-modal setting
in a provably consistent way. Across four applications in radiography, computed
tomography, and electroencephalography, and using only several hours of
clinician time, our approach matches or exceeds the efficacy of
physician-months of hand-labeling with statistical significance, demonstrating
a fundamentally faster and more flexible way of building machine learning
models in medicine
Shoring Up the Foundations: Fusing Model Embeddings and Weak Supervision
Foundation models offer an exciting new paradigm for constructing models with
out-of-the-box embeddings and a few labeled examples. However, it is not clear
how to best apply foundation models without labeled data. A potential approach
is to fuse foundation models with weak supervision frameworks, which use weak
label sources -- pre-trained models, heuristics, crowd-workers -- to construct
pseudolabels. The challenge is building a combination that best exploits the
signal available in both foundation models and weak sources. We propose Liger,
a combination that uses foundation model embeddings to improve two crucial
elements of existing weak supervision techniques. First, we produce finer
estimates of weak source quality by partitioning the embedding space and
learning per-part source accuracies. Second, we improve source coverage by
extending source votes in embedding space. Despite the black-box nature of
foundation models, we prove results characterizing how our approach improves
performance and show that lift scales with the smoothness of label
distributions in embedding space. On six benchmark NLP and video tasks, Liger
outperforms vanilla weak supervision by 14.1 points, weakly-supervised kNN and
adapters by 11.8 points, and kNN and adapters supervised by traditional hand
labels by 7.2 points.Comment: UAI 2022 Camera Read
Perfectly Balanced: Improving Transfer and Robustness of Supervised Contrastive Learning
An ideal learned representation should display transferability and
robustness. Supervised contrastive learning (SupCon) is a promising method for
training accurate models, but produces representations that do not capture
these properties due to class collapse -- when all points in a class map to the
same representation. Recent work suggests that "spreading out" these
representations improves them, but the precise mechanism is poorly understood.
We argue that creating spread alone is insufficient for better representations,
since spread is invariant to permutations within classes. Instead, both the
correct degree of spread and a mechanism for breaking this invariance are
necessary. We first prove that adding a weighted class-conditional InfoNCE loss
to SupCon controls the degree of spread. Next, we study three mechanisms to
break permutation invariance: using a constrained encoder, adding a
class-conditional autoencoder, and using data augmentation. We show that the
latter two encourage clustering of latent subclasses under more realistic
conditions than the former. Using these insights, we show that adding a
properly-weighted class-conditional InfoNCE loss and a class-conditional
autoencoder to SupCon achieves 11.1 points of lift on coarse-to-fine transfer
across 5 standard datasets and 4.7 points on worst-group robustness on 3
datasets, setting state-of-the-art on CelebA by 11.5 points
- …