107 research outputs found
InstructBio: A Large-scale Semi-supervised Learning Paradigm for Biochemical Problems
In the field of artificial intelligence for science, it is consistently an
essential challenge to face a limited amount of labeled data for real-world
problems. The prevailing approach is to pretrain a powerful task-agnostic model
on a large unlabeled corpus but may struggle to transfer knowledge to
downstream tasks. In this study, we propose InstructMol, a semi-supervised
learning algorithm, to take better advantage of unlabeled examples. It
introduces an instructor model to provide the confidence ratios as the
measurement of pseudo-labels' reliability. These confidence scores then guide
the target model to pay distinct attention to different data points, avoiding
the over-reliance on labeled data and the negative influence of incorrect
pseudo-annotations. Comprehensive experiments show that InstructBio
substantially improves the generalization ability of molecular models, in not
only molecular property predictions but also activity cliff estimations,
demonstrating the superiority of the proposed method. Furthermore, our evidence
indicates that InstructBio can be equipped with cutting-edge pretraining
methods and used to establish large-scale and task-specific pseudo-labeled
molecular datasets, which reduces the predictive errors and shortens the
training process. Our work provides strong evidence that semi-supervised
learning can be a promising tool to overcome the data scarcity limitation and
advance molecular representation learning
Large-scale semi-supervised learning with online spectral graph sparsification
International audienceWe introduce Sparse-HFS, a scalable algorithm that can compute solutions to SSL problems using only O(n polylog(n)) space and O(m polylog(n)) time
BigSSL: Exploring the Frontier of Large-Scale Semi-Supervised Learning for Automatic Speech Recognition
We summarize the results of a host of efforts using giant automatic speech
recognition (ASR) models pre-trained using large, diverse unlabeled datasets
containing approximately a million hours of audio. We find that the combination
of pre-training, self-training and scaling up model size greatly increases data
efficiency, even for extremely large tasks with tens of thousands of hours of
labeled data. In particular, on an ASR task with 34k hours of labeled data, by
fine-tuning an 8 billion parameter pre-trained Conformer model we can match
state-of-the-art (SoTA) performance with only 3% of the training data and
significantly improve SoTA with the full training set. We also report on the
universal benefits gained from using big pre-trained and self-trained models
for a large set of downstream tasks that cover a wide range of speech domains
and span multiple orders of magnitudes of dataset sizes, including obtaining
SoTA performance on many public benchmarks. In addition, we utilize the learned
representation of pre-trained networks to achieve SoTA results on non-ASR
tasks.Comment: 14 pages, 7 figures, 13 tables; v2: minor corrections, reference
baselines and bibliography updated; v3: corrections based on reviewer
feedback, bibliography update
Distributed Low-rank Subspace Segmentation
Vision problems ranging from image clustering to motion segmentation to
semi-supervised learning can naturally be framed as subspace segmentation
problems, in which one aims to recover multiple low-dimensional subspaces from
noisy and corrupted input data. Low-Rank Representation (LRR), a convex
formulation of the subspace segmentation problem, is provably and empirically
accurate on small problems but does not scale to the massive sizes of modern
vision datasets. Moreover, past work aimed at scaling up low-rank matrix
factorization is not applicable to LRR given its non-decomposable constraints.
In this work, we propose a novel divide-and-conquer algorithm for large-scale
subspace segmentation that can cope with LRR's non-decomposable constraints and
maintains LRR's strong recovery guarantees. This has immediate implications for
the scalability of subspace segmentation, which we demonstrate on a benchmark
face recognition dataset and in simulations. We then introduce novel
applications of LRR-based subspace segmentation to large-scale semi-supervised
learning for multimedia event detection, concept detection, and image tagging.
In each case, we obtain state-of-the-art results and order-of-magnitude speed
ups
Lessons from Building Acoustic Models with a Million Hours of Speech
This is a report of our lessons learned building acoustic models from 1
Million hours of unlabeled speech, while labeled speech is restricted to 7,000
hours. We employ student/teacher training on unlabeled data, helping scale out
target generation in comparison to confidence model based methods, which
require a decoder and a confidence model. To optimize storage and to
parallelize target generation, we store high valued logits from the teacher
model. Introducing the notion of scheduled learning, we interleave learning on
unlabeled and labeled data. To scale distributed training across a large number
of GPUs, we use BMUF with 64 GPUs, while performing sequence training only on
labeled data with gradient threshold compression SGD using 16 GPUs. Our
experiments show that extremely large amounts of data are indeed useful; with
little hyper-parameter tuning, we obtain relative WER improvements in the 10 to
20% range, with higher gains in noisier conditions.Comment: "Copyright 2019 IEEE. Personal use of this material is permitted.
Permission from IEEE must be obtained for all other uses, in any current or
future media, including reprinting/republishing this material for advertising
or promotional purposes, creating new collective works, for resale or
redistribution to servers or lists, or reuse of any copyrighted component of
this work in other works.
- …