175 research outputs found
Mixed Robust/Average Submodular Partitioning: Fast Algorithms, Guarantees, and Applications to Parallel Machine Learning and Multi-Label Image Segmentation
We study two mixed robust/average-case submodular partitioning problems that
we collectively call Submodular Partitioning. These problems generalize both
purely robust instances of the problem (namely max-min submodular fair
allocation (SFA) and min-max submodular load balancing (SLB) and also
generalize average-case instances (that is the submodular welfare problem (SWP)
and submodular multiway partition (SMP). While the robust versions have been
studied in the theory community, existing work has focused on tight
approximation guarantees, and the resultant algorithms are not, in general,
scalable to very large real-world applications. This is in contrast to the
average case, where most of the algorithms are scalable. In the present paper,
we bridge this gap, by proposing several new algorithms (including those based
on greedy, majorization-minimization, minorization-maximization, and relaxation
algorithms) that not only scale to large sizes but that also achieve
theoretical approximation guarantees close to the state-of-the-art, and in some
cases achieve new tight bounds. We also provide new scalable algorithms that
apply to additive combinations of the robust and average-case extreme
objectives. We show that these problems have many applications in machine
learning (ML). This includes: 1) data partitioning and load balancing for
distributed machine algorithms on parallel machines; 2) data clustering; and 3)
multi-label image segmentation with (only) Boolean submodular functions via
pixel partitioning. We empirically demonstrate the efficacy of our algorithms
on real-world problems involving data partitioning for distributed optimization
of standard machine learning objectives (including both convex and deep neural
network objectives), and also on purely unsupervised (i.e., no supervised or
semi-supervised learning, and no interactive segmentation) image segmentation
Learning From Less Data: A Unified Data Subset Selection and Active Learning Framework for Computer Vision
Supervised machine learning based state-of-the-art computer vision techniques
are in general data hungry. Their data curation poses the challenges of
expensive human labeling, inadequate computing resources and larger experiment
turn around times. Training data subset selection and active learning
techniques have been proposed as possible solutions to these challenges. A
special class of subset selection functions naturally model notions of
diversity, coverage and representation and can be used to eliminate redundancy
thus lending themselves well for training data subset selection. They can also
help improve the efficiency of active learning in further reducing human
labeling efforts by selecting a subset of the examples obtained using the
conventional uncertainty sampling based techniques. In this work, we
empirically demonstrate the effectiveness of two diversity models, namely the
Facility-Location and Dispersion models for training-data subset selection and
reducing labeling effort. We demonstrate this across the board for a variety of
computer vision tasks including Gender Recognition, Face Recognition, Scene
Recognition, Object Detection and Object Recognition. Our results show that
diversity based subset selection done in the right way can increase the
accuracy by upto 5 - 10% over existing baselines, particularly in settings in
which less training data is available. This allows the training of complex
machine learning models like Convolutional Neural Networks with much less
training data and labeling costs while incurring minimal performance loss.Comment: Accepted to WACV 2019. arXiv admin note: substantial text overlap
with arXiv:1805.1119
Diversity in Machine Learning
Machine learning methods have achieved good performance and been widely
applied in various real-world applications. They can learn the model adaptively
and be better fit for special requirements of different tasks. Generally, a
good machine learning system is composed of plentiful training data, a good
model training process, and an accurate inference. Many factors can affect the
performance of the machine learning process, among which the diversity of the
machine learning process is an important one. The diversity can help each
procedure to guarantee a total good machine learning: diversity of the training
data ensures that the training data can provide more discriminative information
for the model, diversity of the learned model (diversity in parameters of each
model or diversity among different base models) makes each parameter/model
capture unique or complement information and the diversity in inference can
provide multiple choices each of which corresponds to a specific plausible
local optimal result. Even though the diversity plays an important role in
machine learning process, there is no systematical analysis of the
diversification in machine learning system. In this paper, we systematically
summarize the methods to make data diversification, model diversification, and
inference diversification in the machine learning process, respectively. In
addition, the typical applications where the diversity technology improved the
machine learning performance have been surveyed, including the remote sensing
imaging tasks, machine translation, camera relocalization, image segmentation,
object detection, topic modeling, and others. Finally, we discuss some
challenges of the diversity technology in machine learning and point out some
directions in future work.Comment: Accepted by IEEE Acces
Targeted Subset Selection for Limited-data ASR Accent Adaptation
We study the task of adapting an existing ASR model to a non-native accent
while being constrained by a transcription budget on the duration of utterances
selected from a large unlabeled corpus. We propose a subset selection approach
using the recently proposed submodular mutual information functions, in which
we identify a diverse set of utterances that match the target accent. This is
specified through a few target utterances and achieved by modelling the
relationship between the target and the selected subsets using these functions.
The model adapts to the accent through fine-tuning with utterances selected and
transcribed from the unlabeled corpus. We also use an accent classifier to
learn accent-aware feature representations. Our method is also able to exploit
samples from other accents to perform out-of-domain selections for low-resource
accents which are not available in these corpora. We show that the targeted
subset selection approach improves significantly upon random sampling - by
around 5% to 10% (absolute) in most cases, and is around 10x more
label-efficient. We also compare with an oracle method where we specifically
pick from the target accent and our method is comparable to the oracle in its
selections and WER performance.Comment: Under review (INTERSPEECH 2022
Extending Contrastive Learning to Unsupervised Coreset Selection
Self-supervised contrastive learning offers a means of learning informative
features from a pool of unlabeled data. In this paper, we delve into another
useful approach -- providing a way of selecting a core-set that is entirely
unlabeled. In this regard, contrastive learning, one of a large number of
self-supervised methods, was recently proposed and has consistently delivered
the highest performance. This prompted us to choose two leading methods for
contrastive learning: the simple framework for contrastive learning of visual
representations (SimCLR) and the momentum contrastive (MoCo) learning
framework. We calculated the cosine similarities for each example of an epoch
for the entire duration of the contrastive learning process and subsequently
accumulated the cosine-similarity values to obtain the coreset score. Our
assumption was that an sample with low similarity would likely behave as a
coreset. Compared with existing coreset selection methods with labels, our
approach reduced the cost associated with human annotation. The unsupervised
method implemented in this study for coreset selection obtained improved
results over a randomly chosen subset, and were comparable to existing
supervised coreset selection on various classification datasets (e.g., CIFAR,
SVHN, and QMNIST).Comment: 11page
A General Framework for Edited Video and Raw Video Summarization
In this paper, we build a general summarization framework for both of edited
video and raw video summarization. Overall, our work can be divided into three
folds: 1) Four models are designed to capture the properties of video
summaries, i.e., containing important people and objects (importance),
representative to the video content (representativeness), no similar key-shots
(diversity) and smoothness of the storyline (storyness). Specifically, these
models are applicable to both edited videos and raw videos. 2) A comprehensive
score function is built with the weighted combination of the aforementioned
four models. Note that the weights of the four models in the score function,
denoted as property-weight, are learned in a supervised manner. Besides, the
property-weights are learned for edited videos and raw videos, respectively. 3)
The training set is constructed with both edited videos and raw videos in order
to make up the lack of training data. Particularly, each training video is
equipped with a pair of mixing-coefficients which can reduce the structure mess
in the training set caused by the rough mixture. We test our framework on three
datasets, including edited videos, short raw videos and long raw videos.
Experimental results have verified the effectiveness of the proposed framework
Diversity-aware Multi-Video Summarization
Most video summarization approaches have focused on extracting a summary from
a single video; we propose an unsupervised framework for summarizing a
collection of videos. We observe that each video in the collection may contain
some information that other videos do not have, and thus exploring the
underlying complementarity could be beneficial in creating a diverse
informative summary. We develop a novel diversity-aware sparse optimization
method for multi-video summarization by exploring the complementarity within
the videos. Our approach extracts a multi-video summary which is both
interesting and representative in describing the whole video collection. To
efficiently solve our optimization problem, we develop an alternating
minimization algorithm that minimizes the overall objective function with
respect to one video at a time while fixing the other videos. Moreover, we
introduce a new benchmark dataset, Tour20, that contains 140 videos with
multiple human created summaries, which were acquired in a controlled
experiment. Finally, by extensive experiments on the new Tour20 dataset and
several other multi-view datasets, we show that the proposed approach clearly
outperforms the state-of-the-art methods on the two problems-topic-oriented
video summarization and multi-view video summarization in a camera network.Comment: IEEE Trans. on Image Processing, 2017 (In Press
Automatic Curation of Large-Scale Datasets for Audio-Visual Representation Learning
Large-scale datasets are the cornerstone of representation learning. Existing
self-supervised approaches extract learning signals by making certain
assumptions about the data, e.g., spatio-temporal continuity and multimodal
correspondence. However, finding large amounts of data that satisfy such
assumptions is not straightforward, and this restricts the community to rely on
datasets collected through laborious annotation and/or manual filtering
processes. In this paper, we propose a subset optimization approach for
automatic dataset curation. Focusing on audio-visual representation learning,
we find a subset that provides the maximum mutual information between audio and
visual channels in videos. We show that self-supervised models trained on our
data, despite being automatically constructed, achieve competitive downstream
performances compared to existing datasets that require annotation and/or
manual filtering. The most significant benefit of our approach is scalability.
We release a dataset of 100M videos with high audio-visual correspondence
Submodularity-Inspired Data Selection for Goal-Oriented Chatbot Training Based on Sentence Embeddings
Spoken language understanding (SLU) systems, such as goal-oriented chatbots
or personal assistants, rely on an initial natural language understanding (NLU)
module to determine the intent and to extract the relevant information from the
user queries they take as input. SLU systems usually help users to solve
problems in relatively narrow domains and require a large amount of in-domain
training data. This leads to significant data availability issues that inhibit
the development of successful systems. To alleviate this problem, we propose a
technique of data selection in the low-data regime that enables us to train
with fewer labeled sentences, thus smaller labelling costs.
We propose a submodularity-inspired data ranking function, the ratio-penalty
marginal gain, for selecting data points to label based only on the information
extracted from the textual embedding space. We show that the distances in the
embedding space are a viable source of information that can be used for data
selection. Our method outperforms two known active learning techniques and
enables cost-efficient training of the NLU unit. Moreover, our proposed
selection technique does not need the model to be retrained in between the
selection steps, making it time efficient as well
Improved Noisy Student Training for Automatic Speech Recognition
Recently, a semi-supervised learning method known as "noisy student training"
has been shown to improve image classification performance of deep networks
significantly. Noisy student training is an iterative self-training method that
leverages augmentation to improve network performance. In this work, we adapt
and improve noisy student training for automatic speech recognition, employing
(adaptive) SpecAugment as the augmentation method. We find effective methods to
filter, balance and augment the data generated in between self-training
iterations. By doing so, we are able to obtain word error rates (WERs)
4.2%/8.6% on the clean/noisy LibriSpeech test sets by only using the clean 100h
subset of LibriSpeech as the supervised set and the rest (860h) as the
unlabeled set. Furthermore, we are able to achieve WERs 1.7%/3.4% on the
clean/noisy LibriSpeech test sets by using the unlab-60k subset of LibriLight
as the unlabeled set for LibriSpeech 960h. We are thus able to improve upon the
previous state-of-the-art clean/noisy test WERs achieved on LibriSpeech 100h
(4.74%/12.20%) and LibriSpeech (1.9%/4.1%).Comment: 5 pages, 5 figures, 4 tables; v2: minor revisions, reference adde
- …