1,589 research outputs found
Dropout Regularization in Hierarchical Mixture of Experts
Dropout is a very effective method in preventing overfitting and has become
the go-to regularizer for multi-layer neural networks in recent years.
Hierarchical mixture of experts is a hierarchically gated model that defines a
soft decision tree where leaves correspond to experts and decision nodes
correspond to gating models that softly choose between its children, and as
such, the model defines a soft hierarchical partitioning of the input space. In
this work, we propose a variant of dropout for hierarchical mixture of experts
that is faithful to the tree hierarchy defined by the model, as opposed to
having a flat, unitwise independent application of dropout as one has with
multi-layer perceptrons. We show that on a synthetic regression data and on
MNIST and CIFAR-10 datasets, our proposed dropout mechanism prevents
overfitting on trees with many levels improving generalization and providing
smoother fits
Language Modeling with Sparse Product of Sememe Experts
Most language modeling methods rely on large-scale data to statistically
learn the sequential patterns of words. In this paper, we argue that words are
atomic language units but not necessarily atomic semantic units. Inspired by
HowNet, we use sememes, the minimum semantic units in human languages, to
represent the implicit semantics behind words for language modeling, named
Sememe-Driven Language Model (SDLM). More specifically, to predict the next
word, SDLM first estimates the sememe distribution gave textual context.
Afterward, it regards each sememe as a distinct semantic expert, and these
experts jointly identify the most probable senses and the corresponding word.
In this way, SDLM enables language models to work beyond word-level
manipulation to fine-grained sememe-level semantics and offers us more powerful
tools to fine-tune language models and improve the interpretability as well as
the robustness of language models. Experiments on language modeling and the
downstream application of headline gener- ation demonstrate the significant
effect of SDLM. Source code and data used in the experiments can be accessed at
https:// github.com/thunlp/SDLM-pytorch.Comment: EMNLP 2018. The first three authors contribute equall
Improved Language Modeling by Decoding the Past
Highly regularized LSTMs achieve impressive results on several benchmark
datasets in language modeling. We propose a new regularization method based on
decoding the last token in the context using the predicted distribution of the
next token. This biases the model towards retaining more contextual
information, in turn improving its ability to predict the next token. With
negligible overhead in the number of parameters and training time, our Past
Decode Regularization (PDR) method achieves a word level perplexity of 55.6 on
the Penn Treebank and 63.5 on the WikiText-2 datasets using a single softmax.
We also show gains by using PDR in combination with a mixture-of-softmaxes,
achieving a word level perplexity of 53.8 and 60.5 on these datasets. In
addition, our method achieves 1.169 bits-per-character on the Penn Treebank
Character dataset for character level language modeling. These results
constitute a new state-of-the-art in their respective settings
YouTube-8M Video Understanding Challenge Approach and Applications
This paper introduces the YouTube-8M Video Understanding Challenge hosted as
a Kaggle competition and also describes my approach to experimenting with
various models. For each of my experiments, I provide the score result as well
as possible improvements to be made. Towards the end of the paper, I discuss
the various ensemble learning techniques that I applied on the dataset which
significantly boosted my overall competition score. At last, I discuss the
exciting future of video understanding research and also the many applications
that such research could significantly improve.Comment: YouTube-8M Workshop submission, 8 page
Video Representation Learning and Latent Concept Mining for Large-scale Multi-label Video Classification
We report on CMU Informedia Lab's system used in Google's YouTube 8 Million
Video Understanding Challenge. In this multi-label video classification task,
our pipeline achieved 84.675% and 84.662% GAP on our evaluation split and the
official test set. We attribute the good performance to three components: 1)
Refined video representation learning with residual links and hypercolumns 2)
Latent concept mining which captures interactions among concepts. 3) Learning
with temporal segments and weighted multi-model ensemble. We conduct
experiments to validate and analyze the contribution of our models. We also
share some unsuccessful trials leveraging conventional approaches such as
recurrent neural networks for video representation learning for this
large-scale video dataset. All the codes to reproduce our results are publicly
available at https://github.com/Martini09/informedia-yt8m-release
An Analysis of Neural Language Modeling at Multiple Scales
Many of the leading approaches in language modeling introduce novel, complex
and specialized architectures. We take existing state-of-the-art word level
language models based on LSTMs and QRNNs and extend them to both larger
vocabularies as well as character-level granularity. When properly tuned, LSTMs
and QRNNs achieve state-of-the-art results on character-level (Penn Treebank,
enwik8) and word-level (WikiText-103) datasets, respectively. Results are
obtained in only 12 hours (WikiText-103) to 2 days (enwik8) using a single
modern GPU
Large-Scale YouTube-8M Video Understanding with Deep Neural Networks
Video classification problem has been studied many years. The success of
Convolutional Neural Networks (CNN) in image recognition tasks gives a powerful
incentive for researchers to create more advanced video classification
approaches. As video has a temporal content Long Short Term Memory (LSTM)
networks become handy tool allowing to model long-term temporal clues. Both
approaches need a large dataset of input data. In this paper three models
provided to address video classification using recently announced YouTube-8M
large-scale dataset. The first model is based on frame pooling approach. Two
other models based on LSTM networks. Mixture of Experts intermediate layer is
used in third model allowing to increase model capacity without dramatically
increasing computations. The set of experiments for handling imbalanced
training data has been conducted.Comment: 6 pages, 5 figures, 3 table
Adaptive Input Representations for Neural Language Modeling
We introduce adaptive input representations for neural language modeling
which extend the adaptive softmax of Grave et al. (2017) to input
representations of variable capacity. There are several choices on how to
factorize the input and output layers, and whether to model words, characters
or sub-word units. We perform a systematic comparison of popular choices for a
self-attentional architecture. Our experiments show that models equipped with
adaptive embeddings are more than twice as fast to train than the popular
character input CNN while having a lower number of parameters. On the
WikiText-103 benchmark we achieve 18.7 perplexity, an improvement of 10.5
perplexity compared to the previously best published result and on the Billion
Word benchmark, we achieve 23.02 perplexity.Comment: 12 page
Contextual Explanation Networks
Modern learning algorithms excel at producing accurate but complex models of
the data. However, deploying such models in the real-world requires extra care:
we must ensure their reliability, robustness, and absence of undesired biases.
This motivates the development of models that are equally accurate but can be
also easily inspected and assessed beyond their predictive performance. To this
end, we introduce contextual explanation networks (CEN)---a class of
architectures that learn to predict by generating and utilizing intermediate,
simplified probabilistic models. Specifically, CENs generate parameters for
intermediate graphical models which are further used for prediction and play
the role of explanations. Contrary to the existing post-hoc model-explanation
tools, CENs learn to predict and to explain simultaneously. Our approach offers
two major advantages: (i) for each prediction valid, instance-specific
explanation is generated with no computational overhead and (ii) prediction via
explanation acts as a regularizer and boosts performance in data-scarce
settings. We analyze the proposed framework theoretically and experimentally.
Our results on image and text classification and survival analysis tasks
demonstrate that CENs are not only competitive with the state-of-the-art
methods but also offer additional insights behind each prediction, that can be
valuable for decision support. We also show that while post-hoc methods may
produce misleading explanations in certain cases, CENs are consistent and allow
to detect such cases systematically.Comment: 48 pages, 18 figures, to appear in JML
Mixture Models for Diverse Machine Translation: Tricks of the Trade
Mixture models trained via EM are among the simplest, most widely used and
well understood latent variable models in the machine learning literature.
Surprisingly, these models have been hardly explored in text generation
applications such as machine translation. In principle, they provide a latent
variable to control generation and produce a diverse set of hypotheses. In
practice, however, mixture models are prone to degeneracies---often only one
component gets trained or the latent variable is simply ignored. We find that
disabling dropout noise in responsibility computation is critical to successful
training. In addition, the design choices of parameterization, prior
distribution, hard versus soft EM and online versus offline assignment can
dramatically affect model performance. We develop an evaluation protocol to
assess both quality and diversity of generations against multiple references,
and provide an extensive empirical study of several mixture model variants. Our
analysis shows that certain types of mixture models are more robust and offer
the best trade-off between translation quality and diversity compared to
variational models and diverse decoding approaches.\footnote{Code to reproduce
the results in this paper is available at
\url{https://github.com/pytorch/fairseq}}Comment: ICML 2019 camera-read
- …