510,482 research outputs found
Language Modeling at Scale
We show how Zipf's Law can be used to scale up language modeling (LM) to take
advantage of more training data and more GPUs. LM plays a key role in many
important natural language applications such as speech recognition and machine
translation. Scaling up LM is important since it is widely accepted by the
community that there is no data like more data. Eventually, we would like to
train on terabytes (TBs) of text (trillions of words). Modern training methods
are far from this goal, because of various bottlenecks, especially memory
(within GPUs) and communication (across GPUs). This paper shows how Zipf's Law
can address these bottlenecks by grouping parameters for common words and
character sequences, because , where is the number of unique words
(types) and is the size of the training set (tokens). For a local batch
size with GPUs and a -dimension embedding matrix, we reduce the
original per-GPU memory and communication asymptotic complexity from
to . Empirically, we find on four publicly available large datasets. When we scale up the
number of GPUs to 64, a factor of 8, training time speeds up by factors up to
6.7 (for character LMs) and 6.3 (for word LMs) with negligible
loss of accuracy. Our weak scaling on 192 GPUs on the Tieba dataset shows a
35\% improvement in LM prediction accuracy by training on 93 GB of data
(2.5 larger than publicly available SOTA dataset), but taking only
1.25 increase in training time, compared to 3 GB of the same dataset
running on 6 GPUs
Language Modeling for Code-Switching: Evaluation, Integration of Monolingual Data, and Discriminative Training
We focus on the problem of language modeling for code-switched language, in
the context of automatic speech recognition (ASR). Language modeling for
code-switched language is challenging for (at least) three reasons: (1) lack of
available large-scale code-switched data for training; (2) lack of a replicable
evaluation setup that is ASR directed yet isolates language modeling
performance from the other intricacies of the ASR system; and (3) the reliance
on generative modeling. We tackle these three issues: we propose an
ASR-motivated evaluation setup which is decoupled from an ASR system and the
choice of vocabulary, and provide an evaluation dataset for English-Spanish
code-switching. This setup lends itself to a discriminative training approach,
which we demonstrate to work better than generative language modeling. Finally,
we explore a variety of training protocols and verify the effectiveness of
training with large amounts of monolingual data followed by fine-tuning with
small amounts of code-switched data, for both the generative and discriminative
cases.Comment: EMNLP 201
Language Modeling with Sparse Product of Sememe Experts
Most language modeling methods rely on large-scale data to statistically
learn the sequential patterns of words. In this paper, we argue that words are
atomic language units but not necessarily atomic semantic units. Inspired by
HowNet, we use sememes, the minimum semantic units in human languages, to
represent the implicit semantics behind words for language modeling, named
Sememe-Driven Language Model (SDLM). More specifically, to predict the next
word, SDLM first estimates the sememe distribution gave textual context.
Afterward, it regards each sememe as a distinct semantic expert, and these
experts jointly identify the most probable senses and the corresponding word.
In this way, SDLM enables language models to work beyond word-level
manipulation to fine-grained sememe-level semantics and offers us more powerful
tools to fine-tune language models and improve the interpretability as well as
the robustness of language models. Experiments on language modeling and the
downstream application of headline gener- ation demonstrate the significant
effect of SDLM. Source code and data used in the experiments can be accessed at
https:// github.com/thunlp/SDLM-pytorch.Comment: EMNLP 2018. The first three authors contribute equall
Multi-scale Transformer Language Models
We investigate multi-scale transformer language models that learn
representations of text at multiple scales, and present three different
architectures that have an inductive bias to handle the hierarchical nature of
language. Experiments on large-scale language modeling benchmarks empirically
demonstrate favorable likelihood vs memory footprint trade-offs, e.g. we show
that it is possible to train a hierarchical variant with 30 layers that has 23%
smaller memory footprint and better perplexity, compared to a vanilla
transformer with less than half the number of layers, on the Toronto
BookCorpus. We analyze the advantages of learned representations at multiple
scales in terms of memory footprint, compute time, and perplexity, which are
particularly appealing given the quadratic scaling of transformers' run time
and memory usage with respect to sequence length
A language model based approach towards large scale and lightweight language identification systems
Multilingual spoken dialogue systems have gained prominence in the recent
past necessitating the requirement for a front-end Language Identification
(LID) system. Most of the existing LID systems rely on modeling the language
discriminative information from low-level acoustic features. Due to the
variabilities of speech (speaker and emotional variabilities, etc.),
large-scale LID systems developed using low-level acoustic features suffer from
a degradation in the performance. In this approach, we have attempted to model
the higher level language discriminative phonotactic information for developing
an LID system. In this paper, the input speech signal is tokenized to phone
sequences by using a language independent phone recognizer. The language
discriminative phonotactic information in the obtained phone sequences are
modeled using statistical and recurrent neural network based language modeling
approaches. As this approach, relies on higher level phonotactical information
it is more robust to variabilities of speech. Proposed approach is
computationally light weight, highly scalable and it can be used in complement
with the existing LID systems.Comment: Under review at ICASSP 201
Development of simulation package for atomic processes of ultra-large-scale system based on electronic structure theory
An early-stage version of simulation package is developed for electronic
structure calculation and dynamics of atom process in large-scale systems,
particularly, nm-scale or 10nm-scale systems. We adopted the Extensible Markup
Language (XML)-style in the input and the output of our simulation code, and
developed some modeling and analysis tools for dynamical simulations of atomic
processes. GaAs bulk system was calculated to demonstrate that the present code
can handle systems with more than one atom specie.Comment: 8 pages,4 figures. A PDF file in better graphics is available at
http://fujimac.t.u-tokyo.ac.jp/lses/index_e.htm
Multiscale sequence modeling with a learned dictionary
We propose a generalization of neural network sequence models. Instead of
predicting one symbol at a time, our multi-scale model makes predictions over
multiple, potentially overlapping multi-symbol tokens. A variation of the
byte-pair encoding (BPE) compression algorithm is used to learn the dictionary
of tokens that the model is trained with. When applied to language modelling,
our model has the flexibility of character-level models while maintaining many
of the performance benefits of word-level models. Our experiments show that
this model performs better than a regular LSTM on language modeling tasks,
especially for smaller models
Distilling Knowledge Learned in BERT for Text Generation
Large-scale pre-trained language model such as BERT has achieved great
success in language understanding tasks. However, it remains an open question
how to utilize BERT for language generation. In this paper, we present a novel
approach, Conditional Masked Language Modeling (C-MLM), to enable the
finetuning of BERT on target generation tasks. The finetuned BERT (teacher) is
exploited as extra supervision to improve conventional Seq2Seq models (student)
for better text generation performance. By leveraging BERT's idiosyncratic
bidirectional nature, distilling knowledge learned in BERT can encourage
auto-regressive Seq2Seq models to plan ahead, imposing global sequence-level
supervision for coherent text generation. Experiments show that the proposed
approach significantly outperforms strong Transformer baselines on multiple
language generation tasks such as machine translation and text summarization.
Our proposed model also achieves new state of the art on IWSLT German-English
and English-Vietnamese MT datasets. Code is available at
https://github.com/ChenRocks/Distill-BERT-Textgen.Comment: ACL 202
Modeling Vocabulary for Big Code Machine Learning
When building machine learning models that operate on source code, several
decisions have to be made to model source-code vocabulary. These decisions can
have a large impact: some can lead to not being able to train models at all,
others significantly affect performance, particularly for Neural Language
Models. Yet, these decisions are not often fully described. This paper lists
important modeling choices for source code vocabulary, and explores their
impact on the resulting vocabulary on a large-scale corpus of 14,436 projects.
We show that a subset of decisions have decisive characteristics, allowing to
train accurate Neural Language Models quickly on a large corpus of 10,106
projects.Comment: 12 pages, 1 figur
Quantifying Long Range Dependence in Language and User Behavior to improve RNNs
Characterizing temporal dependence patterns is a critical step in
understanding the statistical properties of sequential data. Long Range
Dependence (LRD) --- referring to long-range correlations decaying as a power
law rather than exponentially w.r.t. distance --- demands a different set of
tools for modeling the underlying dynamics of the sequential data. While it has
been widely conjectured that LRD is present in language modeling and sequential
recommendation, the amount of LRD in the corresponding sequential datasets has
not yet been quantified in a scalable and model-independent manner. We propose
a principled estimation procedure of LRD in sequential datasets based on
established LRD theory for real-valued time series and apply it to sequences of
symbols with million-item-scale dictionaries. In our measurements, the
procedure estimates reliably the LRD in the behavior of users as they write
Wikipedia articles and as they interact with YouTube. We further show that
measuring LRD better informs modeling decisions in particular for RNNs whose
ability to capture LRD is still an active area of research. The quantitative
measure informs new Evolutive Recurrent Neural Networks (EvolutiveRNNs)
designs, leading to state-of-the-art results on language understanding and
sequential recommendation tasks at a fraction of the computational cost
- …