3,328 research outputs found
Multilingual Unsupervised Sentence Simplification
Progress in Sentence Simplification has been hindered by the lack of
supervised data, particularly in languages other than English. Previous work
has aligned sentences from original and simplified corpora such as English
Wikipedia and Simple English Wikipedia, but this limits corpus size, domain,
and language. In this work, we propose using unsupervised mining techniques to
automatically create training corpora for simplification in multiple languages
from raw Common Crawl web data. When coupled with a controllable generation
mechanism that can flexibly adjust attributes such as length and lexical
complexity, these mined paraphrase corpora can be used to train simplification
systems in any language. We further incorporate multilingual unsupervised
pretraining methods to create even stronger models and show that by training on
mined data rather than supervised corpora, we outperform the previous best
results. We evaluate our approach on English, French, and Spanish
simplification benchmarks and reach state-of-the-art performance with a totally
unsupervised approach. We will release our models and code to mine the data in
any language included in Common Crawl
Gravity optimised particle filter for hand tracking
This paper presents a gravity optimised particle filter (GOPF) where the magnitude of the gravitational force for every particle is proportional to its weight. GOPF attracts nearby particles and replicates new particles as if moving the particles towards the peak of the likelihood distribution, improving the sampling efficiency. GOPF is incorporated into a technique for hand features tracking. A fast approach to hand features detection and labelling using convexity defects is also presented. Experimental results show that GOPF outperforms the standard particle filter and its variants, as well as state-of-the-art CamShift guided particle filter using a significantly reduced number of particles
Cross-lingual Pre-training Based Transfer for Zero-shot Neural Machine Translation
Transfer learning between different language pairs has shown its
effectiveness for Neural Machine Translation (NMT) in low-resource scenario.
However, existing transfer methods involving a common target language are far
from success in the extreme scenario of zero-shot translation, due to the
language space mismatch problem between transferor (the parent model) and
transferee (the child model) on the source side. To address this challenge, we
propose an effective transfer learning approach based on cross-lingual
pre-training. Our key idea is to make all source languages share the same
feature space and thus enable a smooth transition for zero-shot translation. To
this end, we introduce one monolingual pre-training method and two bilingual
pre-training methods to obtain a universal encoder for different languages.
Once the universal encoder is constructed, the parent model built on such
encoder is trained with large-scale annotated data and then directly applied in
zero-shot translation scenario. Experiments on two public datasets show that
our approach significantly outperforms strong pivot-based baseline and various
multilingual NMT approaches.Comment: Accepted as a conference paper at AAAI 2020 (oral presentation
Determination of the exponent gamma for SAWs on the two-dimensional Manhattan lattice
We present a high-statistics Monte Carlo determination of the exponent gamma
for self-avoiding walks on a Manhattan lattice in two dimensions. A
conservative estimate is \gamma \gtapprox 1.3425(3), in agreement with the
universal value 43/32 on regular lattices, but in conflict with predictions
from conformal field theory and with a recent estimate from exact enumerations.
We find strong corrections to scaling that seem to indicate the presence of a
non-analytic exponent Delta < 1. If we assume Delta = 11/16 we find gamma =
1.3436(3), where the error is purely statistical.Comment: 24 pages, LaTeX2e, 4 figure
Distributed representations for multilingual language processing
Distributed representations are a central element in natural language processing. Units of text such as words, ngrams, or characters are mapped to real-valued vectors so that they can be processed by computational models. Representations trained on large amounts of text, called static word embeddings, have been found to work well across a variety of tasks such as sentiment analysis or named entity recognition. More recently, pretrained language models are used as contextualized representations that have been found to yield even better task performances.
Multilingual representations that are invariant with respect to languages are useful for multiple reasons. Models using those representations would only require training data in one language and still generalize across multiple languages. This is especially useful for languages that exhibit data sparsity. Further, machine translation models can benefit from source and target representations in the same space. Last, knowledge extraction models could not only access English data, but data in any natural language and thus exploit a richer source of knowledge.
Given that several thousand languages exist in the world, the need for multilingual language processing seems evident. However, it is not immediately clear, which properties multilingual embeddings should exhibit, how current multilingual representations work and how they could be improved.
This thesis investigates some of these questions. In the first publication, we explore the boundaries of multilingual representation learning by creating an embedding space across more than one thousand languages. We analyze existing methods and propose concept based embedding learning methods. The second paper investigates differences between creating representations for one thousand languages with little data versus considering few languages with abundant data. In the third publication, we refine a method to obtain interpretable subspaces of embeddings. This method can be used to investigate the workings of multilingual representations. The fourth publication finds that multilingual pretrained language models exhibit a high degree of multilinguality in the sense that high quality word alignments can be easily extracted. The fifth paper investigates reasons why multilingual pretrained language models are multilingual despite lacking any kind of crosslingual supervision during training. Based on our findings we propose a training scheme that leads to improved multilinguality. Last, the sixth paper investigates the use of multilingual pretrained language models as multilingual knowledge bases
Three Puzzles on Mathematics, Computation, and Games
In this lecture I will talk about three mathematical puzzles involving
mathematics and computation that have preoccupied me over the years. The first
puzzle is to understand the amazing success of the simplex algorithm for linear
programming. The second puzzle is about errors made when votes are counted
during elections. The third puzzle is: are quantum computers possible?Comment: ICM 2018 plenary lecture, Rio de Janeiro, 36 pages, 7 Figure
- …