50 research outputs found
Large-scale Hierarchical Alignment for Data-driven Text Rewriting
We propose a simple unsupervised method for extracting pseudo-parallel
monolingual sentence pairs from comparable corpora representative of two
different text styles, such as news articles and scientific papers. Our
approach does not require a seed parallel corpus, but instead relies solely on
hierarchical search over pre-trained embeddings of documents and sentences. We
demonstrate the effectiveness of our method through automatic and extrinsic
evaluation on text simplification from the normal to the Simple Wikipedia. We
show that pseudo-parallel sentences extracted with our method not only
supplement existing parallel data, but can even lead to competitive performance
on their own.Comment: RANLP 201
Character-level Chinese-English Translation through ASCII Encoding
Character-level Neural Machine Translation (NMT) models have recently
achieved impressive results on many language pairs. They mainly do well for
Indo-European language pairs, where the languages share the same writing
system. However, for translating between Chinese and English, the gap between
the two different writing systems poses a major challenge because of a lack of
systematic correspondence between the individual linguistic units. In this
paper, we enable character-level NMT for Chinese, by breaking down Chinese
characters into linguistic units similar to that of Indo-European languages. We
use the Wubi encoding scheme, which preserves the original shape and semantic
information of the characters, while also being reversible. We show promising
results from training Wubi-based models on the character- and subword-level
with recurrent as well as convolutional models.Comment: 7 pages, 3 figures, 3rd Conference on Machine Translation (WMT18),
201
Embedding-based Scientific Literature Discovery in a Text Editor Application
Each claim in a research paper requires all relevant prior knowledge to be
discovered, assimilated, and appropriately cited. However, despite the
availability of powerful search engines and sophisticated text editing
software, discovering relevant papers and integrating the knowledge into a
manuscript remain complex tasks associated with high cognitive load. To define
comprehensive search queries requires strong motivation from authors,
irrespective of their familiarity with the research field. Moreover, switching
between independent applications for literature discovery, bibliography
management, reading papers, and writing text burdens authors further and
interrupts their creative process. Here, we present a web application that
combines text editing and literature discovery in an interactive user
interface. The application is equipped with a search engine that couples
Boolean keyword filtering with nearest neighbor search over text embeddings,
providing a discovery experience tuned to an author's manuscript and his
interests. Our application aims to take a step towards more enjoyable and
effortless academic writing.
The demo of the application (https://SciEditorDemo2020.herokuapp.com/) and a
short video tutorial (https://youtu.be/pkdVU60IcRc) are available online
Character-Level Translation with Self-attention
We explore the suitability of self-attention models for character-level
neural machine translation. We test the standard transformer model, as well as
a novel variant in which the encoder block combines information from nearby
characters using convolutions. We perform extensive experiments on WMT and UN
datasets, testing both bilingual and multilingual translation to English using
up to three input languages (French, Spanish, and Chinese). Our transformer
variant consistently outperforms the standard transformer at the
character-level and converges faster while learning more robust character-level
alignments.Comment: ACL 202
Improving efficiency of supercontinuum generation in photonic crystal fibers by direct degenerate four-wave-mixing
We numerically study supercontinuum (SC) generation in photonic crystal
fibers pumped with low-power 30-ps pulses close to the zero dispersion
wavelength 647nm. We show how the efficiency is significantly improved by
designing the dispersion to allow widely separated spectral lines generated by
degenerate four-wave-mixing (FWM) directly from the pump to broaden and merge.
By proper modification of the dispersion profile the generation of additional
FWM Stokes and anti-Stokes lines results in efficient generation of an 800nm
wide SC. Simulations show that the predicted efficient SC generation is more
robust and can survive fiber imperfections modelled as random fluctuations of
the dispersion coefficients along the fiber length.Comment: Submited to Journal of the Optical Society of America B on 16
September 200
Quadratic solitons as nonlocal solitons
We show that quadratic solitons are equivalent to solitons of a nonlocal Kerr
medium. This provides new physical insight into the properties of quadratic
solitons, often believed to be equivalent to solitons of an effective saturable
Kerr medium. The nonlocal analogy also allows for novel analytical solutions
and the prediction of novel bound states of quadratic solitons.Comment: 4 pages, 3 figure
Recommended from our members
The genetic history of the Southern Arc: a bridge between West Asia and Europe
By sequencing 727 ancient individuals from the Southern Arc (Anatolia and its neighbors in Southeastern Europe and West Asia) over 10,000 years, we contextualize its Chalcolithic period and Bronze Age (about 5000 to 1000 BCE), when extensive gene flow entangled it with the Eurasian steppe. Two streams of migration transmitted Caucasus and Anatolian/Levantine ancestry northward, and the Yamnaya pastoralists, formed on the steppe, then spread southward into the Balkans and across the Caucasus into Armenia, where they left numerous patrilineal descendants. Anatolia was transformed by intra–West Asian gene flow, with negligible impact of the later Yamnaya migrations. This contrasts with all other regions where Indo-European languages were spoken, suggesting that the homeland of the Indo-Anatolian language family was in West Asia, with only secondary dispersals of non-Anatolian Indo-Europeans from the steppe
Abstractive Document Summarization in High and Low Resource Settings
Automatic summarization aims to reduce an input document to a compressed version that captures only its salient parts. It is a topic with growing importance in today's age of information overflow.
There are two main types of automatic summarization. Extractive summarization only selects salient sentences from the input, while abstractive summarization generates a summary without explicitly re-using whole sentences, resulting in summaries are often more fluent.
State-of-the-art approaches to abstractive summarization are data-driven, relying on the availability of large collections of paired articles with summaries. The pairs are typically manually constructed, a task which is costly and time-consuming. Furthermore, when targeting a slightly different domain or summary format, a new parallel dataset is often required. This large reliance on parallel resources limits the potential impact of abstractive summarization systems in society.
In this thesis, we consider the problem of abstractive summarization from two different perspectives: high-resource and low-resource summarization.
In the first part, we compare different methods for data-driven summarization, focusing specifically on the problem of generating long, abstractive summaries, such as an abstract for a scientific journal article. We discuss the difficulties that come with abstractive generation of long summaries and propose methods for alleviating them.
In the second part of this thesis, we develop low-resource methods for abstractive text rewriting, first focusing on individual sentences and then on whole summaries. Our methods do not rely on parallel data, but instead utilize raw non-parallel text collections.
In overall, this work makes a step towards data-driven abstractive summarization for the generation of long summaries, without having to rely on vast amounts of parallel, manually curated data
Abstractive Document Summarization without Parallel Data
Abstractive summarization typically relies on large collections of paired
articles and summaries. However, in many cases, parallel data is scarce and
costly to obtain. We develop an abstractive summarization system that relies
only on large collections of example summaries and non-matching articles. Our
approach consists of an unsupervised sentence extractor that selects salient
sentences to include in the final summary, as well as a sentence abstractor
that is trained on pseudo-parallel and synthetic data, that paraphrases each of
the extracted sentences. We perform an extensive evaluation of our method: on
the CNN/DailyMail benchmark, on which we compare our approach to fully
supervised baselines, as well as on the novel task of automatically generating
a press release from a scientific journal article, which is well suited for our
system. We show promising performance on both tasks, without relying on any
article-summary pairs.Comment: LREC 202