10,238 research outputs found
Unsupervised generation of parallel treebanks through sub-tree alignment
The need for syntactically annotated data for use in natural language processing has increased dramatically
in recent years. This is true especially for parallel treebanks, of which very few exist. The ones
that exist are mainly hand-crafted and too small for reliable use in data-oriented applications. In this
paper we introduce an open-source system for fast and robust automatic generation of parallel treebanks.
We expect the opening of the presented platform to the scientific community to help boost research
in the field of data-oriented machine translation and lead to advancements in other fields where
parallel treebanks can be employed
Matching Natural Language Sentences with Hierarchical Sentence Factorization
Semantic matching of natural language sentences or identifying the
relationship between two sentences is a core research problem underlying many
natural language tasks. Depending on whether training data is available, prior
research has proposed both unsupervised distance-based schemes and supervised
deep learning schemes for sentence matching. However, previous approaches
either omit or fail to fully utilize the ordered, hierarchical, and flexible
structures of language objects, as well as the interactions between them. In
this paper, we propose Hierarchical Sentence Factorization---a technique to
factorize a sentence into a hierarchical representation, with the components at
each different scale reordered into a "predicate-argument" form. The proposed
sentence factorization technique leads to the invention of: 1) a new
unsupervised distance metric which calculates the semantic distance between a
pair of text snippets by solving a penalized optimal transport problem while
preserving the logical relationship of words in the reordered sentences, and 2)
new multi-scale deep learning models for supervised semantic training, based on
factorized sentence hierarchies. We apply our techniques to text-pair
similarity estimation and text-pair relationship classification tasks, based on
multiple datasets such as STSbenchmark, the Microsoft Research paraphrase
identification (MSRP) dataset, the SICK dataset, etc. Extensive experiments
show that the proposed hierarchical sentence factorization can be used to
significantly improve the performance of existing unsupervised distance-based
metrics as well as multiple supervised deep learning models based on the
convolutional neural network (CNN) and long short-term memory (LSTM).Comment: Accepted by WWW 2018, 10 page
Highlighting matched and mismatched segments in translation memory output through sub-tree alignment
In recent years, it is becoming more and more clear that the
localisation industry does not have the necessary manpower to satisfy the increasing demand for high-quality translation. This has fuelled the search new and existing technologies that would increase translator throughput. As Translation Memory (TM) systems are the most commonly employed tool by translators, a number of enhancements are
available to assist them in their job. One such enhancement would be to show the translator which parts of the sentence
that needs to be translated match which parts of the fuzzy
match suggested by the TM. For this information to be used,
however, the translators have to carry it over to the TM
translation themselves. In this paper, we present a novel methodology that can automatically detect and highlight
the segments that need to be modified in a TM-suggested
translation. We base it on state-of-the-art sub-tree align-
ment technology (Zhechev,2010) that can produce aligned
phrase-based-tree pairs from unannotated data. Our system
operates in a three-step process. First, the fuzzy match
selected by the TM and its translation are aligned. This
lets us know which segments of the source-language sentence
correspond to which segments in its translation. In the
second step, the fuzzy match is aligned to the input sentence that is currently being translated. This tells us
which parts of the input sentence are available in the fuzzy
match and which still need to be translated. In the third
step, the fuzzy match is used as an intermediary, through
which the alignments between the input sentence and the TM
translation are established. In this way, we can detect with
precision the segments in the suggested translation that the
translator needs to edit and highlight them appropriately to
set them apart from the segments that are already good translations for parts of the input sentence. Additionally,
we can show the alignments—as detected by our system—between
the input and the translation, which will make it even easier for the translator to post-edit the TM suggestion. This alignment information can additionally be used to pre-
translate the mismatched segments, further reducing the post-editing load
DAugNet: Unsupervised, Multi-source, Multi-target, and Life-long Domain Adaptation for Semantic Segmentation of Satellite Images
The domain adaptation of satellite images has recently gained an increasing
attention to overcome the limited generalization abilities of machine learning
models when segmenting large-scale satellite images. Most of the existing
approaches seek for adapting the model from one domain to another. However,
such single-source and single-target setting prevents the methods from being
scalable solutions, since nowadays multiple source and target domains having
different data distributions are usually available. Besides, the continuous
proliferation of satellite images necessitates the classifiers to adapt to
continuously increasing data. We propose a novel approach, coined DAugNet, for
unsupervised, multi-source, multi-target, and life-long domain adaptation of
satellite images. It consists of a classifier and a data augmentor. The data
augmentor, which is a shallow network, is able to perform style transfer
between multiple satellite images in an unsupervised manner, even when new data
are added over the time. In each training iteration, it provides the classifier
with diversified data, which makes the classifier robust to large data
distribution difference between the domains. Our extensive experiments prove
that DAugNet significantly better generalizes to new geographic locations than
the existing approaches
Seeding statistical machine translation with translation memory output through tree-based structural alignment
With the steadily increasing demand for high-quality translation, the localisation industry is constantly searching for technologies that would increase translator
throughput, with the current focus on the use of high-quality Statistical Machine Translation (SMT) as a supplement to the established Translation Memory (TM)
technology. In this paper we present a novel modular approach that utilises state-of-the-art sub-tree alignment to pick out pre-translated segments from a TM match and seed with them an SMT system to produce a final translation. We show that the presented system can outperform pure SMT when a good TM match is found. It can also be used in a Computer-Aided Translation (CAT) environment to present almost perfect translations to the human user with markup highlighting the segments of the translation that need to be checked manually for correctness
- …