261 research outputs found
Data Augmentation via Dependency Tree Morphing for Low Resource Languages
Neural NLP systems achieve high scores in the presence of sizable training
dataset. Lack of such datasets leads to poor system performances in the case
low-resource languages. We present two simple text augmentation techniques
using dependency trees, inspired from image processing. We crop sentences by
removing dependency links, and we rotate sentences by moving the tree fragments
around the root. We apply these techniques to augment the training sets of
low-resource languages in Universal Dependencies project. We implement a
character-level sequence tagging model and evaluate the augmented datasets on
part-of-speech tagging task. We show that crop and rotate provides improvements
over the models trained with non-augmented data for majority of the languages,
especially for languages with rich case marking systems
Benchmarking zero-shot and few-shot approaches for tokenization, tagging, and dependency parsing of Tagalog text
The grammatical analysis of texts in any human language typically involves a
number of basic processing tasks, such as tokenization, morphological tagging,
and dependency parsing. State-of-the-art systems can achieve high accuracy on
these tasks for languages with large datasets, but yield poor results for
languages such as Tagalog which have little to no annotated data. To address
this issue for the Tagalog language, we investigate the use of auxiliary data
sources for creating task-specific models in the absence of annotated Tagalog
data. We also explore the use of word embeddings and data augmentation to
improve performance when only a small amount of annotated Tagalog data is
available. We show that these zero-shot and few-shot approaches yield
substantial improvements on grammatical analysis of both in-domain and
out-of-domain Tagalog text compared to state-of-the-art supervised baselines.Comment: To appear at PACLIC 2022. 10 pages, 2 figures, 4 table
Simple is Better! Lightweight Data Augmentation for Low Resource Slot Filling and Intent Classification
Neural-based models have achieved outstanding performance on slot filling and
intent classification, when fairly large in-domain training data are available.
However, as new domains are frequently added, creating sizeable data is
expensive. We show that lightweight augmentation, a set of augmentation methods
involving word span and sentence level operations, alleviates data scarcity
problems. Our experiments on limited data settings show that lightweight
augmentation yields significant performance improvement on slot filling on the
ATIS and SNIPS datasets, and achieves competitive performance with respect to
more complex, state-of-the-art, augmentation approaches. Furthermore,
lightweight augmentation is also beneficial when combined with pre-trained
LM-based models, as it improves BERT-based joint intent and slot filling
models.Comment: Accepted at PACLIC 2020 - The 34th Pacific Asia Conference on
Language, Information and Computatio
Data Augmentation for Machine Translation via Dependency Subtree Swapping
We present a generic framework for data augmentation via dependency subtree
swapping that is applicable to machine translation. We extract corresponding
subtrees from the dependency parse trees of the source and target sentences and
swap these across bisentences to create augmented samples. We perform thorough
filtering based on graphbased similarities of the dependency trees and
additional heuristics to ensure that extracted subtrees correspond to the same
meaning. We conduct resource-constrained experiments on 4 language pairs in
both directions using the IWSLT text translation datasets and the Hunglish2
corpus. The results demonstrate consistent improvements in BLEU score over our
baseline models in 3 out of 4 language pairs. Our code is available on GitHub
Data augmentation for machine translation via dependency subtree swapping
We present a generic framework for data augmentation via dependency subtree swapping that is applicable to machine translation. We extract corresponding subtrees from the dependency parse trees of the source and target sentences and swap these across bisentences to create augmented samples. We perform thorough filtering based on graphbased similarities of the dependency trees and additional heuristics to ensure that extracted subtrees correspond to the same meaning. We conduct resource-constrained experiments on 4 language pairs in both directions using the IWSLT text translation datasets and the Hunglish2 corpus. The results demonstrate consistent improvements in BLEU score over our baseline models in 3 out of 4 language pairs. Our code is available on GitHub
xSIM++: An Improved Proxy to Bitext Mining Performance for Low-Resource Languages
We introduce a new proxy score for evaluating bitext mining based on
similarity in a multilingual embedding space: xSIM++. In comparison to xSIM,
this improved proxy leverages rule-based approaches to extend English sentences
in any evaluation set with synthetic, hard-to-distinguish examples which more
closely mirror the scenarios we encounter during large-scale mining. We
validate this proxy by running a significant number of bitext mining
experiments for a set of low-resource languages, and subsequently train NMT
systems on the mined data. In comparison to xSIM, we show that xSIM++ is better
correlated with the downstream BLEU scores of translation systems trained on
mined bitexts, providing a reliable proxy of bitext mining performance without
needing to run expensive bitext mining pipelines. xSIM++ also reports
performance for different error types, offering more fine-grained feedback for
model development.Comment: The first two authors contributed equally; ACL 2023 short; Code and
data are available at https://github.com/facebookresearch/LASE
- âŠ