167 research outputs found
Investigating Language Impact in Bilingual Approaches for Computational Language Documentation
For endangered languages, data collection campaigns have to accommodate the
challenge that many of them are from oral tradition, and producing
transcriptions is costly. Therefore, it is fundamental to translate them into a
widely spoken language to ensure interpretability of the recordings. In this
paper we investigate how the choice of translation language affects the
posterior documentation work and potential automatic approaches which will work
on top of the produced bilingual corpus. For answering this question, we use
the MaSS multilingual speech corpus (Boito et al., 2020) for creating 56
bilingual pairs that we apply to the task of low-resource unsupervised word
segmentation and alignment. Our results highlight that the choice of language
for translation influences the word segmentation performance, and that
different lexicons are learned by using different aligned translations. Lastly,
this paper proposes a hybrid approach for bilingual word segmentation,
combining boundary clues extracted from a non-parametric Bayesian model
(Goldwater et al., 2009a) with the attentional word segmentation neural model
from Godard et al. (2018). Our results suggest that incorporating these clues
into the neural models' input representation increases their translation and
alignment quality, specially for challenging language pairs.Comment: Accepted to 1st Joint SLTU and CCURL Worksho
Improving Tokenisation by Alternative Treatment of Spaces
Tokenisation is the first step in almost all NLP tasks, and state-of-the-art
transformer-based language models all use subword tokenisation algorithms to
process input text. Existing algorithms have problems, often producing
tokenisations of limited linguistic validity, and representing equivalent
strings differently depending on their position within a word. We hypothesise
that these problems hinder the ability of transformer-based models to handle
complex words, and suggest that these problems are a result of allowing tokens
to include spaces. We thus experiment with an alternative tokenisation approach
where spaces are always treated as individual tokens. Specifically, we apply
this modification to the BPE and Unigram algorithms. We find that our modified
algorithms lead to improved performance on downstream NLP tasks that involve
handling complex words, whilst having no detrimental effect on performance in
general natural language understanding tasks. Intrinsically, we find our
modified algorithms give more morphologically correct tokenisations, in
particular when handling prefixes. Given the results of our experiments, we
advocate for always treating spaces as individual tokens as an improved
tokenisation method
How the brain represents language and answers questions? Using an AI system to understand the underlying neurobiological mechanisms
To understand the computations that underlie high-level cognitive processes we propose a framework of mechanisms that could in principle implement START, an AI program that answers questions using natural language. START organizes a sentence into a series of triplets, each containing three elements (subject, verb, object). We propose that the brain similarly defines triplets and then chunks the three elements into a spatial pattern. A complete sentence can be represented using up to 7 triplets in a working memory buffer organized by theta and gamma oscillations. This buffer can transfer information into long-term memory networks where a second chunking operation converts the serial triplets into a single spatial pattern in a network, with each triplet (with corresponding elements) represented in specialized subregions. The triplets that define a sentence become synaptically linked, thereby encoding the sentence in synaptic weights. When a question is posed, there is a search for the closest stored memory (having the greatest number of shared triplets). We have devised a search process that does not require that the question and the stored memory have the same number of triplets or have triplets in the same order. Once the most similar memory is recalled and undergoes 2-level dechunking, the sought for information can be obtained by element-by-element comparison of the key triplet in the question to the corresponding triplet in the retrieved memory. This search may require a reordering to align corresponding triplets, the use of pointers that link different triplets, or the use of semantic memory. Our framework uses 12 network processes; existing models can implement many of these, but in other cases we can only suggest neural implementations. Overall, our scheme provides the first view of how language-based question answering could be implemented by the brain
Evaluating Open-Domain Dialogues in Latent Space with Next Sentence Prediction and Mutual Information
The long-standing one-to-many issue of the open-domain dialogues poses
significant challenges for automatic evaluation methods, i.e., there may be
multiple suitable responses which differ in semantics for a given
conversational context. To tackle this challenge, we propose a novel
learning-based automatic evaluation metric (CMN), which can robustly evaluate
open-domain dialogues by augmenting Conditional Variational Autoencoders
(CVAEs) with a Next Sentence Prediction (NSP) objective and employing Mutual
Information (MI) to model the semantic similarity of text in the latent space.
Experimental results on two open-domain dialogue datasets demonstrate the
superiority of our method compared with a wide range of baselines, especially
in handling responses which are distant to the golden reference responses in
semantics.Comment: Accepted at ACL202
Modular and Parameter-Efficient Multimodal Fusion with Prompting
Recent research has made impressive progress in large-scale multimodal pre-training. In the context of the rapid growth of model size, it is necessary to seek efficient and flexible methods other than finetuning. In this paper, we propose to use prompt vectors to align the modalities. Our method achieves comparable performance to several other multimodal fusion methods in low-resource settings. We further show that our method is modular and parameter-efficient for processing tasks involving two or more data modalities
Flow-Adapter Architecture for Unsupervised Machine Translation
In this work, we propose a flow-adapter architecture for unsupervised NMT. It leverages normalizing flows to explicitly model the distributions of sentence-level latent representations, which are subsequently used in conjunction with the attention mechanism for the translation task. The primary novelties of our model are: (a) capturing language-specific sentence representations separately for each language using normalizing flows and (b) using a simple transformation of these latent representations for translating from one language to another. This architecture allows for unsupervised training of each language independently. While there is prior work on latent variables for supervised MT, to the best of our knowledge, this is the first work that uses latent variables and normalizing flows for unsupervised MT. We obtain competitive results on several unsupervised MT benchmarks
CoDA21: Evaluating Language Understanding Capabilities of NLP Models With Context-Definition Alignment
Pretrained language models (PLMs) have achieved superhuman performance on many benchmarks, creating a need for harder tasks. We introduce CoDA21 (Context Definition Alignment), a challenging benchmark that measures natural language understanding (NLU) capabilities of PLMs: Given a definition and a context each for k words, but not the words themselves, the task is to align the k definitions with the k contexts. CoDA21 requires a deep understanding of contexts and definitions, including complex inference and world knowledge. We find that there is a large gap between human and PLM performance, suggesting that CoDA21 measures an aspect of NLU that is not sufficiently covered in existing benchmarks
- …