597 research outputs found
Integrating Weakly Supervised Word Sense Disambiguation into Neural Machine Translation
This paper demonstrates that word sense disambiguation (WSD) can improve
neural machine translation (NMT) by widening the source context considered when
modeling the senses of potentially ambiguous words. We first introduce three
adaptive clustering algorithms for WSD, based on k-means, Chinese restaurant
processes, and random walks, which are then applied to large word contexts
represented in a low-rank space and evaluated on SemEval shared-task data. We
then learn word vectors jointly with sense vectors defined by our best WSD
method, within a state-of-the-art NMT system. We show that the concatenation of
these vectors, and the use of a sense selection mechanism based on the weighted
average of sense vectors, outperforms several baselines including sense-aware
ones. This is demonstrated by translation on five language pairs. The
improvements are above one BLEU point over strong NMT baselines, +4% accuracy
over all ambiguous nouns and verbs, or +20% when scored manually over several
challenging words.Comment: To appear in TAC
Word Sense Disambiguation Based on Large Scale Polish CLARIN Heterogeneous Lexical Resources
Word Sense Disambiguation Based on Large Scale Polish CLARIN Heterogeneous Lexical Resources
Lexical resources can be applied in many different Natural Language Engineering tasks, but the most fundamental task is the recognition of word senses used in text contexts. The problem is difficult, not yet fully solved and different lexical resources provided varied support for it. Polish CLARIN lexical semantic resources are based on the plWordNet — a very large wordnet for Polish — as a central structure which is a basis for linking together several resources of different types. In this paper, several Word Sense Disambiguation (henceforth WSD) methods developed for Polish that utilise plWordNet are discussed. Textual sense descriptions in the traditional lexicon can be compared with text contexts using Lesk’s algorithm in order to find best matching senses. In the case of a wordnet, lexico-semantic relations provide the main description of word senses. Thus, first, we adapted and applied to Polish a WSD method based on the Page Rank. According to it, text words are mapped on their senses in the plWordNet graph and Page Rank algorithm is run to find senses with the highest scores. The method presents results lower but comparable to those reported for English. The error analysis showed that the main problems are: fine grained sense distinctions in plWordNet and limited number of connections between words of different parts of speech. In the second approach plWordNet expanded with the mapping onto the SUMO ontology concepts was used. Two scenarios for WSD were investigated: two step disambiguation and disambiguation based on combined networks of plWordNet and SUMO. In the former scenario, words are first assigned SUMO concepts and next plWordNet senses are disambiguated. In latter, plWordNet and SUMO are combined in one large network used next for the disambiguation of senses. The additional knowledge sources used in WSD improved the performance. The obtained results and potential further lines of developments were discussed
A Survey on Semantic Processing Techniques
Semantic processing is a fundamental research domain in computational
linguistics. In the era of powerful pre-trained language models and large
language models, the advancement of research in this domain appears to be
decelerating. However, the study of semantics is multi-dimensional in
linguistics. The research depth and breadth of computational semantic
processing can be largely improved with new technologies. In this survey, we
analyzed five semantic processing tasks, e.g., word sense disambiguation,
anaphora resolution, named entity recognition, concept extraction, and
subjectivity detection. We study relevant theoretical research in these fields,
advanced methods, and downstream applications. We connect the surveyed tasks
with downstream applications because this may inspire future scholars to fuse
these low-level semantic processing tasks with high-level natural language
processing tasks. The review of theoretical research may also inspire new tasks
and technologies in the semantic processing domain. Finally, we compare the
different semantic processing techniques and summarize their technical trends,
application trends, and future directions.Comment: Published at Information Fusion, Volume 101, 2024, 101988, ISSN
1566-2535. The equal contribution mark is missed in the published version due
to the publication policies. Please contact Prof. Erik Cambria for detail
Towards Effective Disambiguation for Machine Translation with Large Language Models
Resolving semantic ambiguity has long been recognised as a central challenge
in the field of Machine Translation. Recent work on benchmarking translation
performance on ambiguous sentences has exposed the limitations of conventional
Neural Machine Translation (NMT) systems, which fail to handle many such cases.
Large language models (LLMs) have emerged as a promising alternative,
demonstrating comparable performance to traditional NMT models while
introducing new paradigms for controlling the target outputs. In this paper, we
study the capabilities of LLMs to translate "ambiguous sentences" - i.e. those
containing highly polysemous words and/or rare word senses. We also propose two
ways to improve their disambiguation capabilities, through a) in-context
learning and b) fine-tuning on carefully curated ambiguous datasets.
Experiments show that our methods can match or outperform state-of-the-art
systems such as DeepL and NLLB in four out of five language directions. Our
research provides valuable insights into effectively adapting LLMs to become
better disambiguators during Machine Translation. We release our curated
disambiguation corpora and resources at
https://data.statmt.org/ambiguous-europarl.Comment: WMT 202
Code-Switching with Word Senses for Pretraining in Neural Machine Translation
Lexical ambiguity is a significant and pervasive challenge in Neural Machine
Translation (NMT), with many state-of-the-art (SOTA) NMT systems struggling to
handle polysemous words (Campolungo et al., 2022). The same holds for the NMT
pretraining paradigm of denoising synthetic "code-switched" text (Pan et al.,
2021; Iyer et al., 2023), where word senses are ignored in the noising stage --
leading to harmful sense biases in the pretraining data that are subsequently
inherited by the resulting models. In this work, we introduce Word Sense
Pretraining for Neural Machine Translation (WSP-NMT) - an end-to-end approach
for pretraining multilingual NMT models leveraging word sense-specific
information from Knowledge Bases. Our experiments show significant improvements
in overall translation quality. Then, we show the robustness of our approach to
scale to various challenging data and resource-scarce scenarios and, finally,
report fine-grained accuracy improvements on the DiBiMT disambiguation
benchmark. Our studies yield interesting and novel insights into the merits and
challenges of integrating word sense information and structured knowledge in
multilingual pretraining for NMT.Comment: EMNLP (Findings) 2023 Long Pape
Towards Effective Disambiguation for Machine Translation with Large Language Models
Resolving semantic ambiguity has long been recognised as a central challenge in the field of Machine Translation. Recent work on benchmarking translation performance on ambiguous sentences has exposed the limitations of conventional Neural Machine Translation (NMT) systems, which fail to handle many such cases. Large language models (LLMs) have emerged as a promising alternative, demonstrating comparable performance to traditional NMT models while introducing new paradigms for controlling the target outputs. In this paper, we study the capabilities of LLMs to translate ``ambiguous sentences'' - i.e. those containing highly polysemous words and/or rare word senses. We also propose two ways to improve their disambiguation capabilities, through a) in-context learning and b) fine-tuning on carefully curated ambiguous datasets. Experiments show that our methods can match or outperform state-of-the-art systems such as DeepL and NLLB in four out of five language directions. Our research provides valuable insights into effectively adapting LLMs to become better disambiguators during Machine Translation. We release our curated disambiguation corpora and resources at https://data.statmt.org/ambiguous-europarl
Inducing Multilingual Text Analysis Tools Using Bidirectional Recurrent Neural Networks
International audienceThis work focuses on the rapid development of linguistic annotation tools for resource-poor languages. We experiment several cross-lingual annotation projection methods using Recurrent Neural Networks (RNN) models. The distinctive feature of our approach is that our multilingual word representation requires only a parallel corpus between the source and target language. More precisely, our method has the following characteristics: (a) it does not use word alignment information, (b) it does not assume any knowledge about foreign languages, which makes it applicable to a wide range of resource-poor languages, (c) it provides truly multilingual taggers. We investigate both uni-and bi-directional RNN models and propose a method to include external information (for instance low level information from POS) in the RNN to train higher level taggers (for instance, super sense taggers). We demonstrate the validity and genericity of our model by using parallel corpora (obtained by manual or automatic translation). Our experiments are conducted to induce cross-lingual POS and super sense taggers
PersoNER: Persian named-entity recognition
© 1963-2018 ACL. Named-Entity Recognition (NER) is still a challenging task for languages with low digital resources. The main difficulties arise from the scarcity of annotated corpora and the consequent problematic training of an effective NER pipeline. To abridge this gap, in this paper we target the Persian language that is spoken by a population of over a hundred million people world-wide. We first present and provide ArmanPerosNERCorpus, the first manually-annotated Persian NER corpus. Then, we introduce PersoNER, an NER pipeline for Persian that leverages a word embedding and a sequential max-margin classifier. The experimental results show that the proposed approach is capable of achieving interesting MUC7 and CoNNL scores while outperforming two alternatives based on a CRF and a recurrent neural network
AMuSE-WSD: an all-in-one multilingual system for easy word sense disambiguation
Over the past few years, Word Sense Disambiguation (WSD) has received renewed interest: recently proposed systems have shown the remarkable effectiveness of deep learning techniques in this task, especially when aided by modern pretrained language models. Unfortunately, such systems are still not available as ready-to-use end-to-end packages, making it difficult for researchers to take advantage of their performance. The only alternative for a user interested in applying WSD to downstream tasks is to rely on currently available end-to-end WSD systems, which, however, still rely on graph-based heuristics or non-neural machine learning algorithms. In this paper, we fill this gap and propose AMuSE-WSD, the first end-to-end system to offer high-quality sense information in 40 languages through a state-of-the-art neural model for WSD. We hope that AMuSE-WSD will provide a stepping stone for the integration of meaning into real-world applications and encourage further studies in lexical semantics. AMuSE-WSD is available online at http://nlp.uniroma1.it/amuse-ws
Knowledge-based approaches to producing large-scale training data from scratch for Word Sense Disambiguation and Sense Distribution Learning
Communicating and understanding each other is one of the most important human abilities.
As humans, in fact, we can easily assign the correct meaning to the ambiguous words in a text, while, at the same time, being able to abstract, summarise and enrich its content with new information that we learned somewhere else.
On the contrary, machines rely on formal languages which do not leave space to ambiguity hence being easy to parse and understand.
Therefore, to fill the gap between humans and machines and enabling the latter to better communicate with and comprehend its sentient counterpart, in the modern era of computer-science's much effort has been put into developing Natural Language Processing (NLP) approaches which aim at understanding and handling the ambiguity of the human language.
At the core of NLP lies the task of correctly interpreting the meaning of each word in a given text, hence disambiguating its content exactly as a human would do.
Researchers in the Word Sense Disambiguation (WSD) field address exactly this issue by leveraging either knowledge bases, i.e. graphs where nodes are concept and edges are semantic relations among them, or manually-annotated datasets for training machine learning algorithms. One common obstacle is the knowledge acquisition bottleneck problem, id est, retrieving or generating semantically-annotated data which are necessary to build both semantic graphs or training sets is a complex task.
This phenomenon is even more serious when considering languages other than English where resources to generate human-annotated data are scarce and ready-made datasets are completely absent. With the advent of deep learning this issue became even more serious as more complex models need larger datasets in order to learn meaningful patterns to solve the task.
Another critical issue in WSD, as well as in other machine-learning-related fields, is the domain adaptation problem, id est, performing the same task in different application domains.
This is particularly hard when dealing with word senses, as, in fact, they are governed by a Zipfian distribution; hence, by slightly changing the application domain, a sense might become very frequent even though it is very rare in the general domain.
For example the geometric sense of plane is very frequent in a corpus made of math books, while it is very rare in a general domain dataset.
In this thesis we address both these problems. Inter alia, we focus on relieving the burden of human annotations in Word Sense Disambiguation thus enabling the automatic construction of high-quality sense-annotated dataset not only for English, but especially for other languages where sense-annotated data are not available at all.
Furthermore, recognising in word-sense distribution one of the main pitfalls for WSD approaches, we also alleviate the dependency on most frequent sense information by automatically inducing the word-sense distribution in a given text of raw sentences.
In the following we propose a language-independent and automatic approach to generating semantic annotations given a collection of sentences, and then introduce two methods for the automatic inference of word-sense distributions.
Finally, we combine the two kind of approaches to build a semantically-annotated dataset that reflect the sense distribution which we automatically infer from the target text
- …