874 research outputs found
Deep Clustering for Data Cleaning and Integration
Deep Learning (DL) techniques now constitute the state-of-theart for important problems in areas such as text and image processing, and there have been impactful results that deploy DL in several data management tasks. Deep Clustering (DC) has recently emerged as a sub-discipline of DL, in which data representations are learned in tandem with clustering, with a view to automatically identifying the features of the data that lead to improved clustering results. While DC has been used to good effect in several domains, particularly in image processing, the potential of DC for data management tasks remains unexplored. In this paper, we address this gap by investigating the suitability of DC for data cleaning and integration tasks, specifically schema inference, entity resolution and domain discovery, from the perspective of tables, rows and columns, respectively. In this setting, we compare and contrast several DC and non-DC clustering algorithms using standard benchmarks. The results show, among other things, that the most effective DC algorithms consistently outperform non-DC clustering algorithms for data integration tasks. Experiments also show consistently strong performance compared with state-of-the-art bespoke algorithms for each of the data integration tasks
FIREBALL: A Dataset of Dungeons and Dragons Actual-Play with Structured Game State Information
Dungeons & Dragons (D&D) is a tabletop roleplaying game with complex natural
language interactions between players and hidden state information. Recent work
has shown that large language models (LLMs) that have access to state
information can generate higher quality game turns than LLMs that use dialog
history alone. However, previous work used game state information that was
heuristically created and was not a true gold standard game state. We present
FIREBALL, a large dataset containing nearly 25,000 unique sessions from real
D\&D gameplay on Discord with true game state info. We recorded game play
sessions of players who used the Avrae bot, which was developed to aid people
in playing D&D online, capturing language, game commands and underlying game
state information. We demonstrate that FIREBALL can improve natural language
generation (NLG) by using Avrae state information, improving both automated
metrics and human judgments of quality. Additionally, we show that LLMs can
generate executable Avrae commands, particularly after finetuning.Comment: 21 pages, 2 figures. Accepted at ACL 202
Evaluating automated and hybrid neural disambiguation for African historical named entities
Documents detailing South African history contain ambiguous names. Ambiguous names may be due to people having the same name or the same person being referred to by multiple different names. Thus when searching for or attempting to extract information about a particular person, the name used may affect the results. This problem may be alleviated by using a Named Entity Disambiguation (NED) system to disambiguate names by linking them to a knowledge base. In recent years, transformer-based language models have led to improvements in NED systems. Furthermore, multilingual language models have shown the ability to learn concepts across languages, reducing the amount of training data required in low-resource languages. Thus a multilingual language model-based NED system was developed to disambiguate people's names within a historical South African context using documents written in English and isiZulu from the 500 Year Archive (FHYA). The multilingual language model-based system substantially improved on a probability-based baseline and achieved a micro F1-score of 0.726. At the same time, the entity linking component was able to link 81.9% of the mentions to the correct entity. However, the system's performance on documents written in isiZulu was significantly lower than on the documents written in English. Thus the system was augmented with handcrafted rules to improve its performance. The addition of handcrafted rules resulted in a small but significant improvement in performance when compared to the unaugmented NED system
Tailoring Domain Adaptation for Machine Translation Quality Estimation
While quality estimation (QE) can play an important role in the translation process, its effectiveness relies on the availability and quality of training data. For QE in particular, high-quality labeled data is often lacking due to the high cost and effort associated with labeling such data. Aside from the data scarcity challenge, QE models should also be generalizable, i.e., they should be able to handle data from different domains, both generic and specific. To alleviate these two main issues -- data scarcity and domain mismatch -- this paper combines domain adaptation and data augmentation within a robust QE system. Our method first trains a generic QE model and then fine-tunes it on a specific domain while retaining generic knowledge. Our results show a significant improvement for all the language pairs investigated, better cross-lingual inference, and a superior performance in zero-shot learning scenarios as compared to state-of-the-art baselines
T-Projection: High Quality Annotation Projection for Sequence Labeling Tasks
In the absence of readily available labeled data for a given sequence
labeling task and language, annotation projection has been proposed as one of
the possible strategies to automatically generate annotated data. Annotation
projection has often been formulated as the task of transporting, on parallel
corpora, the labels pertaining to a given span in the source language into its
corresponding span in the target language. In this paper we present
T-Projection, a novel approach for annotation projection that leverages large
pretrained text-to-text language models and state-of-the-art machine
translation technology. T-Projection decomposes the label projection task into
two subtasks: (i) A candidate generation step, in which a set of projection
candidates using a multilingual T5 model is generated and, (ii) a candidate
selection step, in which the generated candidates are ranked based on
translation probabilities. We conducted experiments on intrinsic and extrinsic
tasks in 5 Indo-European and 8 low-resource African languages. We demostrate
that T-projection outperforms previous annotation projection methods by a wide
margin. We believe that T-Projection can help to automatically alleviate the
lack of high-quality training data for sequence labeling tasks. Code and data
are publicly available.Comment: Findings of the EMNLP 202
Paraphrase Types for Generation and Detection
Current approaches in paraphrase generation and detection heavily rely on a
single general similarity score, ignoring the intricate linguistic properties
of language. This paper introduces two new tasks to address this shortcoming by
considering paraphrase types - specific linguistic perturbations at particular
text positions. We name these tasks Paraphrase Type Generation and Paraphrase
Type Detection. Our results suggest that while current techniques perform well
in a binary classification scenario, i.e., paraphrased or not, the inclusion of
fine-grained paraphrase types poses a significant challenge. While most
approaches are good at generating and detecting general semantic similar
content, they fail to understand the intrinsic linguistic variables they
manipulate. Models trained in generating and identifying paraphrase types also
show improvements in tasks without them. In addition, scaling these models
further improves their ability to understand paraphrase types. We believe
paraphrase types can unlock a new paradigm for developing paraphrase models and
solving tasks in the future.Comment: Published at EMNLP 202
The Skipped Beat: A Study of Sociopragmatic Understanding in LLMs for 64 Languages
Instruction tuned large language models (LLMs), such as ChatGPT, demonstrate
remarkable performance in a wide range of tasks. Despite numerous recent
studies that examine the performance of instruction-tuned LLMs on various NLP
benchmarks, there remains a lack of comprehensive investigation into their
ability to understand cross-lingual sociopragmatic meaning (SM), i.e., meaning
embedded within social and interactive contexts. This deficiency arises partly
from SM not being adequately represented in any of the existing benchmarks. To
address this gap, we present SPARROW, an extensive multilingual benchmark
specifically designed for SM understanding. SPARROW comprises 169 datasets
covering 13 task types across six primary categories (e.g., anti-social
language detection, emotion recognition). SPARROW datasets encompass 64
different languages originating from 12 language families representing 16
writing scripts. We evaluate the performance of various multilingual pretrained
language models (e.g., mT5) and instruction-tuned LLMs (e.g., BLOOMZ, ChatGPT)
on SPARROW through fine-tuning, zero-shot, and/or few-shot learning. Our
comprehensive analysis reveals that existing open-source instruction tuned LLMs
still struggle to understand SM across various languages, performing close to a
random baseline in some cases. We also find that although ChatGPT outperforms
many LLMs, it still falls behind task-specific finetuned models with a gap of
12.19 SPARROW score. Our benchmark is available at:
https://github.com/UBC-NLP/SPARROWComment: Accepted by EMNLP 2023 Main conferenc
Viewpoint Diversity in Search Results
Adverse phenomena such as the search engine manipulation effect (SEME), where web search users change their attitude on a topic following whatever most highly-ranked search results promote, represent crucial challenges for research and industry. However, the current lack of automatic methods to comprehensively measure or increase viewpoint diversity in search results complicates the understanding and mitigation of such effects. This paper proposes a viewpoint bias metric that evaluates the divergence from a pre-defined scenario of ideal viewpoint diversity considering two essential viewpoint dimensions (i.e., stance and logic of evaluation). In a case study, we apply this metric to actual search results and find considerable viewpoint bias in search results across queries, topics, and search engines that could lead to adverse effects such as SEME. We subsequently demonstrate that viewpoint diversity in search results can be dramatically increased using existing diversification algorithms. The methods proposed in this paper can assist researchers and practitioners in evaluating and improving viewpoint diversity in search results.</p
Gloss Attention for Gloss-free Sign Language Translation
Most sign language translation (SLT) methods to date require the use of gloss
annotations to provide additional supervision information, however, the
acquisition of gloss is not easy. To solve this problem, we first perform an
analysis of existing models to confirm how gloss annotations make SLT easier.
We find that it can provide two aspects of information for the model, 1) it can
help the model implicitly learn the location of semantic boundaries in
continuous sign language videos, 2) it can help the model understand the sign
language video globally. We then propose \emph{gloss attention}, which enables
the model to keep its attention within video segments that have the same
semantics locally, just as gloss helps existing models do. Furthermore, we
transfer the knowledge of sentence-to-sentence similarity from the natural
language model to our gloss attention SLT network (GASLT) to help it understand
sign language videos at the sentence level. Experimental results on multiple
large-scale sign language datasets show that our proposed GASLT model
significantly outperforms existing methods. Our code is provided in
\url{https://github.com/YinAoXiong/GASLT}
- …