244,998 research outputs found
Counterfactually Probing Language Identity in Multilingual Models
Techniques in causal analysis of language models illuminate how linguistic
information is organized in LLMs. We use one such technique, AlterRep, a method
of counterfactual probing, to explore the internal structure of multilingual
models (mBERT and XLM-R). We train a linear classifier on a binary language
identity task, to classify tokens between Language X and Language Y. Applying a
counterfactual probing procedure, we use the classifier weights to project the
embeddings into the null space and push the resulting embeddings either in the
direction of Language X or Language Y. Then we evaluate on a masked language
modeling task. We find that, given a template in Language X, pushing towards
Language Y systematically increases the probability of Language Y words, above
and beyond a third-party control language. But it does not specifically push
the model towards translation-equivalent words in Language Y. Pushing towards
Language X (the same direction as the template) has a minimal effect, but
somewhat degrades these models. Overall, we take these results as further
evidence of the rich structure of massive multilingual language models, which
include both a language-specific and language-general component. And we show
that counterfactual probing can be fruitfully applied to multilingual models.Comment: 12 pages, 5 figures, MRL Workshop @ EMNLP 202
Improving Summarization with Human Edits
Recent work has shown the promise of learning with human feedback paradigms
to produce human-determined high-quality text. Existing works use human
feedback to train large language models (LLMs) in general domain abstractive
summarization and have obtained summary quality exceeding traditional
likelihood training. In this paper, we focus on a less explored form of human
feedback -- Human Edits. We propose Sequence Alignment (un)Likelihood Training
(SALT), a novel technique to use both the human-edited and model-generated data
together in the training loop. In addition, we demonstrate simulating Human
Edits with ground truth summaries coming from existing training data --
Imitation edits, along with the model-generated summaries obtained after the
training, to reduce the need for expensive human-edit data. In our experiments,
we extend human feedback exploration from general domain summarization to
medical domain summarization. Our results demonstrate the effectiveness of SALT
in improving the summary quality with Human and Imitation Edits. Through
additional experiments, we show that SALT outperforms the conventional RLHF
method (designed for human preferences) -- DPO, when applied to human-edit
data. We hope the evidence in our paper prompts researchers to explore,
collect, and better use different human feedback approaches scalably.Comment: To appear in proceedings of the Main Conference on Empirical Methods
in Natural Language Processing (EMNLP) 202
Solo: Data Discovery Using Natural Language Questions Via A Self-Supervised Approach
Most deployed data discovery systems, such as Google Datasets, and open data
portals only support keyword search. Keyword search is geared towards general
audiences but limits the types of queries the systems can answer. We propose a
new system that lets users write natural language questions directly. A major
barrier to using this learned data discovery system is it needs
expensive-to-collect training data, thus limiting its utility. In this paper,
we introduce a self-supervised approach to assemble training datasets and train
learned discovery systems without human intervention. It requires addressing
several challenges, including the design of self-supervised strategies for data
discovery, table representation strategies to feed to the models, and relevance
models that work well with the synthetically generated questions. We combine
all the above contributions into a system, Solo, that solves the problem end to
end. The evaluation results demonstrate the new techniques outperform
state-of-the-art approaches on well-known benchmarks. All in all, the technique
is a stepping stone towards building learned discovery systems. The code is
open-sourced at https://github.com/TheDataStation/soloComment: To appear at Sigmod 202
Neural language model based training data augmentation for weakly supervised early rumor detection
The scarcity and class imbalance of training data are known issues in current rumor detection tasks. We propose a straight-forward and general-purpose data augmentation technique which is beneficial to early rumor detection relying on event propagation patterns. The key idea is to exploit massive unlabeled event data sets on social media to augment limited labeled rumor source tweets. This work is based on rumor spreading patterns revealed by recent rumor studies and semantic relatedness between labeled and unlabeled data. A state-of-the-art neural language model (NLM) and large credibility-focused Twitter corpora are employed to learn context-sensitive representations of rumor tweets. Six different real-world events based on three publicly available rumor datasets are employed in our experiments to provide a comparative evaluation of the effectiveness of the method. The results show that our method can expand the size of an existing rumor data set nearly by 200% and corresponding social context (i.e., conversational threads) by 100% with reasonable quality. Preliminary experiments with a state-of-the-art deep learning-based rumor detection model show that augmented data can alleviate over-fitting and class imbalance caused by limited train data and can help to train complex neural networks (NNs). With augmented data, the performance of rumor detection can be improved by 12.1% in terms of F-score. Our experiments also indicate that augmented training data can help to generalize rumor detection models on unseen rumors
Neural-based Knowledge Transfer in Natural Language Processing
In Natural Language Processing (NLP), neural-based knowledge transfer, which is to transfer out-of-domain (OOD) knowledge to task-specific neural networks, has been applied to many NLP tasks. To further explore neural-based knowledge transfer in NLP, in this dissertation, we consider both structured OOD knowledge and unstructured OOD knowledge, and deal with several representative NLP tasks. For structured OOD knowledge, we study the neural-based knowledge transfer in Machine Reading Comprehension (MRC). In single-passage MRC tasks, to bridge the gap between MRC models and human beings, which is mainly reflected in the hunger for data and the robustness to noise, we integrate the neural networks of MRC models with the general knowledge of human beings embodied in knowledge bases. On the one hand, we propose a data enrichment method, which uses WordNet to extract inter-word semantic connections as general knowledge from each given passage-question pair. On the other hand, we propose a novel MRC model named Knowledge Aided Reader (KAR), which explicitly uses the above extracted general knowledge to assist its attention mechanisms. According to the experimental results, KAR is comparable in performance with the state-of-the-art MRC models, and significantly more robust to noise than them. On top of that, when only a subset (20%-80%) of the training examples are available, KAR outperforms the state-of-the-art MRC models by a large margin, and is still reasonably robust to noise. In multi-hop MRC tasks, to probe the strength of Graph Neural Networks (GNNs), we propose a novel multi-hop MRC model named Graph Aided Reader (GAR), which uses GNN methods to perform multi-hop reasoning, but is free of any pre-trained language model and completely end-to-end. For graph construction, GAR utilizes the topic-referencing relations between passages and the entity-sharing relations between sentences, which is aimed at obtaining the most sensible reasoning clues. For message passing, GAR simulates a top-down reasoning and a bottom-up reasoning, which is aimed at making the best use of the above obtained reasoning clues. According to the experimental results, GAR even outperforms several competitors relying on pre-trained language models and filter-reader pipelines, which implies that GAR benefits a lot from its GNN methods. On this basis, GAR can further benefit from applying pre-trained language models, but pre-trained language models can mainly facilitate the within-passage reasoning rather than cross-passage reasoning of GAR. Moreover, compared with the competitors constructed as filter-reader pipelines, GAR is not only easier to train, but also more applicable to the low-resource cases. For unstructured OOD knowledge, we study the neural-based knowledge transfer in Natural Language Understanding (NLU), and focus on the neural-based knowledge transfer between languages, which is also known as Cross-Lingual Transfer Learning (CLTL). To facilitate the CLTL of NLU models, especially the CLTL between distant languages, we propose a novel CLTL model named Translation Aided Language Learner (TALL), where CLTL is integrated with Machine Translation (MT). Specifically, we adopt a pre-trained multilingual language model as our baseline model, and construct TALL by appending a decoder to it. On this basis, we directly fine-tune the baseline model as an NLU model to conduct CLTL, but put TALL through an MT-oriented pre-training before its NLU-oriented fine-tuning. To make use of unannotated data, we implement the recently proposed Unsupervised Machine Translation (UMT) technique in the MT-oriented pre-training of TALL. According to the experimental results, the application of UMT enables TALL to consistently achieve better CLTL performance than the baseline model without using more annotated data, and the performance gain is relatively prominent in the case of distant languages
- …