17 research outputs found
An Information Minimization Based Contrastive Learning Model for Unsupervised Sentence Embeddings Learning
Unsupervised sentence embeddings learning has been recently dominated by
contrastive learning methods (e.g., SimCSE), which keep positive pairs similar
and push negative pairs apart. The contrast operation aims to keep as much
information as possible by maximizing the mutual information between positive
instances, which leads to redundant information in sentence embedding. To
address this problem, we present an information minimization based contrastive
learning (InforMin-CL) model to retain the useful information and discard the
redundant information by maximizing the mutual information and minimizing the
information entropy between positive instances meanwhile for unsupervised
sentence representation learning. Specifically, we find that information
minimization can be achieved by simple contrast and reconstruction objectives.
The reconstruction operation reconstitutes the positive instance via the other
positive instance to minimize the information entropy between positive
instances. We evaluate our model on fourteen downstream tasks, including both
supervised and unsupervised (semantic textual similarity) tasks. Extensive
experimental results show that our InforMin-CL obtains a state-of-the-art
performance.Comment: 11 pages, 3 figures, published to COLING 202
Generating Efficient Training Data via LLM-based Attribute Manipulation
In this paper, we propose a novel method, Chain-of-Thoughts Attribute
Manipulation (CoTAM), to guide few-shot learning by carefully crafted data from
Large Language Models (LLMs). The main idea is to create data with changes only
in the attribute targeted by the task. Inspired by facial attribute
manipulation, our approach generates label-switched data by leveraging LLMs to
manipulate task-specific attributes and reconstruct new sentences in a
controlled manner. Instead of conventional latent representation controlling,
we implement chain-of-thoughts decomposition and reconstruction to adapt the
procedure to LLMs. Extensive results on text classification and other tasks
verify the advantage of CoTAM over other LLM-based text generation methods with
the same number of training examples. Analysis visualizes the attribute
manipulation effectiveness of CoTAM and presents the potential of LLM-guided
learning with even less supervision
Transformers with Learnable Activation Functions
Activation functions can have a significant impact on reducing the
topological complexity of input data and therefore improve the performance of
the model. Selecting a suitable activation function is an essential step in
neural model design. However, the choice of activation function is seldom
discussed or explored in Transformer-based language models. Their activation
functions are chosen beforehand and then remain fixed from pre-training to
fine-tuning. As a result, the inductive biases they imposed on models cannot be
adjusted during this long life cycle. Moreover, subsequently developed models
(e.g., RoBERTa, BART, and GPT-3) often follow up prior work (e.g., BERT) to use
the same activation function without justification. In this paper, we
investigate the effectiveness of using Rational Activation Function (RAF), a
learnable activation function, in the Transformer architecture. In contrast to
conventional, predefined activation functions, RAFs can adaptively learn
optimal activation functions during training according to input data. Our
experiments show the RAF-based Transformer (RAFT) achieves a lower validation
perplexity than a vanilla BERT with the GELU function. We further evaluate RAFT
on downstream tasks in low- and full-data settings. Our results show that RAFT
outperforms the counterpart model across the majority of tasks and settings.
For instance, RAFT outperforms vanilla BERT on the GLUE benchmark by 5.71
points on average in low-data scenario (where 100 training examples are
available) and by 2.05 points on SQuAD in full-data setting. Analysis of the
shapes of learned RAFs further unveils that they substantially vary between
different layers of the pre-trained model and mostly look very different from
conventional activation functions. RAFT opens a new research direction for
analyzing and interpreting pre-trained models according to the learned
activation functions.Comment: Accepted by EACL2023 finding
Narrowing the Gap between Supervised and Unsupervised Sentence Representation Learning with Large Language Model
Sentence Representation Learning (SRL) is a fundamental task in Natural
Language Processing (NLP), with Contrastive learning of Sentence Embeddings
(CSE) as the mainstream technique due to its superior performance. An
intriguing phenomenon in CSE is the significant performance gap between
supervised and unsupervised methods, even when their sentence encoder and loss
function are the same. Previous works attribute this performance gap to
differences in two representation properties (alignment and uniformity).
However, alignment and uniformity only measure the results, which means they
cannot answer "What happens during the training process that leads to the
performance gap?" and "How can the performance gap be narrowed?". In this
paper, we conduct empirical experiments to answer these "What" and "How"
questions. We first answer the "What" question by thoroughly comparing the
behavior of supervised and unsupervised CSE during their respective training
processes. From the comparison, We observe a significant difference in fitting
difficulty. Thus, we introduce a metric, called Fitting Difficulty Increment
(FDI), to measure the fitting difficulty gap between the evaluation dataset and
the held-out training dataset, and use the metric to answer the "What"
question. Then, based on the insights gained from the "What" question, we
tackle the "How" question by increasing the fitting difficulty of the training
dataset. We achieve this by leveraging the In-Context Learning (ICL) capability
of the Large Language Model (LLM) to generate data that simulates complex
patterns. By utilizing the hierarchical patterns in the LLM-generated data, we
effectively narrow the gap between supervised and unsupervised CSE.Comment: work in progres
Mirror: A Universal Framework for Various Information Extraction Tasks
Sharing knowledge between information extraction tasks has always been a
challenge due to the diverse data formats and task variations. Meanwhile, this
divergence leads to information waste and increases difficulties in building
complex applications in real scenarios. Recent studies often formulate IE tasks
as a triplet extraction problem. However, such a paradigm does not support
multi-span and n-ary extraction, leading to weak versatility. To this end, we
reorganize IE problems into unified multi-slot tuples and propose a universal
framework for various IE tasks, namely Mirror. Specifically, we recast existing
IE tasks as a multi-span cyclic graph extraction problem and devise a
non-autoregressive graph decoding algorithm to extract all spans in a single
step. It is worth noting that this graph structure is incredibly versatile, and
it supports not only complex IE tasks, but also machine reading comprehension
and classification tasks. We manually construct a corpus containing 57 datasets
for model pretraining, and conduct experiments on 30 datasets across 8
downstream tasks. The experimental results demonstrate that our model has
decent compatibility and outperforms or reaches competitive performance with
SOTA systems under few-shot and zero-shot settings. The code, model weights,
and pretraining corpus are available at https://github.com/Spico197/Mirror .Comment: Accepted to EMNLP23 main conferenc