128 research outputs found
Incubating Text Classifiers Following User Instruction with Nothing but LLM
In this paper, we aim to generate text classification data given arbitrary
class definitions (i.e., user instruction), so one can train a small text
classifier without any human annotation or raw corpus. Compared with pioneer
attempts, our proposed Incubator is the first framework that can handle
complicated and even mutually dependent classes (e.g., "TED Talk given by
Educator" and "Other"). Specifically, Incubator is an LLM firstly tuned on the
instruction-to-data mappings that we obtained from classification datasets and
descriptions on HuggingFace together with in-context augmentation by GPT-4. We
then refine Incubator by learning on the cluster centers of semantic textual
embeddings to emphasize the uniformity and semantic diversity in generations.
We compare Incubator on various classification tasks with strong baselines such
as direct LLM-based inference and training data generation by prompt
engineering. Experiments show Incubator is able to (1) perform well on
traditional benchmarks, (2) take label dependency and user preference into
consideration, and (3) enable logical text mining by incubating multiple
classifiers
Towards Zero-shot Relation Extraction in Web Mining: A Multimodal Approach with Relative XML Path
The rapid growth of web pages and the increasing complexity of their
structure poses a challenge for web mining models. Web mining models are
required to understand the semi-structured web pages, particularly when little
is known about the subject or template of a new page. Current methods migrate
language models to the web mining by embedding the XML source code into the
transformer or encoding the rendered layout with graph neural networks.
However, these approaches do not take into account the relationships between
text nodes within and across pages. In this paper, we propose a new approach,
ReXMiner, for zero-shot relation extraction in web mining. ReXMiner encodes the
shortest relative paths in the Document Object Model (DOM) tree which is a more
accurate and efficient signal for key-value pair extraction within a web page.
It also incorporates the popularity of each text node by counting the
occurrence of the same text node across different web pages. We use the
contrastive learning to address the issue of sparsity in relation extraction.
Extensive experiments on public benchmarks show that our method, ReXMiner,
outperforms the state-of-the-art baselines in the task of zero-shot relation
extraction in web mining
DPPred: an effective prediction framework with concise discriminative patterns and its biomedical applications
In the literature, two series of models have been proposed to address prediction problems including classification and regression. Simple models, such as generalized linear models, have ordinary performance but strong interpretability on a set of simple features. The other series, including tree-based models, organize numerical, categorical and high dimensional features into a comprehensive structure with rich interpretable information in the data.
In this thesis, we propose a novel discriminative pattern-based prediction framework (DPPred) to accomplish the prediction tasks by taking their advantages of both effectiveness and interpretability. Specifically, DPPred adopts the concise discriminative patterns that are on the prefix paths from the root to leaf nodes in the tree-based models. Moreover, DPPred selects a limited number of the useful discriminative patterns by searching for the most effective pattern combination to fit generalized linear models.
To validate the effectiveness of DPPred, we conduct experiments on both classification and regression tasks. Experimental results demonstrate that DPPred provides competitive accuracy with the state-of-the-art as well as the valuable interpretability for developers and experts. In particular, when studying health status for cardiopulmonary patients, DPPred shows the acceptable predicting accuracy (more than 95%) and reveals the importance of demographic features; when studying the amyotrophic lateral sclerosis (ALS) disease, DPPred not only outperforms the baselines by using only 40 concise discriminative patterns out of a potentially exponentially large set of patterns, but also discover novel markers
Smaller Language Models are capable of selecting Instruction-Tuning Training Data for Larger Language Models
Instruction-tuning language models has become a crucial step in aligning them
for general use. Typically, this process involves extensive training on large
datasets, incurring high training costs. In this paper, we introduce a novel
training data selection based on the learning percentage of the samples. We
assert that current language models possess the capability to autonomously
select high-quality training data, leading to comparable or improved
performance compared to training on the entire dataset. Our experiments span
different-sized models, revealing that this characteristic holds for models
ranging from 1B (small) to 13B (large) in size. Moreover, we demonstrate an
interesting finding that the data hardness transfers across model sizes, and a
smaller 350M model can effectively curate high-quality training data with hard
samples for a larger 13B model, resulting in an equally or superior
instruction-tuned model compared to training on the complete dataset. Utilizing
open-sourced OPT and Llama-2 models up to 13B in size, two publicly available
instruction-tuning training datasets and evaluated by both automatic metrics &
humans, our paper introduces a novel approach to training data selection,
showcasing a more efficient alternative
Label Noise in Adversarial Training: A Novel Perspective to Study Robust Overfitting
We show that label noise exists in adversarial training. Such label noise is
due to the mismatch between the true label distribution of adversarial examples
and the label inherited from clean examples - the true label distribution is
distorted by the adversarial perturbation, but is neglected by the common
practice that inherits labels from clean examples. Recognizing label noise
sheds insights on the prevalence of robust overfitting in adversarial training,
and explains its intriguing dependence on perturbation radius and data quality.
Also, our label noise perspective aligns well with our observations of the
epoch-wise double descent in adversarial training. Guided by our analyses, we
proposed a method to automatically calibrate the label to address the label
noise and robust overfitting. Our method achieves consistent performance
improvements across various models and datasets without introducing new
hyper-parameters or additional tuning.Comment: Neurips 2022 (Oral); A previous version of this paper (v1) used the
title `Double Descent in Adversarial Training: An Implicit Label Noise
Perspective
ClusterLLM: Large Language Models as a Guide for Text Clustering
We introduce ClusterLLM, a novel text clustering framework that leverages
feedback from an instruction-tuned large language model, such as ChatGPT.
Compared with traditional unsupervised methods that builds upon "small"
embedders, ClusterLLM exhibits two intriguing advantages: (1) it enjoys the
emergent capability of LLM even if its embeddings are inaccessible; and (2) it
understands the user's preference on clustering through textual instruction
and/or a few annotated data. First, we prompt ChatGPT for insights on
clustering perspective by constructing hard triplet questions <does A better
correspond to B than C>, where A, B and C are similar data points that belong
to different clusters according to small embedder. We empirically show that
this strategy is both effective for fine-tuning small embedder and
cost-efficient to query ChatGPT. Second, we prompt ChatGPT for helps on
clustering granularity by carefully designed pairwise questions <do A and B
belong to the same category>, and tune the granularity from cluster hierarchies
that is the most consistent with the ChatGPT answers. Extensive experiments on
14 datasets show that ClusterLLM consistently improves clustering quality, at
an average cost of ~$0.6 per dataset
Debiasing Made State-of-the-art: Revisiting the Simple Seed-based Weak Supervision for Text Classification
Recent advances in weakly supervised text classification mostly focus on
designing sophisticated methods to turn high-level human heuristics into
quality pseudo-labels. In this paper, we revisit the seed matching-based
method, which is arguably the simplest way to generate pseudo-labels, and show
that its power was greatly underestimated. We show that the limited performance
of seed matching is largely due to the label bias injected by the simple
seed-match rule, which prevents the classifier from learning reliable
confidence for selecting high-quality pseudo-labels. Interestingly, simply
deleting the seed words present in the matched input texts can mitigate the
label bias and help learn better confidence. Subsequently, the performance
achieved by seed matching can be improved significantly, making it on par with
or even better than the state-of-the-art. Furthermore, to handle the case when
the seed words are not made known, we propose to simply delete the word tokens
in the input text randomly with a high deletion ratio. Remarkably, seed
matching equipped with this random deletion method can often achieve even
better performance than that with seed deletion
- …