341 research outputs found
CEIL: A General Classification-Enhanced Iterative Learning Framework for Text Clustering
Text clustering, as one of the most fundamental challenges in unsupervised
learning, aims at grouping semantically similar text segments without relying
on human annotations. With the rapid development of deep learning, deep
clustering has achieved significant advantages over traditional clustering
methods. Despite the effectiveness, most existing deep text clustering methods
rely heavily on representations pre-trained in general domains, which may not
be the most suitable solution for clustering in specific target domains. To
address this issue, we propose CEIL, a novel Classification-Enhanced Iterative
Learning framework for short text clustering, which aims at generally promoting
the clustering performance by introducing a classification objective to
iteratively improve feature representations. In each iteration, we first adopt
a language model to retrieve the initial text representations, from which the
clustering results are collected using our proposed Category Disentangled
Contrastive Clustering (CDCC) algorithm. After strict data filtering and
aggregation processes, samples with clean category labels are retrieved, which
serve as supervision information to update the language model with the
classification objective via a prompt learning approach. Finally, the updated
language model with improved representation ability is used to enhance
clustering in the next iteration. Extensive experiments demonstrate that the
CEIL framework significantly improves the clustering performance over
iterations, and is generally effective on various clustering algorithms.
Moreover, by incorporating CEIL on CDCC, we achieve the state-of-the-art
clustering performance on a wide range of short text clustering benchmarks
outperforming other strong baseline methods.Comment: The Web Conference 202
Web knowledge bases
Knowledge is key to natural language understanding. References to specific people, places and things in text are crucial to resolving ambiguity and extracting meaning. Knowledge Bases (KBs) codify this information for automated systems — enabling applications such as entity-based search and question answering. This thesis explores the idea that sites on the web may act as a KB, even if that is not their primary intent. Dedicated kbs like Wikipedia are a rich source of entity information, but are built and maintained at an ongoing cost in human effort. As a result, they are generally limited in terms of the breadth and depth of knowledge they index about entities. Web knowledge bases offer a distributed solution to the problem of aggregating entity knowledge. Social networks aggregate content about people, news sites describe events with tags for organizations and locations, and a diverse assortment of web directories aggregate statistics and summaries for long-tail entities notable within niche movie, musical and sporting domains. We aim to develop the potential of these resources for both web-centric entity Information Extraction (IE) and structured KB population. We first investigate the problem of Named Entity Linking (NEL), where systems must resolve ambiguous mentions of entities in text to their corresponding node in a structured KB. We demonstrate that entity disambiguation models derived from inbound web links to Wikipedia are able to complement and in some cases completely replace the role of resources typically derived from the KB. Building on this work, we observe that any page on the web which reliably disambiguates inbound web links may act as an aggregation point for entity knowledge. To uncover these resources, we formalize the task of Web Knowledge Base Discovery (KBD) and develop a system to automatically infer the existence of KB-like endpoints on the web. While extending our framework to multiple KBs increases the breadth of available entity knowledge, we must still consolidate references to the same entity across different web KBs. We investigate this task of Cross-KB Coreference Resolution (KB-Coref) and develop models for efficiently clustering coreferent endpoints across web-scale document collections. Finally, assessing the gap between unstructured web knowledge resources and those of a typical KB, we develop a neural machine translation approach which transforms entity knowledge between unstructured textual mentions and traditional KB structures. The web has great potential as a source of entity knowledge. In this thesis we aim to first discover, distill and finally transform this knowledge into forms which will ultimately be useful in downstream language understanding tasks
Unifying Large Language Models and Knowledge Graphs: A Roadmap
Large language models (LLMs), such as ChatGPT and GPT4, are making new waves
in the field of natural language processing and artificial intelligence, due to
their emergent ability and generalizability. However, LLMs are black-box
models, which often fall short of capturing and accessing factual knowledge. In
contrast, Knowledge Graphs (KGs), Wikipedia and Huapu for example, are
structured knowledge models that explicitly store rich factual knowledge. KGs
can enhance LLMs by providing external knowledge for inference and
interpretability. Meanwhile, KGs are difficult to construct and evolving by
nature, which challenges the existing methods in KGs to generate new facts and
represent unseen knowledge. Therefore, it is complementary to unify LLMs and
KGs together and simultaneously leverage their advantages. In this article, we
present a forward-looking roadmap for the unification of LLMs and KGs. Our
roadmap consists of three general frameworks, namely, 1) KG-enhanced LLMs,
which incorporate KGs during the pre-training and inference phases of LLMs, or
for the purpose of enhancing understanding of the knowledge learned by LLMs; 2)
LLM-augmented KGs, that leverage LLMs for different KG tasks such as embedding,
completion, construction, graph-to-text generation, and question answering; and
3) Synergized LLMs + KGs, in which LLMs and KGs play equal roles and work in a
mutually beneficial way to enhance both LLMs and KGs for bidirectional
reasoning driven by both data and knowledge. We review and summarize existing
efforts within these three frameworks in our roadmap and pinpoint their future
research directions.Comment: 29 pages, 25 figure
Knowledge extraction from unstructured data
Data availability is becoming more essential, considering the current growth of web-based data. The data available on the web are represented as unstructured, semi-structured, or structured data. In order to make the web-based data available for several Natural Language Processing or Data Mining tasks, the data needs to be presented as machine-readable data in a structured format. Thus, techniques for addressing the problem of capturing knowledge from unstructured data sources are needed. Knowledge extraction methods are used by the research communities to address this problem; methods that are able to capture knowledge in a natural language text and map the extracted knowledge to existing knowledge presented in knowledge graphs (KGs). These knowledge extraction methods include Named-entity recognition, Named-entity Disambiguation, Relation Recognition, and Relation Linking. This thesis addresses the problem of extracting knowledge over unstructured data and discovering patterns in the extracted knowledge. We devise a rule-based approach for entity and relation recognition and linking. The defined approach effectively maps entities and relations within a text to their resources in a target KG. Additionally, it overcomes the challenges of recognizing and linking entities and relations to a specific KG by employing devised catalogs of linguistic and domain-specific rules that state the criteria to recognize entities in a sentence of a particular language, and a deductive database that encodes knowledge in community-maintained KGs. Moreover, we define a Neuro-symbolic approach for the tasks of knowledge extraction in encyclopedic and domain-specific domains; it combines symbolic and sub-symbolic components to overcome the challenges of entity recognition and linking and the limitation of the availability of training data while maintaining the accuracy of recognizing and linking entities. Additionally, we present a context-aware framework for unveiling semantically related posts in a corpus; it is a knowledge-driven framework that retrieves associated posts effectively. We cast the problem of unveiling semantically related posts in a corpus into the Vertex Coloring Problem. We evaluate the performance of our techniques on several benchmarks related to various domains for knowledge extraction tasks. Furthermore, we apply these methods in real-world scenarios from national and international projects. The outcomes show that our techniques are able to effectively extract knowledge encoded in unstructured data and discover patterns over the extracted knowledge presented as machine-readable data. More importantly, the evaluation results provide evidence to the effectiveness of combining the reasoning capacity of the symbolic frameworks with the power of pattern recognition and classification of sub-symbolic models
ConPET: Continual Parameter-Efficient Tuning for Large Language Models
Continual learning necessitates the continual adaptation of models to newly
emerging tasks while minimizing the catastrophic forgetting of old ones. This
is extremely challenging for large language models (LLMs) with vanilla
full-parameter tuning due to high computation costs, memory consumption, and
forgetting issue. Inspired by the success of parameter-efficient tuning (PET),
we propose Continual Parameter-Efficient Tuning (ConPET), a generalizable
paradigm for continual task adaptation of LLMs with task-number-independent
training complexity. ConPET includes two versions with different application
scenarios. First, Static ConPET can adapt former continual learning methods
originally designed for relatively smaller models to LLMs through PET and a
dynamic replay strategy, which largely reduces the tuning costs and alleviates
the over-fitting and forgetting issue. Furthermore, to maintain scalability,
Dynamic ConPET adopts separate PET modules for different tasks and a PET module
selector for dynamic optimal selection. In our extensive experiments, the
adaptation of Static ConPET helps multiple former methods reduce the scale of
tunable parameters by over 3,000 times and surpass the PET-only baseline by at
least 5 points on five smaller benchmarks, while Dynamic ConPET gains its
advantage on the largest dataset. The codes and datasets are available at
https://github.com/Raincleared-Song/ConPET.Comment: 12 pages, 3 figures. This work has been submitted to the IEEE for
possible publication. Copyright may be transferred without notice, after
which this version may no longer be accessibl
BELB: a Biomedical Entity Linking Benchmark
Biomedical entity linking (BEL) is the task of grounding entity mentions to a
knowledge base. It plays a vital role in information extraction pipelines for
the life sciences literature. We review recent work in the field and find that,
as the task is absent from existing benchmarks for biomedical text mining,
different studies adopt different experimental setups making comparisons based
on published numbers problematic. Furthermore, neural systems are tested
primarily on instances linked to the broad coverage knowledge base UMLS,
leaving their performance to more specialized ones, e.g. genes or variants,
understudied. We therefore developed BELB, a Biomedical Entity Linking
Benchmark, providing access in a unified format to 11 corpora linked to 7
knowledge bases and spanning six entity types: gene, disease, chemical,
species, cell line and variant. BELB greatly reduces preprocessing overhead in
testing BEL systems on multiple corpora offering a standardized testbed for
reproducible experiments. Using BELB we perform an extensive evaluation of six
rule-based entity-specific systems and three recent neural approaches
leveraging pre-trained language models. Our results reveal a mixed picture
showing that neural approaches fail to perform consistently across entity
types, highlighting the need of further studies towards entity-agnostic models
Pushing the Limits of ChatGPT on NLP Tasks
Despite the success of ChatGPT, its performances on most NLP tasks are still
well below the supervised baselines. In this work, we looked into the causes,
and discovered that its subpar performance was caused by the following factors:
(1) token limit in the prompt does not allow for the full utilization of the
supervised datasets; (2) mismatch between the generation nature of ChatGPT and
NLP tasks; (3) intrinsic pitfalls of LLMs models, e.g., hallucination, overly
focus on certain keywords, etc.
In this work, we propose a collection of general modules to address these
issues, in an attempt to push the limits of ChatGPT on NLP tasks. Our proposed
modules include (1) a one-input-multiple-prompts strategy that employs multiple
prompts for one input to accommodate more demonstrations; (2) using fine-tuned
models for better demonstration retrieval; (3) transforming tasks to formats
that are more tailored to the generation nature; (4) employing reasoning
strategies that are tailored to addressing the task-specific complexity; (5)
the self-verification strategy to address the hallucination issue of LLMs; (6)
the paraphrase strategy to improve the robustness of model predictions.
We conduct experiments on 21 datasets of 10 representative NLP tasks,
including question answering, commonsense reasoning, natural language
inference, sentiment analysis, named entity recognition, entity-relation
extraction, event extraction, dependency parsing, semantic role labeling, and
part-of-speech tagging. Using the proposed assemble of techniques, we are able
to significantly boost the performance of ChatGPT on the selected NLP tasks,
achieving performances comparable to or better than supervised baselines, or
even existing SOTA performances
Embracing Ambiguity: Improving Similarity-oriented Tasks with Contextual Synonym Knowledge
Contextual synonym knowledge is crucial for those similarity-oriented tasks
whose core challenge lies in capturing semantic similarity between entities in
their contexts, such as entity linking and entity matching. However, most
Pre-trained Language Models (PLMs) lack synonym knowledge due to inherent
limitations of their pre-training objectives such as masked language modeling
(MLM). Existing works which inject synonym knowledge into PLMs often suffer
from two severe problems: (i) Neglecting the ambiguity of synonyms, and (ii)
Undermining semantic understanding of original PLMs, which is caused by
inconsistency between the exact semantic similarity of the synonyms and the
broad conceptual relevance learned from the original corpus. To address these
issues, we propose PICSO, a flexible framework that supports the injection of
contextual synonym knowledge from multiple domains into PLMs via a novel
entity-aware Adapter which focuses on the semantics of the entities (synonyms)
in the contexts. Meanwhile, PICSO stores the synonym knowledge in additional
parameters of the Adapter structure, which prevents it from corrupting the
semantic understanding of the original PLM. Extensive experiments demonstrate
that PICSO can dramatically outperform the original PLMs and the other
knowledge and synonym injection models on four different similarity-oriented
tasks. In addition, experiments on GLUE prove that PICSO also benefits general
natural language understanding tasks. Codes and data will be public.Comment: This work has been submitted to the Neurocomputing. Copyright may be
transferred without notice, after which this version may no longer be
accessibl
- …