21 research outputs found
Dataset and Baseline System for Multi-lingual Extraction and Normalization of Temporal and Numerical Expressions
Temporal and numerical expression understanding is of great importance in
many downstream Natural Language Processing (NLP) and Information Retrieval
(IR) tasks. However, much previous work covers only a few sub-types and focuses
only on entity extraction, which severely limits the usability of identified
mentions. In order for such entities to be useful in downstream scenarios,
coverage and granularity of sub-types are important; and, even more so,
providing resolution into concrete values that can be manipulated. Furthermore,
most previous work addresses only a handful of languages. Here we describe a
multi-lingual evaluation dataset - NTX - covering diverse temporal and
numerical expressions across 14 languages and covering extraction,
normalization, and resolution. Along with the dataset we provide a robust
rule-based system as a strong baseline for comparisons against other models to
be evaluated in this dataset. Data and code are available at
\url{https://aka.ms/NTX}.Comment: Technical Repor
Open-world Story Generation with Structured Knowledge Enhancement: A Comprehensive Survey
Storytelling and narrative are fundamental to human experience, intertwined
with our social and cultural engagement. As such, researchers have long
attempted to create systems that can generate stories automatically. In recent
years, powered by deep learning and massive data resources, automatic story
generation has shown significant advances. However, considerable challenges,
like the need for global coherence in generated stories, still hamper
generative models from reaching the same storytelling ability as human
narrators. To tackle these challenges, many studies seek to inject structured
knowledge into the generation process, which is referred to as structure
knowledge-enhanced story generation. Incorporating external knowledge can
enhance the logical coherence among story events, achieve better knowledge
grounding, and alleviate over-generalization and repetition problems in
stories. This survey provides the latest and comprehensive review of this
research field: (i) we present a systematical taxonomy regarding how existing
methods integrate structured knowledge into story generation; (ii) we summarize
involved story corpora, structured knowledge datasets, and evaluation metrics;
(iii) we give multidimensional insights into the challenges of
knowledge-enhanced story generation and cast light on promising directions for
future study
CoLaDa: A Collaborative Label Denoising Framework for Cross-lingual Named Entity Recognition
Cross-lingual named entity recognition (NER) aims to train an NER system that
generalizes well to a target language by leveraging labeled data in a given
source language. Previous work alleviates the data scarcity problem by
translating source-language labeled data or performing knowledge distillation
on target-language unlabeled data. However, these methods may suffer from label
noise due to the automatic labeling process. In this paper, we propose CoLaDa,
a Collaborative Label Denoising Framework, to address this problem.
Specifically, we first explore a model-collaboration-based denoising scheme
that enables models trained on different data sources to collaboratively
denoise pseudo labels used by each other. We then present an
instance-collaboration-based strategy that considers the label consistency of
each token's neighborhood in the representation space for denoising.
Experiments on different benchmark datasets show that the proposed CoLaDa
achieves superior results compared to previous methods, especially when
generalizing to distant languages.Comment: ACL 2023. Our code is available at
https://github.com/microsoft/vert-papers/tree/master/papers/CoLaD
Enhanced Meta-Learning for Cross-lingual Named Entity Recognition with Minimal Resources
For languages with no annotated resources, transferring knowledge from
rich-resource languages is an effective solution for named entity recognition
(NER). While all existing methods directly transfer from source-learned model
to a target language, in this paper, we propose to fine-tune the learned model
with a few similar examples given a test case, which could benefit the
prediction by leveraging the structural and semantic information conveyed in
such similar examples. To this end, we present a meta-learning algorithm to
find a good model parameter initialization that could fast adapt to the given
test case and propose to construct multiple pseudo-NER tasks for meta-training
by computing sentence similarities. To further improve the model's
generalization ability across different languages, we introduce a masking
scheme and augment the loss function with an additional maximum term during
meta-training. We conduct extensive experiments on cross-lingual named entity
recognition with minimal resources over five target languages. The results show
that our approach significantly outperforms existing state-of-the-art methods
across the board.Comment: This paper is accepted by AAAI2020. Code is available at
https://github.com/microsoft/vert-papers/tree/master/papers/Meta-Cros
All Data on the Table: Novel Dataset and Benchmark for Cross-Modality Scientific Information Extraction
Extracting key information from scientific papers has the potential to help
researchers work more efficiently and accelerate the pace of scientific
progress. Over the last few years, research on Scientific Information
Extraction (SciIE) witnessed the release of several new systems and benchmarks.
However, existing paper-focused datasets mostly focus only on specific parts of
a manuscript (e.g., abstracts) and are single-modality (i.e., text- or
table-only), due to complex processing and expensive annotations. Moreover,
core information can be present in either text or tables or across both. To
close this gap in data availability and enable cross-modality IE, while
alleviating labeling costs, we propose a semi-supervised pipeline for
annotating entities in text, as well as entities and relations in tables, in an
iterative procedure. Based on this pipeline, we release novel resources for the
scientific community, including a high-quality benchmark, a large-scale corpus,
and a semi-supervised annotation pipeline. We further report the performance of
state-of-the-art IE models on the proposed benchmark dataset, as a baseline.
Lastly, we explore the potential capability of large language models such as
ChatGPT for the current task. Our new dataset, results, and analysis validate
the effectiveness and efficiency of our semi-supervised pipeline, and we
discuss its remaining limitations.Comment: Work in progress; 17 pages, 6 figures, 11 table
AutoAgents: A Framework for Automatic Agent Generation
Large language models (LLMs) have enabled remarkable advances in automated
task-solving with multi-agent systems. However, most existing LLM-based
multi-agent approaches rely on predefined agents to handle simple tasks,
limiting the adaptability of multi-agent collaboration to different scenarios.
Therefore, we introduce AutoAgents, an innovative framework that adaptively
generates and coordinates multiple specialized agents to build an AI team
according to different tasks. Specifically, AutoAgents couples the relationship
between tasks and roles by dynamically generating multiple required agents
based on task content and planning solutions for the current task based on the
generated expert agents. Multiple specialized agents collaborate with each
other to efficiently accomplish tasks. Concurrently, an observer role is
incorporated into the framework to reflect on the designated plans and agents'
responses and improve upon them. Our experiments on various benchmarks
demonstrate that AutoAgents generates more coherent and accurate solutions than
the existing multi-agent methods. This underscores the significance of
assigning different roles to different tasks and of team cooperation, offering
new perspectives for tackling complex tasks. The repository of this project is
available at https://github.com/Link-AGI/AutoAgents
Universal NER:A Gold-Standard Multilingual Named Entity Recognition Benchmark
We introduce Universal NER (UNER), an open, community-driven project to develop gold-standard NER benchmarks in many languages. The overarching goal of UNER is to provide high-quality, cross-lingually consistent annotations to facilitate and standardize multilingual NER research. UNER v1 contains 18 datasets annotated with named entities in a cross-lingual consistent schema across 12 diverse languages. In this paper, we detail the dataset creation and composition of UNER; we also provide initial modeling baselines on both in-language and cross-lingual learning settings. We release the data, code, and fitted models to the public