Search CORE

21 research outputs found

Dataset and Baseline System for Multi-lingual Extraction and Normalization of Temporal and Numerical Expressions

Author: Chen Sanxing
Chen Yongqiang
Karlsson Börje F.
Publication venue
Publication date: 31/03/2023
Field of study

Temporal and numerical expression understanding is of great importance in many downstream Natural Language Processing (NLP) and Information Retrieval (IR) tasks. However, much previous work covers only a few sub-types and focuses only on entity extraction, which severely limits the usability of identified mentions. In order for such entities to be useful in downstream scenarios, coverage and granularity of sub-types are important; and, even more so, providing resolution into concrete values that can be manipulated. Furthermore, most previous work addresses only a handful of languages. Here we describe a multi-lingual evaluation dataset - NTX - covering diverse temporal and numerical expressions across 14 languages and covering extraction, normalization, and resolution. Along with the dataset we provide a robust rule-based system as a strong baseline for comparisons against other models to be evaluated in this dataset. Data and code are available at \url{https://aka.ms/NTX}.Comment: Technical Repor

arXiv.org e-Print Archive

Open-world Story Generation with Structured Knowledge Enhancement: A Comprehensive Survey

Author: Hu Wei
Karlsson Börje F.
Lin Jieru
Wang Yuxin
Yu Zhiwei
Publication venue
Publication date: 08/12/2022
Field of study

Storytelling and narrative are fundamental to human experience, intertwined with our social and cultural engagement. As such, researchers have long attempted to create systems that can generate stories automatically. In recent years, powered by deep learning and massive data resources, automatic story generation has shown significant advances. However, considerable challenges, like the need for global coherence in generated stories, still hamper generative models from reaching the same storytelling ability as human narrators. To tackle these challenges, many studies seek to inject structured knowledge into the generation process, which is referred to as structure knowledge-enhanced story generation. Incorporating external knowledge can enhance the logical coherence among story events, achieve better knowledge grounding, and alleviate over-generalization and repetition problems in stories. This survey provides the latest and comprehensive review of this research field: (i) we present a systematical taxonomy regarding how existing methods integrate structured knowledge into story generation; (ii) we summarize involved story corpora, structured knowledge datasets, and evaluation metrics; (iii) we give multidimensional insights into the challenges of knowledge-enhanced story generation and cast light on promising directions for future study

arXiv.org e-Print Archive

CoLaDa: A Collaborative Label Denoising Framework for Cross-lingual Named Entity Recognition

Author: Jiang Huiqiang
Karlsson Börje F.
Lin Chin-Yew
Ma Tingting
Wu Qianhui
Zhao Tiejun
Publication venue
Publication date: 24/05/2023
Field of study

Cross-lingual named entity recognition (NER) aims to train an NER system that generalizes well to a target language by leveraging labeled data in a given source language. Previous work alleviates the data scarcity problem by translating source-language labeled data or performing knowledge distillation on target-language unlabeled data. However, these methods may suffer from label noise due to the automatic labeling process. In this paper, we propose CoLaDa, a Collaborative Label Denoising Framework, to address this problem. Specifically, we first explore a model-collaboration-based denoising scheme that enables models trained on different data sources to collaboratively denoise pseudo labels used by each other. We then present an instance-collaboration-based strategy that considers the label consistency of each token's neighborhood in the representation space for denoising. Experiments on different benchmark datasets show that the proposed CoLaDa achieves superior results compared to previous methods, especially when generalizing to distant languages.Comment: ACL 2023. Our code is available at https://github.com/microsoft/vert-papers/tree/master/papers/CoLaD

arXiv.org e-Print Archive

Enhanced Meta-Learning for Cross-lingual Named Entity Recognition with Minimal Resources

Author: Chen Hui
Huang Biqing
Karlsson Börje F.
Lin Chin-Yew
Lin Zijia
Wang Guoxin
Wu Qianhui
Publication venue
Publication date: 03/04/2020
Field of study

For languages with no annotated resources, transferring knowledge from rich-resource languages is an effective solution for named entity recognition (NER). While all existing methods directly transfer from source-learned model to a target language, in this paper, we propose to fine-tune the learned model with a few similar examples given a test case, which could benefit the prediction by leveraging the structural and semantic information conveyed in such similar examples. To this end, we present a meta-learning algorithm to find a good model parameter initialization that could fast adapt to the given test case and propose to construct multiple pseudo-NER tasks for meta-training by computing sentence similarities. To further improve the model's generalization ability across different languages, we introduce a masking scheme and augment the loss function with an additional maximum term during meta-training. We conduct extensive experiments on cross-lingual named entity recognition with minimal resources over five target languages. The results show that our approach significantly outperforms existing state-of-the-art methods across the board.Comment: This paper is accepted by AAAI2020. Code is available at https://github.com/microsoft/vert-papers/tree/master/papers/Meta-Cros

arXiv.org e-Print Archive

Association for the Advancement of Artificial Intelligence: AAAI Publications

All Data on the Table: Novel Dataset and Benchmark for Cross-Modality Scientific Information Extraction

Author: Karlsson Börje F.
Li Yuhan
Lin Chin-Yew
Okumura Manabu
Shen Wei
Wu Jian
Yu Zhiwei
Publication venue
Publication date: 17/12/2023
Field of study

Extracting key information from scientific papers has the potential to help researchers work more efficiently and accelerate the pace of scientific progress. Over the last few years, research on Scientific Information Extraction (SciIE) witnessed the release of several new systems and benchmarks. However, existing paper-focused datasets mostly focus only on specific parts of a manuscript (e.g., abstracts) and are single-modality (i.e., text- or table-only), due to complex processing and expensive annotations. Moreover, core information can be present in either text or tables or across both. To close this gap in data availability and enable cross-modality IE, while alleviating labeling costs, we propose a semi-supervised pipeline for annotating entities in text, as well as entities and relations in tables, in an iterative procedure. Based on this pipeline, we release novel resources for the scientific community, including a high-quality benchmark, a large-scale corpus, and a semi-supervised annotation pipeline. We further report the performance of state-of-the-art IE models on the proposed benchmark dataset, as a baseline. Lastly, we explore the potential capability of large language models such as ChatGPT for the current task. Our new dataset, results, and analysis validate the effectiveness and efficiency of our semi-supervised pipeline, and we discuss its remaining limitations.Comment: Work in progress; 17 pages, 6 figures, 11 table

arXiv.org e-Print Archive

AutoAgents: A Framework for Automatic Agent Generation

Author: Chen Guangyao
Dong Siwei
Fu Jie
Karlsson Börje F.
Sesay Jaward
Shi Yemin
Shu Yu
Zhang Ge
Publication venue
Publication date: 15/10/2023
Field of study

Large language models (LLMs) have enabled remarkable advances in automated task-solving with multi-agent systems. However, most existing LLM-based multi-agent approaches rely on predefined agents to handle simple tasks, limiting the adaptability of multi-agent collaboration to different scenarios. Therefore, we introduce AutoAgents, an innovative framework that adaptively generates and coordinates multiple specialized agents to build an AI team according to different tasks. Specifically, AutoAgents couples the relationship between tasks and roles by dynamically generating multiple required agents based on task content and planning solutions for the current task based on the generated expert agents. Multiple specialized agents collaborate with each other to efficiently accomplish tasks. Concurrently, an observer role is incorporated into the framework to reflect on the designated plans and agents' responses and improve upon them. Our experiments on various benchmarks demonstrate that AutoAgents generates more coherent and accurate solutions than the existing multi-agent methods. This underscores the significance of assigning different roles to different tasks and of team cooperation, offering new perspectives for tackling complex tasks. The repository of this project is available at https://github.com/Link-AGI/AutoAgents

arXiv.org e-Print Archive

Universal NER:A Gold-Standard Multilingual Named Entity Recognition Benchmark

Author: Blevins Terra
Gonen Hila
Imperial Joseph Marvin
Karlsson Börje F.
Lin Peiqin
Liu Shuheng
Ljubešić Nikola
Mayhew Stephen
Miranda LJ
Pinter Yuval
Plank Barbara
Riabi Arij
Šuppa Marek
Publication venue: 'Center for Open Science'
Publication date: 15/11/2023
Field of study

We introduce Universal NER (UNER), an open, community-driven project to develop gold-standard NER benchmarks in many languages. The overarching goal of UNER is to provide high-quality, cross-lingually consistent annotations to facilitate and standardize multilingual NER research. UNER v1 contains 18 datasets annotated with named entities in a cross-lingual consistent schema across 12 diverse languages. In this paper, we detail the dataset creation and composition of UNER; we also provide initial modeling baselines on both in-language and cross-lingual learning settings. We release the data, code, and fitted models to the public

OPUS