Search CORE

150,936 research outputs found

Document-Level Multi-Event Extraction with Event Proxy Nodes and Hausdorff Distance Minimization

Author: Gui Lin
He Yulan
Wang Xinyu
Publication venue
Publication date: 30/05/2023
Field of study

Document-level multi-event extraction aims to extract the structural information from a given document automatically. Most recent approaches usually involve two steps: (1) modeling entity interactions; (2) decoding entity interactions into events. However, such approaches ignore a global view of inter-dependency of multiple events. Moreover, an event is decoded by iteratively merging its related entities as arguments, which might suffer from error propagation and is computationally inefficient. In this paper, we propose an alternative approach for document-level multi-event extraction with event proxy nodes and Hausdorff distance minimization. The event proxy nodes, representing pseudo-events, are able to build connections with other event proxy nodes, essentially capturing global information. The Hausdorff distance makes it possible to compare the similarity between the set of predicted events and the set of ground-truth events. By directly minimizing Hausdorff distance, the model is trained towards the global optimum directly, which improves performance and reduces training time. Experimental results show that our model outperforms previous state-of-the-art method in F1-score on two datasets with only a fraction of training time

arXiv.org e-Print Archive

Sem-TF-IDF: A Simple Semantic Approach to Generalize TF-IDF by Employing Instruction Tuned Large Language Models

Author: Chen Wei
Lin Bo
Publication venue: Technical Disclosure Commons
Publication date: 09/02/2024
Field of study

TF-IDF (Term Frequency - Inverse Document Frequency) based information retrieval approaches compute the measure of importance of a word to a document in a collection or corpus, adjusted for the fact that some words appear more frequently in general. However, this measure is solely based on counts of individual words in the document or corpus without consideration of higher-level semantics. Attribute extraction is another approach for identifying salient information about a document; however, the extracted attributes are often multi-word phrases and do not include an indication of their salience for the document in question. This disclosure describes a simple unsupervised learning approach called Sem-TF-IDF that leverages instruction-tuned large language models (IT-LLMs) for identifying salient pieces of information related to a document or an entity in the semantic space. The approach involves modifying the frequency definitions in the classic TF-IDF to be based on topics instead of terms. The types of topics can include terms (used interchangeably with “words” in this document), phrases and broadly speaking, any piece of information. Topics within a document can be identified by inputting the document to an IT-LLM along with a suitable prompt and/or employing existing attribute extraction approaches. Thresholds and peer document or entity groups appropriate for the task can be used to filter the topics and optionally summarize the corresponding information as relevant for the application and user needs. The techniques can generalize the classic TF-IDF approach to the higher-level semantic space and are suitable for any information retrieval application in digital maps, search engines, etc

Technical Disclosure Common

Named Entity Recognition in Twitter using Images and Text

Author: Esteves Diego
Lehmann Jens
Napolitano Giulio
Peres Rafael
Publication venue
Publication date: 30/10/2017
Field of study

Named Entity Recognition (NER) is an important subtask of information extraction that seeks to locate and recognise named entities. Despite recent achievements, we still face limitations with correctly detecting and classifying entities, prominently in short and noisy text, such as Twitter. An important negative aspect in most of NER approaches is the high dependency on hand-crafted features and domain-specific knowledge, necessary to achieve state-of-the-art results. Thus, devising models to deal with such linguistically complex contexts is still challenging. In this paper, we propose a novel multi-level architecture that does not rely on any specific linguistic resource or encoded rule. Unlike traditional approaches, we use features extracted from images and text to classify named entities. Experimental tests against state-of-the-art NER for Twitter on the Ritter dataset present competitive results (0.59 F-measure), indicating that this approach may lead towards better NER models.Comment: The 3rd International Workshop on Natural Language Processing for Informal Text (NLPIT 2017), 8 page

arXiv.org e-Print Archive

Fraunhofer-ePrints