53 research outputs found
Fine-tuning and aligning question answering models for complex information extraction tasks
The emergence of Large Language Models (LLMs) has boosted performance and
possibilities in various NLP tasks. While the usage of generative AI models
like ChatGPT opens up new opportunities for several business use cases, their
current tendency to hallucinate fake content strongly limits their
applicability to document analysis, such as information retrieval from
documents. In contrast, extractive language models like question answering (QA)
or passage retrieval models guarantee query results to be found within the
boundaries of an according context document, which makes them candidates for
more reliable information extraction in productive environments of companies.
In this work we propose an approach that uses and integrates extractive QA
models for improved feature extraction of German business documents such as
insurance reports or medical leaflets into a document analysis solution. We
further show that fine-tuning existing German QA models boosts performance for
tailored extraction tasks of complex linguistic features like damage cause
explanations or descriptions of medication appearance, even with using only a
small set of annotated data. Finally, we discuss the relevance of scoring
metrics for evaluating information extraction tasks and deduce a combined
metric from Levenshtein distance, F1-Score, Exact Match and ROUGE-L to mimic
the assessment criteria from human experts.Comment: Accepted at: 15th International Conference on Knowledge Discovery an
Information Retrieval (KDIR 2023), part of IC3
A mathematical framework for combining decisions of multiple experts toward accurate and remote diagnosis of malaria using tele-microscopy.
We propose a methodology for digitally fusing diagnostic decisions made by multiple medical experts in order to improve accuracy of diagnosis. Toward this goal, we report an experimental study involving nine experts, where each one was given more than 8,000 digital microscopic images of individual human red blood cells and asked to identify malaria infected cells. The results of this experiment reveal that even highly trained medical experts are not always self-consistent in their diagnostic decisions and that there exists a fair level of disagreement among experts, even for binary decisions (i.e., infected vs. uninfected). To tackle this general medical diagnosis problem, we propose a probabilistic algorithm to fuse the decisions made by trained medical experts to robustly achieve higher levels of accuracy when compared to individual experts making such decisions. By modelling the decisions of experts as a three component mixture model and solving for the underlying parameters using the Expectation Maximisation algorithm, we demonstrate the efficacy of our approach which significantly improves the overall diagnostic accuracy of malaria infected cells. Additionally, we present a mathematical framework for performing 'slide-level' diagnosis by using individual 'cell-level' diagnosis data, shedding more light on the statistical rules that should govern the routine practice in examination of e.g., thin blood smear samples. This framework could be generalized for various other tele-pathology needs, and can be used by trained experts within an efficient tele-medicine platform
Empirical Methodology for Crowdsourcing Ground Truth
The process of gathering ground truth data through human annotation is a
major bottleneck in the use of information extraction methods for populating
the Semantic Web. Crowdsourcing-based approaches are gaining popularity in the
attempt to solve the issues related to volume of data and lack of annotators.
Typically these practices use inter-annotator agreement as a measure of
quality. However, in many domains, such as event detection, there is ambiguity
in the data, as well as a multitude of perspectives of the information
examples. We present an empirically derived methodology for efficiently
gathering of ground truth data in a diverse set of use cases covering a variety
of domains and annotation tasks. Central to our approach is the use of
CrowdTruth metrics that capture inter-annotator disagreement. We show that
measuring disagreement is essential for acquiring a high quality ground truth.
We achieve this by comparing the quality of the data aggregated with CrowdTruth
metrics with majority vote, over a set of diverse crowdsourcing tasks: Medical
Relation Extraction, Twitter Event Identification, News Event Extraction and
Sound Interpretation. We also show that an increased number of crowd workers
leads to growth and stabilization in the quality of annotations, going against
the usual practice of employing a small number of annotators.Comment: in publication at the Semantic Web Journa
The Evalita 2014 Dependency Parsing task
SUMMARY.
The Parsing Task is among the “historical” tasks of Evalita, and in all editions its main objective has been to define and improve state-of-the-art technologies for parsing Italian. The 2014’s edition of the shared task features several novelties that have mainly to do with the data set and the subtasks. The paper therefore focuses on these two strictly interrelated aspects and presents an overview of the participants systems and results.
RIASSUNTO.
Il “Parsing Task”, tra i compiti storici di Evalita, in tutte le edizioni ha avuto lo scopo principale di definire ed estendere lo stato dell’arte per l’analisi sin-
tattica automatica della lingua italiana. Nell’edizione del 2014 della campagna di valutazione esso si caratterizza per alcune significative novità legate in particolare ai
dati utilizzati per l’addestramento e alla sua organizzazione interna. L’articolo si focalizza pertanto su questi due aspetti strettamente interrelati e presenta una panoramica dei sistemi che hanno partecipato e dei risultati raggiunti
- …