39 research outputs found
Information Extraction from Documents: Question Answering vs Token Classification in real-world setups
Research in Document Intelligence and especially in Document Key Information
Extraction (DocKIE) has been mainly solved as Token Classification problem.
Recent breakthroughs in both natural language processing (NLP) and computer
vision helped building document-focused pre-training methods, leveraging a
multimodal understanding of the document text, layout and image modalities.
However, these breakthroughs also led to the emergence of a new DocKIE subtask
of extractive document Question Answering (DocQA), as part of the Machine
Reading Comprehension (MRC) research field. In this work, we compare the
Question Answering approach with the classical token classification approach
for document key information extraction. We designed experiments to benchmark
five different experimental setups : raw performances, robustness to noisy
environment, capacity to extract long entities, fine-tuning speed on Few-Shot
Learning and finally Zero-Shot Learning. Our research showed that when dealing
with clean and relatively short entities, it is still best to use token
classification-based approach, while the QA approach could be a good
alternative for noisy environment or long entities use-cases
Robust Domain Adaptation for Pre-trained Multilingual Neural Machine Translation Models
Recent literature has demonstrated the potential of multilingual Neural
Machine Translation (mNMT) models. However, the most efficient models are not
well suited to specialized industries. In these cases, internal data is scarce
and expensive to find in all language pairs. Therefore, fine-tuning a mNMT
model on a specialized domain is hard. In this context, we decided to focus on
a new task: Domain Adaptation of a pre-trained mNMT model on a single pair of
language while trying to maintain model quality on generic domain data for all
language pairs. The risk of loss on generic domain and on other pairs is high.
This task is key for mNMT model adoption in the industry and is at the border
of many others. We propose a fine-tuning procedure for the generic mNMT that
combines embeddings freezing and adversarial loss. Our experiments demonstrated
that the procedure improves performances on specialized data with a minimal
loss in initial performances on generic domain for all languages pairs,
compared to a naive standard approach (+10.0 BLEU score on specialized data,
-0.01 to -0.5 BLEU on WMT and Tatoeba datasets on the other pairs with M2M100).Comment: Accepted by EMNLP 2022 MMNLU Worksho