733 research outputs found
CCpdf: Building a High Quality Corpus for Visually Rich Documents from Web Crawl Data
In recent years, the field of document understanding has progressed a lot. A
significant part of this progress has been possible thanks to the use of
language models pretrained on large amounts of documents. However, pretraining
corpora used in the domain of document understanding are single domain,
monolingual, or nonpublic. Our goal in this paper is to propose an efficient
pipeline for creating a big-scale, diverse, multilingual corpus of PDF files
from all over the Internet using Common Crawl, as PDF files are the most
canonical types of documents as considered in document understanding. We
analysed extensively all of the steps of the pipeline and proposed a solution
which is a trade-off between data quality and processing time. We also share a
CCpdf corpus in a form or an index of PDF files along with a script for
downloading them, which produces a collection useful for language model
pretraining. The dataset and tools published with this paper offer researchers
the opportunity to develop even better multilingual language models.Comment: Accepted at ICDAR 202
QueryForm: A Simple Zero-shot Form Entity Query Framework
Zero-shot transfer learning for document understanding is a crucial yet
under-investigated scenario to help reduce the high cost involved in annotating
document entities. We present a novel query-based framework, QueryForm, that
extracts entity values from form-like documents in a zero-shot fashion.
QueryForm contains a dual prompting mechanism that composes both the document
schema and a specific entity type into a query, which is used to prompt a
Transformer model to perform a single entity extraction task. Furthermore, we
propose to leverage large-scale query-entity pairs generated from form-like
webpages with weak HTML annotations to pre-train QueryForm. By unifying
pre-training and fine-tuning into the same query-based framework, QueryForm
enables models to learn from structured documents containing various entities
and layouts, leading to better generalization to target document types without
the need for target-specific training data. QueryForm sets new state-of-the-art
average F1 score on both the XFUND (+4.6%~10.1%) and the Payment (+3.2%~9.5%)
zero-shot benchmark, with a smaller model size and no additional image input.Comment: Accepted to Findings of ACL 202
Semantic enrichment for enhancing LAM data and supporting digital humanities. Review article
With the rapid development of the digital humanities (DH) field, demands for historical and cultural heritage data have generated deep interest in the data provided by libraries, archives, and museums (LAMs). In order to enhance LAM data’s quality and discoverability while enabling a self-sustaining ecosystem, “semantic enrichment” becomes a strategy increasingly used by LAMs during recent years. This article introduces a number of semantic enrichment methods and efforts that can be applied to LAM data at various levels, aiming to support deeper and wider exploration and use of LAM data in DH research. The real cases, research projects, experiments, and pilot studies shared in this article demonstrate endless potential for LAM data, whether they are structured, semi-structured, or unstructured, regardless of what types of original artifacts carry the data. Following their roadmaps would encourage more effective initiatives and strengthen this effort to maximize LAM data’s discoverability, use- and reuse-ability, and their value in the mainstream of DH and Semantic Web
- …