Key Information Extraction (KIE) is a challenging multimodal task that aims
to extract structured value semantic entities from visually rich documents.
Although significant progress has been made, there are still two major
challenges that need to be addressed. Firstly, the layout of existing datasets
is relatively fixed and limited in the number of semantic entity categories,
creating a significant gap between these datasets and the complex real-world
scenarios. Secondly, existing methods follow a two-stage pipeline strategy,
which may lead to the error propagation problem. Additionally, they are
difficult to apply in situations where unseen semantic entity categories
emerge. To address the first challenge, we propose a new large-scale
human-annotated dataset named Complex Layout form for key information
EXtraction (CLEX), which consists of 5,860 images with 1,162 semantic entity
categories. To solve the second challenge, we introduce Parallel Pointer-based
Network (PPN), an end-to-end model that can be applied in zero-shot and
few-shot scenarios. PPN leverages the implicit clues between semantic entities
to assist extracting, and its parallel extraction mechanism allows it to
extract multiple results simultaneously and efficiently. Experiments on the
CLEX dataset demonstrate that PPN outperforms existing state-of-the-art methods
while also offering a much faster inference speed