28 research outputs found
CloudScan - A configuration-free invoice analysis system using recurrent neural networks
We present CloudScan; an invoice analysis system that requires zero
configuration or upfront annotation. In contrast to previous work, CloudScan
does not rely on templates of invoice layout, instead it learns a single global
model of invoices that naturally generalizes to unseen invoice layouts. The
model is trained using data automatically extracted from end-user provided
feedback. This automatic training data extraction removes the requirement for
users to annotate the data precisely. We describe a recurrent neural network
model that can capture long range context and compare it to a baseline logistic
regression model corresponding to the current CloudScan production system. We
train and evaluate the system on 8 important fields using a dataset of 326,471
invoices. The recurrent neural network and baseline model achieve 0.891 and
0.887 average F1 scores respectively on seen invoice layouts. For the harder
task of unseen invoice layouts, the recurrent neural network model outperforms
the baseline with 0.840 average F1 compared to 0.788.Comment: Presented at ICDAR 201
Attend, Copy, Parse -- End-to-end information extraction from documents
Document information extraction tasks performed by humans create data
consisting of a PDF or document image input, and extracted string outputs. This
end-to-end data is naturally consumed and produced when performing the task
because it is valuable in and of itself. It is naturally available, at no
additional cost. Unfortunately, state-of-the-art word classification methods
for information extraction cannot use this data, instead requiring word-level
labels which are expensive to create and consequently not available for many
real life tasks. In this paper we propose the Attend, Copy, Parse architecture,
a deep neural network model that can be trained directly on end-to-end data,
bypassing the need for word-level labels. We evaluate the proposed architecture
on a large diverse set of invoices, and outperform a state-of-the-art
production system based on word classification. We believe our proposed
architecture can be used on many real life information extraction tasks where
word classification cannot be used due to a lack of the required word-level
labels
End-to-end information extraction without token-level supervision
Most state-of-the-art information extraction approaches rely on token-level labels to find the areas of interest in text. Unfortunately, these labels are time-consuming and costly to create, and consequently, not available for many real-life IE tasks. To make matters worse, token-level labels are usually not the desired output, but just an intermediary step. End-to-end (E2E) models, which take raw text as input and produce the desired output directly, need not depend on token-level labels. We propose an E2E model based on pointer networks, which can be trained directly on pairs of raw input and output text. We evaluate our model on the ATIS data set, MIT restaurant corpus and the MIT movie corpus and compare to neural baselines that do use token-level labels. We achieve competitive results, within a few percentage points of the baselines, showing the feasibility of E2E information extraction without the need for token-level labels. This opens up new possibilities, as for many tasks currently addressed by human extractors, raw input and output data are available, but not token-level labels
Significant benefits of AIP testing and clinical screening in familial isolated and young-onset pituitary tumors
Context
Germline mutations in the aryl hydrocarbon receptor-interacting protein (AIP) gene are responsible for a subset of familial isolated pituitary adenoma (FIPA) cases and sporadic pituitary neuroendocrine tumors (PitNETs).
Objective
To compare prospectively diagnosed AIP mutation-positive (AIPmut) PitNET patients with clinically presenting patients and to compare the clinical characteristics of AIPmut and AIPneg PitNET patients.
Design
12-year prospective, observational study.
Participants & Setting
We studied probands and family members of FIPA kindreds and sporadic patients with disease onset ≤18 years or macroadenomas with onset ≤30 years (n = 1477). This was a collaborative study conducted at referral centers for pituitary diseases.
Interventions & Outcome
AIP testing and clinical screening for pituitary disease. Comparison of characteristics of prospectively diagnosed (n = 22) vs clinically presenting AIPmut PitNET patients (n = 145), and AIPmut (n = 167) vs AIPneg PitNET patients (n = 1310).
Results
Prospectively diagnosed AIPmut PitNET patients had smaller lesions with less suprasellar extension or cavernous sinus invasion and required fewer treatments with fewer operations and no radiotherapy compared with clinically presenting cases; there were fewer cases with active disease and hypopituitarism at last follow-up. When comparing AIPmut and AIPneg cases, AIPmut patients were more often males, younger, more often had GH excess, pituitary apoplexy, suprasellar extension, and more patients required multimodal therapy, including radiotherapy. AIPmut patients (n = 136) with GH excess were taller than AIPneg counterparts (n = 650).
Conclusions
Prospectively diagnosed AIPmut patients show better outcomes than clinically presenting cases, demonstrating the benefits of genetic and clinical screening. AIP-related pituitary disease has a wide spectrum ranging from aggressively growing lesions to stable or indolent disease course
Estimation of conditional Probabilities with Decision Trees and an Application to Fine-Grained POS Tagging
We present a HMM part-of-speech tagging method which is particularly suited for POS tagsets with a large number of fine-grained tags. It is based on three ideas: (1) splitting of the POS tags into attribute vectors and decomposition of the contextual POS probabilities of the HMM into a product of attribute probabilities, (2) estimation of the contextual probabilities with decision trees, and (3) use of high-order HMMs. In experiments on German and Czech data, our tagger outperformed state-of-the-art POS taggers
Stopping criteria for active learning of named entity recognition
Active learning is a proven method for reducing the cost of creating the training sets that are necessary for statistical NLP. However, there has been little work on stopping criteria for active learning. An operational stopping criterion is necessary to be able to use active learning in NLP applications. We investigate three different stopping criteria for active learning of named entity recognition (NER) and show that one of them, gradient-based stopping, (i) reliably stops active learning, (ii) achieves nearoptimal NER performance, (iii) and needs only about 20 % as much training data as exhaustive labeling.