4,279 research outputs found
CloudScan - A configuration-free invoice analysis system using recurrent neural networks
We present CloudScan; an invoice analysis system that requires zero
configuration or upfront annotation. In contrast to previous work, CloudScan
does not rely on templates of invoice layout, instead it learns a single global
model of invoices that naturally generalizes to unseen invoice layouts. The
model is trained using data automatically extracted from end-user provided
feedback. This automatic training data extraction removes the requirement for
users to annotate the data precisely. We describe a recurrent neural network
model that can capture long range context and compare it to a baseline logistic
regression model corresponding to the current CloudScan production system. We
train and evaluate the system on 8 important fields using a dataset of 326,471
invoices. The recurrent neural network and baseline model achieve 0.891 and
0.887 average F1 scores respectively on seen invoice layouts. For the harder
task of unseen invoice layouts, the recurrent neural network model outperforms
the baseline with 0.840 average F1 compared to 0.788.Comment: Presented at ICDAR 201
A two-stage approach for table extraction in invoices
The automated analysis of administrative documents is an important field in
document recognition that is studied for decades. Invoices are key documents
among these huge amounts of documents available in companies and public
services. Invoices contain most of the time data that are presented in tables
that should be clearly identified to extract suitable information. In this
paper, we propose an approach that combines an image processing based
estimation of the shape of the tables with a graph-based representation of the
document, which is used to identify complex tables precisely. We propose an
experimental evaluation using a real case application
Attend, Copy, Parse -- End-to-end information extraction from documents
Document information extraction tasks performed by humans create data
consisting of a PDF or document image input, and extracted string outputs. This
end-to-end data is naturally consumed and produced when performing the task
because it is valuable in and of itself. It is naturally available, at no
additional cost. Unfortunately, state-of-the-art word classification methods
for information extraction cannot use this data, instead requiring word-level
labels which are expensive to create and consequently not available for many
real life tasks. In this paper we propose the Attend, Copy, Parse architecture,
a deep neural network model that can be trained directly on end-to-end data,
bypassing the need for word-level labels. We evaluate the proposed architecture
on a large diverse set of invoices, and outperform a state-of-the-art
production system based on word classification. We believe our proposed
architecture can be used on many real life information extraction tasks where
word classification cannot be used due to a lack of the required word-level
labels
Ontology learning from Italian legal texts
The paper reports on the methodology and preliminary results of a case study in automatically extracting ontological knowledge from Italian legislative texts. We use a fully-implemented ontology learning system (T2K) that includes a battery of tools for Natural Language Processing (NLP), statistical text analysis and machine language learning. Tools are dynamically integrated to provide an incremental representation of the content of vast repositories of unstructured documents. Evaluated results, however preliminary, show the great potential of NLP-powered incremental systems like T2K for accurate large-scale semi-automatic extraction of legal ontologies
Inverse software configuration management
Software systems are playing an increasingly important role in almost every aspect of today’s society such that they impact on our businesses, industry, leisure, health and safety. Many of these systems are extremely large and complex and depend upon the correct interaction of many hundreds or even thousands of heterogeneous components. Commensurate with this increased reliance on software is the need for high quality products that meet customer expectations, perform reliably and which can be cost-effectively and safely maintained. Techniques such as software configuration management have proved to be invaluable during the development process to ensure that this is the case. However, there are a very large number of legacy systems which were not developed under controlled conditions, but which still, need to be maintained due to the heavy investment incorporated within them. Such systems are characterised by extremely high program comprehension overheads and the probability that new errors will be introduced during the maintenance process often with serious consequences. To address the issues concerning maintenance of legacy systems this thesis has defined and developed a new process and associated maintenance model, Inverse Software Configuration Management (ISCM). This model centres on a layered approach to the program comprehension process through the definition of a number of software configuration abstractions. This information together with the set of rules for reclaiming the information is stored within an Extensible System Information Base (ESIB) via, die definition of a Programming-in-the- Environment (PITE) language, the Inverse Configuration Description Language (ICDL). In order to assist the application of the ISCM process across a wide range of software applications and system architectures, die PISCES (Proforma Identification Scheme for Configurations of Existing Systems) method has been developed as a series of defined procedures and guidelines. To underpin the method and to offer a user-friendly interface to the process a series of templates, the Proforma Increasing Complexity Series (PICS) has been developed. To enable the useful employment of these techniques on large-scale systems, the subject of automation has been addressed through the development of a flexible meta-CASE environment, the PISCES M4 (MultiMedia Maintenance Manager) system. Of particular interest within this environment is the provision of a multimedia user interface (MUI) to die maintenance process. As a means of evaluating the PISCES method and to provide feedback into die ISCM process a number of practical applications have been modelled. In summary, this research has considered a number of concepts some of which are innovative in themselves, others of which are used in an innovative manner. In combination these concepts may be considered to considerably advance the knowledge and understanding of die comprehension process during the maintenance of legacy software systems. A number of publications have already resulted from the research and several more are in preparation. Additionally a number of areas for further study have been identified some of which are already underway as funded research and development projects
- …