2,118 research outputs found
Attend, Copy, Parse -- End-to-end information extraction from documents
Document information extraction tasks performed by humans create data
consisting of a PDF or document image input, and extracted string outputs. This
end-to-end data is naturally consumed and produced when performing the task
because it is valuable in and of itself. It is naturally available, at no
additional cost. Unfortunately, state-of-the-art word classification methods
for information extraction cannot use this data, instead requiring word-level
labels which are expensive to create and consequently not available for many
real life tasks. In this paper we propose the Attend, Copy, Parse architecture,
a deep neural network model that can be trained directly on end-to-end data,
bypassing the need for word-level labels. We evaluate the proposed architecture
on a large diverse set of invoices, and outperform a state-of-the-art
production system based on word classification. We believe our proposed
architecture can be used on many real life information extraction tasks where
word classification cannot be used due to a lack of the required word-level
labels
Automatic office document classification and information extraction
TEXPR.OS (TEXt PROcessing System) is a document processing system (DPS) to support and assist office workers in their daily work in dealing with information and document management. In this thesis, document classification and information extraction, which are two of the major functional capabilities in TEXPROS, are investigated.
Based on the nature of its content, a document is divided into structured and unstructured (i.e., of free text) parts. The conceptual and content structures are introduced to capture the semantics of the structured and unstructured part of the document respectively. The document is classified and information is extracted based on the analyses of conceptual and content structures. In our approach, the layout structure of a document is used to assist the analyses of the conceptual and content structures of the document. By nested segmentation of a document, the layout structure of the document is represented by an ordered labeled tree structure, called Layout Structure Tree (L-S-Tree). Sample-based classification mechanism is adopted in our approach for classifying the documents. A set of pre-classified documents are stored in a document sample base in the form of sample trees. In the layout analysis, an approximate tree matching is used to match the L-S-Tree of a document to be classified against the sample trees. The layout similarities between the document and the sample documents are evaluated based on the edit distance between the L-S-Tree of the document and the sample trees. The document samples which have the similar layout structure to the document are chosen to be used for the conceptual analysis of the document.
In the conceptual analysis of the document, based on the mapping between the document and document samples, which was found during the layout analysis, the conceptual similarities between the document and the sample documents are evaluated based on the degree of conceptual closeness degree . The document sample which has the similar conceptual structure to the document is chosen to be used for extracting information. Extracting the information of the structured part of the document is based on the layout locations of key terms appearing in the document and string pattern matching. Based on the information extracted from the structured part of the document the type of the document is identified. In the content analysis of the document, the bottom-up and top-down analyses on the free text are combined to extract information from the unstructured part of the document. In the bottom-up analysis, the sentences of the free text are classified into those which are relevant or irrelevant to the extraction. The sentence classification is based on the semantical relationship between the phrases in the sentences and the attribute names in the corresponding content structure by consulting the thesaurus. Then the thematic roles of the phrases in each relevant sentence are identified based on the syntactic analysis and heuristic thematic analysis. In the top-down analysis, the appropriate content structure is identified based on the document type identified in the conceptual analysis. Then the information is extracted from the unstructured part of the document by evaluating the restrictions specified in the corresponding content structure based on the result of bottom-up analysis.
The information extracted from the structured and unstructured parts of the document are stored in the form of a frame like structure (frame instance) in the data base for information retrieval in TEXPROS
TRIE++: Towards End-to-End Information Extraction from Visually Rich Documents
Recently, automatically extracting information from visually rich documents
(e.g., tickets and resumes) has become a hot and vital research topic due to
its widespread commercial value. Most existing methods divide this task into
two subparts: the text reading part for obtaining the plain text from the
original document images and the information extraction part for extracting key
contents. These methods mainly focus on improving the second, while neglecting
that the two parts are highly correlated. This paper proposes a unified
end-to-end information extraction framework from visually rich documents, where
text reading and information extraction can reinforce each other via a
well-designed multi-modal context block. Specifically, the text reading part
provides multi-modal features like visual, textual and layout features. The
multi-modal context block is developed to fuse the generated multi-modal
features and even the prior knowledge from the pre-trained language model for
better semantic representation. The information extraction part is responsible
for generating key contents with the fused context features. The framework can
be trained in an end-to-end trainable manner, achieving global optimization.
What is more, we define and group visually rich documents into four categories
across two dimensions, the layout and text type. For each document category, we
provide or recommend the corresponding benchmarks, experimental settings and
strong baselines for remedying the problem that this research area lacks the
uniform evaluation standard. Extensive experiments on four kinds of benchmarks
(from fixed layout to variable layout, from full-structured text to
semi-unstructured text) are reported, demonstrating the proposed method's
effectiveness. Data, source code and models are available
Question Answering System for Yioop
Yioop is an open source search engine developed and managed by Dr. Christopher Pollett. Currently, Yioop returns the search results of the query in the form of list of URLs, just like other search engines (Google, Bing, DuckDuckGo, etc.) This paper created a new module for Yioop. This new module, known as the Question-Answering (QA) System, takes the search queries in the form of natural language questions and returns results in the form of a short answer that is appropriate to the question asked. This feature is achieved by implementing various functionalities of Natural Language Processing (NLP). By using NLP, the new Question-Answering (QA) System attempts to extract the necessary information from the query provided by the user and provides an appropriate answer from the available data
Forensic acquisition of file systems with parallel processing of digital artifacts to generate an early case assessment report
A evolução da maneira como os seres humanos interagem e realizam tarefas rotineiras mudou nas últimas décadas e uma longa lista de atividades agora somente são possÃveis com o uso de tecnologias da informação – entre essas pode-se destacar a aquisição de bens e serviços, gestão e operações de negócios e comunicações. Essas transformações são visÃveis também em outras atividades menos legÃtimas, permitindo que crimes sejam cometidos através de meios digitais.
Em linhas gerais, investigadores forenses trabalham buscando por indÃcios de ações criminais realizadas por meio de dispositivos digitais para finalmente, tentar identificar os autores, o nÃvel do dano causado e a história atrás que possibilitou o crime. Na sua essência, essa atividade deve seguir normas estritas para garantir que as provas sejam admitidas em tribunal, mas quanto maior o número de novos artefatos e maior o volume de dispositivos de armazenamento disponÃveis, maior o tempo necessário entre a identificação de um dispositivo de um suspeito e o momento em que o investigador começa a navegar no mar de informações alojadas no dispositivo.
Esta pesquisa, tem como objetivo antecipar algumas etapas do EDRM através do uso do processamento em paralelo adjacente nas unidades de processamento (CPU) atuais para para traduzir multiplos artefactos forenses do sistema operativo Windows 10 e gerar um relatório com as informações mais cruciais sobre o dispositivo adquirido. Permitindo uma análise antecipada do caso (ECA) ao mesmo tempo em que uma aquisição completa do disco está em curso, desse modo causando um impacto mÃnimo no tempo geral de aquisição
- …