437,444 research outputs found
Deep Learning for Technical Document Classification
In large technology companies, the requirements for managing and organizing
technical documents created by engineers and managers have increased
dramatically in recent years, which has led to a higher demand for more
scalable, accurate, and automated document classification. Prior studies have
only focused on processing text for classification, whereas technical documents
often contain multimodal information. To leverage multimodal information for
document classification to improve the model performance, this paper presents a
novel multimodal deep learning architecture, TechDoc, which utilizes three
types of information, including natural language texts and descriptive images
within documents and the associations among the documents. The architecture
synthesizes the convolutional neural network, recurrent neural network, and
graph neural network through an integrated training process. We applied the
architecture to a large multimodal technical document database and trained the
model for classifying documents based on the hierarchical International Patent
Classification system. Our results show that TechDoc presents a greater
classification accuracy than the unimodal methods and other state-of-the-art
benchmarks. The trained model can potentially be scaled to millions of
real-world multimodal technical documents, which is useful for data and
knowledge management in large technology companies and organizations.Comment: 16 pages, 8 figures, 9 table
Integration of document representation, processing and management
This paper describes a way for document representation and proposes an approach towards an integrated document processing and management system. The approach has the intention to capture essentially freely structured documents, like those typically used in the office domain. The document analysis system ANASTASIL is capable to reveal the structure of complex paper documents, as well as logical objects within it, like receiver, footnote, date. Moreover, it facilitates the handling of the containing information. Analyzed documents are stored by the management system KRISYS that is connected to several different subsequent services. The described integrated system can be considered as an ideal extension of the human clerk, making his tasks in information processing easier. The symbolic representation of the analysis results allow an easy transformation in a given international standard, e.g., ODA/ODIF or SGML, and to interchange it via global network
Modelling, Visualising and Summarising Documents with a Single Convolutional Neural Network
Capturing the compositional process which maps the meaning of words to that
of documents is a central challenge for researchers in Natural Language
Processing and Information Retrieval. We introduce a model that is able to
represent the meaning of documents by embedding them in a low dimensional
vector space, while preserving distinctions of word and sentence order crucial
for capturing nuanced semantics. Our model is based on an extended Dynamic
Convolution Neural Network, which learns convolution filters at both the
sentence and document level, hierarchically learning to capture and compose low
level lexical features into high level semantic concepts. We demonstrate the
effectiveness of this model on a range of document modelling tasks, achieving
strong results with no feature engineering and with a more compact model.
Inspired by recent advances in visualising deep convolution networks for
computer vision, we present a novel visualisation technique for our document
networks which not only provides insight into their learning process, but also
can be interpreted to produce a compelling automatic summarisation system for
texts
Discovering Knowledge from Relational Data Extracted from Business News
Thousands of business news stories (including press releases, earnings
reports, general business news, etc.) are released each day. Recently, information
technology advances have partially automated the processing of
documents, reducing the amount of text that must be read. Current techniques
(e.g., text classification and information extraction) for full-text analysis for the
most part are limited to discovering information that can be found in single
documents. Often, however, important information does not reside in a single
document, but in the relationships between information distributed over multiple
documents. This paper reports on an investigation into whether knowledge
can be discovered automatically from relational data extracted from large corpora
of business news stories. We use a combination of information extraction,
network analysis, and statistical techniques. We show that relationally interlinked
patterns distributed over multiple documents can indeed be extracted,
and (specifically) that knowledge about companiesÃÂÃÂÃÂâÃÂÃÂÃÂÃÂÃÂÃÂÃÂàinterrelationships can be
discovered. We evaluate the extracted relationships in several ways: we give a
broad visualization of related companies, showing intuitive industry clusters;
we use network analysis to ask who are the central players, and finally, we
show that the extracted interrelationships can be used for important tasks, such
as for classifying companies by industry membership.Information Systems Working Papers Serie
Discovering Knowledge from Relational Data Extracted from Business News
Thousands of business news stories (including press releases, earnings
reports, general business news, etc.) are released each day. Recently, information
technology advances have partially automated the processing of
documents, reducing the amount of text that must be read. Current techniques
(e.g., text classification and information extraction) for full-text analysis for the
most part are limited to discovering information that can be found in single
documents. Often, however, important information does not reside in a single
document, but in the relationships between information distributed over multiple
documents. This paper reports on an investigation into whether knowledge
can be discovered automatically from relational data extracted from large corpora
of business news stories. We use a combination of information extraction,
network analysis, and statistical techniques. We show that relationally interlinked
patterns distributed over multiple documents can indeed be extracted,
and (specifically) that knowledge about companiesÃÂÃÂÃÂâÃÂÃÂÃÂÃÂÃÂÃÂÃÂàinterrelationships can be
discovered. We evaluate the extracted relationships in several ways: we give a
broad visualization of related companies, showing intuitive industry clusters;
we use network analysis to ask who are the central players, and finally, we
show that the extracted interrelationships can be used for important tasks, such
as for classifying companies by industry membership.Information Systems Working Papers Serie
Discussion documents – SUSVAR Visions Workshop, Karrebæksminde, Denmark, April 2008
Seven discussion documents were made during the SUSVAR Visions workshop ‘Sustainable cereal production beyond 2020: Visions from the SUSVAR1 network’, Karrebæksminde, Denmark, 14-16 April 2008. At the workshop, one discussion documents was written for each of the topics mentioned below. In total 55 persons from 21 European countries participated in the process. The participants came from different disciplines: genetics, plant breeding, genetic resources, agronomy, plant pathology, soil science, biometry and system analysis, all specialised in the area of cereal production.
The approach taken at the workshop was to focus on envisioning the future of sustainable agriculture, especially cereal production. This was done by scientific creative thinking on the basis of possibilities in breeding, management and seed production and not on the basis of traditional problem solving. We followed a strategy commonly used in industrial management based on the premise “imagining the future is shaping the future”. The method “appreciative inquiry” was applied supported by a professional facilitator. Experience shows that this way of working sparks engagement and creativity and that progress and results can be reached within a short time. Focus was on the following topics of relevance to cereal production:
- Competition between food and bioenergy,
- Soil fertility management,
- Economical and legal conditions for variety improvement,
- Participation of stakeholders,
- Plant breeding strategies,
- Food and feed processing improvements,
- Sustainable land use.
The initial process was to visualise the most desirable future scenario for the seven essential topics in food and agriculture systems. This process was unhindered by no requirement for a market-driven goal. Each topic was discussed in relation to a broader socio-ecological system with a focus on the means to reach the desired and more sustainable outcomes. The next step at the workshop was to produce the discussion documents.
The final stage of the process is to connect the topics in a completed vision of cereal production within a future sustainable socio-ecological system. This is in progress by a group of key persons within the network, e.g. the working group leaders (in preparation for publication in a scientific journal)
An Approach of Semantic Similarity Measure between Documents Based on Big Data
Semantic indexing and document similarity is an important information retrieval system problem in Big Data with broad applications. In this paper, we investigate MapReduce programming model as a specific framework for managing distributed processing in a large of amount documents. Then we study the state of the art of different approaches for computing the similarity of documents. Finally, we propose our approach of semantic similarity measures using WordNet as an external network semantic resource. For evaluation, we compare the proposed approach with other approaches previously presented by using our new MapReduce algorithm. Experimental results review that our proposed approach outperforms the state of the art ones on running time performance and increases the measurement of semantic similarity
- …