437,444 research outputs found

    Deep Learning for Technical Document Classification

    Full text link
    In large technology companies, the requirements for managing and organizing technical documents created by engineers and managers have increased dramatically in recent years, which has led to a higher demand for more scalable, accurate, and automated document classification. Prior studies have only focused on processing text for classification, whereas technical documents often contain multimodal information. To leverage multimodal information for document classification to improve the model performance, this paper presents a novel multimodal deep learning architecture, TechDoc, which utilizes three types of information, including natural language texts and descriptive images within documents and the associations among the documents. The architecture synthesizes the convolutional neural network, recurrent neural network, and graph neural network through an integrated training process. We applied the architecture to a large multimodal technical document database and trained the model for classifying documents based on the hierarchical International Patent Classification system. Our results show that TechDoc presents a greater classification accuracy than the unimodal methods and other state-of-the-art benchmarks. The trained model can potentially be scaled to millions of real-world multimodal technical documents, which is useful for data and knowledge management in large technology companies and organizations.Comment: 16 pages, 8 figures, 9 table

    Integration of document representation, processing and management

    Get PDF
    This paper describes a way for document representation and proposes an approach towards an integrated document processing and management system. The approach has the intention to capture essentially freely structured documents, like those typically used in the office domain. The document analysis system ANASTASIL is capable to reveal the structure of complex paper documents, as well as logical objects within it, like receiver, footnote, date. Moreover, it facilitates the handling of the containing information. Analyzed documents are stored by the management system KRISYS that is connected to several different subsequent services. The described integrated system can be considered as an ideal extension of the human clerk, making his tasks in information processing easier. The symbolic representation of the analysis results allow an easy transformation in a given international standard, e.g., ODA/ODIF or SGML, and to interchange it via global network

    Modelling, Visualising and Summarising Documents with a Single Convolutional Neural Network

    Full text link
    Capturing the compositional process which maps the meaning of words to that of documents is a central challenge for researchers in Natural Language Processing and Information Retrieval. We introduce a model that is able to represent the meaning of documents by embedding them in a low dimensional vector space, while preserving distinctions of word and sentence order crucial for capturing nuanced semantics. Our model is based on an extended Dynamic Convolution Neural Network, which learns convolution filters at both the sentence and document level, hierarchically learning to capture and compose low level lexical features into high level semantic concepts. We demonstrate the effectiveness of this model on a range of document modelling tasks, achieving strong results with no feature engineering and with a more compact model. Inspired by recent advances in visualising deep convolution networks for computer vision, we present a novel visualisation technique for our document networks which not only provides insight into their learning process, but also can be interpreted to produce a compelling automatic summarisation system for texts

    Discovering Knowledge from Relational Data Extracted from Business News

    Get PDF
    Thousands of business news stories (including press releases, earnings reports, general business news, etc.) are released each day. Recently, information technology advances have partially automated the processing of documents, reducing the amount of text that must be read. Current techniques (e.g., text classification and information extraction) for full-text analysis for the most part are limited to discovering information that can be found in single documents. Often, however, important information does not reside in a single document, but in the relationships between information distributed over multiple documents. This paper reports on an investigation into whether knowledge can be discovered automatically from relational data extracted from large corpora of business news stories. We use a combination of information extraction, network analysis, and statistical techniques. We show that relationally interlinked patterns distributed over multiple documents can indeed be extracted, and (specifically) that knowledge about companiesÃÂÃÂÃÂâÃÂÃÂÃÂÃÂÃÂÃÂÃÂàinterrelationships can be discovered. We evaluate the extracted relationships in several ways: we give a broad visualization of related companies, showing intuitive industry clusters; we use network analysis to ask who are the central players, and finally, we show that the extracted interrelationships can be used for important tasks, such as for classifying companies by industry membership.Information Systems Working Papers Serie

    Discovering Knowledge from Relational Data Extracted from Business News

    Get PDF
    Thousands of business news stories (including press releases, earnings reports, general business news, etc.) are released each day. Recently, information technology advances have partially automated the processing of documents, reducing the amount of text that must be read. Current techniques (e.g., text classification and information extraction) for full-text analysis for the most part are limited to discovering information that can be found in single documents. Often, however, important information does not reside in a single document, but in the relationships between information distributed over multiple documents. This paper reports on an investigation into whether knowledge can be discovered automatically from relational data extracted from large corpora of business news stories. We use a combination of information extraction, network analysis, and statistical techniques. We show that relationally interlinked patterns distributed over multiple documents can indeed be extracted, and (specifically) that knowledge about companiesÃÂÃÂÃÂâÃÂÃÂÃÂÃÂÃÂÃÂÃÂàinterrelationships can be discovered. We evaluate the extracted relationships in several ways: we give a broad visualization of related companies, showing intuitive industry clusters; we use network analysis to ask who are the central players, and finally, we show that the extracted interrelationships can be used for important tasks, such as for classifying companies by industry membership.Information Systems Working Papers Serie

    Discussion documents – SUSVAR Visions Workshop, Karrebæksminde, Denmark, April 2008

    Get PDF
    Seven discussion documents were made during the SUSVAR Visions workshop ‘Sustainable cereal production beyond 2020: Visions from the SUSVAR1 network’, Karrebæksminde, Denmark, 14-16 April 2008. At the workshop, one discussion documents was written for each of the topics mentioned below. In total 55 persons from 21 European countries participated in the process. The participants came from different disciplines: genetics, plant breeding, genetic resources, agronomy, plant pathology, soil science, biometry and system analysis, all specialised in the area of cereal production. The approach taken at the workshop was to focus on envisioning the future of sustainable agriculture, especially cereal production. This was done by scientific creative thinking on the basis of possibilities in breeding, management and seed production and not on the basis of traditional problem solving. We followed a strategy commonly used in industrial management based on the premise “imagining the future is shaping the future”. The method “appreciative inquiry” was applied supported by a professional facilitator. Experience shows that this way of working sparks engagement and creativity and that progress and results can be reached within a short time. Focus was on the following topics of relevance to cereal production: - Competition between food and bioenergy, - Soil fertility management, - Economical and legal conditions for variety improvement, - Participation of stakeholders, - Plant breeding strategies, - Food and feed processing improvements, - Sustainable land use. The initial process was to visualise the most desirable future scenario for the seven essential topics in food and agriculture systems. This process was unhindered by no requirement for a market-driven goal. Each topic was discussed in relation to a broader socio-ecological system with a focus on the means to reach the desired and more sustainable outcomes. The next step at the workshop was to produce the discussion documents. The final stage of the process is to connect the topics in a completed vision of cereal production within a future sustainable socio-ecological system. This is in progress by a group of key persons within the network, e.g. the working group leaders (in preparation for publication in a scientific journal)

    An Approach of Semantic Similarity Measure between Documents Based on Big Data

    Get PDF
    Semantic indexing and document similarity is an important information retrieval system problem in Big Data with broad applications. In this paper, we investigate MapReduce programming model as a specific framework for managing distributed processing in a large of amount documents. Then we study the state of the art of different approaches for computing the similarity of documents. Finally, we propose our approach of semantic similarity measures using WordNet as an external network semantic resource. For evaluation, we compare the proposed approach with other approaches previously presented by using our new MapReduce algorithm. Experimental results review that our proposed approach outperforms the state of the art ones on running time performance and increases the measurement of semantic similarity
    corecore