7 research outputs found
Automatic office document classification and information extraction
TEXPR.OS (TEXt PROcessing System) is a document processing system (DPS) to support and assist office workers in their daily work in dealing with information and document management. In this thesis, document classification and information extraction, which are two of the major functional capabilities in TEXPROS, are investigated.
Based on the nature of its content, a document is divided into structured and unstructured (i.e., of free text) parts. The conceptual and content structures are introduced to capture the semantics of the structured and unstructured part of the document respectively. The document is classified and information is extracted based on the analyses of conceptual and content structures. In our approach, the layout structure of a document is used to assist the analyses of the conceptual and content structures of the document. By nested segmentation of a document, the layout structure of the document is represented by an ordered labeled tree structure, called Layout Structure Tree (L-S-Tree). Sample-based classification mechanism is adopted in our approach for classifying the documents. A set of pre-classified documents are stored in a document sample base in the form of sample trees. In the layout analysis, an approximate tree matching is used to match the L-S-Tree of a document to be classified against the sample trees. The layout similarities between the document and the sample documents are evaluated based on the edit distance between the L-S-Tree of the document and the sample trees. The document samples which have the similar layout structure to the document are chosen to be used for the conceptual analysis of the document.
In the conceptual analysis of the document, based on the mapping between the document and document samples, which was found during the layout analysis, the conceptual similarities between the document and the sample documents are evaluated based on the degree of conceptual closeness degree . The document sample which has the similar conceptual structure to the document is chosen to be used for extracting information. Extracting the information of the structured part of the document is based on the layout locations of key terms appearing in the document and string pattern matching. Based on the information extracted from the structured part of the document the type of the document is identified. In the content analysis of the document, the bottom-up and top-down analyses on the free text are combined to extract information from the unstructured part of the document. In the bottom-up analysis, the sentences of the free text are classified into those which are relevant or irrelevant to the extraction. The sentence classification is based on the semantical relationship between the phrases in the sentences and the attribute names in the corresponding content structure by consulting the thesaurus. Then the thematic roles of the phrases in each relevant sentence are identified based on the syntactic analysis and heuristic thematic analysis. In the top-down analysis, the appropriate content structure is identified based on the document type identified in the conceptual analysis. Then the information is extracted from the unstructured part of the document by evaluating the restrictions specified in the corresponding content structure based on the result of bottom-up analysis.
The information extracted from the structured and unstructured parts of the document are stored in the form of a frame like structure (frame instance) in the data base for information retrieval in TEXPROS
Recommended from our members
Secure digital documents using Steganography and QR Code
This thesis was submitted for the degree of Doctor of Philosophy and awarded by Brunel University LondonWith the increasing use of the Internet several problems have arisen regarding the processing of electronic documents. These include content filtering, content retrieval/search. Moreover, document security has taken a centre stage including copyright protection, broadcast monitoring etc. There is an acute need of an effective tool which can find the identity, location and the time when the document was created so that it can be determined whether or not the contents of the document were tampered with after creation. Owing the sensitivity of the large amounts of data which is processed on a daily basis, verifying the authenticity and integrity of a document is more important now than it ever was. Unsurprisingly document authenticity verification has become the centre of attention in the world of research. Consequently, this research is concerned with creating a tool which deals with the above problem. This research proposes the use of a Quick Response Code as a message carrier for Text Key-print. The Text Key-print is a novel method which employs the basic element of the language (i.e. Characters of the alphabet) in order to achieve authenticity of electronic documents through the transformation of its physical structure into a logical structured relationship. The resultant dimensional matrix is then converted into a binary stream and encapsulated with a serial number or URL inside a Quick response Code (QR code) to form a digital fingerprint mark. For hiding a QR code, two image steganography techniques were developed based upon the spatial and the transform domains. In the spatial domain, three methods were proposed and implemented based on the least significant bit insertion technique and the use of pseudorandom number generator to scatter the message into a set of arbitrary pixels. These methods utilise the three colour channels in the images based on the RGB model based in order to embed one, two or three bits per the eight bit channel which results in three different hiding capacities. The second technique is an adaptive approach in transforming domain where a threshold value is calculated under a predefined location for embedding in order to identify the embedding strength of the embedding technique. The quality of the generated stego images was evaluated using both objective (PSNR) and Subjective (DSCQS) methods to ensure the reliability of our proposed methods. The experimental results revealed that PSNR is not a strong indicator of the perceived stego image quality, but not a bad interpreter also of the actual quality of stego images. Since the visual difference between the cover and the stego image must be absolutely imperceptible to the human visual system, it was logically convenient to ask human observers with different qualifications and experience in the field of image processing to evaluate the perceived quality of the cover and the stego image. Thus, the subjective responses were analysed using statistical measurements to describe the distribution of the scores given by the assessors. Thus, the proposed scheme presents an alternative approach to protect digital documents rather than the traditional techniques of digital signature and watermarking
Developing complex information systems: The use of a geometric data structure to aid the specification of a multi-media information environment.
The enormous computing power available today has resulted in the acceptance of information technology into a wide range of applications, the identification or creation of numerous problem areas, and the considerable tasks of finding problem solutions. Using computers for handling the current data manipulation tasks which characterise modern information processing requires considerably more sophisticated hardware and software technologies. Yet the development of more 'enhanced' packages frequently requires hundreds of man-years. Similarly, computer hardware design has become so complicated that only by using existing computers is it possible to develop newer machines. The common characteristic of such data manipulation tasks is that much larger amounts of information in evermore complex arrangements are being handled at greater speeds. Instead of being 'concrete' or 'black and white', issues at the higher levels of information processing can appear blurred - there may be much less precision because situations, perspectives and circumstances can vary. Most current packages focus on specific task areas, but the modern information processing environment actually requires a broader range of functions that cooperate in integrating and relating information handling activities in a manner far beyond that normally offered. It would seem that a fresh approach is required to examine all of the constituent problems. This report describes the research work carried out during such a consideration, and details the specification and development of a suggested method for enhancing information systems by specifying a multimedia information environment. This thesis develops a statement of the perceived problems, using extensive references to the current state of information system technologies. Examples are given of how some current systems approach the multiple tasks of processing and sharing data and applications. The discussion then moves to consider further what the underlying objectives of information handling - and a suitable integration architecture - should perhaps be, and shows how some current systems do not really meet these aims, although they incorporate certain of the essential fundamentals that contribute towards more enhanced information handling. The discussion provides the basis for the specification and construction of complete, integrated Information Environment applications. The environments are used to describe not only the jobs which the user wishes to carry out, but also the circumstances under which the job is being performed. The architecture uses a new geometric data structure to facilitate manipulation of the working data, relationships, and the environment description. The manipulation is carried out spatially, and this allows the user to work using a geometric representation of the data components, thus supporting the abstract nature of some information handling tasks