Search CORE

11 research outputs found

Knowledge discovering for document classification using tree matching in Texpros

Author: Wei Ching-Song
Publication venue: Digital Commons @ NJIT
Publication date: 31/05/1996
Field of study

This dissertation describes a knowledge-based system for classifying documents based upon the layout structure and conceptual information extracted from the content of the document. The spatial elements in a document are laid out in rectangular blocks which are represented by nodes in an ordered labelled tree, called the layout structure tree (L-S Tree). Each leaf node of a L-S Tree points to its corresponding block content. A knowledge Acquisition Tool (KAT) is devised to create a Document Sample Tree from L-S Tree, in which each of its leaves contains a node content conceptually describing its corresponding block content. Then, applying generalization rules, the KAT performs the inductive learning from Document Sample Trees of a type and generates fewer number of Document Type Trees to represent its type. A testing document is classified if a Document Type Tree is discovered as a substructure of the L-S Tree of the testing document; and then the exact format of the testing document can be found by matching the L-S Tree with the Document Sample Trees of the classified document type. The Document Sample Trees and Document Type Trees are called Structural Knowledge Base (SKB). The tree discovering and matching processes involve computing the edit distance and the degree of conceptual closeness between the SKB trees and the L-S Tree of a testing document by using pattern matching and discovering toolkits. Our experimental results demonstrate that many office documents can be classified correctly using the proposed approach

Digital Commons @ New Jersey Institute of Technology (NJIT)

Knowledge-based document retrieval with application to TEXPROS

Author: Sheng Fang
Publication venue: Digital Commons @ NJIT
Publication date: 31/05/2001
Field of study

Document retrieval in an information system is most often accomplished through keyword search. The common technique behind keyword search is indexing. The major drawback of such a search technique is its lack of effectiveness and accuracy. It is very common in a typical keyword search over the Internet to identify hundreds or even thousands of records as the potentially desired records. However, often few of them are relevant to users\u27 interests. This dissertation presents knowledge-based document retrieval architecture with application to TEXPROS. The architecture is based on a dual document model that consists of a document type hierarchy and, a folder organization. Using the knowledge collected during document filing, the search space can be narrowed down significantly. Combining the classical text-based retrieval methods with the knowledge-based retrieval can improve tremendously both search efficiency and effectiveness. With the proposed predicate-based query language, users can more precisely and accurately specify the search criteria and their knowledge about the documents to be retrieved. To assist users formulate a query, a guided search is presented as part of an intelligent user interface. Supported by an intelligent question generator, an inference engine, a question base, and a predicate-based query composer, the guided search collects the most important information known to the user to retrieve the documents that satisfy users\u27 particular interests. A knowledge-based query processing and search engine is presented as the core component in this architecture. Algorithms are developed for the search engine to effectively and efficiently retrieve the documents that match the query. Cache is introduced to speed up the process of query refinement. Theoretical proof and performance analysis are performed to prove the efficiency and effectiveness of this knowledge-based document retrieval approach

Digital Commons @ New Jersey Institute of Technology (NJIT)

Hytexpros : a hypermedia information retrieval system

Author: Shen Hong
Publication venue: Digital Commons @ NJIT
Publication date: 31/01/2000
Field of study

The Hypermedia information retrieval system makes use of the specific capabilities of hypermedia systems with information retrieval operations and provides new kind of information management tools. It combines both hypermedia and information retrieval to offer end-users the possibility of navigating, browsing and searching a large collection of documents to satisfy an information need. TEXPROS is an intelligent document processing and retrieval system that supports storing, extracting, classifying, categorizing, retrieval and browsing enterprise information. TEXPROS is a perfect application to apply hypermedia information retrieval techniques. In this dissertation, we extend TEXPROS to a hypermedia information retrieval system called HyTEXPROS with hypertext functionalities, such as node, typed and weighted links, anchors, guided-tours, network overview, bookmarks, annotations and comments, and external linkbase. It describes the whole information base including the metadata and the original documents as network nodes connected by links. Through hypertext functionalities, a user can construct dynamically an information path by browsing through pieces of the information base. By adding hypertext functionalities to TEXPROS, HyTEXPROS is created. It changes its working domain from a personal document process domain to a personal library domain accompanied with citation techniques to process original documents. A four-level conceptual architecture is presented as the system architecture of HyTEXPROS. Such architecture is also referred to as the reference model of HyTEXPROS. Detailed description of HyTEXPROS, using the First Order Logic Calculus, is also proposed. An early version of a prototype is briefly described

Digital Commons @ New Jersey Institute of Technology (NJIT)

Knowledge management for TEXPROS

Author: Hu Jianshun
Publication venue: Digital Commons @ NJIT
Publication date: 31/05/1999
Field of study

Most of the document processing systems today have applied Al technologies to support their system intelligent behaviors. For the application of Al technologies in such systems, the core problem is how to represent and manage different kinds of knowledge to support their inference engine components\u27 functionalities. In other words, knowledge management has become a critical issue in the document processing systems. In this dissertation, within the scope of the TEXt PROcessing System (TEXPROS), we identify knowledge of various kinds that are applicable in the system. We investigate several problems of managing this knowledge and then develop a knowledge base for TEXPROS. In developing this knowledge base, we present approaches to representing and managing different kinds of knowledge to support its inference engine components\u27 functionalities. In TEXPROS, a dual-model paradigm is used, which contains the folder organization and the document type hierarchy, to represent and manage documents. We introduce a new System Catalog structure to represent and manage the knowledge for TEXPROS. This knowledge includes the system-level information of the folder organization and the document type hierarchy, and the operational level information of the document base itself. A unified storage approach is employed to store both the operational level information and system level information. Such storage is to house the frame template base and frame instance base. An enhanced two-level thesaurus model is presented in this dissertation. When dealing with special kinds of data in processing documents, a new structure DataDomain is presented, which supports the extended thesaurus functionalities, pattern recognition and data type operations. Based on the dual-model paradigm of TEXPROS, a concept of “Semantic Range” is presented to solve the sense ambiguity problems. In this dissertation, we also present the approaches to implement the general KeyTerm transformation and approximate term matching of TEXPROS. Finally, a new component “Registration Center” at the knowledge management level of TEXPROS is presented. The registration center aims to help users handle knowledge packages for specific working domain and to solve the knowledge porting problem for TEXPROS. This dissertation is concluded with the future research work

Digital Commons @ New Jersey Institute of Technology (NJIT)

Automatic document classification and extraction system (ADoCES)

Author: Li Xuhong
Publication venue: Digital Commons @ NJIT
Publication date: 31/05/1999
Field of study

Document processing is a critical element of office automation. Document image processing begins from the Optical Character Recognition (OCR) phase with complex processing for document classification and extraction. Document classification is a process that classifies an incoming document into a particular predefined document type. Document extraction is a process that extracts information pertinent to the users from the content of a document and assigns the information as the values of the “logical structure” of the document type. Therefore, after document classification and extraction, a paper document will be represented in its digital form instead of its original image file format, which is called a frame instance. A frame instance is an operable and efficient form that can be processed and manipulated during document filing and retrieval. This dissertation describes a system to support a complete procedure, which begins with the scanning of the paper document into the system and ends with the output of an effective digital form of the original document. This is a general-purpose system with “learning” ability and, therefore, it can be adapted easily to many application domains. In this dissertation, the “logical closeness” segmentation method is proposed. A novel representation of document layout structure - Labeled Directed Weighted Graph (LDWG) and a methodology of transforming document segmentation into LDWG representation are described. To find a match between two LDWGs, string representation matching is applied first instead of doing graph comparison directly, which reduces the time necessary to make the comparison. Applying artificial intelligence, the system is able to learn from experiences and build samples of LDWGs to represent each document type. In addition, the concept of frame templates is used for the document logical structure representation. The concept of Document Type Hierarchy (DTH) is also enhanced to express the hierarchical relation over the logical structures existing among the documents

Digital Commons @ New Jersey Institute of Technology (NJIT)

Knowledge-based document filing for texpros

Author: Fan Xien
Publication venue: Digital Commons @ NJIT
Publication date: 31/05/1998
Field of study

This dissertation presents a knowledge-based document filing system for TEXPROS. The requirements of a. personal document processing system are investigated. In order for the system to be used in various application domains, a flexible, dynamic modeling approach is employed by getting the user involved in document modeling. The office documents are described using a dual-model which consists of a document type hierarchy and a folder organization. The document type hierarchy is used to capture the layout, logical and conceptual structures of documents. The folder organization, which is defined by the user, emulates the real world structure for organizing and storing documents in an office environment. The document filing and retrieval are predicate-driven. The user can specify filing criteria and queries in terms of predicates. The predicate specification and folder organization specification are described. It is shown that the new specifications can prevent false drops which happen in the previous approach. The dual models are incorporated by a three-level storage architecture. This storage architecture supports efficient document and information retrieval by limiting the searches to those frame instances of a document type within those folders which appear to be the most similar to the corresponding queries, Specifically, a. three-level retrieval strategy is used in document and information retrieval. Firstly, a knowledge-based query preprocess is applied for efficiently reducing the search space to a small set of frame instances, using the information in the query formula. Secondly, the knowledge and content-based retrieval on the small set of frame instances is applied. Finally, the third level storage provides a platform for adopting potential content-based multimedia document retrieval techniques. A knowledge-based predicate evaluation engine is described for automating document filing. The dissertation presents a knowledge representation model. The knowledge base is dynamicly created by a learning agent, which demonstrates that the notion of flexible and dynamic modeling is applicable. The folder organization is implemented using an agent-based architecture. Each folder is monitored by a filing agent. The basic operations for constructing and reorganizing a folder organization are defined. The dissertation also discusses the cooperation among the filing agents, which is needed for implementing the folder organization

Digital Commons @ New Jersey Institute of Technology (NJIT)

e-DOCSPROS : exploring TEXPROS into e-business era

Author: Cheng Zhenfu
Publication venue: Digital Commons @ NJIT
Publication date: 31/05/2001
Field of study

Document processing is a critical element of office automation. TEXPROS (TEXt PROcessing System) is a knowledge-based system designed to manage personal documents. However, as the Internet and e-Business changed the way offices operate, there is a need to re-envision document processing, storage, retrieval, and sharing. In the current environment, people must be able to access documents remotely and to share those documents with others. e-DOCPROS (e-DOCument PROcessing System) is a new document processing system that takes advantage of many of TEXPROS\u27s structures but adapts the system to this new environment. The new system is built to serve e-businesses, takes advantage of Internet protocols, and to give remote access and document sharing. e-DOCPROS meets the challenge to provide wider usage, and eventually will improve the efficiency and effectiveness of office automation. It allows end users to access their data through any Web browser with Internet access, even a wireless network, which will evolutionarily change the way we manage information. The application of e-DOCPROS to e-Business is considered. Four types of business models re considered here. The first is the Business-to-Business (B2B) model, which performs business-to-business transactions through an Extranet. The Extranet consists of multiple Intranets connected via the Internet.The second is the Business-to-Consumer (B2Q model, which performs business-to-consumer transactions through the Internet. The third is the Intranet model, which performs transactions within an organization through the organization\u27s network. The fourth is the Consumer-to-Consumer (C2C) model, which performs consumer-to consumer transactions through the Internet. A triple model is proposed in this dissertation to integrate organization type hierarchy and document type hierarchy together into folder organization. e-DOCPROS introduces new features into TEXPROS to support those four business models and to accommodate the system requirements. Extensible Markup Language (XML), an industrial standard protocol for data exchange, is employed to achieve the goal of information exchange between e-DOCPROS and the other systems, and also among the subsystems within e-DOCPROS. Document Object Model (DOM) specification is followed throughout the implementation of e-DOCPROS to achieve portability. Agent-based Application Service Provider (ASP) implementation is employed in e-DOCPROS system to achieve cost-effectiveness and accessibility

Digital Commons @ New Jersey Institute of Technology (NJIT)

On document filing based upon predicates

Author: Zhu Zhijian
Publication venue: Digital Commons @ NJIT
Publication date: 31/05/1997
Field of study

This dissertation presents a formal approach to modeling documents in a personal office environment, proposes a heterogeneous algebraic query language to manipulating objects (folders) in the document model, and investigates a predicate-driven document filing system for automatically filing documents. The document model was initially proposed in [38] which adopts a very natural view for describing the office documents using the relational and object-oriented paradigms. The model employs a dual approach to classifying and categorizing office documents by defining both a document type hierarchy and a folder organization. This dissertation extends and specifies formally the document model. Documents are partitioned into different classes, each document class being represented by frame template which describes the properties of the documents of the class. A particular office document, summarized from the view point of its frame template, yields a synopsis of the document which is called frame instances. Frame instances are grouped into a folder on the basis of user-defined criteria, specified as predicates, which determine whether a frame instance belongs to a folder. Folders, each of which is a heterogeneous set of frame instances, can be naturally organized into a folder organization. The folder organization specifying the document filing view is then defined using predicates and a directed graph. However, some operators in the algebraic query language [38] do not support the heterogeneous property. This dissertation proposes an algebra-based query language that gives full support to this heterogeneous property. We investigate the construction problem of a folder organization: does it allow a user to add a new folder with an arbitrary local predicate? Given a folder organization, creating a new folder with arbitrarily defined predicate may cause two abnormalities: inapplicable edges (filing paths) and redundant folders. To deal such abnormalities in the process of constructing a folder organization, the concept of predicate consistency is discussed and an algorithm is proposed for determining whether the predicate of a new folder is consistent with the existing folder organization. The global predicate of a folder governs the content of the folder. However, the predicates of folders (that is, global predicates) do not uniquely specify a folder organization. Then, we investigate the reconstruction problem: under what circumstance can we uniquely recover the folder organization from its global predicates? The problem is solved in terms of graph-theoretic concepts such as associated digraphs, transitive closure, and redundant/non-redundant filing paths. A transitive closure inversion algorithm is then presented which efficiently recovers a folder organization digraph from its associated digraph. After defining a folder organization, we can file a frame instance into the folder organization. A document filing algorithm describes the procedure of filing a frame instance. However, the critical issue of the algorithm is how to evaluate whether a frame instance satisfies the predicate of a folder in a folder organization. In order to solve this issue, a thesaurus, an association dictionary and a knowledge base are then introduced. The thesaurus specifies the association relationship among the key terms that are actually residing in the system and terms that are used by users. An association dictionary gives the association relationship between an attribute of a predicate and a frame template defined in a folder organization. A knowledge base represents background knowledge in a certain application domain

Digital Commons @ New Jersey Institute of Technology (NJIT)

A more efficient document retrieval method for TEXPROS

Author: Dong Yin
Publication venue: Digital Commons @ NJIT
Publication date: 31/01/2001
Field of study

Document processing is a critical element of office automation. Through document classification, extraction and filing, documents are automatically placed into a knowledge base according to certain rules. Document retrieval is a process to get a document back according to a user\u27s requirements and to show the results to the user. Hence, a good user-interface and an efficient retrieval algorithm become core parts of document retrieval. Unlike previous browsers that have been proposed for this purpose, this dissertation develops a new browser that has a user interface with more tools, and one that has a more efficient retrieval algorithm that can deal with a wide variety of retrieval situations. In this dissertation, from the view of an interface, the new browser provides more functions such as zoom in and zoom out , (i.e. automatic scaling of the portion of a graph that is of interest to a user), and help. These functions give users an easier way to view a large graph in one window and provide users with help during the retrieval process. The new browser also provides an algorithm that makes retrieval more efficient by using a reusable base. The Reusable Base is used to hold information that is most related to the user previous desires and the information stored in the Reusable Base is more easily used to form the OP-Net than that in the System Catalog. Hence, it eliminates the need to go to the System Catalog to find the results. This speeds up the retrieval significantly -at least two times faster than without the Reusable Base. Further, the new browser provides information about the folder organization and the document type hierarchy that is in addition to the OP-Net. If users know the type of documents they want, or which folder they are interested in, they can go to the particular document type or the particular folder directly

Digital Commons @ New Jersey Institute of Technology (NJIT)

Pattern discovery in trees : algorithms and applications to document and scientific data management

Author: Chang Chia-Yo
Publication venue: Digital Commons @ NJIT
Publication date: 31/05/1999
Field of study

Ordered, labeled trees are trees in which each node has a label and the left-to-right order of its children (if it has any) is fixed. Such trees have many applications in vision, pattern recognition, molecular biology and natural language processing. In this dissertation we present algorithms for finding patterns in the ordered labeled trees. Specifically we study the largest approximately common substructure (LACS) problem for such trees. We consider a substructure of a tree T to be a connected subgraph of T. Given two trees T1, T2 and an integer d, the LACS problem is to find a substructure U1 of T1 and a substructure U2 of T2 such that U1 is within distance d of U2 and where there does not exist any other substructure V1 of T1 and V2 of T2 such that V1 and V2 satisfy the distance constraint and the sum of the sizes of V1 and V2 is greater than the sum of the sizes of U1 and U2. The LACS problem is motivated by the studies of document and RNA comparison. We consider two types of distance measures: the general edit distance and a restricted edit distance originated from Selkow. We present dynamic programming algorithms to solve the LACS problem based on the two distance measures. The algorithms run as fast as the best known algorithms for computing the distance of two trees when the distance allowed in the common substructures is a constant independent of the input trees. To demonstrate the utility of our algorithms, we discuss their applications to discovering motifs in multiple RNA secondary structures. Such an application shows an example of scientific data mining. We represent an RNA secondary structure by an ordered labeled tree based on a previously proposed scheme. The patterns in the trees are substructures that can differ in both substitutions and deletions/insertions of nodes of the trees. Our techniques incorporate approximate tree matching algorithms and novel heuristics for discovery and optimization. Experimental results obtained by running these algorithms on both generated data and RNA secondary structures show the good performance of the algorithms. It is shown that the optimization heuristics speed up the discovery algorithm by a factor of 10. Moreover, our optimized approach is 100,000 times faster than the brute force method. Finally we implement our techniques into a graphic toolbox that enables users to find repeated substructures in an RNA secondary structure as well as frequently occurring patterns in multiple RNA secondary structures pertaining to rhinovirus obtained from the National Cancer Institute. The system is implemented in C programming language and X windows and is fully operational on SUN workstations

Digital Commons @ New Jersey Institute of Technology (NJIT)