1,008 research outputs found

    Japanese bibliographic records and CJK cataloging in U.S. university libraries.

    Get PDF
    In the last two decades, American university libraries have developed Chinese, Japanese, and Korean (CJK) enhancements to their library automation systems and transitioned from conventional card catalogs to online public access catalogs (OPAC) by using CJK vernacular scripts, although non-Roman script search options of these systems are still limited

    Content Recognition and Context Modeling for Document Analysis and Retrieval

    Get PDF
    The nature and scope of available documents are changing significantly in many areas of document analysis and retrieval as complex, heterogeneous collections become accessible to virtually everyone via the web. The increasing level of diversity presents a great challenge for document image content categorization, indexing, and retrieval. Meanwhile, the processing of documents with unconstrained layouts and complex formatting often requires effective leveraging of broad contextual knowledge. In this dissertation, we first present a novel approach for document image content categorization, using a lexicon of shape features. Each lexical word corresponds to a scale and rotation invariant local shape feature that is generic enough to be detected repeatably and is segmentation free. A concise, structurally indexed shape lexicon is learned by clustering and partitioning feature types through graph cuts. Our idea finds successful application in several challenging tasks, including content recognition of diverse web images and language identification on documents composed of mixed machine printed text and handwriting. Second, we address two fundamental problems in signature-based document image retrieval. Facing continually increasing volumes of documents, detecting and recognizing unique, evidentiary visual entities (\eg, signatures and logos) provides a practical and reliable supplement to the OCR recognition of printed text. We propose a novel multi-scale framework to detect and segment signatures jointly from document images, based on the structural saliency under a signature production model. We formulate the problem of signature retrieval in the unconstrained setting of geometry-invariant deformable shape matching and demonstrate state-of-the-art performance in signature matching and verification. Third, we present a model-based approach for extracting relevant named entities from unstructured documents. In a wide range of applications that require structured information from diverse, unstructured document images, processing OCR text does not give satisfactory results due to the absence of linguistic context. Our approach enables learning of inference rules collectively based on contextual information from both page layout and text features. Finally, we demonstrate the importance of mining general web user behavior data for improving document ranking and other web search experience. The context of web user activities reveals their preferences and intents, and we emphasize the analysis of individual user sessions for creating aggregate models. We introduce a novel algorithm for estimating web page and web site importance, and discuss its theoretical foundation based on an intentional surfer model. We demonstrate that our approach significantly improves large-scale document retrieval performance

    AGROVOC: The linked data concept hub for food and agriculture

    Get PDF
    Newly acquired, aggregated and shared data are essential for innovation in food and agriculture to improve the discoverability of research. Since the early 1980′s, the Food and Agriculture Organization of the United Nations (FAO) has coordinated AGROVOC, a valuable tool for data to be classified homogeneously, facilitating interoperability and reuse. AGROVOC is a multilingual and controlled vocabulary designed to cover concepts and terminology under FAO's areas of interest. It is the largest Linked Open Data set about agriculture available for public use and its highest impact is through facilitating the access and visibility of data across domains and languages. This chapter has the aim of describing the current status of one of the most popular thesaurus in all FAO’s areas of interest, and how it has become the Linked Data Concept Hub for food and agriculture, through new procedures put in plac

    An effective Chinese indexing method based on partitioned signature files.

    Get PDF
    Wong Chi Yin.Thesis (M.Phil.)--Chinese University of Hong Kong, 1998.Includes bibliographical references (leaves 107-114).Abstract also in Chinese.Abstract --- p.iiAcknowledgements --- p.viChapter 1 --- Introduction --- p.1Chapter 1.1 --- Introduction to Chinese IR --- p.1Chapter 1.2 --- Contributions --- p.3Chapter 1.3 --- Organization of this Thesis --- p.5Chapter 2 --- Background --- p.6Chapter 2.1 --- Indexing methods --- p.6Chapter 2.1.1 --- Full-text scanning --- p.7Chapter 2.1.2 --- Inverted files --- p.7Chapter 2.1.3 --- Signature files --- p.9Chapter 2.1.4 --- Clustering --- p.10Chapter 2.2 --- Information Retrieval Models --- p.10Chapter 2.2.1 --- Boolean model --- p.11Chapter 2.2.2 --- Vector space model --- p.11Chapter 2.2.3 --- Probabilistic model --- p.13Chapter 2.2.4 --- Logical model --- p.14Chapter 3 --- Investigation of Segmentation on the Vector Space Retrieval Model --- p.15Chapter 3.1 --- Segmentation of Chinese Texts --- p.16Chapter 3.1.1 --- Character-based segmentation --- p.16Chapter 3.1.2 --- Word-based segmentation --- p.18Chapter 3.1.3 --- N-Gram segmentation --- p.21Chapter 3.2 --- Performance Evaluation of Three Segmentation Approaches --- p.23Chapter 3.2.1 --- Experimental Setup --- p.23Chapter 3.2.2 --- Experimental Results --- p.24Chapter 3.2.3 --- Discussion --- p.29Chapter 4 --- Signature File Background --- p.32Chapter 4.1 --- Superimposed coding --- p.34Chapter 4.2 --- False drop probability --- p.36Chapter 5 --- Partitioned Signature File Based On Chinese Word Length --- p.39Chapter 5.1 --- Fixed Weight Block (FWB) Signature File --- p.41Chapter 5.2 --- Overview of PSFC --- p.45Chapter 5.3 --- Design Considerations --- p.50Chapter 6 --- New Hashing Techniques for Partitioned Signature Files --- p.59Chapter 6.1 --- Direct Division Method --- p.61Chapter 6.2 --- Random Number Assisted Division Method --- p.62Chapter 6.3 --- Frequency-based hashing method --- p.64Chapter 6.4 --- Chinese character-based hashing method --- p.68Chapter 7 --- Experiments and Results --- p.72Chapter 7.1 --- Performance evaluation of partitioned signature file based on Chi- nese word length --- p.74Chapter 7.1.1 --- Retrieval Performance --- p.75Chapter 7.1.2 --- Signature Reduction Ratio --- p.77Chapter 7.1.3 --- Storage Requirement --- p.79Chapter 7.1.4 --- Discussion --- p.81Chapter 7.2 --- Performance evaluation of different dynamic signature generation methods --- p.82Chapter 7.2.1 --- Collision --- p.84Chapter 7.2.2 --- Retrieval Performance --- p.86Chapter 7.2.3 --- Discussion --- p.89Chapter 8 --- Conclusions and Future Work --- p.91Chapter 8.1 --- Conclusions --- p.91Chapter 8.2 --- Future work --- p.95Chapter A --- Notations of Signature Files --- p.96Chapter B --- False Drop Probability --- p.98Chapter C --- Experimental Results --- p.103Bibliography --- p.10

    Advanced Document Description, a Sequential Approach

    Get PDF
    To be able to perform efficient document processing, information systems need to use simple models of documents that can be treated in a smaller number of operations. This problem of document representation is not trivial. For decades, researchers have tried to combine relevant document representations with efficient processing. Documents are commonly represented by vectors in which each dimension corresponds to a word of the document. This approach is termed “bag of words”, as it entirely ignores the relative positions of words. One natural improvement over this representation is the extraction and use of cohesive word sequences. In this dissertation, we consider the problem of the extraction, selection and exploitation of word sequences, with a particular focus on the applicability of our work to domain-independent document collections written in any language
    • …
    corecore