113,971 research outputs found

    Content Recognition and Context Modeling for Document Analysis and Retrieval

    Get PDF
    The nature and scope of available documents are changing significantly in many areas of document analysis and retrieval as complex, heterogeneous collections become accessible to virtually everyone via the web. The increasing level of diversity presents a great challenge for document image content categorization, indexing, and retrieval. Meanwhile, the processing of documents with unconstrained layouts and complex formatting often requires effective leveraging of broad contextual knowledge. In this dissertation, we first present a novel approach for document image content categorization, using a lexicon of shape features. Each lexical word corresponds to a scale and rotation invariant local shape feature that is generic enough to be detected repeatably and is segmentation free. A concise, structurally indexed shape lexicon is learned by clustering and partitioning feature types through graph cuts. Our idea finds successful application in several challenging tasks, including content recognition of diverse web images and language identification on documents composed of mixed machine printed text and handwriting. Second, we address two fundamental problems in signature-based document image retrieval. Facing continually increasing volumes of documents, detecting and recognizing unique, evidentiary visual entities (\eg, signatures and logos) provides a practical and reliable supplement to the OCR recognition of printed text. We propose a novel multi-scale framework to detect and segment signatures jointly from document images, based on the structural saliency under a signature production model. We formulate the problem of signature retrieval in the unconstrained setting of geometry-invariant deformable shape matching and demonstrate state-of-the-art performance in signature matching and verification. Third, we present a model-based approach for extracting relevant named entities from unstructured documents. In a wide range of applications that require structured information from diverse, unstructured document images, processing OCR text does not give satisfactory results due to the absence of linguistic context. Our approach enables learning of inference rules collectively based on contextual information from both page layout and text features. Finally, we demonstrate the importance of mining general web user behavior data for improving document ranking and other web search experience. The context of web user activities reveals their preferences and intents, and we emphasize the analysis of individual user sessions for creating aggregate models. We introduce a novel algorithm for estimating web page and web site importance, and discuss its theoretical foundation based on an intentional surfer model. We demonstrate that our approach significantly improves large-scale document retrieval performance

    PRESY: A Context Based Query Reformulation Tool for Information Retrieval on the Web

    Full text link
    Problem Statement: The huge number of information on the web as well as the growth of new inexperienced users creates new challenges for information retrieval. It has become increasingly difficult for these users to find relevant documents that satisfy their individual needs. Certainly the current search engines (such as Google, Bing and Yahoo) offer an efficient way to browse the web content. However, the result quality is highly based on uses queries which need to be more precise to find relevant documents. This task still complicated for the majority of inept users who cannot express their needs with significant words in the query. For that reason, we believe that a reformulation of the initial user's query can be a good alternative to improve the information selectivity. This study proposes a novel approach and presents a prototype system called PRESY (Profile-based REformulation SYstem) for information retrieval on the web. Approach: It uses an incremental approach to categorize users by constructing a contextual base. The latter is composed of two types of context (static and dynamic) obtained using the users' profiles. The architecture proposed was implemented using .Net environment to perform queries reformulating tests. Results: The experiments gives at the end of this article show that the precision of the returned content is effectively improved. The tests were performed with the most popular searching engine (i.e. Google, Bind and Yahoo) selected in particular for their high selectivity. Among the given results, we found that query reformulation improve the first three results by 10.7% and 11.7% of the next seven returned elements. So as we can see the reformulation of users' initial queries improves the pertinence of returned content.Comment: 8 page

    We Could, but Should We? Ethical Considerations for Providing Access to GeoCities and Other Historical Digital Collections

    Get PDF
    We live in an era in which the ways that we can make sense of our past are evolving as more artifacts from that past become digital. At the same time, the responsibilities of traditional gatekeepers who have negotiated the ethics of historical data collection and use, such as librarians and archivists, are increasingly being sidelined by the system builders who decide whether and how to provide access to historical digital collections, often without sufficient reflection on the ethical issues at hand. It is our aim to better prepare system builders to grapple with these issues. This paper focuses discussions around one such digital collection from the dawn of the web, asking what sorts of analyses can and should be conducted on archival copies of the GeoCities web hosting platform that dates to 1994.This research was supported by the Natural Sciences and Engineering Research Council of Canada, the Social Sciences and Humanities Research Council of Canada, the US National Science Foundation (grants 1618695 and 1704369), the Andrew W. Mellon Foundation, Start Smart Labs, and Compute Canada

    Multimedia search without visual analysis: the value of linguistic and contextual information

    Get PDF
    This paper addresses the focus of this special issue by analyzing the potential contribution of linguistic content and other non-image aspects to the processing of audiovisual data. It summarizes the various ways in which linguistic content analysis contributes to enhancing the semantic annotation of multimedia content, and, as a consequence, to improving the effectiveness of conceptual media access tools. A number of techniques are presented, including the time-alignment of textual resources, audio and speech processing, content reduction and reasoning tools, and the exploitation of surface features

    Closing the loop: assisting archival appraisal and information retrieval in one sweep

    Get PDF
    In this article, we examine the similarities between the concept of appraisal, a process that takes place within the archives, and the concept of relevance judgement, a process fundamental to the evaluation of information retrieval systems. More specifically, we revisit selection criteria proposed as result of archival research, and work within the digital curation communities, and, compare them to relevance criteria as discussed within information retrieval's literature based discovery. We illustrate how closely these criteria relate to each other and discuss how understanding the relationships between the these disciplines could form a basis for proposing automated selection for archival processes and initiating multi-objective learning with respect to information retrieval
    • 

    corecore