9,615 research outputs found

    Content Recognition and Context Modeling for Document Analysis and Retrieval

    Get PDF
    The nature and scope of available documents are changing significantly in many areas of document analysis and retrieval as complex, heterogeneous collections become accessible to virtually everyone via the web. The increasing level of diversity presents a great challenge for document image content categorization, indexing, and retrieval. Meanwhile, the processing of documents with unconstrained layouts and complex formatting often requires effective leveraging of broad contextual knowledge. In this dissertation, we first present a novel approach for document image content categorization, using a lexicon of shape features. Each lexical word corresponds to a scale and rotation invariant local shape feature that is generic enough to be detected repeatably and is segmentation free. A concise, structurally indexed shape lexicon is learned by clustering and partitioning feature types through graph cuts. Our idea finds successful application in several challenging tasks, including content recognition of diverse web images and language identification on documents composed of mixed machine printed text and handwriting. Second, we address two fundamental problems in signature-based document image retrieval. Facing continually increasing volumes of documents, detecting and recognizing unique, evidentiary visual entities (\eg, signatures and logos) provides a practical and reliable supplement to the OCR recognition of printed text. We propose a novel multi-scale framework to detect and segment signatures jointly from document images, based on the structural saliency under a signature production model. We formulate the problem of signature retrieval in the unconstrained setting of geometry-invariant deformable shape matching and demonstrate state-of-the-art performance in signature matching and verification. Third, we present a model-based approach for extracting relevant named entities from unstructured documents. In a wide range of applications that require structured information from diverse, unstructured document images, processing OCR text does not give satisfactory results due to the absence of linguistic context. Our approach enables learning of inference rules collectively based on contextual information from both page layout and text features. Finally, we demonstrate the importance of mining general web user behavior data for improving document ranking and other web search experience. The context of web user activities reveals their preferences and intents, and we emphasize the analysis of individual user sessions for creating aggregate models. We introduce a novel algorithm for estimating web page and web site importance, and discuss its theoretical foundation based on an intentional surfer model. We demonstrate that our approach significantly improves large-scale document retrieval performance

    Unbiased Comparative Evaluation of Ranking Functions

    Full text link
    Eliciting relevance judgments for ranking evaluation is labor-intensive and costly, motivating careful selection of which documents to judge. Unlike traditional approaches that make this selection deterministically, probabilistic sampling has shown intriguing promise since it enables the design of estimators that are provably unbiased even when reusing data with missing judgments. In this paper, we first unify and extend these sampling approaches by viewing the evaluation problem as a Monte Carlo estimation task that applies to a large number of common IR metrics. Drawing on the theoretical clarity that this view offers, we tackle three practical evaluation scenarios: comparing two systems, comparing kk systems against a baseline, and ranking kk systems. For each scenario, we derive an estimator and a variance-optimizing sampling distribution while retaining the strengths of sampling-based evaluation, including unbiasedness, reusability despite missing data, and ease of use in practice. In addition to the theoretical contribution, we empirically evaluate our methods against previously used sampling heuristics and find that they generally cut the number of required relevance judgments at least in half.Comment: Under review; 10 page

    The Partial Evaluation Approach to Information Personalization

    Get PDF
    Information personalization refers to the automatic adjustment of information content, structure, and presentation tailored to an individual user. By reducing information overload and customizing information access, personalization systems have emerged as an important segment of the Internet economy. This paper presents a systematic modeling methodology - PIPE (`Personalization is Partial Evaluation') - for personalization. Personalization systems are designed and implemented in PIPE by modeling an information-seeking interaction in a programmatic representation. The representation supports the description of information-seeking activities as partial information and their subsequent realization by partial evaluation, a technique for specializing programs. We describe the modeling methodology at a conceptual level and outline representational choices. We present two application case studies that use PIPE for personalizing web sites and describe how PIPE suggests a novel evaluation criterion for information system designs. Finally, we mention several fundamental implications of adopting the PIPE model for personalization and when it is (and is not) applicable.Comment: Comprehensive overview of the PIPE model for personalizatio

    Retrieval experiments using pseudo-desktop collections

    Full text link

    CHORUS Deliverable 2.2: Second report - identification of multi-disciplinary key issues for gap analysis toward EU multimedia search engines roadmap

    Get PDF
    After addressing the state-of-the-art during the first year of Chorus and establishing the existing landscape in multimedia search engines, we have identified and analyzed gaps within European research effort during our second year. In this period we focused on three directions, notably technological issues, user-centred issues and use-cases and socio- economic and legal aspects. These were assessed by two central studies: firstly, a concerted vision of functional breakdown of generic multimedia search engine, and secondly, a representative use-cases descriptions with the related discussion on requirement for technological challenges. Both studies have been carried out in cooperation and consultation with the community at large through EC concertation meetings (multimedia search engines cluster), several meetings with our Think-Tank, presentations in international conferences, and surveys addressed to EU projects coordinators as well as National initiatives coordinators. Based on the obtained feedback we identified two types of gaps, namely core technological gaps that involve research challenges, and “enablers”, which are not necessarily technical research challenges, but have impact on innovation progress. New socio-economic trends are presented as well as emerging legal challenges

    Digital Image Users and Reuse: Enhancing practitioner discoverability of digital library reuse based on user file naming behavior

    Get PDF
    Diese Dissertation untersucht Geräte, die Praktiker verwenden, um die Wiederverwendung von digitalen Bibliotheksmaterialien zu entdecken. Der Autor führt zwei Verifikationsstudien durch, in denen zwei zuvor angewandte Strategien untersucht werden, die Praktiker verwenden, um die Wiederverwendung digitaler Objekte zu identifizieren, insbesondere Google Images Reverse Image Lookup (RIL) und eingebettete Metadaten. Es beschreibt diese Strategiebeschränkungen und bietet einen neuen, einzigartigen Ansatz zur Verfolgung der Wiederverwendung, indem der Suchansatz des Autors basierend auf dem Benennungsverhalten von Benutzerdateien verwendet wird. Bei der Untersuchung des Nutzens und der Einschränkungen von Google Images und eingebetteten Metadaten beobachtet und dokumentiert der Autor ein Muster des Benennungsverhaltens von Benutzerdateien, das vielversprechend ist, die Wiederverwendung durch den Praktiker zu verbessern. Der Autor führt eine Untersuchung zur Bewertung der Dateibenennung durch, um dieses Muster des Verhaltens der Benutzerdateibenennung und die Auswirkungen der Dateibenennung auf die Suchmaschinenoptimierung zu untersuchen. Der Autor leitet mehrere signifikante Ergebnisse ab, während er diese Studie fertigstellt. Der Autor stellt fest, dass Google Bilder aufgrund der Änderung des Algorithmus kein brauchbares Werkzeug mehr ist, um die Wiederverwendung durch die breite Öffentlichkeit oder andere Benutzer zu entdecken, mit Ausnahme von Benutzern aus der Industrie. Eingebettete Metadaten sind aufgrund der nicht persistenten Natur eingebetteter Metadaten kein zuverlässiges Bewertungsinstrument. Der Autor stellt fest, dass viele Benutzer ihre eigenen Dateinamen generieren, die beim Speichern und Teilen von digitalen Bildern fast ausschließlich für Menschen lesbar sind. Der Autor argumentiert, dass, wenn Praktiker Suchbegriffe nach den "aggregierten Dateinamen" modellieren, sie ihre Entdeckung wiederverwendeter digitaler Objekte erhöhen.This dissertation explores devices practitioners utilize to discover the reuse of digital library materials. The author performs two verification studies investigating two previously employed strategies that practitioners use to identify digital object reuse, specifically Google Images reverse image lookup (RIL) and embedded metadata. It describes these strategy limitations and offers a new, unique approach for tracking reuse by employing the author's search approach based on user file naming behavior. While exploring the utility and limitations of Google Images and embedded metadata, the author observes and documents a pattern of user file naming behavior that exhibits promise for improving practitioner's discoverability of reuse. The author conducts a file naming assessment investigation to examine this pattern of user file naming behavior and the impact of file naming on search engine optimization. The author derives several significant findings while completing this study. The author establishes that Google Images is no longer a viable tool to discover reuse by the general public or other users except for industry users because of its algorithm change. Embedded metadata is not a reliable assessment tool because of the non-persistent nature of embedded metadata. The author finds that many users generate their own file names, almost exclusively human-readable when saving and sharing digital images. The author argues that when practitioners model search terms after the "aggregated file names" they increase their discovery of reused digital objects
    • …
    corecore