20,010 research outputs found

    EHI: End-to-end Learning of Hierarchical Index for Efficient Dense Retrieval

    Full text link
    Dense embedding-based retrieval is now the industry standard for semantic search and ranking problems, like obtaining relevant web documents for a given query. Such techniques use a two-stage process: (a) contrastive learning to train a dual encoder to embed both the query and documents and (b) approximate nearest neighbor search (ANNS) for finding similar documents for a given query. These two stages are disjoint; the learned embeddings might be ill-suited for the ANNS method and vice-versa, leading to suboptimal performance. In this work, we propose End-to-end Hierarchical Indexing -- EHI -- that jointly learns both the embeddings and the ANNS structure to optimize retrieval performance. EHI uses a standard dual encoder model for embedding queries and documents while learning an inverted file index (IVF) style tree structure for efficient ANNS. To ensure stable and efficient learning of discrete tree-based ANNS structure, EHI introduces the notion of dense path embedding that captures the position of a query/document in the tree. We demonstrate the effectiveness of EHI on several benchmarks, including de-facto industry standard MS MARCO (Dev set and TREC DL19) datasets. For example, with the same compute budget, EHI outperforms state-of-the-art (SOTA) in by 0.6% (MRR@10) on MS MARCO dev set and by 4.2% (nDCG@10) on TREC DL19 benchmarks

    Content-based indexing of low resolution documents

    Get PDF
    In any multimedia presentation, the trend for attendees taking pictures of slides that interest them during the presentation using capturing devices is gaining popularity. To enhance the image usefulness, the images captured could be linked to image or video database. The database can be used for the purpose of file archiving, teaching and learning, research and knowledge management, which concern image search. However, the above-mentioned devices include cameras or mobiles phones have low resolution resulted from poor lighting and noise. Content-Based Image Retrieval (CBIR) is considered among the most interesting and promising fields as far as image search is concerned. Image search is related with finding images that are similar for the known query image found in a given image database. This thesis concerns with the methods used for the purpose of identifying documents that are captured using image capturing devices. In addition, the thesis also concerns with a technique that can be used to retrieve images from an indexed image database. Both concerns above apply digital image processing technique. To build an indexed structure for fast and high quality content-based retrieval of an image, some existing representative signatures and the key indexes used have been revised. The retrieval performance is very much relying on how the indexing is done. The retrieval approaches that are currently in existence including making use of shape, colour and texture features. Putting into consideration these features relative to individual databases, the majority of retrievals approaches have poor results on low resolution documents, consuming a lot of time and in the some cases, for the given query image, irrelevant images are obtained. The proposed identification and indexing method in the thesis uses a Visual Signature (VS). VS consists of the captures slides textual layout’s graphical information, shape’s moment and spatial distribution of colour. This approach, which is signature-based are considered for fast and efficient matching to fulfil the needs of real-time applications. The approach also has the capability to overcome the problem low resolution document such as noisy image, the environment’s varying lighting conditions and complex backgrounds. We present hierarchy indexing techniques, whose foundation are tree and clustering. K-means clustering are used for visual features like colour since their spatial distribution give a good image’s global information. Tree indexing for extracted layout and shape features are structured hierarchically and Euclidean distance is used to get similarity image for CBIR. The assessment of the proposed indexing scheme is conducted based on recall and precision, a standard CBIR retrieval performance evaluation. We develop CBIR system and conduct various retrieval experiments with the fundamental aim of comparing the accuracy during image retrieval. A new algorithm that can be used with integrated visual signatures, especially in late fusion query was introduced. The algorithm has the capability of reducing any shortcoming associated with normalisation in initial fusion technique. Slides from conferences, lectures and meetings presentation are used for comparing the proposed technique’s performances with that of the existing approaches with the help of real data. This finding of the thesis presents exciting possibilities as the CBIR systems is able to produce high quality result even for a query, which uses low resolution documents. In the future, the utilization of multimodal signatures, relevance feedback and artificial intelligence technique are recommended to be used in CBIR system to further enhance the performance

    Random Indexing K-tree

    Get PDF
    Random Indexing (RI) K-tree is the combination of two algorithms for clustering. Many large scale problems exist in document clustering. RI K-tree scales well with large inputs due to its low complexity. It also exhibits features that are useful for managing a changing collection. Furthermore, it solves previous issues with sparse document vectors when using K-tree. The algorithms and data structures are defined, explained and motivated. Specific modifications to K-tree are made for use with RI. Experiments have been executed to measure quality. The results indicate that RI K-tree improves document cluster quality over the original K-tree algorithm.Comment: 8 pages, ADCS 2009; Hyperref and cleveref LaTeX packages conflicted. Removed clevere

    Location-based indexing for mobile context-aware access to a digital library

    Get PDF
    Mobile information systems need to collaborate with each other to provide seamless information access to the user. Information about the user and their context provides the points of contact between the systems. Location is the most basic user context. TIP is a mobile tourist information system that provides location-based access to documents in the digital library Greenstone. This paper identifies the challenges for providing effcient access to location-based information using the various access modes a tourist requires on their travels. We discuss our extended 2DR-tree approach to meet these challenges

    Investigation into Indexing XML Data Techniques

    Get PDF
    The rapid development of XML technology improves the WWW, since the XML data has many advantages and has become a common technology for transferring data cross the internet. Therefore, the objective of this research is to investigate and study the XML indexing techniques in terms of their structures. The main goal of this investigation is to identify the main limitations of these techniques and any other open issues. Furthermore, this research considers most common XML indexing techniques and performs a comparison between them. Subsequently, this work makes an argument to find out these limitations. To conclude, the main problem of all the XML indexing techniques is the trade-off between the size and the efficiency of the indexes. So, all the indexes become large in order to perform well, and none of them is suitable for all users’ requirements. However, each one of these techniques has some advantages in somehow

    Non-hierarchical Structures: How to Model and Index Overlaps?

    Full text link
    Overlap is a common phenomenon seen when structural components of a digital object are neither disjoint nor nested inside each other. Overlapping components resist reduction to a structural hierarchy, and tree-based indexing and query processing techniques cannot be used for them. Our solution to this data modeling problem is TGSA (Tree-like Graph for Structural Annotations), a novel extension of the XML data model for non-hierarchical structures. We introduce an algorithm for constructing TGSA from annotated documents; the algorithm can efficiently process non-hierarchical structures and is associated with formal proofs, ensuring that transformation of the document to the data model is valid. To enable high performance query analysis in large data repositories, we further introduce an extension of XML pre-post indexing for non-hierarchical structures, which can process both reachability and overlapping relationships.Comment: The paper has been accepted at the Balisage 2014 conferenc

    Universal Indexes for Highly Repetitive Document Collections

    Get PDF
    Indexing highly repetitive collections has become a relevant problem with the emergence of large repositories of versioned documents, among other applications. These collections may reach huge sizes, but are formed mostly of documents that are near-copies of others. Traditional techniques for indexing these collections fail to properly exploit their regularities in order to reduce space. We introduce new techniques for compressing inverted indexes that exploit this near-copy regularity. They are based on run-length, Lempel-Ziv, or grammar compression of the differential inverted lists, instead of the usual practice of gap-encoding them. We show that, in this highly repetitive setting, our compression methods significantly reduce the space obtained with classical techniques, at the price of moderate slowdowns. Moreover, our best methods are universal, that is, they do not need to know the versioning structure of the collection, nor that a clear versioning structure even exists. We also introduce compressed self-indexes in the comparison. These are designed for general strings (not only natural language texts) and represent the text collection plus the index structure (not an inverted index) in integrated form. We show that these techniques can compress much further, using a small fraction of the space required by our new inverted indexes. Yet, they are orders of magnitude slower.Comment: This research has received funding from the European Union's Horizon 2020 research and innovation programme under the Marie Sk{\l}odowska-Curie Actions H2020-MSCA-RISE-2015 BIRDS GA No. 69094

    TIP spatial index: efficient access to digital libraries in a context-aware mobile system

    Get PDF
    We present a framework for efficient, uniform, location-based access to digital library collections that are external to a context-aware mobile information system. Using a tourist Information system, we utilize a spatial index to manage the context of location. We show how access to resources from within and outside of the tourist information system can be carried out in a seamless manner. We show how the spatial index can be navigated to continually provide information to the user. An empirical evaluation of the navigation strategy versus traditional spatial searching shows that navigation is efficient and outperforms traditional spatial search. In conclusion, our work provides a strategy for context-aware mobile systems to co-operate with digital libraries in a seamless and efficient manner
    corecore