20,010 research outputs found
EHI: End-to-end Learning of Hierarchical Index for Efficient Dense Retrieval
Dense embedding-based retrieval is now the industry standard for semantic
search and ranking problems, like obtaining relevant web documents for a given
query. Such techniques use a two-stage process: (a) contrastive learning to
train a dual encoder to embed both the query and documents and (b) approximate
nearest neighbor search (ANNS) for finding similar documents for a given query.
These two stages are disjoint; the learned embeddings might be ill-suited for
the ANNS method and vice-versa, leading to suboptimal performance. In this
work, we propose End-to-end Hierarchical Indexing -- EHI -- that jointly learns
both the embeddings and the ANNS structure to optimize retrieval performance.
EHI uses a standard dual encoder model for embedding queries and documents
while learning an inverted file index (IVF) style tree structure for efficient
ANNS. To ensure stable and efficient learning of discrete tree-based ANNS
structure, EHI introduces the notion of dense path embedding that captures the
position of a query/document in the tree. We demonstrate the effectiveness of
EHI on several benchmarks, including de-facto industry standard MS MARCO (Dev
set and TREC DL19) datasets. For example, with the same compute budget, EHI
outperforms state-of-the-art (SOTA) in by 0.6% (MRR@10) on MS MARCO dev set and
by 4.2% (nDCG@10) on TREC DL19 benchmarks
Content-based indexing of low resolution documents
In any multimedia presentation, the trend for attendees taking pictures of slides that interest them during the presentation using capturing devices is gaining popularity. To enhance the image usefulness, the images captured could be linked to image or video database. The database can be used for the purpose of file archiving, teaching and learning, research and knowledge management, which concern image search. However, the above-mentioned devices include cameras or mobiles phones have low resolution resulted from poor lighting and noise. Content-Based Image Retrieval (CBIR) is considered among the most interesting and promising fields as far as image search is concerned. Image search is related with finding images that are similar for the known query image found in a given image database. This thesis concerns with the methods used for the purpose of identifying documents that are captured using image capturing devices. In addition, the thesis also concerns with a technique that can be used to retrieve images from an indexed image database. Both concerns above apply digital image processing technique. To build an indexed structure for fast and high quality content-based retrieval of an image, some existing representative signatures and the key indexes used have been revised. The retrieval performance is very much relying on how the indexing is done. The retrieval approaches that are currently in existence including making use of shape, colour and texture features. Putting into consideration these features relative to individual databases, the majority of retrievals approaches have poor results on low resolution documents, consuming a lot of time and in the some cases, for the given query image, irrelevant images are obtained. The proposed identification and indexing method in the thesis uses a Visual Signature (VS). VS consists of the captures slides textual layout’s graphical information, shape’s moment and spatial distribution of colour. This approach, which is signature-based are considered for fast and efficient matching to fulfil the needs of real-time applications. The approach also has the capability to overcome the problem low resolution document such as noisy image, the environment’s varying lighting conditions and complex backgrounds. We present hierarchy indexing techniques, whose foundation are tree and clustering. K-means clustering are used for visual features like colour since their spatial distribution give a good image’s global information. Tree indexing for extracted layout and shape features are structured hierarchically and Euclidean distance is used to get similarity image for CBIR. The assessment of the proposed indexing scheme is conducted based on recall and precision, a standard CBIR retrieval performance evaluation. We develop CBIR system and conduct various retrieval experiments with the fundamental aim of comparing the accuracy during image retrieval. A new algorithm that can be used with integrated visual signatures, especially in late fusion query was introduced. The algorithm has the capability of reducing any shortcoming associated with normalisation in initial fusion technique. Slides from conferences, lectures and meetings presentation are used for comparing the proposed technique’s performances with that of the existing approaches with the help of real data. This finding of the thesis presents exciting possibilities as the CBIR systems is able to produce high quality result even for a query, which uses low resolution documents. In the future, the utilization of multimodal signatures, relevance feedback and artificial intelligence technique are recommended to be used in CBIR system to further enhance the performance
Random Indexing K-tree
Random Indexing (RI) K-tree is the combination of two algorithms for
clustering. Many large scale problems exist in document clustering. RI K-tree
scales well with large inputs due to its low complexity. It also exhibits
features that are useful for managing a changing collection. Furthermore, it
solves previous issues with sparse document vectors when using K-tree. The
algorithms and data structures are defined, explained and motivated. Specific
modifications to K-tree are made for use with RI. Experiments have been
executed to measure quality. The results indicate that RI K-tree improves
document cluster quality over the original K-tree algorithm.Comment: 8 pages, ADCS 2009; Hyperref and cleveref LaTeX packages conflicted.
Removed clevere
Location-based indexing for mobile context-aware access to a digital library
Mobile information systems need to collaborate with each other to provide seamless information access to the user. Information about the user and their context provides the points of contact between the systems. Location is the most basic user context.
TIP is a mobile tourist information system that provides location-based access to documents in the digital library Greenstone. This paper identifies the challenges for providing effcient access to location-based information using the various access modes a tourist requires on their travels. We discuss our extended 2DR-tree approach to meet these challenges
Investigation into Indexing XML Data Techniques
The rapid development of XML technology improves the WWW, since the XML data has many advantages and has become a common technology for transferring data cross the internet. Therefore, the objective of this research is to investigate and study the XML indexing techniques in terms of their structures. The main goal of this investigation is to identify the main limitations of these techniques and any other open issues.
Furthermore, this research considers most common XML indexing techniques and performs a comparison between them. Subsequently, this work makes an argument to find out these limitations. To conclude, the main problem of all the XML indexing techniques is the trade-off between the
size and the efficiency of the indexes. So, all the indexes become large in order to perform well, and none of them is suitable for all users’ requirements. However, each one of these techniques has some advantages in somehow
Non-hierarchical Structures: How to Model and Index Overlaps?
Overlap is a common phenomenon seen when structural components of a digital
object are neither disjoint nor nested inside each other. Overlapping
components resist reduction to a structural hierarchy, and tree-based indexing
and query processing techniques cannot be used for them. Our solution to this
data modeling problem is TGSA (Tree-like Graph for Structural Annotations), a
novel extension of the XML data model for non-hierarchical structures. We
introduce an algorithm for constructing TGSA from annotated documents; the
algorithm can efficiently process non-hierarchical structures and is associated
with formal proofs, ensuring that transformation of the document to the data
model is valid. To enable high performance query analysis in large data
repositories, we further introduce an extension of XML pre-post indexing for
non-hierarchical structures, which can process both reachability and
overlapping relationships.Comment: The paper has been accepted at the Balisage 2014 conferenc
Universal Indexes for Highly Repetitive Document Collections
Indexing highly repetitive collections has become a relevant problem with the
emergence of large repositories of versioned documents, among other
applications. These collections may reach huge sizes, but are formed mostly of
documents that are near-copies of others. Traditional techniques for indexing
these collections fail to properly exploit their regularities in order to
reduce space.
We introduce new techniques for compressing inverted indexes that exploit
this near-copy regularity. They are based on run-length, Lempel-Ziv, or grammar
compression of the differential inverted lists, instead of the usual practice
of gap-encoding them. We show that, in this highly repetitive setting, our
compression methods significantly reduce the space obtained with classical
techniques, at the price of moderate slowdowns. Moreover, our best methods are
universal, that is, they do not need to know the versioning structure of the
collection, nor that a clear versioning structure even exists.
We also introduce compressed self-indexes in the comparison. These are
designed for general strings (not only natural language texts) and represent
the text collection plus the index structure (not an inverted index) in
integrated form. We show that these techniques can compress much further, using
a small fraction of the space required by our new inverted indexes. Yet, they
are orders of magnitude slower.Comment: This research has received funding from the European Union's Horizon
2020 research and innovation programme under the Marie Sk{\l}odowska-Curie
Actions H2020-MSCA-RISE-2015 BIRDS GA No. 69094
TIP spatial index: efficient access to digital libraries in a context-aware mobile system
We present a framework for efficient, uniform, location-based access to digital library collections that are external to a context-aware mobile information system. Using a tourist Information system, we utilize a spatial index to manage the context of location. We show how access to resources from within and outside of the tourist information system can be carried out in a seamless manner. We show how the spatial index can be navigated to continually provide information to the user. An empirical evaluation of the navigation strategy versus traditional spatial searching shows that navigation is efficient and outperforms traditional spatial search. In conclusion, our work provides a strategy for context-aware mobile systems to co-operate with digital libraries in a seamless and efficient manner
- …