Search CORE

370 research outputs found

Recommended from our members

Indexing Proximity-based Dependencies for Information Retrieval

Author: Huston Samuel
Publication venue: ScholarWorks@UMass Amherst
Publication date: 01/01/2014
Field of study

Research into term dependencies for information retrieval has demonstrated that dependency retrieval models are able to consistently improve retrieval effectiveness over bag-of-words models. However, the computation of term dependency statistics is a major efficiency bottleneck in the execution of these retrieval models. This thesis investigates the problem of improving the efficiency of dependency retrieval models without compromising the effectiveness benefits of the term dependency features. Despite the large number of published comparisons between dependency models and bag-of-words approaches, there has been a lack of direct comparisons between alternate dependency models. We provide this comparison and investigate different types of proximity features. Several bi-term and many-term dependency models over a range of TREC collections, for both short (title) and long (description) queries, are compared to determine the strongest benchmark models. We observe that the weighted sequential dependence model is the most effective model studied. Additionally, we observe that there is some potential in many-term dependencies, but more selective methods are required to exploit these features. We then investigate two novel index structures to directly index the proximitybased dependencies used in the sequential dependence model and weighted sequential dependence model. The frequent index and the sketch index data structures can both provide efficient access to collection and document level statistics for all indexed term dependencies, while minimizing space costs, relative to a full inverted index of term dependencies. We test whether these structures can improve retrieval efficiency without incurring large space requirements, or degrading retrieval effectiveness significantly. A secondary requirement is that each data structure must be able to be constructed for an input text collection in a scalable and distributed manner. Based on the observation that the vast majority of term dependencies extracted from queries are relatively frequent in the collection, the “frequent” index of term dependencies omits data for infrequent term dependencies. The sketch index of term dependencies uses techniques from sketch data structures to store probabilisticallybounded estimates of the required statistics. We present analyses of these data structures that include construction and space costs, retrieval efficiency and investigation of any degradation of retrieval effectiveness. Finally, we investigate the application of these data structures to the execution of the strongest performing dependency models identified. We compare the retrieval efficiency of each of these structures across two query processing algorithms, and across both short and long queries, using two large web collections. We observe that these newly proposed data structures allow the execution of queries considerably faster than when using positional indexes, and as fast as a full index of term dependencies, but with lowered storage overhead

ScholarWorks@UMass Amherst

Space-Economical Partial Gram Indices for Exact Substring Matching

Author: Boncz P.A. (Peter)
Sidirourgos E. (Eleftherios)
Tang N. (Nan)
Publication venue: 'American College of Medical Physics (ACMP)'
Publication date: 01/11/2009
Field of study

CWI's Institutional Repository

Image indexing and retrieval in the compressed domain

Author: Armstrong Andrew
Publication venue
Publication date: 01/09/2003
Field of study

University of South Wales Research Explorer

GVC: efficient random access compression for gene sequence variations

Author: Adhisantoso Yeremia Gunawan
Ohm Jens-Rainer
Ostermann Jörn
Rohlfing Christian
Tunev Viktor
Voges Jan
Publication venue: London : BioMed Central
Publication date: 01/01/2023
Field of study

Background: In recent years, advances in high-throughput sequencing technologies have enabled the use of genomic information in many fields, such as precision medicine, oncology, and food quality control. The amount of genomic data being generated is growing rapidly and is expected to soon surpass the amount of video data. The majority of sequencing experiments, such as genome-wide association studies, have the goal of identifying variations in the gene sequence to better understand phenotypic variations. We present a novel approach for compressing gene sequence variations with random access capability: the Genomic Variant Codec (GVC). We use techniques such as binarization, joint row- and column-wise sorting of blocks of variations, as well as the image compression standard JBIG for efficient entropy coding. Results: Our results show that GVC provides the best trade-off between compression and random access compared to the state of the art: it reduces the genotype information size from 758 GiB down to 890 MiB on the publicly available 1000 Genomes Project (phase 3) data, which is 21% less than the state of the art in random-access capable methods. Conclusions: By providing the best results in terms of combined random access and compression, GVC facilitates the efficient storage of large collections of gene sequence variations. In particular, the random access capability of GVC enables seamless remote data access and application integration. The software is open source and available at https://github.com/sXperfect/gvc/

Institutionelles Repositorium der Leibniz Universität Hannover

Less is More: Restricted Representations for Better Interpretability and Generalizability

Author: Jiang Zhiying
Publication venue: 'University of Waterloo'
Publication date: 18/07/2023
Field of study

Deep neural networks are prevalent in supervised learning for large amounts of tasks such as image classification, machine translation and even scientific discovery. Their success is often at the sacrifice of interpretability and generalizability. The increasing complexity of models and involvement of the pre-training process make the inexplicability more imminent. The outstanding performance when labeled data are abundant while prone to overfit when labeled data are limited demonstrates the difficulty of deep neural networks' generalizability to different datasets. This thesis aims to improve interpretability and generalizability by restricting representations. We choose to approach interpretability by focusing on attribution analysis to understand which features contribute to prediction on BERT, and to approach generalizability by focusing on effective methods in a low-data regime. We consider two strategies of restricting representations: (1) adding bottleneck, and (2) introducing compression. Given input x, suppose we want to learn y with the latent representation z (i.e. x→z→y), adding bottleneck means adding function R such that L(R(z)) < L(z) and introducing compression means adding function R so that L(R(y)) < L(y) where L refers to the number of bits. In other words, the restriction is added either in the middle of the pipeline or at the end of it. We first introduce how adding information bottleneck can help attribution analysis and apply it to investigate BERT's behavior on text classification in Chapter 3. We then extend this attribution method to analyze passage reranking in Chapter 4, where we conduct a detailed analysis to understand cross-layer and cross-passage behavior. Adding bottleneck can not only provide insight to understand deep neural networks but can also be used to increase generalizability. In Chapter 5, we demonstrate the equivalence between adding bottleneck and doing neural compression. We then leverage this finding with a framework called Non-Parametric learning by Compression with Latent Variables (NPC-LV), and show how optimizing neural compressors can be used in the non-parametric image classification with few labeled data. To further investigate how compression alone helps non-parametric learning without latent variables (NPC), we carry out experiments with a universal compressor gzip on text classification in Chapter 6. In Chapter 7, we elucidate methods of adopting the perspective of doing compression but without the actual process of compression using T5. Using experimental results in passage reranking, we show that our method is highly effective in a low-data regime when only one thousand query-passage pairs are available. In addition to the weakly supervised scenario, we also extend our method to large language models like GPT under almost no supervision --- in one-shot and zero-shot settings. The experiments show that without extra parameters or in-context learning, GPT can be used for semantic similarity, text classification, and text ranking and outperform strong baselines, which is presented in Chapter 8. The thesis proposes to tackle two big challenges in machine learning --- "interpretability" and "generalizability" through restricting representation. We provide both theoretical derivation and empirical results to show the effectiveness of using information-theoretic approaches. We not only design new algorithms but also provide numerous insights on why and how "compression" is so important in understanding deep neural networks and improving generalizability

University of Waterloo's Institutional Repository

Qualitative Evaluation of Data Compression in Real-time Ultrasound Imaging

Author: Sundersingh Bijoy J.
Publication venue: UTHSC Digital Commons
Publication date: 01/06/2000
Field of study

The purpose of this project was to evaluate qualitatively real-time ultrasound imaging using objective and subjective techniques to determine the minimum bandwidth required for clinical diagnosis of various anatomical and pathological states. In the experimental setup live ultrasound video samples representing the most common clinical examinations were compressed at 128, 256, 384, 768, 1152 and 1536 kbps using a compressor-decompressor (CODEC) adhering to International Telecommunication Union (ITU-T) recommendation H.261. A protocol for qualitative evaluation was developed and subjective and objective testing were performed based on this protocol. Subjective methods comprised of inter-rater reliability tests using kappa statistics and three way Analysis of Variance (ANOVA) using General Linear Models (GLM). Objective testing were performed using histogram analysis and estimation of peak signal to noise ratios. The kappa scores for all bandwidths greater than 256 kbps indicated good inter-rater reliablity and minimum variation in confidence levels. Using the results from GLM and ANOVA we could not establish a trend in degradation of observer confidence with increasing compression ratios. The histogram analysis showed a linear increase in standard deviation values, indicating a linear scatter in pixel intensity, with increasing compression ratios. Although higher compression levels were evaluated, only video clips with bandwidths greater than 256 kbps displayed satisfactory temporal and spatial resolution, good enough to make clinical diagnosis of various anatomical and pathological states. The evaluations also indicate that compressed real-time ultrasound imagery using H.261 can be transmitted over a T1 or ADSL networks

UTHSC Digital Commons (University of Tennessee Health Science Center)

The 1993 Space and Earth Science Data Compression Workshop

Author: Tilton James C.
Publication venue
Publication date
Field of study

The Earth Observing System Data and Information System (EOSDIS) is described in terms of its data volume, data rate, and data distribution requirements. Opportunities for data compression in EOSDIS are discussed

NASA Technical Reports Server

Pattern Annotation and Classification in Broadcast Content of Radio Productions

Author: Valk Tom
Publication venue
Publication date: 20/05/2020
Field of study

International Hellenic University: IHU Open Access Repository

Efficient query processing for scalable web search

Author: Macdonald Craig
Ounis Iadh
Tonellotto Nicola
Publication venue: 'Now Publishers'
Publication date: 01/01/2018
Field of study

Search engines are exceptionally important tools for accessing information in today’s world. In satisfying the information needs of millions of users, the effectiveness (the quality of the search results) and the efficiency (the speed at which the results are returned to the users) of a search engine are two goals that form a natural trade-off, as techniques that improve the effectiveness of the search engine can also make it less efficient. Meanwhile, search engines continue to rapidly evolve, with larger indexes, more complex retrieval strategies and growing query volumes. Hence, there is a need for the development of efficient query processing infrastructures that make appropriate sacrifices in effectiveness in order to make gains in efficiency. This survey comprehensively reviews the foundations of search engines, from index layouts to basic term-at-a-time (TAAT) and document-at-a-time (DAAT) query processing strategies, while also providing the latest trends in the literature in efficient query processing, including the coherent and systematic reviews of techniques such as dynamic pruning and impact-sorted posting lists as well as their variants and optimisations. Our explanations of query processing strategies, for instance the WAND and BMW dynamic pruning algorithms, are presented with illustrative figures showing how the processing state changes as the algorithms progress. Moreover, acknowledging the recent trends in applying a cascading infrastructure within search systems, this survey describes techniques for efficiently integrating effective learned models, such as those obtained from learning-to-rank techniques. The survey also covers the selective application of query processing techniques, often achieved by predicting the response times of the search engine (known as query efficiency prediction), and making per-query tradeoffs between efficiency and effectiveness to ensure that the required retrieval speed targets can be met. Finally, the survey concludes with a summary of open directions in efficient search infrastructures, namely the use of signatures, real-time, energy-efficient and modern hardware and software architectures

Archivio della Ricerca - Università di Pisa

Enlighten

CERN Document Server

Real-time Text Queries with Tunable Term Pair Indexes

Author: Broschart A.
Schenkel R.
Publication venue: Max-Planck-Institut für Informatik
Publication date: 01/01/2010
Field of study

Term proximity scoring is an established means in information retrieval for improving result quality of full-text queries. Integrating such proximity scores into efficient query processing, however, has not been equally well studied. Existing methods make use of precomputed lists of documents where tuples of terms, usually pairs, occur together, usually incurring a huge index size compared to term-only indexes. This paper introduces a joint framework for trading off index size and result quality, and provides optimization techniques for tuning precomputed indexes towards either maximal result quality or maximal query processing performance, given an upper bound for the index size. The framework allows to selectively materialize lists for pairs based on a query log to further reduce index size. Extensive experiments with two large text collections demonstrate runtime improvements of several orders of magnitude over existing text-based processing techniques with reasonable index sizes

MPG.PuRe