831 research outputs found

    Efficient Document Re-Ranking for Transformers by Precomputing Term Representations

    Full text link
    Deep pretrained transformer networks are effective at various ranking tasks, such as question answering and ad-hoc document ranking. However, their computational expenses deem them cost-prohibitive in practice. Our proposed approach, called PreTTR (Precomputing Transformer Term Representations), considerably reduces the query-time latency of deep transformer networks (up to a 42x speedup on web document ranking) making these networks more practical to use in a real-time ranking scenario. Specifically, we precompute part of the document term representations at indexing time (without a query), and merge them with the query representation at query time to compute the final ranking score. Due to the large size of the token representations, we also propose an effective approach to reduce the storage requirement by training a compression layer to match attention scores. Our compression technique reduces the storage required up to 95% and it can be applied without a substantial degradation in ranking performance.Comment: Accepted at SIGIR 2020 (long

    Augmenting Latent Dirichlet Allocation and Rank Threshold Detection with Ontologies

    Get PDF
    In an ever-increasing data rich environment, actionable information must be extracted, filtered, and correlated from massive amounts of disparate often free text sources. The usefulness of the retrieved information depends on how we accomplish these steps and present the most relevant information to the analyst. One method for extracting information from free text is Latent Dirichlet Allocation (LDA), a document categorization technique to classify documents into cohesive topics. Although LDA accounts for some implicit relationships such as synonymy (same meaning) it often ignores other semantic relationships such as polysemy (different meanings), hyponym (subordinate), meronym (part of), and troponomys (manner). To compensate for this deficiency, we incorporate explicit word ontologies, such as WordNet, into the LDA algorithm to account for various semantic relationships. Experiments over the 20 Newsgroups, NIPS, OHSUMED, and IED document collections demonstrate that incorporating such knowledge improves perplexity measure over LDA alone for given parameters. In addition, the same ontology augmentation improves recall and precision results for user queries

    Inference and Evaluation of the Multinomial Mixture Model for Text Clustering

    Full text link
    In this article, we investigate the use of a probabilistic model for unsupervised clustering in text collections. Unsupervised clustering has become a basic module for many intelligent text processing applications, such as information retrieval, text classification or information extraction. The model considered in this contribution consists of a mixture of multinomial distributions over the word counts, each component corresponding to a different theme. We present and contrast various estimation procedures, which apply both in supervised and unsupervised contexts. In supervised learning, this work suggests a criterion for evaluating the posterior odds of new documents which is more statistically sound than the "naive Bayes" approach. In an unsupervised context, we propose measures to set up a systematic evaluation framework and start with examining the Expectation-Maximization (EM) algorithm as the basic tool for inference. We discuss the importance of initialization and the influence of other features such as the smoothing strategy or the size of the vocabulary, thereby illustrating the difficulties incurred by the high dimensionality of the parameter space. We also propose a heuristic algorithm based on iterative EM with vocabulary reduction to solve this problem. Using the fact that the latent variables can be analytically integrated out, we finally show that Gibbs sampling algorithm is tractable and compares favorably to the basic expectation maximization approach

    Making a Better Query: Find Good Feedback Documents and Terms via Semantic Associations

    Get PDF
    When people search, they always input several keywords as an input query. While current information retrieval (IR) systems are based on term matching, documents will not be considered as relevant if they do not have the exact terms as in the query. However, it is common that these documents are relevant if they contain terms semantically similar to the query. To retrieve these documents, a classic way is to expand the original query with more related terms. Pseudo relevance feedback (PRF) has proven to be effective to expand origin queries and improve the performance of IR. It assumes the top k ranked documents obtained through the first round retrieval are relevant as feedback documents, and expand the original queries with feedback terms selected from these feedback documents. However, applying PRF for query expansion must be very carefully. Wrongly added terms can bring noisy information and hurt the overall search experiences extensively. The assumption of feedback documents is too strong to be completely true. To avoid noise import and make significant improvements simultaneously, we solve the significant problem through four ways in this dissertation. Firstly, we assume the proximity information among terms as term semantic associations and utilize them to seek new relevant terms. Next, to obtain good and robust performance for PRF via adapting topic information, we propose a new concept named topic space and present three models based on it. Topics obtained through topic modeling do help identify how relevant a feedback document is. Weights of candidate terms in these more relevant feedback documents will be boosted and have higher probabilities to be chosen. Furthermore, we apply machine learning methods to classify which feedback documents are effective for PRF. To solve the problem of lack-of-training-data for the application of machine learning methods in PRF, we improve a traditional co-training method and take the quality of classifiers into account. Finally, we present a new probabilistic framework to integrate existing effective methods like semantic associations as components for further research. All the work has been tested on public datasets and proven to be effective and efficient

    Transfer learning for information retrieval

    Get PDF
    The lack of relevance labels is increasingly challenging and presents a bottleneck in the training of reliable learning-to-rank (L2R) models. Obtaining relevance labels using human judgment is expensive and even impossible in some scenarios. Previous research has studied different approaches to solving the problem including generating relevance labels by crowdsourcing and active learning. Recent studies have started to find ways to reuse knowledge from a related collection to help the ranking in a new collection. However, the effectiveness of a ranking function trained in one collection may be degraded when used in another collection due to the generalization issues of machine learning. Transfer learning involves a set of algorithms that are used to train or adapt a model for a target collection without sucient training labels by transferring knowledge from a related source collection with abundant labels. Transfer learning can also be applied to L2R to help train ranking functions for a new task by reusing data from a related collection while minimizing the generalization gap. Some attempts have been made to apply transfer learning techniques on L2R tasks. This thesis investigates different approaches to transfer learning methods for L2R, which are called transfer ranking. However, most of the existing studies on transfer ranking have been focused on the scenario when there are a small but not sucient number of relevance labels. The field of transfer ranking with no target collection labels is still relatively undeveloped. Moreover, the main reason why a transfer ranking solution is needed is that a ranking function trained in the source collection cannot generalize to the target collection, due to the differences in the data distribution of the two collections. However, the effect of the data distribution differences on ranking model generalization has not been examined in detail. The focus of this study is the scenario when there are no relevance labels from the new collection (the target collection), but where a related collection (the target collection) has an abundant amount of training data and labels. In this thesis, we first demonstrate the generalization gap of different L2R algorithms when the distribution of the source and target collections are different in multiple ways, and we then develop alternative solutions to tackling the problem, which includes instance weighting algorithms and self-labeling methods. Instance weighting algorithms estimate weights for each training query in the source collection according to the target query distribution and use the weighted objective function to optimize a ranking function for the target collection. The results on different test collections suggest that instance weighting methods, including existing approaches, are not reliable. The self-labeling methods use other approaches to generate imputed relevance labels for queries in the target collection, which look to transfer the ranking knowledge to the target collection by transferring the label knowledge. The algorithms were tested on various transferring scenarios and showed significant effectiveness and consistency. We thus demonstrate that the performance of self-labeling methods can be further improved with a minimal number of calibration labels from the target collection. The algorithms and knowledge developed in this thesis can help solve generic ranking knowledge transfer problems under different scenarios

    Semantics-based language models for information retrieval and text mining

    Get PDF
    The language modeling approach centers on the issue of estimating an accurate model by choosing appropriate language models as well as smoothing techniques. In the thesis, we propose a novel context-sensitive semantic smoothing method referred to as a topic signature language model. It extracts explicit topic signatures from a document and then statistically maps them into individual words in the vocabulary. In order to support the new language model, we developed two automated algorithms to extract multiword phrases and ontological concepts, respectively, and an EM-based algorithm to learn semantic mapping knowledge from co-occurrence data. The topic signature language model is applied to three applications: information retrieval, text classification, and text clustering. The evaluations on news collection and biomedical literature prove the effectiveness of the topic signature language model.In the experiment of information retrieval, the topic signature language model consistently outperforms the baseline two-stage language model as well as the context-insensitive semantic smoothing method in all configurations. It also beats the state-of-the-art Okapi models in all configurations. In the experiment of text classification, when the size of training documents is small, the Bayesian classifier with semantic smoothing not only outperforms the classifiers with background smoothing and Laplace smoothing, but it also beats the active learning classifiers and SVM classifiers. On the task of clustering, whether or not the dataset to cluster is small, the model-based k-means with semantic smoothing performs significantly better than both the model-based k-means with background smoothing and Laplace smoothing. It is also superior to the spherical k-means in terms of effectiveness.In addition, we empirically prove that, within the framework of topic signature language models, the semantic knowledge learned from one collection could be effectively applied to other collections. In the thesis, we also compare three types of topic signatures (i.e., words, multiword phrases, and ontological concepts), with respect to their effectiveness and efficiency for semantic smoothing. In general, it is more expensive to extract multiword phrases and ontological concepts than individual words, but semantic mapping based on multiword phrases and ontological concepts are more effective in handling data sparsity than on individual words.Ph.D., Information Science and Technology -- Drexel University, 200

    Variational Deep Semantic Hashing for Text Documents

    Full text link
    As the amount of textual data has been rapidly increasing over the past decade, efficient similarity search methods have become a crucial component of large-scale information retrieval systems. A popular strategy is to represent original data samples by compact binary codes through hashing. A spectrum of machine learning methods have been utilized, but they often lack expressiveness and flexibility in modeling to learn effective representations. The recent advances of deep learning in a wide range of applications has demonstrated its capability to learn robust and powerful feature representations for complex data. Especially, deep generative models naturally combine the expressiveness of probabilistic generative models with the high capacity of deep neural networks, which is very suitable for text modeling. However, little work has leveraged the recent progress in deep learning for text hashing. In this paper, we propose a series of novel deep document generative models for text hashing. The first proposed model is unsupervised while the second one is supervised by utilizing document labels/tags for hashing. The third model further considers document-specific factors that affect the generation of words. The probabilistic generative formulation of the proposed models provides a principled framework for model extension, uncertainty estimation, simulation, and interpretability. Based on variational inference and reparameterization, the proposed models can be interpreted as encoder-decoder deep neural networks and thus they are capable of learning complex nonlinear distributed representations of the original documents. We conduct a comprehensive set of experiments on four public testbeds. The experimental results have demonstrated the effectiveness of the proposed supervised learning models for text hashing.Comment: 11 pages, 4 figure
    corecore