2,554 research outputs found

    Relevance-based Word Embedding

    Full text link
    Learning a high-dimensional dense representation for vocabulary terms, also known as a word embedding, has recently attracted much attention in natural language processing and information retrieval tasks. The embedding vectors are typically learned based on term proximity in a large corpus. This means that the objective in well-known word embedding algorithms, e.g., word2vec, is to accurately predict adjacent word(s) for a given word or context. However, this objective is not necessarily equivalent to the goal of many information retrieval (IR) tasks. The primary objective in various IR tasks is to capture relevance instead of term proximity, syntactic, or even semantic similarity. This is the motivation for developing unsupervised relevance-based word embedding models that learn word representations based on query-document relevance information. In this paper, we propose two learning models with different objective functions; one learns a relevance distribution over the vocabulary set for each query, and the other classifies each term as belonging to the relevant or non-relevant class for each query. To train our models, we used over six million unique queries and the top ranked documents retrieved in response to each query, which are assumed to be relevant to the query. We extrinsically evaluate our learned word representation models using two IR tasks: query expansion and query classification. Both query expansion experiments on four TREC collections and query classification experiments on the KDD Cup 2005 dataset suggest that the relevance-based word embedding models significantly outperform state-of-the-art proximity-based embedding models, such as word2vec and GloVe.Comment: to appear in the proceedings of The 40th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '17

    Rocchio\u27s Model Based on Vector Space Basis Change for Pseudo Relevance Feedback

    Get PDF
    Rocchio\u27s relevance feedback model is a classic query expansion method and it has been shown to be effective in boosting information retrieval performance. The main problem with this method is that the relevant and the irrelevant documents overlap in the vector space because they often share same terms (at least the terms of the query). With respect to the initial vector space basis (index terms), it is difficult to select terms that separate relevant and irrelevant documents. The Vector Space Basis Change is used to separate relevant and irrelevant documents without any modification on the query term weights. In this paper, first, we study how to incorporate Vector Space Basis Change into the Rocchio\u27s model. Second, we propose Rocchio\u27s models based on Vector Space Basis Change, called VSBCRoc models. Experimental results on a TREC collection show that our proposed models are effective

    TopSig: Topology Preserving Document Signatures

    Get PDF
    Performance comparisons between File Signatures and Inverted Files for text retrieval have previously shown several significant shortcomings of file signatures relative to inverted files. The inverted file approach underpins most state-of-the-art search engine algorithms, such as Language and Probabilistic models. It has been widely accepted that traditional file signatures are inferior alternatives to inverted files. This paper describes TopSig, a new approach to the construction of file signatures. Many advances in semantic hashing and dimensionality reduction have been made in recent times, but these were not so far linked to general purpose, signature file based, search engines. This paper introduces a different signature file approach that builds upon and extends these recent advances. We are able to demonstrate significant improvements in the performance of signature file based indexing and retrieval, performance that is comparable to that of state of the art inverted file based systems, including Language models and BM25. These findings suggest that file signatures offer a viable alternative to inverted files in suitable settings and from the theoretical perspective it positions the file signatures model in the class of Vector Space retrieval models.Comment: 12 pages, 8 figures, CIKM 201

    Integrating Medical Ontology and Pseudo Relevance Feedback For Medical Document Retrieval

    Get PDF
    The purpose of this thesis is to undertake and improve the accuracy of locating the relevant documents from a large amount of Electronic Medical Data (EMD). The unique goal of this research is to propose a new idea for using medical ontology to find an easy and more reliable approach for patients to have a better understanding of their diseases and also help doctors to find and further improve the possible methods of diagnosis and treatments. The empirical studies were based on the dataset provided by CLEF focused on health care data. In this research, I have used Information Retrieval to find and obtain relevant information within the large amount of data sets provided by CLEF. I then used ranking functionality on the Terrier platform to calculate and evaluate the matching documents in the collection of data sets. BM25 was used as the base normalization method to retrieve the results and Pseudo Relevance Feedback weighting model to retrieve the information regarding patients health history and medical records in order to find more accurate results. I then used Unified Medical Language System to develop indexing of the queries while searching on the Internet and looking for health related documents. UMLS software was actually used to link the computer system with the health and biomedical terms and vocabularies into classify tools; it works as a dictionary for the patients by translating the medical terms. Later I would like to work on using medical ontology to create a relationship between the documents regarding the medical data and my retrieved results

    Improving understandability in consumer health information search: Uevora @ 2016 fire chis

    Get PDF
    This paper presents our work at 2016 FIRE CHIS. Given a CHIS query and a document associated with that query, the task is to classify the sentences in the document as relevant to the query or not; and further classify the relevant sentences to be supporting, neutral or opposing to the claim made in the query. In this paper, we present two different approaches to do the classification. With the first approach, we implement two models to satisfy the task. We first implement an information retrieval model to retrieve the sentences that are relevant to the query; and then we use supervised learning method to train a classification model to classify the relevant sentences into support, oppose or neutral. With the second approach, we only use machine learning techniques to learn a model and classify the sentences into four classes (relevant & support, relevant & neutral, relevant & oppose, irrelevant & neutral). Our submission for CHIS uses the first approach.Erasmus Mundus LEADER projec
    corecore