142,184 research outputs found

    Part of Speech Based Term Weighting for Information Retrieval

    Full text link
    Automatic language processing tools typically assign to terms so-called weights corresponding to the contribution of terms to information content. Traditionally, term weights are computed from lexical statistics, e.g., term frequencies. We propose a new type of term weight that is computed from part of speech (POS) n-gram statistics. The proposed POS-based term weight represents how informative a term is in general, based on the POS contexts in which it generally occurs in language. We suggest five different computations of POS-based term weights by extending existing statistical approximations of term information measures. We apply these POS-based term weights to information retrieval, by integrating them into the model that matches documents to queries. Experiments with two TREC collections and 300 queries, using TF-IDF & BM25 as baselines, show that integrating our POS-based term weights to retrieval always leads to gains (up to +33.7% from the baseline). Additional experiments with a different retrieval model as baseline (Language Model with Dirichlet priors smoothing) and our best performing POS-based term weight, show retrieval gains always and consistently across the whole smoothing range of the baseline

    Automatically generation and evaluation of Stop words list for Chinese Patents

    Get PDF
    As an important preprocessing step of information retrieval and information processing, the accuracy of stop words’ elimination directly influences the ultimate result of retrieval and mining. In information retrieval, stop words’ elimination can compress the storage space of index, and in text mining, it can reduce the dimension of vector space enormously, save the storage space of vector space and speed up the calculation. However, Chinese patents are a kind of legal documents containing technical information, and the general Chinese stop words list is not applicable for them. This paper advances two methodologies for Chinese patents. One is based on word frequency and the other on statistics. Through experiments on real patents data, these two methodologies’ accuracy are compared under several corpuses with different scale, and also compared with general stop list. The experiment result indicates that both of these two methodologies can extract the stop words suitable for Chinese patents and the accuracy of Methodology based on statistics is a little higher than the one based on word frequency

    Rewarding the Location of Terms in Sentences to Enhance Probabilistic Information Retrieval

    Get PDF
    In most traditional retrieval models, the weight (or probability) of a query term is estimated based on its own distribution or statistics. Intuitively, however, the nouns are more important in information retrieval and are more often found near the beginning and the end of sentences. In this thesis, we investigate the effect of rewarding the terms based on their location in sentences on information retrieval. Particularly, we propose a kernel-based method to capture the term placement pattern, in which a novel Term Location retrieval model is derived in combination with the BM25 model to enhance probabilistic information retrieval. Experiments on five TREC datasets of varied size and content indicates that the proposed model significantly outperforms the optimized BM25 and DirichletLM in MAP over all datasets with all kernel functions, and excels compared to the optimized BM25 and DirichletLM over most of the datasets in P@5 and P@20 with different kernel functions

    Improving ranking for systematic reviews using query adaptation

    Get PDF
    Identifying relevant studies for inclusion in systematic reviews requires significant effort from human experts who manually screen large numbers of studies. The problem is made more difficult by the growing volume of medical literature and Information Retrieval techniques have proved to be useful to reduce workload. Reviewers are often interested in particular types of evidence such as Diagnostic Test Accuracy studies. This paper explores the use of query adaption to identify particular types of evidence and thereby reduce the workload placed on reviewers. A simple retrieval system that ranks studies using TF.IDF weighted cosine similarity was implemented. The Log-Likelihood, ChiSquared and Odds-Ratio lexical statistics and relevance feedback were used to generate sets of terms that indicate evidence relevant to Diagnostic Test Accuracy reviews. Experiments using a set of 80 systematic reviews from the CLEF2017 and CLEF2018 eHealth tasks demonstrate that the approach improves retrieval performance

    Bias-variance analysis in estimating true query model for information retrieval

    Get PDF
    The estimation of query model is an important task in language modeling (LM) approaches to information retrieval (IR). The ideal estimation is expected to be not only effective in terms of high mean retrieval performance over all queries, but also stable in terms of low variance of retrieval performance across different queries. In practice, however, improving effectiveness can sacrifice stability, and vice versa. In this paper, we propose to study this tradeoff from a new perspective, i.e., the bias-variance tradeoff, which is a fundamental theory in statistics. We formulate the notion of bias-variance regarding retrieval performance and estimation quality of query models. We then investigate several estimated query models, by analyzing when and why the bias-variance tradeoff will occur, and how the bias and variance can be reduced simultaneously. A series of experiments on four TREC collections have been conducted to systematically evaluate our bias-variance analysis. Our approach and results will potentially form an analysis framework and a novel evaluation strategy for query language modeling

    Kannada and Telugu Native Languages to English Cross Language Information Retrieval

    Get PDF
    One of the crucial challenges in cross lingual information retrieval is the retrieval of relevant information for a query expressed in as native language. While retrieval of relevant documents is slightly easier, analysing the relevance of the retrieved documents and the presentation of the results to the users are non-trivial tasks. To accomplish the above task, we present our Kannada English and Telugu English CLIR systems as part of Ad-Hoc Bilingual task. We take a query translation based approach using bi-lingual dictionaries. When a query words not found in the dictionary then the words are transliterated using a simple rule based approach which utilizes the corpus to return the ‘k’ closest English transliterations of the given Kannada/Telugu word. The resulting multiple translation/transliteration choices for each query word are disambiguated using an iterative page-rank style algorithm which, based on term-term co-occurrence statistics, produces the final translated query. Finally we conduct experiments on these translated query using a Kannada/Telugu document collection and a set of English queries to report the improvements, performance achieved for each task is to be presented and statistical analysis of these results are given

    Zerber+R: Top-k Retrieval from a Confidential Index

    Get PDF
    Zerr, S., Olmedilla, D., Nejdl, W., & Siberski, W. (2009). Zerber+R: Top-k Retrieval from a Confidential Index. Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology (pp. 439-449). March, 24-26, 2009, Saint Petersburg, Russia (ISBN: 978-1-60558-422-5).Privacy-preserving document exchange among collaboration groups in an enterprise as well as across enterprises requires techniques for sharing and search of access-controlled information through largely untrusted servers. In these settings search systems need to provide confidentiality guarantees for shared information while offering IR properties comparable to the ordinary search engines. Top-k is a standard IR technique which enables fast query execution on very large indexes and makes systems highly scalable. However, indexing access-controlled information for top-k retrieval is a challenging task due to the sensitivity of the term statistics used for ranking. In this paper we present Zerber+R -- a ranking model which allows for privacy-preserving top-k retrieval from an outsourced inverted index. We propose a relevance score transformation function which makes relevance scores of different terms indistinguishable, such that even if stored on an untrusted server they do not reveal information about the indexed data. Experiments on two real-world data sets show that Zerber+R makes economical usage of bandwidth and offers retrieval properties comparable with an ordinary inverted index.The work on this publication has been sponsored by the TENCompetence Integrated Project that is funded by the European Commission's 6th Framework Programme, priority IST/Technology Enhanced Learning. Contract 027087 [http://www.tencompetence.org

    Using Search Term Positions for Determining Document Relevance

    Get PDF
    The technological advancements in computer networks and the substantial reduction of their production costs have caused a massive explosion of digitally stored information. In particular, textual information is becoming increasingly available in electronic form. Finding text documents dealing with a certain topic is not a simple task. Users need tools to sift through non-relevant information and retrieve only pieces of information relevant to their needs. The traditional methods of information retrieval (IR) based on search term frequency have somehow reached their limitations, and novel ranking methods based on hyperlink information are not applicable to unlinked documents. The retrieval of documents based on the positions of search terms in a document has the potential of yielding improvements, because other terms in the environment where a search term appears (i.e. the neighborhood) are considered. That is to say, the grammatical type, position and frequency of other words help to clarify and specify the meaning of a given search term. However, the required additional analysis task makes position-based methods slower than methods based on term frequency and requires more storage to save the positions of terms. These drawbacks directly affect the performance of the most user critical phase of the retrieval process, namely query evaluation time, which explains the scarce use of positional information in contemporary retrieval systems. This thesis explores the possibility of extending traditional information retrieval systems with positional information in an efficient manner that permits us to optimize the retrieval performance by handling term positions at query evaluation time. To achieve this task, several abstract representation of term positions to efficiently store and operate on term positional data are investigated. In the Gauss model, descriptive statistics methods are used to estimate term positional information, because they minimize outliers and irregularities in the data. The Fourier model is based on Fourier series to represent positional information. In the Hilbert model, functional analysis methods are used to provide reliable term position estimations and simple mathematical operators to handle positional data. The proposed models are experimentally evaluated using standard resources of the IR research community (Text Retrieval Conference). All experiments demonstrate that the use of positional information can enhance the quality of search results. The suggested models outperform state-of-the-art retrieval utilities. The term position models open new possibilities to analyze and handle textual data. For instance, document clustering and compression of positional data based on these models could be interesting topics to be considered in future research
    • …
    corecore