1,550 research outputs found

    Query Expansion with Locally-Trained Word Embeddings

    Full text link
    Continuous space word embeddings have received a great deal of attention in the natural language processing and machine learning communities for their ability to model term similarity and other relationships. We study the use of term relatedness in the context of query expansion for ad hoc information retrieval. We demonstrate that word embeddings such as word2vec and GloVe, when trained globally, underperform corpus and query specific embeddings for retrieval tasks. These results suggest that other tasks benefiting from global embeddings may also benefit from local embeddings

    A Study Of Data Informatics: Data Analysis And Knowledge Discovery Via A Novel Data Mining Algorithm

    Get PDF
    Frequent pattern mining (fpm) has become extremely popular among data mining researchers because it provides interesting and valuable patterns from large datasets. The decreasing cost of storage devices and the increasing availability of processing power make it possible for researchers to build and analyze gigantic datasets in various scientific and business domains. A filtering process is needed, however, to generate patterns that are relevant. This dissertation contributes to addressing this need. An experimental system named fpmies (frequent pattern mining information extraction system) was built to extract information from electronic documents automatically. Collocation analysis was used to analyze the relationship of words. Template mining was used to build the experimental system which is the foundation of fpmies. With the rising need for improved environmental performance, a dataset based on green supply chain practices of three companies was used to test fpmies. The new system was also tested by users resulting in a recall of 83.4%. The new algorithm\u27s combination of semantic relationships with template mining significantly improves the recall of fpmies. The study\u27s results also show that fpmies is much more efficient than manually trying to extract information. Finally, the performance of the fpmies system was compared with the most popular fpm algorithm, apriori, yielding a significantly improved recall and precision for fpmies (76.7% and 74.6% respectively) compared to that of apriori (30% recall and 24.6% precision)

    On Document Relevance and Lexical Cohesion between Query Terms

    Get PDF
    Cataloged from PDF version of article.Lexical cohesion is a property of text, achieved through lexical-semantic relations between words in text. Most information retrieval systems make use of lexical relations in text only to a limited extent. In this paper we empirically investigate whether the degree of lexical cohesion between the contexts of query terms' occurrences in a document is related to its relevance to the query. Lexical cohesion between distinct query terms in a document is estimated on the basis of the lexical-semantic relations (repetition, synonymy, hyponymy and sibling) that exist between there collocates - words that co-occur with them in the same windows of text. Experiments suggest significant differences between the lexical cohesion in relevant and non-relevant document sets exist. A document ranking method based on lexical cohesion shows some performance improvements. (c) 2006 Elsevier Ltd. All rights reserved

    Knowledge Expansion of a Statistical Machine Translation System using Morphological Resources

    Get PDF
    Translation capability of a Phrase-Based Statistical Machine Translation (PBSMT) system mostly depends on parallel data and phrases that are not present in the training data are not correctly translated. This paper describes a method that efficiently expands the existing knowledge of a PBSMT system without adding more parallel data but using external morphological resources. A set of new phrase associations is added to translation and reordering models; each of them corresponds to a morphological variation of the source/target/both phrases of an existing association. New associations are generated using a string similarity score based on morphosyntactic information. We tested our approach on En-Fr and Fr-En translations and results showed improvements of the performance in terms of automatic scores (BLEU and Meteor) and reduction of out-of-vocabulary (OOV) words. We believe that our knowledge expansion framework is generic and could be used to add different types of information to the model.JRC.G.2-Global security and crisis managemen

    Measuring the Stability of Query Term Collocations and Using it in Document Ranking

    Get PDF
    Delivering the right information to the user is fundamental in information retrieval system. Many traditional information retrieval models assume word independence and view a document as bag-of-words, however getting the right information requires a deep understanding of the content of the document and the relationships that exist between words in the text. This study focuses on developing two new document ranking techniques, which are based on a lexical cohesive relationship of collocation. Collocation relationship is a semantic relationship that exists between words that co-occur in the same lexical environment. Two types of collocation relationship have been considered; collocation in the same grammatical structure (such as a sentence), and collocation in the same semantic structure where query terms occur in different sentences but they co-occur with the same words. In the first technique, we only considered the first type of collocation to calculate the document score; where the positional frequency of query terms co-occurrence have been used to identify collocation relationship between query terms and calculating query term’s weight. In the second technique, both types of collocation have been considered; where the co-occurrence frequency distribution within a predefined window has been used to determine query terms collocations and computing query term’s weight. Evaluation of the proposed techniques show performance gain in some of the collocations over the chosen baseline runs

    Enhancing Information Retrieval Through Concept-Based Language Modeling and Semantic Smoothing.

    Get PDF
    Traditionally, many information retrieval models assume that terms occur in documents independently. Although these models have already shown good performance, the word independency assumption seems to be unrealistic from a natural language point of view, which considers that terms are related to each other. Therefore, such an assumption leads to two well‐known problems in information retrieval (IR), namely, polysemy, or term mismatch, and synonymy. In language models, these issues have been addressed by considering dependencies such as bigrams, phrasal‐concepts, or word relationships, but such models are estimated using simple n‐grams or concept counting. In this paper, we address polysemy and synonymy mismatch with a concept‐based language modeling approach that combines ontological concepts from external resources with frequently found collocations from the document collection. In addition, the concept‐based model is enriched with subconcepts and semantic relationships through a semantic smoothing technique so as to perform semantic matching. Experiments carried out on TREC collections show that our model achieves significant improvements over a single word‐based model and the Markov Random Field model (using a Markov classifier)
    corecore