481 research outputs found

    Part of Speech Based Term Weighting for Information Retrieval

    Full text link
    Automatic language processing tools typically assign to terms so-called weights corresponding to the contribution of terms to information content. Traditionally, term weights are computed from lexical statistics, e.g., term frequencies. We propose a new type of term weight that is computed from part of speech (POS) n-gram statistics. The proposed POS-based term weight represents how informative a term is in general, based on the POS contexts in which it generally occurs in language. We suggest five different computations of POS-based term weights by extending existing statistical approximations of term information measures. We apply these POS-based term weights to information retrieval, by integrating them into the model that matches documents to queries. Experiments with two TREC collections and 300 queries, using TF-IDF & BM25 as baselines, show that integrating our POS-based term weights to retrieval always leads to gains (up to +33.7% from the baseline). Additional experiments with a different retrieval model as baseline (Language Model with Dirichlet priors smoothing) and our best performing POS-based term weight, show retrieval gains always and consistently across the whole smoothing range of the baseline

    Spoken content retrieval: A survey of techniques and technologies

    Get PDF
    Speech media, that is, digital audio and video containing spoken content, has blossomed in recent years. Large collections are accruing on the Internet as well as in private and enterprise settings. This growth has motivated extensive research on techniques and technologies that facilitate reliable indexing and retrieval. Spoken content retrieval (SCR) requires the combination of audio and speech processing technologies with methods from information retrieval (IR). SCR research initially investigated planned speech structured in document-like units, but has subsequently shifted focus to more informal spoken content produced spontaneously, outside of the studio and in conversational settings. This survey provides an overview of the field of SCR encompassing component technologies, the relationship of SCR to text IR and automatic speech recognition and user interaction issues. It is aimed at researchers with backgrounds in speech technology or IR who are seeking deeper insight on how these fields are integrated to support research and development, thus addressing the core challenges of SCR

    Finding Academic Experts on a MultiSensor Approach using Shannon's Entropy

    Full text link
    Expert finding is an information retrieval task concerned with the search for the most knowledgeable people, in some topic, with basis on documents describing peoples activities. The task involves taking a user query as input and returning a list of people sorted by their level of expertise regarding the user query. This paper introduces a novel approach for combining multiple estimators of expertise based on a multisensor data fusion framework together with the Dempster-Shafer theory of evidence and Shannon's entropy. More specifically, we defined three sensors which detect heterogeneous information derived from the textual contents, from the graph structure of the citation patterns for the community of experts, and from profile information about the academic experts. Given the evidences collected, each sensor may define different candidates as experts and consequently do not agree in a final ranking decision. To deal with these conflicts, we applied the Dempster-Shafer theory of evidence combined with Shannon's Entropy formula to fuse this information and come up with a more accurate and reliable final ranking list. Experiments made over two datasets of academic publications from the Computer Science domain attest for the adequacy of the proposed approach over the traditional state of the art approaches. We also made experiments against representative supervised state of the art algorithms. Results revealed that the proposed method achieved a similar performance when compared to these supervised techniques, confirming the capabilities of the proposed framework

    Relating Dependent Terms in Information Retrieval

    Get PDF
    Les moteurs de recherche font partie de notre vie quotidienne. Actuellement, plus d’un tiers de la population mondiale utilise l’Internet. Les moteurs de recherche leur permettent de trouver rapidement les informations ou les produits qu'ils veulent. La recherche d'information (IR) est le fondement de moteurs de recherche modernes. Les approches traditionnelles de recherche d'information supposent que les termes d'indexation sont indĂ©pendants. Pourtant, les termes qui apparaissent dans le mĂȘme contexte sont souvent dĂ©pendants. L’absence de la prise en compte de ces dĂ©pendances est une des causes de l’introduction de bruit dans le rĂ©sultat (rĂ©sultat non pertinents). Certaines Ă©tudes ont proposĂ© d’intĂ©grer certains types de dĂ©pendance, tels que la proximitĂ©, la cooccurrence, la contiguĂŻtĂ© et de la dĂ©pendance grammaticale. Dans la plupart des cas, les modĂšles de dĂ©pendance sont construits sĂ©parĂ©ment et ensuite combinĂ©s avec le modĂšle traditionnel de mots avec une importance constante. Par consĂ©quent, ils ne peuvent pas capturer correctement la dĂ©pendance variable et la force de dĂ©pendance. Par exemple, la dĂ©pendance entre les mots adjacents "Black Friday" est plus importante que celle entre les mots "road constructions". Dans cette thĂšse, nous Ă©tudions diffĂ©rentes approches pour capturer les relations des termes et de leurs forces de dĂ©pendance. Nous avons proposĂ© des mĂ©thodes suivantes: ─ Nous rĂ©examinons l'approche de combinaison en utilisant diffĂ©rentes unitĂ©s d'indexation pour la RI monolingue en chinois et la RI translinguistique entre anglais et chinois. En plus d’utiliser des mots, nous Ă©tudions la possibilitĂ© d'utiliser bi-gramme et uni-gramme comme unitĂ© de traduction pour le chinois. Plusieurs modĂšles de traduction sont construits pour traduire des mots anglais en uni-grammes, bi-grammes et mots chinois avec un corpus parallĂšle. Une requĂȘte en anglais est ensuite traduite de plusieurs façons, et un score classement est produit avec chaque traduction. Le score final de classement combine tous ces types de traduction. Nous considĂ©rons la dĂ©pendance entre les termes en utilisant la thĂ©orie d’évidence de Dempster-Shafer. Une occurrence d'un fragment de texte (de plusieurs mots) dans un document est considĂ©rĂ©e comme reprĂ©sentant l'ensemble de tous les termes constituants. La probabilitĂ© est assignĂ©e Ă  un tel ensemble de termes plutĂŽt qu’a chaque terme individuel. Au moment d’évaluation de requĂȘte, cette probabilitĂ© est redistribuĂ©e aux termes de la requĂȘte si ces derniers sont diffĂ©rents. Cette approche nous permet d'intĂ©grer les relations de dĂ©pendance entre les termes. Nous proposons un modĂšle discriminant pour intĂ©grer les diffĂ©rentes types de dĂ©pendance selon leur force et leur utilitĂ© pour la RI. Notamment, nous considĂ©rons la dĂ©pendance de contiguĂŻtĂ© et de cooccurrence Ă  de diffĂ©rentes distances, c’est-Ă -dire les bi-grammes et les paires de termes dans une fenĂȘtre de 2, 4, 8 et 16 mots. Le poids d’un bi-gramme ou d’une paire de termes dĂ©pendants est dĂ©terminĂ© selon un ensemble des caractĂšres, en utilisant la rĂ©gression SVM. Toutes les mĂ©thodes proposĂ©es sont Ă©valuĂ©es sur plusieurs collections en anglais et/ou chinois, et les rĂ©sultats expĂ©rimentaux montrent que ces mĂ©thodes produisent des amĂ©liorations substantielles sur l'Ă©tat de l'art.Search engine has become an integral part of our life. More than one-third of world populations are Internet users. Most users turn to a search engine as the quick way to finding the information or product they want. Information retrieval (IR) is the foundation for modern search engines. Traditional information retrieval approaches assume that indexing terms are independent. However, terms occurring in the same context are often dependent. Failing to recognize the dependencies between terms leads to noise (irrelevant documents) in the result. Some studies have proposed to integrate term dependency of different types, such as proximity, co-occurrence, adjacency and grammatical dependency. In most cases, dependency models are constructed apart and then combined with the traditional word-based (unigram) model on a fixed importance proportion. Consequently, they cannot properly capture variable term dependency and its strength. For example, dependency between adjacent words “black Friday” is more important to consider than those of between “road constructions”. In this thesis, we try to study different approaches to capture term relationships and their dependency strengths. We propose the following methods for monolingual IR and Cross-Language IR (CLIR): We re-examine the combination approach by using different indexing units for Chinese monolingual IR, then propose the similar method for CLIR. In addition to the traditional method based on words, we investigate the possibility of using Chinese bigrams and unigrams as translation units. Several translation models from English words to Chinese unigrams, bigrams and words are created based on a parallel corpus. An English query is then translated in several ways, each producing a ranking score. The final ranking score combines all these types of translations. We incorporate dependencies between terms in our model using Dempster-Shafer theory of evidence. Every occurrence of a text fragment in a document is represented as a set which includes all its implied terms. Probability is assigned to such a set of terms instead of individual terms. During query evaluation phase, the probability of the set can be transferred to those of the related query, allowing us to integrate language-dependent relations to IR. We propose a discriminative language model that integrates different term dependencies according to their strength and usefulness to IR. We consider the dependency of adjacency and co-occurrence within different distances, i.e. bigrams, pairs of terms within text window of size 2, 4, 8 and 16. The weight of bigram or a pair of dependent terms in the final model is learnt according to a set of features. All the proposed methods are evaluated on several English and/or Chinese collections, and experimental results show these methods achieve substantial improvements over state-of-the-art baselines

    Modeling variable dependencies between characters in Chinese information retrieval

    Get PDF
    Abstract. Chinese IR can work on words and/or character n-grams. In previous studies, when several types of index are used, independence is usually assumed between them, which obviously is not true in reality. In this paper, we propose a model for Chinese IR that integrates different types of dependency between Chinese characters. The role of a pair of dependent characters in the matching process is variable, depending on the pair’s ability to describe the underlying meaning and to retrieve relevant documents. The weight of the pair is learnt using SVM. Our experiments on TREC and NTCIR Chinese collections show that our model can significantly outperform most existing approaches. The results confirm the necessity to integrate dependent pairs of characters in Chinese IR and to use them according to their possible contribution to IR
    • 

    corecore