14 research outputs found

    Contrastive Approach towards Text Source Classification based on Top-Bag-Word Similarity

    Get PDF
    PACLIC / The University of the Philippines Visayas Cebu College Cebu City, Philippines / November 20-22, 200

    Two Level Disambiguation Model for Query Translation

    Get PDF
    Selection of the most suitable translation among all translation candidates returned by bilingual dictionary has always been quiet challenging task for any cross language query translation. Researchers have frequently tried to use word co-occurrence statistics to determine the most probable translation for user query. Algorithms using such statistics have certain shortcomings, which are focused in this paper. We propose a novel method for ambiguity resolution, named ‘two level disambiguation model’. At first level disambiguation, the model properly weighs the importance of translation alternatives of query terms obtained from the dictionary. The importance factor measures the probability of a translation candidate of being selected as the final translation of a query term. This removes the problem of taking binary decision for translation candidates. At second level disambiguation, the model targets the user query as a single concept and deduces the translation of all query terms simultaneously, taking into account the weights of translation alternatives also. This is contrary to previous researches which select translation for each word in source language query independently. The experimental result with English-Hindi cross language information retrieval shows that the proposed two level disambiguation model achieved 79.53% and 83.50% of monolingual translation and 21.11% and 17.36% improvement compared to greedy disambiguation strategies in terms of MAP for short and long queries respectively

    Applying digital content management to support localisation

    Get PDF
    The retrieval and presentation of digital content such as that on the World Wide Web (WWW) is a substantial area of research. While recent years have seen huge expansion in the size of web-based archives that can be searched efficiently by commercial search engines, the presentation of potentially relevant content is still limited to ranked document lists represented by simple text snippets or image keyframe surrogates. There is expanding interest in techniques to personalise the presentation of content to improve the richness and effectiveness of the user experience. One of the most significant challenges to achieving this is the increasingly multilingual nature of this data, and the need to provide suitably localised responses to users based on this content. The Digital Content Management (DCM) track of the Centre for Next Generation Localisation (CNGL) is seeking to develop technologies to support advanced personalised access and presentation of information by combining elements from the existing research areas of Adaptive Hypermedia and Information Retrieval. The combination of these technologies is intended to produce significant improvements in the way users access information. We review key features of these technologies and introduce early ideas for how these technologies can support localisation and localised content before concluding with some impressions of future directions in DCM

    Collocation translation acquisition using monolingual corpora

    Full text link

    Resolving query translation ambiguity using a decaying co-occurrence model and syntactic dependence relations. In: SIGIR

    No full text
    ABSTRACT Bilingual dictionaries have been commonly used for query translation in cross-language information retrieval (CLIR). However, we are faced with the problem of translation selection. Several recent studies suggested the utilization of term cooccurrences in this selection. This paper presents two extensions to improve them. First, we extend the basic co-occurrence model by adding a decaying factor that decreases the mutual information when the distance between the terms increases. Second, we incorporate a triple translation model, in which syntactic dependence relations (represented as triples) are integrated. Our evaluation on translation accuracy shows that translating triples as units is more precise than a word-by-word translation. Our CLIR experiments show that the addition of the decaying factor leads to substantial improvements of the basic co-occurrence model; and the triple translation model brings further improvements

    Relating Dependent Terms in Information Retrieval

    Get PDF
    Les moteurs de recherche font partie de notre vie quotidienne. Actuellement, plus d’un tiers de la population mondiale utilise l’Internet. Les moteurs de recherche leur permettent de trouver rapidement les informations ou les produits qu'ils veulent. La recherche d'information (IR) est le fondement de moteurs de recherche modernes. Les approches traditionnelles de recherche d'information supposent que les termes d'indexation sont indĂ©pendants. Pourtant, les termes qui apparaissent dans le mĂȘme contexte sont souvent dĂ©pendants. L’absence de la prise en compte de ces dĂ©pendances est une des causes de l’introduction de bruit dans le rĂ©sultat (rĂ©sultat non pertinents). Certaines Ă©tudes ont proposĂ© d’intĂ©grer certains types de dĂ©pendance, tels que la proximitĂ©, la cooccurrence, la contiguĂŻtĂ© et de la dĂ©pendance grammaticale. Dans la plupart des cas, les modĂšles de dĂ©pendance sont construits sĂ©parĂ©ment et ensuite combinĂ©s avec le modĂšle traditionnel de mots avec une importance constante. Par consĂ©quent, ils ne peuvent pas capturer correctement la dĂ©pendance variable et la force de dĂ©pendance. Par exemple, la dĂ©pendance entre les mots adjacents "Black Friday" est plus importante que celle entre les mots "road constructions". Dans cette thĂšse, nous Ă©tudions diffĂ©rentes approches pour capturer les relations des termes et de leurs forces de dĂ©pendance. Nous avons proposĂ© des mĂ©thodes suivantes: ─ Nous rĂ©examinons l'approche de combinaison en utilisant diffĂ©rentes unitĂ©s d'indexation pour la RI monolingue en chinois et la RI translinguistique entre anglais et chinois. En plus d’utiliser des mots, nous Ă©tudions la possibilitĂ© d'utiliser bi-gramme et uni-gramme comme unitĂ© de traduction pour le chinois. Plusieurs modĂšles de traduction sont construits pour traduire des mots anglais en uni-grammes, bi-grammes et mots chinois avec un corpus parallĂšle. Une requĂȘte en anglais est ensuite traduite de plusieurs façons, et un score classement est produit avec chaque traduction. Le score final de classement combine tous ces types de traduction. Nous considĂ©rons la dĂ©pendance entre les termes en utilisant la thĂ©orie d’évidence de Dempster-Shafer. Une occurrence d'un fragment de texte (de plusieurs mots) dans un document est considĂ©rĂ©e comme reprĂ©sentant l'ensemble de tous les termes constituants. La probabilitĂ© est assignĂ©e Ă  un tel ensemble de termes plutĂŽt qu’a chaque terme individuel. Au moment d’évaluation de requĂȘte, cette probabilitĂ© est redistribuĂ©e aux termes de la requĂȘte si ces derniers sont diffĂ©rents. Cette approche nous permet d'intĂ©grer les relations de dĂ©pendance entre les termes. Nous proposons un modĂšle discriminant pour intĂ©grer les diffĂ©rentes types de dĂ©pendance selon leur force et leur utilitĂ© pour la RI. Notamment, nous considĂ©rons la dĂ©pendance de contiguĂŻtĂ© et de cooccurrence Ă  de diffĂ©rentes distances, c’est-Ă -dire les bi-grammes et les paires de termes dans une fenĂȘtre de 2, 4, 8 et 16 mots. Le poids d’un bi-gramme ou d’une paire de termes dĂ©pendants est dĂ©terminĂ© selon un ensemble des caractĂšres, en utilisant la rĂ©gression SVM. Toutes les mĂ©thodes proposĂ©es sont Ă©valuĂ©es sur plusieurs collections en anglais et/ou chinois, et les rĂ©sultats expĂ©rimentaux montrent que ces mĂ©thodes produisent des amĂ©liorations substantielles sur l'Ă©tat de l'art.Search engine has become an integral part of our life. More than one-third of world populations are Internet users. Most users turn to a search engine as the quick way to finding the information or product they want. Information retrieval (IR) is the foundation for modern search engines. Traditional information retrieval approaches assume that indexing terms are independent. However, terms occurring in the same context are often dependent. Failing to recognize the dependencies between terms leads to noise (irrelevant documents) in the result. Some studies have proposed to integrate term dependency of different types, such as proximity, co-occurrence, adjacency and grammatical dependency. In most cases, dependency models are constructed apart and then combined with the traditional word-based (unigram) model on a fixed importance proportion. Consequently, they cannot properly capture variable term dependency and its strength. For example, dependency between adjacent words “black Friday” is more important to consider than those of between “road constructions”. In this thesis, we try to study different approaches to capture term relationships and their dependency strengths. We propose the following methods for monolingual IR and Cross-Language IR (CLIR): We re-examine the combination approach by using different indexing units for Chinese monolingual IR, then propose the similar method for CLIR. In addition to the traditional method based on words, we investigate the possibility of using Chinese bigrams and unigrams as translation units. Several translation models from English words to Chinese unigrams, bigrams and words are created based on a parallel corpus. An English query is then translated in several ways, each producing a ranking score. The final ranking score combines all these types of translations. We incorporate dependencies between terms in our model using Dempster-Shafer theory of evidence. Every occurrence of a text fragment in a document is represented as a set which includes all its implied terms. Probability is assigned to such a set of terms instead of individual terms. During query evaluation phase, the probability of the set can be transferred to those of the related query, allowing us to integrate language-dependent relations to IR. We propose a discriminative language model that integrates different term dependencies according to their strength and usefulness to IR. We consider the dependency of adjacency and co-occurrence within different distances, i.e. bigrams, pairs of terms within text window of size 2, 4, 8 and 16. The weight of bigram or a pair of dependent terms in the final model is learnt according to a set of features. All the proposed methods are evaluated on several English and/or Chinese collections, and experimental results show these methods achieve substantial improvements over state-of-the-art baselines

    CLIR teknikak baliabide urriko hizkuntzetarako

    Get PDF
    152 p.Hizkuntza arteko informazioaren berreskurapenerako sistema bat garatxean kontsulta itzultzea da hizkuntzaren mugari aurre egiteko hurbilpenik erabiliena. Kontsulta itzultzeko estrategia arrakastatsuenak itzulpen automatikoko sistem aedo corpus paraleloetan oinarritzen dira, baina baliabide hauek urriak dira baliabide urriko hizkuntzen eszenatokietan. Horrelako egoeretan egokiagoa litzateke eskuragarriago diren baliabideetan oinarritutako komtsulta itzultzeko estrategia bat. Tesi honetan frogatu nahi dugu baliabide nagusi horiek hiztegi elebiduna eta horren osagarri diren corpus konparagarriak eta kontsulta-saioak izan daitezkeela. // Hizkuntza arteko informazioaren berreskurapenerako sistema bat garatxean kontsulta itzultzea da hizkuntzaren mugari aurre egiteko hurbilpenik erabiliena. Kontsulta itzultzeko estrategia arrakastatsuenak itzulpen automatikoko sistem aedo corpus paraleloetan oinarritzen dira, baina baliabide hauek urriak dira baliabide urriko hizkuntzen eszenatokietan. Horrelako egoeretan egokiagoa litzateke eskuragarriago diren baliabideetan oinarritutako komtsulta itzultzeko estrategia bat. Tesi honetan frogatu nahi dugu baliabide nagusi horiek hiztegi elebiduna eta horren osagarri diren corpus konparagarriak eta kontsulta-saioak izan daitezkeela
    corecore