114 research outputs found

    Japanese/English Cross-Language Information Retrieval: Exploration of Query Translation and Transliteration

    Full text link
    Cross-language information retrieval (CLIR), where queries and documents are in different languages, has of late become one of the major topics within the information retrieval community. This paper proposes a Japanese/English CLIR system, where we combine a query translation and retrieval modules. We currently target the retrieval of technical documents, and therefore the performance of our system is highly dependent on the quality of the translation of technical terms. However, the technical term translation is still problematic in that technical terms are often compound words, and thus new terms are progressively created by combining existing base words. In addition, Japanese often represents loanwords based on its special phonogram. Consequently, existing dictionaries find it difficult to achieve sufficient coverage. To counter the first problem, we produce a Japanese/English dictionary for base words, and translate compound words on a word-by-word basis. We also use a probabilistic method to resolve translation ambiguity. For the second problem, we use a transliteration method, which corresponds words unlisted in the base word dictionary to their phonetic equivalents in the target language. We evaluate our system using a test collection for CLIR, and show that both the compound word translation and transliteration methods improve the system performance

    Knowledge Management and Cultural Heritage Repositories. Cross-Lingual Information Retrieval Strategies

    Get PDF
    In the last years important initiatives, like the development of the European Library and Europeana, aim to increase the availability of cultural content from various types of providers and institutions. The accessibility to these resources requires the development of environments which allow both to manage multilingual complexity and to preserve the semantic interoperability. The creation of Natural Language Processing (NLP) applications is finalized to the achievement of CrossLingual Information Retrieval (CLIR). This paper presents an ongoing research on language processing based on the LexiconGrammar (LG) approach with the goal of improving knowledge management in the Cultural Heritage repositories. The proposed framework aims to guarantee interoperability between multilingual systems in order to overcome crucial issues like cross-language and cross-collection retrieval. Indeed, the LG methodology tries to overcome the shortcomings of statistical approaches as in Google Translate or Bing by Microsoft concerning Multi-Word Unit (MWU) processing in queries, where the lack of linguistic context represents a serious obstacle to disambiguation. In particular, translations concerning specific domains, as it is has been widely recognized, is unambiguous since the meanings of terms are mono-referential and the type of relation that links a given term to its equivalent in a foreign language is biunivocal, i.e. a one-to-one coupling which causes this relation to be exclusive and reversible. Ontologies are used in CLIR and are considered by several scholars a promising research area to improve the effectiveness of Information Extraction (IE) techniques particularly for technical-domain queries. Therefore, we present a methodological framework which allows to map both the data and the metadata among the language-specific ont

    Mixed-Language Arabic- English Information Retrieval

    Get PDF
    Includes abstract.Includes bibliographical references.This thesis attempts to address the problem of mixed querying in CLIR. It proposes mixed-language (language-aware) approaches in which mixed queries are used to retrieve most relevant documents, regardless of their languages. To achieve this goal, however, it is essential firstly to suppress the impact of most problems that are caused by the mixed-language feature in both queries and documents and which result in biasing the final ranked list. Therefore, a cross-lingual re-weighting model was developed. In this cross-lingual model, term frequency, document frequency and document length components in mixed queries are estimated and adjusted, regardless of languages, while at the same time the model considers the unique mixed-language features in queries and documents, such as co-occurring terms in two different languages. Furthermore, in mixed queries, non-technical terms (mostly those in non-English language) would likely overweight and skew the impact of those technical terms (mostly those in English) due to high document frequencies (and thus low weights) of the latter terms in their corresponding collection (mostly the English collection). Such phenomenon is caused by the dominance of the English language in scientific domains. Accordingly, this thesis also proposes reasonable re-weighted Inverse Document Frequency (IDF) so as to moderate the effect of overweighted terms in mixed queries

    Hybrid and Interactive Domain-Specific Translation for Multilingual Access to Digital Libraries

    Get PDF
    Accurate high-coverage translation is a vital component of reliable cross language information retrieval (CLIR) systems. This is particularly true for retrieval from archives such as Digital Libraries which are often specific to certain domains. While general machine translation (MT) has been shown to be effective for CLIR tasks in laboratory information retrieval evaluation tasks, it is generally not well suited to specialized situations where domain-specific translations are required. We demonstrate that effective query translation in the domain of cultural heritage (CH) can be achieved using a hybrid translation method which augments a standard MT system with domain-specific phrase dictionaries automatically mined from Wikipedia. We further describe the use of these components in a domain-specific interactive query translation service. The interactive system selects the hybrid translation by default, with other possible translations being offered to the user interactively to enable them to select alternative or additional translation(s). The objective of this interactive service is to provide user control of translation while maximising translation accuracy and minimizing the translation effort of the user. Experiments using our hybrid translation system with sample query logs from users of CH websites demonstrate a large improvement in the accuracy of domain-specific phrase detection and translation

    Foundation, Implementation and Evaluation of the MorphoSaurus System: Subword Indexing, Lexical Learning and Word Sense Disambiguation for Medical Cross-Language Information Retrieval

    Get PDF
    Im medizinischen Alltag, zu welchem viel Dokumentations- und Recherchearbeit gehört, ist mittlerweile der ĂŒberwiegende Teil textuell kodierter Information elektronisch verfĂŒgbar. Hiermit kommt der Entwicklung leistungsfĂ€higer Methoden zur effizienten Recherche eine vorrangige Bedeutung zu. Bewertet man die NĂŒtzlichkeit gĂ€ngiger Textretrievalsysteme aus dem Blickwinkel der medizinischen Fachsprache, dann mangelt es ihnen an morphologischer FunktionalitĂ€t (Flexion, Derivation und Komposition), lexikalisch-semantischer FunktionalitĂ€t und der FĂ€higkeit zu einer sprachĂŒbergreifenden Analyse großer DokumentenbestĂ€nde. In der vorliegenden Promotionsschrift werden die theoretischen Grundlagen des MorphoSaurus-Systems (ein Akronym fĂŒr Morphem-Thesaurus) behandelt. Dessen methodischer Kern stellt ein um Morpheme der medizinischen Fach- und Laiensprache gruppierter Thesaurus dar, dessen EintrĂ€ge mittels semantischer Relationen sprachĂŒbergreifend verknĂŒpft sind. Darauf aufbauend wird ein Verfahren vorgestellt, welches (komplexe) Wörter in Morpheme segmentiert, die durch sprachunabhĂ€ngige, konzeptklassenartige Symbole ersetzt werden. Die resultierende ReprĂ€sentation ist die Basis fĂŒr das sprachĂŒbergreifende, morphemorientierte Textretrieval. Neben der Kerntechnologie wird eine Methode zur automatischen Akquise von LexikoneintrĂ€gen vorgestellt, wodurch bestehende Morphemlexika um weitere Sprachen ergĂ€nzt werden. Die BerĂŒcksichtigung sprachĂŒbergreifender PhĂ€nomene fĂŒhrt im Anschluss zu einem neuartigen Verfahren zur Auflösung von semantischen AmbiguitĂ€ten. Die LeistungsfĂ€higkeit des morphemorientierten Textretrievals wird im Rahmen umfangreicher, standardisierter Evaluationen empirisch getestet und gĂ€ngigen Herangehensweisen gegenĂŒbergestellt

    Long-Term Preservation of Digital Records, Part I: A Theoretical Basis

    Get PDF
    The Information Revolution is making preservation of digital records an urgent issue. Archivists have grappled with the question of how to achieve this for about 15 years. We focus on limitations to preservation, identifying precisely what can be preserved and what cannot. Our answer comes from the philosophical theory of knowledge, especially its discussion about the limits of what can be communicated. Philosophers have taught that answers to critical questions have been obscured by "failure to understand the logic of our language". We can clarify difficulties by paying extremely close attention to the meaning of words such as 'knowledge', 'information', 'the original', and 'dynamic'. What is valuable in transmitted and stored messages, and what should be preserved, is an abstraction, the pattern inherent in each transmitted and stored digital record. This answer has, in fact, been lurking just below the surface of archival literature. To make progress, archivists must collaborate with software engineers. Understanding perspectives across disciplinary boundaries will be needed.

    Relating Dependent Terms in Information Retrieval

    Get PDF
    Les moteurs de recherche font partie de notre vie quotidienne. Actuellement, plus d’un tiers de la population mondiale utilise l’Internet. Les moteurs de recherche leur permettent de trouver rapidement les informations ou les produits qu'ils veulent. La recherche d'information (IR) est le fondement de moteurs de recherche modernes. Les approches traditionnelles de recherche d'information supposent que les termes d'indexation sont indĂ©pendants. Pourtant, les termes qui apparaissent dans le mĂȘme contexte sont souvent dĂ©pendants. L’absence de la prise en compte de ces dĂ©pendances est une des causes de l’introduction de bruit dans le rĂ©sultat (rĂ©sultat non pertinents). Certaines Ă©tudes ont proposĂ© d’intĂ©grer certains types de dĂ©pendance, tels que la proximitĂ©, la cooccurrence, la contiguĂŻtĂ© et de la dĂ©pendance grammaticale. Dans la plupart des cas, les modĂšles de dĂ©pendance sont construits sĂ©parĂ©ment et ensuite combinĂ©s avec le modĂšle traditionnel de mots avec une importance constante. Par consĂ©quent, ils ne peuvent pas capturer correctement la dĂ©pendance variable et la force de dĂ©pendance. Par exemple, la dĂ©pendance entre les mots adjacents "Black Friday" est plus importante que celle entre les mots "road constructions". Dans cette thĂšse, nous Ă©tudions diffĂ©rentes approches pour capturer les relations des termes et de leurs forces de dĂ©pendance. Nous avons proposĂ© des mĂ©thodes suivantes: ─ Nous rĂ©examinons l'approche de combinaison en utilisant diffĂ©rentes unitĂ©s d'indexation pour la RI monolingue en chinois et la RI translinguistique entre anglais et chinois. En plus d’utiliser des mots, nous Ă©tudions la possibilitĂ© d'utiliser bi-gramme et uni-gramme comme unitĂ© de traduction pour le chinois. Plusieurs modĂšles de traduction sont construits pour traduire des mots anglais en uni-grammes, bi-grammes et mots chinois avec un corpus parallĂšle. Une requĂȘte en anglais est ensuite traduite de plusieurs façons, et un score classement est produit avec chaque traduction. Le score final de classement combine tous ces types de traduction. Nous considĂ©rons la dĂ©pendance entre les termes en utilisant la thĂ©orie d’évidence de Dempster-Shafer. Une occurrence d'un fragment de texte (de plusieurs mots) dans un document est considĂ©rĂ©e comme reprĂ©sentant l'ensemble de tous les termes constituants. La probabilitĂ© est assignĂ©e Ă  un tel ensemble de termes plutĂŽt qu’a chaque terme individuel. Au moment d’évaluation de requĂȘte, cette probabilitĂ© est redistribuĂ©e aux termes de la requĂȘte si ces derniers sont diffĂ©rents. Cette approche nous permet d'intĂ©grer les relations de dĂ©pendance entre les termes. Nous proposons un modĂšle discriminant pour intĂ©grer les diffĂ©rentes types de dĂ©pendance selon leur force et leur utilitĂ© pour la RI. Notamment, nous considĂ©rons la dĂ©pendance de contiguĂŻtĂ© et de cooccurrence Ă  de diffĂ©rentes distances, c’est-Ă -dire les bi-grammes et les paires de termes dans une fenĂȘtre de 2, 4, 8 et 16 mots. Le poids d’un bi-gramme ou d’une paire de termes dĂ©pendants est dĂ©terminĂ© selon un ensemble des caractĂšres, en utilisant la rĂ©gression SVM. Toutes les mĂ©thodes proposĂ©es sont Ă©valuĂ©es sur plusieurs collections en anglais et/ou chinois, et les rĂ©sultats expĂ©rimentaux montrent que ces mĂ©thodes produisent des amĂ©liorations substantielles sur l'Ă©tat de l'art.Search engine has become an integral part of our life. More than one-third of world populations are Internet users. Most users turn to a search engine as the quick way to finding the information or product they want. Information retrieval (IR) is the foundation for modern search engines. Traditional information retrieval approaches assume that indexing terms are independent. However, terms occurring in the same context are often dependent. Failing to recognize the dependencies between terms leads to noise (irrelevant documents) in the result. Some studies have proposed to integrate term dependency of different types, such as proximity, co-occurrence, adjacency and grammatical dependency. In most cases, dependency models are constructed apart and then combined with the traditional word-based (unigram) model on a fixed importance proportion. Consequently, they cannot properly capture variable term dependency and its strength. For example, dependency between adjacent words “black Friday” is more important to consider than those of between “road constructions”. In this thesis, we try to study different approaches to capture term relationships and their dependency strengths. We propose the following methods for monolingual IR and Cross-Language IR (CLIR): We re-examine the combination approach by using different indexing units for Chinese monolingual IR, then propose the similar method for CLIR. In addition to the traditional method based on words, we investigate the possibility of using Chinese bigrams and unigrams as translation units. Several translation models from English words to Chinese unigrams, bigrams and words are created based on a parallel corpus. An English query is then translated in several ways, each producing a ranking score. The final ranking score combines all these types of translations. We incorporate dependencies between terms in our model using Dempster-Shafer theory of evidence. Every occurrence of a text fragment in a document is represented as a set which includes all its implied terms. Probability is assigned to such a set of terms instead of individual terms. During query evaluation phase, the probability of the set can be transferred to those of the related query, allowing us to integrate language-dependent relations to IR. We propose a discriminative language model that integrates different term dependencies according to their strength and usefulness to IR. We consider the dependency of adjacency and co-occurrence within different distances, i.e. bigrams, pairs of terms within text window of size 2, 4, 8 and 16. The weight of bigram or a pair of dependent terms in the final model is learnt according to a set of features. All the proposed methods are evaluated on several English and/or Chinese collections, and experimental results show these methods achieve substantial improvements over state-of-the-art baselines
    • 

    corecore