416 research outputs found

    Japanese/English Cross-Language Information Retrieval: Exploration of Query Translation and Transliteration

    Full text link
    Cross-language information retrieval (CLIR), where queries and documents are in different languages, has of late become one of the major topics within the information retrieval community. This paper proposes a Japanese/English CLIR system, where we combine a query translation and retrieval modules. We currently target the retrieval of technical documents, and therefore the performance of our system is highly dependent on the quality of the translation of technical terms. However, the technical term translation is still problematic in that technical terms are often compound words, and thus new terms are progressively created by combining existing base words. In addition, Japanese often represents loanwords based on its special phonogram. Consequently, existing dictionaries find it difficult to achieve sufficient coverage. To counter the first problem, we produce a Japanese/English dictionary for base words, and translate compound words on a word-by-word basis. We also use a probabilistic method to resolve translation ambiguity. For the second problem, we use a transliteration method, which corresponds words unlisted in the base word dictionary to their phonetic equivalents in the target language. We evaluate our system using a test collection for CLIR, and show that both the compound word translation and transliteration methods improve the system performance

    Analyzing user reviews of messaging Apps for competitive analysis

    Get PDF
    Dissertation presented as the partial requirement for obtaining a Master's degree in Data Science and Advanced Analytics, specialization in Data ScienceThe rise of various messaging apps has resulted in intensively fierce competition, and the era of Web 2.0 enables business managers to gain competitive intelligence from user-generated content (UGC). Text-mining UGC for competitive intelligence has been drawing great interest of researchers. However, relevant studies mostly focus on industries such as hospitality and products, and few studies applied such techniques to effectively perform competitive analysis for messaging apps. Here, we conducted a competitive analysis based on topic modeling and sentiment analysis by text-mining 27,479 user reviews of four iOS messaging apps, namely Messenger, WhatsApp, Signal and Telegram. The results show that the performance of topic modeling and sentiment analysis is encouraging, and that a combination of the extracted app aspect-based topics and the adjusted sentiment scores can effectively reveal meaningful competitive insights into user concerns, competitive strengths and weaknesses as well as changes of user sentiments over time. We anticipate that this study will not only advance the existing literature on competitive analysis using text mining techniques for messaging apps but also help existing players and new entrants in the market to sharpen their competitive edge by better understanding their user needs and the industry trends

    Sub-word indexing and blind relevance feedback for English, Bengali, Hindi, and Marathi IR

    Get PDF
    The Forum for Information Retrieval Evaluation (FIRE) provides document collections, topics, and relevance assessments for information retrieval (IR) experiments on Indian languages. Several research questions are explored in this paper: 1. how to create create a simple, languageindependent corpus-based stemmer, 2. how to identify sub-words and which types of sub-words are suitable as indexing units, and 3. how to apply blind relevance feedback on sub-words and how feedback term selection is affected by the type of the indexing unit. More than 140 IR experiments are conducted using the BM25 retrieval model on the topic titles and descriptions (TD) for the FIRE 2008 English, Bengali, Hindi, and Marathi document collections. The major findings are: The corpus-based stemming approach is effective as a knowledge-light term conation step and useful in case of few language-specific resources. For English, the corpusbased stemmer performs nearly as well as the Porter stemmer and significantly better than the baseline of indexing words when combined with query expansion. In combination with blind relevance feedback, it also performs significantly better than the baseline for Bengali and Marathi IR. Sub-words such as consonant-vowel sequences and word prefixes can yield similar or better performance in comparison to word indexing. There is no best performing method for all languages. For English, indexing using the Porter stemmer performs best, for Bengali and Marathi, overlapping 3-grams obtain the best result, and for Hindi, 4-prefixes yield the highest MAP. However, in combination with blind relevance feedback using 10 documents and 20 terms, 6-prefixes for English and 4-prefixes for Bengali, Hindi, and Marathi IR yield the highest MAP. Sub-word identification is a general case of decompounding. It results in one or more index terms for a single word form and increases the number of index terms but decreases their average length. The corresponding retrieval experiments show that relevance feedback on sub-words benefits from selecting a larger number of index terms in comparison with retrieval on word forms. Similarly, selecting the number of relevance feedback terms depending on the ratio of word vocabulary size to sub-word vocabulary size almost always slightly increases information retrieval effectiveness compared to using a fixed number of terms for different languages

    Relating Dependent Terms in Information Retrieval

    Get PDF
    Les moteurs de recherche font partie de notre vie quotidienne. Actuellement, plus d’un tiers de la population mondiale utilise l’Internet. Les moteurs de recherche leur permettent de trouver rapidement les informations ou les produits qu'ils veulent. La recherche d'information (IR) est le fondement de moteurs de recherche modernes. Les approches traditionnelles de recherche d'information supposent que les termes d'indexation sont indépendants. Pourtant, les termes qui apparaissent dans le même contexte sont souvent dépendants. L’absence de la prise en compte de ces dépendances est une des causes de l’introduction de bruit dans le résultat (résultat non pertinents). Certaines études ont proposé d’intégrer certains types de dépendance, tels que la proximité, la cooccurrence, la contiguïté et de la dépendance grammaticale. Dans la plupart des cas, les modèles de dépendance sont construits séparément et ensuite combinés avec le modèle traditionnel de mots avec une importance constante. Par conséquent, ils ne peuvent pas capturer correctement la dépendance variable et la force de dépendance. Par exemple, la dépendance entre les mots adjacents "Black Friday" est plus importante que celle entre les mots "road constructions". Dans cette thèse, nous étudions différentes approches pour capturer les relations des termes et de leurs forces de dépendance. Nous avons proposé des méthodes suivantes: ─ Nous réexaminons l'approche de combinaison en utilisant différentes unités d'indexation pour la RI monolingue en chinois et la RI translinguistique entre anglais et chinois. En plus d’utiliser des mots, nous étudions la possibilité d'utiliser bi-gramme et uni-gramme comme unité de traduction pour le chinois. Plusieurs modèles de traduction sont construits pour traduire des mots anglais en uni-grammes, bi-grammes et mots chinois avec un corpus parallèle. Une requête en anglais est ensuite traduite de plusieurs façons, et un score classement est produit avec chaque traduction. Le score final de classement combine tous ces types de traduction. Nous considérons la dépendance entre les termes en utilisant la théorie d’évidence de Dempster-Shafer. Une occurrence d'un fragment de texte (de plusieurs mots) dans un document est considérée comme représentant l'ensemble de tous les termes constituants. La probabilité est assignée à un tel ensemble de termes plutôt qu’a chaque terme individuel. Au moment d’évaluation de requête, cette probabilité est redistribuée aux termes de la requête si ces derniers sont différents. Cette approche nous permet d'intégrer les relations de dépendance entre les termes. Nous proposons un modèle discriminant pour intégrer les différentes types de dépendance selon leur force et leur utilité pour la RI. Notamment, nous considérons la dépendance de contiguïté et de cooccurrence à de différentes distances, c’est-à-dire les bi-grammes et les paires de termes dans une fenêtre de 2, 4, 8 et 16 mots. Le poids d’un bi-gramme ou d’une paire de termes dépendants est déterminé selon un ensemble des caractères, en utilisant la régression SVM. Toutes les méthodes proposées sont évaluées sur plusieurs collections en anglais et/ou chinois, et les résultats expérimentaux montrent que ces méthodes produisent des améliorations substantielles sur l'état de l'art.Search engine has become an integral part of our life. More than one-third of world populations are Internet users. Most users turn to a search engine as the quick way to finding the information or product they want. Information retrieval (IR) is the foundation for modern search engines. Traditional information retrieval approaches assume that indexing terms are independent. However, terms occurring in the same context are often dependent. Failing to recognize the dependencies between terms leads to noise (irrelevant documents) in the result. Some studies have proposed to integrate term dependency of different types, such as proximity, co-occurrence, adjacency and grammatical dependency. In most cases, dependency models are constructed apart and then combined with the traditional word-based (unigram) model on a fixed importance proportion. Consequently, they cannot properly capture variable term dependency and its strength. For example, dependency between adjacent words “black Friday” is more important to consider than those of between “road constructions”. In this thesis, we try to study different approaches to capture term relationships and their dependency strengths. We propose the following methods for monolingual IR and Cross-Language IR (CLIR): We re-examine the combination approach by using different indexing units for Chinese monolingual IR, then propose the similar method for CLIR. In addition to the traditional method based on words, we investigate the possibility of using Chinese bigrams and unigrams as translation units. Several translation models from English words to Chinese unigrams, bigrams and words are created based on a parallel corpus. An English query is then translated in several ways, each producing a ranking score. The final ranking score combines all these types of translations. We incorporate dependencies between terms in our model using Dempster-Shafer theory of evidence. Every occurrence of a text fragment in a document is represented as a set which includes all its implied terms. Probability is assigned to such a set of terms instead of individual terms. During query evaluation phase, the probability of the set can be transferred to those of the related query, allowing us to integrate language-dependent relations to IR. We propose a discriminative language model that integrates different term dependencies according to their strength and usefulness to IR. We consider the dependency of adjacency and co-occurrence within different distances, i.e. bigrams, pairs of terms within text window of size 2, 4, 8 and 16. The weight of bigram or a pair of dependent terms in the final model is learnt according to a set of features. All the proposed methods are evaluated on several English and/or Chinese collections, and experimental results show these methods achieve substantial improvements over state-of-the-art baselines

    Embedding Web-based Statistical Translation Models in Cross-Language Information Retrieval

    Get PDF
    Although more and more language pairs are covered by machine translation services, there are still many pairs that lack translation resources. Cross-language information retrieval (CLIR) is an application which needs translation functionality of a relatively low level of sophistication since current models for information retrieval (IR) are still based on a bag-of-words. The Web provides a vast resource for the automatic construction of parallel corpora which can be used to train statistical translation models automatically. The resulting translation models can be embedded in several ways in a retrieval model. In this paper, we will investigate the problem of automatically mining parallel texts from the Web and different ways of integrating the translation models within the retrieval process. Our experiments on standard test collections for CLIR show that the Web-based translation models can surpass commercial MT systems in CLIR tasks. These results open the perspective of constructing a fully automatic query translation device for CLIR at a very low cost.Comment: 37 page

    A corpus-based induction learning approach to natural language processing.

    Get PDF
    by Leung Chi Hong.Thesis (Ph.D.)--Chinese University of Hong Kong, 1996.Includes bibliographical references (leaves 163-171).Chapter Chapter 1. --- Introduction --- p.1Chapter Chapter 2. --- Background Study of Natural Language Processing --- p.9Chapter 2.1. --- Knowledge-based approach --- p.9Chapter 2.1.1. --- Morphological analysis --- p.10Chapter 2.1.2. --- Syntactic parsing --- p.11Chapter 2.1.3. --- Semantic parsing --- p.16Chapter 2.1.3.1. --- Semantic grammar --- p.19Chapter 2.1.3.2. --- Case grammar --- p.20Chapter 2.1.4. --- Problems of knowledge acquisition in knowledge-based approach --- p.22Chapter 2.2. --- Corpus-based approach --- p.23Chapter 2.2.1. --- Beginning of corpus-based approach --- p.23Chapter 2.2.2. --- An example of corpus-based application: word tagging --- p.25Chapter 2.2.3. --- Annotated corpus --- p.26Chapter 2.2.4. --- State of the art in the corpus-based approach --- p.26Chapter 2.3. --- Knowledge-based approach versus corpus-based approach --- p.28Chapter 2.4. --- Co-operation between two different approaches --- p.32Chapter Chapter 3. --- Induction Learning applied to Corpus-based Approach --- p.35Chapter 3.1. --- General model of traditional corpus-based approach --- p.36Chapter 3.1.1. --- Division of a problem into a number of sub-problems --- p.36Chapter 3.1.2. --- Solution selected from a set of predefined choices --- p.36Chapter 3.1.3. --- Solution selection based on a particular kind of linguistic entity --- p.37Chapter 3.1.4. --- Statistical correlations between solutions and linguistic entities --- p.37Chapter 3.1.5. --- Prediction of the best solution based on statistical correlations --- p.38Chapter 3.2. --- First problem in the corpus-based approach: Irrelevance in the corpus --- p.39Chapter 3.3. --- Induction learning --- p.41Chapter 3.3.1. --- General issues about induction learning --- p.41Chapter 3.3.2. --- Reasons of using induction learning in the corpus-based approach --- p.43Chapter 3.3.3. --- General model of corpus-based induction learning approach --- p.45Chapter 3.3.3.1. --- Preparation of positive corpus and negative corpus --- p.45Chapter 3.3.3.2. --- Statistical correlations between solutions and linguistic entities --- p.46Chapter 3.3.3.3. --- Combination of the statistical correlations obtained from the positive and negative corpora --- p.48Chapter 3.4. --- Second problem in the corpus-based approach: Modification of initial probabilistic approximations --- p.50Chapter 3.5. --- Learning feedback modification --- p.52Chapter 3.5.1. --- Determination of which correlation scores to be modified --- p.52Chapter 3.5.2. --- Determination of the magnitude of modification --- p.53Chapter 3.5.3. --- An general algorithm of learning feedback modification --- p.56Chapter Chapter 4. --- Identification of Phrases and Templates in Domain-specific Chinese Texts --- p.59Chapter 4.1. --- Analysis of the problem solved by the traditional corpus-based approach --- p.61Chapter 4.2. --- Phrase identification based on positive and negative corpora --- p.63Chapter 4.3. --- Phrase identification procedure --- p.64Chapter 4.3.1. --- Step 1: Phrase seed identification --- p.65Chapter 4.3.2. --- Step 2: Phrase construction from phrase seeds --- p.65Chapter 4.4. --- Template identification procedure --- p.67Chapter 4.5. --- Experiment and result --- p.70Chapter 4.5.1. --- Testing data --- p.70Chapter 4.5.2. --- Details of experiments --- p.71Chapter 4.5.3. --- Experimental results --- p.72Chapter 4.5.3.1. --- Phrases and templates identified in financial news articles --- p.72Chapter 4.5.3.2. --- Phrases and templates identified in political news articles --- p.73Chapter 4.6. --- Conclusion --- p.74Chapter Chapter 5. --- A Corpus-based Induction Learning Approach to Improving the Accuracy of Chinese Word Segmentation --- p.76Chapter 5.1. --- Background of Chinese word segmentation --- p.77Chapter 5.2. --- Typical methods of Chinese word segmentation --- p.78Chapter 5.2.1. --- Syntactic and semantic approach --- p.78Chapter 5.2.2. --- Statistical approach --- p.79Chapter 5.2.3. --- Heuristic approach --- p.81Chapter 5.3. --- Problems in word segmentation --- p.82Chapter 5.3.1. --- Chinese word definition --- p.82Chapter 5.3.2. --- Word dictionary --- p.83Chapter 5.3.3. --- Word segmentation ambiguity --- p.84Chapter 5.4. --- Corpus-based induction learning approach to improving word segmentation accuracy --- p.86Chapter 5.4.1. --- Rationale of approach --- p.87Chapter 5.4.2. --- Method of constructing modification rules --- p.89Chapter 5.5. --- Experiment and results --- p.94Chapter 5.6. --- Characteristics of modification rules constructed in experiment --- p.96Chapter 5.7. --- Experiment constructing rules for compound words with suffixes --- p.98Chapter 5.8. --- Relationship between modification frequency and Zipfs first law --- p.99Chapter 5.9. --- Problems in the approach --- p.100Chapter 5.10. --- Conclusion --- p.101Chapter Chapter 6. --- Corpus-based Induction Learning Approach to Automatic Indexing of Controlled Index Terms --- p.103Chapter 6.1. --- Background of automatic indexing --- p.103Chapter 6.1.1. --- Definition of index term and indexing --- p.103Chapter 6.1.2. --- Manual indexing versus automatic indexing --- p.105Chapter 6.1.3. --- Different approaches to automatic indexing --- p.107Chapter 6.2. --- Corpus-based induction learning approach to automatic indexing --- p.109Chapter 6.2.1. --- Fundamental concept about corpus-based automatic indexing --- p.110Chapter 6.2.2. --- Procedure of automatic indexing --- p.111Chapter 6.2.2.1. --- Learning process --- p.112Chapter 6.2.2.2. --- Indexing process --- p.118Chapter 6.3. --- Experiments of corpus-based induction learning approach to automatic indexing --- p.118Chapter 6.3.1. --- An experiment evaluating the complete procedures --- p.119Chapter 6.3.1.1. --- Testing data used in the experiment --- p.119Chapter 6.3.1.2. --- Details of the experiment --- p.119Chapter 6.3.1.3. --- Experimental result --- p.121Chapter 6.3.2. --- An experiment comparing with the traditional approach --- p.122Chapter 6.3.3. --- An experiment determining the optimal indexing score threshold --- p.124Chapter 6.3.4. --- An experiment measuring the precision and recall of indexing performance --- p.127Chapter 6.4. --- Learning feedback modification --- p.128Chapter 6.4.1. --- Positive feedback --- p.129Chapter 6.4.2. --- Negative feedback --- p.131Chapter 6.4.3. --- Change of indexed proportions of positive/negative training corpus in feedback iterations --- p.132Chapter 6.4.4. --- An experiment evaluating the learning feedback modification --- p.134Chapter 6.4.5. --- An experiment testing the significance factor in merging process --- p.136Chapter 6.5. --- Conclusion --- p.138Chapter Chapter 7. --- Conclusion --- p.140Appendix A: Some examples of identified phrases in financial news articles --- p.149Appendix B: Some examples of identified templates in financial news articles --- p.150Appendix C: Some examples of texts containing the templates in financial news articles --- p.151Appendix D: Some examples of identified phrases in political news articles --- p.152Appendix E: Some examples of identified templates in political news articles --- p.153Appendix F: Some examples of texts containing the templates in political news articles --- p.154Appendix G: Syntactic tags used in word segmentation modification rule experiment --- p.155Appendix H: An example of semantic approach to automatic indexing --- p.156Appendix I: An example of syntactic approach to automatic indexing --- p.158Appendix J: Samples of INSPEC and MEDLINE Records --- p.161Appendix K: Examples of Promoting and Demoting Words --- p.162References --- p.16

    Nodalida 2005 - proceedings of the 15th NODALIDA conference

    Get PDF

    Mixed-Language Arabic- English Information Retrieval

    Get PDF
    Includes abstract.Includes bibliographical references.This thesis attempts to address the problem of mixed querying in CLIR. It proposes mixed-language (language-aware) approaches in which mixed queries are used to retrieve most relevant documents, regardless of their languages. To achieve this goal, however, it is essential firstly to suppress the impact of most problems that are caused by the mixed-language feature in both queries and documents and which result in biasing the final ranked list. Therefore, a cross-lingual re-weighting model was developed. In this cross-lingual model, term frequency, document frequency and document length components in mixed queries are estimated and adjusted, regardless of languages, while at the same time the model considers the unique mixed-language features in queries and documents, such as co-occurring terms in two different languages. Furthermore, in mixed queries, non-technical terms (mostly those in non-English language) would likely overweight and skew the impact of those technical terms (mostly those in English) due to high document frequencies (and thus low weights) of the latter terms in their corresponding collection (mostly the English collection). Such phenomenon is caused by the dominance of the English language in scientific domains. Accordingly, this thesis also proposes reasonable re-weighted Inverse Document Frequency (IDF) so as to moderate the effect of overweighted terms in mixed queries

    The Generation of Compound Nominals to Represent the Essence of Text The COMMIX System

    Get PDF
    This thesis concerns the COMMIX system, which automatically extracts information on what a text is about, and generates that information in the highly compacted form of compound nominal expressions. The expressions generated are complex and may include novel terms which do not appear themselves in the input text. From the practical point of view, the work is driven by the need for better representations of content: for representations which are shorter and more concise than would appear in an abstract, yet more informative and representative of the actual aboutness than commonly occurs in indexing expressions and key terms. This additional layer of representation is referred to in this work as pertaining to the essence of a particular text. From a theoretical standpoint, the thesis shows how the compound nominal as a construct can be successfully employed in these highly informative representations. It involves an exploration of the claim that there is sufficient semantic information contained within the standard dictionary glosses for individual words to enable the construction of useful and highly representative novel compound nominal expressions, without recourse to standard syntactic and statistical methods. It shows how a shallow semantic approach to content identification which is based on lexical overlap can produce some very encouraging results. The methodology employed, and described herein, is domain-independent, and does not require the specification of templates with which the input text must comply. In these two respects, the methodology developed in this work avoids two of the most common problems associated with information extraction. As regards the evaluation of this type of work, the thesis introduces and utilises the notion of percentage attainment value, which is used in conjunction with subjects' opinions about the degree to which the aboutness terms succeed in indicating the subject matter of the texts for which they were generated

    Foundation, Implementation and Evaluation of the MorphoSaurus System: Subword Indexing, Lexical Learning and Word Sense Disambiguation for Medical Cross-Language Information Retrieval

    Get PDF
    Im medizinischen Alltag, zu welchem viel Dokumentations- und Recherchearbeit gehört, ist mittlerweile der überwiegende Teil textuell kodierter Information elektronisch verfügbar. Hiermit kommt der Entwicklung leistungsfähiger Methoden zur effizienten Recherche eine vorrangige Bedeutung zu. Bewertet man die Nützlichkeit gängiger Textretrievalsysteme aus dem Blickwinkel der medizinischen Fachsprache, dann mangelt es ihnen an morphologischer Funktionalität (Flexion, Derivation und Komposition), lexikalisch-semantischer Funktionalität und der Fähigkeit zu einer sprachübergreifenden Analyse großer Dokumentenbestände. In der vorliegenden Promotionsschrift werden die theoretischen Grundlagen des MorphoSaurus-Systems (ein Akronym für Morphem-Thesaurus) behandelt. Dessen methodischer Kern stellt ein um Morpheme der medizinischen Fach- und Laiensprache gruppierter Thesaurus dar, dessen Einträge mittels semantischer Relationen sprachübergreifend verknüpft sind. Darauf aufbauend wird ein Verfahren vorgestellt, welches (komplexe) Wörter in Morpheme segmentiert, die durch sprachunabhängige, konzeptklassenartige Symbole ersetzt werden. Die resultierende Repräsentation ist die Basis für das sprachübergreifende, morphemorientierte Textretrieval. Neben der Kerntechnologie wird eine Methode zur automatischen Akquise von Lexikoneinträgen vorgestellt, wodurch bestehende Morphemlexika um weitere Sprachen ergänzt werden. Die Berücksichtigung sprachübergreifender Phänomene führt im Anschluss zu einem neuartigen Verfahren zur Auflösung von semantischen Ambiguitäten. Die Leistungsfähigkeit des morphemorientierten Textretrievals wird im Rahmen umfangreicher, standardisierter Evaluationen empirisch getestet und gängigen Herangehensweisen gegenübergestellt
    corecore