661 research outputs found

    Sub-word indexing and blind relevance feedback for English, Bengali, Hindi, and Marathi IR

    Get PDF
    The Forum for Information Retrieval Evaluation (FIRE) provides document collections, topics, and relevance assessments for information retrieval (IR) experiments on Indian languages. Several research questions are explored in this paper: 1. how to create create a simple, languageindependent corpus-based stemmer, 2. how to identify sub-words and which types of sub-words are suitable as indexing units, and 3. how to apply blind relevance feedback on sub-words and how feedback term selection is affected by the type of the indexing unit. More than 140 IR experiments are conducted using the BM25 retrieval model on the topic titles and descriptions (TD) for the FIRE 2008 English, Bengali, Hindi, and Marathi document collections. The major findings are: The corpus-based stemming approach is effective as a knowledge-light term conation step and useful in case of few language-specific resources. For English, the corpusbased stemmer performs nearly as well as the Porter stemmer and significantly better than the baseline of indexing words when combined with query expansion. In combination with blind relevance feedback, it also performs significantly better than the baseline for Bengali and Marathi IR. Sub-words such as consonant-vowel sequences and word prefixes can yield similar or better performance in comparison to word indexing. There is no best performing method for all languages. For English, indexing using the Porter stemmer performs best, for Bengali and Marathi, overlapping 3-grams obtain the best result, and for Hindi, 4-prefixes yield the highest MAP. However, in combination with blind relevance feedback using 10 documents and 20 terms, 6-prefixes for English and 4-prefixes for Bengali, Hindi, and Marathi IR yield the highest MAP. Sub-word identification is a general case of decompounding. It results in one or more index terms for a single word form and increases the number of index terms but decreases their average length. The corresponding retrieval experiments show that relevance feedback on sub-words benefits from selecting a larger number of index terms in comparison with retrieval on word forms. Similarly, selecting the number of relevance feedback terms depending on the ratio of word vocabulary size to sub-word vocabulary size almost always slightly increases information retrieval effectiveness compared to using a fixed number of terms for different languages

    Embedding Web-based Statistical Translation Models in Cross-Language Information Retrieval

    Get PDF
    Although more and more language pairs are covered by machine translation services, there are still many pairs that lack translation resources. Cross-language information retrieval (CLIR) is an application which needs translation functionality of a relatively low level of sophistication since current models for information retrieval (IR) are still based on a bag-of-words. The Web provides a vast resource for the automatic construction of parallel corpora which can be used to train statistical translation models automatically. The resulting translation models can be embedded in several ways in a retrieval model. In this paper, we will investigate the problem of automatically mining parallel texts from the Web and different ways of integrating the translation models within the retrieval process. Our experiments on standard test collections for CLIR show that the Web-based translation models can surpass commercial MT systems in CLIR tasks. These results open the perspective of constructing a fully automatic query translation device for CLIR at a very low cost.Comment: 37 page

    Text Classification for Arabic Words Using Rep-Tree

    Get PDF
    The amount of text data mining in the world and in our life seems ever increasing and there’s no end to it. The concept (Text Data Mining) defined as the process of deriving high-quality information from text. It has been applied on different fields including: Pattern mining, opinion mining, and web mining. The concept of Text Data Mining is based around the global Stemming of different forms of Arabic words. Stemming is defined like the method of reducing inflected (or typically derived) words to their word stem, base or root kind typically a word kind. We use the REP-Tree to improve text representation. In addition, test new combinations of weighting schemes to be applied on Arabic text data for classification purposes. For processing, WEKA workbench is used. The results in the paper on data set of BBC-Arabic website also show the efficiency and accuracy of REP-TREE in Arabic text classification

    Evaluation and Improvement of Semantically-Enhanced Tagging System

    Get PDF
    The Social Web or ‘Web 2.0’ is focused on the interaction and collaboration between web sites users. It is credited for the existence of tagging systems, amongst other things such as blogs and Wikis. Tagging systems like YouTube and Flickr offer their users the simplicity and freedom in creating and sharing their own contents and thus folksonomy is a very active research area where many improvements are presented to overcome existing disadvantages such as the lack of semantic meaning, ambiguity, and inconsistency. TE is a tagging system proposing solutions to the problems of multilingualism, lack of semantic meaning and shorthand writing (which is very common in the social web) through the aid of semantic and social resources. The current research is presenting an addition to the TE system in the form of an embedded stemming component to provide a solution to the different lexical form problems. Prior to this, the TE system had to be explored thoroughly and then its efficiency had to be determined in order to decide on the practicality of embedding any additional components as enhancements to the performance. Deciding on this involved analysing the algorithm efficiency using an analytical approach to determine its time and space complexity. The TE had a time growth rate of O (NÂČ) which is polynomial, thus the algorithm is considered efficient. Nonetheless, recommended modifications like patch SQL execution can improve this. Regarding space complexity, the number of tags per photo represents the problem size which, if it grows, will increase linearly the required memory space. Based on the findings above, the TE system is re-implemented on Flickr instead of YouTube, because of a recent YouTube restriction, which is of greater benefit in multi languages tagging system since the language barrier is meaningless in this case. The re-implementation is achieved using ‘flickrj’ (Java Interface for Flickr APIs). Next, the stemming component is added to perform tags normalisation prior to the ontologies querying. The component is embedded using the Java encoding of the porter 2 stemmer which support many languages including Italian. The impact of the stemming component on the performance of the TE system in terms of the size of the index table and the number of retrieved results is investigated using an experiment that showed a reduction of 48% in the size of the index table. This also means that search queries have less system tags to compare them against the search keywords and this can speed up the search. Furthermore, the experiment runs similar search trails on two versions of the TE systems one without the stemming component and the other with the stemming component and found out that the latter produced more results on the conditions of working with valid words and valid stems. The embedding of the stemming component in the new TE system has lessened the effect of the storage overhead needed for the generated system tags by their reduction for the size of the index table which make the system suited for many applications such as text classification, summarization, email filtering, machine translation
etc

    Meaning refinement to improve cross-lingual information retrieval

    Get PDF
    Magdeburg, Univ., Fak. fĂŒr Informatik, Diss., 2012von Farag Ahme

    Unsupervised learning of Arabic non-concatenative morphology

    Get PDF
    Unsupervised approaches to learning the morphology of a language play an important role in computer processing of language from a practical and theoretical perspective, due their minimal reliance on manually produced linguistic resources and human annotation. Such approaches have been widely researched for the problem of concatenative affixation, but less attention has been paid to the intercalated (non-concatenative) morphology exhibited by Arabic and other Semitic languages. The aim of this research is to learn the root and pattern morphology of Arabic, with accuracy comparable to manually built morphological analysis systems. The approach is kept free from human supervision or manual parameter settings, assuming only that roots and patterns intertwine to form a word. Promising results were obtained by applying a technique adapted from previous work in concatenative morphology learning, which uses machine learning to determine relatedness between words. The output, with probabilistic relatedness values between words, was then used to rank all possible roots and patterns to form a lexicon. Analysis using trilateral roots resulted in correct root identification accuracy of approximately 86% for inflected words. Although the machine learning-based approach is effective, it is conceptually complex. So an alternative, simpler and computationally efficient approach was then devised to obtain morpheme scores based on comparative counts of roots and patterns. In this approach, root and pattern scores are defined in terms of each other in a mutually recursive relationship, converging to an optimized morpheme ranking. This technique gives slightly better accuracy while being conceptually simpler and more efficient. The approach, after further enhancements, was evaluated on a version of the Quranic Arabic Corpus, attaining a final accuracy of approximately 93%. A comparative evaluation shows this to be superior to two existing, well used manually built Arabic stemmers, thus demonstrating the practical feasibility of unsupervised learning of non-concatenative morphology

    Distributional Measures of Semantic Distance: A Survey

    Full text link
    The ability to mimic human notions of semantic distance has widespread applications. Some measures rely only on raw text (distributional measures) and some rely on knowledge sources such as WordNet. Although extensive studies have been performed to compare WordNet-based measures with human judgment, the use of distributional measures as proxies to estimate semantic distance has received little attention. Even though they have traditionally performed poorly when compared to WordNet-based measures, they lay claim to certain uniquely attractive features, such as their applicability in resource-poor languages and their ability to mimic both semantic similarity and semantic relatedness. Therefore, this paper presents a detailed study of distributional measures. Particular attention is paid to flesh out the strengths and limitations of both WordNet-based and distributional measures, and how distributional measures of distance can be brought more in line with human notions of semantic distance. We conclude with a brief discussion of recent work on hybrid measures

    Mixed-Language Arabic- English Information Retrieval

    Get PDF
    Includes abstract.Includes bibliographical references.This thesis attempts to address the problem of mixed querying in CLIR. It proposes mixed-language (language-aware) approaches in which mixed queries are used to retrieve most relevant documents, regardless of their languages. To achieve this goal, however, it is essential firstly to suppress the impact of most problems that are caused by the mixed-language feature in both queries and documents and which result in biasing the final ranked list. Therefore, a cross-lingual re-weighting model was developed. In this cross-lingual model, term frequency, document frequency and document length components in mixed queries are estimated and adjusted, regardless of languages, while at the same time the model considers the unique mixed-language features in queries and documents, such as co-occurring terms in two different languages. Furthermore, in mixed queries, non-technical terms (mostly those in non-English language) would likely overweight and skew the impact of those technical terms (mostly those in English) due to high document frequencies (and thus low weights) of the latter terms in their corresponding collection (mostly the English collection). Such phenomenon is caused by the dominance of the English language in scientific domains. Accordingly, this thesis also proposes reasonable re-weighted Inverse Document Frequency (IDF) so as to moderate the effect of overweighted terms in mixed queries
    • 

    corecore