2,881 research outputs found

    Sub-word indexing and blind relevance feedback for English, Bengali, Hindi, and Marathi IR

    Get PDF
    The Forum for Information Retrieval Evaluation (FIRE) provides document collections, topics, and relevance assessments for information retrieval (IR) experiments on Indian languages. Several research questions are explored in this paper: 1. how to create create a simple, languageindependent corpus-based stemmer, 2. how to identify sub-words and which types of sub-words are suitable as indexing units, and 3. how to apply blind relevance feedback on sub-words and how feedback term selection is affected by the type of the indexing unit. More than 140 IR experiments are conducted using the BM25 retrieval model on the topic titles and descriptions (TD) for the FIRE 2008 English, Bengali, Hindi, and Marathi document collections. The major findings are: The corpus-based stemming approach is effective as a knowledge-light term conation step and useful in case of few language-specific resources. For English, the corpusbased stemmer performs nearly as well as the Porter stemmer and significantly better than the baseline of indexing words when combined with query expansion. In combination with blind relevance feedback, it also performs significantly better than the baseline for Bengali and Marathi IR. Sub-words such as consonant-vowel sequences and word prefixes can yield similar or better performance in comparison to word indexing. There is no best performing method for all languages. For English, indexing using the Porter stemmer performs best, for Bengali and Marathi, overlapping 3-grams obtain the best result, and for Hindi, 4-prefixes yield the highest MAP. However, in combination with blind relevance feedback using 10 documents and 20 terms, 6-prefixes for English and 4-prefixes for Bengali, Hindi, and Marathi IR yield the highest MAP. Sub-word identification is a general case of decompounding. It results in one or more index terms for a single word form and increases the number of index terms but decreases their average length. The corresponding retrieval experiments show that relevance feedback on sub-words benefits from selecting a larger number of index terms in comparison with retrieval on word forms. Similarly, selecting the number of relevance feedback terms depending on the ratio of word vocabulary size to sub-word vocabulary size almost always slightly increases information retrieval effectiveness compared to using a fixed number of terms for different languages

    Arabic Fluency Assessment: Procedures for Assessing Stuttering in Arabic Preschool Children

    Get PDF
    The primary aim of this thesis was to screen school-aged (4+) children for two separate types of fluency issues and to distinguish both groups from fluent children. The two fluency issues are Word-Finding Difficulty (WFD) and other speech disfluencies (primarily stuttering). The cohort examined consisted of children who spoke Arabic and English. We first designed a phonological assessment procedure that can equitably test Arabic and English children, called the Arabic English non-word repetition task (AEN_NWR). Riley’s Stuttering Severity Instrument (SSI) is the standard way of assessing fluency for speakers of English. There is no standardized version of SSI for Arabic speakers. Hence, we designed a scheme to measure disfluency symptoms in Arabic speech (Arabic fluency assessment). The scheme recognizes that Arabic and English differ at all language levels (lexically, phonologically and syntactically). After the children with WFD had been separated from those with stuttering, our second aim was to develop and deliver appropriate interventions for the different cohorts. Specifically, we aimed to develop treatments for the children with WFD using short procedures that are suitable for conducting in schools. Children who stutter are referred to SLTs to receive the appropriate type of intervention. To treat WFD, another set of non-word materials was designed to include phonemic patterns not used in the speaker’s native language that are required if that speaker uses another targeted language (e.g. phonemic patterns that occur in English, but not Arabic). The goal was to use these materials in an intervention to train phonemic sequences that are not used in the child’s additional language such as the phonemic patterns that occur in English, but not Arabic. The hypothesis is that a native Arabic speaker learning English would be expected to struggle on those phonotactic patterns not used in Arabic that are required for English. In addition to the screening and intervention protocols designed, self-report procedures are desirable to assess speech fluency when time for testing is limited. To that end, the last chapter discussed the importance of designing a fluency questionnaire that can assess fluency in the entire population of speakers. Together with the AEN_NWR, the brief self-report instrument forms a package of assessment procedures that facilitate screening of speech disfluencies in Arabic children (aged 4+) when they first enter school. The seven chapters, described in more detail below, together constitute a package that achieves the aims of identifying speech problems in children using Arabic and/or English and offering intervention to treat WFD

    Embedding Web-based Statistical Translation Models in Cross-Language Information Retrieval

    Get PDF
    Although more and more language pairs are covered by machine translation services, there are still many pairs that lack translation resources. Cross-language information retrieval (CLIR) is an application which needs translation functionality of a relatively low level of sophistication since current models for information retrieval (IR) are still based on a bag-of-words. The Web provides a vast resource for the automatic construction of parallel corpora which can be used to train statistical translation models automatically. The resulting translation models can be embedded in several ways in a retrieval model. In this paper, we will investigate the problem of automatically mining parallel texts from the Web and different ways of integrating the translation models within the retrieval process. Our experiments on standard test collections for CLIR show that the Web-based translation models can surpass commercial MT systems in CLIR tasks. These results open the perspective of constructing a fully automatic query translation device for CLIR at a very low cost.Comment: 37 page

    Sentiment analysis in the Arabic language using machine learning

    Get PDF
    Includes bibliographical references.2015 Summer.Sentiment analysis has recently become one of the growing areas of research related to natural language processing and machine learning. Much opinion and sentiment about specific topics are available online, which allows several parties such as customers, companies and even governments, to explore these opinions. The first task is to classify the text in terms of whether or not it expresses opinion or factual information. Polarity classification is the second task, which distinguishes between polarities (positive, negative or neutral) that sentences may carry. The analysis of natural language text for the identification of subjectivity and sentiment has been well studied in terms of the English language. Conversely, the work that has been carried out in terms of Arabic remains in its infancy; thus, more cooperation is required between research communities in order for them to offer a mature sentiment analysis system for Arabic. There are recognized challenges in this field; some of which are inherited from the nature of the Arabic language itself, while others are derived from the scarcity of tools and sources. This dissertation provides the rationale behind the current work and proposed methods to enhance the performance of sentiment analysis in the Arabic language. The first step is to increase the resources that help in the analysis process; the most important part of this task is to have annotated sentiment corpora. Several free corpora are available for the English language, but these resources are still limited in other languages, such as Arabic. This dissertation describes the work undertaken by the author to enrich sentiment analysis in Arabic by building a new Arabic Sentiment Corpus. The data is labeled not only with two polarities (positive and negative), but the neutral sentiment is also used during the annotation process. The second step includes the proposal of features that may capture sentiment orientation in the Arabic language, as well as using different machine learning classifiers that may be able to work better and capture the non-linearity with a richly morphological and highly inflectional language, such as Arabic. Different types of features are proposed. These proposed features try to capture different aspects and characteristics of Arabic. Morphological, Semantic, Stylistic features are proposed and investigated. In regard with the classifier, the performance of using linear and nonlinear machine learning approaches was compared. The results are promising for the continued use of nonlinear ML classifiers for this task. Learning knowledge from a particular dataset domain and applying it to a different domain is one useful method in the case of limited resources, such as with the Arabic language. This dissertation shows and discussed the possibility of applying cross-domain in the field of Arabic sentiment analysis. It also indicates the feasibility of using different mechanisms of the cross-domain method. Other work in this dissertation includes the exploration of the effect of negation in Arabic subjectivity and polarity classification. The negation word lists were devised to help in this and other natural language processing tasks. These words include both types of Arabic, Modern Standard and some of Dialects. Two methods of dealing with the negation in sentiment analysis in Arabic were proposed. The first method is based on a static approach that assumes that each sentence containing negation words is considered a negated sentence. When determining the effect of negation, different techniques were proposed, using different word window sizes, or using base phrase chunk. The second approach depends on a dynamic method that needs an annotated negation dataset in order to build a model that can determine whether or not the sentence is negated by the negation words and to establish the effect of the negation on the sentence. The results achieved by adding negation to Arabic sentiment analysis were promising and indicate that the negation has an effect on this task. Finally, the experiments and evaluations that were conducted in this dissertation encourage the researchers to continue in this direction of research

    DOES L2 WORD DECODING IMPLY L2 MEANING ACTIVATION? RELATIONSHIPS AMONG DECODING, MEANING IDENTIFICATION, ANDL2 ORAL LANGUAGE PROFICIENCY IN READING SPANISH AS A SECOND LANGUAGE

    Get PDF
    This study investigated the role of meaning activation and L2 oral language proficiency among Moroccan children learning to read in Spanish for the first time. Recent cross-linguistic research suggests that children learning to read in an L1 or L2 transparent orthography can achieve phonological decoding accuracy faster by relying on grapheme-phoneme strategies. In that case, it becomes extremely important to investigate the role of meaning and its relation to the development of phonological decoding and reading comprehension, especially when children are learning to read in an L2 transparent orthography. The main objective of this study was to discover whether phonological decoding and meaning identification can be considered to be two independent constructs or only one. The second objective was to expand the scope of L2 Spanish oral language proficiency by examining its influence on each of these constructs and on sentence reading comprehension. A battery of measures for assessing the various domains of phonological awareness, decoding, meaning identification and sentence comprehension, were administered to 140 Moroccan children with at least one year of literacy instruction in Spain. Letter knowledge and concept of print were used as control variables. Confirmatory analysis results demonstrated that decoding and word identification form different but dependent constructs. Structural equation modeling indicated that the contribution of L2 oral language proficiency depended on the exact nature of the dependent variable: L2 oral language proficiency does not directly predict decoding skills but is directly related to meaning identification skills and sentence comprehension. The findings provided an understanding of the roles of meaning and L2 oral language proficiency in isolated word reading and sentence comprehension, and clearly implied that decoding and comprehension are more independent when learning to read in an L2 transparent orthography. L2 decoding in Spanish can take place without comprehension. Possible theoretical, instructional and assessment implications related to L2 Spanish reading development are drawn based on the study's results

    Creating language resources for under-resourced languages: methodologies, and experiments with Arabic

    Get PDF
    Language resources are important for those working on computational methods to analyse and study languages. These resources are needed to help advancing the research in fields such as natural language processing, machine learning, information retrieval and text analysis in general. We describe the creation of useful resources for languages that currently lack them, taking resources for Arabic summarisation as a case study. We illustrate three different paradigms for creating language resources, namely: (1) using crowdsourcing to produce a small resource rapidly and relatively cheaply; (2) translating an existing gold-standard dataset, which is relatively easy but potentially of lower quality; and (3) using manual effort with appropriately skilled human participants to create a resource that is more expensive but of high quality. The last of these was used as a test collection for TAC-2011. An evaluation of the resources is also presented
    corecore