17,824 research outputs found

    Native language identification of fluent and advanced non-native writers

    Get PDF
    This is an accepted manuscript of an article published by ACM in ACM Transactions on Asian and Low-Resource Language Information Processing in April 2020, available online: https://doi.org/10.1145/3383202 The accepted version of the publication may differ from the final published version.Native Language Identification (NLI) aims at identifying the native languages of authors by analyzing their text samples written in a non-native language. Most existing studies investigate this task for educational applications such as second language acquisition and require the learner corpora. This article performs NLI in a challenging context of the user-generated-content (UGC) where authors are fluent and advanced non-native speakers of a second language. Existing NLI studies with UGC (i) rely on the content-specific/social-network features and may not be generalizable to other domains and datasets, (ii) are unable to capture the variations of the language-usage-patterns within a text sample, and (iii) are not associated with any outlier handling mechanism. Moreover, since there is a sizable number of people who have acquired non-English second languages due to the economic and immigration policies, there is a need to gauge the applicability of NLI with UGC to other languages. Unlike existing solutions, we define a topic-independent feature space, which makes our solution generalizable to other domains and datasets. Based on our feature space, we present a solution that mitigates the effect of outliers in the data and helps capture the variations of the language-usage-patterns within a text sample. Specifically, we represent each text sample as a point set and identify the top-k stylistically similar text samples (SSTs) from the corpus. We then apply the probabilistic k nearest neighbors’ classifier on the identified top-k SSTs to predict the native languages of the authors. To conduct experiments, we create three new corpora where each corpus is written in a different language, namely, English, French, and German. Our experimental studies show that our solution outperforms competitive methods and reports more than 80% accuracy across languages.Research funded by Higher Education Commission, and Grants for Development of New Faculty Staff at Chulalongkorn University | Digital Economy Promotion Agency (# MP-62-0003) | Thailand Research Funds (MRG6180266 and MRG6280175).Published versio

    On the Similarities Between Native, Non-native and Translated Texts

    Full text link
    We present a computational analysis of three language varieties: native, advanced non-native, and translation. Our goal is to investigate the similarities and differences between non-native language productions and translations, contrasting both with native language. Using a collection of computational methods we establish three main results: (1) the three types of texts are easily distinguishable; (2) non-native language and translations are closer to each other than each of them is to native language; and (3) some of these characteristics depend on the source or native language, while others do not, reflecting, perhaps, unified principles that similarly affect translations and non-native language.Comment: ACL2016, 12 page

    Strategies for Representing Tone in African Writing Systems

    Get PDF
    Tone languages provide some interesting challenges for the designers of new orthographies. One approach is to omit tone marks, just as stress is not marked in English (zero marking). Another approach is to do phonemic tone analysis and then make heavy use of diacritic symbols to distinguish the `tonemes' (exhaustive marking). While orthographies based on either system have been successful, this may be thanks to our ability to manage inadequate orthographies rather than to any intrinsic advantage which is afforded by one or the other approach. In many cases, practical experience with both kinds of orthography in sub-Saharan Africa has shown that people have not been able to attain the level of reading and writing fluency that we know to be possible for the orthographies of non-tonal languages. In some cases this can be attributed to a sociolinguistic setting which does not favour vernacular literacy. In other cases, the orthography itself might be to blame. If the orthography of a tone language is difficult to user or to learn, then a good part of the reason, I believe, is that the designer either has not paid enough attention to the function of tone in the language, or has not ensured that the information encoded in the orthography is accessible to the ordinary (non-linguist) user of the language. If the writing of tone is not going to continue to be a stumbling block to literacy efforts, then a fresh approach to tone orthography is required, one which assigns high priority to these two factors. This article describes the problems with orthographies that use too few or too many tone marks, and critically evaluates a wide range of creative intermediate solutions. I review the contributions made by phonology and reading theory, and provide some broad methodological principles to guide someone who is seeking to represent tone in a writing system. The tone orthographies of several languages from sub-Saharan Africa are presented throughout the article, with particular emphasis on some tone languages of Cameroon

    Dlùth is Inneach: Linguistic and Institutional Foundations for Gaelic Corpus Planning

    Get PDF
    This report presents the results of a one-year research project, commissioned by Bòrd na Gàidhlig BnG) and carried out by a Soillse Research team, whose goal was to answer the following question: What corpus planning principles are appropriate for the strengthening and promotion of Scottish Gaelic, and what effective coordination would result in their implementation? This report contains the following agreed outcomes: a clear and consistent linguistic foundation for Gaelic corpus planning, according with Bòrd na Gàidhlig’s acquisition, usage and status planning initiatives, and most likely to be supported by Gaelic users. a programme of priorities to be addressed by Gaelic corpus planning. recommendations on a means of coordination that will be effective in terms of cost and management (i.e. an institutional framework

    Cross-linguistic and cross-disciplinary investigation of lexical bundles in academic writing

    Get PDF
    The present paper reviews the use of lexical bundles in academic writing from two different viewpoints namely linguistic and discipline, directed at how academic writers belonging to different disciplines or linguistic backgrounds construct their discourses through lexical bundles. As cohesive devices, lexical bundles are an indispensible part of the text and play a crucial role in shaping propositions, evolving the text, guiding readers through the flow of information and gaining the writer's proffered meaning. By using lexical bundles, academic writers are able to attain naturalness in their writings and create a more reader-friendly approach to the unfolding text. Bearing the significance of lexical bundles in mind, this review paper aims to examine the effect of disciplinary variation and linguistic differences on the use of lexical bundles in academic writing. Most researchers believe that the frequency as well as the use of lexical bundles is different across disciplines and from one language to another language. Therefore, through a review of previous studies, there is a systematic investigation of evidence to support the above claims. Possible limitations of previous studies are discussed and some implications for further research are presented

    The Effect of Genre-based Instruction on the Teaching of Business Report Writing

    Get PDF
    Maestría en Inglés con Orientación en Lingüística AplicadaThe present study investigates the application of Genre Theory to the teaching of business report writing to university students. Five instructors, three raters and five groups of learners attending the fourth semester of a six-stage Business English course participated in this study. Three intact classes were randomly selected for the experimental group (EG) and two for the control group (CG). Each group was administered a pre-test requesting learners to write a business assessment report. The EG then received a four-week instruction period following the principles of a genre-based approach. The CG underwent no treatment. After the period of instruction, a post-test requesting the same task as the pre-test was administered to both groups. Three independent raters scored the reports using a five-band scale adapted for this study. The quantitative data collected from the tests were analyzed using the Wilcoxon Rank Sums and the Cohen's Simple Unweighted Coefficient statistical tests. Scripts were also analyzed for code associations using the AQUAD.5 (Analysis of Qualitative Data) software. The information gathered was triangulated with the data provided by questionnaires administered to students and interviews held with instructors. Results support the hypothesis that genre-based instruction enhances the written production of Business English learners with an intermediate English proficiency.Fil: Trebucq, María Dolores. Universidad Nacional de Córdoba. Facultad de Lenguas, Argentina

    Formulaic language -its characteristics and how it is used and acquired-

    Get PDF
    corecore