17,824 research outputs found
Native language identification of fluent and advanced non-native writers
This is an accepted manuscript of an article published by ACM in ACM Transactions on Asian and Low-Resource Language Information Processing in April 2020, available online: https://doi.org/10.1145/3383202
The accepted version of the publication may differ from the final published version.Native Language Identification (NLI) aims at identifying the native languages of authors by analyzing their text samples written in a non-native language. Most existing studies investigate this task for educational applications such as second language acquisition and require the learner corpora. This article performs NLI in a challenging context of the user-generated-content (UGC) where authors are fluent and advanced non-native speakers of a second language. Existing NLI studies with UGC (i) rely on the content-specific/social-network features and may not be generalizable to other domains and datasets, (ii) are unable to capture the variations of the language-usage-patterns within a text sample, and (iii) are not associated with any outlier handling mechanism. Moreover, since there is a sizable number of people who have acquired non-English second languages due to the economic and immigration policies, there is a need to gauge the applicability of NLI with UGC to other languages. Unlike existing solutions, we define a topic-independent feature space, which makes our solution generalizable to other domains and datasets. Based on our feature space, we present a solution that mitigates the effect of outliers in the data and helps capture the variations of the language-usage-patterns within a text sample. Specifically, we represent each text sample as a point set and identify the top-k stylistically similar text samples (SSTs) from the corpus. We then apply the probabilistic k nearest neighbors’ classifier on the identified top-k SSTs to predict the native languages of the authors. To conduct experiments, we create three new corpora where each corpus is written in a different language, namely, English, French, and German. Our experimental studies show that our solution outperforms competitive methods and reports more than 80% accuracy across languages.Research funded by Higher Education Commission, and Grants for Development of New Faculty Staff at Chulalongkorn University | Digital Economy Promotion Agency (# MP-62-0003) | Thailand Research Funds (MRG6180266 and MRG6280175).Published versio
On the Similarities Between Native, Non-native and Translated Texts
We present a computational analysis of three language varieties: native,
advanced non-native, and translation. Our goal is to investigate the
similarities and differences between non-native language productions and
translations, contrasting both with native language. Using a collection of
computational methods we establish three main results: (1) the three types of
texts are easily distinguishable; (2) non-native language and translations are
closer to each other than each of them is to native language; and (3) some of
these characteristics depend on the source or native language, while others do
not, reflecting, perhaps, unified principles that similarly affect translations
and non-native language.Comment: ACL2016, 12 page
Strategies for Representing Tone in African Writing Systems
Tone languages provide some interesting challenges for the designers of new orthographies.
One approach is to omit tone marks, just as stress is not marked in English (zero marking).
Another approach is to do phonemic tone analysis and then make heavy use of diacritic
symbols to distinguish the `tonemes' (exhaustive marking). While orthographies based on
either system have been successful, this may be thanks to our ability to manage inadequate
orthographies rather than to any intrinsic advantage which is afforded by one or the other
approach. In many cases, practical experience with both kinds of orthography in sub-Saharan
Africa has shown that people have not been able to attain the level of reading and writing
fluency that we know to be possible for the orthographies of non-tonal languages. In some
cases this can be attributed to a sociolinguistic setting which does not favour vernacular
literacy. In other cases, the orthography itself might be to blame. If the orthography of a tone
language is difficult to user or to learn, then a good part of the reason, I believe, is that the
designer either has not paid enough attention to the function of tone in the language, or has
not ensured that the information encoded in the orthography is accessible to the ordinary
(non-linguist) user of the language. If the writing of tone is not going to continue to be a
stumbling block to literacy efforts, then a fresh approach to tone orthography is required, one
which assigns high priority to these two factors.
This article describes the problems with orthographies that use too few or too many tone
marks, and critically evaluates a wide range of creative intermediate solutions. I review the
contributions made by phonology and reading theory, and provide some broad methodological
principles to guide someone who is seeking to represent tone in a writing system. The tone
orthographies of several languages from sub-Saharan Africa are presented throughout the
article, with particular emphasis on some tone languages of Cameroon
Dlùth is Inneach: Linguistic and Institutional Foundations for Gaelic Corpus Planning
This report presents the results of a one-year research project, commissioned by Bòrd na Gàidhlig BnG) and carried out by a Soillse Research team, whose goal was to answer the following question:
What corpus planning principles are appropriate for the strengthening and promotion of Scottish Gaelic, and what effective coordination would result in their implementation?
This report contains the following agreed outcomes:
a clear and consistent linguistic foundation for Gaelic corpus planning, according with Bòrd na Gàidhlig’s acquisition, usage and status planning initiatives, and most likely to be supported by Gaelic users.
a programme of priorities to be addressed by Gaelic corpus planning.
recommendations on a means of coordination that will be effective in terms of cost and management (i.e. an institutional framework
Cross-linguistic and cross-disciplinary investigation of lexical bundles in academic writing
The present paper reviews the use of lexical bundles in academic writing from two different viewpoints namely linguistic and discipline, directed at how academic writers belonging to different disciplines or linguistic backgrounds construct their discourses through lexical bundles. As cohesive devices, lexical bundles are an indispensible part of the text and play a crucial role in shaping propositions, evolving the text, guiding readers through the flow of information and gaining the writer's proffered meaning. By using lexical bundles, academic writers are able to attain naturalness in their writings and create a more reader-friendly approach to the unfolding text. Bearing the significance of lexical bundles in mind, this review paper aims to examine the effect of disciplinary variation and linguistic differences on the use of lexical bundles in academic writing. Most researchers believe that the frequency as well as the use of lexical bundles is different across disciplines and from one language to another language. Therefore, through a review of previous studies, there is a systematic investigation of evidence to support the above claims. Possible limitations of previous studies are discussed and some implications for further research are presented
The Effect of Genre-based Instruction on the Teaching of Business Report Writing
Maestría en Inglés con Orientación en Lingüística AplicadaThe present study investigates the application of Genre Theory to the teaching of
business report writing to university students. Five instructors, three raters and five
groups of learners attending the fourth semester of a six-stage Business English course
participated in this study. Three intact classes were randomly selected for the
experimental group (EG) and two for the control group (CG).
Each group was administered a pre-test requesting learners to write a business
assessment report. The EG then received a four-week instruction period following the
principles of a genre-based approach. The CG underwent no treatment. After the period
of instruction, a post-test requesting the same task as the pre-test was administered to
both groups.
Three independent raters scored the reports using a five-band scale adapted for
this study. The quantitative data collected from the tests were analyzed using the
Wilcoxon Rank Sums and the Cohen's Simple Unweighted Coefficient statistical tests.
Scripts were also analyzed for code associations using the AQUAD.5 (Analysis of
Qualitative Data) software. The information gathered was triangulated with the data
provided by questionnaires administered to students and interviews held with
instructors.
Results support the hypothesis that genre-based instruction enhances the written
production of Business English learners with an intermediate English proficiency.Fil: Trebucq, María Dolores. Universidad Nacional de Córdoba. Facultad de Lenguas, Argentina
- …