Search CORE

1,783 research outputs found

Investigating the native speaker phenomenon – a pilot corpus study of native and non-native writing

Author: Krakowian Przemysław
Publication venue: Wydawnictwo Uniwersytetu Łódzkiego
Publication date: 01/01/2011
Field of study

The aim of this report is to provide a preliminary account of the investigation of two general corpora of written English, which was prompted originally by interest in an analytical tool designed to assess the propositional density in the utterances of learners of English. Since corpora of written language are easier to obtain and to procure in comparison with corpora of spoken language, the procedure was honed and fine tuned on a written corpus with the aim to investigate spoken utterances in an attempt to validate a scoring procedure. Propositional density was envisaged at the onset of the study as an instrumental factor in determining the relative merit of an assortment of samples. A computer program called CPIDR (a Computerized Propositional Idea Density Rater, pronounced “spider”) involves a relatively straightforward procedure and produces results which are easy to interpret for most purposes

Repozytorium Uniwersytetu Łódzkiego (University of Lodz Repository)

Linguistic annotation in/for corpus linguistics

Author: Andrea L. Berez
Stefan Th. Gries
Publication venue
Publication date
Field of study

This article surveys linguistic annotation in corpora and corpus linguistics. We first define the concept of 'corpus ' as a radial category and then, in Section 2, discuss a variety of kinds of information for which corpora are annotated and that are exploited in contemporary corpus linguistics. Section 3 then exemplifies many current formats of annotation with an eye to highlighting both the diversity of formats currently available and the emergence of XML annotation as, for now, the most widespread form of annotation. Section 4 summarizes and concludes with desiderata for future developments.

CiteSeerX

Tagging a Norwegian Speech Corpus

Author: Nøklestad Anders
Søfteland Åshild
Publication venue
Publication date: 23/05/2007
Field of study

Proceedings of the 16th Nordic Conference of Computational Linguistics NODALIDA-2007. Editors: Joakim Nivre, Heiki-Jaan Kaalep, Kadri Muischnek and Mare Koit. University of Tartu, Tartu, 2007. ISBN 978-9985-4-0513-0 (online) ISBN 978-9985-4-0514-7 (CD-ROM) pp. 245-248

DSpace at Tartu University Library

Steps for Creating two Persian Specialized Corpora

Author: Alayiaboozar Elham
Hojjatpanah Ali Asghar
Publication venue: Regional Information Center for Science & Technology
Publication date: 12/10/2022
Field of study

Currently, most linguistic studies benefit from valid linguistic data available at corpora. Compiling corpora is a common practice in linguistic research. The present study introduces two specialized corpora in Persian; a specialized corpus is used to study a particular type of language or language variety. For building such corpora, first, a set of texts were compiled based on pre-established criteria used in the sampling process (including the mode of the texts, type of the texts, domain of the texts, language/ language varieties of the texts and the date of the texts). The corpora are specialized because they include technical terms in information processing and management, librarianship, linguistics, computational linguistics, thesaurus building, managing, policy-making, natural language processing, information technology, information retrieval, ontology and other related interdisciplinary domains. After compiling data and Metadata, the texts were preprocessed (normalized and tokenized) and annotated (automated POS tagging); finally, the tags were manually checked. Each corpus includes more than four million words. Since not many specialized corpora are built in Persian, such corpora could be considered valuable resources for researchers interested in studying linguistic variations in Persian interdisciplinary texts.https://dorl.net/dor/20.1001.1.20088302.2022.20.4.14.

International Journal of Information Science and Management (IJISM)

We All Make Mistakes! . Analysing an Error-coded Corpus of Spanish University Students Written English

Author: Mac Donald Penny
Publication venue: 'Universidad Complutense de Madrid (UCM)'
Publication date: 01/01/2016
Field of study

[EN] The present study analyses the errors identified and coded in the written argumentative texts of 304 Spanish university students of English extracted from two corpora one from a technical university context corpus (totalling 950 written compositions) and the other from learners enrolled in the Humanities (totalling 750 written compositions). Considered an important design criterion for computer learner corpora studies, the students levels were measured using the Oxford Quick Placement Test and the scores obtained (0 to 60) were then related to the CEFR (Common European Framework of Reference for Languages) levels ranging from A1 to C2. Learners writing in a foreign language not only make errors related to grammar and vocabulary, but also with regard to their competence in the use of syntax, discourse relations and pragmatics among others, and the error coding system has been designed to attempt to address all the possible levels of error with as many sub-categories as required. Within the field of applied linguistics and language teaching/learning, many studies have been carried out over the years designed to address the phenomenon of interlanguage errors made by learners of English (Dusková, 1969; Green & Hecht, 1985; Lennon, 1991; Olsen, 1999 among many others). Previously, these studies involved analyzing a small number of texts with a limited number of tags, based on either linguistic taxonomies or surface structure categories of errors (Dulay, Burt, & Krashen 1982). However, in the last three decades, technological advances have been made which have facilitated the analysis of much larger amounts of data using computers for both the development of learner corpora and programs for a more detailed analysis of the learner data. The aim of the present research is two-fold. Firstly, we explore the nature of the errors coded in the corpus i.e. which errors are most frequent, including not only the main categories but also the most delicate levels of errors. Secondly, we address the question of the relationship, if any, of the learners competence levels and the type and frequency of the errors they make. The results show that grammar errors are the most frequent, and that the linguistic competence of the learners has a lower than expected influence on the most frequent types of errors coded in the corpus.We would like to acknowledge the support given for the TREACLE Project from the Spanish Ministry of Economy and Competitiveness (FFI2009-14436/FILO). The author would also like to express her gratitude to Mick O'Donnell and Susana Murcia for their very useful comments on the first draft of this article.Mac Donald, P. (2016). We All Make Mistakes! . Analysing an Error-coded Corpus of Spanish University Students Written English. Complutense Journal of English Studies. 24:103-129. https://doi.org/10.5209/CJES.53273S1031292

Crossref

RiuNet

Preliminary Study of Validating Vocabulary Selection and Organization of A Manual Communication Board in Malay.

Author: Onwi Nadwah
Publication venue
Publication date: 13/01/2014
Field of study

An integral component of a language-based augmentative and alternative communication (AAC) system is providing vocabulary typical of fluent native language speakers. In the absence of reliable and valid research on Malay vocabulary for AAC, this descriptive study explored the validation process of vocabulary selection and organization for a 144-location manual communication board. An hour of aided language samples (talking while pointing to a prototype display) followed by self-administered surveys were gathered from four typical native Malay speakers (n=4), aged between 22 to 36 years at the University of Pittsburgh. Vocabulary frequency analysis, word commonality, and overall perceptions and feedback on the prototype display were compiled and analyzed. A total of 1112 word tokens and 454 word types were analyzed to support preliminary validation of the selected vocabulary and word organization of the prototype. Approximately 40% of the words on the display were used during the interview and the top 20 words were reported. Findings also suggest the importance of morphology and syntax considerations at early design stages. The positive overall perception of the display including vocabulary selection, the cultural and ethnicity appropriateness, and suggestions for system improvement were confirmed by the usability survey. Minimal rearrangement of the icon display needs to be performed to improve the usability of the system. Thus, the study findings support the early Malay manual communication board for AAC intervention. However, the limitation of the sample size and additional research is required to support a final display that optimizes vocabulary and morphosyntactic organization of a manual communication board in Malay

D-Scholarship@Pitt

Building the Arabic Learner Corpus and a System for Arabic Error Annotation

Author: Alfaifi Abdullah Yahya G.
Publication venue: University of Leeds
Publication date: 01/05/2015
Field of study

Recent developments in learner corpora have highlighted the growing role they play in some linguistic and computational research areas such as language teaching and natural language processing. However, there is a lack of a well-designed Arabic learner corpus that can be used for studies in the aforementioned research areas. This thesis aims to introduce a detailed and original methodology for developing a new learner corpus. This methodology which represents the major contribution of the thesis includes a combination of resources, proposed standards and tools developed for the Arabic Learner Corpus project. The resources include the Arabic Learner Corpus, which is the largest learner corpus for Arabic based on systematic design criteria. The resources also include the Error Tagset of Arabic that was designed for annotating errors in Arabic covering 29 types of errors under five broad categories. The Guide on Design Criteria for Learner Corpus is an example of the proposed standards which was created based on a review of previous work. It focuses on 11 aspects of corpus design criteria. The tools include the Computer-aided Error Annotation Tool for Arabic that provides some functions facilitating error annotation such as the smart-selection function and the auto-tagging function. Additionally, the tools include the ALC Search Tool that is developed to enable searching the ALC and downloading the source files based on a number of determinants. The project was successfully able to recruit 992 people including language learners, data collectors, evaluators, annotators and collaborators from more than 30 educational institutions in Saudi Arabia and the UK. The data of the Arabic Learner Corpus was used in a number of projects for different purposes including error detection and correction, native language identification, Arabic analysers evaluation, applied linguistics studies and data-driven Arabic learning. The use of the ALC highlights the extent to which it is important to develop this project

White Rose E-theses Online