13 research outputs found

    Error Analysis of Czech Written Expression of the Romani Pupils in the 9th Grade of the Secondary Practical Schools Based on the Corpora ROMi

    Get PDF
    disertační práce Chybovost v písemném projevu romských žáků 9. ročníků základních škol praktických na základě elektronické databanky ROMi Zuzanna Bedřichová ÚČJTK FFUK Praha 2014 Disertační práce se zabývá chybovostí v psaném projevu romských žáků 9. ročníků základních škol praktických. Práce analyzuje 130 školních prací těchto žáků dostupných v databance ROMi (databanka psaného a mluveného projevu romských dětí a mládeže v češtině). Cílem disertační práce bylo zpracovat takovou kvalitativně-kvantitativní analýzu chybovosti zkoumaných textů, která by mohla sloužit jako podklad pro pedagogické účely (např. vypracování učebních materiálů). Autorka proto sestavila vlastní schéma sledovaných chyb, na jehož základě provedla ruční chybovou analýzu. Navržené chybové schéma sleduje řadu jevů, především vliv mluveného projevu na zkoumané texty, který se jeví jako jeden z hlavních rysů jejich chybovosti. Kromě chybové analýzy autorka přináší i podrobné informace o databance ROMi a dalších anotačních schématech, obecnou jazykovou charakteristiku textů (tedy jevy, které nejsou či nemusí být chybové, jako je diskurzní výstavba textů, emotivní vyjadřování přítomné v textech, vliv romského etnolektu češtiny, snaha naopak normu pro písemný text dodržovat, kolísání mezi dodržením a porušením normy ad.) a v...English Summary - Error Analysis of Czech Written Expression of the Romani Pupils in the 9th Grade of the Secondary Practical Schools Based on the Corpora ROMi Zuzanna Bedřichová ÚČJTK FFUK Prague 2014 The study is focused on practice of error making in written expressions of the Romani pupils in the 9th grade of Secondary practical schools (schools for children with special educational needs). Here 130 written school works of these pupils, which are available through the database ROMi (database of written and spoken accounts in Czech language of children and youth of Romani origin), have been analysed. The author offers innovative concept of new and elaborate scheme of error analysis, and qualitatively - quantitative analyses of the pupils' written accounts. Beside the qualitatively - quantitative analyses, the study outlines current situation of issues such as education of Romani children in the Czech language, the Romani ethnolect of Czech language, and spoken language as a source of stigmatisation. Furthermore, details about the ROMi database, 130 original written accounts in full length and practical proposals of compensation in the practice of error making are provided.Institute of Czech Language and Theory of CommunicationÚstav českého jazyka a teorie komunikaceFilozofická fakultaFaculty of Art

    Error Analysis of Czech Written Expression of the Romani Pupils in the 9th Grade of the Secondary Practical Schools Based on the Corpora ROMi

    Get PDF
    English Summary - Error Analysis of Czech Written Expression of the Romani Pupils in the 9th Grade of the Secondary Practical Schools Based on the Corpora ROMi Zuzanna Bedřichová ÚČJTK FFUK Prague 2014 The study is focused on practice of error making in written expressions of the Romani pupils in the 9th grade of Secondary practical schools (schools for children with special educational needs). Here 130 written school works of these pupils, which are available through the database ROMi (database of written and spoken accounts in Czech language of children and youth of Romani origin), have been analysed. The author offers innovative concept of new and elaborate scheme of error analysis, and qualitatively - quantitative analyses of the pupils' written accounts. Beside the qualitatively - quantitative analyses, the study outlines current situation of issues such as education of Romani children in the Czech language, the Romani ethnolect of Czech language, and spoken language as a source of stigmatisation. Furthermore, details about the ROMi database, 130 original written accounts in full length and practical proposals of compensation in the practice of error making are provided

    ROMi 1.0

    No full text
    ROMi represents a specific subcorpus of CZESL (Czech as a Second Language). It collects examples of language use, both spoken and written, of Czech Romani children and teen-agers. The range of materials exceeds 1,5 million words. Language Material The material presents uses of spoken language by language-specific group of Romani speakers using Czech as their first language. However, this form of the language is specifically different from Czech as used by the Czech-speaking majority, both on the spoken and secondarily on the written level. It concerns the so-called Romani ethnolect of Czech, i.e. a variety of Czech used by Romani communities mainly in the Czech Republic. We may detect obvious influence of Romani, Slovak and Hungarian. Furthermore, many of the recorded speakers live in social exclusion and thus their language production is influenced by both factors, i.e. by Romani ethnolect and social exclusion. The language material was collected in the years 2009 – 2012 under the Education for Competitiveness Operational Programme, within the framework of the project Innovations of Czech as a Second Language Education collaboratively by the Technical University of Liberec and the Institute of Czech Language and Theory of Communication, Faculty of Arts, Charles University. The language material was processed with support of Institute of Formal and Applied Linguistics - project LINDAT-Clarin. It concerns 110 recordings obtained in various environments – the collection of material took place both in schools and also in several non-profit organizations offering leisure time activities to Romani students. Apart from the school setting, the recordings thus come from the environment of extracurricular activities, sport matches and households. Both the respondents and the collectors are Romani. The samples were acquired in all regions of the Czech Republic, although the majority of recordings were obtained in the Central Bohemia, South Bohemia, Ústí and Vysočina Region. The age of the respondents ranges from 12 to 28 years. The collected samples are also accompanied by metadata relating to the following areas: The collected samples are accompanied by metadata relating to the following areas: • The place of origin (the place of collection, the size of the residence and dialect area, region, environment (school, extracurricular, private); socially excluded locality. • The circumstances of the collection expressing the extent of control exercised by the collector (topic assigned/non-assigned). • The respondent (the age of the student; class/year; sex; type of the school; subjective knowledge of Romani; first language – the one the student considers to be his first; communicative environment in the family – which language(s) is/are used for communication in the family. • The place of data collection – in the case of schools metadata comprise characteristics of the type of school (primary, for students with special needs, remedial, vocational, secondary), the founder (state, church, private organisation), in the case of the place of individual collection of data you may find organisation, interest group markings, etc. • The collector (the abbreviation of collector´s name and his work area, in some cases also his age). Delimiting the group of respondents The respondents are constituted by students of primary schools, schools for students with special needs, secondary schools and by teenagers who have just completed the compulsory education. For the purposes of the language material collection, those students who consider themselves to be Romani or who are considered Romani by others were included to the sample. Moreover, a language criterion was added to this definition - thus those students in whose families Romani is spoken at home were also included. Active knowledge of the Romani language was not required since hardly a third of Romani children living in the Czech Republic nowadays is competent in this language. Ethical aspects of the data collection and processing As regards the content of the language material, it places demands on the data processing from the ethical point of view. Frequently, the texts and recordings feature highly interesting material; the respondents talk about their life stories fully distant or inconceivable for the social majority. During the transcription process, all materials are anonymized and identification data are removed. Field Research When dealing with the environment threatened by social exclusion, it is highly important to consider especially the needs and opportunities of the group members as well as the needs of those individuals, who find themselves or work in such an environment. During the developmental process of the corpus, we became decidedly convinced that it is necessary to accommodate different demands on material quality of texts and recordings and not to overburden both the respondents and the collectors with limiting or impossible requirements. Therefore, the corpus comprises several recordings of lower technical quality which were acquired in the presence of other persons, with the television turned on, etc. Firstly, the recordings would not even have come into existence under different circumstances – it is natural that the interviewing of younger children was taking place directly in their households, in the presence of their parents. Secondly, the recordings would have been made, yet they would have been influenced by the unnaturalness of the situation, consequently affecting the language material. Apart from the interviews with younger children, it regards especially those conversations between the collectros and their peers, e.g. inside leisure time clubs. Characteristics of the recordings The collected recordings come both from the school environment (especially conversations of teacher assistants with individual students) and from the leisure time facilities (interest groups, after-school tutoring). In most cases it concerns conversations of the collector and the individual, alternatively a pair of respondents. The length of the recordings differs, although the majority ranges from 20 to 35 minutes. A single recording approximately contains 2 495 words. The quality of recordings is influenced by the limits of field-utilizable technologies and the effort to increase authenticity to the maximum. Transcription of the recordings The rules for transcription of the recordings are based on similar ones designed for SCHOLA corpus. Transcriptions are carried out by the means of folkloristic transcription, i.e. the closest to the written record, especially adapted for the purposes of computational processing, following the practice established in the Czech National Corpus. The transcription is performed with the help of the Transcriber programme, which connects the sound and graphic track

    ROMi 1.0

    No full text
    ROMi represents a specific subcorpus of CZESL (Czech as a Second Language). It collects examples of language use, both spoken and written, of Czech Romani children and teen-agers. The range of materials exceeds 1,5 million words. Language Material The material presents uses of spoken language by language-specific group of Romani speakers using Czech as their first language. However, this form of the language is specifically different from Czech as used by the Czech-speaking majority, both on the spoken and secondarily on the written level. It concerns the so-called Romani ethnolect of Czech, i.e. a variety of Czech used by Romani communities mainly in the Czech Republic. We may detect obvious influence of Romani, Slovak and Hungarian. Furthermore, many of the recorded speakers live in social exclusion and thus their language production is influenced by both factors, i.e. by Romani ethnolect and social exclusion. The language material was collected in the years 2009 – 2012 under the Education for Competitiveness Operational Programme, within the framework of the project Innovations of Czech as a Second Language Education collaboratively by the Technical University of Liberec and the Institute of Czech Language and Theory of Communication, Faculty of Arts, Charles University. The language material was processed with support of Institute of Formal and Applied Linguistics - project LINDAT-Clarin. It concerns 110 recordings obtained in various environments – the collection of material took place both in schools and also in several non-profit organizations offering leisure time activities to Romani students. Apart from the school setting, the recordings thus come from the environment of extracurricular activities, sport matches and households. Both the respondents and the collectors are Romani. The samples were acquired in all regions of the Czech Republic, although the majority of recordings were obtained in the Central Bohemia, South Bohemia, Ústí and Vysočina Region. The age of the respondents ranges from 12 to 28 years. The collected samples are also accompanied by metadata relating to the following areas: The collected samples are accompanied by metadata relating to the following areas: • The place of origin (the place of collection, the size of the residence and dialect area, region, environment (school, extracurricular, private); socially excluded locality. • The circumstances of the collection expressing the extent of control exercised by the collector (topic assigned/non-assigned). • The respondent (the age of the student; class/year; sex; type of the school; subjective knowledge of Romani; first language – the one the student considers to be his first; communicative environment in the family – which language(s) is/are used for communication in the family. • The place of data collection – in the case of schools metadata comprise characteristics of the type of school (primary, for students with special needs, remedial, vocational, secondary), the founder (state, church, private organisation), in the case of the place of individual collection of data you may find organisation, interest group markings, etc. • The collector (the abbreviation of collector´s name and his work area, in some cases also his age). Delimiting the group of respondents The respondents are constituted by students of primary schools, schools for students with special needs, secondary schools and by teenagers who have just completed the compulsory education. For the purposes of the language material collection, those students who consider themselves to be Romani or who are considered Romani by others were included to the sample. Moreover, a language criterion was added to this definition - thus those students in whose families Romani is spoken at home were also included. Active knowledge of the Romani language was not required since hardly a third of Romani children living in the Czech Republic nowadays is competent in this language. Ethical aspects of the data collection and processing As regards the content of the language material, it places demands on the data processing from the ethical point of view. Frequently, the texts and recordings feature highly interesting material; the respondents talk about their life stories fully distant or inconceivable for the social majority. During the transcription process, all materials are anonymized and identification data are removed. Field Research When dealing with the environment threatened by social exclusion, it is highly important to consider especially the needs and opportunities of the group members as well as the needs of those individuals, who find themselves or work in such an environment. During the developmental process of the corpus, we became decidedly convinced that it is necessary to accommodate different demands on material quality of texts and recordings and not to overburden both the respondents and the collectors with limiting or impossible requirements. Therefore, the corpus comprises several recordings of lower technical quality which were acquired in the presence of other persons, with the television turned on, etc. Firstly, the recordings would not even have come into existence under different circumstances – it is natural that the interviewing of younger children was taking place directly in their households, in the presence of their parents. Secondly, the recordings would have been made, yet they would have been influenced by the unnaturalness of the situation, consequently affecting the language material. Apart from the interviews with younger children, it regards especially those conversations between the collectros and their peers, e.g. inside leisure time clubs. Characteristics of the recordings The collected recordings come both from the school environment (especially conversations of teacher assistants with individual students) and from the leisure time facilities (interest groups, after-school tutoring). In most cases it concerns conversations of the collector and the individual, alternatively a pair of respondents. The length of the recordings differs, although the majority ranges from 20 to 35 minutes. A single recording approximately contains 2 495 words. The quality of recordings is influenced by the limits of field-utilizable technologies and the effort to increase authenticity to the maximum. Transcription of the recordings The rules for transcription of the recordings are based on similar ones designed for SCHOLA corpus. Transcriptions are carried out by the means of folkloristic transcription, i.e. the closest to the written record, especially adapted for the purposes of computational processing, following the practice established in the Czech National Corpus. The transcription is performed with the help of the Transcriber programme, which connects the sound and graphic track

    AKCES 4

    No full text
    Corpus AKCES 4 includes texts written in czech by youth growing up in locations at risk of social exclusion (AKCES/CLAC - Czech Language Acquisition Corpora

    AKCES 5 (CzeSL-SGT)

    No full text
    Essays written by non-native learners of Czech, a part of AKCES/CLAC – Czech Language Acquisition Corpora. CzeSL-SGT stands for Czech as a Second Language with Spelling, Grammar and Tags. Extends the “foreign” (ciz) part of AKCES 3 (CzeSL-plain) by texts collected in 2013. Original forms and automatic corrections are tagged, lemmatized and assigned erros labels. Most texts have metadata attributes (30 items) about the author and the text

    AKCES 5 (CzeSL-SGT) Release 2

    No full text
    Essays written by non-native learners of Czech, a part of AKCES/CLAC – Czech Language Acquisition Corpora. CzeSL-SGT stands for Czech as a Second Language with Spelling, Grammar and Tags. Extends the “foreign” (ciz) part of AKCES 3 (CzeSL-plain) by texts collected in 2013. Original forms and automatic corrections are tagged, lemmatized and assigned erros labels. Most texts have metadata attributes (30 items) about the author and the text. In addition to a few minor bugs, fixes a critical issue in Release 1: the native speakers of Ukrainian (s_L1:"uk") were wrongly labelled as speakers of "other European languages" (s_L1_group="IE"), instead of speakers of a Slavic language (s_L1_group="S"). The file is now a regular XML document, with all annotation represented as XML attributes

    AKCES-GEC Grammatical Error Correction Dataset for Czech

    No full text
    AKCES-GEC is a grammar error correction corpus for Czech generated from a subset of AKCES. It contains train, dev and test files annotated in M2 format. Note that in comparison to CZESL-GEC dataset, this dataset contains separated edits together with their type annotations in M2 format and also has two times more sentences. If you use this dataset, please use following citation: @article{naplava2019wnut, title={Grammatical Error Correction in Low-Resource Scenarios}, author={N{\'a}plava, Jakub and Straka, Milan}, journal={arXiv preprint arXiv:1910.00353}, year={2019}
    corecore