4 research outputs found

    CoDiAJe - the Annotated Diachronic Corpus of Judeo-spanish : Description of a Multi-alphabetic Corpus and its Textual and Linguistic Annotations

    Get PDF
    Judeo-Spanish differs from late 15th-century Spanish and modern Spanish in several respects, such as its morphology, syntax, and semantics, but the most visible difference is in the alphabet. From the end of the 19th century, Judeo-Spanish has been written in various alphabets -Greek, Cyrillic, and especially Latin-. However, the Hebrew alphabet had been used since ancient times, before it was abandoned finally only in the 1940s. This means that the majority of Judeo-Spanish texts are written in Hebrew characters. CoDiAJe is an annotated diachronic corpus that includes documents produced from the 16th century up to the present day, developed in TEITOK. The significance of its development is that this tool processes linguistic data in the alphabets mentioned above, allowing users to visualize each text in five orthographic forms (the original version in which it was written, its transcription in Latin characters, an expanded form to complete abbreviations or to correct defective writing, a version in modern Judeo-Spanish, and a version in orthographic modern Spanish). CoDiAJe enables the user to conduct searches not only for a specific word, but also for all its linguistic and orthographic variants in the different alphabets. During the annotation process, tags from the EAGLES tagset for Spanish were modified, and others were created: these are simply steps towards the creation of an accurate tagset for Judeo-Spanish. The digitized texts are also enriched with semantic-conceptual information and information on the affiliation of all non-Romance elements.El judeoespañol se diferencia del español de finales del siglo XV y del español moderno en varios aspectos que afectan a la fonética y fonología, morfología, sintaxis y semántica. Sin embargo, la diferencia más fácilmente apreciable está en el alfabeto. A finales del siglo XIX se comenzó a escribir con diferentes alfabetos: griego, cirílico y, sobre todo, latino en diferentes versiones. Sin embargo, desde tiempos remotos se utilizó el alfabeto hebreo, y su abandono definitivo solo ocurrió en la década de los cuarenta del siglo pasado, por lo que la mayor parte de los textos escritos en esta lengua están en caracteres hebreos. CoDiAJe es un corpus diacrónico anotado que incluye documentos creados desde el siglo XVI hasta nuestros días, desarrollado en TEITOK. La importancia de su desarrollo está en que procesa datos lingüísticos en los alfabetos mencionados anteriormente, da al usuario la opción de visualizar cada texto en cinco formas gráficas (la versión original independientemente del alfabeto en el que fue escrita, su transcripción en caracteres latinos, una forma expandida para completar las abreviaturas o corregir la escritura defectuosa, una versión en judeoespañol moderno y una versión en la ortografía del español moderno), y permite realizar búsquedas no solo de una palabra específica sino de todas sus variantes lingüísticas y ortográficas en textos escritos en los diferentes alfabetos. Durante el proceso de anotación se fueron modificando las etiquetas de EAGLES para el español y se crearon algunas nuevas. Significa que, a medida que se van anotando los textos, vamos creando un etiquetador para el judeoespañol. Los textos digitalizados también se enriquecen con información semántico-conceptual e información sobre la filiación de todos los elementos no románicos que se detectan en los textos

    Compiling and annotating a learner corpus for a morphologically rich language: CzeSL, a corpus of non-native Czech

    Get PDF
    Learner corpora, linguistic collections documenting a language as used by learners, provide an important empirical foundation for language acquisition research and teaching practice. This book presents CzeSL, a corpus of non-native Czech, against the background of theoretical and practical issues in the current learner corpus research. Languages with rich morphology and relatively free word order, including Czech, are particularly challenging for the analysis of learner language. The authors address both the complexity of learner error annotation, describing three complementary annotation schemes, and the complexity of description of non-native Czech in terms of standard linguistic categories. The book discusses in detail practical aspects of the corpus creation: the process of collection and annotation itself, the supporting tools, the resulting data, their formats and search platforms. The chapter on use cases exemplifies the usefulness of learner corpora for teaching, language acquisition research, and computational linguistics. Any researcher developing learner corpora will surely appreciate the concluding chapter listing lessons learned and pitfalls to avoid

    Corpus Linguistics software:Understanding their usages and delivering two new tools

    Get PDF
    The increasing availability of computers to ordinary users in the last few decades has led to an exponential increase in the use of Corpus Linguistics (CL) methodologies. The people exploring this data come from a variety of backgrounds and, in many cases, are not proficient corpus linguists. Despite the ongoing development of new tools, there is still an immense gap between what CL can offer and what is currently being done by researchers. This study has two outcomes. It (a) identifies the gap between potential and actual uses of CL methods and tools, and (b) enhances the usability of CL software and complement statistical application through the use of data visualization and user-friendly interfaces. The first outcome is achieved through (i) an investigation of how CL methods are reported in academic publications; (ii) a systematic observation of users of CL software as they engage in the routine tasks; and (iii) a review of four well-established pieces of software used for corpus exploration. Based on the findings, two new statistical tools for CL studies with high usability were developed and implemented on to an existing system, CQPweb. The Advanced Dispersion tool allows users to graphically explore how queries are distributed in a corpus, which makes it easier for users to understand the concept of dispersion. The tool also provides accurate dispersion measures. The Parlink Tool was designed having as its primary target audience beginners with interest in translations studies and second language education. The tool’s primary function is to make it easier for users to see possible translations for corpus queries in the parallel concordances, without the need to use external resources, such as translation memories
    corecore