10 research outputs found

    The national corpus of contemporary Welsh, 2016-2020

    No full text
    The CorCenCC corpus contains over 11 million words (circa 14.4m tokens). CorCenCC is the first corpus of the Welsh language that covers all three aspects of contemporary Welsh: spoken, written and electronically mediated (e-language). It offers a snapshot of the Welsh language across a range of contexts of use, e.g. private conversations, group socialising, business and other work situations, in education, in the various published media, and in public spaces. It includes examples of news headlines, personal and professional emails and correspondence, academic writing, formal and informal speech, blog posts and text messaging. Language data was sampled from a range of different speakers and users of Welsh, from all regions of Wales, of all ages and genders, with a wide range of occupations, and with a variety of linguistic backgrounds (e.g. how they came to speak Welsh), to reflect the diversity of text types and of Welsh speakers found in contemporary Wales. In this way, the CorCenCC corpus provides the means for empowering users of Welsh to better understand and observe the language across diverse settings, and creates a solid evidence base for the teaching of contemporary Welsh to those who aspire to use it. Over time, the corpus has the potential to make a significant contribution to the transformation of Welsh as the language of public, commercial, education and governmental discourse. A beta version of some bilingual corpus query tools have also been created as part of the CorCenCC project (see Related Resources). These include simple query, full query, frequency list, n-gram, keyword and collocation functionalities. The CorCenCC website also contains Y Tiwtiadur, a collection of data-driven teaching and learning tools designed to help supplement Welsh language learning at all different ages and levels. Y Tiwtiadur contains four distinct corpus-based exercises: Gap Filling (Cloze), Vocabulary Profiler, Word Identification and Word-in-Context. To access this tool, see Related Resources.CorCenCC is an inter-disciplinary and multi-institutional project that has created a large- scale, open-source corpus of contemporary Welsh. A corpus, in this context, is a collection of examples of spoken, written and/or e-language examples from real life contexts, that allows users to identify and explore language as it is actually used, rather than relying on intuition or prescriptive accounts of how it ‘should’ be used. Corpora let us investigate how we use language across different genres and communicative mediums (i.e. spoken, written or digital), and how it varies according to the speaker/writer and the communicative purpose. This evidence-based approach is used by academic researchers, lexicographers, teachers, language learners, assessors, resource developers, policy makers, publishers, translators and others, and is essential to the development of technologies such as predictive text production, word processing tools, machine translation, speech recognition and web search tools. CorCenCC will provide societal, economic and academic benefits by: (1) Facilitating uses of Welsh in public, commercial, educational and governmental settings. (2) Redefining the scope, relevance and design infrastructure of corpus development methodology. CorCenCC is open-source and publicly accessible, with user interfaces for specific groups. It enables, for example, community users to investigate dialect variation or idiosyncrasies of their own language use; professional users to profile texts for readability or develop digital language tools; language learners learn from real life models of Welsh; and researchers to investigate patterns of language use and change. The project team comprised experts in corpus linguistics, Welsh, and language pedagogy and assessment, who specialise in the application of linguistic tools to real world issues. Working with an advisory body of stakeholder representatives, they were optimally placed to meet the project aims: creating a permanent, sustainable and fit-for-purpose record of the living language, and pioneering an approach to content generation and user-driven applications that will provide a model for future corpus creation.</p

    Proceedings of the OHBM Brainhack 2021

    No full text
    The global pandemic presented new challenges and op-portunities for organizing conferences, and OHBM 2021was no exception. The OHBM Brainhack is an event thatoccurs just prior to the OHBM meeting, typically in-per-son, where scientists of all levels of expertise and interestgather to work and learn together for a few days in a col-laborative hacking-style environment on projects of com-mon interest (1). Building off the success of the OHBM2020 Hackathon (2), the 2021 Open Science SpecialInterest Group came together online to organize a largecoordinated Brainhack event that would take place overthe course of 4 days. The OHBM 2021 Brainhack eventwas organized along two guiding principles, providinga highly inclusive collaborative environment for inter-action between scientists across disciplines and levelsof expertise to push forward important projects thatneed support, also known as the “Hack-Track” of theBrainhack. The second aim of the OHBM Brainhack is toempower scientists to improve the quality of their sci-entific endeavors by providing high-quality hands-ontraining on best practices in open-science approaches.This is best exemplified by the training events providedby the “Train-Track” at the OHBM 2021 Brainhack. Here,we briefly explain both of these elements of the OHBM2021 Brainhack, before continuing on to the Brainhackproceedings
    corecore