23 research outputs found

    Transcribe : a web-based linguistic transcription tool

    Get PDF

    Transcribe: a Web-Based Linguistic Transcription Tool

    Get PDF

    CLARIN. The infrastructure for language resources

    Get PDF
    CLARIN, the "Common Language Resources and Technology Infrastructure", has established itself as a major player in the field of research infrastructures for the humanities. This volume provides a comprehensive overview of the organization, its members, its goals and its functioning, as well as of the tools and resources hosted by the infrastructure. The many contributors representing various fields, from computer science to law to psychology, analyse a wide range of topics, such as the technology behind the CLARIN infrastructure, the use of CLARIN resources in diverse research projects, the achievements of selected national CLARIN consortia, and the challenges that CLARIN has faced and will face in the future. The book will be published in 2022, 10 years after the establishment of CLARIN as a European Research Infrastructure Consortium by the European Commission (Decision 2012/136/EU)

    CLARIN

    Get PDF
    The book provides a comprehensive overview of the Common Language Resources and Technology Infrastructure – CLARIN – for the humanities. It covers a broad range of CLARIN language resources and services, its underlying technological infrastructure, the achievements of national consortia, and challenges that CLARIN will tackle in the future. The book is published 10 years after establishing CLARIN as an Europ. Research Infrastructure Consortium

    CLARIN

    Get PDF
    The book provides a comprehensive overview of the Common Language Resources and Technology Infrastructure – CLARIN – for the humanities. It covers a broad range of CLARIN language resources and services, its underlying technological infrastructure, the achievements of national consortia, and challenges that CLARIN will tackle in the future. The book is published 10 years after establishing CLARIN as an Europ. Research Infrastructure Consortium

    Wiktionary: The Metalexicographic and the Natural Language Processing Perspective

    Get PDF
    Dictionaries are the main reference works for our understanding of language. They are used by humans and likewise by computational methods. So far, the compilation of dictionaries has almost exclusively been the profession of expert lexicographers. The ease of collaboration on the Web and the rising initiatives of collecting open-licensed knowledge, such as in Wikipedia, caused a new type of dictionary that is voluntarily created by large communities of Web users. This collaborative construction approach presents a new paradigm for lexicography that poses new research questions to dictionary research on the one hand and provides a very valuable knowledge source for natural language processing applications on the other hand. The subject of our research is Wiktionary, which is currently the largest collaboratively constructed dictionary project. In the first part of this thesis, we study Wiktionary from the metalexicographic perspective. Metalexicography is the scientific study of lexicography including the analysis and criticism of dictionaries and lexicographic processes. To this end, we discuss three contributions related to this area of research: (i) We first provide a detailed analysis of Wiktionary and its various language editions and dictionary structures. (ii) We then analyze the collaborative construction process of Wiktionary. Our results show that the traditional phases of the lexicographic process do not apply well to Wiktionary, which is why we propose a novel process description that is based on the frequent and continual revision and discussion of the dictionary articles and the lexicographic instructions. (iii) We perform a large-scale quantitative comparison of Wiktionary and a number of other dictionaries regarding the covered languages, lexical entries, word senses, pragmatic labels, lexical relations, and translations. We conclude the metalexicographic perspective by finding that the collaborative Wiktionary is not an appropriate replacement for expert-built dictionaries due to its inconsistencies, quality flaws, one-fits-all-approach, and strong dependence on expert-built dictionaries. However, Wiktionary's rapid and continual growth, its high coverage of languages, newly coined words, domain-specific vocabulary and non-standard language varieties, as well as the kind of evidence based on the authors' intuition provide promising opportunities for both lexicography and natural language processing. In particular, we find that Wiktionary and expert-built wordnets and thesauri contain largely complementary entries. In the second part of the thesis, we study Wiktionary from the natural language processing perspective with the aim of making available its linguistic knowledge for computational applications. Such applications require vast amounts of structured data with high quality. Expert-built resources have been found to suffer from insufficient coverage and high construction and maintenance cost, whereas fully automatic extraction from corpora or the Web often yields resources of limited quality. Collaboratively built encyclopedias present a viable solution, but do not cover well linguistically oriented knowledge as it is found in dictionaries. That is why we propose extracting linguistic knowledge from Wiktionary, which we achieve by the following three main contributions: (i) We propose the novel multilingual ontology OntoWiktionary that is created by extracting and harmonizing the weakly structured dictionary articles in Wiktionary. A particular challenge in this process is the ambiguity of semantic relations and translations, which we resolve by automatic word sense disambiguation methods. (ii) We automatically align Wiktionary with WordNet 3.0 at the word sense level. The largely complementary information from the two dictionaries yields an aligned resource with higher coverage and an enriched representation of word senses. (iii) We represent Wiktionary according to the ISO standard Lexical Markup Framework, which we adapt to the peculiarities of collaborative dictionaries. This standardized representation is of great importance for fostering the interoperability of resources and hence the dissemination of Wiktionary-based research. To this end, our work presents a foundational step towards the large-scale integrated resource UBY, which facilitates a unified access to a number of standardized dictionaries by means of a shared web interface for human users and an application programming interface for natural language processing applications. A user can, in particular, switch between and combine information from Wiktionary and other dictionaries without completely changing the software. Our final resource and the accompanying datasets and software are publicly available and can be employed for multiple different natural language processing applications. It particularly fills the gap between the small expert-built wordnets and the large amount of encyclopedic knowledge from Wikipedia. We provide a survey of previous works utilizing Wiktionary, and we exemplify the usefulness of our work in two case studies on measuring verb similarity and detecting cross-lingual marketing blunders, which make use of our Wiktionary-based resource and the results of our metalexicographic study. We conclude the thesis by emphasizing the usefulness of collaborative dictionaries when being combined with expert-built resources, which bears much unused potential

    Corpus-based Lexicography for Sesotho

    Get PDF
    For centuries, dictionaries were compiled based upon the knowledge of the lexicographer and information retrieved from manually consulted sources, mainly through a process of reading and marking. This approach meant that much of the information used in the dictionary relied upon the knowledge of the lexicographer. It is vital to rely on the lexicographer’s knowledge of the language but this has its shortcomings, since there is no single individual who knows all the words or terms, their meanings and usage, the words they combine with, and so on, in a specific language. The utilization of this method left room for errors and omissions because the lexicographer could easily overlook some words due to factors like time, fatigue, limited knowledge of the lexicographer, etc. Important words, for example words likely to be looked for by the target users of the dictionary, could accidentally be omitted. In the 1980s, the corpus era was born and the lexicography field changed forever. Collins COBUILD in Birmingham spearheaded this era with the publication of the first corpus-based dictionary, the Collins COBUILD Dictionary in 1987. Since the corpus era began, lexicographers no longer rely solely on their knowledge of the language, intuition, or the limited information gathered from available written sources, which are very limited for African languages. The corpus allows the lexicographer to have access to huge volumes of authentic data from written texts and transcribed oral data. This research will therefore critically discuss dictionary compilation for Sesotho and spearhead the use of corpora in the compilation of Sesotho dictionaries, so that lexicographers do not compile dictionaries as if they are compiling the first dictionary for the language. In addition, they should take into account tasks like lexicographic planning, amongst other factors required to compile a good user-friendly dictionary. Key words Corpora, collocations, concordances, lexicography, lexicographical planning, microstructure, macrostructure, lemmatisation.Dissertation (MA)--University of Pretoria, 2018.African LanguagesMAUnrestricte

    Modelling a conversational agent (Botocrates) for promoting critical thinking and argumentation skills

    Get PDF
    Students in higher education institutions are often advised to think critically, yet without being guided to do so. The study investigated the use of a conversational agent (Botocrates) for supporting critical thinking and academic argumentation skills. The overarching research questions were: can a conversational agent support critical thinking and academic argumentation skills? If so, how? The study was carried out in two stages: modelling and evaluating Botocrates' prototype. The prototype was a Wizard-of-Oz system where a human plays Botocrates' role by following a set of instructions and knowledge-base to guide generation of responses. Both stages were conducted at the School of Education at the University of Leeds. In the first stage, the study analysed 13 logs of online seminars in order to define the tasks and dialogue strategies needed to be performed by Botocrates. The study identified two main tasks of Botocrates: providing answers to students' enquiries and engaging students in the argumentation process. Botocrates’ dialogue strategies and contents were built to achieve these two tasks. The novel theoretical framework of the ‘challenge to explain’ process and the notion of the ‘constructive expansion of exchange structure’ were produced during this stage and incorporated into Botocrates’ prototype. The aim of the ‘challenge to explain’ process is to engage users in repeated and constant cycles of reflective thinking processes. The ‘constructive expansion of exchange structure’ is the practical application of the ‘challenge to explain’ process. In the second stage, the study used the Wizard-of-Oz (WOZ) experiments and interviews to evaluate Botocrates’ prototype. 7 students participated in the evaluation stage and each participant was immediately interviewed after chatting with Botocrates. The analysis of the data gathered from the WOZ and interviews showed encouraging results in terms of students’ engagement in the process of argumentation. As a result of the role of ‘critic’ played by Botocrates during the interactions, users actively and positively adopted the roles of explainer, clarifier, and evaluator. However, the results also showed negative experiences that occurred to users during the interaction. Improving Botocrates’ performance and training users could decrease users’ unsuccessful and negative experiences. The study identified the critical success and failure factors related to achieving the tasks of Botocrates

    Unmet goals of tracking: within-track heterogeneity of students' expectations for

    Get PDF
    Educational systems are often characterized by some form(s) of ability grouping, like tracking. Although substantial variation in the implementation of these practices exists, it is always the aim to improve teaching efficiency by creating homogeneous groups of students in terms of capabilities and performances as well as expected pathways. If students’ expected pathways (university, graduate school, or working) are in line with the goals of tracking, one might presume that these expectations are rather homogeneous within tracks and heterogeneous between tracks. In Flanders (the northern region of Belgium), the educational system consists of four tracks. Many students start out in the most prestigious, academic track. If they fail to gain the necessary credentials, they move to the less esteemed technical and vocational tracks. Therefore, the educational system has been called a 'cascade system'. We presume that this cascade system creates homogeneous expectations in the academic track, though heterogeneous expectations in the technical and vocational tracks. We use data from the International Study of City Youth (ISCY), gathered during the 2013-2014 school year from 2354 pupils of the tenth grade across 30 secondary schools in the city of Ghent, Flanders. Preliminary results suggest that the technical and vocational tracks show more heterogeneity in student’s expectations than the academic track. If tracking does not fulfill the desired goals in some tracks, tracking practices should be questioned as tracking occurs along social and ethnic lines, causing social inequality
    corecore