9,462 research outputs found

    Learning languages from parallel corpora

    Full text link
    This work describes a blueprint for an application that generates language learning exercises from parallel corpora. Word alignment and parallel structures allow for the automatic assessment of sentence pairs in the source and target languages, while users of the application continuously improve the quality of the data with their interactions, thus crowdsourcing parallel language learning material. Through triangulation, their assessment can be transferred to language pairs other than the original ones if multiparallel corpora are used as a source. Several challenges need to be addressed for such an application to work, and we will discuss three of them here. First, the question of how adequate learning material can be identified in corpora has received some attention in the last decade, and we will detail what the structure of parallel corpora implies for that selection. Secondly, we will consider which type of exercises can be generated automatically from parallel corpora such that they foster learning and keep learners motivated. And thirdly, we will highlight the potential of employing users, that is both teachers and learners, as crowdsourcers to help improve the material

    Developing a distributed electronic health-record store for India

    Get PDF
    The DIGHT project is addressing the problem of building a scalable and highly available information store for the Electronic Health Records (EHRs) of the over one billion citizens of India

    Proceedings of the COLING 2004 Post Conference Workshop on Multilingual Linguistic Ressources MLR2004

    No full text
    International audienceIn an ever expanding information society, most information systems are now facing the "multilingual challenge". Multilingual language resources play an essential role in modern information systems. Such resources need to provide information on many languages in a common framework and should be (re)usable in many applications (for automatic or human use). Many centres have been involved in national and international projects dedicated to building har- monised language resources and creating expertise in the maintenance and further development of standardised linguistic data. These resources include dictionaries, lexicons, thesauri, word-nets, and annotated corpora developed along the lines of best practices and recommendations. However, since the late 90's, most efforts in scaling up these resources remain the responsibility of the local authorities, usually, with very low funding (if any) and few opportunities for academic recognition of this work. Hence, it is not surprising that many of the resource holders and developers have become reluctant to give free access to the latest versions of their resources, and their actual status is therefore currently rather unclear. The goal of this workshop is to study problems involved in the development, management and reuse of lexical resources in a multilingual context. Moreover, this workshop provides a forum for reviewing the present state of language resources. The workshop is meant to bring to the international community qualitative and quantitative information about the most recent developments in the area of linguistic resources and their use in applications. The impressive number of submissions (38) to this workshop and in other workshops and conferences dedicated to similar topics proves that dealing with multilingual linguistic ressources has become a very hot problem in the Natural Language Processing community. To cope with the number of submissions, the workshop organising committee decided to accept 16 papers from 10 countries based on the reviewers' recommendations. Six of these papers will be presented in a poster session. The papers constitute a representative selection of current trends in research on Multilingual Language Resources, such as multilingual aligned corpora, bilingual and multilingual lexicons, and multilingual speech resources. The papers also represent a characteristic set of approaches to the development of multilingual language resources, such as automatic extraction of information from corpora, combination and re-use of existing resources, online collaborative development of multilingual lexicons, and use of the Web as a multilingual language resource. The development and management of multilingual language resources is a long-term activity in which collaboration among researchers is essential. We hope that this workshop will gather many researchers involved in such developments and will give them the opportunity to discuss, exchange, compare their approaches and strengthen their collaborations in the field. The organisation of this workshop would have been impossible without the hard work of the program committee who managed to provide accurate reviews on time, on a rather tight schedule. We would also like to thank the Coling 2004 organising committee that made this workshop possible. Finally, we hope that this workshop will yield fruitful results for all participants

    Using Crowdsourced Exercises for Vocabulary Training to Expand ConceptNet

    Get PDF
    In this work, we report on a crowdsourcing experiment conducted using the V-TREL vocabulary trainer which is accessed via a Telegram chatbot interface to gather knowledge on word relations suitable for expanding ConceptNet. V-TREL is built on top of a generic architecture implementing the implicit crowdsourding paradigm in order to offer vocabulary training exercises generated from the commonsense knowledge-base ConceptNet and - in the background - to collect and evaluate the learners' answers to extend ConceptNet with new words. In the experiment about 90 university students learning English at C1 level, based on the Common European Framework of Reference for Languages (CEFR), trained their vocabulary with V-TREL over a period of 16 calendar days. The experiment allowed to gather more than 12,000 answers from learners on different question types. In this paper, we present in detail the experimental setup and the outcome of the experiment, which indicates the potential of our approach for both crowdsourcing data as well as fostering vocabulary skills.Peer reviewe

    Delving into the uncharted territories of Word Sense Disambiguation

    Get PDF
    The automatic disambiguation of word senses, i.e. Word Sense Disambiguation, is a long-standing task in the field of Natural Language Processing; an AI-complete problem that took its first steps more than half a century ago, and which, to date, has apparently attained human-like performances on standard evaluation benchmarks. Unfortunately, the steady evolution that the task experienced over time in terms of sheer performance has not been followed hand in hand by adequate theoretical support, nor by careful error analysis. Furthermore, we believe that the lack of an exhaustive bird’s eye view which accounts for the sort of high-end and unrealistic computational architectures that systems will soon need in order to further refine their performances could lead the field to a dead angle in a few years. In essence, taking advantage of the current moment of great accomplishments and renewed interest in the task, we argue that Word Sense Disambiguation is mature enough for researchers to really observe the extent of the results hitherto obtained, evaluate what is actually missing, and answer the much sought for question: “are current state-of-the-art systems really able to effectively solve lexical ambiguity?” Driven by the desire to become both architects and participants in this period of pondering, we have identified a few macro-areas representatives of the challenges of automatic disambiguation. From this point of view, in this thesis, we propose experimental solutions and empirical tools so as to bring to the attention of the Word Sense Disambiguation community unusual and unexplored points of view. We hope these will represent a new perspective through which to best observe the current state of disambiguation, as well as to foresee future paths for the task to evolve on. Specifically, 1q) prompted by the growing concern about the rise in performance being closely linked to the demand for more and more unrealistic computational architectures in all areas of application of Deep Learning related techniques, we 1a) provide evidence for the undisclosed potential of approaches based on knowledge-bases, via the exploitation of syntagmatic information. Moreover, 2q) driven by the dissatisfaction with the use of cognitively-inaccurate, finite inventories of word senses in Word Sense Disambiguation, we 2a) introduce an approach based on Definition Modeling paradigms to generate contextual definitions for target words and phrases, hence going beyond the limits set by specific lexical-semantic inventories. Finally, 3q) moved by the desire to analyze the real implications beyond the idea of “machines performing disambiguation on par with their human counterparts” we 3a) put forward a detailed analysis of the shared errors affecting current state-of-the-art systems based on diverse approaches for Word Sense Disambiguation, and highlight, by means of a novel evaluation dataset tailored to represent common and critical issues shared by all systems, performances way lower than those usually reported in the current literature

    Programming Language Techniques for Natural Language Applications

    Get PDF
    It is easy to imagine machines that can communicate in natural language. Constructing such machines is more difficult. The aim of this thesis is to demonstrate how declarative grammar formalisms that distinguish between abstract and concrete syntax make it easier to develop natural language applications. We describe how the type-theorectical grammar formalism Grammatical Framework (GF) can be used as a high-level language for natural language applications. By taking advantage of techniques from the field of programming language implementation, we can use GF grammars to perform portable and efficient parsing and linearization, generate speech recognition language models, implement multimodal fusion and fission, generate support code for abstract syntax transformations, generate dialogue managers, and implement speech translators and web-based syntax-aware editors. By generating application components from a declarative grammar, we can reduce duplicated work, ensure consistency, make it easier to build multilingual systems, improve linguistic quality, enable re-use across system domains, and make systems more portable

    Final FLaReNet deliverable: Language Resources for the Future - The Future of Language Resources

    Get PDF
    Language Technologies (LT), together with their backbone, Language Resources (LR), provide an essential support to the challenge of Multilingualism and ICT of the future. The main task of language technologies is to bridge language barriers and to help creating a new environment where information flows smoothly across frontiers and languages, no matter the country, and the language, of origin. To achieve this goal, all players involved need to act as a community able to join forces on a set of shared priorities. However, until now the field of Language Resources and Technology has long suffered from an excess of individuality and fragmentation, with a lack of coherence concerning the priorities for the field, the direction to move, not to mention a common timeframe. The context encountered by the FLaReNet project was thus represented by an active field needing a coherence that can only be given by sharing common priorities and endeavours. FLaReNet has contributed to the creation of this coherence by gathering a wide community of experts and making them participate in the definition of an exhaustive set of recommendations
    corecore