175 research outputs found

    Enhancing knowledge acquisition systems with user generated and crowdsourced resources

    Get PDF
    This thesis is on leveraging knowledge acquisition systems with collaborative data and crowdsourcing work from internet. We propose two strategies and apply them for building effective entity linking and question answering (QA) systems. The first strategy is on integrating an information extraction system with online collaborative knowledge bases, such as Wikipedia and Freebase. We construct a Cross-Lingual Entity Linking (CLEL) system to connect Chinese entities, such as people and locations, with corresponding English pages in Wikipedia. The main focus is to break the language barrier between Chinese entities and the English KB, and to resolve the synonymy and polysemy of Chinese entities. To address those problems, we create a cross-lingual taxonomy and a Chinese knowledge base (KB). We investigate two methods of connecting the query representation with the KB representation. Based on our CLEL system participating in TAC KBP 2011 evaluation, we finally propose a simple and effective generative model, which achieved much better performance. The second strategy is on creating annotation for QA systems with the help of crowd- sourcing. Crowdsourcing is to distribute a task via internet and recruit a lot of people to complete it simultaneously. Various annotated data are required to train the data-driven statistical machine learning algorithms for underlying components in our QA system. This thesis demonstrates how to convert the annotation task into crowdsourcing micro-tasks, investigate different statistical methods for enhancing the quality of crowdsourced anno- tation, and finally use enhanced annotation to train learning to rank models for passage ranking algorithms for QA.Gegenstand dieser Arbeit ist das Nutzbarmachen sowohl von Systemen zur Wissener- fassung als auch von kollaborativ erstellten Daten und Arbeit aus dem Internet. Es werden zwei Strategien vorgeschlagen, welche für die Erstellung effektiver Entity Linking (Disambiguierung von Entitätennamen) und Frage-Antwort Systeme eingesetzt werden. Die erste Strategie ist, ein Informationsextraktions-System mit kollaborativ erstellten Online- Datenbanken zu integrieren. Wir entwickeln ein Cross-Linguales Entity Linking-System (CLEL), um chinesische Entitäten, wie etwa Personen und Orte, mit den entsprechenden Wikipediaseiten zu verknüpfen. Das Hauptaugenmerk ist es, die Sprachbarriere zwischen chinesischen Entitäten und englischer Datenbank zu durchbrechen, und Synonymie und Polysemie der chinesis- chen Entitäten aufzulösen. Um diese Probleme anzugehen, erstellen wir eine cross linguale Taxonomie und eine chinesische Datenbank. Wir untersuchen zwei Methoden, die Repräsentation der Anfrage und die Repräsentation der Datenbank zu verbinden. Schließlich stellen wir ein einfaches und effektives generatives Modell vor, das auf unserem System für die Teilnahme an der TAC KBP 2011 Evaluation basiert und eine erheblich bessere Performanz erreichte. Die zweite Strategie ist, Annotationen für Frage-Antwort-Systeme mit Hilfe von "Crowd- sourcing" zu erstellen. "Crowdsourcing" bedeutet, eine Aufgabe via Internet an eine große Menge an angeworbene Menschen zu verteilen, die diese simultan erledigen. Verschiedene annotierte Daten sind notwendig, um die datengetriebenen statistischen Lernalgorithmen zu trainieren, die unserem Frage-Antwort System zugrunde liegen. Wir zeigen, wie die Annotationsaufgabe in Mikro-Aufgaben für das Crowdsourcing umgewan- delt werden kann, wir untersuchen verschiedene statistische Methoden, um die Qualität der Annotation aus dem Crowdsourcing zu erweitern, und schließlich nutzen wir die erwei- erte Annotation, um Modelle zum Lernen von Ranglisten von Textabschnitten zu trainieren

    CLARIN. The infrastructure for language resources

    Get PDF
    CLARIN, the "Common Language Resources and Technology Infrastructure", has established itself as a major player in the field of research infrastructures for the humanities. This volume provides a comprehensive overview of the organization, its members, its goals and its functioning, as well as of the tools and resources hosted by the infrastructure. The many contributors representing various fields, from computer science to law to psychology, analyse a wide range of topics, such as the technology behind the CLARIN infrastructure, the use of CLARIN resources in diverse research projects, the achievements of selected national CLARIN consortia, and the challenges that CLARIN has faced and will face in the future. The book will be published in 2022, 10 years after the establishment of CLARIN as a European Research Infrastructure Consortium by the European Commission (Decision 2012/136/EU)

    Proceedings of the 17th Annual Conference of the European Association for Machine Translation

    Get PDF
    Proceedings of the 17th Annual Conference of the European Association for Machine Translation (EAMT

    CLARIN

    Get PDF
    The book provides a comprehensive overview of the Common Language Resources and Technology Infrastructure – CLARIN – for the humanities. It covers a broad range of CLARIN language resources and services, its underlying technological infrastructure, the achievements of national consortia, and challenges that CLARIN will tackle in the future. The book is published 10 years after establishing CLARIN as an Europ. Research Infrastructure Consortium

    CLARIN

    Get PDF
    The book provides a comprehensive overview of the Common Language Resources and Technology Infrastructure – CLARIN – for the humanities. It covers a broad range of CLARIN language resources and services, its underlying technological infrastructure, the achievements of national consortia, and challenges that CLARIN will tackle in the future. The book is published 10 years after establishing CLARIN as an Europ. Research Infrastructure Consortium

    A new approach to CALL content authoring

    Get PDF
    [no abstract

    Cultura-Inspired Intercultural Exchanges: Focus on Asian and Pacific Languages

    Get PDF
    Although many online intercultural exchanges have been conducted based on the groundbreaking Cultura model, most to date have been between and among European languages. This volume presents several chapters with a focus on exchanges involving Asian and Pacific languages. Many of the benefits and challenges of these exchanges are similar to those reported for European languages; however, some of the difficulties reported in the Chinese and Japanese exchanges might be due to the significant linguistic differences between English and East Asian languages. This volume adds to the body of emerging studies of telecollaboration among learners of Asian and Pacific languages

    New Data-Driven Approaches to Text Simplification

    Get PDF
    A thesis submitted in partial fulfilment of the requirements of the University of Wolverhampton for the degree of Doctor of PhilosophyMany texts we encounter in our everyday lives are lexically and syntactically very complex. This makes them difficult to understand for people with intellectual or reading impairments, and difficult for various natural language processing systems to process. This motivated the need for text simplification (TS) which transforms texts into their simpler variants. Given that this is still a relatively new research area, many challenges are still remaining. The focus of this thesis is on better understanding the current problems in automatic text simplification (ATS) and proposing new data-driven approaches to solving them. We propose methods for learning sentence splitting and deletion decisions, built upon parallel corpora of original and manually simplified Spanish texts, which outperform the existing similar systems. Our experiments in adaptation of those methods to different text genres and target populations report promising results, thus offering one possible solution for dealing with the scarcity of parallel corpora for text simplification aimed at specific target populations, which is currently one of the main issues in ATS. The results of our extensive analysis of the phrase-based statistical machine translation (PB-SMT) approach to ATS reject the widespread assumption that the success of that approach largely depends on the size of the training and development datasets. They indicate more influential factors for the success of the PB-SMT approach to ATS, and reveal some important differences between cross-lingual MT and the monolingual v MT used in ATS. Our event-based system for simplifying news stories in English (EventSimplify) overcomes some of the main problems in ATS. It does not require a large number of handcrafted simplification rules nor parallel data, and it performs significant content reduction. The automatic and human evaluations conducted show that it produces grammatical text and increases readability, preserving and simplifying relevant content and reducing irrelevant content. Finally, this thesis addresses another important issue in TS which is how to automatically evaluate the performance of TS systems given that access to the target users might be difficult. Our experiments indicate that existing readability metrics can successfully be used for this task when enriched with human evaluation of grammaticality and preservation of meaning

    Principles and Applications of Data Science

    Get PDF
    Data science is an emerging multidisciplinary field which lies at the intersection of computer science, statistics, and mathematics, with different applications and related to data mining, deep learning, and big data. This Special Issue on “Principles and Applications of Data Science” focuses on the latest developments in the theories, techniques, and applications of data science. The topics include data cleansing, data mining, machine learning, deep learning, and the applications of medical and healthcare, as well as social media
    corecore