2,185 research outputs found

    D7.1. Criteria for evaluation of resources, technology and integration.

    Get PDF
    This deliverable defines how evaluation is carried out at each integration cycle in the PANACEA project. As PANACEA aims at producing large scale resources, evaluation becomes a critical and challenging issue. Critical because it is important to assess the quality of the results that should be delivered to users. Challenging because we prospect rather new areas, and through a technical platform: some new methodologies will have to be explored or old ones to be adapted

    D6.2 Integrated Final Version of the Components for Lexical Acquisition

    Get PDF
    The PANACEA project has addressed one of the most critical bottlenecks that threaten the development of technologies to support multilingualism in Europe, and to process the huge quantity of multilingual data produced annually. Any attempt at automated language processing, particularly Machine Translation (MT), depends on the availability of language-specific resources. Such Language Resources (LR) contain information about the language\u27s lexicon, i.e. the words of the language and the characteristics of their use. In Natural Language Processing (NLP), LRs contribute information about the syntactic and semantic behaviour of words - i.e. their grammar and their meaning - which inform downstream applications such as MT. To date, many LRs have been generated by hand, requiring significant manual labour from linguistic experts. However, proceeding manually, it is impossible to supply LRs for every possible pair of European languages, textual domain, and genre, which are needed by MT developers. Moreover, an LR for a given language can never be considered complete nor final because of the characteristics of natural language, which continually undergoes changes, especially spurred on by the emergence of new knowledge domains and new technologies. PANACEA has addressed this challenge by building a factory of LRs that progressively automates the stages involved in the acquisition, production, updating and maintenance of LRs required by MT systems. The existence of such a factory will significantly cut down the cost, time and human effort required to build LRs. WP6 has addressed the lexical acquisition component of the LR factory, that is, the techniques for automated extraction of key lexical information from texts, and the automatic collation of lexical information into LRs in a standardized format. The goal of WP6 has been to take existing techniques capable of acquiring syntactic and semantic information from corpus data, improving upon them, adapting and applying them to multiple languages, and turning them into powerful and flexible techniques capable of supporting massive applications. One focus for improving the scalability and portability of lexical acquisition techniques has been to extend exiting techniques with more powerful, less "supervised" methods. In NLP, the amount of supervision refers to the amount of manual annotation which must be applied to a text corpus before machine learning or other techniques are applied to the data to compile a lexicon. More manual annotation means more accurate training data, and thus a more accurate LR. However, given that it is impractical from a cost and time perspective to manually annotate the vast amounts of data required for multilingual MT across domains, it is important to develop techniques which can learn from corpora with less supervision. Less supervised methods are capable of supporting both large-scale acquisition and efficient domain adaptation, even in the domains where data is scarce. Another focus of lexical acquisition in PANACEA has been the need of LR users to tune the accuracy level of LRs. Some applications may require increased precision, or accuracy, where the application requires a high degree of confidence in the lexical information used. At other times a greater level of coverage may be required, with information about more words at the expense of some degree of accuracy. Lexical acquisition in PANACEA has investigated confidence thresholds for lexical acquisition to ensure that the ultimate users of LRs can generate lexical data from the PANACEA factory at the desired level of accuracy

    Proceedings of the First Workshop on Computing News Storylines (CNewsStory 2015)

    Get PDF
    This volume contains the proceedings of the 1st Workshop on Computing News Storylines (CNewsStory 2015) held in conjunction with the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (ACL-IJCNLP 2015) at the China National Convention Center in Beijing, on July 31st 2015. Narratives are at the heart of information sharing. Ever since people began to share their experiences, they have connected them to form narratives. The study od storytelling and the field of literary theory called narratology have developed complex frameworks and models related to various aspects of narrative such as plots structures, narrative embeddings, characters’ perspectives, reader response, point of view, narrative voice, narrative goals, and many others. These notions from narratology have been applied mainly in Artificial Intelligence and to model formal semantic approaches to narratives (e.g. Plot Units developed by Lehnert (1981)). In recent years, computational narratology has qualified as an autonomous field of study and research. Narrative has been the focus of a number of workshops and conferences (AAAI Symposia, Interactive Storytelling Conference (ICIDS), Computational Models of Narrative). Furthermore, reference annotation schemes for narratives have been proposed (NarrativeML by Mani (2013)). The workshop aimed at bringing together researchers from different communities working on representing and extracting narrative structures in news, a text genre which is highly used in NLP but which has received little attention with respect to narrative structure, representation and analysis. Currently, advances in NLP technology have made it feasible to look beyond scenario-driven, atomic extraction of events from single documents and work towards extracting story structures from multiple documents, while these documents are published over time as news streams. Policy makers, NGOs, information specialists (such as journalists and librarians) and others are increasingly in need of tools that support them in finding salient stories in large amounts of information to more effectively implement policies, monitor actions of “big players” in the society and check facts. Their tasks often revolve around reconstructing cases either with respect to specific entities (e.g. person or organizations) or events (e.g. hurricane Katrina). Storylines represent explanatory schemas that enable us to make better selections of relevant information but also projections to the future. They form a valuable potential for exploiting news data in an innovative way.JRC.G.2-Global security and crisis managemen

    Reasoning-Driven Question-Answering For Natural Language Understanding

    Get PDF
    Natural language understanding (NLU) of text is a fundamental challenge in AI, and it has received significant attention throughout the history of NLP research. This primary goal has been studied under different tasks, such as Question Answering (QA) and Textual Entailment (TE). In this thesis, we investigate the NLU problem through the QA task and focus on the aspects that make it a challenge for the current state-of-the-art technology. This thesis is organized into three main parts: In the first part, we explore multiple formalisms to improve existing machine comprehension systems. We propose a formulation for abductive reasoning in natural language and show its effectiveness, especially in domains with limited training data. Additionally, to help reasoning systems cope with irrelevant or redundant information, we create a supervised approach to learn and detect the essential terms in questions. In the second part, we propose two new challenge datasets. In particular, we create two datasets of natural language questions where (i) the first one requires reasoning over multiple sentences; (ii) the second one requires temporal common sense reasoning. We hope that the two proposed datasets will motivate the field to address more complex problems. In the final part, we present the first formal framework for multi-step reasoning algorithms, in the presence of a few important properties of language use, such as incompleteness, ambiguity, etc. We apply this framework to prove fundamental limitations for reasoning algorithms. These theoretical results provide extra intuition into the existing empirical evidence in the field

    Unsupervised Induction of Semantic Roles within a Reconstruction-Error Minimization Framework

    Get PDF
    We introduce a new approach to unsupervised estimation of feature-rich semantic role labeling models. Our model consists of two components: (1) an encoding component: a semantic role labeling model which predicts roles given a rich set of syntactic and lexical features; (2) a reconstruction component: a tensor factorization model which relies on roles to predict argument fillers. When the components are estimated jointly to minimize errors in argument reconstruction, the induced roles largely correspond to roles defined in annotated resources. Our method performs on par with most accurate role induction methods on English and German, even though, unlike these previous approaches, we do not incorporate any prior linguistic knowledge about the languages

    Russian verbal prefixation: A frame semantic analysis

    Get PDF
    This book addresses the complexity of Russian verbal prefixation system that has been extensively studied but yet not explained. Traditionally, different meanings have been investigated and listed in the dictionaries and grammars and more recently linguists attempted to unify various prefix usages under more general descriptions. The existent semantic approaches, however, do not aim to use semantic representations in order to account for the problems of prefix stacking and aspect determination. This task has been so far undertaken by syntactic approaches to prefixation, that divide verbal prefixes in classes and limit complex verb formation by restricting structural positions available for the members of each class. I show that these approaches have two major drawbacks: the implicit prediction of the non-existence of complex biaspectual verbs and the absence of uniformly accepted formal criteria for the underlying prefix classification. In this book the reader can find an implementable formal semantic approach to prefixation that covers five prefixes: za-, na-, po-, pere-, and do-. It is shown how to predict the existence, semantics, and aspect of a given complex verb with the help of the combination of an LTAG and frame semantics. The task of identifying the possible affix combinations is distributed between three modules: syntax, which is kept simple (only basic structural assumptions), frame semantics, which ensures that the constraints are respected, and pragmatics, which rules out some prefixed verbs and restricts the range of available interpretations. For the purpose of the evaluation of the theory, an implementation of the proposed analysis for a grammar fragment using a metagrammar description is provided. It is shown that the proposed analysis delivers more accurate and complete predictions with respect to the existence of complex verbs than the most precise syntactic account

    Russian verbal prefixation: A frame semantic analysis

    Get PDF
    This book addresses the complexity of Russian verbal prefixation system that has been extensively studied but yet not explained. Traditionally, different meanings have been investigated and listed in the dictionaries and grammars and more recently linguists attempted to unify various prefix usages under more general descriptions. The existent semantic approaches, however, do not aim to use semantic representations in order to account for the problems of prefix stacking and aspect determination. This task has been so far undertaken by syntactic approaches to prefixation, that divide verbal prefixes in classes and limit complex verb formation by restricting structural positions available for the members of each class. I show that these approaches have two major drawbacks: the implicit prediction of the non-existence of complex biaspectual verbs and the absence of uniformly accepted formal criteria for the underlying prefix classification. In this book the reader can find an implementable formal semantic approach to prefixation that covers five prefixes: za-, na-, po-, pere-, and do-. It is shown how to predict the existence, semantics, and aspect of a given complex verb with the help of the combination of an LTAG and frame semantics. The task of identifying the possible affix combinations is distributed between three modules: syntax, which is kept simple (only basic structural assumptions), frame semantics, which ensures that the constraints are respected, and pragmatics, which rules out some prefixed verbs and restricts the range of available interpretations. For the purpose of the evaluation of the theory, an implementation of the proposed analysis for a grammar fragment using a metagrammar description is provided. It is shown that the proposed analysis delivers more accurate and complete predictions with respect to the existence of complex verbs than the most precise syntactic account
    • …
    corecore