282 research outputs found

    NATSUM: Narrative abstractive summarization through cross-document timeline generation

    Get PDF
    A new approach to narrative abstractive summarization (NATSUM) is presented in this paper. NATSUM is centered on generating a narrative chronologically ordered summary about a target entity from several news documents related to the same topic. To achieve this, first, our system creates a cross-document timeline where a time point contains all the event mentions that refer to the same event. This timeline is enriched with all the arguments of the events that are extracted from different documents. Secondly, using natural language generation techniques, one sentence for each event is produced using the arguments involved in the event. Specifically, a hybrid surface realization approach is used, based on over-generation and ranking techniques. The evaluation demonstrates that NATSUM performed better than extractive summarization approaches and competitive abstractive baselines, improving the F1-measure at least by 50%, when a real scenario is simulated.This research work has been partially funded by the Ministerio de Economía y Competitividad. España through projects TIN2015-65100-R, TIN2015-65136-C2-2-R, as well as by the project “Analisis de Sentimientos Aplicado a la Prevencion del Suicidio en las Redes Sociales (ASAP)” funded by Ayudas Fundación BBVA a equipos de investigacion cientifica. Moreover, it has been also funded by Generalitat Valenciana through project “SIIA: Tecnologías del lenguaje humano para una sociedad inclusiva, igualitaria, y accesible” with grant reference PROMETEU/2018/089

    Incorporating pronoun function into statistical machine translation

    Get PDF
    Pronouns are used frequently in language, and perform a range of functions. Some pronouns are used to express coreference, and others are not. Languages and genres differ in how and when they use pronouns and this poses a problem for Statistical Machine Translation (SMT) systems (Le Nagard and Koehn, 2010; Hardmeier and Federico, 2010; Novák, 2011; Guillou, 2012; Weiner, 2014; Hardmeier, 2014). Attention to date has focussed on coreferential (anaphoric) pronouns with NP antecedents, which when translated from English into a language with grammatical gender, must agree with the translation of the head of the antecedent. Despite growing attention to this problem, little progress has been made, and little attention has been given to other pronouns. The central claim of this thesis is that pronouns performing different functions in text should be handled differently by SMT systems and when evaluating pronoun translation. This motivates the introduction of a new framework to categorise pronouns according to their function: Anaphoric/cataphoric reference, event reference, extra-textual reference, pleonastic, addressee reference, speaker reference, generic reference, or other function. Labelling pronouns according to their function also helps to resolve instances of functional ambiguity arising from the same pronoun in the source language having multiple functions, each with different translation requirements in the target language. The categorisation framework is used in corpus annotation, corpus analysis, SMT system development and evaluation. I have directed the annotation and conducted analyses of a parallel corpus of English-German texts called ParCor (Guillou et al., 2014), in which pronouns are manually annotated according to their function. This provides a first step toward understanding the problems that SMT systems face when translating pronouns. In the thesis, I show how analysis of manual translation can prove useful in identifying and understanding systematic differences in pronoun use between two languages and can help inform the design of SMT systems. In particular, the analysis revealed that the German translations in ParCor contain more anaphoric and pleonastic pronouns than their English originals, reflecting differences in pronoun use. This raises a particular problem for the evaluation of pronoun translation. Automatic evaluation methods that rely on reference translations to assess pronoun translation, will not be able to provide an adequate evaluation when the reference translation departs from the original source-language text. I also show how analysis of the output of state-of-the-art SMT systems can reveal how well current systems perform in translating different types of pronouns and indicate where future efforts would be best directed. The analysis revealed that biases in the training data, for example arising from the use of “it” and “es” as both anaphoric and pleonastic pronouns in both English and German, is a problem that SMT systems must overcome. SMT systems also need to disambiguate the function of those pronouns with ambiguous surface forms so that each pronoun may be translated in an appropriate way. To demonstrate the value of this work, I have developed an automated post-editing system in which automated tools are used to construct ParCor-style annotations over the source-language pronouns. The annotations are then used to resolve functional ambiguity for the pronoun “it” with separate rules applied to the output of a baseline SMT system for anaphoric vs. non-anaphoric instances. The system was submitted to the DiscoMT 2015 shared task on pronoun translation for English-French. As with all other participating systems, the automatic post-editing system failed to beat a simple phrase-based baseline. A detailed analysis, including an oracle experiment in which manual annotation replaces the automated tools, was conducted to discover the causes of poor system performance. The analysis revealed that the design of the rules and their strict application to the SMT output are the biggest factors in the failure of the system. The lack of automatic evaluation metrics for pronoun translation is a limiting factor in SMT system development. To alleviate this problem, Christian Hardmeier and I have developed a testing regimen called PROTEST comprising (1) a hand-selected set of pronoun tokens categorised according to the different problems that SMT systems face and (2) an automated evaluation script. Pronoun translations can then be automatically compared against a reference translation, with mismatches referred for manual evaluation. The automatic evaluation was applied to the output of systems submitted to the DiscoMT 2015 shared task on pronoun translation. This again highlighted the weakness of the post-editing system, which performs poorly due to its focus on producing gendered pronoun translations, and its inability to distinguish between pleonastic and event reference pronouns

    A DATA DRIVEN APPROACH TO IDENTIFY JOURNALISTIC 5WS FROM TEXT DOCUMENTS

    Get PDF
    Textual understanding is the process of automatically extracting accurate high-quality information from text. The amount of textual data available from different sources such as news, blogs and social media is growing exponentially. These data encode significant latent information which if extracted accurately can be valuable in a variety of applications such as medical report analyses, news understanding and societal studies. Natural language processing techniques are often employed to develop customized algorithms to extract such latent information from text. Journalistic 5Ws refer to the basic information in news articles that describes an event and include where, when, who, what and why. Extracting them accurately may facilitate better understanding of many social processes including social unrest, human rights violations, propaganda spread, and population migration. Furthermore, the 5Ws information can be combined with socio-economic and demographic data to analyze state and trajectory of these processes. In this thesis, a data driven pipeline has been developed to extract the 5Ws from text using syntactic and semantic cues in the text. First, a classifier is developed to identify articles specifically related to social unrest. The classifier has been trained with a dataset of over 80K news articles. We then use NLP algorithms to generate a set of candidates for the 5Ws. Then, a series of algorithms to extract the 5Ws are developed. These algorithms based on heuristics leverage specific words and parts-of-speech customized for individual Ws to compute their scores. The heuristics are based on the syntactic structure of the document as well as syntactic and semantic representations of individual words and sentences. These scores are then combined and ranked to obtain the best answers to Journalistic 5Ws. The classification accuracy of the algorithms is validated using a manually annotated dataset of news articles

    Improving Neural Question Answering with Retrieval and Generation

    Get PDF
    Text-based Question Answering (QA) is a subject of interest both for its practical applications, and as a test-bed to measure the key Artificial Intelligence competencies of Natural Language Processing (NLP) and the representation and application of knowledge. QA has progressed a great deal in recent years by adopting neural networks, the construction of large training datasets, and unsupervised pretraining. Despite these successes, QA models require large amounts of hand-annotated data, struggle to apply supplied knowledge effectively, and can be computationally ex- pensive to operate. In this thesis, we employ natural language generation and information retrieval techniques in order to explore and address these three issues. We first approach the task of Reading Comprehension (RC), with the aim of lifting the requirement for in-domain hand-annotated training data. We describe a method for inducing RC capabilities without requiring hand-annotated RC instances, and demonstrate performance on par with early supervised approaches. We then explore multi-lingual RC, and develop a dataset to evaluate methods which enable training RC models in one language, and testing them in another. Second, we explore open-domain QA (ODQA), and consider how to build mod- els which best leverage the knowledge contained in a Wikipedia text corpus. We demonstrate that retrieval-augmentation greatly improves the factual predictions of large pretrained language models in unsupervised settings. We then introduce a class of retrieval-augmented generator model, and demonstrate its strength and flexibility across a range of knowledge-intensive NLP tasks, including ODQA. Lastly, we study the relationship between memorisation and generalisation in ODQA, developing a behavioural framework based on memorisation to contextualise the performance of ODQA models. Based on these insights, we introduce a class of ODQA model based on the concept of representing knowledge as question- answer pairs, and demonstrate how, by using question generation, such models can achieve high accuracy, fast inference, and well-calibrated predictions

    European Language Grid

    Get PDF
    This open access book provides an in-depth description of the EU project European Language Grid (ELG). Its motivation lies in the fact that Europe is a multilingual society with 24 official European Union Member State languages and dozens of additional languages including regional and minority languages. The only meaningful way to enable multilingualism and to benefit from this rich linguistic heritage is through Language Technologies (LT) including Natural Language Processing (NLP), Natural Language Understanding (NLU), Speech Technologies and language-centric Artificial Intelligence (AI) applications. The European Language Grid provides a single umbrella platform for the European LT community, including research and industry, effectively functioning as a virtual home, marketplace, showroom, and deployment centre for all services, tools, resources, products and organisations active in the field. Today the ELG cloud platform already offers access to more than 13,000 language processing tools and language resources. It enables all stakeholders to deposit, upload and deploy their technologies and datasets. The platform also supports the long-term objective of establishing digital language equality in Europe by 2030 – to create a situation in which all European languages enjoy equal technological support. This is the very first book dedicated to Language Technology and NLP platforms. Cloud technology has only recently matured enough to make the development of a platform like ELG feasible on a larger scale. The book comprehensively describes the results of the ELG project. Following an introduction, the content is divided into four main parts: (I) ELG Cloud Platform; (II) ELG Inventory of Technologies and Resources; (III) ELG Community and Initiative; and (IV) ELG Open Calls and Pilot Projects

    Unsupervised German predicate entailment using the distributional inclusion hypothesis

    Get PDF
    Recognizing textual entailment is an important prerequisite to many tasks in NLP, e.g. question answering and semantic parsing. Knowing that for example buying a thing entails subsequently owning it is a relation that humans learn by interacting with the world, while machines need other ways to acquire this knowledge. Previous approaches at learning predicate entailment relations from text have focused only on English. In this thesis we present the adaptation of the unsupervised entailment graph building algorithm of Hosseini et al. to German, which can be seen as a study of challenges in language adaptation for this task in general. We create a variety of German tools necessary for this approach and give a detailed account of the challenges faced and the insights gained from them. First, we create a German relation extraction system and compare it against the English system presented by Hosseini et al. Finding that the typing of German entities constitutes a bottleneck, we create German fine-grained typing system for named and general entities. In doing so we examine the methods of annotation projection and zero-shot cross-lingual transfer, finding that for German fine-grained named entity typing zero-shot cross-lingual transfer performs best. We then move on to creating a German system that types general entities (e.g. ``ex-president'') as well as named entities (e.g. ``Obama''), by augmenting our training data with data automatically generated from a German WordNet. We find that this way up to 10 percent points improvement in general entity typing performance can be reached, while only slightly impacting named entity typing performance by 1 percent point. We use these components in the pipeline to construct German entailment graphs. We also present a method that uses German and English entailment graphs to generate training data for a supervised predicate entailment detection system, and show that this method outperforms current approaches at this task. This way we create a multilingual predicate entailment detection system, that outperforms both the monolingual German system and the zero-shot cross-lingual system on German test data, and also performs better than a monolingual English system on English test data

    European Language Grid

    Get PDF
    This open access book provides an in-depth description of the EU project European Language Grid (ELG). Its motivation lies in the fact that Europe is a multilingual society with 24 official European Union Member State languages and dozens of additional languages including regional and minority languages. The only meaningful way to enable multilingualism and to benefit from this rich linguistic heritage is through Language Technologies (LT) including Natural Language Processing (NLP), Natural Language Understanding (NLU), Speech Technologies and language-centric Artificial Intelligence (AI) applications. The European Language Grid provides a single umbrella platform for the European LT community, including research and industry, effectively functioning as a virtual home, marketplace, showroom, and deployment centre for all services, tools, resources, products and organisations active in the field. Today the ELG cloud platform already offers access to more than 13,000 language processing tools and language resources. It enables all stakeholders to deposit, upload and deploy their technologies and datasets. The platform also supports the long-term objective of establishing digital language equality in Europe by 2030 – to create a situation in which all European languages enjoy equal technological support. This is the very first book dedicated to Language Technology and NLP platforms. Cloud technology has only recently matured enough to make the development of a platform like ELG feasible on a larger scale. The book comprehensively describes the results of the ELG project. Following an introduction, the content is divided into four main parts: (I) ELG Cloud Platform; (II) ELG Inventory of Technologies and Resources; (III) ELG Community and Initiative; and (IV) ELG Open Calls and Pilot Projects
    corecore