95 research outputs found

    Finding answers to questions, in text collections or web, in open domain or specialty domains

    Get PDF
    International audienceThis chapter is dedicated to factual question answering, i.e. extracting precise and exact answers to question given in natural language from texts. A question in natural language gives more information than a bag of word query (i.e. a query made of a list of words), and provides clues for finding precise answers. We will first focus on the presentation of the underlying problems mainly due to the existence of linguistic variations between questions and their answerable pieces of texts for selecting relevant passages and extracting reliable answers. We will first present how to answer factual question in open domain. We will also present answering questions in specialty domain as it requires dealing with semi-structured knowledge and specialized terminologies, and can lead to different applications, as information management in corporations for example. Searching answers on the Web constitutes another application frame and introduces specificities linked to Web redundancy or collaborative usage. Besides, the Web is also multilingual, and a challenging problem consists in searching answers in target language documents other than the source language of the question. For all these topics, we present main approaches and the remaining problems

    Abstract syntax as interlingua: Scaling up the grammatical framework from controlled languages to robust pipelines

    Get PDF
    Syntax is an interlingual representation used in compilers. Grammatical Framework (GF) applies the abstract syntax idea to natural languages. The development of GF started in 1998, first as a tool for controlled language implementations, where it has gained an established position in both academic and commercial projects. GF provides grammar resources for over 40 languages, enabling accurate generation and translation, as well as grammar engineering tools and components for mobile and Web applications. On the research side, the focus in the last ten years has been on scaling up GF to wide-coverage language processing. The concept of abstract syntax offers a unified view on many other approaches: Universal Dependencies, WordNets, FrameNets, Construction Grammars, and Abstract Meaning Representations. This makes it possible for GF to utilize data from the other approaches and to build robust pipelines. In return, GF can contribute to data-driven approaches by methods to transfer resources from one language to others, to augment data by rule-based generation, to check the consistency of hand-annotated corpora, and to pipe analyses into high-precision semantic back ends. This article gives an overview of the use of abstract syntax as interlingua through both established and emerging NLP applications involving GF

    Enhancing knowledge acquisition systems with user generated and crowdsourced resources

    Get PDF
    This thesis is on leveraging knowledge acquisition systems with collaborative data and crowdsourcing work from internet. We propose two strategies and apply them for building effective entity linking and question answering (QA) systems. The first strategy is on integrating an information extraction system with online collaborative knowledge bases, such as Wikipedia and Freebase. We construct a Cross-Lingual Entity Linking (CLEL) system to connect Chinese entities, such as people and locations, with corresponding English pages in Wikipedia. The main focus is to break the language barrier between Chinese entities and the English KB, and to resolve the synonymy and polysemy of Chinese entities. To address those problems, we create a cross-lingual taxonomy and a Chinese knowledge base (KB). We investigate two methods of connecting the query representation with the KB representation. Based on our CLEL system participating in TAC KBP 2011 evaluation, we finally propose a simple and effective generative model, which achieved much better performance. The second strategy is on creating annotation for QA systems with the help of crowd- sourcing. Crowdsourcing is to distribute a task via internet and recruit a lot of people to complete it simultaneously. Various annotated data are required to train the data-driven statistical machine learning algorithms for underlying components in our QA system. This thesis demonstrates how to convert the annotation task into crowdsourcing micro-tasks, investigate different statistical methods for enhancing the quality of crowdsourced anno- tation, and finally use enhanced annotation to train learning to rank models for passage ranking algorithms for QA.Gegenstand dieser Arbeit ist das Nutzbarmachen sowohl von Systemen zur Wissener- fassung als auch von kollaborativ erstellten Daten und Arbeit aus dem Internet. Es werden zwei Strategien vorgeschlagen, welche für die Erstellung effektiver Entity Linking (Disambiguierung von Entitätennamen) und Frage-Antwort Systeme eingesetzt werden. Die erste Strategie ist, ein Informationsextraktions-System mit kollaborativ erstellten Online- Datenbanken zu integrieren. Wir entwickeln ein Cross-Linguales Entity Linking-System (CLEL), um chinesische Entitäten, wie etwa Personen und Orte, mit den entsprechenden Wikipediaseiten zu verknüpfen. Das Hauptaugenmerk ist es, die Sprachbarriere zwischen chinesischen Entitäten und englischer Datenbank zu durchbrechen, und Synonymie und Polysemie der chinesis- chen Entitäten aufzulösen. Um diese Probleme anzugehen, erstellen wir eine cross linguale Taxonomie und eine chinesische Datenbank. Wir untersuchen zwei Methoden, die Repräsentation der Anfrage und die Repräsentation der Datenbank zu verbinden. Schließlich stellen wir ein einfaches und effektives generatives Modell vor, das auf unserem System für die Teilnahme an der TAC KBP 2011 Evaluation basiert und eine erheblich bessere Performanz erreichte. Die zweite Strategie ist, Annotationen für Frage-Antwort-Systeme mit Hilfe von "Crowd- sourcing" zu erstellen. "Crowdsourcing" bedeutet, eine Aufgabe via Internet an eine große Menge an angeworbene Menschen zu verteilen, die diese simultan erledigen. Verschiedene annotierte Daten sind notwendig, um die datengetriebenen statistischen Lernalgorithmen zu trainieren, die unserem Frage-Antwort System zugrunde liegen. Wir zeigen, wie die Annotationsaufgabe in Mikro-Aufgaben für das Crowdsourcing umgewan- delt werden kann, wir untersuchen verschiedene statistische Methoden, um die Qualität der Annotation aus dem Crowdsourcing zu erweitern, und schließlich nutzen wir die erwei- erte Annotation, um Modelle zum Lernen von Ranglisten von Textabschnitten zu trainieren

    The Localisation of Video Games

    Get PDF
    The present thesis is a study of the translation of video games with a particular emphasis on the Spanish-English language pair, although other languages are brought into play when they offer a clearer illustration of a particular point in the discussion. On the one hand, it offers a descriptive analysis of the video game industry understood as a global phenomenon in entertainment, with the aim of understanding the norms governing present game development and publishing practices. On the other hand, it discusses particular translation issues that seem to be unique to these entertainment products due to their multichannel and polysemiotic nature, in which verbal and nonverbal signs are intimately interconnected in search of maximum game interactivity. Although this research positions itself within the theoretical framework of Descriptive Translation Studies, it actually goes beyond the mere accounting of current processes to propose changes whenever professional practice seems to be unable to rid itself of old unsatisfactory habits. Of a multidisciplinary nature, the present thesis is greatly informed by various areas of knowledge such as audiovisual translation, software localisation, computer assisted translation and translation memory tools, comparative literature, and video game production and marketing, amongst others. The conclusions are an initial breakthrough in terms of research into this new area, challenging some of the basic tenets current in translation studies thanks to its multidisciplinary approach, and its solid grounding on current game localisation industry practice. The results can be useful in order to boost professional quality and to promote the training of translators in video game localisation in higher education centres.Open Acces

    Automated Translation with Interlingual Word Representations

    Get PDF
    In dit proefschrift onderzoeken we het gebruik vertaalsystemen die gebruiken maken van een transferfase met interlinguale representaties van woorden. Op deze manier benaderen we het probleem van de lexicale ambiguïteit in de automatische vertaalsystemen als twee afzonderlijke taken: het bepalen van woordbetekenis en lexicale selectie. Eerst worden de woorden in de brontaal op basis van hun betekenis gedesambigueerd, resulterend in interlinguale representaties van woorden. Vervolgens wordt een lexicale selectiemodule gebruikt die het meest geschikte woord in de doeltaal selecteert. We geven een gedetailleerde beschrijving van de ontwikkeling en evaluatie van vertaalsystemen voor Nederlands-Engels. Dit biedt een achtergrond voor de experimenten in het tweede en derde deel van dit proefschrift. Daarna beschrijven we een methode die de betekenis van woorden bepaalt. Deze is vergelijkbaar met het klassieke Lesk-algoritme, omdat het gebruik maakt van het idee dat gedeelde woorden tussen de context van een woord en zijn definitie informatie over de betekenis ervan verschaffen. Wij gebruiken echter, in plaats daarvan, woord- en betekenisvectoren om de overeenkomst te berekenen tussen de definitie van een betekenis en de context van een woord. We gebruiken onze methode bovendien voor het localiseren en -interpreteren van woordgrapjes.Ten slotte presenteren we een model voor lexicale keuze dat lemma's selecteert, gegeven de abstracte representaties van woorden. Dit doen we door de grammaticale bomen te converteren naar hidden Markov bomen. Op deze manier kan de optimale combinatie van lemmas en hun context berekend worden

    Predicate Matrix: an interoperable lexical knowledge base for predicates

    Get PDF
    183 p.La Matriz de Predicados (Predicate Matrix en inglés) es un nuevo recurso léxico-semántico resultado de la integración de múltiples fuentes de conocimiento, entre las cuales se encuentran FrameNet, VerbNet, PropBank y WordNet. La Matriz de Predicados proporciona un léxico extenso y robusto que permite mejorar la interoperabilidad entre los recursos semánticos mencionados anteriormente. La creación de la Matriz de Predicados se basa en la integración de Semlink y nuevos mappings obtenidos utilizando métodos automáticos que enlazan el conocimiento semántico a nivel léxico y de roles. Asimismo, hemos ampliado la Predicate Matrix para cubrir los predicados nominales (inglés, español) y predicados en otros idiomas (castellano, catalán y vasco). Como resultado, la Matriz de predicados proporciona un léxico multilingüe que permite el análisis semántico interoperable en múltiples idiomas

    Automated Translation with Interlingual Word Representations

    Get PDF

    Automatic Speech Recognition (ASR) and NMT for Interlingual and Intralingual Communication: Speech to Text Technology for Live Subtitling and Accessibility.

    Get PDF
    Considered the increasing demand for institutional translation and the multilingualism of international organizations, the application of Artificial Intelligence (AI) technologies in multilingual communications and for the purposes of accessibility has become an important element in the production of translation and interpreting services (Zetzsche, 2019). In particular, the widespread use of Automatic Speech Recognition (ASR) and Neural Machine Translation (NMT) technology represents a recent development in the attempt of satisfying the increasing demand for interinstitutional, multilingual communications at inter-governmental level (Maslias, 2017). Recently, researchers have been calling for a universalistic view of media and conference accessibility (Greco, 2016). The application of ASR, combined with NMT, may allow for the breaking down of communication barriers at European institutional conferences where multilingualism represents a fundamental pillar (Jopek Bosiacka, 2013). In addition to representing a so-called disruptive technology (Accipio Consulting, 2006), ASR technology may facilitate the communication with non-hearing users (Lewis, 2015). Thanks to ASR, it is possible to guarantee content accessibility for non-hearing audience via subtitles at institutionally-held conferences or speeches. Hence the need for analysing and evaluating ASR output: a quantitative approach is adopted to try to make an evaluation of subtitles, with the objective of assessing its accuracy (Romero-Fresco, 2011). A database of F.A.O.’s and other international institutions’ English-language speeches and conferences on climate change is taken into consideration. The statistical approach is based on WER and NER models (Romero-Fresco, 2016) and on an adapted version. The ASR software solution implemented into the study will be VoxSigma by Vocapia Research and Google Speech Recognition engine. After having defined a taxonomic scheme, Native and Non-Native subtitles are compared to gold standard transcriptions. The intralingual and interlingual output generated by NMT is specifically analysed and evaluated via the NTR model to evaluate accuracy and accessibility

    Multilingual Knowledge Graphs and Low-Resource Languages: A Review

    Get PDF
    There is a lack of multilingual data to support applications in a large number of languages, especially for low-resource languages. Knowledge graphs (KG) could contribute to closing the gap of language support by providing easily accessible, machine-readable, multilingual linked data, which can be reused across applications. In this paper, we provide an overview of work in the domain of multilingual KGs with a focus on low-resource languages. We review the current state of multilingual KGs along with the different aspects that are crucial for creating KGs with language coverage in mind. Special consideration is given to challenges particular to low-resource languages in KGs. We further provide an overview of applications that yield multilingual KG information as well as downstream applications reusing such multilingual data. Finally, we explore open problems regarding multilingual KGs with a focus on low-resource languages
    corecore