960 research outputs found

    Ebaluatoia: crowd evaluation of English-Basque machine translation

    Get PDF
    [EU]Lan honetan Ebaluatoia aurkezten da, eskala handiko ingelesa-euskara itzulpen automatikoko ebaluazio kanpaina, komunitate-elkarlanean oinarritua. Bost sistemaren itzulpen kalitatea konparatzea izan da kanpainaren helburua, zehazki, bi sistema estatistiko, erregeletan oinarritutako bat eta sistema hibrido bat (IXA taldean garatuak) eta Google Translate. Emaitzetan oinarrituta, sistemen sailkapen bat egin dugu, baita etorkizuneko ikerkuntza bideratuko duten zenbait analisi kualitatibo ere, hain zuzen, ebaluazio-bildumako azpi-multzoen analisia, iturburuko esaldien analisi estrukturala eta itzulpenen errore-analisia. Lanak analisi hauen hastapenak aurkezten ditu, etorkizunean zein motatako analisietan sakondu erakutsiko digutenak.[EN]This dissertation reports on the crowd-based large-scale English-Basque machine translation evaluation campaign, Ebaluatoia. This initiative aimed to compare system quality for five machine translation systems: two statistical systems, a rule- based system and a hybrid system developed within the IXA group, and an external system, Google Translate. We have established a ranking of the systems under study and performed qualitative analyses to guide further research. In particular, we have carried out initial subset evaluation, structural analysis and e rror analysis to help identify where we should place future analysis effort

    질의응답 시스템을 위한 텍스트 랭킹 심층 신경망

    Get PDF
    학위논문 (박사) -- 서울대학교 대학원 : 공과대학 전기·정보공학부, 2020. 8. 정교민.The question answering (QA) system has attracted huge interests due to its applicability in real-world applications. This dissertation proposes novel ranking algorithms for the QA system based on deep neural networks. We first tackle the long-text QA that requires the model to understand the excessively large sequence of text inputs. To solve this problem, we propose a hierarchical recurrent dual encoder that encodes texts from word-level to paragraph-level. We further propose a latent topic clustering method that utilizes semantic information in the target corpus, and thus it increases the performance of the QA system. Secondly, we investigate the short-text QA, where the information in text pairs are limited. To overcome the insufficiency, we combine a pretrained language model and an enhanced latent clustering method to the QA model. This novel architecture enables the model to utilizes additional information, resulting in achieving state-of-the-art performance for the standard answer-selection tasks (i.e., WikiQA, TREC-QA). Finally, we investigate detecting supporting sentences for complex QA system. As opposed to the previous studies, the model needs to understand the relationship between sentences to answer the question. Inspired by the hierarchical nature of the text, we propose a graph neural network-based model that iteratively propagates necessary information between text nodes and achieve the best performance among existing methods.본 학위 논문은 딥 뉴럴 네트워크 기반 질의응답 시스템에 관한 모델을 제안한다. 먼저 긴 문장에 대한 질의응답을 하기 위해서 계층 구조의 재귀신경망 모델을 제안하였다. 이를 통해 모델이 주어진 문장을 짧은 시퀀스 단위로 효율적으로 다룰 수 있게 하여 큰 성능 향상을 얻었다. 또한 학습 과정에서 데이터 안에 내포된 토픽을 자동 분류하는 모델을 제안하고, 이를 기존 질의응답 모델에 병합하여 추가 성능 개선을 이루었다. 이어지는 연구로 짧은 문장에 대한 질의응답 모델을 제안하였다. 문장의 길이가 짧아질수록 문장 안에서 얻을 수 있는 정보의 양도 줄어들게 된다. 우리는 이러한 문제를 해결하기 위해, 사전 학습된 언어 모델과 새로운 토픽 클러스터링 기법을 적용하였다. 제안한 모델은 종래 짧은 문장 질의응답 연구 중 가장 좋은 성능을 획득하였다. 마지막으로 여러 문장 사이의 관계를 이용하여 답변을 찾아야 하는 질의응답 연구를 진행하였다. 우리는 문서 내 각 문장을 그래프로 도식화한 후 이를 학습할 수 있는 그래프 뉴럴 네트워크를 제안하였다. 제안한 모델은 각 문장의 관계성을 성공적으로 계산하였고, 이를 통해 복잡도가 높은 질의응답 시스템에서 기존에 제안된 모델들과 비교하여 가장 좋은 성능을 획득하였다.1 Introduction 1 2 Background 8 2.1 Textual Data Representation 8 2.2 Encoding Sequential Information in Text 12 3 Question-Answer Pair Ranking for Long Text 16 3.1 Related Work 18 3.2 Method 19 3.2.1 Baseline Approach 19 3.2.2 Proposed Approaches (HRDE+LTC) 22 3.3 Experimental Setup and Dataset 26 3.3.1 Dataset 26 3.3.2 Consumer Product Question Answering Corpus 30 3.3.3 Implementation Details 32 3.4 Empirical Results 34 3.4.1 Comparison with other methods 35 3.4.2 Degradation Comparison for Longer Texts 37 3.4.3 Effects of the LTC Numbers 38 3.4.4 Comprehensive Analysis of LTC 38 3.5 Further Investigation on Ranking Lengthy Document 40 3.5.1 Problem and Dataset 41 3.5.2 Methods 45 3.5.3 Experimental Results 51 3.6 Conclusion 55 4 Answer-Selection for Short Sentence 56 4.1 Related Work 57 4.2 Method 59 4.2.1 Baseline approach 59 4.2.2 Proposed Approaches (Comp-Clip+LM+LC+TL) 62 4.3 Experimental Setup and Dataset 66 4.3.1 Dataset 66 4.3.2 Implementation Details 68 4.4 Empirical Results 69 4.4.1 Comparison with Other Methods 69 4.4.2 Impact of Latent Clustering 72 4.5 Conclusion 72 5 Supporting Sentence Detection for Question Answering 73 5.1 Related Work 75 5.2 Method 76 5.2.1 Baseline approaches 76 5.2.2 Proposed Approach (Propagate-Selector) 78 5.3 Experimental Setup and Dataset 82 5.3.1 Dataset 82 5.3.2 Implementation Details 83 5.4 Empirical Results 85 5.4.1 Comparisons with Other Methods 85 5.4.2 Hop Analysis 86 5.4.3 Impact of Various Graph Topologies 88 5.4.4 Impact of Node Representation 91 5.5 Discussion 92 5.6 Conclusion 93 6 Conclusion 94Docto

    Integrating deep and shallow natural language processing components : representations and hybrid architectures

    Get PDF
    We describe basic concepts and software architectures for the integration of shallow and deep (linguistics-based, semantics-oriented) natural language processing (NLP) components. The main goal of this novel, hybrid integration paradigm is improving robustness of deep processing. After an introduction to constraint-based natural language parsing, we give an overview of typical shallow processing tasks. We introduce XML standoff markup as an additional abstraction layer that eases integration of NLP components, and propose the use of XSLT as a standardized and efficient transformation language for online NLP integration. In the main part of the thesis, we describe our contributions to three hybrid architecture frameworks that make use of these fundamentals. SProUT is a shallow system that uses elements of deep constraint-based processing, namely type hierarchy and typed feature structures. WHITEBOARD is the first hybrid architecture to integrate not only part-of-speech tagging, but also named entity recognition and topological parsing, with deep parsing. Finally, we present Heart of Gold, a middleware architecture that generalizes WHITEBOARD into various dimensions such as configurability, multilinguality and flexible processing strategies. We describe various applications that have been implemented using the hybrid frameworks such as structured named entity recognition, information extraction, creative document authoring support, deep question analysis, as well as evaluations. In WHITEBOARD, e.g., it could be shown that shallow pre-processing increases both coverage and efficiency of deep parsing by a factor of more than two. Heart of Gold not only forms the basis for applications that utilize semanticsoriented natural language analysis, but also constitutes a complex research instrument for experimenting with novel processing strategies combining deep and shallow methods, and eases replication and comparability of results.Diese Arbeit beschreibt Grundlagen und Software-Architekturen für die Integration von flachen mit tiefen (linguistikbasierten und semantikorientierten) Verarbeitungskomponenten für natürliche Sprache. Das Hauptziel dieses neuartigen, hybriden Integrationparadigmas ist die Verbesserung der Robustheit der tiefen Verarbeitung. Nach einer Einführung in constraintbasierte Analyse natürlicher Sprache geben wir einen Überblick über typische Aufgaben flacher Sprachverarbeitungskomponenten. Wir führen XML Standoff-Markup als zusätzliche Abstraktionsebene ein, mit deren Hilfe sich Sprachverarbeitungskomponenten einfacher integrieren lassen. Ferner schlagen wir XSLT als standardisierte und effiziente Transformationssprache für die Online-Integration vor. Im Hauptteil der Arbeit stellen wir unsere Beiträge zu drei hybriden Architekturen vor, welche auf den beschriebenen Grundlagen aufbauen. SProUT ist ein flaches System, das Elemente tiefer Verarbeitung wie Typhierarchie und getypte Merkmalsstrukturen nutzt. WHITEBOARD ist das erste System, welches nicht nur Part-of-speech-Tagging, sondern auch Eigennamenerkennung und flaches topologisches Parsing mit tiefer Verarbeitung kombiniert. Schließlich wird Heart of Gold vorgestellt, eine Middleware-Architektur, welche WHITEBOARD hinsichtlich verschiedener Dimensionen wie Konfigurierbarkeit, Mehrsprachigkeit und Unterstützung flexibler Verarbeitungsstrategien generalisiert. Wir beschreiben verschiedene, mit Hilfe der hybriden Architekturen implementierte Anwendungen wie strukturierte Eigennamenerkennung, Informationsextraktion, Kreativitätsunterstützung bei der Dokumenterstellung, tiefe Frageanalyse, sowie Evaluationen. So konnte z.B. in WHITEBOARD gezeigt werden, dass durch flache Vorverarbeitung sowohl Abdeckung als auch Effizienz des tiefen Parsers mehr als verdoppelt werden. Heart of Gold bildet nicht nur Grundlage für semantikorientierte Sprachanwendungen, sondern stellt auch eine wissenschaftliche Experimentierplattform für weitere, neuartige Kombinationsstrategien dar, welche zudem die Replizierbarkeit und Vergleichbarkeit von Ergebnissen erleichtert

    Research in Linguistic Engineering: Resources and Tools

    Get PDF
    In this paper we are revisiting some of the resources and tools developed by the members of the Intelligent Systems Research Group (GSI) at UPM as well as from the Information Retrieval and Natural Language Processing Research Group (IR&NLP) at UNED. Details about developed resources (corpus, software) and current interests and projects are given for the two groups. It is also included a brief summary and links into open source resources and tools developed by other groups of the MAVIR consortium

    Natural Language Generation from Knowledge-Base Triples

    Get PDF
    Cílem této diplomové práce je vytvořit nástroj jenž za pomocí strojového učení dokáže verbalizovat data, t.j. ze vstupních dat ve formě RDF trojic dokáže vytvořit odpovídající text v přirozeném jazyce (angličtina) takový, že bude gramaticky a mluvnicky správný, bude obsahovat veškeré informace ze vstupních dat a nebude obsahovat žádné informace navíc. Práce nejprve zkoumá dostupná data, poté se zabývá architekturami modelů pro statistické strojové učení a jejich možné použití pro generování přirozeného jazyka. Práce se taktéž zabývá numerickou reprezentací textu, generováním textu pomocí učících se modelů a optimalizačních algoritmů pro trénování těchto modelů. V další části práce jsou navrženy dva rozdílné přístupy pro řešení zadání práce. Navržené přístupy jsou poté zhodnoceny pomocí automatických metrik a nejlepší systémy jsou zhodnoceny manuálně. Závěr této diplomové práce je věnován nasazení výsledné aplikace pro produkční běh.The main goal of this master thesis is to create a machine-learning-based tool that is able to verbalize given data, i.e., from given RDF triples; it should be able to create a corresponding text in a natural language (English) such that the text must be grammatically correct, fluent, must contain all information from the input data and cannot have any additional information. The thesis begins with examining the publicly available datasets; then, it focuses on the architectures of statistical machine learning models and their possible usage for natural language generation. The work is also focused on possible numerical text representation, text generation by machine learning models, and optimization algorithms for training the models. The next part of the thesis proposes two main solutions to the problem and examines each of them. Automatic metrics evaluate all systems, and the best performing models are then passed to a human (manual) evaluation. The last part of the thesis focuses on implementing the final application and its deployment for production

    Analyzing short-answer questions and their automatic scoring - studies on semantic relations in reading comprehension and the reduction of human annotation effort

    Get PDF
    Short-answer questions are a wide-spread exercise type in many educational areas. Answers given by learners to such questions are scored by teachers based on their content alone ignoring their linguistic correctness as far as possible. They typically have a length of up to a few sentences. Manual scoring is a time-consuming task, so that automatic scoring of short-answer questions using natural language processing techniques has become an important task. This thesis focuses on two aspects of short-answer questions and their scoring: First, we concentrate on a reading comprehension scenario for learners of German as a foreign language, where students answer questions about a reading text. Within this scenario, we examine the multiple relations between reading texts, learner answers and teacher-specified target answers. Second, we investigate how to reduce human scoring workload by both fully automatic and computer-assisted scoring. The latter is a scenario where scoring is not done entirely automatically, but where a teacher receives scoring support, for example, by means of clustering similar answers together. Addressing the first aspect, we conduct a series of corpus annotation studies which highlight the relations between pairs of learner answers and target answers, as well as between both types of answers and the reading text they refer to. We annotate sentences from the reading text that were potentially used by learners or teachers for constructing answers and observe that, unsurprisingly, most correct answers can easily be linked to the text; incorrect answers often link to the text as well, but are often backed up by a part of the text not relevant to answer the question. Based on these findings, we create a new baseline scoring model which considers for correctness whether learners looked for an answer in the right place or not. After identifying those links into the text, we label the relation between learner answers and target answers as well as between reading texts and answers by annotating entailment relations. In contrast to the widespread assumption that scoring can be fully mapped to the task of recognizing textual entailment, we find the two tasks to be only closely related and not completely equivalent. Correct answers do often, but not always, entail the target answer, as well as part of the related text, and incorrect answers do most of the time not stand in an entailment relation to the target answer, but often have some overlap with the text. This close relatedness allows us to use gold-standard entailment information to improve the performance of automatic scoring. We also use links between learner answers and both reading texts and target answers in a statistical alignment-based scoring approach using methods from machine translation and reach a performance comparable to an existing knowledge-based alignment approach. Our investigations into how human scoring effort can be reduced when learner answers are manually scored by teachers are based on two methods: active learning and clustering. In the active learning approach, we score particularly informative items first, i.e., items from which a classifier can learn most, identifying them using uncertainty-based sample selection. In this way, we reach a higher performance with a given number of annotation steps compared to randomly selected answers. In the second research strand, we use clustering methods to group similar answers together, such that groups of answers can be scored in one scoring step. In doing so, the number of necessary labeling steps can be substantially reduced. When comparing clustering-based scoring to classical supervised machine learning setups, where the human annotations are used to train a classifier, supervised machine learning is still in the lead in terms of performance, whereas clusters provide the advantage of structured output. However, we are able to close part of the performance gap by means of supervised feature selection and semi-supervised clustering. In an additional study, we investigate the automatic processing of learner language with respect to the performance of part-of-speech (POS) tagging tools. We manually annotate a German reading comprehension corpus both with spelling normalization and POS information and find that the performance of automatic POS tagging can be improved by spell-checking the data using the reading text as additional evidence for lexical material intended in a learner answer.Short-Answer-Fragen sind ein weit verbreiteter Aufgabentyp in vielen Bildungsbereichen. Die Antworten, die Lerner zu solchen Aufgaben geben, werden von Lehrenden allein auf Grundlage ihres Inhalts bewertet; linguistische Korrektheit wird soweit möglich ignoriert. Diese Doktorarbeit legt ihren Schwerpunkt auf zwei Aspekte im Zusammenhang mit Short- Answer-Fragen und ihrer Bewertung: Zum einen betrachten wir ein Leseverständnisszenario, bei dem Studenten Fragen zu Lesetexten beantworten. Dabei untersuchen wir insbesondere die verschiedenen Beziehungen, die es zwischen Lesetexten, Lernerantworten und vom Lehrer erstellten Musterantworten gibt. Zum anderen untersuchen wir, wie der menschliche Bewertungsaufwand durch voll-automatisches und computergestütztes Bewerten reduziert werden kann. Bei letzterem handelt es sich um ein Szenario, in dem Lehrer bei der Bewertung unterstützt werden, z.B. indem ähnliche Antworten automatisch gruppiert werden. Zur Untersuchung des ersten Aspekts unternehmen wir eine Reihe von Korpusannotationsstudien, die sowohl die Beziehungen zwischen Lerner- und Musterantworten beleuchten, als auch die Beziehung zwischen diesen Antworten und dem Lesetext, auf den sie sich beziehen. Wir annotieren Sätze aus dem Lesetext, die vermutlich bei der Formulierung einer Antwort benutzt wurden und machen die zu erwartende Beobachtung, dass die meisten korrekten Antworten problemlos mit bestimmten Textpassagen in Verbindung gebracht werden können. Inkorrekte Antworten haben ebenfalls oft eine Verbindung zu bestimmten Textpassagen, die aber oft für die jeweilige Frage nicht relevant sind. Auf Grundlage dieser Erkenntnisse entwerfen wir ein neues Baseline-Bewertungsmodell, das für die Korrektheit einer Antwort nur in Betracht zieht, ob der Lerner die Antwort an der richtigen Stelle im Lesetext gesucht hat oder nicht. Nachdem wir diese Verbindungen in den Text identifiziert haben, annotieren wir die Relation zwischen Lerner- und Musterantworten und zwischen Texten und Antworten mit Entailment- Relationen. Im Gegensatz zur der weitverbreiteten Annahme, dass das Bewerten von Short- Answer-Fragen und das Erkennen von Textual-Entailment-Relationen zwischen Lerner und Musterantworten sich direkt entsprechen, finden wir heraus, dass die beiden Aufgaben nur nahe verwandt aber nicht vollständig äquivalent sind. Korrekte Antworten entailen meistens, aber nicht immer, die Musterantwort und auch den entsprechenden Satz im Lesetext. Inkorrekte Antworten stehen meist in keiner Entailmentrelation mit der Musterantwort, haben aber oft zumindest teilweisen Overlap mit dem Text. Diese nahe Verwandtschaft erlaubt es uns, Goldstandard-Entailmentinformation zu benutzen, um die Performanz beim automatischen Bewerten zu verbessern. Wir benutzen die annotierten Verbindungen zwischen Lesetexten und Antworten auch in einem Scoringansatz, der auf statistischem Alignment basiert und Methoden aus dem Bereich der maschinellen Übersetzung nutzt. Dabei erreichen wir eine Scoringgenauigkeit, die mit Ansätzen, die ein existierendes wissensbasiertes Alignment nutzen, vergleichbar ist. Unsere Untersuchungen, wie der Bewertungsaufwand beim Menschen verringert werden kann, wenn Antworten vom Lehrer manuell bewertet werden, basieren auf zwei Methoden: Active Learning und Clustering. Beim Active-Learning-Ansatz werden besonders informative Antworten vorrangig zur Bewertung ausgewählt, d.h. solche Antworten, von denen ein Klassifikator besonders viel lernen kann. Wir identifizieren solche Antworten durch Uncertainty-Sampling- Methoden und erreichen dadurch mit einer gegebenen Anzahl von Annotationsschritten eine höhere Klassifikationsgenauigkeit als mit zufällig ausgewählten Antworten. In unserem zweiten Forschungszweig nutzen wir Clusteringmethoden um ähnliche Antworten zu gruppieren, so dass Gruppen von Antworten in einem Annotationsschritt bewertet werden können. Dadurch kann die Anzahl der insgesamt nötigen Bewertungsschritte drastisch reduziert werden. Beim Vergleich zwischen clusteringbasierten Bewertungsverfahren und klassischem überwachten maschinellen Lernen, bei dem menschliche Annotationen dazu genutzt werden, einen Klassifikator zu trainieren, erbringen überwachte maschinelle Lernverfahren immer noch eine höhere Bewertungsgenauigkeit. Demgegenüber bringen Cluster den Vorteil eines strukturierten Outputs mit sich. Wir sind jedoch in der Lage, einen Teil diese Genauigkeitslücke zu schließen, in dem wir überwachte Featureauswahl und halbüberwachtes Clustering anwenden. In einer zusätzlichen Studie untersuchen wir die automatische Verarbeitung von Lernersprache im Hinblick auf die Performanz vonWerkzeugen für dasWortarten-Tagging. Wir annotieren ein deutsches Leseverstehenskorpus manuell sowohl mit Normalisierungsinformation in Bezug auf Rechtschreibung als auch mit Wortartinformation. Als Ergebnis der Studie finden wir, dass die Performanz bei der automatischen Wortartenzuweisung durch Rechtschreibkorrektur verbessert werden kann, insbesondere wenn wir den Lesetext als zusätzliche Evidenz dafür verwenden, welche Wörter der Leser in einer Antwort vermutlich benutzen wollte

    Results of the WMT14 Metrics Shared Task

    Get PDF
    This paper presents the results of the WMT14 Metrics Shared Task. We asked participants of this task to score the outputs of the MT systems involved in WMT14 Shared Translation Task. We col- lected scores of 23 metrics from 12 re- search groups. In addition to that we com- puted scores of 6 standard metrics (BLEU, NIST, WER, PER, TER and CDER) as baselines. The collected scores were eval- uated in terms of system level correlation (how well each metric’s scores correlate with WMT14 official manual ranking of systems) and in terms of segment level correlation (how often a metric agrees with humans in comparing two translations of a particular sentence)

    A Primer on Seq2Seq Models for Generative Chatbots

    Get PDF
    The recent spread of Deep Learning-based solutions for Artificial Intelligence and the development of Large Language Models has pushed forwards significantly the Natural Language Processing area. The approach has quickly evolved in the last ten years, deeply affecting NLP, from low-level text pre-processing tasks –such as tokenisation or POS tagging– to high-level, complex NLP applications like machine translation and chatbots. This paper examines recent trends in the development of open-domain data-driven generative chatbots, focusing on the Seq2Seq architectures. Such architectures are compatible with multiple learning approaches, ranging from supervised to reinforcement and, in the last years, allowed to realise very engaging open-domain chatbots. Not only do these architectures allow to directly output the next turn in a conversation but, to some extent, they also allow to control the style or content of the response. To offer a complete view on the subject, we examine possible architecture implementations as well as training and evaluation approaches. Additionally, we provide information about the openly available corpora to train and evaluate such models and about the current and past chatbot competitions. Finally, we present some insights on possible future directions, given the current research status
    corecore