242 research outputs found

    Similar Text Fragments Extraction for Identifying Common Wikipedia Communities

    Get PDF
    Similar text fragments extraction from weakly formalized data is the task of natural language processing and intelligent data analysis and is used for solving the problem of automatic identification of connected knowledge fields. In order to search such common communities in Wikipedia, we propose to use as an additional stage a logical-algebraic model for similar collocations extraction. With Stanford Part-Of-Speech tagger and Stanford Universal Dependencies parser, we identify the grammatical characteristics of collocation words. WithWordNet synsets, we choose their synonyms. Our dataset includes Wikipedia articles from different portals and projects. The experimental results show the frequencies of synonymous text fragments inWikipedia articles that form common information spaces. The number of highly frequented synonymous collocations can obtain an indication of key common up-to-date Wikipedia communities

    On the Mono- and Cross-Language Detection of Text Re-Use and Plagiarism

    Full text link
    Barrón Cedeño, LA. (2012). On the Mono- and Cross-Language Detection of Text Re-Use and Plagiarism [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/16012Palanci

    ParaPhraser: Russian paraphrase corpus and shared task

    Get PDF
    The paper describes the results of the First Russian Paraphrase Detection Shared Task held in St.-Petersburg, Russia, in October 2016. Research in the area of paraphrase extraction, detection and generation has been successfully developing for a long time while there has been only a recent surge of interest towards the problem in the Russian community of computational linguistics. We try to overcome this gap by introducing the project ParaPhraser.ru dedicated to the collection of Russian paraphrase corpus and organizing a Paraphrase Detection Shared Task, which uses the corpus as the training data. The participants of the task applied a wide variety of techniques to the problem of paraphrase detection, from rule-based approaches to deep learning, and results of the task reflect the following tendencies: the best scores are obtained by the strategy of using traditional classifiers combined with fine-grained linguistic features, however, complex neural networks, shallow methods and purely technical methods also demonstrate competitive results.Peer reviewe

    A resource-light method for cross-lingual semantic textual similarity

    Full text link
    [EN] Recognizing semantically similar sentences or paragraphs across languages is beneficial for many tasks, ranging from cross-lingual information retrieval and plagiarism detection to machine translation. Recently proposed methods for predicting cross-lingual semantic similarity of short texts, however, make use of tools and resources (e.g., machine translation systems, syntactic parsers or named entity recognition) that for many languages (or language pairs) do not exist. In contrast, we propose an unsupervised and a very resource-light approach for measuring semantic similarity between texts in different languages. To operate in the bilingual (or multilingual) space, we project continuous word vectors (i.e., word embeddings) from one language to the vector space of the other language via the linear translation model. We then align words according to the similarity of their vectors in the bilingual embedding space and investigate different unsupervised measures of semantic similarity exploiting bilingual embeddings and word alignments. Requiring only a limited-size set of word translation pairs between the languages, the proposed approach is applicable to virtually any pair of languages for which there exists a sufficiently large corpus, required to learn monolingual word embeddings. Experimental results on three different datasets for measuring semantic textual similarity show that our simple resource-light approach reaches performance close to that of supervised and resource-intensive methods, displaying stability across different language pairs. Furthermore, we evaluate the proposed method on two extrinsic tasks, namely extraction of parallel sentences from comparable corpora and cross-lingual plagiarism detection, and show that it yields performance comparable to those of complex resource-intensive state-of-the-art models for the respective tasks. (C) 2017 Published by Elsevier B.V.Part of the work presented in this article was performed during second author's research visit to the University of Mannheim, supported by Contact Fellowship awarded by the DAAD scholarship program "STIBET Doktoranden". The research of the last author has been carried out in the framework of the SomEMBED project (TIN2015-71147-C2-1-P). Furthermore, this work was partially funded by the Junior-professor funding programme of the Ministry of Science, Research and the Arts of the state of Baden-Wurttemberg (project "Deep semantic models for high-end NLP application").Glavas, G.; Franco-Salvador, M.; Ponzetto, SP.; Rosso, P. (2018). A resource-light method for cross-lingual semantic textual similarity. Knowledge-Based Systems. 143:1-9. https://doi.org/10.1016/j.knosys.2017.11.041S1914

    Event structures in knowledge, pictures and text

    Get PDF
    This thesis proposes new techniques for mining scripts. Scripts are essential pieces of common sense knowledge that contain information about everyday scenarios (like going to a restaurant), namely the events that usually happen in a scenario (entering, sitting down, reading the menu...), their typical order (ordering happens before eating), and the participants of these events (customer, waiter, food...). Because many conventionalized scenarios are shared common sense knowledge and thus are usually not described in standard texts, we propose to elicit sequential descriptions of typical scenario instances via crowdsourcing over the internet. This approach overcomes the implicitness problem and, at the same time, is scalable to large data collections. To generalize over the input data, we need to mine event and participant paraphrases from the textual sequences. For this task we make use of the structural commonalities in the collected sequential descriptions, which yields much more accurate paraphrases than approaches that do not take structural constraints into account. We further apply the algorithm we developed for event paraphrasing to parallel standard texts for extracting sentential paraphrases and paraphrase fragments. In this case we consider the discourse structure in a text as a sequential event structure. As for event paraphrasing, the structure-aware paraphrasing approach clearly outperforms systems that do not consider discourse structure. As a multimodal application, we develop a new resource in which textual event descriptions are grounded in videos, which enables new investigations on action description semantics and a more accurate modeling of event description similarities. This grounding approach also opens up new possibilities for applying the computed script knowledge for automated event recognition in videos.Die vorliegende Dissertation schlägt neue Techniken zur Berechnung von Skripten vor. Skripte sind essentielle Teile des Allgemeinwissens, die Informationen über alltägliche Szenarien (wie im Restaurant essen) enthalten, nämlich die Ereignisse, die typischerweise in einem Szenario vorkommen (eintreten, sich setzen, die Karte lesen...), deren typische zeitliche Abfolge (man bestellt bevor man isst), und die Teilnehmer der Ereignisse (ein Gast, der Kellner, das Essen,...). Da viele konventionalisierte Szenarien implizit geteiltes Allgemeinwissen sind und üblicherweise nicht detailliert in Texten beschrieben werden, schlagen wir vor, Beschreibungen von typischen Szenario-Instanzen durch sog. “Crowdsourcing” über das Internet zu sammeln. Dieser Ansatz löst das Implizitheits-Problem und lässt sich gleichzeitig zu großen Daten-Sammlungen hochskalieren. Um über die Eingabe-Daten zu generalisieren, müssen wir in den Text-Sequenzen Paraphrasen für Ereignisse und Teilnehmer finden. Hierfür nutzen wir die strukturellen Gemeinsamkeiten dieser Sequenzen, was viel präzisere Paraphrasen-Information ergibt als Standard-Ansätze, die strukturelle Einschränkungen nicht beachten. Die Techniken, die wir für die Ereignis-Paraphrasierung entwickelt haben, wenden wir auch auf parallele Standard-Texte an, um Paraphrasen auf Satz-Ebene sowie Paraphrasen-Fragmente zu extrahieren. Hier betrachten wir die Diskurs-Struktur eines Textes als sequentielle Ereignis-Struktur. Auch hier liefert der strukturell informierte Ansatz klar bessere Ergebnisse als herkömmliche Systeme, die Diskurs-Struktur nicht in die Berechnung mit einbeziehen. Als multimodale Anwendung entwickeln wir eine neue Ressource, in der Text-Beschreibungen von Ereignissen mittels zeitlicher Synchronisierung in Videos verankert sind. Dies ermöglicht neue Ansätze für die Erforschung der Semantik von Ereignisbeschreibungen, und erlaubt außerdem die Modellierung treffenderer Ereignis-Ähnlichkeiten. Dieser Schritt der visuellen Verankerung von Text in Videos eröffnet auch neue Möglichkeiten für die Anwendung des berechneten Skript-Wissen bei der automatischen Ereigniserkennung in Videos

    Fine-Grained Linguistic Soft Constraints on Statistical Natural Language Processing Models

    Get PDF
    This dissertation focuses on effective combination of data-driven natural language processing (NLP) approaches with linguistic knowledge sources that are based on manual text annotation or word grouping according to semantic commonalities. I gainfully apply fine-grained linguistic soft constraints -- of syntactic or semantic nature -- on statistical NLP models, evaluated in end-to-end state-of-the-art statistical machine translation (SMT) systems. The introduction of semantic soft constraints involves intrinsic evaluation on word-pair similarity ranking tasks, extension from words to phrases, application in a novel distributional paraphrase generation technique, and an introduction of a generalized framework of which these soft semantic and syntactic constraints can be viewed as instances, and in which they can be potentially combined. Fine granularity is key in the successful combination of these soft constraints, in many cases. I show how to softly constrain SMT models by adding fine-grained weighted features, each preferring translation of only a specific syntactic constituent. Previous attempts using coarse-grained features yielded negative results. I also show how to softly constrain corpus-based semantic models of words (“distributional profiles”) to effectively create word-sense-aware models, by using semantic word grouping information found in a manually compiled thesaurus. Previous attempts, using hard constraints and resulting in aggregated, coarse-grained models, yielded lower gains. A novel paraphrase generation technique incorporating these soft semantic constraints is then also evaluated in a SMT system. This paraphrasing technique is based on the Distributional Hypothesis. The main advantage of this novel technique over current “pivoting” techniques for paraphrasing is the independence from parallel texts, which are a limited resource. The evaluation is done by augmenting translation models with paraphrase-based translation rules, where fine-grained scoring of paraphrase-based rules yields significantly higher gains. The model augmentation includes a novel semantic reinforcement component: In many cases there are alternative paths of generating a paraphrase-based translation rule. Each of these paths reinforces a dedicated score for the “goodness” of the new translation rule. This augmented score is then used as a soft constraint, in a weighted log-linear feature, letting the translation model learn how much to “trust” the paraphrase-based translation rules. The work reported here is the first to use distributional semantic similarity measures to improve performance of an end-to-end phrase-based SMT system. The unified framework for statistical NLP models with soft linguistic constraints enables, in principle, the combination of both semantic and syntactic constraints -- and potentially other constraints, too -- in a single SMT model

    Cross-Lingual Textual Entailment and Applications

    Get PDF
    Textual Entailment (TE) has been proposed as a generic framework for modeling language variability. The great potential of integrating (monolingual) TE recognition components into NLP architectures has been reported in several areas, such as question answering, information retrieval, information extraction and document summarization. Mainly due to the absence of cross-lingual TE (CLTE) recognition components, similar improvements have not yet been achieved in any corresponding cross-lingual application. In this thesis, we propose and investigate Cross-Lingual Textual Entailment (CLTE) as a semantic relation between two text portions in dierent languages. We present dierent practical solutions to approach this problem by i) bringing CLTE back to the monolingual scenario, translating the two texts into the same language; and ii) integrating machine translation and TE algorithms and techniques. We argue that CLTE can be a core tech- nology for several cross-lingual NLP applications and tasks. Experiments on dierent datasets and two interesting cross-lingual NLP applications, namely content synchronization and machine translation evaluation, conrm the eectiveness of our approaches leading to successful results. As a complement to the research in the algorithmic side, we successfully explored the creation of cross-lingual textual entailment corpora by means of crowdsourcing, as a cheap and replicable data collection methodology that minimizes the manual work done by expert annotators

    Analyzing short-answer questions and their automatic scoring - studies on semantic relations in reading comprehension and the reduction of human annotation effort

    Get PDF
    Short-answer questions are a wide-spread exercise type in many educational areas. Answers given by learners to such questions are scored by teachers based on their content alone ignoring their linguistic correctness as far as possible. They typically have a length of up to a few sentences. Manual scoring is a time-consuming task, so that automatic scoring of short-answer questions using natural language processing techniques has become an important task. This thesis focuses on two aspects of short-answer questions and their scoring: First, we concentrate on a reading comprehension scenario for learners of German as a foreign language, where students answer questions about a reading text. Within this scenario, we examine the multiple relations between reading texts, learner answers and teacher-specified target answers. Second, we investigate how to reduce human scoring workload by both fully automatic and computer-assisted scoring. The latter is a scenario where scoring is not done entirely automatically, but where a teacher receives scoring support, for example, by means of clustering similar answers together. Addressing the first aspect, we conduct a series of corpus annotation studies which highlight the relations between pairs of learner answers and target answers, as well as between both types of answers and the reading text they refer to. We annotate sentences from the reading text that were potentially used by learners or teachers for constructing answers and observe that, unsurprisingly, most correct answers can easily be linked to the text; incorrect answers often link to the text as well, but are often backed up by a part of the text not relevant to answer the question. Based on these findings, we create a new baseline scoring model which considers for correctness whether learners looked for an answer in the right place or not. After identifying those links into the text, we label the relation between learner answers and target answers as well as between reading texts and answers by annotating entailment relations. In contrast to the widespread assumption that scoring can be fully mapped to the task of recognizing textual entailment, we find the two tasks to be only closely related and not completely equivalent. Correct answers do often, but not always, entail the target answer, as well as part of the related text, and incorrect answers do most of the time not stand in an entailment relation to the target answer, but often have some overlap with the text. This close relatedness allows us to use gold-standard entailment information to improve the performance of automatic scoring. We also use links between learner answers and both reading texts and target answers in a statistical alignment-based scoring approach using methods from machine translation and reach a performance comparable to an existing knowledge-based alignment approach. Our investigations into how human scoring effort can be reduced when learner answers are manually scored by teachers are based on two methods: active learning and clustering. In the active learning approach, we score particularly informative items first, i.e., items from which a classifier can learn most, identifying them using uncertainty-based sample selection. In this way, we reach a higher performance with a given number of annotation steps compared to randomly selected answers. In the second research strand, we use clustering methods to group similar answers together, such that groups of answers can be scored in one scoring step. In doing so, the number of necessary labeling steps can be substantially reduced. When comparing clustering-based scoring to classical supervised machine learning setups, where the human annotations are used to train a classifier, supervised machine learning is still in the lead in terms of performance, whereas clusters provide the advantage of structured output. However, we are able to close part of the performance gap by means of supervised feature selection and semi-supervised clustering. In an additional study, we investigate the automatic processing of learner language with respect to the performance of part-of-speech (POS) tagging tools. We manually annotate a German reading comprehension corpus both with spelling normalization and POS information and find that the performance of automatic POS tagging can be improved by spell-checking the data using the reading text as additional evidence for lexical material intended in a learner answer.Short-Answer-Fragen sind ein weit verbreiteter Aufgabentyp in vielen Bildungsbereichen. Die Antworten, die Lerner zu solchen Aufgaben geben, werden von Lehrenden allein auf Grundlage ihres Inhalts bewertet; linguistische Korrektheit wird soweit möglich ignoriert. Diese Doktorarbeit legt ihren Schwerpunkt auf zwei Aspekte im Zusammenhang mit Short- Answer-Fragen und ihrer Bewertung: Zum einen betrachten wir ein Leseverständnisszenario, bei dem Studenten Fragen zu Lesetexten beantworten. Dabei untersuchen wir insbesondere die verschiedenen Beziehungen, die es zwischen Lesetexten, Lernerantworten und vom Lehrer erstellten Musterantworten gibt. Zum anderen untersuchen wir, wie der menschliche Bewertungsaufwand durch voll-automatisches und computergestütztes Bewerten reduziert werden kann. Bei letzterem handelt es sich um ein Szenario, in dem Lehrer bei der Bewertung unterstützt werden, z.B. indem ähnliche Antworten automatisch gruppiert werden. Zur Untersuchung des ersten Aspekts unternehmen wir eine Reihe von Korpusannotationsstudien, die sowohl die Beziehungen zwischen Lerner- und Musterantworten beleuchten, als auch die Beziehung zwischen diesen Antworten und dem Lesetext, auf den sie sich beziehen. Wir annotieren Sätze aus dem Lesetext, die vermutlich bei der Formulierung einer Antwort benutzt wurden und machen die zu erwartende Beobachtung, dass die meisten korrekten Antworten problemlos mit bestimmten Textpassagen in Verbindung gebracht werden können. Inkorrekte Antworten haben ebenfalls oft eine Verbindung zu bestimmten Textpassagen, die aber oft für die jeweilige Frage nicht relevant sind. Auf Grundlage dieser Erkenntnisse entwerfen wir ein neues Baseline-Bewertungsmodell, das für die Korrektheit einer Antwort nur in Betracht zieht, ob der Lerner die Antwort an der richtigen Stelle im Lesetext gesucht hat oder nicht. Nachdem wir diese Verbindungen in den Text identifiziert haben, annotieren wir die Relation zwischen Lerner- und Musterantworten und zwischen Texten und Antworten mit Entailment- Relationen. Im Gegensatz zur der weitverbreiteten Annahme, dass das Bewerten von Short- Answer-Fragen und das Erkennen von Textual-Entailment-Relationen zwischen Lerner und Musterantworten sich direkt entsprechen, finden wir heraus, dass die beiden Aufgaben nur nahe verwandt aber nicht vollständig äquivalent sind. Korrekte Antworten entailen meistens, aber nicht immer, die Musterantwort und auch den entsprechenden Satz im Lesetext. Inkorrekte Antworten stehen meist in keiner Entailmentrelation mit der Musterantwort, haben aber oft zumindest teilweisen Overlap mit dem Text. Diese nahe Verwandtschaft erlaubt es uns, Goldstandard-Entailmentinformation zu benutzen, um die Performanz beim automatischen Bewerten zu verbessern. Wir benutzen die annotierten Verbindungen zwischen Lesetexten und Antworten auch in einem Scoringansatz, der auf statistischem Alignment basiert und Methoden aus dem Bereich der maschinellen Übersetzung nutzt. Dabei erreichen wir eine Scoringgenauigkeit, die mit Ansätzen, die ein existierendes wissensbasiertes Alignment nutzen, vergleichbar ist. Unsere Untersuchungen, wie der Bewertungsaufwand beim Menschen verringert werden kann, wenn Antworten vom Lehrer manuell bewertet werden, basieren auf zwei Methoden: Active Learning und Clustering. Beim Active-Learning-Ansatz werden besonders informative Antworten vorrangig zur Bewertung ausgewählt, d.h. solche Antworten, von denen ein Klassifikator besonders viel lernen kann. Wir identifizieren solche Antworten durch Uncertainty-Sampling- Methoden und erreichen dadurch mit einer gegebenen Anzahl von Annotationsschritten eine höhere Klassifikationsgenauigkeit als mit zufällig ausgewählten Antworten. In unserem zweiten Forschungszweig nutzen wir Clusteringmethoden um ähnliche Antworten zu gruppieren, so dass Gruppen von Antworten in einem Annotationsschritt bewertet werden können. Dadurch kann die Anzahl der insgesamt nötigen Bewertungsschritte drastisch reduziert werden. Beim Vergleich zwischen clusteringbasierten Bewertungsverfahren und klassischem überwachten maschinellen Lernen, bei dem menschliche Annotationen dazu genutzt werden, einen Klassifikator zu trainieren, erbringen überwachte maschinelle Lernverfahren immer noch eine höhere Bewertungsgenauigkeit. Demgegenüber bringen Cluster den Vorteil eines strukturierten Outputs mit sich. Wir sind jedoch in der Lage, einen Teil diese Genauigkeitslücke zu schließen, in dem wir überwachte Featureauswahl und halbüberwachtes Clustering anwenden. In einer zusätzlichen Studie untersuchen wir die automatische Verarbeitung von Lernersprache im Hinblick auf die Performanz vonWerkzeugen für dasWortarten-Tagging. Wir annotieren ein deutsches Leseverstehenskorpus manuell sowohl mit Normalisierungsinformation in Bezug auf Rechtschreibung als auch mit Wortartinformation. Als Ergebnis der Studie finden wir, dass die Performanz bei der automatischen Wortartenzuweisung durch Rechtschreibkorrektur verbessert werden kann, insbesondere wenn wir den Lesetext als zusätzliche Evidenz dafür verwenden, welche Wörter der Leser in einer Antwort vermutlich benutzen wollte
    corecore