1,553 research outputs found

    Using Extra-Linguistic Material for Mandarin-French Verbal Constructions Comparison

    Get PDF
    International audienceSystematic cross-linguistic studies of verbs syntactic-semantic behaviors for ty-pologically distant languages such as Mandarin Chinese and French are difficult to conduct.Such studies are nevertheless necessary due to the crucial role that verbal constructions playin the mental lexicon. This paper addresses the problem by combining psycho-linguisticsand computational methods. Psycho-linguistics provides us with a bilingual corpus that fea-tures verbal construction associated with carefully built extra-linguistic material (short videoclips). Computational approaches bring us distributional semantic models (DSM) to measurethe distance between linguistic elements in the extra-linguistic space. These models allowsfor cross-linguistic measures that we evaluate against manually annotated data. In this pa-per, we discuss the results, potential shortcomings involving cultural variability and how tomeasure such bias

    Using Extra-Linguistic Material for Mandarin-French Verbal Constructions Comparison

    Get PDF
    PACLIC 23 / City University of Hong Kong / 3-5 December 200

    Unsupervised Paraphrasing of Multiword Expressions

    Full text link
    We propose an unsupervised approach to paraphrasing multiword expressions (MWEs) in context. Our model employs only monolingual corpus data and pre-trained language models (without fine-tuning), and does not make use of any external resources such as dictionaries. We evaluate our method on the SemEval 2022 idiomatic semantic text similarity task, and show that it outperforms all unsupervised systems and rivals supervised systems.Comment: 13 pages; accepted for Findings of ACL 202

    Event structures in knowledge, pictures and text

    Get PDF
    This thesis proposes new techniques for mining scripts. Scripts are essential pieces of common sense knowledge that contain information about everyday scenarios (like going to a restaurant), namely the events that usually happen in a scenario (entering, sitting down, reading the menu...), their typical order (ordering happens before eating), and the participants of these events (customer, waiter, food...). Because many conventionalized scenarios are shared common sense knowledge and thus are usually not described in standard texts, we propose to elicit sequential descriptions of typical scenario instances via crowdsourcing over the internet. This approach overcomes the implicitness problem and, at the same time, is scalable to large data collections. To generalize over the input data, we need to mine event and participant paraphrases from the textual sequences. For this task we make use of the structural commonalities in the collected sequential descriptions, which yields much more accurate paraphrases than approaches that do not take structural constraints into account. We further apply the algorithm we developed for event paraphrasing to parallel standard texts for extracting sentential paraphrases and paraphrase fragments. In this case we consider the discourse structure in a text as a sequential event structure. As for event paraphrasing, the structure-aware paraphrasing approach clearly outperforms systems that do not consider discourse structure. As a multimodal application, we develop a new resource in which textual event descriptions are grounded in videos, which enables new investigations on action description semantics and a more accurate modeling of event description similarities. This grounding approach also opens up new possibilities for applying the computed script knowledge for automated event recognition in videos.Die vorliegende Dissertation schlĂ€gt neue Techniken zur Berechnung von Skripten vor. Skripte sind essentielle Teile des Allgemeinwissens, die Informationen ĂŒber alltĂ€gliche Szenarien (wie im Restaurant essen) enthalten, nĂ€mlich die Ereignisse, die typischerweise in einem Szenario vorkommen (eintreten, sich setzen, die Karte lesen...), deren typische zeitliche Abfolge (man bestellt bevor man isst), und die Teilnehmer der Ereignisse (ein Gast, der Kellner, das Essen,...). Da viele konventionalisierte Szenarien implizit geteiltes Allgemeinwissen sind und ĂŒblicherweise nicht detailliert in Texten beschrieben werden, schlagen wir vor, Beschreibungen von typischen Szenario-Instanzen durch sog. “Crowdsourcing” ĂŒber das Internet zu sammeln. Dieser Ansatz löst das Implizitheits-Problem und lĂ€sst sich gleichzeitig zu großen Daten-Sammlungen hochskalieren. Um ĂŒber die Eingabe-Daten zu generalisieren, mĂŒssen wir in den Text-Sequenzen Paraphrasen fĂŒr Ereignisse und Teilnehmer finden. HierfĂŒr nutzen wir die strukturellen Gemeinsamkeiten dieser Sequenzen, was viel prĂ€zisere Paraphrasen-Information ergibt als Standard-AnsĂ€tze, die strukturelle EinschrĂ€nkungen nicht beachten. Die Techniken, die wir fĂŒr die Ereignis-Paraphrasierung entwickelt haben, wenden wir auch auf parallele Standard-Texte an, um Paraphrasen auf Satz-Ebene sowie Paraphrasen-Fragmente zu extrahieren. Hier betrachten wir die Diskurs-Struktur eines Textes als sequentielle Ereignis-Struktur. Auch hier liefert der strukturell informierte Ansatz klar bessere Ergebnisse als herkömmliche Systeme, die Diskurs-Struktur nicht in die Berechnung mit einbeziehen. Als multimodale Anwendung entwickeln wir eine neue Ressource, in der Text-Beschreibungen von Ereignissen mittels zeitlicher Synchronisierung in Videos verankert sind. Dies ermöglicht neue AnsĂ€tze fĂŒr die Erforschung der Semantik von Ereignisbeschreibungen, und erlaubt außerdem die Modellierung treffenderer Ereignis-Ähnlichkeiten. Dieser Schritt der visuellen Verankerung von Text in Videos eröffnet auch neue Möglichkeiten fĂŒr die Anwendung des berechneten Skript-Wissen bei der automatischen Ereigniserkennung in Videos

    Doctor of Philosophy

    Get PDF
    dissertationResearch that has examined how L2 writers write from sources and the extent to which these source-based text;s differ from text;s produced by L1 writers suggests that L2 writers copy more extensively and attribute information to original sources less frequently than L1 writers (e.g., Keck, 2006). This dissertation study set out to add to the existing body of literature on text;ual borrowing in undergraduate L2 writers with the additional goal of examining the extent to which these writers‘ text;ual borrowing is influenced by instruction on avoiding plagiarism. The study employed qualitative methodology and drew upon multiple data sources. Additionally, unlike much of the existing research on L2 writers‘ text;ual borrowing, this study examined three L2 writers‘ text;ual borrowing in the context; of authentic source-based assignments produced in an ESL writing class and mainstream courses. The findings showed that the three L2 writers in the study were able to avoid blatant plagiarism by implementing basic text;ual borrowing strategies, such as paraphrasing by substituting original words with synonyms. However, they continued to have difficulties with more nuanced aspects of source use, such as transparency and cohesion in attribution, integration of source-based material with their own voice, source selection and organization, and use of effective reading and writing strategies. With respect to the observed instruction, the study uncovered several central themes: the instructor 1) tended to focus on the punitive consequences of plagiarism (although her perspective shifted toward the end of the course), 2) frequently emphasized concepts of credibility and blame as main reasons for responsible text;ual borrowing, and 3) simplified instruction on text;ual borrowing to rephrasing of others‘ words and changing structure. These findings highlight the mismatch between the complex difficulties that undergraduate L2 writers have with text;ual borrowing on one hand and the simplified instruction that ignores these difficulties on the other. I discuss this uncovered disparity in the realm of L2 writing teacher preparation and professional training for faculty across the curriculum, arguing for increased institutional support. I also outline a framework for providing such instructional support, which includes linguistic, text;ual, cognitive, metacognitive, and social support

    Multiword expressions at length and in depth

    Get PDF
    The annual workshop on multiword expressions takes place since 2001 in conjunction with major computational linguistics conferences and attracts the attention of an ever-growing community working on a variety of languages, linguistic phenomena and related computational processing issues. MWE 2017 took place in Valencia, Spain, and represented a vibrant panorama of the current research landscape on the computational treatment of multiword expressions, featuring many high-quality submissions. Furthermore, MWE 2017 included the first shared task on multilingual identification of verbal multiword expressions. The shared task, with extended communal work, has developed important multilingual resources and mobilised several research groups in computational linguistics worldwide. This book contains extended versions of selected papers from the workshop. Authors worked hard to include detailed explanations, broader and deeper analyses, and new exciting results, which were thoroughly reviewed by an internationally renowned committee. We hope that this distinctly joint effort will provide a meaningful and useful snapshot of the multilingual state of the art in multiword expressions modelling and processing, and will be a point point of reference for future work

    Un environnement générique et ouvert pour le traitement des expressions polylexicales

    Get PDF
    The treatment of multiword expressions (MWEs), like take off, bus stop and big deal, is a challenge for NLP applications. This kind of linguistic construction is not only arbitrary but also much more frequent than one would initially guess. This thesis investigates the behaviour of MWEs across different languages, domains and construction types, proposing and evaluating an integrated methodological framework for their acquisition. There have been many theoretical proposals to define, characterise and classify MWEs. We adopt generic definition stating that MWEs are word combinations which must be treated as a unit at some level of linguistic processing. They present a variable degree of institutionalisation, arbitrariness, heterogeneity and limited syntactic and semantic variability. There has been much research on automatic MWE acquisition in the recent decades, and the state of the art covers a large number of techniques and languages. Other tasks involving MWEs, namely disambiguation, interpretation, representation and applications, have received less emphasis in the field. The first main contribution of this thesis is the proposal of an original methodological framework for automatic MWE acquisition from monolingual corpora. This framework is generic, language independent, integrated and contains a freely available implementation, the mwetoolkit. It is composed of independent modules which may themselves use multiple techniques to solve a specific sub-task in MWE acquisition. The evaluation of MWE acquisition is modelled using four independent axes. We underline that the evaluation results depend on parameters of the acquisition context, e.g., nature and size of corpora, language and type of MWE, analysis depth, and existing resources. The second main contribution of this thesis is the application-oriented evaluation of our methodology proposal in two applications: computer-assisted lexicography and statistical machine translation. For the former, we evaluate the usefulness of automatic MWE acquisition with the mwetoolkit for creating three lexicons: Greek nominal expressions, Portuguese complex predicates and Portuguese sentiment expressions. For the latter, we test several integration strategies in order to improve the treatment given to English phrasal verbs when translated by a standard statistical MT system into Portuguese. Both applications can benefit from automatic MWE acquisition, as the expressions acquired automatically from corpora can both speed up and improve the quality of the results. The promising results of previous and ongoing experiments encourage further investigation about the optimal way to integrate MWE treatment into other applications. Thus, we conclude the thesis with an overview of the past, ongoing and future work

    Paraphrasing and Translation

    Get PDF
    Paraphrasing and translation have previously been treated as unconnected natural lan¬ guage processing tasks. Whereas translation represents the preservation of meaning when an idea is rendered in the words in a different language, paraphrasing represents the preservation of meaning when an idea is expressed using different words in the same language. We show that the two are intimately related. The major contributions of this thesis are as follows:‱ We define a novel technique for automatically generating paraphrases using bilingual parallel corpora, which are more commonly used as training data for statistical models of translation.‱ We show that paraphrases can be used to improve the quality of statistical ma¬ chine translation by addressing the problem of coverage and introducing a degree of generalization into the models.‱ We explore the topic of automatic evaluation of translation quality, and show that the current standard evaluation methodology cannot be guaranteed to correlate with human judgments of translation quality.Whereas previous data-driven approaches to paraphrasing were dependent upon either data sources which were uncommon such as multiple translation of the same source text, or language specific resources such as parsers, our approach is able to harness more widely parallel corpora and can be applied to any language which has a parallel corpus. The technique was evaluated by replacing phrases with their para¬ phrases, and asking judges whether the meaning of the original phrase was retained and whether the resulting sentence remained grammatical. Paraphrases extracted from a parallel corpus with manual alignments are judged to be accurate (both meaningful and grammatical) 75% of the time, retaining the meaning of the original phrase 85% of the time. Using automatic alignments, meaning can be retained at a rate of 70%.Being a language independent and probabilistic approach allows our method to be easily integrated into statistical machine translation. A paraphrase model derived from parallel corpora other than the one used to train the translation model can be used to increase the coverage of statistical machine translation by adding translations of previously unseen words and phrases. If the translation of a word was not learned, but a translation of a synonymous word has been learned, then the word is paraphrased and its paraphrase is translated. Phrases can be treated similarly. Results show that augmenting a state-of-the-art SMT system with paraphrases in this way leads to significantly improved coverage and translation quality. For a training corpus with 10,000 sentence pairs, we increase the coverage of unique test set unigrams from 48% to 90%, with more than half of the newly covered items accurately translated, as opposed to none in current approaches
    • 

    corecore