1,553 research outputs found
Using Extra-Linguistic Material for Mandarin-French Verbal Constructions Comparison
International audienceSystematic cross-linguistic studies of verbs syntactic-semantic behaviors for ty-pologically distant languages such as Mandarin Chinese and French are difficult to conduct.Such studies are nevertheless necessary due to the crucial role that verbal constructions playin the mental lexicon. This paper addresses the problem by combining psycho-linguisticsand computational methods. Psycho-linguistics provides us with a bilingual corpus that fea-tures verbal construction associated with carefully built extra-linguistic material (short videoclips). Computational approaches bring us distributional semantic models (DSM) to measurethe distance between linguistic elements in the extra-linguistic space. These models allowsfor cross-linguistic measures that we evaluate against manually annotated data. In this pa-per, we discuss the results, potential shortcomings involving cultural variability and how tomeasure such bias
Using Extra-Linguistic Material for Mandarin-French Verbal Constructions Comparison
PACLIC 23 / City University of Hong Kong / 3-5 December 200
Unsupervised Paraphrasing of Multiword Expressions
We propose an unsupervised approach to paraphrasing multiword expressions
(MWEs) in context. Our model employs only monolingual corpus data and
pre-trained language models (without fine-tuning), and does not make use of any
external resources such as dictionaries. We evaluate our method on the SemEval
2022 idiomatic semantic text similarity task, and show that it outperforms all
unsupervised systems and rivals supervised systems.Comment: 13 pages; accepted for Findings of ACL 202
Recommended from our members
Adapting Automatic Summarization to New Sources of Information
English-language news articles are no longer necessarily the best source of information. The Web allows information to spread more quickly and travel farther: first-person accounts of breaking news events pop up on social media, and foreign-language news articles are accessible to, if not immediately understandable by, English-speaking users. This thesis focuses on developing automatic summarization techniques for these new sources of information.
We focus on summarizing two specific new sources of information: personal narratives, first-person accounts of exciting or unusual events that are readily found in blog entries and other social media posts, and non-English documents, which must first be translated into English, often introducing translation errors that complicate the summarization process. Personal narratives are a very new area of interest in natural language processing research, and they present two key challenges for summarization. First, unlike many news articles, whose lead sentences serve as summaries of the most important ideas in the articles, personal narratives provide no such shortcuts for determining where important information occurs in within them; second, personal narratives are written informally and colloquially, and unlike news articles, they are rarely edited, so they require heavier editing and rewriting during the summarization process. Non-English documents, whether news or narrative, present yet another source of difficulty on top of any challenges inherent to their genre: they must be translated into English, potentially introducing translation errors and disfluencies that must be identified and corrected during summarization.
The bulk of this thesis is dedicated to addressing the challenges of summarizing personal narratives found on the Web. We develop a two-stage summarization system for personal narrative that first extracts sentences containing important content and then rewrites those sentences into summary-appropriate forms. Our content extraction system is inspired by contextualist narrative theory, using changes in writing style throughout a narrative to detect sentences containing important information; it outperforms both graph-based and neural network approaches to sentence extraction for this genre. Our paraphrasing system rewrites the extracted sentences into shorter, standalone summary sentences, learning to mimic the paraphrasing choices of human summarizers more closely than can traditional lexicon- or translation-based paraphrasing approaches.
We conclude with a chapter dedicated to summarizing non-English documents written in low-resource languages â documents that would otherwise be unreadable for English-speaking users. We develop a cross-lingual summarization system that performs even heavier editing and rewriting than does our personal narrative paraphrasing system; we create and train on large amounts of synthetic errorful translations of foreign-language documents. Our approach produces fluent English summaries from disdisfluent translations of non-English documents, and it generalizes across languages
Recommended from our members
A corpus-based investigation of collocational errors in EFL Taiwanese high school students\u27 compositions
Many language instructors focus on vocabulary word by word, neglecting common phrases. The result is that English as a Second Language students do not learn to speak idiomatic English (i.e. they make collocation errors). This study of the English compositions of National Tainan Second Senior High School students in Taiwan examined collocation errors, categorizing them according to Benson, Benson and Ilson\u27s Collocation Classification System. An examination was then made of the error types as correlated with general English proficiency
Event structures in knowledge, pictures and text
This thesis proposes new techniques for mining scripts.
Scripts are essential pieces of common sense knowledge that contain information about everyday scenarios (like going to a restaurant), namely the events that usually happen in a scenario (entering, sitting down, reading the menu...), their typical order (ordering happens before eating), and the participants of these events (customer, waiter, food...).
Because many conventionalized scenarios are shared common sense knowledge and thus are usually not described in standard texts, we propose to elicit sequential descriptions of typical scenario instances via crowdsourcing over the internet. This approach overcomes the implicitness problem and, at the same time, is scalable to large data collections.
To generalize over the input data, we need to mine event and participant paraphrases from the textual sequences. For this task we make use of the structural commonalities in the collected sequential descriptions, which yields much more accurate paraphrases than approaches that do not take structural constraints into account.
We further apply the algorithm we developed for event paraphrasing to parallel standard texts for extracting sentential paraphrases and paraphrase fragments. In this case we consider the discourse structure in a text as a sequential event structure. As for event paraphrasing, the structure-aware paraphrasing approach clearly outperforms systems that do not consider discourse structure.
As a multimodal application, we develop a new resource in which textual event descriptions are grounded in videos, which enables new investigations on action description semantics and a more accurate modeling of event description similarities. This grounding approach also opens up new possibilities for applying the computed script knowledge for automated event recognition in videos.Die vorliegende Dissertation schlĂ€gt neue Techniken zur Berechnung von Skripten vor. Skripte sind essentielle Teile des Allgemeinwissens, die Informationen ĂŒber alltĂ€gliche Szenarien (wie im Restaurant essen) enthalten, nĂ€mlich die Ereignisse, die typischerweise in einem Szenario vorkommen (eintreten, sich setzen, die Karte lesen...), deren typische zeitliche Abfolge (man bestellt bevor man isst), und die Teilnehmer der Ereignisse (ein Gast, der Kellner, das Essen,...).
Da viele konventionalisierte Szenarien implizit geteiltes Allgemeinwissen sind und ĂŒblicherweise nicht detailliert in Texten beschrieben werden, schlagen wir vor, Beschreibungen von typischen Szenario-Instanzen durch sog. âCrowdsourcingâ ĂŒber das Internet zu sammeln. Dieser Ansatz löst das Implizitheits-Problem und lĂ€sst sich gleichzeitig zu groĂen Daten-Sammlungen hochskalieren.
Um ĂŒber die Eingabe-Daten zu generalisieren, mĂŒssen wir in den Text-Sequenzen Paraphrasen fĂŒr Ereignisse und Teilnehmer finden. HierfĂŒr nutzen wir die strukturellen Gemeinsamkeiten dieser Sequenzen, was viel prĂ€zisere Paraphrasen-Information ergibt als Standard-AnsĂ€tze, die strukturelle EinschrĂ€nkungen nicht beachten.
Die Techniken, die wir fĂŒr die Ereignis-Paraphrasierung entwickelt haben, wenden wir auch auf parallele Standard-Texte an, um Paraphrasen auf Satz-Ebene sowie Paraphrasen-Fragmente zu extrahieren. Hier betrachten wir die Diskurs-Struktur eines Textes als sequentielle Ereignis-Struktur. Auch hier liefert der strukturell informierte Ansatz klar bessere Ergebnisse als herkömmliche Systeme, die Diskurs-Struktur nicht in die Berechnung mit einbeziehen.
Als multimodale Anwendung entwickeln wir eine neue Ressource, in der Text-Beschreibungen von Ereignissen mittels zeitlicher Synchronisierung in Videos verankert sind. Dies ermöglicht neue AnsĂ€tze fĂŒr die Erforschung der Semantik von Ereignisbeschreibungen, und erlaubt auĂerdem die Modellierung treffenderer Ereignis-Ăhnlichkeiten. Dieser Schritt der visuellen Verankerung von Text in Videos eröffnet auch neue Möglichkeiten fĂŒr die Anwendung des berechneten Skript-Wissen bei der automatischen Ereigniserkennung in Videos
Doctor of Philosophy
dissertationResearch that has examined how L2 writers write from sources and the extent to which these source-based text;s differ from text;s produced by L1 writers suggests that L2 writers copy more extensively and attribute information to original sources less frequently than L1 writers (e.g., Keck, 2006). This dissertation study set out to add to the existing body of literature on text;ual borrowing in undergraduate L2 writers with the additional goal of examining the extent to which these writersâ text;ual borrowing is influenced by instruction on avoiding plagiarism. The study employed qualitative methodology and drew upon multiple data sources. Additionally, unlike much of the existing research on L2 writersâ text;ual borrowing, this study examined three L2 writersâ text;ual borrowing in the context; of authentic source-based assignments produced in an ESL writing class and mainstream courses. The findings showed that the three L2 writers in the study were able to avoid blatant plagiarism by implementing basic text;ual borrowing strategies, such as paraphrasing by substituting original words with synonyms. However, they continued to have difficulties with more nuanced aspects of source use, such as transparency and cohesion in attribution, integration of source-based material with their own voice, source selection and organization, and use of effective reading and writing strategies. With respect to the observed instruction, the study uncovered several central themes: the instructor 1) tended to focus on the punitive consequences of plagiarism (although her perspective shifted toward the end of the course), 2) frequently emphasized concepts of credibility and blame as main reasons for responsible text;ual borrowing, and 3) simplified instruction on text;ual borrowing to rephrasing of othersâ words and changing structure. These findings highlight the mismatch between the complex difficulties that undergraduate L2 writers have with text;ual borrowing on one hand and the simplified instruction that ignores these difficulties on the other. I discuss this uncovered disparity in the realm of L2 writing teacher preparation and professional training for faculty across the curriculum, arguing for increased institutional support. I also outline a framework for providing such instructional support, which includes linguistic, text;ual, cognitive, metacognitive, and social support
Multiword expressions at length and in depth
The annual workshop on multiword expressions takes place since 2001 in conjunction with major computational linguistics conferences and attracts the attention of an ever-growing community working on a variety of languages, linguistic phenomena and related computational processing issues. MWE 2017 took place in Valencia, Spain, and represented a vibrant panorama of the current research landscape on the computational treatment of multiword expressions, featuring many high-quality submissions. Furthermore, MWE 2017 included the first shared task on multilingual identification of verbal multiword expressions. The shared task, with extended communal work, has developed important multilingual resources and mobilised several research groups in computational linguistics worldwide. This book contains extended versions of selected papers from the workshop. Authors worked hard to include detailed explanations, broader and deeper analyses, and new exciting results, which were thoroughly reviewed by an internationally renowned committee. We hope that this distinctly joint effort will provide a meaningful and useful snapshot of the multilingual state of the art in multiword expressions modelling and processing, and will be a point point of reference for future work
Un environnement générique et ouvert pour le traitement des expressions polylexicales
The treatment of multiword expressions (MWEs), like take off, bus stop and big deal, is a challenge for NLP applications. This kind of linguistic construction is not only arbitrary but also much more frequent than one would initially guess. This thesis investigates the behaviour of MWEs across different languages, domains and construction types, proposing and evaluating an integrated methodological framework for their acquisition. There have been many theoretical proposals to define, characterise and classify MWEs. We adopt generic definition stating that MWEs are word combinations which must be treated as a unit at some level of linguistic processing. They present a variable degree of institutionalisation, arbitrariness, heterogeneity and limited syntactic and semantic variability. There has been much research on automatic MWE acquisition in the recent decades, and the state of the art covers a large number of techniques and languages. Other tasks involving MWEs, namely disambiguation, interpretation, representation and applications, have received less emphasis in the field. The first main contribution of this thesis is the proposal of an original methodological framework for automatic MWE acquisition from monolingual corpora. This framework is generic, language independent, integrated and contains a freely available implementation, the mwetoolkit. It is composed of independent modules which may themselves use multiple techniques to solve a specific sub-task in MWE acquisition. The evaluation of MWE acquisition is modelled using four independent axes. We underline that the evaluation results depend on parameters of the acquisition context, e.g., nature and size of corpora, language and type of MWE, analysis depth, and existing resources. The second main contribution of this thesis is the application-oriented evaluation of our methodology proposal in two applications: computer-assisted lexicography and statistical machine translation. For the former, we evaluate the usefulness of automatic MWE acquisition with the mwetoolkit for creating three lexicons: Greek nominal expressions, Portuguese complex predicates and Portuguese sentiment expressions. For the latter, we test several integration strategies in order to improve the treatment given to English phrasal verbs when translated by a standard statistical MT system into Portuguese. Both applications can benefit from automatic MWE acquisition, as the expressions acquired automatically from corpora can both speed up and improve the quality of the results. The promising results of previous and ongoing experiments encourage further investigation about the optimal way to integrate MWE treatment into other applications. Thus, we conclude the thesis with an overview of the past, ongoing and future work
Paraphrasing and Translation
Paraphrasing and translation have previously been treated as unconnected natural lanÂŹ
guage processing tasks. Whereas translation represents the preservation of meaning
when an idea is rendered in the words in a different language, paraphrasing represents
the preservation of meaning when an idea is expressed using different words in the
same language. We show that the two are intimately related. The major contributions
of this thesis are as follows:âą We define a novel technique for automatically generating paraphrases using
bilingual parallel corpora, which are more commonly used as training data for
statistical models of translation.âą We show that paraphrases can be used to improve the quality of statistical maÂŹ
chine translation by addressing the problem of coverage and introducing a degree
of generalization into the models.âą We explore the topic of automatic evaluation of translation quality, and show that
the current standard evaluation methodology cannot be guaranteed to correlate
with human judgments of translation quality.Whereas previous data-driven approaches to paraphrasing were dependent upon
either data sources which were uncommon such as multiple translation of the same
source text, or language specific resources such as parsers, our approach is able to
harness more widely parallel corpora and can be applied to any language which has
a parallel corpus. The technique was evaluated by replacing phrases with their paraÂŹ
phrases, and asking judges whether the meaning of the original phrase was retained
and whether the resulting sentence remained grammatical. Paraphrases extracted from
a parallel corpus with manual alignments are judged to be accurate (both meaningful
and grammatical) 75% of the time, retaining the meaning of the original phrase 85%
of the time. Using automatic alignments, meaning can be retained at a rate of 70%.Being a language independent and probabilistic approach allows our method to be
easily integrated into statistical machine translation. A paraphrase model derived from
parallel corpora other than the one used to train the translation model can be used to
increase the coverage of statistical machine translation by adding translations of previously unseen words and phrases. If the translation of a word was not learned, but
a translation of a synonymous word has been learned, then the word is paraphrased and its paraphrase is translated. Phrases can be treated similarly. Results show that
augmenting a state-of-the-art SMT system with paraphrases in this way leads to significantly improved coverage and translation quality. For a training corpus with 10,000
sentence pairs, we increase the coverage of unique test set unigrams from 48% to 90%,
with more than half of the newly covered items accurately translated, as opposed to
none in current approaches
- âŠ