Search CORE

15 research outputs found

When do Words Matter? Understanding the Impact of Lexical Choice on Audience Perception using Individual Treatment Effect Estimation

Author: Culotta Aron
Wang Zhao
Publication venue
Publication date: 14/11/2018
Field of study

Studies across many disciplines have shown that lexical choice can affect audience perception. For example, how users describe themselves in a social media profile can affect their perceived socio-economic status. However, we lack general methods for estimating the causal effect of lexical choice on the perception of a specific sentence. While randomized controlled trials may provide good estimates, they do not scale to the potentially millions of comparisons necessary to consider all lexical choices. Instead, in this paper, we first offer two classes of methods to estimate the effect on perception of changing one word to another in a given sentence. The first class of algorithms builds upon quasi-experimental designs to estimate individual treatment effects from observational data. The second class treats treatment effect estimation as a classification problem. We conduct experiments with three data sources (Yelp, Twitter, and Airbnb), finding that the algorithmic estimates align well with those produced by randomized-control trials. Additionally, we find that it is possible to transfer treatment effect classifiers across domains and still maintain high accuracy.Comment: AAAI_201

arXiv.org e-Print Archive

Association for the Advancement of Artificial Intelligence: AAAI Publications

The Semantic Typology of Visually Grounded Paraphrases

Author: Chu Chenhui
Garcia Noa
Nakashima Yuta
Oliveira Vinicius
Otani Mayu
Virgo Giovanni, Felix
Publication venue: 'Elsevier BV'
Publication date: 01/01/2022
Field of study

Visually grounded paraphrases (VGPs) are different phrasal expressions describing the same visual concept in an image. Previous studies treat VGP identification as a binary classification task, which ignores various phenomena behind VGPs (i.e., different linguistic interpretation of the same visual concept) such as linguistic paraphrases and VGPs from different aspects. In this paper, we propose semantic typology for VGPs, aiming to elucidate the VGP phenomena and deepen the understanding about how human beings interpret vision with language. We construct a large VGP dataset that annotates the class to which each VGP pair belongs according to our typology. In addition, we present a classification model that fuses language and visual features for VGP classification on our dataset. Experiments indicate that joint language and vision representation learning is important for VGP classification. We further demonstrate that our VGP typology can boost the performance of visually grounded textual entailment

Kyoto University Research Information Repository

Classification automatique des procédés de traduction

Author: Illouz Gabriel
Vilnat Anne
Zhai Yuming
Publication venue: HAL CCSD
Publication date: 01/07/2019
Field of study

International audienceIn order to distinguish literal translation from other translation processes, translators and linguists have proposed several typologies to characterize different translation processes, such as idiomatic equivalence, generalization, particularization, semantic modulation, etc. However, the techniques to extract paraphrases from bilingual parallel corpora have not exploited this information. In this work, we propose an automatic classification of translation processes, based on manually annotated examples in an English-French parallel corpus of TED Talks. Even with a small dataset, the experimental results are encouraging and our experiments show the direction to follow in future work.En vue de distinguer la traduction littérale des autres procédés de traduction, des traducteurs et linguistes ont proposé plusieurs typologies pour caractériser les différents procédés de traduction, tels que l'équivalence idiomatique, la généralisation, la particularisation, la modulation sémantique, etc. En revanche, les techniques d'extraction de paraphrases à partir de corpus parallèles bilingues n'ont pas exploité ces informations. Dans ce travail, nous proposons une classification automatique des procédés de traduction en nous basant sur des exemples annotés manuellement dans un corpus parallèle (anglais-français) de TED Talks. Même si le jeu de données est petit, les résultats expérimentaux sont encourageants, et les expériences montrent la direction à suivre dans les futurs travaux

Transforming Dependency Structures to Logical Forms for Semantic Parsing

Author: Collins Michael
Das Dipanjan
Kwiatkowski Tom
Lapata Mirella
Reddy Siva
Steedman Mark
Täckström Oscar
Publication venue
Publication date: 01/04/2016
Field of study

Edinburgh Research Explorer

An Empirical Evaluation Of Attention And Pointer Networks For Paraphrase Generation

Author: Gupta Varun
Publication venue
Publication date: 27/06/2019
Field of study

In computer vision, one of the common practice to augment the image dataset is by creating new images using geometric transformation, which preserves the similarity. This data augmentation was one of the most significant factors to win the Image Net competition in 2012 with vast neural networks. Similarly, in speech recognition, we saw similar results by augmenting the signal by noise, slowing signal or accelerating it, and spectrogram modification. Unlike in computer vision and speech data, there haven not been many techniques explored to augment data in natural language processing (NLP). The only technique explored in text data is by lexical substitution, which only focuses on replacing words by synonyms. In this thesis, we investigate the use of different pointer networks with the sequence to sequence models, which have shown excellent results in neural machine translation (NMT) and text simplification tasks, in generating similar sentences using a sequence to sequence model and of the paraphrase dataset (PPDB). The evaluation of these paraphrases is carried out by augmenting the training dataset of IMDb movie review dataset and comparing its performance with the baseline model. We show how these paraphrases can affect downstream tasks. Furthermore, We train different classifiers to create a stable baseline for evaluation on IMDb movie dataset. To our best knowledge, this is the first study on generating paraphrases using these models with the help of PPDB dataset and evaluating these paraphrases in the downstream task

Concordia University Research Repository

Identifying Semantic Divergences Across Languages

Author: Vyas Yogarshi
Publication venue
Publication date: 01/01/2019
Field of study

Cross-lingual resources such as parallel corpora and bilingual dictionaries are cornerstones of multilingual natural language processing (NLP). They have been used to study the nature of translation, train automatic machine translation systems, as well as to transfer models across languages for an array of NLP tasks. However, the majority of work in cross-lingual and multilingual NLP assumes that translations recorded in these resources are semantically equivalent. This is often not the case---words and sentences that are considered to be translations of each other frequently divergein meaning, often in systematic ways. In this thesis, we focus on such mismatches in meaning in text that we expect to be aligned across languages. We term such mismatches as cross-lingual semantic divergences. The core claim of this thesis is that translation is not always meaning preserving which leads to cross-lingual semantic divergences that affect multilingual NLP tasks. Detecting such divergences requires ways of directly characterizing differences in meaning across languages through novel cross-lingual tasks, as well as models that account for translation ambiguity and do not rely on expensive, task-specific supervision. We support this claim through three main contributions. First, we show that a large fraction of data in multilingual resources (such as parallel corpora and bilingual dictionaries) is identified as semantically divergent by human annotators. Second, we introduce cross-lingual tasks that characterize differences in word meaning across languages by identifying the semantic relation between two words. We also develop methods to predict such semantic relations, as well as a model to predict whether sentences in different languages have the same meaning. Finally, we demonstrate the impact of divergences by applying the methods developed in the previous sections to two downstream tasks. We first show that our model for identifying semantic relations between words helps in separating equivalent word translations from divergent translations in the context of bilingual dictionary induction, even when the two words are close in meaning. We also show that identifying and filtering semantic divergences in parallel data helps in training a neural machine translation system twice as fast without sacrificing quality

Digital Repository at the University of Maryland

Syntax-mediated semantic parsing

Author: Reddy Goli Venkata Sivakumar
Publication venue: The University of Edinburgh
Publication date: 30/11/2017
Field of study

Querying a database to retrieve an answer, telling a robot to perform an action, or teaching a computer to play a game are tasks requiring communication with machines in a language interpretable by them. Semantic parsing is the task of converting human language to a machine interpretable language. While human languages are sequential in nature with latent structures, machine interpretable languages are formal with explicit structures. The computational linguistics community have created several treebanks to understand the formal syntactic structures of human languages. In this thesis, we use these to obtain formal meaning representations of languages, and learn computational models to convert these meaning representations to the target machine representation. Our goal is to evaluate if existing treebank syntactic representations are useful for semantic parsing. Existing semantic parsing methods mainly learn domain-specific grammars which can parse human languages to machine representation directly. We deviate from this trend and make use of general-purpose syntactic grammar to help in semantic parsing. We use two syntactic representations: Combinatory Categorial Grammar (CCG) and dependency syntax. CCG has a well established theory on deriving meaning representations from its syntactic derivations. But there are no CCG treebanks for many languages since these are difficult to annotate. In contrast, dependencies are easy to annotate and have many treebanks. However, dependencies do not have a well established theory for deriving meaning representations. In this thesis, we propose novel theories for deriving meaning representations from dependencies. Our evaluation task is question answering on a knowledge base. Given a question, our goal is to answer it on the knowledge base by converting the question to an executable query. We use Freebase, the knowledge source behind Google’s search engine, as our knowledge base. Freebase contains millions of real world facts represented in a graphical format. Inspired from the Freebase structure, we formulate semantic parsing as a graph matching problem, i.e., given a natural language sentence, we convert it into a graph structure from the meaning representation obtained from syntax, and find the subgraph of Freebase that best matches the natural language graph. Our experiments on Free917, WebQuestions and GraphQuestions semantic parsing datasets conclude that general-purpose syntax is more useful for semantic parsing than induced task-specific syntax and syntax-agnostic representations

Edinburgh Research Archive

Hizkuntza-ulermenari ekarpenak: N-gramen arteko atentzio eta lerrokatzeak antzekotasun eta inferentzia interpretagarrirako.

Author: López Gazpio Iñigo
Publication venue
Publication date: 01/01/2018
Field of study

148 p.Hizkuntzaren Prozesamenduaren bitartez hezkuntzaren alorreko sistemaadimendunak hobetzea posible da, ikasleen eta irakasleen lan-karganabarmenki arinduz. Tesi honetan esaldi-mailako hizkuntza-ulermena aztertueta proposamen berrien bitartez sistema adimendunen hizkuntza-ulermenaareagotzen dugu, sistemei erabiltzailearen esaldiak modu zehatzagoaninterpretatzeko gaitasuna emanez. Esaldiak modu finean interpretatzekogaitasunak feedbacka modu automatikoan sortzeko aukera ematen baitu.Tesi hau garatzeko hizkuntza-ulermenean sakondu dugu antzekotasunsemantikoari eta inferentzia logikoari dagokien ezaugarriak eta sistemakaztertuz. Bereziki, esaldi barneko hitzak multzotan egituratuz eta lerrokatuzesaldiak hobeto modelatu daitezkeela erakutsi dugu. Horretarako, hitz solteaklerrokatzen dituen aurrekarien egoerako neurona-sare sistema batinplementatu eta n-grama arbitrarioak lerrokatzeko moldaketak egin ditugu.Hitzen arteko lerrokatzea aspalditik ezaguna bada ere, tesi honek, lehen aldiz,n-grama arbitrarioak atentzio-mekanismo baten bitartez lerrokatzekoproposamenak plazaratzen ditu.Gainera, esaldien arteko antzekotasunak eta desberdintasunak moduzehatzean identifikatzeko, esaldien interpretagarritasuna areagotzeko etaikasleei feedback zehatza emateko geruza berri bat sortu dugu: iSTS.Antzekotasun semantikoa eta inferentzia logikoa biltzen dituen geruzahorrekin chunkak lerrokatu ditugu, eta ikasleei feedback zehatza emateko gaiizan garela frogatu dugu hezkuntzaren testuinguruko bi ebaluazioeszenariotan.Tesi honekin batera hainbat sistema eta datu-multzo argitaratu diraetorkizunean komunitate zientifikoak ikertzen jarrai dezan

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Archivo Digital para la Docencia y la Investigación