15 research outputs found
When do Words Matter? Understanding the Impact of Lexical Choice on Audience Perception using Individual Treatment Effect Estimation
Studies across many disciplines have shown that lexical choice can affect
audience perception. For example, how users describe themselves in a social
media profile can affect their perceived socio-economic status. However, we
lack general methods for estimating the causal effect of lexical choice on the
perception of a specific sentence. While randomized controlled trials may
provide good estimates, they do not scale to the potentially millions of
comparisons necessary to consider all lexical choices. Instead, in this paper,
we first offer two classes of methods to estimate the effect on perception of
changing one word to another in a given sentence. The first class of algorithms
builds upon quasi-experimental designs to estimate individual treatment effects
from observational data. The second class treats treatment effect estimation as
a classification problem. We conduct experiments with three data sources (Yelp,
Twitter, and Airbnb), finding that the algorithmic estimates align well with
those produced by randomized-control trials. Additionally, we find that it is
possible to transfer treatment effect classifiers across domains and still
maintain high accuracy.Comment: AAAI_201
The Semantic Typology of Visually Grounded Paraphrases
Visually grounded paraphrases (VGPs) are different phrasal expressions describing the same visual concept in an image. Previous studies treat VGP identification as a binary classification task, which ignores various phenomena behind VGPs (i.e., different linguistic interpretation of the same visual concept) such as linguistic paraphrases and VGPs from different aspects. In this paper, we propose semantic typology for VGPs, aiming to elucidate the VGP phenomena and deepen the understanding about how human beings interpret vision with language. We construct a large VGP dataset that annotates the class to which each VGP pair belongs according to our typology. In addition, we present a classification model that fuses language and visual features for VGP classification on our dataset. Experiments indicate that joint language and vision representation learning is important for VGP classification. We further demonstrate that our VGP typology can boost the performance of visually grounded textual entailment
Classification automatique des procédés de traduction
International audienceIn order to distinguish literal translation from other translation processes, translators and linguists have proposed several typologies to characterize different translation processes, such as idiomatic equivalence, generalization, particularization, semantic modulation, etc. However, the techniques to extract paraphrases from bilingual parallel corpora have not exploited this information. In this work, we propose an automatic classification of translation processes, based on manually annotated examples in an English-French parallel corpus of TED Talks. Even with a small dataset, the experimental results are encouraging and our experiments show the direction to follow in future work.En vue de distinguer la traduction littĂ©rale des autres procĂ©dĂ©s de traduction, des traducteurs et linguistes ont proposĂ© plusieurs typologies pour caractĂ©riser les diffĂ©rents procĂ©dĂ©s de traduction, tels que l'Ă©quivalence idiomatique, la gĂ©nĂ©ralisation, la particularisation, la modulation sĂ©mantique, etc. En revanche, les techniques d'extraction de paraphrases Ă partir de corpus parallĂšles bilingues n'ont pas exploitĂ© ces informations. Dans ce travail, nous proposons une classification automatique des procĂ©dĂ©s de traduction en nous basant sur des exemples annotĂ©s manuellement dans un corpus parallĂšle (anglais-français) de TED Talks. MĂȘme si le jeu de donnĂ©es est petit, les rĂ©sultats expĂ©rimentaux sont encourageants, et les expĂ©riences montrent la direction Ă suivre dans les futurs travaux
An Empirical Evaluation Of Attention And Pointer Networks For Paraphrase Generation
In computer vision, one of the common practice to augment the image dataset is by
creating new images using geometric transformation, which preserves the similarity.
This data augmentation was one of the most significant factors to win the Image Net
competition in 2012 with vast neural networks. Similarly, in speech recognition, we
saw similar results by augmenting the signal by noise, slowing signal or accelerating
it, and spectrogram modification.
Unlike in computer vision and speech data, there haven not been many techniques
explored to augment data in natural language processing (NLP). The only technique
explored in text data is by lexical substitution, which only focuses on replacing
words by synonyms.
In this thesis, we investigate the use of different pointer networks with the sequence
to sequence models, which have shown excellent results in neural machine translation
(NMT) and text simplification tasks, in generating similar sentences using a sequence
to sequence model and of the paraphrase dataset (PPDB). The evaluation of
these paraphrases is carried out by augmenting the training dataset of IMDb movie
review dataset and comparing its performance with the baseline model. We show
how these paraphrases can affect downstream tasks. Furthermore, We train different
classifiers to create a stable baseline for evaluation on IMDb movie dataset. To our
best knowledge, this is the first study on generating paraphrases using these models
with the help of PPDB dataset and evaluating these paraphrases in the downstream
task
Identifying Semantic Divergences Across Languages
Cross-lingual resources such as parallel corpora and bilingual dictionaries are cornerstones of multilingual natural language processing (NLP). They have been used to study the nature of translation, train automatic machine translation systems, as well as to transfer models across languages for an array of NLP tasks. However, the majority of work in cross-lingual and multilingual NLP assumes that translations recorded in these resources are semantically equivalent. This is often not the case---words and sentences that are considered to be translations of each other frequently divergein meaning, often in systematic ways.
In this thesis, we focus on such mismatches in meaning in text that we expect to be aligned across languages. We term such mismatches as cross-lingual semantic divergences. The core claim of this thesis is that translation is not always meaning preserving which leads to cross-lingual semantic divergences that affect multilingual NLP tasks. Detecting such divergences requires ways of directly characterizing differences in meaning across languages through novel cross-lingual tasks, as well as models that account for translation ambiguity and do not rely on expensive, task-specific supervision.
We support this claim through three main contributions. First, we show that a large fraction of data in multilingual resources (such as parallel corpora and bilingual dictionaries) is identified as semantically divergent by human annotators. Second, we introduce cross-lingual tasks that characterize differences in word meaning across languages by identifying the semantic relation between two words. We also develop methods to predict such semantic relations, as well as a model to predict whether sentences in different languages have the same meaning. Finally, we demonstrate the impact of divergences by applying the methods developed in the previous sections to two downstream tasks. We first show that our model for identifying semantic relations between words helps in separating equivalent word translations from divergent translations in the context of bilingual dictionary induction, even when the two words are close in meaning. We also show that identifying and filtering semantic divergences in parallel data helps in training a neural machine translation system twice as fast without sacrificing quality
Syntax-mediated semantic parsing
Querying a database to retrieve an answer, telling a robot to perform an action, or
teaching a computer to play a game are tasks requiring communication with machines
in a language interpretable by them. Semantic parsing is the task of converting human
language to a machine interpretable language. While human languages are sequential in
nature with latent structures, machine interpretable languages are formal with explicit
structures. The computational linguistics community have created several treebanks to
understand the formal syntactic structures of human languages. In this thesis, we use
these to obtain formal meaning representations of languages, and learn computational
models to convert these meaning representations to the target machine representation.
Our goal is to evaluate if existing treebank syntactic representations are useful for
semantic parsing.
Existing semantic parsing methods mainly learn domain-specific grammars which
can parse human languages to machine representation directly. We deviate from this
trend and make use of general-purpose syntactic grammar to help in semantic parsing.
We use two syntactic representations: Combinatory Categorial Grammar (CCG) and
dependency syntax. CCG has a well established theory on deriving meaning representations
from its syntactic derivations. But there are no CCG treebanks for many languages
since these are difficult to annotate. In contrast, dependencies are easy to annotate and
have many treebanks. However, dependencies do not have a well established theory for
deriving meaning representations. In this thesis, we propose novel theories for deriving
meaning representations from dependencies.
Our evaluation task is question answering on a knowledge base. Given a question,
our goal is to answer it on the knowledge base by converting the question to an executable
query. We use Freebase, the knowledge source behind Googleâs search engine,
as our knowledge base. Freebase contains millions of real world facts represented in a
graphical format. Inspired from the Freebase structure, we formulate semantic parsing
as a graph matching problem, i.e., given a natural language sentence, we convert it into
a graph structure from the meaning representation obtained from syntax, and find the
subgraph of Freebase that best matches the natural language graph.
Our experiments on Free917, WebQuestions and GraphQuestions semantic parsing
datasets conclude that general-purpose syntax is more useful for semantic parsing than
induced task-specific syntax and syntax-agnostic representations
Hizkuntza-ulermenari ekarpenak: N-gramen arteko atentzio eta lerrokatzeak antzekotasun eta inferentzia interpretagarrirako.
148 p.Hizkuntzaren Prozesamenduaren bitartez hezkuntzaren alorreko sistemaadimendunak hobetzea posible da, ikasleen eta irakasleen lan-karganabarmenki arinduz. Tesi honetan esaldi-mailako hizkuntza-ulermena aztertueta proposamen berrien bitartez sistema adimendunen hizkuntza-ulermenaareagotzen dugu, sistemei erabiltzailearen esaldiak modu zehatzagoaninterpretatzeko gaitasuna emanez. Esaldiak modu finean interpretatzekogaitasunak feedbacka modu automatikoan sortzeko aukera ematen baitu.Tesi hau garatzeko hizkuntza-ulermenean sakondu dugu antzekotasunsemantikoari eta inferentzia logikoari dagokien ezaugarriak eta sistemakaztertuz. Bereziki, esaldi barneko hitzak multzotan egituratuz eta lerrokatuzesaldiak hobeto modelatu daitezkeela erakutsi dugu. Horretarako, hitz solteaklerrokatzen dituen aurrekarien egoerako neurona-sare sistema batinplementatu eta n-grama arbitrarioak lerrokatzeko moldaketak egin ditugu.Hitzen arteko lerrokatzea aspalditik ezaguna bada ere, tesi honek, lehen aldiz,n-grama arbitrarioak atentzio-mekanismo baten bitartez lerrokatzekoproposamenak plazaratzen ditu.Gainera, esaldien arteko antzekotasunak eta desberdintasunak moduzehatzean identifikatzeko, esaldien interpretagarritasuna areagotzeko etaikasleei feedback zehatza emateko geruza berri bat sortu dugu: iSTS.Antzekotasun semantikoa eta inferentzia logikoa biltzen dituen geruzahorrekin chunkak lerrokatu ditugu, eta ikasleei feedback zehatza emateko gaiizan garela frogatu dugu hezkuntzaren testuinguruko bi ebaluazioeszenariotan.Tesi honekin batera hainbat sistema eta datu-multzo argitaratu diraetorkizunean komunitate zientifikoak ikertzen jarrai dezan