Search CORE

12 research outputs found

Listening between the Lines: Learning Personal Attributes from Conversations

Author: Mirza Paramita
Tigunova Anna
Weikum Gerhard
Yates Andrew
Publication venue
Publication date: 01/01/2019
Field of study

Open-domain dialogue agents must be able to converse about many topics while incorporating knowledge about the user into the conversation. In this work we address the acquisition of such knowledge, for personalization in downstream Web applications, by extracting personal attributes from conversations. This problem is more challenging than the established task of information extraction from scientific publications or Wikipedia articles, because dialogues often give merely implicit cues about the speaker. We propose methods for inferring personal attributes, such as profession, age or family status, from conversations using deep learning. Specifically, we propose several Hidden Attribute Models, which are neural networks leveraging attention mechanisms and embeddings. Our methods are trained on a per-predicate basis to output rankings of object values for a given subject-predicate combination (e.g., ranking the doctor and nurse professions high when speakers talk about patients, emergency rooms, etc). Experiments with various conversational texts including Reddit discussions, movie scripts and a collection of crowdsourced personal dialogues demonstrate the viability of our methods and their superior performance compared to state-of-the-art baselines.Comment: published in WWW'1

arXiv.org e-Print Archive

MPG.PuRe

Personalized Dialogue Generation with Diversified Traits

Author: Chen Guanyi
Huang Minlie
Liu Song
Zheng Yinhe
Zhu Xuan
Publication venue
Publication date: 28/01/2019
Field of study

Endowing a dialogue system with particular personality traits is essential to deliver more human-like conversations. However, due to the challenge of embodying personality via language expression and the lack of large-scale persona-labeled dialogue data, this research problem is still far from well-studied. In this paper, we investigate the problem of incorporating explicit personality traits in dialogue generation to deliver personalized dialogues. To this end, firstly, we construct PersonalDialog, a large-scale multi-turn dialogue dataset containing various traits from a large number of speakers. The dataset consists of 20.83M sessions and 56.25M utterances from 8.47M speakers. Each utterance is associated with a speaker who is marked with traits like Age, Gender, Location, Interest Tags, etc. Several anonymization schemes are designed to protect the privacy of each speaker. This large-scale dataset will facilitate not only the study of personalized dialogue generation, but also other researches on sociolinguistics or social science. Secondly, to study how personality traits can be captured and addressed in dialogue generation, we propose persona-aware dialogue generation models within the sequence to sequence learning framework. Explicit personality traits (structured by key-value pairs) are embedded using a trait fusion module. During the decoding process, two techniques, namely persona-aware attention and persona-aware bias, are devised to capture and address trait-related information. Experiments demonstrate that our model is able to address proper traits in different contexts. Case studies also show interesting results for this challenging research problem.Comment: Please contact [zhengyinhe1 at 163 dot com] for the PersonalDialog datase

arXiv.org e-Print Archive

Utrecht University Repository

Listening between the Lines: Learning Personal Attributes from Conversations

Author: Mirza P.
Tigunova A.
Weikum G.
Yates A.
Publication venue
Publication date: 01/01/2019
Field of study

MPG.PuRe

Recommended from our members

Can online self-reports assist in real-time identification of influenza vaccination uptake? A cross-sectional study of influenza vaccine-related tweets in the USA, 2013-2017

Author: Broniatowski David A
Cai Justin
Dredze Mark
Huang Xiaolei
Jamison Amelia M
Paul Michael J
Quinn Sandra Crouse
Smith Michael C
Publication venue
Publication date: 01/01/2018
Field of study

The Centers for Disease Control and Prevention (CDC) spend significant time and resources to track influenza vaccination coverage each influenza season using national surveys. Emerging data from social media provide an alternative solution to surveillance at both national and local levels of influenza vaccination coverage in near real time. This study aimed to characterise and analyse the vaccinated population from temporal, demographical and geographical perspectives using automatic classification of vaccination-related Twitter data. In this cross-sectional study, we continuously collected tweets containing both influenza-related terms and vaccine-related terms covering four consecutive influenza seasons from 2013 to 2017. We created a machine learning classifier to identify relevant tweets, then evaluated the approach by comparing to data from the CDC's FluVaxView. We limited our analysis to tweets geolocated within the USA. We assessed 1 124 839 tweets. We found strong correlations of 0.799 between monthly Twitter estimates and CDC, with correlations as high as 0.950 in individual influenza seasons. We also found that our approach obtained geographical correlations of 0.387 at the US state level and 0.467 at the regional level. Finally, we found a higher level of influenza vaccine tweets among female users than male users, also consistent with the results of CDC surveys on vaccine uptake. Significant correlations between Twitter data and CDC data show the potential of using social media for vaccination surveillance. Temporal variability is captured better than geographical and demographical variability. We discuss potential paths forward for leveraging this approach.</p

CU Scholar Institutional Repository

Inferring attributes with picture metadata embeddings

Author: Imine Abdessamad
Pijani Bizhan Alipour
Rusinowitch Michaël
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 27/07/2020
Field of study

International audienceUsers in online social networks are vulnerable to attribute inference attacks due to some published data. Thus, the picture owner's gender has a strong influence on individuals' emotional reactions to the photo. In this work, we present a graph-embedding approach for gender inference attacks based on pictures meta-data such as (i) alt-texts generated by Facebook to describe the content of images, and (ii) Emojis/Emoticons posted by friends, friends of friends or regular users as a reaction to the picture. Specifically, we apply a semi-supervised technique, node2vec, for learning a mapping of pictures meta-data to a low-dimensional vector space. Next, we study in this vector space the gender closeness of users who published similar photos and/or received similar reactions. We leverage this image sharing and reaction mode of Facebook users to derive an efficient and accurate technique for user gender inference. Experimental results show that privacy attack often succeeds even when other information than pictures published by their owners is either hidden or unavailable

INRIA a CCSD electronic archive server

HAL-Rennes 1

Online Attacks on Picture Owner Privacy

Author: B Alipour
GA Miller
KW Church
L Santamaría
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 14/09/2020
Field of study

International audienceWe present an online attribute inference attack by leverag-ing Facebook picture metadata (i) alt-text generated by Facebook to describe picture contents, and (ii) comments containing words and emo-jis posted by other Facebook users. Specifically, we study the correlation of the picture's owner with Facebook generated alt-text and comments used by commenters when reacting to the image. We concentrate on gender attribute that is highly relevant for targeted advertising or privacy breaking. We explore how to launch an online gender inference attack on any Facebook user by handling online newly discovered vocabulary using the retrofitting process to enrich a core vocabulary built during offline training. Our experiments show that even when the user hides most public data (e.g., friend list, attribute, page, group), an attacker can detect user gender with AUC (area under the ROC curve) from 87% to 92%, depending on the picture metadata availability. Moreover, we can detect with high accuracy sequences of words leading to gender disclosure, and accordingly, enable users to derive countermeasures and configure their privacy settings safely

Crossref

INRIA a CCSD electronic archive server

HAL-Rennes 1

You are what emojis say about your pictures: Language - independent gender inference attack on Facebook

Author: Abdelberi Chaabane
Ai Wei
Alipour Bizhan
An Jiaxin
Brouer Robyn L.
Chen Zhenpeng
Gong Neil Zhenqiang
Miller Hannah Jean
Nguyen Dong
Preotiuc-Pietro Daniel
Sap Maarten
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 30/03/2020
Field of study

International audienceThe picture owner's gender has a strong influence on individuals' emotional reactions to the picture. In this study, we investigate gender inference attacks on their owners from pictures meta-data composed of: (i) alt-texts generated by Facebook to describe the content of pictures, and (ii) Emojis/Emoticons posted by friends, friends of friends or regular users as a reaction to the picture. Specifically, we study the correlation of picture owner gender with alt-text, and Emojis/Emoticons used by commenters when reacting to these pictures. We leverage this image sharing and reaction mode of Facebook users to derive an efficient and accurate technique for user gender inference. We show that such a privacy attack often succeeds even when other information than pictures published by their owners is either hidden or unavailable

Crossref

INRIA a CCSD electronic archive server

HAL-Rennes 1

From Discourse Structure To Text Specificity: Studies Of Coherence Preferences

Author: Li Junyi
Publication venue: ScholarlyCommons
Publication date: 01/01/2017
Field of study

To successfully communicate through text, a writer needs to organize information into an understandable and well-structured discourse for the targeted audience. This involves deciding when to convey general statements, when to elaborate on details, and gauging how much details to convey, i.e., the level of specificity. This thesis explores the automatic prediction of text specificity, and whether the perception of specificity varies across different audiences. We characterize text specificity from two aspects: the instantiation discourse relation, and the specificity of sentences and words. We identify characteristics of instantiation that signify a change of specificity between sentences. Features derived from these characteristics substantially improve the detection of the relation. Using instantiation sentences as the basis for training, we propose a semi-supervised system to predict sentence specificity with speed and accuracy. Furthermore, we present insights into the effect of underspecified words and phrases on the comprehension of text, and the prediction of such words. We show distinct preferences in specificity and discourse structure among different audiences. We investigate these distinctions in both cross-lingual and monolingual context. Cross-lingually, we identify discourse factors that significantly impact the quality of text translated from Chinese to English. Notably, a large portion of Chinese sentences are significantly more specific and need to be translated into multiple English sentences. We introduce a system using rich syntactic features to accurately detect such sentences. We also show that simplified text is more general, and that specific sentences are more likely to need simplification. Finally, we present evidence that the perception of sentence specificity differs among male and female readers

ScholarlyCommons@Penn

BALANCING THE ASSUMPTIONS OF CAUSAL INFERENCE AND NATURAL LANGUAGE PROCESSING

Author: Wood-Doughty Zachary D
Publication venue: 'The Busan Gyeongnam Mathematical Society'
Publication date: 24/02/2022
Field of study

Drawing conclusions about real-world relationships of cause and effect from data collected without randomization requires making assumptions about the true processes that generate the data we observe. Causal inference typically considers low-dimensional data such as categorical or numerical fields in structured medical records. Yet a restriction to such data excludes natural language texts -- including social media posts or clinical free-text notes -- that can provide a powerful perspective into many aspects of our lives. This thesis explores whether the simplifying assumptions we make in order to model human language and behavior can support the causal conclusions that are necessary to inform decisions in healthcare or public policy. An analysis of millions of documents must rely on automated methods from machine learning and natural language processing, yet trust is essential in many clinical or policy applications. We need to develop causal methods that can reflect the uncertainty of imperfect predictive models to inform robust decision-making. We explore several areas of research in pursuit of these goals. We propose a measurement error approach for incorporating text classifiers into causal analyses and demonstrate the assumption on which it relies. We introduce a framework for generating synthetic text datasets on which causal inference methods can be evaluated, and use it to demonstrate that many existing approaches make assumptions that are likely violated. We then propose a proxy model methodology that provides explanations for uninterpretable black-box models, and close by incorporating it into our measurement error approach to explore the assumptions necessary for an analysis of gender and toxicity on Twitter

Johns Hopkins University

JScholarship