769 research outputs found
Occupation prediction with multimodal learning from Tweet messages and Google Street View images
Despite the development of various heuristic and machine learning models, social media user occupation predication remains challenging due to limited high-quality ground truth data and difficulties in effectively integrating multiple data sources in different modalities, which can be complementary and contribute to informing the profession or job role of an individual. In response, this study introduces a novel semi-supervised multimodal learning method for Twitter user occupation prediction with a limited number of training samples. Specifically, an unsupervised learning model is first designed to extract textual and visual embeddings from individual tweet messages (textual) and Google Street View images (visual), with the latter capturing the geographical and environmental context surrounding individuals’ residential and workplace areas. Next, these high-dimensional multimodal features are fed into a multilayer transfer learning model for individual occupation classification. The proposed occupation prediction method achieves high evaluation scores for identifying Office workers, Students, and Others or Jobless people, with the F1 score for identifying Office workers surpassing the best previously reported scores for occupation classification using social media data
Linking geosocial sensing with the socio-demographic fabric of smart cities
Technological advances have enabled new sources of geoinformation, such as geosocial media, and have supported the propagation of the concept of smart cities. This paper argues that a city cannot be smart without citizens in the loop, and that a geosocial sensor might be one component to achieve that. First, we need to better understand which facets of urban life could be detected by a geosocial sensor, and how to calibrate it. This requires replicable studies that foster longitudinal and comparative research. Consequently, this paper examines the relationship between geosocial media content and socio-demographic census data for a global city, London, at two administrative levels. It aims for a transparent study design to encourage replication, using Term Frequency—Inverse Document Frequency of keywords, rule-based and word-embedding sentiment analysis, and local cluster analysis. The findings of limited links between geosocial media content and socio-demographic characteristics support earlier critiques on the utility of geosocial media for smart city planning purposes. The paper concludes that passive listening to publicly available geosocial media, in contrast to pro-active engagement with citizens, seems of limited use to understand and improve urban quality of life
Mapping urban socioeconomic inequalities in developing countries through Facebook advertising data
Ending poverty in all its forms everywhere is the number one Sustainable Development Goal of the UN 2030 Agenda. To monitor the progress toward such an ambitious target, reliable, up-to-date and fine-grained measurements of socioeconomic indicators are necessary. When it comes to socioeconomic development, novel digital traces can provide a complementary data source to overcome the limits of traditional data collection methods, which are often not regularly updated and lack adequate spatial resolution. In this study, we collect publicly available and anonymous advertising audience estimates from Facebook to predict socioeconomic conditions of urban residents, at a fine spatial granularity, in four large urban areas: Atlanta (USA), Bogotá (Colombia), Santiago (Chile), and Casablanca (Morocco). We find that behavioral attributes inferred from the Facebook marketing platform can accurately map the socioeconomic status of residential areas within cities, and that predictive performance is comparable in both high and low-resource settings. Our work provides additional evidence of the value of social advertising media data to measure human development and it also shows the limitations in generalizing the use of these data to make predictions across countries
Interpreting wealth distribution via poverty map inference using multimodal data
Poverty maps are essential tools for governments and NGOs to track
socioeconomic changes and adequately allocate infrastructure and services in
places in need. Sensor and online crowd-sourced data combined with machine
learning methods have provided a recent breakthrough in poverty map inference.
However, these methods do not capture local wealth fluctuations, and are not
optimized to produce accountable results that guarantee accurate predictions to
all sub-populations. Here, we propose a pipeline of machine learning models to
infer the mean and standard deviation of wealth across multiple geographically
clustered populated places, and illustrate their performance in Sierra Leone
and Uganda. These models leverage seven independent and freely available
feature sources based on satellite images, and metadata collected via online
crowd-sourcing and social media. Our models show that combined metadata
features are the best predictors of wealth in rural areas, outperforming
image-based models, which are the best for predicting the highest wealth
quintiles. Our results recover the local mean and variation of wealth, and
correctly capture the positive yet non-monotonous correlation between them. We
further demonstrate the capabilities and limitations of model transfer across
countries and the effects of data recency and other biases. Our methodology
provides open tools to build towards more transparent and interpretable models
to help governments and NGOs to make informed decisions based on data
availability, urbanization level, and poverty thresholds.Comment: 12 pages. In Proceedings of the ACM Web Conference 2023 (WWW'23
Data science: a game changer for science and innovation
AbstractThis paper shows data science's potential for disruptive innovation in science, industry, policy, and people's lives. We present how data science impacts science and society at large in the coming years, including ethical problems in managing human behavior data and considering the quantitative expectations of data science economic impact. We introduce concepts such as open science and e-infrastructure as useful tools for supporting ethical data science and training new generations of data scientists. Finally, this work outlines SoBigData Research Infrastructure as an easy-to-access platform for executing complex data science processes. The services proposed by SoBigData are aimed at using data science to understand the complexity of our contemporary, globally interconnected society
Social demographics imputation based on similarity in multi-dimensional activity-travel pattern:A two-step approach
In response to the absence of demographics in increasingly emerging big data sets, we propose a novel method for inferring the missing demographic information based on similarity in people’s daily multi-dimensional activity-travel patterns as well as the characteristics of the area they move about. Instead of using isolated activity-travel attributes to infer social demographic features, our proposed method first calculates the similarity of people’s multidimensional daily activities and travels as well as characteristics of their visiting locations, between those for whom the social demographics are to be imputed (target) and those with known demographics (base) using a polynomial function. The weights of the function are determined using the permutation feature importance method, and then dynamic time warping is used to align the multidimensional activity sequences of the base and target sample and measure their similarities. For each person in the target database, a matched list is created consisting of those with the most similar activity-travel sequences in the base sample. A support vector machine is then trained using the base sample as input to impute the demographics of the target sample. The proposed model is trained using a national travel survey and validated by applying it to a GPS dataset. The results show that the proposed method outperforms existing methods in predicting four selected demographics: gender, age, education level, and work status, with an accuracy range between 91% and 94% for the national dataset and 88% to 91% for the GPS data. This study highlights the importance of considering the multidimensional and sequential nature of peoples’ daily activity-travel patterns in the imputation of demographic features
Identification of Online Users' Social Status via Mining User-Generated Data
With the burst of available online user-generated data, identifying online users’ social status via mining user-generated data can play a significant role in many commercial applications, research and policy-making in many domains. Social status refers to the position of a person in relation to others within a society, which is an abstract concept. The actual definition of social status is specific in terms of specific measure indicator. For example, opinion leadership measures individual social status in terms of influence and expertise in an online society, while socioeconomic status characterizes personal real-life social status based on social and economic factors. Compared with traditional survey method which is time-consuming, expensive and sometimes difficult, some efforts have been made to identify specific social status of users based on specific user-generated data using classic machine learning methods. However, in fact, regarding specific social status identification based on specific user-generated data, the specific case has several specific challenges. However, classic machine learning methods in existing works fail to address these challenges, which lead to low identification accuracy. Given the importance of improving identification accuracy, this thesis studies three specific cases on identification of online and offline social status. For each work, this thesis proposes novel effective identification method to address the specific challenges for improving accuracy. The first work aims at identifying users’ online social status in terms of topic-sensitive influence and knowledge authority in social community question answering sites, namely identifying topical opinion leaders who are both influential and expert. Social community question answering (SCQA) site, an innovative community question answering platform, not only offers traditional question answering (QA) services but also integrates an online social network where users can follow each other. Identifying topical opinion leaders in SCQA has become an important research area due to the significant role of topical opinion leaders. However, most previous related work either focus on using knowledge expertise to find experts for improving the quality of answers, or aim at measuring user influence to identify influential ones. In order to identify the true topical opinion leaders, we propose a topical opinion leader identification framework called QALeaderRank which takes account of both topic-sensitive influence and topical knowledge expertise. In the proposed framework, to measure the topic-sensitive influence of each user, we design a novel influence measure algorithm that exploits both the social and QA features of SCQA, taking into account social network structure, topical similarity and knowledge authority. In addition, we propose three topic-relevant metrics to infer the topical expertise of each user. The extensive experiments along with an online user study show that the proposed QALeaderRank achieves significant improvement compared with the state-of-the-art methods. Furthermore, we analyze the topic interest change behaviors of users over time and examine the predictability of user topic interest through experiments. The second work focuses on predicting individual socioeconomic status from mobile phone data. Socioeconomic Status (SES) is an important social and economic aspect widely concerned. Assessing individual SES can assist related organizations in making a variety of policy decisions. Traditional approach suffers from the extremely high cost in collecting large-scale SES-related survey data. With the ubiquity of smart phones, mobile phone data has become a novel data source for predicting individual SES with low cost. However, the task of predicting individual SES on mobile phone data also proposes some new challenges, including sparse individual records, scarce explicit relationships and limited labeled samples, unconcerned in prior work restricted to regional or household-oriented SES prediction. To address these issues, we propose a semi-supervised Hypergraph based Factor Graph Model (HyperFGM) for individual SES prediction. HyperFGM is able to efficiently capture the associations between SES and individual mobile phone records to handle the individual record sparsity. For the scarce explicit relationships, HyperFGM models implicit high-order relationships among users on the hypergraph structure. Besides, HyperFGM explores the limited labeled data and unlabeled data in a semi-supervised way. Experimental results show that HyperFGM greatly outperforms the baseline methods on individual SES prediction with using a set of anonymized real mobile phone data. The third work is to predict social media users’ socioeconomic status based on their social media content, which is useful for related organizations and companies in a range of applications, such as economic and social policy-making. Previous work leverage manually defined textual features and platform-based user level attributes from social media content and feed them into a machine learning based classifier for SES prediction. However, they ignore some important information of social media content, containing the order and the hierarchical structure of social media text as well as the relationships among user level attributes. To this end, we propose a novel coupled social media content representation model for individual SES prediction, which not only utilizes a hierarchical neural network to incorporate the order and the hierarchical structure of social media text but also employs a coupled attribute representation method to take into account intra-coupled and inter-coupled interaction relationships among user level attributes. The experimental results show that the proposed model significantly outperforms other stat-of-the-art models on a real dataset, which validate the efficiency and robustness of the proposed model
Recommended from our members
Sociolinguistically Driven Approaches for Just Natural Language Processing
Natural language processing (NLP) systems are now ubiquitous. Yet the benefits of these language technologies do not accrue evenly to all users, and indeed they can be harmful; NLP systems reproduce stereotypes, prevent speakers of non-standard language varieties from participating fully in public discourse, and re-inscribe historical patterns of linguistic stigmatization and discrimination. How harms arise in NLP systems, and who is harmed by them, can only be understood at the intersection of work on NLP, fairness and justice in machine learning, and the relationships between language and social justice. In this thesis, we propose to address two questions at this intersection: i) How can we conceptualize harms arising from NLP systems?, and ii) How can we quantify such harms?
We propose the following contributions. First, we contribute a model in order to collect the first large dataset of African American Language (AAL)-like social media text. We use the dataset to quantify the performance of two types of NLP systems, identifying disparities in model performance between Mainstream U.S. English (MUSE)- and AAL-like text. Turning to the landscape of bias in NLP more broadly, we then provide a critical survey of the emerging literature on bias in NLP and identify its limitations. Drawing on work across sociology, sociolinguistics, linguistic anthropology, social psychology, and education, we provide an account of the relationships between language and injustice, propose a taxonomy of harms arising from NLP systems grounded in those relationships, and propose a set of guiding research questions for work on bias in NLP. Finally, we adapt the measurement modeling framework from the quantitative social sciences to effectively evaluate approaches for quantifying bias in NLP systems. We conclude with a discussion of recent work on bias through the lens of style in NLP, raising a set of normative questions for future work
- …