254 research outputs found

    Geotagging One Hundred Million Twitter Accounts with Total Variation Minimization

    Full text link
    Geographically annotated social media is extremely valuable for modern information retrieval. However, when researchers can only access publicly-visible data, one quickly finds that social media users rarely publish location information. In this work, we provide a method which can geolocate the overwhelming majority of active Twitter users, independent of their location sharing preferences, using only publicly-visible Twitter data. Our method infers an unknown user's location by examining their friend's locations. We frame the geotagging problem as an optimization over a social network with a total variation-based objective and provide a scalable and distributed algorithm for its solution. Furthermore, we show how a robust estimate of the geographic dispersion of each user's ego network can be used as a per-user accuracy measure which is effective at removing outlying errors. Leave-many-out evaluation shows that our method is able to infer location for 101,846,236 Twitter users at a median error of 6.38 km, allowing us to geotag over 80\% of public tweets.Comment: 9 pages, 8 figures, accepted to IEEE BigData 2014, Compton, Ryan, David Jurgens, and David Allen. "Geotagging one hundred million twitter accounts with total variation minimization." Big Data (Big Data), 2014 IEEE International Conference on. IEEE, 201

    Smart, Responsible, and Upper Caste Only: Measuring Caste Attitudes through Large-Scale Analysis of Matrimonial Profiles

    Full text link
    Discriminatory caste attitudes currently stigmatize millions of Indians, subjecting individuals to prejudice in all aspects of life. Governmental incentives and societal movements have attempted to counter these attitudes, yet accurate measurements of public opinions on caste are not yet available for understanding whether progress is being made. Here, we introduce a novel approach to measure public attitudes of caste through an indicator variable: openness to intercaste marriage. Using a massive dataset of over 313K profiles from a major Indian matrimonial site, we precisely quantify public attitudes, along with differences between generations and between Indian residents and diaspora. We show that younger generations are more open to intercaste marriage, yet attitudes are based on a complex function of social status beyond their own caste. In examining the desired qualities in a spouse, we find that individuals open to intercaste marriage are more individualistic in the qualities they desire, rather than favoring family-related qualities, which mirrors larger societal trends away from collectivism. Finally, we show that attitudes in diaspora are significantly less open, suggesting a bi-cultural model of integration. Our research provides the first empirical evidence identifying how various intersections of identity shape attitudes toward intercaste marriage in India and among the Indian diaspora in the US.Comment: 12 pages; Accepted to be published at ICWSM'1

    Creating Full Individual-level Location Timelines from Sparse Social Media Data

    Full text link
    In many domain applications, a continuous timeline of human locations is critical; for example for understanding possible locations where a disease may spread, or the flow of traffic. While data sources such as GPS trackers or Call Data Records are temporally-rich, they are expensive, often not publicly available or garnered only in select locations, restricting their wide use. Conversely, geo-located social media data are publicly and freely available, but present challenges especially for full timeline inference due to their sparse nature. We propose a stochastic framework, Intermediate Location Computing (ILC) which uses prior knowledge about human mobility patterns to predict every missing location from an individual's social media timeline. We compare ILC with a state-of-the-art RNN baseline as well as methods that are optimized for next-location prediction only. For three major cities, ILC predicts the top 1 location for all missing locations in a timeline, at 1 and 2-hour resolution, with up to 77.2% accuracy (up to 6% better accuracy than all compared methods). Specifically, ILC also outperforms the RNN in settings of low data; both cases of very small number of users (under 50), as well as settings with more users, but with sparser timelines. In general, the RNN model needs a higher number of users to achieve the same performance as ILC. Overall, this work illustrates the tradeoff between prior knowledge of heuristics and more data, for an important societal problem of filling in entire timelines using freely available, but sparse social media data.Comment: 10 pages, 8 figures, 2 table

    The structure of online social networks modulates the rate of lexical change

    Full text link
    New words are regularly introduced to communities, yet not all of these words persist in a community's lexicon. Among the many factors contributing to lexical change, we focus on the understudied effect of social networks. We conduct a large-scale analysis of over 80k neologisms in 4420 online communities across a decade. Using Poisson regression and survival analysis, our study demonstrates that the community's network structure plays a significant role in lexical change. Apart from overall size, properties including dense connections, the lack of local clusters and more external contacts promote lexical innovation and retention. Unlike offline communities, these topic-based communities do not experience strong lexical levelling despite increased contact but accommodate more niche words. Our work provides support for the sociolinguistic hypothesis that lexical change is partially shaped by the structure of the underlying network but also uncovers findings specific to online communities.Comment: NAACL 202

    Social Meme-ing: Measuring Linguistic Variation in Memes

    Full text link
    Much work in the space of NLP has used computational methods to explore sociolinguistic variation in text. In this paper, we argue that memes, as multimodal forms of language comprised of visual templates and text, also exhibit meaningful social variation. We construct a computational pipeline to cluster individual instances of memes into templates and semantic variables, taking advantage of their multimodal structure in doing so. We apply this method to a large collection of meme images from Reddit and make available the resulting \textsc{SemanticMemes} dataset of 3.8M images clustered by their semantic function. We use these clusters to analyze linguistic variation in memes, discovering not only that socially meaningful variation in meme usage exists between subreddits, but that patterns of meme innovation and acculturation within these communities align with previous findings on written language

    Author Mentions in Science News Reveal Wide-Spread Ethnic Bias

    Full text link
    Media outlets play a key role in spreading scientific knowledge to the general public and raising the profile of researchers among their peers. Yet, given time and space constraints, not all scholars can receive equal media attention, and journalists' choices of whom to mention are poorly understood. In this study, we use a comprehensive dataset of 232,524 news stories from 288 U.S.-based outlets covering 100,208 research papers across all sciences to investigate the rates at which scientists of different ethnicities are mentioned by name. We find strong evidence of ethnic biases in author mentions, even after controlling for a wide range of possible confounds. Specifically, authors with non-British-origin names are significantly less likely to be mentioned or quoted than comparable British-origin named authors, even within the stories of a particular news outlet covering a particular scientific venue on a particular research topic. Instead, minority scholars are more likely to have their names substituted with their role at their institution. This ethnic bias is consistent across all types of media outlets, with even larger disparities in General-Interest outlets that tend to publish longer stories and have dedicated editorial teams for accurately reporting science. Our findings reveal that the perceived ethnicity can substantially shape scientists' media attention, and, by our estimation, this bias has affected thousands of scholars unfairly.Comment: 68 pages, 8 figures, 11 table

    Are All Successful Communities Alike? Characterizing and Predicting the Success of Online Communities

    Full text link
    The proliferation of online communities has created exciting opportunities to study the mechanisms that explain group success. While a growing body of research investigates community success through a single measure -- typically, the number of members -- we argue that there are multiple ways of measuring success. Here, we present a systematic study to understand the relations between these success definitions and test how well they can be predicted based on community properties and behaviors from the earliest period of a community's lifetime. We identify four success measures that are desirable for most communities: (i) growth in the number of members; (ii) retention of members; (iii) long term survival of the community; and (iv) volume of activities within the community. Surprisingly, we find that our measures do not exhibit very high correlations, suggesting that they capture different types of success. Additionally, we find that different success measures are predicted by different attributes of online communities, suggesting that success can be achieved through different behaviors. Our work sheds light on the basic understanding of what success represents in online communities and what predicts it. Our results suggest that success is multi-faceted and cannot be measured nor predicted by a single measurement. This insight has practical implications for the creation of new online communities and the design of platforms that facilitate such communities.Comment: To appear at The Web Conference 201

    Analyzing the Engagement of Social Relationships During Life Event Shocks in Social Media

    Full text link
    Individuals experiencing unexpected distressing events, shocks, often rely on their social network for support. While prior work has shown how social networks respond to shocks, these studies usually treat all ties equally, despite differences in the support provided by different social relationships. Here, we conduct a computational analysis on Twitter that examines how responses to online shocks differ by the relationship type of a user dyad. We introduce a new dataset of over 13K instances of individuals' self-reporting shock events on Twitter and construct networks of relationship-labeled dyadic interactions around these events. By examining behaviors across 110K replies to shocked users in a pseudo-causal analysis, we demonstrate relationship-specific patterns in response levels and topic shifts. We also show that while well-established social dimensions of closeness such as tie strength and structural embeddedness contribute to shock responsiveness, the degree of impact is highly dependent on relationship and shock types. Our findings indicate that social relationships contain highly distinctive characteristics in network interactions and that relationship-specific behaviors in online shock responses are unique from those of offline settings.Comment: Accepted to ICWSM 2023. 12 pages, 5 figures, 5 table
    • …
    corecore