7 research outputs found

    Language in Our Time: An Empirical Analysis of Hashtags

    Get PDF
    Hashtags in online social networks have gained tremendous popularity during the past five years. The resulting large quantity of data has provided a new lens into modern society. Previously, researchers mainly rely on data collected from Twitter to study either a certain type of hashtags or a certain property of hashtags. In this paper, we perform the first large-scale empirical analysis of hashtags shared on Instagram, the major platform for hashtag-sharing. We study hashtags from three different dimensions including the temporal-spatial dimension, the semantic dimension, and the social dimension. Extensive experiments performed on three large-scale datasets with more than 7 million hashtags in total provide a series of interesting observations. First, we show that the temporal patterns of hashtags can be categorized into four different clusters, and people tend to share fewer hashtags at certain places and more hashtags at others. Second, we observe that a non-negligible proportion of hashtags exhibit large semantic displacement. We demonstrate hashtags that are more uniformly shared among users, as quantified by the proposed hashtag entropy, are less prone to semantic displacement. In the end, we propose a bipartite graph embedding model to summarize users' hashtag profiles, and rely on these profiles to perform friendship prediction. Evaluation results show that our approach achieves an effective prediction with AUC (area under the ROC curve) above 0.8 which demonstrates the strong social signals possessed in hashtags.Comment: WWW 201

    Tagvisor: A Privacy Advisor for Sharing Hashtags

    Get PDF
    Hashtag has emerged as a widely used concept of popular culture and campaigns, but its implications on people's privacy have not been investigated so far. In this paper, we present the first systematic analysis of privacy issues induced by hashtags. We concentrate in particular on location, which is recognized as one of the key privacy concerns in the Internet era. By relying on a random forest model, we show that we can infer a user's precise location from hashtags with accuracy of 70% to 76%, depending on the city. To remedy this situation, we introduce a system called Tagvisor that systematically suggests alternative hashtags if the user-selected ones constitute a threat to location privacy. Tagvisor realizes this by means of three conceptually different obfuscation techniques and a semantics-based metric for measuring the consequent utility loss. Our findings show that obfuscating as little as two hashtags already provides a near-optimal trade-off between privacy and utility in our dataset. This in particular renders Tagvisor highly time-efficient, and thus, practical in real-world settings

    Privacy risk assessment of emerging machine learning paradigms

    Get PDF
    Machine learning (ML) has progressed tremendously, and data is the key factor to drive such development. However, there are two main challenges regarding collecting the data and handling it with ML models. First, the acquisition of high-quality labeled data can be difficult and expensive due to the need for extensive human annotation. Second, to model the complex relationship between entities, e.g., social networks or molecule structures, graphs have been leveraged. However, conventional ML models may not effectively handle graph data due to the non-linear and complex nature of the relationships between nodes. To address these challenges, recent developments in semi-supervised learning and self-supervised learning have been introduced to leverage unlabeled data for ML tasks. In addition, a new family of ML models known as graph neural networks has been proposed to tackle the challenges associated with graph data. Despite being powerful, the potential privacy risk stemming from these paradigms should also be taken into account. In this dissertation, we perform the privacy risk assessment of the emerging machine learning paradigms. Firstly, we investigate the membership privacy leakage stemming from semi-supervised learning. Concretely, we propose the first data augmentation-based membership inference attack that is tailored to the training paradigm of semi-supervised learning methods. Secondly, we quantify the privacy leakage of self-supervised learning through the lens of membership inference attacks and attribute inference attacks. Thirdly, we study the privacy implications of training GNNs on graphs. In particular, we propose the first attack to steal a graph from the outputs of a GNN model that is trained on the graph. Finally, we also explore potential defense mechanisms to mitigate these attacks.Maschinelles Lernen (ML) hat enorme Fortschritte gemacht, und Daten sind der Schlüsselfaktor, um diese Entwicklung voranzutreiben. Es gibt jedoch zwei große Herausforderungen bei der Erfassung der Daten und deren Handhabung mit ML-Modellen. Erstens kann die Erfassung qualitativ hochwertiger beschrifteter Daten aufgrund der Notwendigkeit umfangreicher menschlicher Anmerkungen schwierig und teuer sein. Zweitens wurden Graphen genutzt, um die komplexe Beziehung zwischen Entitäten, z. B. sozialen Netzwerken oder Molekülstrukturen, zu modellieren. Herkömmliche ML Modelle können Diagrammdaten jedoch aufgrund der nichtlinearen und komplexen Natur der Beziehungen zwischen Knoten möglicherweise nicht effektiv handhaben. Um diesen Herausforderungen zu begegnen, wurden jüngste Entwicklungen im halbüberwachten Lernen und im selbstüberwachten Lernen eingeführt, um unbeschriftete Daten für ML Aufgaben zu nutzen. Darüber hinaus wurde eine neue Familie von ML-Modellen, bekannt als Graph Neural Networks, vorgeschlagen, um die Herausforderungen im Zusammenhang mit Graphdaten zu bewältigen. Obwohl sie leistungsfähig sind, sollte auch das potenzielle Datenschutzrisiko berücksichtigt werden, das sich aus diesen Paradigmen ergibt. In dieser Dissertation führen wir die Datenschutzrisikobewertung der aufkommenden Paradigmen des maschinellen Lernens durch. Erstens untersuchen wir die Datenschutzlecks der Mitgliedschaft, die sich aus halbüberwachtem Lernen ergeben. Konkret schlagen wir den ersten auf Datenaugmentation basierenden Mitgliedschafts-Inferenz-Angriff vor, der auf das Trainingsparadigma halbüberwachter Lernmethoden zugeschnitten ist. Zweitens quantifizieren wir das Durchsickern der Privatsphäre des selbstüberwachten Lernens durch die Linse von Mitgliedschafts-Inferenz-Angriffen und Attribut-Inferenz- Angriffen. Drittens untersuchen wir die Datenschutzauswirkungen des Trainings von GNNs auf Graphen. Insbesondere schlagen wir den ersten Angriff vor, um einen Graphen aus den Ausgaben eines GNN-Modells zu stehlen, das auf dem Graphen trainiert wird. Schließlich untersuchen wir auch mögliche Verteidigungsmechanismen, um diese Angriffe abzuschwächen

    Health privacy : methods for privacy-preserving data sharing of methylation, microbiome and eye tracking data

    Get PDF
    This thesis studies the privacy risks of biomedical data and develops mechanisms for privacy-preserving data sharing. The contribution of this work is two-fold: First, we demonstrate privacy risks of a variety of biomedical data types such as DNA methylation data, microbiome data and eye tracking data. Despite being less stable than well-studied genome data and more prone to environmental changes, well-known privacy attacks can be adopted and threaten the privacy of data donors. Nevertheless, data sharing is crucial to advance biomedical research given that collection the data of a sufficiently large population is complex and costly. Therefore, we develop as a second step privacy- preserving tools that enable researchers to share such biomedical data. and second, we equip researchers with tools to enable privacy-preserving data sharing. These tools are mostly based on differential privacy, machine learning techniques and adversarial examples and carefully tuned to the concrete use case to maintain data utility while preserving privacy.Diese Dissertation beleuchtet Risiken für die Privatsphäre von biomedizinischen Daten und entwickelt Mechanismen für privatsphäre-erthaltendes Teilen von Daten. Dies zerfällt in zwei Teile: Zunächst zeigen wir die Risiken für die Privatsphäre auf, die von biomedizinischen Daten wie DNA Methylierung, Mikrobiomdaten und bei der Aufnahme von Augenbewegungen vorkommen. Obwohl diese Daten weniger stabil sind als Genomdaten, deren Risiken der Forschung gut bekannt sind, und sich mehr unter Umwelteinflüssen ändern, können bekannte Angriffe angepasst werden und bedrohen die Privatsphäre der Datenspender. Dennoch ist das Teilen von Daten essentiell um biomedizinische Forschung voranzutreiben, denn Daten von einer ausreichend großen Studienpopulation zu sammeln ist aufwändig und teuer. Deshalb entwickeln wir als zweiten Schritt privatsphäre-erhaltende Techniken, die es Wissenschaftlern erlauben, solche biomedizinischen Daten zu teilen. Diese Techniken basieren im Wesentlichen auf differentieller Privatsphäre und feindlichen Beispielen und sind sorgfältig auf den konkreten Einsatzzweck angepasst um den Nutzen der Daten zu erhalten und gleichzeitig die Privatsphäre zu schützen

    Primena Big Data analitike za istraživanje prostorno-vremenske dinamike ljudske populacije

    Get PDF
    With the rapid growth of the volume of available data related to human dynamics, it became more challenging to research and investigate topics that could reveal novel knowledge in the area. In present time people tend to live mostly in large cities, where knowledge about human dynamics, habits and behaviour could lead to better city organisation, energy efficiency, transport organisation and overall better quality and more sustainable living. Human dynamics could be reasoned from many different aspects, but all of them have three elements in common: time, space and data volume. Human activity and interaction could not be inspected without space and time component because everything is happening somewhere at some time. Also, with huge smartphone adoption now terabytes of data related to human dynamic are available. Although data is sensitive to personal information, true owners of the data is either telecom operator company, social media company or any other company that provides the applications that are used on the mobile phone. If such data is to be opened to public or scientific community to conduct a research with it, it needs to be anonimized first.Another challenge of user generated data is data set volume. Data is usually very large in size (Volume), it comes from different sources and in different formats (Variety) and it is generated in real-time and it evolves very fast (Velocity). These are three V's of Big Data, and such data sets need to be approached with specially designed Big Data technologies.In the research presented in this thesis we assembled Big Data technologies, Graph Theory and space-time dependent human dynamic data.Са све већом и већом количином података која је доступна везано за динамику људске популације, постаје све више изазовно да се спроведе истраживање у овој области које би донело ново знање. У данашње време људи масовно живе у великим градовима где би знање о људској динамици, навикама и понашању могло значајно да унапреди организацију градова, енергетску ефикасност, транспорт и свеукупно квалитетнији и више одржив животни стил. Динамика људске популације може да се посматра са више аспеката, али сви они имају три заједничка елемента: време, простор и количину података. Људска активност и интеракције не могу се посматрати одвојено од просторне и временске компоненте јер се све дешава негде и у неко време. Такође, са великим присуством паметних телефона данас су доступни терабајти података о људској динамици. Иако су подаци осетљиви због приватности корисника, прави власници података су заправо телеком компаније, или компаније друштвених мрежа или неке друге компаније које развијају корисничке апликације за паметне телефоне. Ако би се такви подаци отварали за јавност или научну заједницу морали би прво да буду анонимизовани. Други изазов везан за кориснички генерисане податке је величина података. Подаци су обично веома велики меморијски (енг. „Volume“), долазе из различитих извора и у различитим форматима (енг. „Variety“) и генерисани су реалном времену и мењају се  еома брзо (енг. „Velocity“). Ово су три „V“ Великих података, и такви подаци захтевају посебан приступ аналитици са специјално дизајнираним алатима за Аналитику великих података. У оквиру истраживања које је презентовано у овој тези објединили смо Аналитику великих података, Теорију графова и просторно-временски зависне податке о људској динамици.Sa sve većom i većom količinom podataka koja je dostupna vezano za dinamiku ljudske populacije, postaje sve više izazovno da se sprovede istraživanje u ovoj oblasti koje bi donelo novo znanje. U današnje vreme ljudi masovno žive u velikim gradovima gde bi znanje o ljudskoj dinamici, navikama i ponašanju moglo značajno da unapredi organizaciju gradova, energetsku efikasnost, transport i sveukupno kvalitetniji i više održiv životni stil. Dinamika ljudske populacije može da se posmatra sa više aspekata, ali svi oni imaju tri zajednička elementa: vreme, prostor i količinu podataka. LJudska aktivnost i interakcije ne mogu se posmatrati odvojeno od prostorne i vremenske komponente jer se sve dešava negde i u neko vreme. Takođe, sa velikim prisustvom pametnih telefona danas su dostupni terabajti podataka o ljudskoj dinamici. Iako su podaci osetljivi zbog privatnosti korisnika, pravi vlasnici podataka su zapravo telekom kompanije, ili kompanije društvenih mreža ili neke druge kompanije koje razvijaju korisničke aplikacije za pametne telefone. Ako bi se takvi podaci otvarali za javnost ili naučnu zajednicu morali bi prvo da budu anonimizovani. Drugi izazov vezan za korisnički generisane podatke je veličina podataka. Podaci su obično veoma veliki memorijski (eng. „Volume“), dolaze iz različitih izvora i u različitim formatima (eng. „Variety“) i generisani su realnom vremenu i menjaju se  eoma brzo (eng. „Velocity“). Ovo su tri „V“ Velikih podataka, i takvi podaci zahtevaju poseban pristup analitici sa specijalno dizajniranim alatima za Analitiku velikih podataka. U okviru istraživanja koje je prezentovano u ovoj tezi objedinili smo Analitiku velikih podataka, Teoriju grafova i prostorno-vremenski zavisne podatke o ljudskoj dinamici

    Quantifying Location Sociality

    Get PDF
    The emergence of location-based social networks provides an unprecedented chance to study the interaction between human mobility and social relations. This work is a step towards quantifying whether a location is suitable for conducting social activities, and the notion is named location sociality. Being able to quantify location sociality creates practical opportunities such as urban planning and location recommendation. To quantify a location’s sociality, we propose a mixture model of HITS and PageRank on a heterogeneous network linking users and locations. By exploiting millions of check-in data generated by Instagram users in New York and Los Angeles, we investigate the relation between location sociality and several location properties, including location categories, rating and popularity. We further perform two case studies, i.e., friendship prediction and location recommendation, experimental results demonstrate the usefulness of our quantification
    corecore