5,588 research outputs found

    Privacy Intelligence: A Survey on Image Sharing on Online Social Networks

    Full text link
    Image sharing on online social networks (OSNs) has become an indispensable part of daily social activities, but it has also led to an increased risk of privacy invasion. The recent image leaks from popular OSN services and the abuse of personal photos using advanced algorithms (e.g. DeepFake) have prompted the public to rethink individual privacy needs when sharing images on OSNs. However, OSN image sharing itself is relatively complicated, and systems currently in place to manage privacy in practice are labor-intensive yet fail to provide personalized, accurate and flexible privacy protection. As a result, an more intelligent environment for privacy-friendly OSN image sharing is in demand. To fill the gap, we contribute a systematic survey of 'privacy intelligence' solutions that target modern privacy issues related to OSN image sharing. Specifically, we present a high-level analysis framework based on the entire lifecycle of OSN image sharing to address the various privacy issues and solutions facing this interdisciplinary field. The framework is divided into three main stages: local management, online management and social experience. At each stage, we identify typical sharing-related user behaviors, the privacy issues generated by those behaviors, and review representative intelligent solutions. The resulting analysis describes an intelligent privacy-enhancing chain for closed-loop privacy management. We also discuss the challenges and future directions existing at each stage, as well as in publicly available datasets.Comment: 32 pages, 9 figures. Under revie

    The Dark Side(-Channel) of Mobile Devices: A Survey on Network Traffic Analysis

    Full text link
    In recent years, mobile devices (e.g., smartphones and tablets) have met an increasing commercial success and have become a fundamental element of the everyday life for billions of people all around the world. Mobile devices are used not only for traditional communication activities (e.g., voice calls and messages) but also for more advanced tasks made possible by an enormous amount of multi-purpose applications (e.g., finance, gaming, and shopping). As a result, those devices generate a significant network traffic (a consistent part of the overall Internet traffic). For this reason, the research community has been investigating security and privacy issues that are related to the network traffic generated by mobile devices, which could be analyzed to obtain information useful for a variety of goals (ranging from device security and network optimization, to fine-grained user profiling). In this paper, we review the works that contributed to the state of the art of network traffic analysis targeting mobile devices. In particular, we present a systematic classification of the works in the literature according to three criteria: (i) the goal of the analysis; (ii) the point where the network traffic is captured; and (iii) the targeted mobile platforms. In this survey, we consider points of capturing such as Wi-Fi Access Points, software simulation, and inside real mobile devices or emulators. For the surveyed works, we review and compare analysis techniques, validation methods, and achieved results. We also discuss possible countermeasures, challenges and possible directions for future research on mobile traffic analysis and other emerging domains (e.g., Internet of Things). We believe our survey will be a reference work for researchers and practitioners in this research field.Comment: 55 page

    User identification and community exploration via mining big personal data in online platforms

    Get PDF
    User-generated big data mining is vital important for large online platforms in terms of security, profits improvement, products recommendation and system management. Personal attributes recognition, user behavior prediction, user identification, and community detection are the most critical and interesting issues that remain as challenges in many real applications in terms of accuracy, efficiency and data security. For an online platform with tens of thousands of users, it is always vulnerable to malicious users who pose a threat to other innocent users and consume unnecessary resources, where accurate user identification is urgently required to prevent corresponding malicious attempts. Meanwhile, accurate prediction of user behavior will help large platforms provide satisfactory recommendations to users and efficiently allocate different amounts of resources to different users. In addition to individual identification, community exploration of large social networks that formed by online databases could also help managers gain knowledge of how a community evolves. And such large scale and diverse social networks can be used to validate network theories, which are previously developed from synthetic networks or small real networks. In this thesis, we study several specific cases to address some key challenges that remain in different types of large online platforms, such as user behavior prediction for cold-start users, privacy protection for user-generated data, and large scale and diverse social community analysis. In the first case, as an emerging business, online education has attracted tens of thousands users as it can provide diverse courses that can exactly satisfy whatever demands of the students. Due to the limitation of public school systems, many students pursue private supplementary tutoring for improving their academic performance. Similar to online shopping platform, online education system is also a user-product based service, where users usually have to select and purchase the courses that meet their demands. It is important to construct a course recommendation and user behavior prediction system based on user attributes or user-generated data. Item recommendation in current online shopping systems is usually based on the interactions between users and products, since most of the personal attributes are unnecessary for online shopping services, and users often provide false information during registration. Therefore, it is not possible to recommend items based on personal attributes by exploiting the similarity of attributes among users, such as education level, age, school, gender, etc. Different from most online shopping platforms, online education platforms have access to a large number of credible personal attributes since accurate personal information is important in education service, and user behaviors could be predicted with just user attribute. Moreover, previous works on learning individual attributes are based primarily on panel survey data, which ensures its credibility but lacks efficiency. Therefore, most works simply include hundreds or thousands of users in the study. With more than 200,000 anonymous K-12 students' 3-year learning data from one of the world's largest online extra-curricular education platforms, we uncover students' online learning behaviors and infer the impact of students' home location, family socioeconomic situation and attended school's reputation/rank on the students' private tutoring course participation and learning outcomes. Further analysis suggests that such impact may be largely attributed to the inequality of access to educational resources in different cities and the inequality in family socioeconomic status. Finally, we study the predictability of students' performance and behaviors using machine learning algorithms with different groups of features, showing students' online learning performance can be predicted based on personal attributes and user-generated data with MAE<10%<10\%. As mentioned above, user attributes are usually fake information in most online platforms, and online platforms are usually vulnerable of malicious users. It is very important to identify the users or verify their attributes. Many researches have used user-generated mobile phone data (which includes sensitive information) to identify diverse user attributes, such as social economic status, ages, education level, professions, etc. Most of these approaches leverage original sensitive user data to build feature-rich models that take private information as input, such as exact locations, App usages and call detailed records. However, accessing users' mobile phone raw data may violate the more and more strict private data protection policies and regulations (e.g. GDPR). We observe that appropriate statistical methods can offer an effective means to eliminate private information and preserve personal characteristics, thus enabling the identification of the user attributes without privacy concern. Typically, identifying an unfamiliar caller's profession is important to protect citizens' personal safety and property. Due to limited data protection of various popular online services in some countries such as taxi hailing or takeouts ordering, many users nowadays encounter an increasing number of phone calls from strangers. The situation may be aggravated when criminals pretend to be such service delivery staff, bringing threats to the user individuals as well as the society. Additionally, more and more people suffer from excessive digital marketing and fraud phone calls because of personal information leakage. Therefore, a real time identification of unfamiliar caller is urgently needed. We explore the feasibility of user identification with privacy-preserved user-generated mobile, and we develop CPFinder, a system which implements automatic user identification callers on end devices. The system could mainly identify four categories of users: taxi drivers, delivery and takeouts staffs, telemarketers and fraudsters, and normal users (other professions). Our evaluation over an anonymized dataset of 1,282 users with a period of 3 months in Shanghai City shows that the CPFinder can achieve an accuracy of 75+\% for multi-class classification and 92.35+\% for binary classification. In addition to the mining of personal attributes and behaviors, the community mining of a large group of people based on online big data also attracts lots of attention due to the accessibility of large scale social network in online platforms. As one of the very important branch of social network, scientific collaboration network has been studied for decades as online big publication databases are easy to access and many user attribute are available. Academic collaborations become regular and the connections among researchers become closer due to the prosperity of globalized academic communications. It has been found that many computer science conferences are closed communities in terms of the acceptance of newcomers' papers, especially are the well-regarded conferences~\cite{cabot2018cs}. However, an in-depth study on the difference in the closeness and structural features of different conferences and what caused these differences is still missing. %Also, reviewing the strong and weak tie theories, there are multifaceted influences exerted by the combination of this two types of ties in different context. More analysis is needed to determine whether the network is closed or has other properties. We envision that social connections play an increasing role in the academic society and influence the paper selection process. The influences are not only restricted within visible links, but also extended to weak ties that connect two distanced node. Previous studies of coauthor networks did not adequately consider the central role of some authors in the publication venues, such as \ac{PC} chairs of the conferences. Such people could influence the evolutionary patterns of coauthor networks due to their authorities and trust for members to select accepted papers and their core positions in the community. Thus, in addition to the ratio of newcomers' papers it would be interesting if the PC chairs' relevant metrics could be quantified to measure the closure of a conference from the perspective of old authors' papers. Additionally, the analysis of the differences among different conferences in terms of the evolution of coauthor networks and degree of closeness may disclose the formation of closed communities. Therefore, we will introduce several different outcomes due to the various structural characteristics of several typical conferences. In this paper, using the DBLP dataset of computer science publications and a PC chair dataset, we show the evidence of the existence of strong and weak ties in coauthor networks and the PC chairs' influences are also confirmed to be related with the tie strength and network structural properties. Several PC chair relevant metrics based on coauthor networks are introduced to measure the closure and efficiency of a conference.2021-10-2

    Applications of Federated Learning in Smart Cities: Recent Advances, Taxonomy, and Open Challenges

    Full text link
    Federated learning plays an important role in the process of smart cities. With the development of big data and artificial intelligence, there is a problem of data privacy protection in this process. Federated learning is capable of solving this problem. This paper starts with the current developments of federated learning and its applications in various fields. We conduct a comprehensive investigation. This paper summarize the latest research on the application of federated learning in various fields of smart cities. In-depth understanding of the current development of federated learning from the Internet of Things, transportation, communications, finance, medical and other fields. Before that, we introduce the background, definition and key technologies of federated learning. Further more, we review the key technologies and the latest results. Finally, we discuss the future applications and research directions of federated learning in smart cities

    CYCLOSA: Decentralizing Private Web Search Through SGX-Based Browser Extensions

    Get PDF
    By regularly querying Web search engines, users (unconsciously) disclose large amounts of their personal data as part of their search queries, among which some might reveal sensitive information (e.g. health issues, sexual, political or religious preferences). Several solutions exist to allow users querying search engines while improving privacy protection. However, these solutions suffer from a number of limitations: some are subject to user re-identification attacks, while others lack scalability or are unable to provide accurate results. This paper presents CYCLOSA, a secure, scalable and accurate private Web search solution. CYCLOSA improves security by relying on trusted execution environments (TEEs) as provided by Intel SGX. Further, CYCLOSA proposes a novel adaptive privacy protection solution that reduces the risk of user re- identification. CYCLOSA sends fake queries to the search engine and dynamically adapts their count according to the sensitivity of the user query. In addition, CYCLOSA meets scalability as it is fully decentralized, spreading the load for distributing fake queries among other nodes. Finally, CYCLOSA achieves accuracy of Web search as it handles the real query and the fake queries separately, in contrast to other existing solutions that mix fake and real query results

    Profiling user activities with minimal traffic traces

    Full text link
    Understanding user behavior is essential to personalize and enrich a user's online experience. While there are significant benefits to be accrued from the pursuit of personalized services based on a fine-grained behavioral analysis, care must be taken to address user privacy concerns. In this paper, we consider the use of web traces with truncated URLs - each URL is trimmed to only contain the web domain - for this purpose. While such truncation removes the fine-grained sensitive information, it also strips the data of many features that are crucial to the profiling of user activity. We show how to overcome the severe handicap of lack of crucial features for the purpose of filtering out the URLs representing a user activity from the noisy network traffic trace (including advertisement, spam, analytics, webscripts) with high accuracy. This activity profiling with truncated URLs enables the network operators to provide personalized services while mitigating privacy concerns by storing and sharing only truncated traffic traces. In order to offset the accuracy loss due to truncation, our statistical methodology leverages specialized features extracted from a group of consecutive URLs that represent a micro user action like web click, chat reply, etc., which we call bursts. These bursts, in turn, are detected by a novel algorithm which is based on our observed characteristics of the inter-arrival time of HTTP records. We present an extensive experimental evaluation on a real dataset of mobile web traces, consisting of more than 130 million records, representing the browsing activities of 10,000 users over a period of 30 days. Our results show that the proposed methodology achieves around 90% accuracy in segregating URLs representing user activities from non-representative URLs

    Understanding and controlling leakage in machine learning

    Get PDF
    Machine learning models are being increasingly adopted in a variety of real-world scenarios. However, the privacy and confidentiality implications introduced in these scenarios are not well understood. Towards better understanding such implications, we focus on scenarios involving interactions between numerous parties prior to, during, and after training relevant models. Central to these interactions is sharing information for a purpose e.g., contributing data samples towards a dataset, returning predictions via an API. This thesis takes a step toward understanding and controlling leakage of private information during such interactions. In the first part of the thesis we investigate leakage of private information in visual data and specifically, photos representative of content shared on social networks. There is a long line of work to tackle leakage of personally identifiable information in social photos, especially using face- and body-level visual cues. However, we argue this presents only a narrow perspective as images reveal a wide spectrum of multimodal private information (e.g., disabilities, name-tags). Consequently, we work towards a Visual Privacy Advisor that aims to holistically identify and mitigate private risks when sharing social photos. In the second part, we address leakage during training of ML models. We observe learning algorithms are being increasingly used to train models on rich decentralized datasets e.g., personal data on numerous mobile devices. In such cases, information in the form of high-dimensional model parameter updates are anonymously aggregated from participating individuals. However, we find that the updates encode sufficient identifiable information and allows them to be linked back to participating individuals. We additionally propose methods to mitigate this leakage while maintaining high utility of the updates. In the third part, we discuss leakage of confidential information during inference time of black-box models. In particular, we find models lend themselves to model functionality stealing attacks: an adversary can interact with the black-box model towards creating a replica `knock-off' model that exhibits similar test-set performances. As such attacks pose a severe threat to the intellectual property of the model owner, we also work towards effective defenses. Our defense strategy by introducing bounded and controlled perturbations to predictions can significantly amplify model stealing attackers' error rates. In summary, this thesis advances understanding of privacy leakage when information is shared in raw visual forms, during training of models, and at inference time when deployed as black-boxes. In each of the cases, we further propose techniques to mitigate leakage of information to enable wide-spread adoption of techniques in real-world scenarios.Modelle für maschinelles Lernen werden zunehmend in einer Vielzahl realer Szenarien eingesetzt. Die in diesen Szenarien vorgestellten Auswirkungen auf Datenschutz und Vertraulichkeit wurden jedoch nicht vollständig untersucht. Um solche Implikationen besser zu verstehen, konzentrieren wir uns auf Szenarien, die Interaktionen zwischen mehreren Parteien vor, während und nach dem Training relevanter Modelle beinhalten. Das Teilen von Informationen für einen Zweck, z. B. das Einbringen von Datenproben in einen Datensatz oder die Rückgabe von Vorhersagen über eine API, ist zentral für diese Interaktionen. Diese Arbeit verhilft zu einem besseren Verständnis und zur Kontrolle des Verlusts privater Informationen während solcher Interaktionen. Im ersten Teil dieser Arbeit untersuchen wir den Verlust privater Informationen bei visuellen Daten und insbesondere bei Fotos, die für Inhalte repräsentativ sind, die in sozialen Netzwerken geteilt werden. Es gibt eine lange Reihe von Arbeiten, die das Problem des Verlustes persönlich identifizierbarer Informationen in sozialen Fotos angehen, insbesondere mithilfe visueller Hinweise auf Gesichts- und Körperebene. Wir argumentieren jedoch, dass dies nur eine enge Perspektive darstellt, da Bilder ein breites Spektrum multimodaler privater Informationen (z. B. Behinderungen, Namensschilder) offenbaren. Aus diesem Grund arbeiten wir auf einen Visual Privacy Advisor hin, der darauf abzielt, private Risiken beim Teilen sozialer Fotos ganzheitlich zu identifizieren und zu minimieren. Im zweiten Teil befassen wir uns mit Datenverlusten während des Trainings von ML-Modellen. Wir beobachten, dass zunehmend Lernalgorithmen verwendet werden, um Modelle auf umfangreichen dezentralen Datensätzen zu trainieren, z. B. persönlichen Daten auf zahlreichen Mobilgeräten. In solchen Fällen werden Informationen von teilnehmenden Personen in Form von hochdimensionalen Modellparameteraktualisierungen anonym verbunden. Wir stellen jedoch fest, dass die Aktualisierungen ausreichend identifizierbare Informationen codieren und es ermöglichen, sie mit teilnehmenden Personen zu verknüpfen. Wir schlagen zudem Methoden vor, um diesen Datenverlust zu verringern und gleichzeitig die hohe Nützlichkeit der Aktualisierungen zu erhalten. Im dritten Teil diskutieren wir den Verlust vertraulicher Informationen während der Inferenzzeit von Black-Box-Modellen. Insbesondere finden wir, dass sich Modelle für die Entwicklung von Angriffen, die auf Funktionalitätsdiebstahl abzielen, eignen: Ein Gegner kann mit dem Black-Box-Modell interagieren, um ein Replikat-Knock-Off-Modell zu erstellen, das ähnliche Test-Set-Leistungen aufweist. Da solche Angriffe eine ernsthafte Bedrohung für das geistige Eigentum des Modellbesitzers darstellen, arbeiten wir auch an einer wirksamen Verteidigung. Unsere Verteidigungsstrategie durch die Einführung begrenzter und kontrollierter Störungen in Vorhersagen kann die Fehlerraten von Modelldiebstahlangriffen erheblich verbessern. Zusammenfassend lässt sich sagen, dass diese Arbeit das Verständnis von Datenschutzverlusten beim Informationsaustausch verbessert, sei es bei rohen visuellen Formen, während des Trainings von Modellen oder während der Inferenzzeit von Black-Box-Modellen. In jedem Fall schlagen wir ferner Techniken zur Verringerung des Informationsverlusts vor, um eine weit verbreitete Anwendung von Techniken in realen Szenarien zu ermöglichen.Max Planck Institute for Informatic
    • …
    corecore