2,536 research outputs found
Differential Privacy, Property Testing, and Perturbations
Controlling the dissemination of information about ourselves has become a minefield in
the modern age. We release data about ourselves every day and don’t always fully understand
what information is contained in this data. It is often the case that the combination
of seemingly innocuous pieces of data can be combined to reveal more sensitive information
about ourselves than we intended. Differential privacy has developed as a technique
to prevent this type of privacy leakage. It borrows ideas from information theory to inject
enough uncertainty into the data so that sensitive information is provably absent from
the privatised data. Current research in differential privacy walks the fine line between
removing sensitive information while allowing non-sensitive information to be released.
At its heart, this thesis is about the study of information. Many of the results can be
formulated as asking a subset of the questions: does the data you have contain enough
information to learn what you would like to learn? and how can I affect the data to ensure
you can’t discern sensitive information? We will often approach the former question from
both directions: information theoretic lower bounds on recovery and algorithmic upper
bounds.
We begin with an information theoretic lower bound for graphon estimation. This explores
the fundamental limits of how much information about the underlying population is
contained in a finite sample of data. We then move on to exploring the connection between
information theoretic results and privacy in the context of linear inverse problems. We find
that there is a discrepancy between how the inverse problems community and the privacy
community view good recovery of information. Next, we explore black-box testing for
privacy. We argue that the amount of information required to verify the privacy guarantee
of an algorithm, without access to the internals of the algorithm, is lower bounded by the
amount of information required to break the privacy guarantee. Finally, we explore a setting
where imposing privacy is a help rather than a hindrance: online linear optimisation.
We argue that private algorithms have the right kind of stability guarantee to ensure low
regret for online linear optimisation.PHDMathematicsUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttps://deepblue.lib.umich.edu/bitstream/2027.42/143940/1/amcm_1.pd
Privacy-Preserving Distributed Optimization via Subspace Perturbation: A General Framework
As the modern world becomes increasingly digitized and interconnected,
distributed signal processing has proven to be effective in processing its
large volume of data. However, a main challenge limiting the broad use of
distributed signal processing techniques is the issue of privacy in handling
sensitive data. To address this privacy issue, we propose a novel yet general
subspace perturbation method for privacy-preserving distributed optimization,
which allows each node to obtain the desired solution while protecting its
private data. In particular, we show that the dual variables introduced in each
distributed optimizer will not converge in a certain subspace determined by the
graph topology. Additionally, the optimization variable is ensured to converge
to the desired solution, because it is orthogonal to this non-convergent
subspace. We therefore propose to insert noise in the non-convergent subspace
through the dual variable such that the private data are protected, and the
accuracy of the desired solution is completely unaffected. Moreover, the
proposed method is shown to be secure under two widely-used adversary models:
passive and eavesdropping. Furthermore, we consider several distributed
optimizers such as ADMM and PDMM to demonstrate the general applicability of
the proposed method. Finally, we test the performance through a set of
applications. Numerical tests indicate that the proposed method is superior to
existing methods in terms of several parameters like estimated accuracy,
privacy level, communication cost and convergence rate
Data Analytics and Performance Enhancement in Edge-Cloud Collaborative Internet of Things Systems
Based on the evolving communications, computing and embedded systems technologies, Internet of Things (IoT) systems can interconnect not only physical users and devices but also virtual services and objects, which have already been applied to many different application scenarios, such as smart home, smart healthcare, and intelligent transportation. With the rapid development, the number of involving devices increases tremendously. The huge number of devices and correspondingly generated data bring critical challenges to the IoT systems. To enhance the overall performance, this thesis aims to address the related technical issues on IoT data processing and physical topology discovery of the subnets self-organized by IoT devices.
First of all, the issues on outlier detection and data aggregation are addressed through the development of recursive principal component analysis (R-PCA) based data analysis framework. The framework is developed in a cluster-based structure to fully exploit the spatial correlation of IoT data. Specifically, the sensing devices are gathered into clusters based on spatial data correlation. Edge devices are assigned to the clusters for the R-PCA based outlier detection and data aggregation. The outlier-free and aggregated data are forwarded to the remote cloud server for data reconstruction and storage. Moreover, a data reduction scheme is further proposed to relieve the burden on the trunk link for data uploading by utilizing the temporal data correlation. Kalman filters (KFs) with identical parameters are maintained at the edge and cloud for data prediction. The amount of data uploading is reduced by using the data predicted by the KF in the cloud instead of uploading all the practically measured data.
Furthermore, an unmanned aerial vehicle (UAV) assisted IoT system is particularly designed for large-scale monitoring. Wireless sensor nodes are flexibly deployed for environmental sensing and self-organized into wireless sensor networks (WSNs). A physical topology discovery scheme is proposed to construct the physical topology of WSNs in the cloud server to facilitate performance optimization, where the physical topology indicates both the logical connectivity statuses of WSNs and the physical locations of WSN nodes. The physical topology discovery scheme is implemented through the newly developed parallel Metropolis-Hastings random walk based information sampling and network-wide 3D localization algorithms, where UAVs are served as the mobile edge devices and anchor nodes. Based on the physical topology constructed in the cloud, a UAV-enabled spatial data sampling scheme is further proposed to efficiently sample data from the monitoring area by using denoising autoencoder (DAE). By deploying the encoder of DAE at the UAV and decoder in the cloud, the data can be partially sampled from the sensing field and accurately reconstructed in the cloud.
In the final part of the thesis, a novel autoencoder (AE) neural network based data outlier detection algorithm is proposed, where both encoder and decoder of AE are deployed at the edge devices. Data outliers can be accurately detected by the large fluctuations in the squared error generated by the data passing through the encoder and decoder of the AE
Recommended from our members
Learning from aggregated data
Data aggregation is ubiquitous in modern life. Due to various reasons like privacy, scalability, robustness, etc., ground truth data is often subjected to aggregation before being released to the public, or utilised by researchers and analysts. Learning from aggregated data is a challenging problem that requires significant algorithmic innovation, since naive application of standard techniques to aggregated data is vulnerable to the ecological fallacy. In this work, we explore three different versions of this setting.
First, we tackle the problem of using generalised linear models when features/covariates are fully observed but the targets are only available as histograms- a common scenario in the healthcare domain where many datasets contain both non-sensitive attributes like age, sex, zip-code, etc., as well as privacy sensitive attributes like healthcare records. We introduce an efficient algorithm that uses alternating data imputation and GLM estimation steps to learn predictive models in this setting.
Next, we look at the problem of learning sparse linear models when both features and targets are in aggregated form, specified as empirical estimates of group-wise means computed over different sub-groups of the population. We show that if the true sub-populations are heterogeneous enough, the optimal sparse parameter can be recovered within an arbitrarily small tolerance even in the presence of noise, provided the empirical estimates are obtained from a sufficiently large number of observations.
Third, we tackle the scenario of predictive modelling with data that is subjected to spatio-temporal aggregation. We show that by formulating the problem in the frequency domain, we can bypass the mathematical and representational challenges that arise due to non-uniform aggregation, misaligned sampling periods and aliasing. We introduce a novel algorithm that uses restricted Fourier transforms to estimate a linear model which, when applied to spatio-temporally aggregated data, has a generalisation error that is provably close to the optimal performance by the best possible linear model that can be learned from the non-aggregated data set.
We then focus our attention on the complementary problem that involves designing aggregation strategies that can allow learning, as well as developing algorithmic techniques that can use only the aggregates to train a model that works on individual samples. We motivate our methods by using the example of Gaussian regression, and subsequently extend our techniques to subsume binary classifiers and generalised linear models. We deonstrate the effectiveness of our techniques with empirical evaluation on data from healthcare and telecommunication.
Finally, we present a concrete example of our methods applied to a real life practical problem. Specifically, we consider an application in the domain of online advertising where the complexity of bidding strategies require accurate estimates of most probable cost-per-click or CPC incurred by advertisers, but the data used for training these CPC prediction models are only available as aggregated invoices supplied by an ad publisher on a daily or hourly basis. We introduce a novel learning framework that can use aggregates computed at varying levels of granularity for building individual-level predictive models. We generalise our modelling and algorithmic framework to handle data from diverse domains, and extend our techniques to cover arbitrary aggregation paradigms like sliding windows and overlapping/non-uniform aggregation. We show empirical evidence for the efficacy of our techniques with experiments on both synthetic data and real data from the online advertising domain as well as healthcare to demonstrate the wider applicability of our framework.Electrical and Computer Engineerin
Device Free Localisation Techniques in Indoor Environments
The location estimation of a target for a long period was performed only by device based localisation technique which is difficult in applications where target especially human is non-cooperative. A target was detected by equipping a device using global positioning systems, radio frequency systems, ultrasonic frequency systems, etc. Device free localisation (DFL) is an upcoming technology in automated localisation in which target need not equip any device for identifying its position by the user. For achieving this objective, the wireless sensor network is a better choice due to its growing popularity. This paper describes the possible categorisation of recently developed DFL techniques using wireless sensor network. The scope of each category of techniques is analysed by comparing their potential benefits and drawbacks. Finally, future scope and research directions in this field are also summarised
Towards System Implementation and Data Analysis for Crowdsensing Based Outdoor RSS Maps
© 2013 IEEE. With the explosive usage of smart mobile devices, sustainable access to wireless networks (e.g., Wi-Fi) has become a pervasive demand. Most mobile users expect seamless network connection with low cost. Indeed, this can be achieved by using an accurate received signal strength (RSS) map of wireless access points. While existing methods are either costly or unscalable, the recently emerged mobile crowdsensing (MCS) paradigm is a promising technique for building RSS maps. MCS applications leverage pervasive mobile devices to collaboratively collect data. However, the heterogeneity of devices and the mobility of users could cause inherent noises and blank spots in collected data set. In this paper, we study how to: 1) tame the sensing noises from heterogenous mobile devices and 2) construct accurate and complete RSS maps with random mobility of crowdsensing participants. First, we build a mobile crowdsensing system called i Map to collect RSS measurements with heterogeneous mobile devices. Second, through observing experimental results, we build statistical models of sensing noises and derive different parameters for each kind of mobile device. Third, we present the signal transmission model with measurement error model, and we propose a novel signal recovery scheme to construct accurate and complete RSS maps. The evaluation results show that the proposed method can achieve 90% and 95% recovery rate in geographic coordinate system and polar coordinate system, respectively
- …