118 research outputs found
Recommended from our members
Data Summarizations for Scalable, Robust and Privacy-Aware Learning in High Dimensions
The advent of large-scale datasets has offered unprecedented amounts of information for building statistically powerful machines, but, at the same time, also introduced a remarkable computational challenge: how can we efficiently process massive data? This thesis presents a suite of data reduction methods that make learning algorithms scale on large datasets, via extracting a succinct model-specific representation that summarizes the
full data collectionâa coreset. Our frameworks support by design datasets of arbitrary dimensionality, and can be used for general purpose Bayesian inference under real-world constraints, including privacy preservation and robustness to outliers, encompassing diverse uncertainty-aware data analysis tasks, such as density estimation, classification
and regression.
We motivate the necessity for novel data reduction techniques in the first place by developing a reidentification attack on coarsened representations of private behavioural data. Analysing longitudinal records of human mobility, we detect privacy-revealing structural patterns, that remain preserved in reduced graph representations of individualsâ information with manageable size. These unique patterns enable mounting linkage attacks via structural similarity computations on longitudinal mobility traces, revealing an overlooked, yet existing, privacy threat.
We then propose a scalable variational inference scheme for approximating posteriors on large datasets via learnable weighted pseudodata, termed pseudocoresets. We show that the use of pseudodata enables overcoming the constraints on minimum summary size for given approximation quality, that are imposed on all existing Bayesian coreset constructions due to data dimensionality. Moreover, it allows us to develop a scheme for pseudocoresets-based summarization that satisfies the standard framework of differential privacy by construction; in this way, we can release reduced size privacy-preserving representations for sensitive datasets that are amenable to arbitrary post-processing.
Subsequently, we consider summarizations for large-scale Bayesian inference in scenarios when observed datapoints depart from the statistical assumptions of our model. Using robust divergences, we develop a method for constructing coresets resilient to model misspecification. Crucially, this method is able to automatically discard outliers from the generated data summaries. Thus we deliver robustified scalable representations
for inference, that are suitable for applications involving contaminated and unreliable data sources.
We demonstrate the performance of proposed summarization techniques on multiple parametric statistical models, and diverse simulated and real-world datasets, from music genre features to hospital readmission records, considering a wide range of data dimensionalities.Nokia Bell Labs,
Lundgren Fund,
Darwin College, University of Cambridge
Department of Computer Science & Technology, University of Cambridg
Quantifying Privacy Loss of Human Mobility Graph Topology
Human mobility is often represented as a mobility network, or graph, with nodes representing places of significance which an individual visits, such as their home, work, places of social amenity, etc., and edge weights corresponding to probability estimates of movements between these places. Previous research has shown that individuals can be identified by a small number of geolocated nodes in their mobility network, rendering mobility trace anonymization a hard task. In this paper we build on prior work and demonstrate that even when all location and timestamp information is removed from nodes, the graph topology of an individual mobility network itself is often uniquely identifying. Further, we observe that a mobility network is often unique, even when only a small number of the most popular nodes and edges are considered. We evaluate our approach using a large dataset of cell-tower location traces from 1 500 smartphone handsets with a mean duration of 430 days. We process the data to derive the topâN places visited by the device in the trace, and find that 93% of traces have a unique topâ10 mobility network, and all traces are unique when considering topâ15 mobility networks. Since mobility patterns, and therefore mobility networks for an individual, vary over time, we use graph kernel distance functions, to determine whether two mobility networks, taken at different points in time, represent the same individual. We then show that our distance metrics, while imperfect predictors, perform significantly better than a random strategy and therefore our approach represents a significant loss in privacy
Recommended from our members
Countering Acoustic Adversarial Attacks in Microphone-equipped Smart Home Devices
Deep neural networks (DNNs) continue to demonstrate superior generalization performance in an increasing range of applications, including speech recognition and image understanding. Recent innovations in compression algorithms, design of efficient architectures and hardware accelerators have prompted a rapid growth in deploying DNNs on mobile and IoT devices to redefine user experiences. Relying on the superior inference quality of DNNs, various voice-enabled devices have started to pervade our everyday lives and are increasingly used for, e.g., opening and closing doors, starting or stopping washing machines, ordering products online, and authenticating monetary transactions. As the popularity of these voice-enabled services increases, so does their risk of being attacked. Recently, DNNs have been shown to be extremely brittle under adversarial attacks and people with malicious intentions can potentially exploit this vulnerability to compromise DNN-based voice-enabled systems. Although some existing work already highlights the vulnerability of audio models, very little is known of the behaviour of compressed on-device audio models under adversarial attacks. This paper bridges this gap by investigating thoroughly the vulnerabilities of compressed audio DNNs and makes a stride towards making compressed models robust. In particular, we propose a stochastic compression technique that generates compressed models with greater robustness to adversarial attacks. We present an extensive set of evaluations on adversarial vulnerability and robustness of DNNs in two diverse audio recognition tasks, while considering two popular attack algorithms: FGSM and PGD. We found that error rates of conventionally trained audio DNNs under attack can be as high as 100%. Under both white- and black-box attacks, our proposed approach is found to decrease the error rate of DNNs under attack by a large margin.Noki
A new method to retrieve the real part of the equivalent refractive index of atmospheric aerosols
This document is the Accepted Manuscript version of the following article: S. Vratolis, et al, âA new method to retrieve the real part of the equivalent refractive index of atmospheric aerosolsâ, Journal of Aerosol Science, Vol. 117: 54-62, March 2018. Under embargo until 29 December 2019. The final, published version is available online at DOI: https://doi.org/10.1016/j.jaerosci.2017.12.013.In the context of the international experimental campaign Hygroscopic Aerosols to Cloud Droplets (HygrA-CD, 15 May to 22 June 2014), dry aerosol size distributions were measured at Demokritos station (DEM) using a Scanning Mobility Particle Sizer (SMPS) in the size range from 10 to 550 nm (electrical mobility diameter), and an Optical Particle Counter (OPC model Grimm 107 operating at the laser wavelength of 660 nm) to acquire the particle size distribution in the size range of 250 nm to 2.5 ÎŒm optical diameter. This work describes a method that was developed to align size distributions in the overlapping range of the SMPS and the OPC, thus allowing us to retrieve the real part of the aerosol equivalent refractive index (ERI). The objective is to show that size distribution data acquired at in situ measurement stations can provide an insight to the physical and chemical properties of aerosol particles, leading to better understanding of aerosol impact on human health and earth radiative balance. The resulting ERI could be used in radiative transfer models to assess aerosol forcing direct effect, as well as an index of aerosol chemical composition. To validate the method, a series of calibration experiments were performed using compounds with known refractive index (RI). This led to a corrected version of the ERI values, (ERICOR). The ERICOR values were subsequently compared to model estimates of RI values, based on measured PM2.5 chemical composition, and to aerosol RI retrieved values by inverted lidar measurements on selected days.Peer reviewe
Quantitative assessment of the variability in chemical profiles from source apportionment analysis of PM10 and PM2.5 at different sites within a large metropolitan area
The study aims to assess the differences between the chemical profiles of the major anthropogenic and natural PM sources in two areas with different levels of urbanization and traffic density within the same urban agglomeration. A traffic site and an urban background site in the Athens Metropolitan Area have been selected for this comparison. For both sites, eight sources were identified, with seven of them being common for the two sites (Mineral Dust, non-Exhaust Emissions, Exhaust Emissions, Heavy Oil Combustion, Sulfates & Organics, Sea Salt and Biomass Burning) and one, site-specific (Nitrates for the traffic site and Aged Sea Salt for the urban background site). The similarity between the source profiles was quantified using two statistical analysis tools, Pearson correlation (PC) and Standardized Identity Distance (SID). According to Pearson coefficients five out of the eight source profiles present high (PCÂ >Â 0.8) correlation (Mineral Dust, Biomass Burning, Sea Salt, Sulfates and Heavy Oil Combustion), one presented moderate (0.8Â >Â PCÂ >Â 0.6) correlation (Exhaust) and two low/no (PCÂ <Â 0.6) correlation (non-Exhaust, Nitrates/Aged Sea Salt). The source profiles that appear to be more correlated are those of sources that are not expected to have high spatial variability because there are either natural/secondary and thus have a regional character or are emitted outside the urban agglomeration and are transported to both sites. According to SID four out of the eight sources have high statistical correlation (SIDÂ <Â 1) in the two sites (Mineral Dust, Sea salt, Sulfates, Heavy Oil Combustion). Biomass Burning was found to be the source that yielded different results from the two methodologies. The careful examination of the source profile of that source revealed the reason for this discrepancy. SID takes all the species of the profile equally into account, while PC might be disproportionally affected by a few numbers of species with very high concentrations. It is suggested, based on the findings of this work, that the combined use of both tools can lead the users to a thorough evaluation of the similarity of source profiles. This work is, to the best of our knowledge, the first time a study is focused on the quantitative comparison of the source profiles for sites inside the same urban agglomeration using statistical indicators.The study was supported by âCALIBRA/EYIEâ (MIS 5002799) and âPANhellenic infrastructure for Atmospheric Composition and climatE changeâ (MIS 5021516) implemented under the Action âReinforcement of the Research and Innovation Infrastructureâ, funded by the Operational Programme âCompetitiveness, Entrepreneurship and Innovationâ (NSRF 2014â2020) and co-financed by Greece and the European Union (European Regional Development Fund). Collection and chemical analysis of samples were supported by LIFE + AIRUSE EU project (ENV/ES/584). Partial support was also received by H2020 ERAPLANET/SMURBS ERANET GA No 689443.Peer reviewe
Three-year long source apportionment study of airborne particles in Ulaanbaatar using X-ray fluorescence and positive matrix factorization
The capital city of Mongolia, Ulaanbaatar, suffers from high levels of pollution due to excessive airborne particulate matter (APM). A lack of systematic data for the region has inspired investigation into the type, origin and seasonal variations of this pollution, the effects of meteorological conditions and even the time-dependence of anthropogenic sources. This work reports source apportionment results from a large data set of 184 samples each of fine (PM2.5) and coarse (PM2.5-10) fraction atmospheric PM collected over a three-year period (2014â2016) in Ulaanbaatar, Mongolia. Positive Matrix Factorization (PMF) was applied using the concentrations of 16 elements measured by an energy dispersive X-ray fluorescence spectrometer along with the black carbon content measured by a reflectometer as input data. The PMF results revealed that whereas mixed sources dominate the coarse fraction, soil and traffic sources are the principle contributors to the fine fraction. The source profiles and the seasonal variations of their contributions indicate that fly ash emanating from coal combustion mixes with traffic emissions and resuspended soil, resulting in variable chemical source profiles. Four sources were identified for both fractions, namely, soil, coal combustion, traffic and oil combustion, which respectively contributed 35%, 16%, 41% and 8% to the coarse fraction and 31%, 27%, 31% and 11% to the fine fraction. Additionally, the probable source contributions from long-range transport events were assessed via concentration-weighted trajectory analysis
- âŠ