37 research outputs found

    Local Differentially Private Matrix Factorization with MoG for Recommendations

    Get PDF
    Unethical data aggregation practices of many recommendation systems have raised privacy concerns among users. Local differential privacy (LDP) based recommendation systems address this problem by perturbing a user’s original data locally in their device before sending it to the data aggregator (DA). The DA performs recommendations over perturbed data which causes substantial prediction error. To tackle privacy and utility issues with untrustworthy DA in recommendation systems, we propose a novel LDP matrix factorization (MF) with mixture of Gaussian (MoG). We use a Bounded Laplace mechanism (BLP) to perturb user’s original ratings locally. BLP restricts the perturbed ratings to a predefined output domain, thus reducing the level of noise aggregated at DA. The MoG method estimates the noise added to the original ratings, which further improves the prediction accuracy without violating the principles of differential privacy (DP). With Movielens and Jester datasets, we demonstrate that our method offers a higher prediction accuracy under strong privacy protection compared to existing LDP recommendation methods

    Private and Utility Enhanced Recommendations with Local Differential Privacy and Gaussian Mixture Model

    Get PDF
    Recommendation systems rely heavily on behavioural and preferential data (e.g. ratings and likes) of a user to produce accurate recommendations. However, such unethical data aggregation and analytical practices of Service Providers (SP) causes privacy concerns among users. Local differential privacy (LDP) based perturbation mechanisms address this concern by adding noise to users' data at the user-side before sending it to the SP. The SP then uses the perturbed data to perform recommendations. Although LDP protects the privacy of users from SP, it causes a substantial decline in recommendation accuracy. We propose an LDP-based Matrix Factorization (MF) with a Gaussian Mixture Model (MoG) to address this problem. The LDP perturbation mechanism, i.e., Bounded Laplace (BLP), regulates the effect of noise by confining the perturbed ratings to a predetermined domain. We derive a sufficient condition of the scale parameter for BLP to satisfy ε -LDP. We use the MoG model at the SP to estimate the noise added locally to the ratings and the MF algorithm to predict missing ratings. Our LDP based recommendation system improves the predictive accuracy without violating LDP principles. We demonstrate that our method offers a substantial increase in recommendation accuracy under a strong privacy guarantee through empirical evaluations on three real-world datasets, i.e., Movielens, Libimseti and Jester

    Preserving individual privacy in ubiquitous e-commerce environments: a utility preserving approach for user-based privacy control

    Get PDF
    Applications such as e-commerce, smart home appliances, and healthcare systems, amongst other things, have become part and parcel of our daily lives. The data aggregated through these applications combined with state-of-the-art machine learning approaches have even increased the widespread uptake of these applications. However, such data aggregation and analytical practices have raised privacy concerns among users. Privacy-preserving machine learning models mitigate these concerns through private data aggregation and analytical techniques. The first objective of this thesis is to design a privacy preserving data aggregation and analytical approach for recommendation systems. Recommendation systems rely heavily on behavioural and preferential data of a user to produce accurate recommendations. Aggregation of such data can reveal sensitive information about users to the Third-Party Service Providers (TPSPs). We start with designing a recommendation system that uses Local Differential Privacy (LDP) based input data perturbation mechanism to perturb users’ ratings locally before sending it to the TPSP. Hence, the TPSP aggregates only the perturbed ratings and has no access to original ratings. This approach protects a user’s privacy from TPSPs who aggregate ratings to infer any sensitive information. Next, we propose an LDP-based hybrid recommendation framework to protect users’ privacy from TPSPs who aggregate both ratings and reviews. We propose to perturb user ratings and pre-process user reviews at the user-side before sending them to the TPSP. Such an approach ensures that the TPSP cannot aggregate the original ratings or reviews from the users. However, these approaches still do not protect a user’s privacy from TPSPs who collect implicit feedback to predict a user’s preferences. Hence, we design an LDP-based federated matrix factorization for implicit feedback. We motivate the idea of stochastic gradient perturbation using the Bounded Laplace (BLP) mechanism to ensure strong privacy protection for users. The second objective of this thesis is to design a privacy preserving untraceable TPSP-based payment protocol. A TPSP based payment system does not protect a customer’s privacy in the face of an untrustworthy TPSP. Customers cannot make transactions anonymously as the TPSP collects detailed transaction-related information. TPSP uses this information to create a comprehensive behaviour profile of each customer, based on which TPSP can deduce sensitive information about a customer’s lifestyle. Hence we propose an untraceable payment system in this thesis to tackle this problem

    Uncertainty in Artificial Intelligence: Proceedings of the Thirty-Fourth Conference

    Get PDF

    Machine learning model selection with multi-objective Bayesian optimization and reinforcement learning

    Get PDF
    A machine learning system, including when used in reinforcement learning, is usually fed with only limited data, while aimed at training a model with good predictive performance that can generalize to an underlying data distribution. Within certain hypothesis classes, model selection chooses a model based on selection criteria calculated from available data, which usually serve as estimators of generalization performance of the model. One major challenge for model selection that has drawn increasing attention is the discrepancy between the data distribution where training data is sampled from and the data distribution at deployment. The model can over-fit in the training distribution, and fail to extrapolate in unseen deployment distributions, which can greatly harm the reliability of a machine learning system. Such a distribution shift challenge can become even more pronounced in high-dimensional data types like gene expression data, functional data and image data, especially in a decentralized learning scenario. Another challenge for model selection is efficient search in the hypothesis space. Since training a machine learning model usually takes a fair amount of resources, searching for an appropriate model with favorable configurations is by inheritance an expensive process, thus calling for efficient optimization algorithms. To tackle the challenge of distribution shift, novel resampling methods for the evaluation of robustness of neural network was proposed, as well as a domain generalization method using multi-objective bayesian optimization in decentralized learning scenario and variational inference in a domain unsupervised manner. To tackle the expensive model search problem, combining bayesian optimization and reinforcement learning in an interleaved manner was proposed for efficient search in a hierarchical conditional configuration space. Additionally, the effectiveness of using multi-objective bayesian optimization for model search in a decentralized learning scenarios was proposed and verified. A model selection perspective to reinforcement learning was proposed with associated contributions in tackling the problem of exploration in high dimensional state action spaces and sparse reward. Connections between statistical inference and control was summarized. Additionally, contributions in open source software development in related machine learning sub-topics like feature selection and functional data analysis with advanced tuning method and abundant benchmarking were also made

    Personalized program guides for digital television.

    Get PDF
    Razvoj digitalne televizije je doveo do značajnog porasta broja TV sadržaja dostupnih korisnicima, ali je otežao izbor onog koji je od interesa. Sve do pojave personalizovanih programskih vodiča sposobnih da nauče korisnička interesovanja i preporuče odgovarajuće sadržaje nije postojalo rešenje koje je na adekvatan način razmatralo ovaj problem. Ranija rešenja, kao što su štampani i elektronski vodiči, su pretežno samo pretvarala problem viška informacija u drugi oblik. Napredak tehnologije i društva postavlja sve veće zahteve pred personalizovane programske vodiče za digitalnu televiziju, što zahteva njihovo pažljivo planiranje i projektovanje. Vodiči moraju da budu u mogućnosti da modeliraju različite načine donošenja odluka pojedinačnih korisnika, da rade u realnom vremenu na mobilnim uređajima s ograničenim hardverskim resursima, da vode računa o karakteristikama prikupljenih podataka, da uzimaju u obzir kontekst u kome se pristupa TV sadržaju i da štite privatnost svih korisnika, jer neki od njih nisu svesni mogućih opasnosti. Pažljivim izborom arhitekture i algoritma učenja, lokalno implementiran vodič baziran na neuralnim mrežama može da ispuni sve ove zahteve. S obzirom na to da korisnici znatno češće pružaju informacije o sadržajima koji im se dopadaju nego o onim koji im se ne dopadaju, u ekstremnim slučajevima se dešava to da su prikupljene samo pozitivne interakcije. Da bi se taj problem prevazišao, predložen je sistem s dva režima rada. U prvom režimu sistem uči i pruža preporuke samo na osnovu TV sadržaja koje korisnik voli, dok u drugom izjednačava uticaj sadržaja koje korisnik voli i onih koje ne voli na proces pružanja preporuka. Povećan uticaj pozitivnih interakcija dovodi do degradacije predikcije sadržaja koje posmatrač ne želi da gleda, te će se, usled greške u klasifikaciji, neželjeni sadržaji često pojavljivati u listi preporuka i na taj način smanjiti zadovoljstvo korisnika. Korišćenjem serije simulacija pokazali smo da je postignuto trajanje treniranja neuralne mreže kratko, čak i na uređajima s ograničenim hardverskim resursima. Zaključak je da je predloženi vodič veoma pogodan za implementaciju na mobilnim uređajima od kojih se očekuje da u budućnosti postanu dominantan način pristupa TV sadržajima.The development of digital television significantly increased the quantity of media contents available to the users, but made it difficult to make the right choice. Before the invention of the personalized program guides capable of learning user preferences and recommending adequate contents, there were no means of properly addressing this problem. Former solutions, such as printed or electronic program guides, mostly converted the problem of having to deal with too much information into another form. The advancements in both technology and society put higher demands to the personalized program guides for digital TV, which require careful planning and design processes. Guides must be able to model various individual decision making approaches, work in real-time on mobile devices with limited hardware resources, take into account the characteristics of the collected data, take into consideration the program accessing context and protect the privacy of all users, since some of them are not aware of the possible risks. By carefully choosing the architecture and learning algorithms, a locally implemented guide based on neural networks can fulfil all the aforementioned requirements. Due to the fact that the users provide information about the content they like much more often than about the one they dislike, only positive interactions are collected in extreme cases. In order to overcome that situation, a system having two operating modes is proposed. The first mode enables the system to learn and give recommendations based on preferred TV contents, while the second equalizes the influence of the liked and disliked contents on the recommending process. The increased influence of positive interactions degrades the unwanted content prediction process, resulting in classification error, appearance of unwanted content in the recommendation list and user dissatisfaction. By applying a series of simulations, we showed the accomplished neural network training time to be short, even in cases of devices with limited hardware resources. It can be concluded that the proposed guide is very convenient for implementation on mobile devices which are expected to become a dominant way of accessing media contents in the future

    Identification of biomarkers for the management of human prostate cancer

    Get PDF
    A critical problem in the clinical management of prostate cancer is that it shows high intra- and inter-tumoural heterogeneity. As a result, accurate prediction of individual cancer behaviour is not achievable at the time of diagnosis, leading to substantial overtreatment. It remains an enigma that, in contrast to other cancers, no molecular biomarkers which define robust subtypes of prostate cancer with distinct clinical outcomes have been discovered. In the first part of this study, using data from exon microarrays, we developed a novel method that can identify transcriptional alterations within genes. The alterations might be the result of chromosomal rearrangements, such as translocations, and deletions, or of other abnormalities, such as read-through transcription and alternative transcriptional initiation sites. Using data from two independent datasets we identify several candidate alterations that are constantly correlated with the biochemical failure or that are linked to the development of metastasis. In the second part of the study we illustrate the application of an unsupervised Bayesian procedure, which identifies a subtype of the disease in five prostate cancer transcriptome datasets. Cancers assigned to this subtype (designated DESNT cancers) are characterized by low expression of a core set of 45 genes. For the four datasets with linked PSA failure data following prostatectomy, patients with DESNT cancer exhibited poor outcome relative to other patients (p = 2.65 ・ 10−5, p = 4.28 ・ 10−5, p = 2.98 ・ 10−8 and p = 1.22 ・ 10−3). The DESNT cancers are not linked with the presence of any particular class of genetic mutation, including ETS gene status. However, the methylation analysis reveals a possible role of epigenetic changes in the generation of the DESNT subtype. Our results demonstrate the existence of a novel poor prognosis category of human prostate cancer and will assist in the targeting of therapy, helping avoid treatment-associated morbidity in men with indolent disease

    Democratizing machine learning

    Get PDF
    Modelle des maschinellen Lernens sind zunehmend in der Gesellschaft verankert, oft in Form von automatisierten Entscheidungsprozessen. Ein wesentlicher Grund dafür ist die verbesserte Zugänglichkeit von Daten, aber auch von Toolkits für maschinelles Lernen, die den Zugang zu Methoden des maschinellen Lernens für Nicht-Experten ermöglichen. Diese Arbeit umfasst mehrere Beiträge zur Demokratisierung des Zugangs zum maschinellem Lernen, mit dem Ziel, einem breiterem Publikum Zugang zu diesen Technologien zu er- möglichen. Die Beiträge in diesem Manuskript stammen aus mehreren Bereichen innerhalb dieses weiten Gebiets. Ein großer Teil ist dem Bereich des automatisierten maschinellen Lernens (AutoML) und der Hyperparameter-Optimierung gewidmet, mit dem Ziel, die oft mühsame Aufgabe, ein optimales Vorhersagemodell für einen gegebenen Datensatz zu finden, zu vereinfachen. Dieser Prozess besteht meist darin ein für vom Benutzer vorgegebene Leistungsmetrik(en) optimales Modell zu finden. Oft kann dieser Prozess durch Lernen aus vorhergehenden Experimenten verbessert oder beschleunigt werden. In dieser Arbeit werden drei solcher Methoden vorgestellt, die entweder darauf abzielen, eine feste Menge möglicher Hyperparameterkonfigurationen zu erhalten, die wahrscheinlich gute Lösungen für jeden neuen Datensatz enthalten, oder Eigenschaften der Datensätze zu nutzen, um neue Konfigurationen vorzuschlagen. Darüber hinaus wird eine Sammlung solcher erforderlichen Metadaten zu den Experimenten vorgestellt, und es wird gezeigt, wie solche Metadaten für die Entwicklung und als Testumgebung für neue Hyperparameter- Optimierungsmethoden verwendet werden können. Die weite Verbreitung von ML-Modellen in vielen Bereichen der Gesellschaft erfordert gleichzeitig eine genauere Untersuchung der Art und Weise, wie aus Modellen abgeleitete automatisierte Entscheidungen die Gesellschaft formen, und ob sie möglicherweise Individuen oder einzelne Bevölkerungsgruppen benachteiligen. In dieser Arbeit wird daher ein AutoML-Tool vorgestellt, das es ermöglicht, solche Überlegungen in die Suche nach einem optimalen Modell miteinzubeziehen. Diese Forderung nach Fairness wirft gleichzeitig die Frage auf, ob die Fairness eines Modells zuverlässig geschätzt werden kann, was in einem weiteren Beitrag in dieser Arbeit untersucht wird. Da der Zugang zu Methoden des maschinellen Lernens auch stark vom Zugang zu Software und Toolboxen abhängt, sind mehrere Beiträge in Form von Software Teil dieser Arbeit. Das R-Paket mlr3pipelines ermöglicht die Einbettung von Modellen in sogenan- nte Machine Learning Pipelines, die Vor- und Nachverarbeitungsschritte enthalten, die im maschinellen Lernen und AutoML häufig benötigt werden. Das mlr3fairness R-Paket hingegen ermöglicht es dem Benutzer, Modelle auf potentielle Benachteiligung hin zu über- prüfen und diese durch verschiedene Techniken zu reduzieren. Eine dieser Techniken, multi-calibration wurde darüberhinaus als seperate Software veröffentlicht.Machine learning artifacts are increasingly embedded in society, often in the form of automated decision-making processes. One major reason for this, along with methodological improvements, is the increasing accessibility of data but also machine learning toolkits that enable access to machine learning methodology for non-experts. The core focus of this thesis is exactly this – democratizing access to machine learning in order to enable a wider audience to benefit from its potential. Contributions in this manuscript stem from several different areas within this broader area. A major section is dedicated to the field of automated machine learning (AutoML) with the goal to abstract away the tedious task of obtaining an optimal predictive model for a given dataset. This process mostly consists of finding said optimal model, often through hyperparameter optimization, while the user in turn only selects the appropriate performance metric(s) and validates the resulting models. This process can be improved or sped up by learning from previous experiments. Three such methods one with the goal to obtain a fixed set of possible hyperparameter configurations that likely contain good solutions for any new dataset and two using dataset characteristics to propose new configurations are presented in this thesis. It furthermore presents a collection of required experiment metadata and how such meta-data can be used for the development and as a test bed for new hyperparameter optimization methods. The pervasion of models derived from ML in many aspects of society simultaneously calls for increased scrutiny with respect to how such models shape society and the eventual biases they exhibit. Therefore, this thesis presents an AutoML tool that allows incorporating fairness considerations into the search for an optimal model. This requirement for fairness simultaneously poses the question of whether we can reliably estimate a model’s fairness, which is studied in a further contribution in this thesis. Since access to machine learning methods also heavily depends on access to software and toolboxes, several contributions in the form of software are part of this thesis. The mlr3pipelines R package allows for embedding models in so-called machine learning pipelines that include pre- and postprocessing steps often required in machine learning and AutoML. The mlr3fairness R package on the other hand enables users to audit models for potential biases as well as reduce those biases through different debiasing techniques. One such technique, multi-calibration is published as a separate software package, mcboost
    corecore