15,370 research outputs found

    Confidential Boosting with Random Linear Classifiers for Outsourced User-generated Data

    Full text link
    User-generated data is crucial to predictive modeling in many applications. With a web/mobile/wearable interface, a data owner can continuously record data generated by distributed users and build various predictive models from the data to improve their operations, services, and revenue. Due to the large size and evolving nature of users data, data owners may rely on public cloud service providers (Cloud) for storage and computation scalability. Exposing sensitive user-generated data and advanced analytic models to Cloud raises privacy concerns. We present a confidential learning framework, SecureBoost, for data owners that want to learn predictive models from aggregated user-generated data but offload the storage and computational burden to Cloud without having to worry about protecting the sensitive data. SecureBoost allows users to submit encrypted or randomly masked data to designated Cloud directly. Our framework utilizes random linear classifiers (RLCs) as the base classifiers in the boosting framework to dramatically simplify the design of the proposed confidential boosting protocols, yet still preserve the model quality. A Cryptographic Service Provider (CSP) is used to assist the Cloud's processing, reducing the complexity of the protocol constructions. We present two constructions of SecureBoost: HE+GC and SecSh+GC, using combinations of homomorphic encryption, garbled circuits, and random masking to achieve both security and efficiency. For a boosted model, Cloud learns only the RLCs and the CSP learns only the weights of the RLCs. Finally, the data owner collects the two parts to get the complete model. We conduct extensive experiments to understand the quality of the RLC-based boosting and the cost distribution of the constructions. Our results show that SecureBoost can efficiently learn high-quality boosting models from protected user-generated data

    Privacy-preserving scoring of tree ensembles : a novel framework for AI in healthcare

    Get PDF
    Machine Learning (ML) techniques now impact a wide variety of domains. Highly regulated industries such as healthcare and finance have stringent compliance and data governance policies around data sharing. Advances in secure multiparty computation (SMC) for privacy-preserving machine learning (PPML) can help transform these regulated industries by allowing ML computations over encrypted data with personally identifiable information (PII). Yet very little of SMC-based PPML has been put into practice so far. In this paper we present the very first framework for privacy-preserving classification of tree ensembles with application in healthcare. We first describe the underlying cryptographic protocols that enable a healthcare organization to send encrypted data securely to a ML scoring service and obtain encrypted class labels without the scoring service actually seeing that input in the clear. We then describe the deployment challenges we solved to integrate these protocols in a cloud based scalable risk-prediction platform with multiple ML models for healthcare AI. Included are system internals, and evaluations of our deployment for supporting physicians to drive better clinical outcomes in an accurate, scalable, and provably secure manner. To the best of our knowledge, this is the first such applied framework with SMC-based privacy-preserving machine learning for healthcare

    Wireless Data Acquisition for Edge Learning: Data-Importance Aware Retransmission

    Full text link
    By deploying machine-learning algorithms at the network edge, edge learning can leverage the enormous real-time data generated by billions of mobile devices to train AI models, which enable intelligent mobile applications. In this emerging research area, one key direction is to efficiently utilize radio resources for wireless data acquisition to minimize the latency of executing a learning task at an edge server. Along this direction, we consider the specific problem of retransmission decision in each communication round to ensure both reliability and quantity of those training data for accelerating model convergence. To solve the problem, a new retransmission protocol called data-importance aware automatic-repeat-request (importance ARQ) is proposed. Unlike the classic ARQ focusing merely on reliability, importance ARQ selectively retransmits a data sample based on its uncertainty which helps learning and can be measured using the model under training. Underpinning the proposed protocol is a derived elegant communication-learning relation between two corresponding metrics, i.e., signal-to-noise ratio (SNR) and data uncertainty. This relation facilitates the design of a simple threshold based policy for importance ARQ. The policy is first derived based on the classic classifier model of support vector machine (SVM), where the uncertainty of a data sample is measured by its distance to the decision boundary. The policy is then extended to the more complex model of convolutional neural networks (CNN) where data uncertainty is measured by entropy. Extensive experiments have been conducted for both the SVM and CNN using real datasets with balanced and imbalanced distributions. Experimental results demonstrate that importance ARQ effectively copes with channel fading and noise in wireless data acquisition to achieve faster model convergence than the conventional channel-aware ARQ.Comment: This is an updated version: 1) extension to general classifiers; 2) consideration of imbalanced classification in the experiments. Submitted to IEEE Journal for possible publicatio

    EC-CENTRIC: An Energy- and Context-Centric Perspective on IoT Systems and Protocol Design

    Get PDF
    The radio transceiver of an IoT device is often where most of the energy is consumed. For this reason, most research so far has focused on low power circuit and energy efficient physical layer designs, with the goal of reducing the average energy per information bit required for communication. While these efforts are valuable per se, their actual effectiveness can be partially neutralized by ill-designed network, processing and resource management solutions, which can become a primary factor of performance degradation, in terms of throughput, responsiveness and energy efficiency. The objective of this paper is to describe an energy-centric and context-aware optimization framework that accounts for the energy impact of the fundamental functionalities of an IoT system and that proceeds along three main technical thrusts: 1) balancing signal-dependent processing techniques (compression and feature extraction) and communication tasks; 2) jointly designing channel access and routing protocols to maximize the network lifetime; 3) providing self-adaptability to different operating conditions through the adoption of suitable learning architectures and of flexible/reconfigurable algorithms and protocols. After discussing this framework, we present some preliminary results that validate the effectiveness of our proposed line of action, and show how the use of adaptive signal processing and channel access techniques allows an IoT network to dynamically tune lifetime for signal distortion, according to the requirements dictated by the application

    Rate-Distortion Classification for Self-Tuning IoT Networks

    Full text link
    Many future wireless sensor networks and the Internet of Things are expected to follow a software defined paradigm, where protocol parameters and behaviors will be dynamically tuned as a function of the signal statistics. New protocols will be then injected as a software as certain events occur. For instance, new data compressors could be (re)programmed on-the-fly as the monitored signal type or its statistical properties change. We consider a lossy compression scenario, where the application tolerates some distortion of the gathered signal in return for improved energy efficiency. To reap the full benefits of this paradigm, we discuss an automatic sensor profiling approach where the signal class, and in particular the corresponding rate-distortion curve, is automatically assessed using machine learning tools (namely, support vector machines and neural networks). We show that this curve can be reliably estimated on-the-fly through the computation of a small number (from ten to twenty) of statistical features on time windows of a few hundreds samples

    VirtualIdentity : privacy preserving user profiling

    Get PDF
    User profiling from user generated content (UGC) is a common practice that supports the business models of many social media companies. Existing systems require that the UGC is fully exposed to the module that constructs the user profiles. In this paper we show that it is possible to build user profiles without ever accessing the user's original data, and without exposing the trained machine learning models for user profiling - which are the intellectual property of the company - to the users of the social media site. We present VirtualIdentity, an application that uses secure multi-party cryptographic protocols to detect the age, gender and personality traits of users by classifying their user-generated text and personal pictures with trained support vector machine models in a privacy preserving manner

    The malaria system microApp: A new, mobile device-based tool for malaria diagnosis

    Get PDF
    Background: Malaria is a public health problem that affects remote areas worldwide. Climate change has contributed to the problem by allowing for the survival of Anopheles in previously uninhabited areas. As such, several groups have made developing news systems for the automated diagnosis of malaria a priority. Objective: The objective of this study was to develop a new, automated, mobile device-based diagnostic system for malaria. The system uses Giemsa-stained peripheral blood samples combined with light microscopy to identify the Plasmodium falciparum species in the ring stage of development. Methods: The system uses image processing and artificial intelligence techniques as well as a known face detection algorithm to identify Plasmodium parasites. The algorithm is based on integral image and haar-like features concepts, and makes use of weak classifiers with adaptive boosting learning. The search scope of the learning algorithm is reduced in the preprocessing step by removing the background around blood cells. Results: As a proof of concept experiment, the tool was used on 555 malaria-positive and 777 malaria-negative previously-made slides. The accuracy of the system was, on average, 91%, meaning that for every 100 parasite-infected samples, 91 were identified correctly. Conclusions: Accessibility barriers of low-resource countries can be addressed with low-cost diagnostic tools. Our system, developed for mobile devices (mobile phones and tablets), addresses this by enabling access to health centers in remote communities, and importantly, not depending on extensive malaria expertise or expensive diagnostic detection equipment.Peer ReviewedPostprint (published version

    Doctor of Philosophy

    Get PDF
    dissertationMachine learning is the science of building predictive models from data that automatically improve based on past experience. To learn these models, traditional learning algorithms require labeled data. They also require that the entire dataset fits in the memory of a single machine. Labeled data are available or can be acquired for small and moderately sized datasets but curating large datasets can be prohibitively expensive. Similarly, massive datasets are usually too huge to fit into the memory of a single machine. An alternative is to distribute the dataset over multiple machines. Distributed learning, however, poses new challenges as most existing machine learning techniques are inherently sequential. Additionally, these distributed approaches have to be designed keeping in mind various resource limitations of real-world settings, prime among them being intermachine communication. With the advent of big datasets machine learning algorithms are facing new challenges. Their design is no longer limited to minimizing some loss function but, additionally, needs to consider other resources that are critical when learning at scale. In this thesis, we explore different models and measures for learning with limited resources that have a budget. What budgetary constraints are posed by modern datasets? Can we reuse or combine existing machine learning paradigms to address these challenges at scale? How does the cost metrics change when we shift to distributed models for learning? These are some of the questions that have been investigated in this thesis. The answers to these questions hold the key to addressing some of the challenges faced when learning on massive datasets. In the first part of this thesis, we present three different budgeted scenarios that deal with scarcity of labeled data and limited computational resources. The goal is to leverage transfer information from related domains to learn under budgetary constraints. Our proposed techniques comprise semisupervised transfer, online transfer and active transfer. In the second part of this thesis, we study distributed learning with limited communication. We present initial sampling based results, as well as, propose communication protocols for learning distributed linear classifiers
    corecore