643 research outputs found

    Securing cloud-based data analytics: A practical approach

    Get PDF
    The ubiquitous nature of computers is driving a massive increase in the amount of data generated by humans and machines. The shift to cloud technologies is a paradigm change that offers considerable financial and administrative gains in the effort to analyze these data. However, governmental and business institutions wanting to tap into these gains are concerned with security issues. The cloud presents new vulnerabilities and is dominated by new kinds of applications, which calls for new security solutions. In the direction of analyzing massive amounts of data, tools like MapReduce, Apache Storm, Dryad and higher-level scripting languages like Pig Latin and DryadLINQ have significantly improved corresponding tasks for software developers. The equally important aspect of securing computations performed by these tools and ensuring confidentiality of data has seen very little support emerge for programmers. In this dissertation, we present solutions to a. secure computations being run in the cloud by leveraging BFT replication coupled with fault isolation and b. secure data from being leaked by computing directly on encrypted data. For securing computations (a.), we leverage a combination of variable-degree clustering, approximated and offline output comparison, smart deployment, and separation of duty to achieve a parameterized tradeoff between fault tolerance and overhead in practice. We demonstrate the low overhead achieved with our solution when securing data-flow computations expressed in Apache Pig, and Hadoop. Our solution allows assured computation with less than 10 percent latency overhead as shown by our evaluation. For securing data (b.), we present novel data flow analyses and program transformations for Pig Latin and Apache Storm, that automatically enable the execution of corresponding scripts on encrypted data. We avoid fully homomorphic encryption because of its prohibitively high cost; instead, in some cases, we rely on a minimal set of operations performed by the client. We present the algorithms used for this translation, and empirically demonstrate the practical performance of our approach as well as improvements for programmers in terms of the effort required to preserve data confidentiality

    CoVault: A Secure Analytics Platform

    Full text link
    Analytics on personal data, such as individuals' mobility, financial, and health data can be of significant benefit to society. Such data is already collected by smartphones, apps and services today, but liberal societies have so far refrained from making it available for large-scale analytics. Arguably, this is due at least in part to the lack of an analytics platform that can secure data through transparent, technical means (ideally with decentralized trust), enforce source policies, handle millions of distinct data sources, and run queries on billions of records with acceptable query latencies. To bridge this gap, we present an analytics platform called CoVault which combines secure multi-party computation (MPC) with trusted execution environment (TEE)-based delegation of trust to be able execute approved queries on encrypted data contributed by individuals within a datacenter to achieve the above properties. We show that CoVault scales well despite the high cost of MPC. For example, CoVault can process data relevant to epidemic analytics for a country of 80M people (about 11.85B data records/day) on a continuous basis using a core pair for every 20,000 people. Compared to a state-of-the-art MPC-based platform, CoVault can process queries between 7 to over 100 times faster, as well as scale to many sources and big data.Comment: 13 pages, 6 figure

    A Domain Specific Language for Digital Forensics and Incident Response Analysis

    Get PDF
    One of the longstanding conceptual problems in digital forensics is the dichotomy between the need for verifiable and reproducible forensic investigations, and the lack of practical mechanisms to accomplish them. With nearly four decades of professional digital forensic practice, investigator notes are still the primary source of reproducibility information, and much of it is tied to the functions of specific, often proprietary, tools. The lack of a formal means of specification for digital forensic operations results in three major problems. Specifically, there is a critical lack of: a) standardized and automated means to scientifically verify accuracy of digital forensic tools; b) methods to reliably reproduce forensic computations (their results); and c) framework for inter-operability among forensic tools. Additionally, there is no standardized means for communicating software requirements between users, researchers and developers, resulting in a mismatch in expectations. Combined with the exponential growth in data volume and complexity of applications and systems to be investigated, all of these concerns result in major case backlogs and inherently reduce the reliability of the digital forensic analyses. This work proposes a new approach to the specification of forensic computations, such that the above concerns can be addressed on a scientific basis with a new domain specific language (DSL) called nugget. DSLs are specialized languages that aim to address the concerns of particular domains by providing practical abstractions. Successful DSLs, such as SQL, can transform an application domain by providing a standardized way for users to communicate what they need without specifying how the computation should be performed. This is the first effort to build a DSL for (digital) forensic computations with the following research goals: 1) provide an intuitive formal specification language that covers core types of forensic computations and common data types; 2) provide a mechanism to extend the language that can incorporate arbitrary computations; 3) provide a prototype execution environment that allows the fully automatic execution of the computation; 4) provide a complete, formal, and auditable log of computations that can be used to reproduce an investigation; 5) demonstrate cloud-ready processing that can match the growth in data volumes and complexity

    Distributed Differential Privacy and Applications

    Get PDF
    Recent growth in the size and scope of databases has resulted in more research into making productive use of this data. Unfortunately, a significant stumbling block which remains is protecting the privacy of the individuals that populate these datasets. As people spend more time connected to the Internet, and conduct more of their daily lives online, privacy becomes a more important consideration, just as the data becomes more useful for researchers, companies, and individuals. As a result, plenty of important information remains locked down and unavailable to honest researchers today, due to fears that data leakages will harm individuals. Recent research in differential privacy opens a promising pathway to guarantee individual privacy while simultaneously making use of the data to answer useful queries. Differential privacy is a theory that provides provable information theoretic guarantees on what any answer may reveal about any single individual in the database. This approach has resulted in a flurry of recent research, presenting novel algorithms that can compute a rich class of computations in this setting. In this dissertation, we focus on some real world challenges that arise when trying to provide differential privacy guarantees in the real world. We design and build runtimes that achieve the mathematical differential privacy guarantee in the face of three real world challenges: securing the runtimes against adversaries, enabling readers to verify that the answers are accurate, and dealing with data distributed across multiple domains

    CamFlow: Managed Data-sharing for Cloud Services

    Full text link
    A model of cloud services is emerging whereby a few trusted providers manage the underlying hardware and communications whereas many companies build on this infrastructure to offer higher level, cloud-hosted PaaS services and/or SaaS applications. From the start, strong isolation between cloud tenants was seen to be of paramount importance, provided first by virtual machines (VM) and later by containers, which share the operating system (OS) kernel. Increasingly it is the case that applications also require facilities to effect isolation and protection of data managed by those applications. They also require flexible data sharing with other applications, often across the traditional cloud-isolation boundaries; for example, when government provides many related services for its citizens on a common platform. Similar considerations apply to the end-users of applications. But in particular, the incorporation of cloud services within `Internet of Things' architectures is driving the requirements for both protection and cross-application data sharing. These concerns relate to the management of data. Traditional access control is application and principal/role specific, applied at policy enforcement points, after which there is no subsequent control over where data flows; a crucial issue once data has left its owner's control by cloud-hosted applications and within cloud-services. Information Flow Control (IFC), in addition, offers system-wide, end-to-end, flow control based on the properties of the data. We discuss the potential of cloud-deployed IFC for enforcing owners' dataflow policy with regard to protection and sharing, as well as safeguarding against malicious or buggy software. In addition, the audit log associated with IFC provides transparency, giving configurable system-wide visibility over data flows. [...]Comment: 14 pages, 8 figure

    Cluster Computing in Zero Knowledge

    Get PDF
    Large computations, when amenable to distributed parallel execution, are often executed on computer clusters, for scalability and cost reasons. Such computations are used in many applications, including, to name but a few, machine learning, webgraph mining, and statistical machine translation. Oftentimes, though, the input data is private and only the result of the computation can be published. Zero-knowledge proofs would allow, in such settings, to verify correctness of the output without leaking (additional) information about the input. In this work, we investigate theoretical and practical aspects of *zero-knowledge proofs for cluster computations*. We design, build, and evaluate zero-knowledge proof systems for which: (i) a proof attests to the correct execution of a cluster computation; and (ii) generating the proof is itself a cluster computation that is similar in structure and complexity to the original one. Concretely, we focus on MapReduce, an elegant and popular form of cluster computing. Previous zero-knowledge proof systems can in principle prove a MapReduce computation\u27s correctness, via a monolithic NP statement that reasons about all mappers, all reducers, and shuffling. However, it is not clear how to generate the proof for such monolithic statements via parallel execution by a distributed system. Our work demonstrates, by theory and implementation, that proof generation can be similar in structure and complexity to the original cluster computation. Our main technique is a bootstrapping theorem for succinct non-interactive arguments of knowledge (SNARKs) that shows how, via recursive proof composition and Proof-Carrying Data, it is possible to transform any SNARK into a *distributed SNARK for MapReduce* which proves, piecewise and in a distributed way, the correctness of every step in the original MapReduce computation as well as their global consistency
    corecore