643 research outputs found
Securing cloud-based data analytics: A practical approach
The ubiquitous nature of computers is driving a massive increase in the amount of data generated by humans and machines. The shift to cloud technologies is a paradigm change that offers considerable financial and administrative gains in the effort to analyze these data. However, governmental and business institutions wanting to tap into these gains are concerned with security issues. The cloud presents new vulnerabilities and is dominated by new kinds of applications, which calls for new security solutions. In the direction of analyzing massive amounts of data, tools like MapReduce, Apache Storm, Dryad and higher-level scripting languages like Pig Latin and DryadLINQ have significantly improved corresponding tasks for software developers. The equally important aspect of securing computations performed by these tools and ensuring confidentiality of data has seen very little support emerge for programmers.
In this dissertation, we present solutions to a. secure computations being run in the cloud by leveraging BFT replication coupled with fault isolation and b. secure data from being leaked by computing directly on encrypted data. For securing computations (a.), we leverage a combination of variable-degree clustering, approximated and offline output comparison, smart deployment, and separation of duty to achieve a parameterized tradeoff between fault tolerance and overhead in practice. We demonstrate the low overhead achieved with our solution when securing data-flow computations expressed in Apache Pig, and Hadoop. Our solution allows assured computation with less than 10 percent latency overhead as shown by our evaluation. For securing data (b.), we present novel data flow analyses and program transformations for Pig Latin and Apache Storm, that automatically enable the execution of corresponding scripts on encrypted data. We avoid fully homomorphic encryption because of its prohibitively high cost; instead, in some cases, we rely on a minimal set of operations performed by the client. We present the algorithms used for this translation, and empirically demonstrate the practical performance of our approach as well as improvements for programmers in terms of the effort required to preserve data confidentiality
CoVault: A Secure Analytics Platform
Analytics on personal data, such as individuals' mobility, financial, and
health data can be of significant benefit to society. Such data is already
collected by smartphones, apps and services today, but liberal societies have
so far refrained from making it available for large-scale analytics. Arguably,
this is due at least in part to the lack of an analytics platform that can
secure data through transparent, technical means (ideally with decentralized
trust), enforce source policies, handle millions of distinct data sources, and
run queries on billions of records with acceptable query latencies. To bridge
this gap, we present an analytics platform called CoVault which combines secure
multi-party computation (MPC) with trusted execution environment (TEE)-based
delegation of trust to be able execute approved queries on encrypted data
contributed by individuals within a datacenter to achieve the above properties.
We show that CoVault scales well despite the high cost of MPC. For example,
CoVault can process data relevant to epidemic analytics for a country of 80M
people (about 11.85B data records/day) on a continuous basis using a core pair
for every 20,000 people. Compared to a state-of-the-art MPC-based platform,
CoVault can process queries between 7 to over 100 times faster, as well as
scale to many sources and big data.Comment: 13 pages, 6 figure
Recommended from our members
Toward practical argument systems for verifiable computation
textHow can a client extract useful work from a server without trusting it to compute correctly? A modern motivation for this classic question is third party computing models in which customers outsource their computations to service providers (as in cloud computing). In principle, deep results in complexity theory and cryptography imply that it is possible to verify that an untrusted entity executed a computation correctly. For instance, the server can employ probabilistically checkable proofs (PCPs) in conjunction with cryptographic commitments to generate a succinct proof of correct execution, which the client can efficiently check. However, these theoretical solutions are impractical: they require thousands of CPU years to verifiably execute even simple computations. This dissertation describes the design, implementation, and experimental evaluation viiiof a system, called Pepper, that brings this theory into the realm of plausibility. Pepper incorporates a series of algorithmic improvements and systems engineering techniques to improve performance by over 20 orders of magnitude, relative to an implementation of the theory without our refinements. These include a new probabilistically checkable proof encoding with nearly optimal asymptotics, a concise representation for computations, a more efficient cryptographic commitment primitive, and a distributed implementation of the server with GPU acceleration to reduce latency. Additionally, Pepper extends the verification machinery to handle realistic applications of third party computing: those that interact with remote storage or state (e.g., MapReduce jobs, database queries). To do so, Pepper composes techniques from untrusted storage with the aforementioned technical machinery to verifiably offload both computations and state. Furthermore, to make it easy to use this technology, Pepper includes a compiler to automatically transform programs in a subset of C into executables that run verifiably. One of the chief limitations of Pepper is that verifiable execution is still orders of magnitude slower than an unverifiable native execution. Nonetheless, Pepper takes powerful results from complexity theory and verifiable computation a few steps closer to practicalityComputer Science
A Domain Specific Language for Digital Forensics and Incident Response Analysis
One of the longstanding conceptual problems in digital forensics is the dichotomy between the need for verifiable and reproducible forensic investigations, and the lack of practical mechanisms to accomplish them. With nearly four decades of professional digital forensic practice, investigator notes are still the primary source of reproducibility information, and much of it is tied to the functions of specific, often proprietary, tools.
The lack of a formal means of specification for digital forensic operations results in three major problems. Specifically, there is a critical lack of:
a) standardized and automated means to scientifically verify accuracy of digital forensic tools;
b) methods to reliably reproduce forensic computations (their results); and
c) framework for inter-operability among forensic tools.
Additionally, there is no standardized means for communicating software requirements between users, researchers and developers, resulting in a mismatch in expectations. Combined with the exponential growth in data volume and complexity of applications and systems to be investigated, all of these concerns result in major case backlogs and inherently reduce the reliability of the digital forensic analyses.
This work proposes a new approach to the specification of forensic computations, such that the above concerns can be addressed on a scientific basis with a new domain specific language (DSL) called nugget. DSLs are specialized languages that aim to address the concerns of particular domains by providing practical abstractions. Successful DSLs, such as SQL, can transform an application domain by providing a standardized way for users to communicate what they need without specifying how the computation should be performed.
This is the first effort to build a DSL for (digital) forensic computations with the following research goals:
1) provide an intuitive formal specification language that covers core types of forensic computations and common data types;
2) provide a mechanism to extend the language that can incorporate arbitrary computations;
3) provide a prototype execution environment that allows the fully automatic execution of the computation;
4) provide a complete, formal, and auditable log of computations that can be used to reproduce an investigation;
5) demonstrate cloud-ready processing that can match the growth in data volumes and complexity
Distributed Differential Privacy and Applications
Recent growth in the size and scope of databases has resulted in more
research into making productive use of this data. Unfortunately, a
significant stumbling block which remains is protecting the privacy of
the individuals that populate these datasets. As people spend more
time connected to the Internet, and conduct more of their daily lives
online, privacy becomes a more important consideration, just as the
data becomes more useful for researchers, companies, and
individuals. As a result, plenty of important information remains
locked down and unavailable to honest researchers today, due to fears
that data leakages will harm individuals.
Recent research in differential privacy opens a promising pathway to
guarantee individual privacy while simultaneously making use of the
data to answer useful queries. Differential privacy is a theory that
provides provable information theoretic guarantees on what any answer
may reveal about any single individual in the database. This approach
has resulted in a flurry of recent research, presenting novel
algorithms that can compute a rich class of computations in this
setting.
In this dissertation, we focus on some real world challenges that
arise when trying to provide differential privacy guarantees in the
real world. We design and build runtimes that achieve the mathematical
differential privacy guarantee in the face of three real world
challenges: securing the runtimes against adversaries, enabling
readers to verify that the answers are accurate, and dealing with data
distributed across multiple domains
CamFlow: Managed Data-sharing for Cloud Services
A model of cloud services is emerging whereby a few trusted providers manage
the underlying hardware and communications whereas many companies build on this
infrastructure to offer higher level, cloud-hosted PaaS services and/or SaaS
applications. From the start, strong isolation between cloud tenants was seen
to be of paramount importance, provided first by virtual machines (VM) and
later by containers, which share the operating system (OS) kernel. Increasingly
it is the case that applications also require facilities to effect isolation
and protection of data managed by those applications. They also require
flexible data sharing with other applications, often across the traditional
cloud-isolation boundaries; for example, when government provides many related
services for its citizens on a common platform. Similar considerations apply to
the end-users of applications. But in particular, the incorporation of cloud
services within `Internet of Things' architectures is driving the requirements
for both protection and cross-application data sharing.
These concerns relate to the management of data. Traditional access control
is application and principal/role specific, applied at policy enforcement
points, after which there is no subsequent control over where data flows; a
crucial issue once data has left its owner's control by cloud-hosted
applications and within cloud-services. Information Flow Control (IFC), in
addition, offers system-wide, end-to-end, flow control based on the properties
of the data. We discuss the potential of cloud-deployed IFC for enforcing
owners' dataflow policy with regard to protection and sharing, as well as
safeguarding against malicious or buggy software. In addition, the audit log
associated with IFC provides transparency, giving configurable system-wide
visibility over data flows. [...]Comment: 14 pages, 8 figure
Cluster Computing in Zero Knowledge
Large computations, when amenable to distributed parallel execution, are often executed on computer clusters, for scalability and cost reasons. Such computations are used in many applications, including, to name but a few, machine learning, webgraph mining, and statistical machine translation. Oftentimes, though, the input data is private and only the result of the computation can be published. Zero-knowledge proofs would allow, in such settings, to verify correctness of the output without leaking (additional) information about the input.
In this work, we investigate theoretical and practical aspects of *zero-knowledge proofs for cluster computations*. We design, build, and evaluate zero-knowledge proof systems for which:
(i) a proof attests to the correct execution of a cluster computation; and
(ii) generating the proof is itself a cluster computation that is similar in structure and complexity to the original one.
Concretely, we focus on MapReduce, an elegant and popular form of cluster computing.
Previous zero-knowledge proof systems can in principle prove a MapReduce computation\u27s correctness, via a monolithic NP statement that reasons about all mappers, all reducers, and shuffling. However, it is not clear how to generate the proof for such monolithic statements via parallel execution by a distributed system. Our work demonstrates, by theory and implementation, that proof generation can be similar in structure and complexity to the original cluster computation.
Our main technique is a bootstrapping theorem for succinct non-interactive arguments of knowledge (SNARKs) that shows how, via recursive proof composition and Proof-Carrying Data, it is possible to transform any SNARK into a *distributed SNARK for MapReduce* which proves, piecewise and in a distributed way, the correctness of every step in the original MapReduce computation as well as their global consistency
- …