882 research outputs found

    DataHub: Collaborative Data Science & Dataset Version Management at Scale

    Get PDF
    Relational databases have limited support for data collaboration, where teams collaboratively curate and analyze large datasets. Inspired by software version control systems like git, we propose (a) a dataset version control system, giving users the ability to create, branch, merge, difference and search large, divergent collections of datasets, and (b) a platform, DataHub, that gives users the ability to perform collaborative data analysis building on this version control system. We outline the challenges in providing dataset version control at scale.Comment: 7 page

    Scaling Causality Analysis for Production Systems.

    Full text link
    Causality analysis reveals how program values influence each other. It is important for debugging, optimizing, and understanding the execution of programs. This thesis scales causality analysis to production systems consisting of desktop and server applications as well as large-scale Internet services. This enables developers to employ causality analysis to debug and optimize complex, modern software systems. This thesis shows that it is possible to scale causality analysis to both fine-grained instruction level analysis and analysis of Internet scale distributed systems with thousands of discrete software components by developing and employing automated methods to observe and reason about causality. First, we observe causality at a fine-grained instruction level by developing the first taint tracking framework to support tracking millions of input sources. We also introduce flexible taint tracking to allow for scoping different queries and dynamic filtering of inputs, outputs, and relationships. Next, we introduce the Mystery Machine, which uses a ``big data'' approach to discover causal relationships between software components in a large-scale Internet service. We leverage the fact that large-scale Internet services receive a large number of requests in order to observe counterexamples to hypothesized causal relationships. Using discovered casual relationships, we identify the critical path for request execution and use the critical path analysis to explore potential scheduling optimizations. Finally, we explore using causality to make data-quality tradeoffs in Internet services. A data-quality tradeoff is an explicit decision by a software component to return lower-fidelity data in order to improve response time or minimize resource usage. We perform a study of data-quality tradeoffs in a large-scale Internet service to show the pervasiveness of these tradeoffs. We develop DQBarge, a system that enables better data-quality tradeoffs by propagating critical information along the causal path of request processing. Our evaluation shows that DQBarge helps Internet services mitigate load spikes, improve utilization of spare resources, and implement dynamic capacity planning.PHDComputer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/135888/1/mcchow_1.pd

    Enforcement and Spectrum Sharing: Case Studies of Federal-Commercial Sharing

    Get PDF
    To promote economic growth and unleash the potential of wireless broadband, there is a need to introduce more spectrally efficient technologies and spectrum management regimes. That led to an environment where commercial wireless broadband need to share spectrum with the federal and non-federal operations. Implementing sharing regimes on a non-opportunistic basis means that sharing agreements must be implemented. To have meaning, those agreements must be enforceable.\ud \ud With the significant exception of license-free wireless systems, commercial wireless services are based on exclusive use. With the policy change facilitating spectrum sharing, it becomes necessary to consider how sharing might take place in practice. Beyond the technical aspects of sharing, that must be resolved lie questions about how usage rights are appropriately determined and enforced. This paper is reasoning about enforcement in a particular spectrum bands (1695-1710 MHz and 3.5 GHz) that are currently being proposed for sharing between commercial services and incumbent spectrum users in the US. We examine three enforcement approaches, exclusion zones, protection zones and pure ex post and consider their implications in terms of cost elements, opportunity cost, and their adaptability

    CamFlow: Managed Data-sharing for Cloud Services

    Full text link
    A model of cloud services is emerging whereby a few trusted providers manage the underlying hardware and communications whereas many companies build on this infrastructure to offer higher level, cloud-hosted PaaS services and/or SaaS applications. From the start, strong isolation between cloud tenants was seen to be of paramount importance, provided first by virtual machines (VM) and later by containers, which share the operating system (OS) kernel. Increasingly it is the case that applications also require facilities to effect isolation and protection of data managed by those applications. They also require flexible data sharing with other applications, often across the traditional cloud-isolation boundaries; for example, when government provides many related services for its citizens on a common platform. Similar considerations apply to the end-users of applications. But in particular, the incorporation of cloud services within `Internet of Things' architectures is driving the requirements for both protection and cross-application data sharing. These concerns relate to the management of data. Traditional access control is application and principal/role specific, applied at policy enforcement points, after which there is no subsequent control over where data flows; a crucial issue once data has left its owner's control by cloud-hosted applications and within cloud-services. Information Flow Control (IFC), in addition, offers system-wide, end-to-end, flow control based on the properties of the data. We discuss the potential of cloud-deployed IFC for enforcing owners' dataflow policy with regard to protection and sharing, as well as safeguarding against malicious or buggy software. In addition, the audit log associated with IFC provides transparency, giving configurable system-wide visibility over data flows. [...]Comment: 14 pages, 8 figure

    Towards automated provenance collection for runtime models to record system history

    Get PDF
    In highly dynamic environments, systems are expected to make decisions on the fly based on their observations that are bound to be partial. As such, the reasons for its runtime behaviour may be difficult to understand. In these cases, accountability is crucial, and decisions by the system need to be traceable. Logging is essential to support explanations of behaviour, but it poses challenges. Concerns about analysing massive logs have motivated the introduction of structured logging, however, knowing what to log and which details to include is still a challenge. Structured logs still do not necessarily relate events to each other, or indicate time intervals. We argue that logging changes to a runtime model in a provenance graph can mitigate some of these problems. The runtime model keeps only relevant details, therefore reducing the volume of the logs, while the provenance graph records causal connections between the changes and the activities performed by the agents in the system that have introduced them. In this paper, we demonstrate a first version towards a reusable infrastructure for the automated construction of such a provenance graph. We apply it to a multithreaded traffic simulation case study, with multiple concurrent agents managing different parts of the simulation. We show how the provenance graphs can support validating the system behaviour, and how a seeded fault is reflected in the provenance graphs
    • …
    corecore