40 research outputs found
DataHub: Collaborative Data Science & Dataset Version Management at Scale
Relational databases have limited support for data collaboration, where teams
collaboratively curate and analyze large datasets. Inspired by software version
control systems like git, we propose (a) a dataset version control system,
giving users the ability to create, branch, merge, difference and search large,
divergent collections of datasets, and (b) a platform, DataHub, that gives
users the ability to perform collaborative data analysis building on this
version control system. We outline the challenges in providing dataset version
control at scale.Comment: 7 page
Fast and Reliable Missing Data Contingency Analysis with Predicate-Constraints
Today, data analysts largely rely on intuition to determine whether missing
or withheld rows of a dataset significantly affect their analyses. We propose a
framework that can produce automatic contingency analysis, i.e., the range of
values an aggregate SQL query could take, under formal constraints describing
the variation and frequency of missing data tuples. We describe how to process
SUM, COUNT, AVG, MIN, and MAX queries in these conditions resulting in hard
error bounds with testable constraints. We propose an optimization algorithm
based on an integer program that reconciles a set of such constraints, even if
they are overlapping, conflicting, or unsatisfiable, into such bounds. Our
experiments on real-world datasets against several statistical imputation and
inference baselines show that statistical techniques can have a deceptively
high error rate that is often unpredictable. In contrast, our framework offers
hard bounds that are guaranteed to hold if the constraints are not violated. In
spite of these hard bounds, we show competitive accuracy to statistical
baselines
Serializability, not Serial: Concurrency Control and Availability in Multi-Datacenter Datastores
We present a framework for concurrency control and availability in
multi-datacenter datastores. While we consider Google's Megastore as our
motivating example, we define general abstractions for key components, making
our solution extensible to any system that satisfies the abstraction
properties. We first develop and analyze a transaction management and
replication protocol based on a straightforward implementation of the Paxos
algorithm. Our investigation reveals that this protocol acts as a concurrency
prevention mechanism rather than a concurrency control mechanism. We then
propose an enhanced protocol called Paxos with Combination and Promotion
(Paxos-CP) that provides true transaction concurrency while requiring the same
per instance message complexity as the basic Paxos protocol. Finally, we
compare the performance of Paxos and Paxos-CP in a multi-datacenter
experimental study, and we demonstrate that Paxos-CP results in significantly
fewer aborted transactions than basic Paxos.Comment: VLDB201
Decibel: the relational dataset branching system
As scientific endeavors and data analysis become increasingly collaborative, there is a need for data management systems that natively support the versioning or branching of datasets to enable concurrent analysis, cleaning, integration, manipulation, or curation of data across teams of individuals. Common practice for sharing and collaborating on datasets involves creating or storing multiple copies of the dataset, one for each stage of analysis, with no provenance information tracking the relationships between these datasets. This results not only in wasted storage, but also makes it challenging to track and integrate modifications made by different users to the same dataset. In this paper, we introduce the Relational Dataset Branching System, Decibel, a new relational storage system with built-in version control designed to address these short-comings. We present our initial design for Decibel and provide a thorough evaluation of three versioned storage engine designs that focus on efficient query processing with minimal storage overhead. We also develop an exhaustive benchmark to enable the rigorous testing of these and future versioned storage engine designs.National Science Foundation (U.S.) (1513972)National Science Foundation (U.S.) (1513407)National Science Foundation (U.S.) (1513443)Intel Science and Technology Center for Big Dat
Herbicide-Resistance in Turf Systems: Insights and Options for Managing Complexity
Due to complex interactions between social and ecological systems, herbicide resistance has classic features of a “wicked problem.” Herbicide-resistant (HR) Poa annua poses a risk to sustainably managing U.S. turfgrass systems, but there is scant knowledge to guide its management. Six focus groups were conducted throughout the United States to gain understanding of socio-economic barriers to adopting herbicide-resistance management practices. Professionals from major turfgrass sectors (golf courses, sports fields, lawn care, and seed/sod production) were recruited as focus-group participants. Discussions emphasized challenges of the weed management of turfgrass systems as compared to agronomic crops. This included greater time constraints for managing weeds and more limited chemical control options. Lack of understanding about the proper use of compounds with different modes of action was identified as a threat to sustainable weed management. There were significant regional differences in perceptions of the existence, geographic scope, and social and ecological causes of HR in managing Poa annua. Effective resistance management will require tailoring chemical and non-chemical practices to the specific conditions of different turfgrass sectors and regions. Some participants thought it would be helpful to have multi-year resistance management programs that are both sector- and species-specific
Mortality and pulmonary complications in patients undergoing surgery with perioperative SARS-CoV-2 infection: an international cohort study
Background: The impact of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) on postoperative recovery needs to be understood to inform clinical decision making during and after the COVID-19 pandemic. This study reports 30-day mortality and pulmonary complication rates in patients with perioperative SARS-CoV-2 infection. Methods: This international, multicentre, cohort study at 235 hospitals in 24 countries included all patients undergoing surgery who had SARS-CoV-2 infection confirmed within 7 days before or 30 days after surgery. The primary outcome measure was 30-day postoperative mortality and was assessed in all enrolled patients. The main secondary outcome measure was pulmonary complications, defined as pneumonia, acute respiratory distress syndrome, or unexpected postoperative ventilation. Findings: This analysis includes 1128 patients who had surgery between Jan 1 and March 31, 2020, of whom 835 (74·0%) had emergency surgery and 280 (24·8%) had elective surgery. SARS-CoV-2 infection was confirmed preoperatively in 294 (26·1%) patients. 30-day mortality was 23·8% (268 of 1128). Pulmonary complications occurred in 577 (51·2%) of 1128 patients; 30-day mortality in these patients was 38·0% (219 of 577), accounting for 81·7% (219 of 268) of all deaths. In adjusted analyses, 30-day mortality was associated with male sex (odds ratio 1·75 [95% CI 1·28–2·40], p\textless0·0001), age 70 years or older versus younger than 70 years (2·30 [1·65–3·22], p\textless0·0001), American Society of Anesthesiologists grades 3–5 versus grades 1–2 (2·35 [1·57–3·53], p\textless0·0001), malignant versus benign or obstetric diagnosis (1·55 [1·01–2·39], p=0·046), emergency versus elective surgery (1·67 [1·06–2·63], p=0·026), and major versus minor surgery (1·52 [1·01–2·31], p=0·047). Interpretation: Postoperative pulmonary complications occur in half of patients with perioperative SARS-CoV-2 infection and are associated with high mortality. Thresholds for surgery during the COVID-19 pandemic should be higher than during normal practice, particularly in men aged 70 years and older. Consideration should be given for postponing non-urgent procedures and promoting non-operative treatment to delay or avoid the need for surgery. Funding: National Institute for Health Research (NIHR), Association of Coloproctology of Great Britain and Ireland, Bowel and Cancer Research, Bowel Disease Research Foundation, Association of Upper Gastrointestinal Surgeons, British Association of Surgical Oncology, British Gynaecological Cancer Society, European Society of Coloproctology, NIHR Academy, Sarcoma UK, Vascular Society for Great Britain and Ireland, and Yorkshire Cancer Research