114,906 research outputs found
Scheduling of data-intensive workloads in a brokered virtualized environment
Providing performance predictability guarantees is increasingly important in cloud platforms, especially for data-intensive applications, for which performance depends greatly on the available rates of data transfer between the various computing/storage hosts underlying the virtualized resources assigned to the application. With the increased prevalence of brokerage services in cloud platforms, there is a need for resource management solutions that consider the brokered nature of these workloads, as well as the special demands of their intra-dependent components. In this paper, we present an offline mechanism for scheduling batches of brokered data-intensive workloads, which can be extended to an online setting. The objective of the mechanism is to decide on a packing of the workloads in a batch that minimizes the broker's incurred costs, Moreover, considering the brokered nature of such workloads, we define a payment model that provides incentives to these workloads to be scheduled as part of a batch, which we analyze theoretically. Finally, we evaluate the proposed scheduling algorithm, and exemplify the fairness of the payment model in practical settings via trace-based experiments
Prochlo: Strong Privacy for Analytics in the Crowd
The large-scale monitoring of computer users' software activities has become
commonplace, e.g., for application telemetry, error reporting, or demographic
profiling. This paper describes a principled systems architecture---Encode,
Shuffle, Analyze (ESA)---for performing such monitoring with high utility while
also protecting user privacy. The ESA design, and its Prochlo implementation,
are informed by our practical experiences with an existing, large deployment of
privacy-preserving software monitoring.
(cont.; see the paper
Sam2bam: High-Performance Framework for NGS Data Preprocessing Tools
This paper introduces a high-throughput software tool framework called {\it
sam2bam} that enables users to significantly speedup pre-processing for
next-generation sequencing data. The sam2bam is especially efficient on
single-node multi-core large-memory systems. It can reduce the runtime of data
pre-processing in marking duplicate reads on a single node system by 156-186x
compared with de facto standard tools. The sam2bam consists of parallel
software components that can fully utilize the multiple processors, available
memory, high-bandwidth of storage, and hardware compression accelerators if
available.
The sam2bam provides file format conversion between well-known genome file
formats, from SAM to BAM, as a basic feature. Additional features such as
analyzing, filtering, and converting the input data are provided by {\it
plug-in} tools, e.g., duplicate marking, which can be attached to sam2bam at
runtime.
We demonstrated that sam2bam could significantly reduce the runtime of NGS
data pre-processing from about two hours to about one minute for a whole-exome
data set on a 16-core single-node system using up to 130 GB of memory. The
sam2bam could reduce the runtime for whole-genome sequencing data from about 20
hours to about nine minutes on the same system using up to 711 GB of memory
How Algorithmic Confounding in Recommendation Systems Increases Homogeneity and Decreases Utility
Recommendation systems are ubiquitous and impact many domains; they have the
potential to influence product consumption, individuals' perceptions of the
world, and life-altering decisions. These systems are often evaluated or
trained with data from users already exposed to algorithmic recommendations;
this creates a pernicious feedback loop. Using simulations, we demonstrate how
using data confounded in this way homogenizes user behavior without increasing
utility
- …