8,744 research outputs found
Sparse Allreduce: Efficient Scalable Communication for Power-Law Data
Many large datasets exhibit power-law statistics: The web graph, social
networks, text data, click through data etc. Their adjacency graphs are termed
natural graphs, and are known to be difficult to partition. As a consequence
most distributed algorithms on these graphs are communication intensive. Many
algorithms on natural graphs involve an Allreduce: a sum or average of
partitioned data which is then shared back to the cluster nodes. Examples
include PageRank, spectral partitioning, and many machine learning algorithms
including regression, factor (topic) models, and clustering. In this paper we
describe an efficient and scalable Allreduce primitive for power-law data. We
point out scaling problems with existing butterfly and round-robin networks for
Sparse Allreduce, and show that a hybrid approach improves on both.
Furthermore, we show that Sparse Allreduce stages should be nested instead of
cascaded (as in the dense case). And that the optimum throughput Allreduce
network should be a butterfly of heterogeneous degree where degree decreases
with depth into the network. Finally, a simple replication scheme is introduced
to deal with node failures. We present experiments showing significant
improvements over existing systems such as PowerGraph and Hadoop
AIOps for a Cloud Object Storage Service
With the growing reliance on the ubiquitous availability of IT systems and
services, these systems become more global, scaled, and complex to operate. To
maintain business viability, IT service providers must put in place reliable
and cost efficient operations support. Artificial Intelligence for IT
Operations (AIOps) is a promising technology for alleviating operational
complexity of IT systems and services. AIOps platforms utilize big data,
machine learning and other advanced analytics technologies to enhance IT
operations with proactive actionable dynamic insight.
In this paper we share our experience applying the AIOps approach to a
production cloud object storage service to get actionable insights into
system's behavior and health. We describe a real-life production cloud scale
service and its operational data, present the AIOps platform we have created,
and show how it has helped us resolving operational pain points.Comment: 5 page
Scheduling Storms and Streams in the Cloud
Motivated by emerging big streaming data processing paradigms (e.g., Twitter
Storm, Streaming MapReduce), we investigate the problem of scheduling graphs
over a large cluster of servers. Each graph is a job, where nodes represent
compute tasks and edges indicate data-flows between these compute tasks. Jobs
(graphs) arrive randomly over time, and upon completion, leave the system. When
a job arrives, the scheduler needs to partition the graph and distribute it
over the servers to satisfy load balancing and cost considerations.
Specifically, neighboring compute tasks in the graph that are mapped to
different servers incur load on the network; thus a mapping of the jobs among
the servers incurs a cost that is proportional to the number of "broken edges".
We propose a low complexity randomized scheduling algorithm that, without
service preemptions, stabilizes the system with graph arrivals/departures; more
importantly, it allows a smooth trade-off between minimizing average
partitioning cost and average queue lengths. Interestingly, to avoid service
preemptions, our approach does not rely on a Gibbs sampler; instead, we show
that the corresponding limiting invariant measure has an interpretation
stemming from a loss system.Comment: 14 page
- …