30 research outputs found
GraphX: Unifying Data-Parallel and Graph-Parallel Analytics
From social networks to language modeling, the growing scale and importance
of graph data has driven the development of numerous new graph-parallel systems
(e.g., Pregel, GraphLab). By restricting the computation that can be expressed
and introducing new techniques to partition and distribute the graph, these
systems can efficiently execute iterative graph algorithms orders of magnitude
faster than more general data-parallel systems. However, the same restrictions
that enable the performance gains also make it difficult to express many of the
important stages in a typical graph-analytics pipeline: constructing the graph,
modifying its structure, or expressing computation that spans multiple graphs.
As a consequence, existing graph analytics pipelines compose graph-parallel and
data-parallel systems using external storage systems, leading to extensive data
movement and complicated programming model.
To address these challenges we introduce GraphX, a distributed graph
computation framework that unifies graph-parallel and data-parallel
computation. GraphX provides a small, core set of graph-parallel operators
expressive enough to implement the Pregel and PowerGraph abstractions, yet
simple enough to be cast in relational algebra. GraphX uses a collection of
query optimization techniques such as automatic join rewrites to efficiently
implement these graph-parallel operators. We evaluate GraphX on real-world
graphs and workloads and demonstrate that GraphX achieves comparable
performance as specialized graph computation systems, while outperforming them
in end-to-end graph pipelines. Moreover, GraphX achieves a balance between
expressiveness, performance, and ease of use
Black or White? How to Develop an AutoTuner for Memory-based Analytics [Extended Version]
There is a lot of interest today in building autonomous (or, self-driving)
data processing systems. An emerging school of thought is to leverage AI-driven
"black box" algorithms for this purpose. In this paper, we present a contrarian
view. We study the problem of autotuning the memory allocation for applications
running on modern distributed data processing systems. For this problem, we
show that an empirically-driven "white-box" algorithm, called RelM, that we
have developed provides a close-to-optimal tuning at a fraction of the
overheads compared to state-of-the-art AI-driven "black box" algorithms,
namely, Bayesian Optimization (BO) and Deep Distributed Policy Gradient (DDPG).
The main reason for RelM's superior performance is that the memory management
in modern memory-based data analytics systems is an interplay of algorithms at
multiple levels: (i) at the resource-management level across various containers
allocated by resource managers like Kubernetes and YARN, (ii) at the container
level among the OS, pods, and processes such as the Java Virtual Machine (JVM),
(iii) at the application level for caching, aggregation, data shuffles, and
application data structures, and (iv) at the JVM level across various pools
such as the Young and Old Generation. RelM understands these interactions and
uses them in building an analytical solution to autotune the memory management
knobs. In another contribution, called GBO, we use the RelM's analytical models
to speed up Bayesian Optimization. Through an evaluation based on Apache Spark,
we showcase that RelM's recommendations are significantly better than what
commonly-used Spark deployments provide, and are close to the ones obtained by
brute-force exploration; while GBO provides optimality guarantees for a higher,
but still significantly lower compared to the state-of-the-art AI-driven
policies, cost overhead.Comment: Main version in ACM SIGMOD 202
Novel drugs approved by the EMA, the FDA, and the MHRA in 2023: A year in review
In 2023, seventy novel drugs received market authorization for the first time in either Europe (by the EMA and the MHRA) or in the United States (by the FDA). Confirming a steady recent trend, more than half of these drugs target rare diseases or intractable forms of cancer. Thirty drugs are categorized as âfirstâinâclassâ (FIC), illustrating the quality of research and innovation that drives new chemical entity discovery and development. We succinctly describe the mechanism of action of most of these FIC drugs and discuss the therapeutic areas covered, as well as the chemical category to which these drugs belong. The 2023 novel drug list also demonstrates an unabated emphasis on polypeptides (recombinant proteins and antibodies), Advanced Therapy Medicinal Products (gene and cell therapies) and RNA therapeutics, including the firstâever approval of a CRISPRâCas9âbased geneâediting cell therapy
Recommended from our members
Novel drugs approved by the EMA, the FDA, and the MHRA in 2023: a year in review
In 2023, seventy novel drugs received market authorization for the first time in either Europe (by the EMA and the MHRA) or in the United States (by the FDA). Confirming a steady recent trend, more than half of these drugs target rare diseases or intractable forms of cancer. Thirty drugs are categorized as "first-in-class" (FIC), illustrating the quality of research and innovation that drives new chemical entity discovery and development. We succinctly describe the mechanism of action of most of these FIC drugs and discuss the therapeutic areas covered, as well as the chemical category to which these drugs belong. The 2023 novel drug list also demonstrates an unabated emphasis on polypeptides (recombinant proteins and antibodies), Advanced Therapy Medicinal Products (gene and cell therapies) and RNA therapeutics, including the first-ever approval of a CRISPR-Cas9-based gene-editing cell therapy
Recommended from our members
Go with the Flow: Graphs, Streaming and Relational Computations over Distributed Dataflow
Modern data analysis is undergoing a ``Big Data'' transformation: organizations are generating and gathering more data than ever before, in a variety of formats covering both structured and unstructured data, and employing increasingly sophisticated techniques such as machine learning and graph computation beyond the traditional roll-up and drill-down capabilities provided by SQL. To cope with the big data challenges, we believe that data processing systems will need to provide fine-grained fault recovery across a larger cluster of machines, support both SQL and complex analytics efficiently, and enable real-time computation.This dissertation builds on Apache Spark, a distributed dataflow engine, and creates three related systems: Spark SQL, Structured Streaming, and GraphX. Spark SQL combines relational and procedural processing through a new API called DataFrame. It also includes an extensible query optimizer to support a wide variety of data sources and analytic workloads. Structured Streaming extends Spark SQL's DataFrame API and query optimizer to automatically incrementalize queries, so users can reason about real-time stream data as batch datasets, and have the same application operate over both stream data and batch data. GraphX recasts graph specific system optimizations as dataflow optimizations, and provides an efficient framework for graph computation on top of Spark.The three systems have enjoyed wide adoption in industry and academia, and together they laid the foundation for Spark's 2.0 release. They demonstrate the feasibility and advantages of unifying disparate, specialized data systems on top of distributed dataflow systems