Search CORE

30 research outputs found

GraphX: Unifying Data-Parallel and Graph-Parallel Analytics

Author: Crankshaw Daniel
Dave Ankur
Franklin Michael J.
Gonzalez Joseph E.
Stoica Ion
Xin Reynold S.
Publication venue
Publication date: 11/02/2014
Field of study

From social networks to language modeling, the growing scale and importance of graph data has driven the development of numerous new graph-parallel systems (e.g., Pregel, GraphLab). By restricting the computation that can be expressed and introducing new techniques to partition and distribute the graph, these systems can efficiently execute iterative graph algorithms orders of magnitude faster than more general data-parallel systems. However, the same restrictions that enable the performance gains also make it difficult to express many of the important stages in a typical graph-analytics pipeline: constructing the graph, modifying its structure, or expressing computation that spans multiple graphs. As a consequence, existing graph analytics pipelines compose graph-parallel and data-parallel systems using external storage systems, leading to extensive data movement and complicated programming model. To address these challenges we introduce GraphX, a distributed graph computation framework that unifies graph-parallel and data-parallel computation. GraphX provides a small, core set of graph-parallel operators expressive enough to implement the Pregel and PowerGraph abstractions, yet simple enough to be cast in relational algebra. GraphX uses a collection of query optimization techniques such as automatic join rewrites to efficiently implement these graph-parallel operators. We evaluate GraphX on real-world graphs and workloads and demonstrate that GraphX achieves comparable performance as specialized graph computation systems, while outperforming them in end-to-end graph pipelines. Moreover, GraphX achieves a balance between expressiveness, performance, and ease of use

arXiv.org e-Print Archive

CiteSeerX

Black or White? How to Develop an AutoTuner for Memory-based Analytics [Extended Version]

Author: Agrawal Sanjay
Alipourfard Omid
Cao Zhen
Chaudhuri Surajit
Chaudhuri Surajit
Dias Karl
Hsu Chin-Jung
Iorgulescu Calin
Kwan Eva
Lillicrap Timothy P.
Marcus Ryan
Or Andrew
Ryza Sandy
Storm Adam J.
Venkataraman Shivaram
Xin Reynold
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 26/02/2020
Field of study

There is a lot of interest today in building autonomous (or, self-driving) data processing systems. An emerging school of thought is to leverage AI-driven "black box" algorithms for this purpose. In this paper, we present a contrarian view. We study the problem of autotuning the memory allocation for applications running on modern distributed data processing systems. For this problem, we show that an empirically-driven "white-box" algorithm, called RelM, that we have developed provides a close-to-optimal tuning at a fraction of the overheads compared to state-of-the-art AI-driven "black box" algorithms, namely, Bayesian Optimization (BO) and Deep Distributed Policy Gradient (DDPG). The main reason for RelM's superior performance is that the memory management in modern memory-based data analytics systems is an interplay of algorithms at multiple levels: (i) at the resource-management level across various containers allocated by resource managers like Kubernetes and YARN, (ii) at the container level among the OS, pods, and processes such as the Java Virtual Machine (JVM), (iii) at the application level for caching, aggregation, data shuffles, and application data structures, and (iv) at the JVM level across various pools such as the Young and Old Generation. RelM understands these interactions and uses them in building an analytical solution to autotune the memory management knobs. In another contribution, called GBO, we use the RelM's analytical models to speed up Bayesian Optimization. Through an evaluation based on Apache Spark, we showcase that RelM's recommendations are significantly better than what commonly-used Spark deployments provide, and are close to the ones obtained by brute-force exploration; while GBO provides optimality guarantees for a higher, but still significantly lower compared to the state-of-the-art AI-driven policies, cost overhead.Comment: Main version in ACM SIGMOD 202

arXiv.org e-Print Archive

Crossref

Novel drugs approved by the EMA, the FDA, and the MHRA in 2023: A year in review

Author: Alexander Steve P. H.
Cortese‐Krott Miriam
Ferdinandy Péter
Kendall Dave A.
Martemyanov Kirill A.
Mauro Claudio
Nagercoil Nithyanandan
Panettieri Reynold A.
Papapetropoulos Andreas
Patel Hemal H.
Schulz Rainer
Stefanska Barbara
Stephens Gary J.
Teixeira Mauro M.
Topouzis Stavros
Vergnolle Nathalie
Wang Xin
Publication venue: Wiley
Publication date: 22/03/2022
Field of study

In 2023, seventy novel drugs received market authorization for the first time in either Europe (by the EMA and the MHRA) or in the United States (by the FDA). Confirming a steady recent trend, more than half of these drugs target rare diseases or intractable forms of cancer. Thirty drugs are categorized as “first‐in‐class” (FIC), illustrating the quality of research and innovation that drives new chemical entity discovery and development. We succinctly describe the mechanism of action of most of these FIC drugs and discuss the therapeutic areas covered, as well as the chemical category to which these drugs belong. The 2023 novel drug list also demonstrates an unabated emphasis on polypeptides (recombinant proteins and antibodies), Advanced Therapy Medicinal Products (gene and cell therapies) and RNA therapeutics, including the first‐ever approval of a CRISPR‐Cas9‐based gene‐editing cell therapy

Repository@Nottingham

Recommended from our members

Novel drugs approved by the EMA, the FDA, and the MHRA in 2023: a year in review

Author: Alexander Steve P. H.
Cortese-Krott Miriam
Ferdinandy Péter
Kendall Dave A.
Martemyanov Kirill A.
Mauro Claudio
Nagercoil Nithyanandan
Panettieri Jr Reynold A.
Papapetropoulos Andreas
Patel Hemal H.
Schulz Rainer
Stefanska Barbara
Stephens Gary J.
Teixeira Mauro M.
Topouzis Stavros
Vergnolle Nathalie
Wang Xin
Publication venue: Wiley
Publication date: 22/03/2024
Field of study

In 2023, seventy novel drugs received market authorization for the first time in either Europe (by the EMA and the MHRA) or in the United States (by the FDA). Confirming a steady recent trend, more than half of these drugs target rare diseases or intractable forms of cancer. Thirty drugs are categorized as "first-in-class" (FIC), illustrating the quality of research and innovation that drives new chemical entity discovery and development. We succinctly describe the mechanism of action of most of these FIC drugs and discuss the therapeutic areas covered, as well as the chemical category to which these drugs belong. The 2023 novel drug list also demonstrates an unabated emphasis on polypeptides (recombinant proteins and antibodies), Advanced Therapy Medicinal Products (gene and cell therapies) and RNA therapeutics, including the first-ever approval of a CRISPR-Cas9-based gene-editing cell therapy

Central Archive at the University of Reading

University of Birmingham Research Portal

Recommended from our members

Go with the Flow: Graphs, Streaming and Relational Computations over Distributed Dataflow

Author: Xin Reynold Shi
Publication venue: eScholarship, University of California
Publication date: 01/01/2018
Field of study

Modern data analysis is undergoing a ``Big Data'' transformation: organizations are generating and gathering more data than ever before, in a variety of formats covering both structured and unstructured data, and employing increasingly sophisticated techniques such as machine learning and graph computation beyond the traditional roll-up and drill-down capabilities provided by SQL. To cope with the big data challenges, we believe that data processing systems will need to provide fine-grained fault recovery across a larger cluster of machines, support both SQL and complex analytics efficiently, and enable real-time computation.This dissertation builds on Apache Spark, a distributed dataflow engine, and creates three related systems: Spark SQL, Structured Streaming, and GraphX. Spark SQL combines relational and procedural processing through a new API called DataFrame. It also includes an extensible query optimizer to support a wide variety of data sources and analytic workloads. Structured Streaming extends Spark SQL's DataFrame API and query optimizer to automatically incrementalize queries, so users can reason about real-time stream data as batch datasets, and have the same application operate over both stream data and batch data. GraphX recasts graph specific system optimizations as dataflow optimizations, and provides an efficient framework for graph computation on top of Spark.The three systems have enjoyed wide adoption in industry and academia, and together they laid the foundation for Spark's 2.0 release. They demonstrate the feasibility and advantages of unifying disparate, specialized data systems on top of distributed dataflow systems

eScholarship - University of California

Web information systems engineering - WISE 2019: 20th international conference, Hong Kong, China, January 19-22, 2020, proceedings

Author: Cheng Reynold
Huang Xin
Mamoulis Nikos
Sun Yizhou
Publication venue: Springer International Publishing AG
Publication date: 01/01/2019
Field of study

CERN Document Server

Editorial of special issue of WISE 2019

Author: Cheng Reynold
Hua Wen
Huang Xin
Wang Sibo
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/08/2020
Field of study

Directory of Open Access Journals

University of Queensland eSpace