Search CORE

260 research outputs found

Automatic Task Parallelization of Dataflow Graphs in ML/DL models

Author: Das Srinjoy
Rauchwerger Lawrence
Publication venue
Publication date: 22/08/2023
Field of study

Several methods exist today to accelerate Machine Learning(ML) or Deep-Learning(DL) model performance for training and inference. However, modern techniques that rely on various graph and operator parallelism methodologies rely on search space optimizations which are costly in terms of power and hardware usage. Especially in the case of inference, when the batch size is 1 and execution is on CPUs or for power-constrained edge devices, current techniques can become costly, complicated or inapplicable. To ameliorate this, we present a Critical-Path-based Linear Clustering approach to exploit inherent parallel paths in ML dataflow graphs. Our task parallelization approach further optimizes the structure of graphs via cloning and prunes them via constant propagation and dead-code elimination. Contrary to other work, we generate readable and executable parallel Pytorch+Python code from input ML models in ONNX format via a new tool that we have built called {\bf Ramiel}. This allows us to benefit from other downstream acceleration techniques like intra-op parallelism and potentially pipeline parallelism. Our preliminary results on several ML graphs demonstrate up to 1.9

\times

speedup over serial execution and outperform some of the current mechanisms in both compile and runtimes. Lastly, our methods are lightweight and fast enough so that they can be used effectively for power and resource-constrained devices, while still enabling downstream optimizations

arXiv.org e-Print Archive

The Potential of Synergistic Static, Dynamic and Speculative Loop Nest Optimizations for Automatic Parallelization

Author: Baghdadi Riyadh
Bastoul Cedric
Cohen Albert
Pouchet Louis-Noel
Rauchwerger Lawrence
Publication venue
Publication date: 01/01/2010
Field of study

Research in automatic parallelization of loop-centric programs started with static analysis, then broadened its arsenal to include dynamic inspection-execution and speculative execution, the best results involving hybrid static-dynamic schemes. Beyond the detection of parallelism in a sequential program, scalable parallelization on many-core processors involves hard and interesting parallelism adaptation and mapping challenges. These challenges include tailoring data locality to the memory hierarchy, structuring independent tasks hierarchically to exploit multiple levels of parallelism, tuning the synchronization grain, balancing the execution load, decoupling the execution into thread-level pipelines, and leveraging heterogeneous hardware with specialized accelerators. The polyhedral framework allows to model, construct and apply very complex loop nest transformations addressing most of the parallelism adaptation and mapping challenges. But apart from hardware-specific, back-end oriented transformations (if-conversion, trace scheduling, value prediction), loop nest optimization has essentially ignored dynamic and speculative techniques. Research in polyhedral compilation recently reached a significant milestone towards the support of dynamic, data-dependent control flow. This opens a large avenue for blending dynamic analyses and speculative techniques with advanced loop nest optimizations. Selecting real-world examples from SPEC benchmarks and numerical kernels, we make a case for the design of synergistic static, dynamic and speculative loop transformation techniques. We also sketch the embedding of dynamic information, including speculative assumptions, in the heart of affine transformation search spaces

arXiv.org e-Print Archive

HAL-CentraleSupelec

CiteSeerX

INRIA a CCSD electronic archive server

HAL-Rennes 1

A Comparative Analysis of STM Approaches to Reduction Operations in Irregular Applications

Author: Bienia
Blume
Dai Wang
DeVuyst
Feautrier
Felber
Foster
Gonzalez-Mesa
Gonzalez-Mesa
González
Gottschlich
Gutiérrez
Hall
Han
Han
Harris
Herlihy
Johnson
Mehrara
Morales
Mukherjee
Perez
Porter
Prabhu
Quislant
Rauchwerger
Rauchwerger
Ruan
Udupa
von Praun
Yu
Publication venue: 'Elsevier BV'
Publication date: 01/01/2016
Field of study

As a recently consolidated paradigm for optimistic concurrency in modern multicore architectures, Transactional Memory (TM) can help to the exploitation of parallelism in irregular applications when data dependence information is not available up to run- time. This paper presents and discusses how to leverage TM to exploit parallelism in an important class of irregular applications, the class that exhibits irregular reduction patterns. In order to test and compare our techniques with other solutions, they were implemented in a software TM system called ReduxSTM, that acts as a proof of concept. Basically, ReduxSTM combines two major ideas: a sequential-equivalent ordering of transaction commits that assures the correct result, and an extension of the underlying TM privatization mechanism to reduce unnecessary overhead due to reduction memory updates as well as unnecesary aborts and rollbacks. A comparative study of STM solutions, including ReduxSTM, and other more classical approaches to the parallelization of reduction operations is presented in terms of time, memory and overhead.Universidad de Málaga. Campus de Excelencia Internacional Andalucía Tech

Crossref

Repositorio Institucional Universidad de Málaga

A Hybrid Approach to Proving Memory Reference Monotonicity

Author: C.E. Oancea
H. Yu
J. Hoeflinger
L. Rauchwerger
L. Rauchwerger
M. Berry
M.W. Hall
P. Feautrier
P. Feautrier
S. Rus
T. Fahringer
W. Blume
W. Blume
W. Pugh
Y. Lin
Y. Lin
Y. Paek
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2013
Field of study

Crossref

Value Prediction and Speculative Execution on GPU

Author: Christine Eisenbeis
Jean-Luc Gaudiot
L. Hammond
L. Rauchwerger
M. Franklin
S. Liu
Shaoshan Liu
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref

Tradeoffs in buffering speculative memory state for thread-level speculation in multiprocessors

Author: Akkary H.
Cintra M.
Figueiredo R.
Garzarán M. J.
Gopal S.
Gupta M.
Hammond L.
Josep Torrellas
José María Llabería
Knight T.
Lawrence Rauchwerger
Marcuello P.
María Jesús Garzarán
Milos Prvulovic
Prvulovic M.
Rauchwerger L.
Rundberg P.
Sohi G. S.
Steffan J.
Tremblay M.
Víctor Viñals
Zhang Y.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date
Field of study

Crossref

A phase I study of combination chemotherapy with gemcitabine and oral UFT for advanced non-small cell lung cancer

Author: A Sandler
C Martin
DH Ho
DR Rauchwerger
E Shimizu
FA Shepherd
H Cortes-Funes
H Wada
J Feliu
K Ota
LW Hertel
M Fukuoka
N Keicho
NG Hagag
PA Bunn Jr
PA Philip
RP Abratt
S Cascinu
S Madajewicz
U Gatzemeier
Y Ichinose
Y Ichinose
Y Maehara
Y Yamada
Publication venue: Nature Publishing Group
Publication date
Field of study

A phase I study was carried out to determine the optimal dose and administration schedule for combined UFT plus gemcitabine therapy in patients with non-small cell lung cancer. Twenty-four patients (including 11 patients previously treated with cisplatin as the key drug) received oral UFT 400 mg m−2 on days 1 to 14 with intravenous infusions of gemcitabine (800 mg m−2 on days 8 and 15, or 900 mg m−2 on days 8 and 15, or 900 mg m−2 on days 1, 8 and 15). The most appropriate dosing option appeared to be 400 mg m−2 per day of oral UFT for 14 consecutive days with 900 mg m−2 gemcitabine on days 8 and 15. Eight of the 24 patients achieved partial response. The combination chemotherapy UFT and gemcitabine was well tolerated and may benefit patients with advanced non-small cell lung cancer. A multicentre phase II study using a 3-weekly regimen is in progress

Crossref

PubMed Central

Automatically Harnessing Sparse Acceleration

Sparse linear algebra is central to many scientific programs, yet compilers fail to optimize it well. High-performance libraries are available, but adoption costs are significant. Moreover, libraries tie programs into vendor-specific software and hardware ecosystems, creating non-portable code. In this paper, we develop a new approach based on our specification Language for implementers of Linear Algebra Computations (LiLAC). Rather than requiring the application developer to (re)write every program for a given library, the burden is shifted to a one-off description by the library implementer. The LiLAC-enabled compiler uses this to insert appropriate library routines without source code changes. LiLAC provides automatic data marshaling, maintaining state between calls and minimizing data transfers. Appropriate places for library insertion are detected in compiler intermediate representation, independent of source languages. We evaluated on large-scale scientific applications written in FORTRAN; standard C/C++ and FORTRAN benchmarks; and C++ graph analytics kernels. Across heterogeneous platforms, applications and data sets we show speedups of 1.1

\times

to over 10

\times

without user intervention.Comment: Accepted to CC 202

arXiv.org e-Print Archive

Crossref

Edinburgh Research Explorer

Phase I trial of oral S-1 combined with gemcitabine in metastatic pancreatic cancer

Author: A Kobayashi
A Ohtsu
B Glimelius
CJ van Groeningen
D Evans
DR Rauchwerger
E Matano
GJ Peters
H Bruckner
H Kato
H Saisho
H Tadenuma
HA Burris III
J Van den Brande
JD Berlin
JD Berlin
JD Berlin
K Nakamura
K Sudo
K Tatsumi
M Hidalgo
ML Rothenberg
P Chollet
PM Hoff
S Cascinu
S Matsuno
S Okada
T Ishihara
T Saeki
T Shirasaka
T Shirasaka
T Taguchi
T Takechi
T Takechi
T Yamaguchi
W Koizumi
Y Inuyama
Y Sakata
Publication venue: Nature Publishing Group
Publication date
Field of study

The objective of this study was to determine the maximum tolerated dose (MTD) and dose-limiting toxicities (DLTs) of S-1, an oral fluorouracil derivative, combined with gemcitabine, the current standard treatment for advanced pancreatic cancer (APC). The subjects were histopathologically proven APC patients with distant metastasis. S-1 was administered orally twice daily each day for 14 days and gemcitabine on days 8 and 15 of each cycle, and this was repeated every 21 days. Doses of each drug were planned as follows: level 1: 800/60, level 2a: 800/80, level 2b: 1000/60, level 3: 1000/80 (gemcitabine (mg m−2)/S-1 (mg m−2 day−1)). In all, 21 patients with APC were enrolled. The main grade 3–4 toxicities observed during first cycle were neutropenia (33%), anaemia (10%), thrombocytopenia (14%) and anorexia (10%). There were no DLT observed in level 1. Three of six patients in level 2a had DLT and this level was considered the MTD. In all, 12 patients in level 2b had no DLT and this level was selected as the recommended dose. Applicable responses were one complete response and nine partial responses (48%). As toxicities were well tolerated and antitumour activities seem to be promising, this combination can be recommended for further phase II studies with APC

Crossref

PubMed Central