Search CORE

203 research outputs found

Harnessing the Power of Many: Extensible Toolkit for Scalable Ensemble Applications

Author: Balasubramanian Vivek
Cervone Guido
Hu Weiming
Jha Shantenu
Lefebvre Matthieu
Lei Wenjie
Tromp Jeroen
Turilli Matteo
Publication venue
Publication date: 16/05/2018
Field of study

Many scientific problems require multiple distinct computational tasks to be executed in order to achieve a desired solution. We introduce the Ensemble Toolkit (EnTK) to address the challenges of scale, diversity and reliability they pose. We describe the design and implementation of EnTK, characterize its performance and integrate it with two distinct exemplar use cases: seismic inversion and adaptive analog ensembles. We perform nine experiments, characterizing EnTK overheads, strong and weak scalability, and the performance of two use case implementations, at scale and on production infrastructures. We show how EnTK meets the following general requirements: (i) implementing dedicated abstractions to support the description and execution of ensemble applications; (ii) support for execution on heterogeneous computing infrastructures; (iii) efficient scalability up to O(10^4) tasks; and (iv) fault tolerance. We discuss novel computational capabilities that EnTK enables and the scientific advantages arising thereof. We propose EnTK as an important addition to the suite of tools in support of production scientific computing

arXiv.org e-Print Archive

Crossref

PGen: large-scale genomic variations analysis workflow and browser in SoyKB

Author
Publication venue: BioMed Central
Publication date: 06/10/2016
Field of study

Springer - Publisher Connector

A Tale of Two Data-Intensive Paradigms: Applications, Abstractions, and Architectures

Author: Fox Geoffrey C.
Jha Shantenu
Luckow Andre
Mantha Pradeep
Qiu Judy
Publication venue
Publication date: 01/01/2014
Field of study

Scientific problems that depend on processing large amounts of data require overcoming challenges in multiple areas: managing large-scale data distribution, co-placement and scheduling of data with compute resources, and storing and transferring large volumes of data. We analyze the ecosystems of the two prominent paradigms for data-intensive applications, hereafter referred to as the high-performance computing and the Apache-Hadoop paradigm. We propose a basis, common terminology and functional factors upon which to analyze the two approaches of both paradigms. We discuss the concept of "Big Data Ogres" and their facets as means of understanding and characterizing the most common application workloads found across the two paradigms. We then discuss the salient features of the two paradigms, and compare and contrast the two approaches. Specifically, we examine common implementation/approaches of these paradigms, shed light upon the reasons for their current "architecture" and discuss some typical workloads that utilize them. In spite of the significant software distinctions, we believe there is architectural similarity. We discuss the potential integration of different implementations, across the different levels and components. Our comparison progresses from a fully qualitative examination of the two paradigms, to a semi-quantitative methodology. We use a simple and broadly used Ogre (K-means clustering), characterize its performance on a range of representative platforms, covering several implementations from both paradigms. Our experiments provide an insight into the relative strengths of the two paradigms. We propose that the set of Ogres will serve as a benchmark to evaluate the two paradigms along different dimensions.Comment: 8 pages, 2 figure

arXiv.org e-Print Archive

CiteSeerX

Crossref

NeuroManager: a workflow analysis based simulation management engine for computational neuroscience

Author: David B. Stockton
Fidel Santamaria
Publication venue: 'Frontiers Media SA'
Publication date: 01/01/2015
Field of study

Frontiers - Publisher Connector

Combining Process Guidance and Industrial Feedback for Successfully Deploying Big Data Projects

Author: Annick Majchrowski
Christophe Ponsard
Mounir Touzani
Publication venue: RonPub
Publication date: 01/01/2017
Field of study

Companies are faced with the challenge of handling increasing amounts of digital data to run or improve their business. Although a large set of technical solutions are available to manage such Big Data, many companies lack the maturity to manage that kind of projects, which results in a high failure rate. This paper aims at providing better process guidance for a successful deployment of Big Data projects. Our approach is based on the combination of a set of methodological bricks documented in the literature from early data mining projects to nowadays. It is complemented by learned lessons from pilots conducted in different areas (IT, health, space, food industry) with a focus on two pilots giving a concrete vision of how to drive the implementation with emphasis on the identification of values, the definition of a relevant strategy, the use of an Agile follow-up and a progressive rise in maturity

RonPub -- Research Online Publishing

Recommended from our members

TACC in 2020: COVID-19, and the Next Generation of Cyberinfrastructure

Author: Stanzione Dan
Publication venue
Publication date: 01/01/2020
Field of study

Texas ScholarWorks

Biomarker Development for Advanced Prostate Cancer

Author: Grist Emily
Publication venue: UCL (University College London)
Publication date: 28/12/2022
Field of study

Patients diagnosed with advanced prostate cancer starting long term ADT follow a highly variable clinical course. Treatment intensification improves outcome overall, but without biomarkers we overtreat some subgroups and we are unable to direct the most effective treatment strategy to others. I developed a protocol for biomarker discovery and evaluation, leveraging the STAMPEDE trial, in which donated clinical samples are associated with prospective clinical data. Genomic copy number alterations commonly occur in prostate cancer, however the clinical implication of copy number change in advanced HSPC is unknown. I generated low coverage WGS data from FFPE tissue from participants in the control group of STAMPEDE and copy number profiled 688 tumour regions from 300 participants to describe the association between the burden of copy number alteration and outcome. The burden of copy number alteration positively associated with radiologically-evident distant metastases at diagnosis (P value=0.00006) and showed a non-linear relationship with clinical outcome on univariable and multivariable analysis, characterised by a sharp increase in the relative risk of progression (P value=0.003) and death (P value=0.045) for each unit increase, stabilising into more modest increases with higher burdens. This association between copy number burden and outcome was similar in each of the metastatic states. Copy number loss occurred significantly more frequently than gain at the lowest copy number burden quartile (q=4.1X10-6). Loss of segments in chromosome 5q21-22 and gains at 8q21-24, respectively including CHD1 and cMYC, occurred more frequently in cases with higher copy number alteration. Intra-patient burden of copy number alteration variance associated with increased risk of distant metastases (Kruskal-Wallis test P value=0.037). In conclusion, copy number alteration at diagnosis in advanced prostate cancer associates with increased risk of metastases and accumulation of a limited number of copy number alterations associates with most of the increased risk of disease progression and death

UCL Discovery