203 research outputs found
Harnessing the Power of Many: Extensible Toolkit for Scalable Ensemble Applications
Many scientific problems require multiple distinct computational tasks to be
executed in order to achieve a desired solution. We introduce the Ensemble
Toolkit (EnTK) to address the challenges of scale, diversity and reliability
they pose. We describe the design and implementation of EnTK, characterize its
performance and integrate it with two distinct exemplar use cases: seismic
inversion and adaptive analog ensembles. We perform nine experiments,
characterizing EnTK overheads, strong and weak scalability, and the performance
of two use case implementations, at scale and on production infrastructures. We
show how EnTK meets the following general requirements: (i) implementing
dedicated abstractions to support the description and execution of ensemble
applications; (ii) support for execution on heterogeneous computing
infrastructures; (iii) efficient scalability up to O(10^4) tasks; and (iv)
fault tolerance. We discuss novel computational capabilities that EnTK enables
and the scientific advantages arising thereof. We propose EnTK as an important
addition to the suite of tools in support of production scientific computing
A Tale of Two Data-Intensive Paradigms: Applications, Abstractions, and Architectures
Scientific problems that depend on processing large amounts of data require
overcoming challenges in multiple areas: managing large-scale data
distribution, co-placement and scheduling of data with compute resources, and
storing and transferring large volumes of data. We analyze the ecosystems of
the two prominent paradigms for data-intensive applications, hereafter referred
to as the high-performance computing and the Apache-Hadoop paradigm. We propose
a basis, common terminology and functional factors upon which to analyze the
two approaches of both paradigms. We discuss the concept of "Big Data Ogres"
and their facets as means of understanding and characterizing the most common
application workloads found across the two paradigms. We then discuss the
salient features of the two paradigms, and compare and contrast the two
approaches. Specifically, we examine common implementation/approaches of these
paradigms, shed light upon the reasons for their current "architecture" and
discuss some typical workloads that utilize them. In spite of the significant
software distinctions, we believe there is architectural similarity. We discuss
the potential integration of different implementations, across the different
levels and components. Our comparison progresses from a fully qualitative
examination of the two paradigms, to a semi-quantitative methodology. We use a
simple and broadly used Ogre (K-means clustering), characterize its performance
on a range of representative platforms, covering several implementations from
both paradigms. Our experiments provide an insight into the relative strengths
of the two paradigms. We propose that the set of Ogres will serve as a
benchmark to evaluate the two paradigms along different dimensions.Comment: 8 pages, 2 figure
Combining Process Guidance and Industrial Feedback for Successfully Deploying Big Data Projects
Companies are faced with the challenge of handling increasing amounts of digital data to run or improve their business. Although a large set of technical solutions are available to manage such Big Data, many companies lack the maturity to manage that kind of projects, which results in a high failure rate. This paper aims at providing better process guidance for a successful deployment of Big Data projects. Our approach is based on the combination of a set of methodological bricks documented in the literature from early data mining projects to nowadays. It is complemented by learned lessons from pilots conducted in different areas (IT, health, space, food industry) with a focus on two pilots giving a concrete vision of how to drive the implementation with emphasis on the identification of values, the definition of a relevant strategy, the use of an Agile follow-up and a progressive rise in maturity
Biomarker Development for Advanced Prostate Cancer
Patients diagnosed with advanced prostate cancer starting long term ADT follow a highly variable clinical course. Treatment intensification improves outcome overall, but without biomarkers we overtreat some subgroups and we are unable to direct the most effective treatment strategy to others.
I developed a protocol for biomarker discovery and evaluation, leveraging the STAMPEDE trial, in which donated clinical samples are associated with prospective clinical data. Genomic copy number alterations commonly occur in prostate cancer, however the clinical implication of copy number change in advanced HSPC is unknown. I generated low coverage WGS data from FFPE tissue from participants in the control group of STAMPEDE and copy number profiled 688 tumour regions from 300 participants to describe the association between the burden of copy number alteration and outcome.
The burden of copy number alteration positively associated with radiologically-evident distant metastases at diagnosis (P value=0.00006) and showed a non-linear relationship with clinical outcome on univariable and multivariable analysis, characterised by a sharp increase in the relative risk of progression (P value=0.003) and death (P value=0.045) for each unit increase, stabilising into more modest increases with higher burdens. This association between copy number burden and outcome was similar in each of the metastatic states. Copy number loss occurred significantly more frequently than gain at the lowest copy number burden quartile (q=4.1X10-6). Loss of segments in chromosome 5q21-22 and gains at 8q21-24, respectively including CHD1 and cMYC, occurred more frequently in cases with higher copy number alteration. Intra-patient burden of copy number alteration variance associated with increased risk of distant metastases (Kruskal-Wallis test P value=0.037).
In conclusion, copy number alteration at diagnosis in advanced prostate cancer associates with increased risk of metastases and accumulation of a limited number of copy number alterations associates with most of the increased risk of disease progression and death
- …