203 research outputs found

    Harnessing the Power of Many: Extensible Toolkit for Scalable Ensemble Applications

    Full text link
    Many scientific problems require multiple distinct computational tasks to be executed in order to achieve a desired solution. We introduce the Ensemble Toolkit (EnTK) to address the challenges of scale, diversity and reliability they pose. We describe the design and implementation of EnTK, characterize its performance and integrate it with two distinct exemplar use cases: seismic inversion and adaptive analog ensembles. We perform nine experiments, characterizing EnTK overheads, strong and weak scalability, and the performance of two use case implementations, at scale and on production infrastructures. We show how EnTK meets the following general requirements: (i) implementing dedicated abstractions to support the description and execution of ensemble applications; (ii) support for execution on heterogeneous computing infrastructures; (iii) efficient scalability up to O(10^4) tasks; and (iv) fault tolerance. We discuss novel computational capabilities that EnTK enables and the scientific advantages arising thereof. We propose EnTK as an important addition to the suite of tools in support of production scientific computing

    PGen: large-scale genomic variations analysis workflow and browser in SoyKB

    Get PDF

    A Tale of Two Data-Intensive Paradigms: Applications, Abstractions, and Architectures

    Full text link
    Scientific problems that depend on processing large amounts of data require overcoming challenges in multiple areas: managing large-scale data distribution, co-placement and scheduling of data with compute resources, and storing and transferring large volumes of data. We analyze the ecosystems of the two prominent paradigms for data-intensive applications, hereafter referred to as the high-performance computing and the Apache-Hadoop paradigm. We propose a basis, common terminology and functional factors upon which to analyze the two approaches of both paradigms. We discuss the concept of "Big Data Ogres" and their facets as means of understanding and characterizing the most common application workloads found across the two paradigms. We then discuss the salient features of the two paradigms, and compare and contrast the two approaches. Specifically, we examine common implementation/approaches of these paradigms, shed light upon the reasons for their current "architecture" and discuss some typical workloads that utilize them. In spite of the significant software distinctions, we believe there is architectural similarity. We discuss the potential integration of different implementations, across the different levels and components. Our comparison progresses from a fully qualitative examination of the two paradigms, to a semi-quantitative methodology. We use a simple and broadly used Ogre (K-means clustering), characterize its performance on a range of representative platforms, covering several implementations from both paradigms. Our experiments provide an insight into the relative strengths of the two paradigms. We propose that the set of Ogres will serve as a benchmark to evaluate the two paradigms along different dimensions.Comment: 8 pages, 2 figure

    Combining Process Guidance and Industrial Feedback for Successfully Deploying Big Data Projects

    Get PDF
    Companies are faced with the challenge of handling increasing amounts of digital data to run or improve their business. Although a large set of technical solutions are available to manage such Big Data, many companies lack the maturity to manage that kind of projects, which results in a high failure rate. This paper aims at providing better process guidance for a successful deployment of Big Data projects. Our approach is based on the combination of a set of methodological bricks documented in the literature from early data mining projects to nowadays. It is complemented by learned lessons from pilots conducted in different areas (IT, health, space, food industry) with a focus on two pilots giving a concrete vision of how to drive the implementation with emphasis on the identification of values, the definition of a relevant strategy, the use of an Agile follow-up and a progressive rise in maturity

    Biomarker Development for Advanced Prostate Cancer

    Get PDF
    Patients diagnosed with advanced prostate cancer starting long term ADT follow a highly variable clinical course. Treatment intensification improves outcome overall, but without biomarkers we overtreat some subgroups and we are unable to direct the most effective treatment strategy to others. I developed a protocol for biomarker discovery and evaluation, leveraging the STAMPEDE trial, in which donated clinical samples are associated with prospective clinical data. Genomic copy number alterations commonly occur in prostate cancer, however the clinical implication of copy number change in advanced HSPC is unknown. I generated low coverage WGS data from FFPE tissue from participants in the control group of STAMPEDE and copy number profiled 688 tumour regions from 300 participants to describe the association between the burden of copy number alteration and outcome. The burden of copy number alteration positively associated with radiologically-evident distant metastases at diagnosis (P value=0.00006) and showed a non-linear relationship with clinical outcome on univariable and multivariable analysis, characterised by a sharp increase in the relative risk of progression (P value=0.003) and death (P value=0.045) for each unit increase, stabilising into more modest increases with higher burdens. This association between copy number burden and outcome was similar in each of the metastatic states. Copy number loss occurred significantly more frequently than gain at the lowest copy number burden quartile (q=4.1X10-6). Loss of segments in chromosome 5q21-22 and gains at 8q21-24, respectively including CHD1 and cMYC, occurred more frequently in cases with higher copy number alteration. Intra-patient burden of copy number alteration variance associated with increased risk of distant metastases (Kruskal-Wallis test P value=0.037). In conclusion, copy number alteration at diagnosis in advanced prostate cancer associates with increased risk of metastases and accumulation of a limited number of copy number alterations associates with most of the increased risk of disease progression and death
    • …
    corecore