16 research outputs found

    Low exposure long-baseline neutrino oscillation sensitivity of the DUNE experiment

    Get PDF
    The Deep Underground Neutrino Experiment (DUNE) will produce world-leading neutrino oscillation measurements over the lifetime of the experiment. In this work, we explore DUNE\u27s sensitivity to observe charge-parity violation (CPV) in the neutrino sector, and to resolve the mass ordering, for exposures of up to 100 kiloton-megawatt-calendar years (kt-MW-CY), where calendar years include an assumption of 57% accelerator uptime based on past accelerator performance at Fermilab. The analysis includes detailed uncertainties on the flux prediction, the neutrino interaction model, and detector effects. We demonstrate that DUNE will be able to unambiguously resolve the neutrino mass ordering at a 4σ (5σ) level with a 66 (100) kt-MW-CY far detector exposure, and has the ability to make strong statements at significantly shorter exposures depending on the true value of other oscillation parameters, with a median sensitivity of 3σ for almost all true δCP values after only 24 kt-MW-CY. We also show that DUNE has the potential to make a robust measurement of CPV at a 3σ level with a 100 kt-MW-CY exposure for the maximally CP-violating values δCP=±π/2. Additionally, the dependence of DUNE\u27s sensitivity on the exposure taken in neutrino-enhanced and antineutrino-enhanced running is discussed. An equal fraction of exposure taken in each beam mode is found to be close to optimal when considered over the entire space of interest

    Erratum to: 36th International Symposium on Intensive Care and Emergency Medicine

    Get PDF
    [This corrects the article DOI: 10.1186/s13054-016-1208-6.]

    High-Performance Cluster-Scalable Computational Methods for Genomics Applications

    No full text
    The ever increasing pace of advancements in sequencing technologies has enabled rapid DNA/genome sequencing to become much more accessible. In particular, next (second) and third generation sequencing technologies offer high throughput, massively parallel and cost effective sequencing solutions. Individual sample sequencing data volumes as well as the number of assembled genomes are also growing quickly. These advances in high throughput sequencing technologies and demand for fast computational processing and downstream analysis of sequencing data in clinical settings is widening the gap between the time spent in sample collection and sequencing versus computational analysis. To improve the scalability and performance optimizations of genome variant calling analysis workflows on modern computing systems, in this dissertation four potential research directions have been selected for further exploration. First, to exploit the performance of modern processors hardware features like multi-core and vector units on the GATK best practices variant calling pipelines, we introduce ArrowSAM, a columnar inmemory data format to place and process genomics data in-memory thus removing the need for repeated file storage accesses in intermediate variant calling pipeline applications. Our second contribution focuses on integration of the Apache Arrow based columnar in-memory data format in the PySpark API to enable exploiting the benefits of vectorized operations in the Python language using user-defined functions on Spark dataframes. For our third research contribution, we tested and benchmarked both the scalability and performance of Arrow Flight for client-server as well as cluster scaled communication.For our final research contribution reported in this dissertation, we implemented an orthogonal approach that is even more scalable than Apache Spark and Arrow Flight based solutions and offers flexibility to use many different variant callers.Computer Engineerin

    Benchmarking Apache Arrow Flight - A wire-speed protocol for data transfer, querying and microservices

    No full text
    Moving structured data between different big data frameworks and/or data warehouses/storage systems often cause significant overhead. Most of the time more than 80% of the total time spent in accessing data is elapsed in serialization/de-serialization step. Columnar data formats are gaining popularity in both analytics and transactional databases. Apache Arrow, a unified columnar in-memory data format promises to provide efficient data storage, access, manipulation and transport. In addition, with the introduction of the Arrow Flight communication capabilities, which is built on top of gRPC, Arrow enables high performance data transfer over TCP networks. Arrow Flight allows parallel Arrow RecordBatch transfer over networks in a platform and language-independent way, and offers high performance, parallelism and security based on open-source standards. In this paper, we bring together some recently implemented use cases of Arrow Flight with their benchmarking results. These use cases include bulk Arrow data transfer, querying subsystems and Flight as a microservice integration into different frameworks to show the throughput and scalability results of this protocol. We show that Flight is able to achieve up to 6000 MB/s and 4800 MB/s throughput for DoGet() and DoPut() operations respectively. On Mellanox ConnectX-3 or Connect-IB interconnect nodes Flight can utilize upto 95% of the total available bandwidth. Flight is scalable and can use upto half of the available system cores efficiently for a bidirectional communication. For query systems like Dremio, Flight is order of magnitude faster than ODBC and turbodbc protocols. Arrow Flight based implementation on Dremio performs 20x and 30x better as compared to turbodbc and ODBC connections respectively. We briefly outline some recent Flight based use cases both in big data frameworks like Apache Spark and Dask and remote Arrow data processing tools. We also discuss some limitations and future outlook of Apache Arrow and Arrow Flight as a whole. Computer Engineerin

    VC@Scale: Scalable and high-performance variant calling on cluster environments

    No full text
    Background Recently many new deep learning–based variant-calling methods like DeepVariant have emerged as more accurate compared with conventional variant-calling algorithms such as GATK HaplotypeCaller, Sterlka2, and Freebayes albeit at higher computational costs. Therefore, there is a need for more scalable and higher performance workflows of these deep learning methods. Almost all existing cluster-scaled variant-calling workflows that use Apache Spark/Hadoop as big data frameworks loosely integrate existing single-node pre-processing and variant-calling applications. Using Apache Spark just for distributing/scheduling data among loosely coupled applications or using I/O-based storage for storing the output of intermediate applications does not exploit the full benefit of Apache Spark in-memory processing. To achieve this, we propose a native Spark-based workflow that uses Python and Apache Arrow to enable efficient transfer of data between different workflow stages. This benefits from the ease of programmability of Python and the high efficiency of Arrow’s columnar in-memory data transformations. Results Here we present a scalable, parallel, and efficient implementation of next-generation sequencing data pre-processing and variant-calling workflows. Our design tightly integrates most pre-processing workflow stages, using Spark built-in functions to sort reads by coordinates and mark duplicates efficiently. Our approach outperforms state-of-the-art implementations by >2 times for the pre-processing stages, creating a scalable and high-performance solution for DeepVariant for both CPU-only and CPU + GPU clusters. Conclusions We show the feasibility and easy scalability of our approach to achieve high performance and efficient resource utilization for variant-calling analysis on high-performance computing clusters using the standardized Apache Arrow data representations. All codes, scripts, and configurations used to run our implementations are publicly available and open sourced; see https://github.com/abs-tudelft/variant-calling-at-scale.Computer Engineerin

    Communication-Efficient Cluster Scalable Genomics Data Processing Using Apache Arrow Flight

    No full text
    Current cluster scaled genomics data processing solutions rely on big data frameworks like Apache Spark, Hadoop and HDFS for data scheduling, processing and storage. These frameworks come with additional computation and memory overheads by default. It has been observed that scaling genomics dataset processing beyond 32 nodes is not efficient on such frameworks.To overcome the inefficiencies of big data frameworks for processing genomics data on clusters, we introduce a low-overhead and highly scalable solution on a SLURM based HPC batch system. This solution uses Apache Arrow as in-memory columnar data format to store genomics data efficiently and Arrow Flight as a network protocol to move and schedule this data across the HPC nodes with low communication overhead.As a use case, we use NGS short reads DNA sequencing data for pre-processing and variant calling applications. This solution outperforms existing Apache Spark based big data solutions in term of both computation time (2x) and lower communication overhead (more than 20-60% depending on cluster size). Our solution has similar performance to MPI-based HPC solutions, with the added advantage of easy programmability and transparent big data scalability. The whole solution is Python and shell script based, which makes it flexible to update and integrate alternative variant callers. Our solution is publicly available on GitHub at https://github.com/abs-tudelft/time-to-fly-high/tree/main/genomicsGreen Open Access added to TU Delft Institutional Repository 'You share, we take care!' - Taverne project https://www.openaccess.nl/en/you-share-we-take-care Otherwise as indicated in the copyright section: the publisher is the copyright holder of this work and the author uses the Dutch legislation to make this work public.Computer Engineerin

    ArrowSAM: In-Memory Genomics Data Processing Using Apache Arrow

    No full text
    The rapidly growing size of genomics data bases, driven by advances in sequencing technologies, demands fast and cost-effective processing. However, processing this data creates many challenges, particularly in selecting appropriate algorithms and computing platforms. Computing systems need data closer to the processor for fast processing. Traditionally, due to cost, volatility and other physical constraints of DRAM, it was not feasible to place large amounts of working data sets in memory. However, new emerging storage class memories allow storing and processing big data closer to the processor. In this work, we show how the commonly used genomics data format, Sequence Alignment/Map (SAM), can be presented in the Apache Arrow in-memory data representation to benefit of in-memory processing and to ensure better scalability through shared memory objects, by avoiding large (de)-serialization overheads in cross-language interoperability. To demonstrate the benefits of such a system, we propose ArrowSAM, an in-memory SAM format that uses the Apache Arrow framework, and integrate it into genome pre-processing pipelines including BWA-MEM, Picard and Sambamba. Results show 15x and 2.4x speedups as compared to Picard and Sambamba, respectively. The code and scripts for running all workflows are freely available at https://github.com/abs-tudelft/ArrowSAM.Accepted author manuscriptComputer EngineeringNumerical AnalysisQuantum & Computer Engineerin

    Optimizing performance of GATK workflows using Apache Arrow In-Memory data framework

    No full text
    Background: Immense improvements in sequencing technologies enable producing large amounts of high throughput and cost effective next-generation sequencing (NGS) data. This data needs to be processed efficiently for further downstream analyses. Computing systems need this large amounts of data closer to the processor (with low latency) for fast and efficient processing. However, existing workflows depend heavily on disk storage and access, to process this data incurs huge disk I/O overheads. Previously, due to the cost, volatility and other physical constraints of DRAM memory, it was not feasible to place large amounts of working data sets in memory. However, recent developments in storage-class memory and non-volatile memory technologies have enabled computing systems to place huge data in memory to process it directly from memory to avoid disk I/O bottlenecks. To exploit the benefits of such memory systems efficiently, proper formatted data placement in memory and its high throughput access is necessary by avoiding (de)-serialization and copy overheads in between processes. For this purpose, we use the newly developed Apache Arrow, a cross-language development framework that provides language-independent columnar in-memory data format for efficient in-memory big data analytics. This allows genomics applications developed in different programming languages to communicate in-memory without having to access disk storage and avoiding (de)-serialization and copy overheads. Implementation: We integrate Apache Arrow in-memory based Sequence Alignment/Map (SAM) format and its shared memory objects store library in widely used genomics high throughput data processing applications like BWA-MEM, Picard and GATK to allow in-memory communication between these applications. In addition, this also allows us to exploit the cache locality of tabular data and parallel processing capabilities through shared memory objects. Results: Our implementation shows that adopting in-memory SAM representation in genomics high throughput data processing applications results in better system resource utilization, low number of memory accesses due to high cache locality exploitation and parallel scalability due to shared memory objects. Our implementation focuses on the GATK best practices recommended workflows for germline analysis on whole genome sequencing (WGS) and whole exome sequencing (WES) data sets. We compare a number of existing in-memory data placing and sharing techniques like ramDisk and Unix pipes to show how columnar in-memory data representation outperforms both. We achieve a speedup of 4.85x and 4.76x for WGS and WES data, respectively, in overall execution time of variant calling workflows. Similarly, a speedup of 1.45x and 1.27x for these data sets, respectively, is achieved, as compared to the second fastest workflow. In some individual tools, particularly in sorting, duplicates removal and base quality score recalibration the speedup is even more promising. Availability: The code and scripts used in our experiments are available in both container and repository form at: https://github.com/abs-tudelft/ArrowSAM.Computer EngineeringNumerical AnalysisQuantum & Computer Engineerin

    SALoBa: Maximizing Data Locality and Workload Balance for Fast Sequence Alignment on GPUs

    No full text
    Sequence alignment forms an important backbone in many sequencing applications. A commonly used strategy for sequence alignment is an approximate string matching with a two-dimensional dynamic programming approach. Although some prior work has been conducted on GPU acceleration of a sequence alignment, we identify several shortcomings that limit exploiting the full computational capability of modern GPUs. This paper presents SALoBa, a GPU-accelerated sequence alignment library focused on seed extension. Based on the analysis of previous work with real-world sequencing data, we propose techniques to exploit the data locality and improve work-load balancing. The experimental results reveal that SALoBa significantly improves the seed extension kernel compared to state-of-the-art GPU-based methods.Green Open Access added to TU Delft Institutional Repository 'You share, we take care!' - Taverne project https://www.openaccess.nl/en/you-share-we-take-care Otherwise as indicated in the copyright section: the publisher is the copyright holder of this work and the author uses the Dutch legislation to make this work public.Computer EngineeringNumerical Analysi