23 research outputs found
Covering Vehicle Routing Problem: Application for Mobile Child Friendly Spaces for Refugees
Tubitak under the Grant Number 216M380
Recommended from our members
Scaling Generalized N-Body Problems, A Case Study from Genomics
This work examines a data-intensive irregular application from genomics that represents a type of Generalized N-Body problems, one of the "seven giants"of the NRC Big Data motifs. In this problem, computations (genome alignments) are performed on sparse data-dependent pairs of inputs, with variable cost computation and variable datum sizes. Unlike simulation-based N-Body problems, there is no inherent locality in the pairwise interactions, and the interaction sparsity depends on particular parameters of the input, which can also affect the quality of the output. We build-on a pre-existing bulk-synchronous implementation, using collective communication in MPI, and implement a new asynchronous one, using cross-node RPCs in UPC++. We establish the intranode comparability and efficiency of both, scaling from one to all core(s) on node. Then we evaluate the multinode scalability from 1 node to 512 nodes (32,768 cores) of NERSC's Cray XC40 with Intel Xeon Phi "Knight's Landing"nodes. With real workloads, we examine the load balance of the irregular computation and communication, and the costs of many small asynchronous messages versus few large-aggregated messages, in both latency and overall application memory footprint. While both implementations demonstrate good scaling, the study reveals some of the programming and architectural challenges for scaling this type of data-intensive irregular application, and contributes code that can be used in genomics pipelines or in benchmarking for data analytics more broadly
Recommended from our members
Distributed-memory k-mer counting on GPUs
A fundamental step in many bioinformatics computations is to count the frequency of fixed-length sequences, called k-mers, a problem that has received considerable attention as an important target for shared memory parallelization. With datasets growing at an exponential rate, distributed memory parallelization is becoming increasingly critical. Existing distributed memory k-mer counters do not take advantage of GPUs for accelerating computations. Additionally, they do not employ domain-specific optimizations to reduce communication volume in a distributed environment. In this paper, we present the first GPU-accelerated distributed-memory parallel k-mer counter. We evaluate the communication volume as the major bottleneck in scaling k-mer counting to multiple GPU-equipped compute nodes and implement a supermer-based optimization to reduce the communication volume and to enhance scalability. Our empirical analysis examines the balance of communication to computation on a state-of-the-art system, the Summit supercomputer at Oak Ridge National Lab. Results show overall speedups of up to two orders of magnitude with GPU optimization over CPU-based k mer counters. Furthermore, we show an additional 1.5× speedup using the supermer-based communication optimization
Recommended from our members
Parallel string graph construction and transitive reduction for de novo genome assembly
One of the most computationally intensive tasks in computational biology is de novo genome assembly, the decoding of the sequence of an unknown genome from redundant and erroneous short sequences. A common assembly paradigm identifies overlapping sequences, simplifies their layout, and creates consensus. Despite many algorithms developed in the literature, the efficient assembly of large genomes is still an open problem. In this work, we introduce new distributed-memory parallel algorithms for overlap detection and layout simplification steps of de novo genome assembly, and implement them in the diBELLA 2D pipeline. Our distributed memory algorithms for both overlap detection and layout simplification are based on linear-algebra operations over semirings using 2D distributed sparse matrices. Our layout step consists of performing a transitive reduction from the overlap graph to a string graph. We provide a detailed communication analysis of the main stages of our new algorithms. diBELLA 2D achieves near linear scaling with over 80% parallel efficiency for the human genome, reducing the runtime for overlap detection by 1.2 - 1.3 × for the human genome and 1.5 - 1.9 × for C.elegans compared to the state-of-the-art. Our transitive reduction algorithm outperforms an existing distributed-memory implementation by 10.5 - 13.3 × for the human genome and 18- 29 × for the C. elegans. Our work paves the way for efficient de novo assembly of large genomes using long reads in distributed memory
potential application of them: Anode materials of Li-ion batteries
Nowadays, doped graphenes are attracting much interest in the field of Li-ion batteries since it shows higher specific capacity than widely used graphite. However, synthesis methods of doped graphenes have secondary processes that requires much energy. In this study, in situ synthesis of N-doped graphene powders by using of cyclic voltammetric method from starting a graphite rod in nitric acid solution has been discussed for the first time in the literature. The N-including functional groups such as nitro groups, pyrrolic N, and pyridinic N have been selectively prepared as changing scanned potential ranges in cyclic voltammetry. The electrochemical performance as anode material in Li-ion batteries has also been covered within this study. N-doped graphene powders have been characterized by electrochemical, spectroscopic, and microscopic methods. According to the X-ray photoelectron spectroscopy and Raman results, N-doped graphene powders have approximately 16 to 18 graphene rings in their main structure. The electrochemical analysis of graphene powders synthesized at different potential ranges showed that the highest capacity was obtained 438 mAh/g after 10 cycles by using current density of 50 mA/g at N-GP4. Furthermore, the sample having higher defect size shows better specific capacity. However, the more stable structure due to oxygen content and less defect size improves the rate capabilities, and thus, the results obtained at high current density indicated that the remaining capacity of N-GP1 was higher than the others.C1 [Gursu, Hurmus; Sahin, Yucel] Yildiz Tech Univ, Fac Art & Sci, Dept Chem, TR-34220 Istanbul, Turkey.[Guner, Yagmur] Pamukkale Univ, Dept Met & Mat Engn, TR-20160 Denizli, Turkey.[Dermenci, Kamil Burak; Buluc, Ahmet Furkan; Savaci, Umut; Turan, Servet] Eskisehir Tech Univ, Dept Mat Sci & Engn, TR-26555 Eskisehir, Turkey.[Gencten, Metin] Yildiz Tech Univ, Fac Chem & Met Engn, Dept Met & Mat Engn, TR-34210 Istanbul, Turkey
LOGAN: High-Performance GPU-Based X-Drop Long-Read Alignment
Pairwise sequence alignment is one of the most computationally intensive kernels in genomic data analysis, accounting for more than 90% of the runtime for key bioinformatics applications. This method is particularly expensive for third-generation sequences due to the high computational cost of analyzing sequences of length between 1Kb and 1Mb. Given the quadratic overhead of exact pairwise algorithms for long alignments, the community primarily relies on approximate algorithms that search only for high-quality alignments and stop early when one is not found. In this work, we present the first GPU optimization of the popular X-drop alignment algorithm, that we named LOGAN. Results show that our high-performance multi-GPU implementation achieves up to 181.6 GCUPS and speed-ups up to 6.6× and 30.7× using 1 and 6 NVIDIA Tesla V100, respectively, over the state-of-the-art software running on two IBM Power9 processors using 168 CPU threads, with equivalent accuracy. We also demonstrate a 2.3× LOGAN speed-up versus ksw2, a state-of-art vectorized algorithm for sequence alignment implemented in minimap2, a long-read mapping software. To highlight the impact of our work on a real-world application, we couple LOGAN with a many-to-many long-read alignment software called BELLA, and demonstrate that our implementation improves the overall BELLA runtime by up to 10.6×. Finally, we adapt the Roofline model for LOGAN and demonstrate that our implementation is near optimal on the NVIDIA Tesla V100s