Search CORE

73 research outputs found

GPU-Accelerated BWT Construction for Large Collection of Short Reads

Author: Lam Tak-Wah
Liu Chi-Man
Luo Ruibang
Publication venue
Publication date: 29/01/2014
Field of study

Advances in DNA sequencing technology have stimulated the development of algorithms and tools for processing very large collections of short strings (reads). Short-read alignment and assembly are among the most well-studied problems. Many state-of-the-art aligners, at their core, have used the Burrows-Wheeler transform (BWT) as a main-memory index of a reference genome (typical example, NCBI human genome). Recently, BWT has also found its use in string-graph assembly, for indexing the reads (i.e., raw data from DNA sequencers). In a typical data set, the volume of reads is tens of times of the sequenced genome and can be up to 100 Gigabases. Note that a reference genome is relatively stable and computing the index is not a frequent task. For reads, the index has to computed from scratch for each given input. The ability of efficient BWT construction becomes a much bigger concern than before. In this paper, we present a practical method called CX1 for constructing the BWT of very large string collections. CX1 is the first tool that can take advantage of the parallelism given by a graphics processing unit (GPU, a relative cheap device providing a thousand or more primitive cores), as well as simultaneously the parallelism from a multi-core CPU and more interestingly, from a cluster of GPU-enabled nodes. Using CX1, the BWT of a short-read collection of up to 100 Gigabases can be constructed in less than 2 hours using a machine equipped with a quad-core CPU and a GPU, or in about 43 minutes using a cluster with 4 such machines (the speedup is almost linear after excluding the first 16 minutes for loading the reads from the hard disk). The previously fastest tool BRC is measured to take 12 hours to process 100 Gigabases on one machine; it is non-trivial how BRC can be parallelized to take advantage a cluster of machines, let alone GPUs.Comment: 11 page

arXiv.org e-Print Archive

CiteSeerX

MEGAHIT: An ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph

Author: Lam Tak-Wah
Li Dinghua
Liu Chi-Man
Luo Ruibang
Sadakane Kunihiko
Publication venue
Publication date: 23/12/2014
Field of study

MEGAHIT is a NGS de novo assembler for assembling large and complex metagenomics data in a time- and cost-efficient manner. It finished assembling a soil metagenomics dataset with 252Gbps in 44.1 hours and 99.6 hours on a single computing node with and without a GPU, respectively. MEGAHIT assembles the data as a whole, i.e., it avoids pre-processing like partitioning and normalization, which might compromise on result integrity. MEGAHIT generates 3 times larger assembly, with longer contig N50 and average contig length than the previous assembly. 55.8% of the reads were aligned to the assembly, which is 4 times higher than the previous. The source code of MEGAHIT is freely available at https://github.com/voutcn/megahit under GPLv3 license.Comment: 2 pages, 2 tables, 1 figure, submitted to Oxford Bioinformatics as an Application Not

arXiv.org e-Print Archive

HKU Scholars Hub

16GT: A fast and sensitive variant caller using a 16-genotype probabilistic model

Author: Luo Ruibang
Salzberg Steven L.
Schatz Michael C.
Publication venue: 'Oxford University Press (OUP)'
Publication date: 24/02/2017
Field of study

Â© The Author 2017. Published by Oxford University Press. 16GT is a variant caller for Illumina whole-genome and whole-exome sequencing data. It uses a new 16-genotype probabilistic model to unify single nucleotide polymorphism and insertion and deletion calling in a single variant calling algorithm. In benchmark comparisons with 5 other widely used variant callers on a modern 36-core server, 16GT demonstrated improved sensitivity in calling single nucleotide polymorphisms, and it provided comparable sensitivity and accuracy for calling insertions and deletions as compared to the GATK HaplotypeCaller. 16GT is available at https://github.com/aquaskyline/16GT.Link_to_subscribed_fulltex

HKU Scholars Hub

BASE: a practical de novo assembler for large genomes using long NGS reads

Author: Binghang Liu
Chi-Man Liu
Dinghua Li
Hing-Fung Ting
Ruibang Luo
Siu-Ming Yiu
Tak-Wah Lam
Yingrui Li
Publication venue: Springer Nature
Publication date: 01/01/2016
Field of study

Â© 2016 The Author(s). Background: De novo genome assembly using NGS data remains a computation-intensive task especially for large genomes. In practice, efficiency is often a primary concern and favors using a more efficient assembler like SOAPdenovo2. Yet SOAPdenovo2, based on de Bruijn graph, fails to take full advantage of longer NGS reads (say, 150 bp to 250 bp from Illumina HiSeq and MiSeq). Assemblers that are based on string graphs (e.g., SGA), though less popular and also very slow, are more favorable for longer reads. Methods: This paper shows a new de novo assembler called BASE. It enhances the classic seed-extension approach by indexing the reads efficiently to generate adaptive seeds that have high probability to appear uniquely in the genome. Such seeds form the basis for BASE to build extension trees and then to use reverse validation to remove the branches based on read coverage and paired-end information, resulting in high-quality consensus sequences of reads sharing the seeds. Such consensus sequences are then extended to contigs. Results: Experiments on two bacteria and four human datasets shows the advantage of BASE in both contig quality and speed in dealing with longer reads. In the experiment on bacteria, two datasets with read length of 100 bp and 250 bp were used. Especially for the 250 bp dataset, BASE gives much better quality than SOAPdenovo2 and SGA and is simlilar to SPAdes. Regarding speed, BASE is consistently a few times faster than SPAdes and SGA, but still slower than SOAPdenovo2. BASE and Soapdenov2 are further compared using human datasets with read length 100 bp, 150 bp and 250 bp. BASE shows a higher N50 for all datasets, while the improvement becomes more significant when read length reaches 250 bp. Besides, BASE is more-meory efficent than SOAPdenovo2 when sequencing data with error rate. Conclusions: BASE is a practically efficient tool for constructing contig, with significant improvement in quality for long NGS reads. It is relatively easy to extend BASE to include scaffolding.published_or_final_versio

Springer - Publisher Connector

PubMed Central

HKU Scholars Hub

SOAP3-dp: Fast, Accurate and Sensitive GPU-based Short Read Aligner

Author: Chang Yu
Chi-Man Liu
David W Cheung
Edward Wu
Haoxiang Lin
Hing-Fung Ting
Jianqiao Zhu
Lap-Kei Lee
Ruibang Luo
Ruiqiang Li
Shaoliang Peng
Siu-Ming Yiu
Tak-Wah Lam
Thomas Wong
Wenjuan Zhu
Xiaoqian Zhu
Yingrui Li
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 01/01/2013
Field of study

To tackle the exponentially increasing throughput of Next-Generation Sequencing (NGS), most of the existing short-read aligners can be configured to favor speed in trade of accuracy and sensitivity. SOAP3-dp, through leveraging the computational power of both CPU and GPU with optimized algorithms, delivers high speed and sensitivity simultaneously. Compared with widely adopted aligners including BWA, Bowtie2, SeqAlto, GEM and GPU-based aligners including BarraCUDA and CUSHAW, SOAP3-dp is two to tens of times faster, while maintaining the highest sensitivity and lowest false discovery rate (FDR) on Illumina reads with different lengths. Transcending its predecessor SOAP3, which does not allow gapped alignment, SOAP3-dp by default tolerates alignment similarity as low as 60 percent. Real data evaluation using human genome demonstrates SOAP3-dp's power to enable more authentic variants and longer Indels to be discovered. Fosmid sequencing shows a 9.1 percent FDR on newly discovered deletions. SOAP3-dp natively supports BAM file format and provides a scoring scheme same as BWA, which enables it to be integrated into existing analysis pipelines. SOAP3-dp has been deployed on Amazon-EC2, NIH-Biowulf and Tianhe-1A.Comment: 21 pages, 6 figures, submitted to PLoS ONE, additional files available at "https://www.dropbox.com/sh/bhclhxpoiubh371/O5CO_CkXQE". Comments most welcom

arXiv.org e-Print Archive

Directory of Open Access Journals

PubMed Central

HKU Scholars Hub

FigShare

MegaGTA: A sensitive and accurate metagenomic gene-targeted assembler using iterative de Bruijn graphs

Author: Huang Yukun
Lam Tak Wah
Leung Chi Ming
Li Dinghua
Luo Ruibang
Ting Hing Fung
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2017
Field of study

Â© 2017 The Author(s). Background: The recent release of the gene-targeted metagenomics assembler Xander has demonstrated that using the trained Hidden Markov Model (HMM) to guide the traversal of de Bruijn graph gives obvious advantage over other assembly methods. Xander, as a pilot study, indeed has a lot of room for improvement. Apart from its slow speed, Xander uses only 1 k-mer size for graph construction and whatever choice of k will compromise either sensitivity or accuracy. Xander uses a Bloom-filter representation of de Bruijn graph to achieve a lower memory footprint. Bloom filters bring in false positives, and it is not clear how this would impact the quality of assembly. Xander does not keep track of the multiplicity of k-mers, which would have been an effective way to differentiate between erroneous k-mers and correct k-mers. Results: In this paper, we present a new gene-targeted assembler MegaGTA, which attempts to improve Xander in different aspects. Quality-wise, it utilizes iterative de Bruijn graphs to take full advantage of multiple k-mer sizes to make the best of both sensitivity and accuracy. Computation-wise, it employs succinct de Bruijn graphs (SdBG) to achieve low memory footprint and high speed (the latter is benefited from a highly efficient parallel algorithm for constructing SdBG). Unlike Bloom filters, an SdBG is an exact representation of a de Bruijn graph. It enables MegaGTA to avoid false-positive contigs and to easily incorporate the multiplicity of k-mers for building better HMM model. We have compared MegaGTA and Xander on an HMP-defined mock metagenomic dataset, and showed that MegaGTA excelled in both sensitivity and accuracy. On a large rhizosphere soil metagenomic sample (327Gbp), MegaGTA produced 9.7-19.3% more contigs than Xander, and these contigs were assigned to 10-25% more gene references. In our experiments, MegaGTA, depending on the number of k-mers used, is two to ten times faster than Xander. Conclusion: MegaGTA improves on the algorithm of Xander and achieves higher sensitivity, accuracy and speed. Moreover, it is capable of assembling gene sequences from ultra-large metagenomic datasets. Its source code is freely available at https://github.com/HKU-BAL/megagta.Link_to_subscribed_fulltex

Directory of Open Access Journals

HKU Scholars Hub

A Gramaticalização do Verbo Ir e a Variação de Formas para Expressar o Futuro do Presente: uma Fotografia Capixaba

Author: Chi-Man Liu
Jeanno Cheung
Lap-Kei Lee
Ruibang Luo
Tak-Wah Lam
Wai-Chun Law
Yiu-Lun Wong
Publication venue: Mestrado em Estudos Linguísticos
Publication date: 29/04/2008
Field of study

Esta pesquisa verifica o estágio do processo de gramaticalização do verbo IR, que tem assumido a função de auxiliar em construções perifrásticas para expressar tempo. Para isso, investiga-se a variação entre as formas sintética e perifrástica com IR para expressão do futuro do presente. Temos por hipótese que a forma perifrástica já atinge todos os gêneros das duas modalidades da língua, uma vez que já se especializou para codificar tempo. São examinados dois gêneros, tomando-os como prototípicos do continuun oral/escrito: entrevistas com informantes universitários e editoriais de jornal. Partindo de uma orientação teórica Funcionalista, num quadro mais geral, concebe-se a língua como flexível ao uso, passível de influências cognitivas, sociais e também individuais, embora haja nela forças que atuam no sentido de regularizar a estrutura. Seguindo algumas pesquisas que têm se mostrado frutíferas, o modelo funcionalista estará em diálogo com outro modelo que procura dar conta da heterogeneidade estruturada da língua e de seus processos de mudança: a Teoria Variacionista. Num quadro mais específico, os fundamentos que orientam a pesquisa são os da Gramaticalização. Os dados extraídos dos gêneros selecionados serão submetidos ao programa computacional GOLDVARB 2001 e, em seguida, interpretados à luz das teorias lingüísticas que fundamentam esta pesquisa

Directory of Open Access Journals

PubMed Central

Repositório Institucional da Universidade Federal do Espirito Santo

HKU Scholars Hub

SOAP3-dp: Fast, Accurate and Sensitive GPU-Based Short Read Aligner

Author: Cheung David W
Lee Lap-Kei
Lin Haoxiang
Liu Chi-Man
Luo Ruibang
Ting Hing-Fung
Wong King Fung (Thomas)
Wu Edward
Yiu Siu-Ming
Zhu Jianqiao
Zhu Wenjuan
Zhu Xiaoqian
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 29/11/2018
Field of study

To tackle the exponentially increasing throughput of Next-Generation Sequencing (NGS), most of the existing short-read aligners can be configured to favor speed in trade of accuracy and sensitivity. SOAP3-dp, through leveraging the computational power of both CPU and GPU with optimized algorithms, delivers high speed and sensitivity simultaneously. Compared with widely adopted aligners including BWA, Bowtie2, SeqAlto, CUSHAW2, GEM and GPU-based aligners BarraCUDA and CUSHAW, SOAP3-dp was found to be two to tens of times faster, while maintaining the highest sensitivity and lowest false discovery rate (FDR) on Illumina reads with different lengths. Transcending its predecessor SOAP3, which does not allow gapped alignment, SOAP3-dp by default tolerates alignment similarity as low as 60%. Real data evaluation using human genome demonstrates SOAP3-dp's power to enable more authentic variants and longer Indels to be discovered. Fosmid sequencing shows a 9.1% FDR on newly discovered deletions. SOAP3-dp natively supports BAM file format and provides the same scoring scheme as BWA, which enables it to be integrated into existing analysis pipelines. SOAP3-dp has been deployed on Amazon-EC2, NIH-Biowulf and Tianhe-1A

The Australian National University

MICA: A fast short-read aligner that takes full advantage of Many Integrated Core Architecture (MIC)

Author: Chan Sze-Hang
Cheung Jeanno
He Guangzhu
Lam Tak-Wah
Law Wai-Chun
Li Ruiqiang
Li Yingrui
Liu Chi-Man
Luo Ruibang
Peng Shaoliang
Wang Heng
Wang Jun
Wu Edward
Yu Chang
Zhou Dazong
Zhu Xiaoqian
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2015
Field of study

Background: Short-read aligners have recently gained a lot of speed by exploiting the massive parallelism of GPU. An uprising alterative to GPU is Intel MIC; supercomputers like Tianhe-2, currently top of TOP500, is built with 48,000 MIC boards to offer ~55 PFLOPS. The CPU-like architecture of MIC allows CPU-based software to be parallelized easily; however, the performance is often inferior to GPU counterparts as an MIC card contains only ~60 cores (while a GPU card typically has over a thousand cores). Results: To better utilize MIC-enabled computers for NGS data analysis, we developed a new short-read aligner MICA that is optimized in view of MIC's limitation and the extra parallelism inside each MIC core. By utilizing the 512-bit vector units in the MIC and implementing a new seeding strategy, experiments on aligning 150 bp paired-end reads show that MICA using one MIC card is 4.9 times faster than BWA-MEM (using 6 cores of a top-end CPU), and slightly faster than SOAP3-dp (using a GPU). Furthermore, MICA's simplicity allows very efficient scale-up when multiple MIC cards are used in a node (3 cards give a 14.1-fold speedup over BWA-MEM). Summary: MICA can be readily used by MIC-enabled supercomputers for production purpose. We have tested MICA on Tianhe-2 with 90 WGS samples (17.47 Tera-bases), which can be aligned in an hour using 400 nodes. MICA has impressive performance even though MIC is only in its initial stage of development. Availability and implementation: MICA's source code is freely available at http://sourceforge.net/projects/mica-aligner under GPL v3. Supplementary information: Supplementary information is available as "Additional File 1". Datasets are available at www.bio8.cs.hku.hk/dataset/mica.published_or_final_versio

Crossref

PubMed Central

HKU Scholars Hub

University of Queensland eSpace