23 research outputs found

    Artificial and natural duplicates in pyrosequencing reads of metagenomic data

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Artificial duplicates from pyrosequencing reads may lead to incorrect interpretation of the abundance of species and genes in metagenomic studies. Duplicated reads were filtered out in many metagenomic projects. However, since the duplicated reads observed in a pyrosequencing run also include natural (non-artificial) duplicates, simply removing all duplicates may also cause underestimation of abundance associated with natural duplicates.</p> <p>Results</p> <p>We implemented a method for identification of exact and nearly identical duplicates from pyrosequencing reads. This method performs an all-against-all sequence comparison and clusters the duplicates into groups using an algorithm modified from our previous sequence clustering method cd-hit. This method can process a typical dataset in ~10 minutes; it also provides a consensus sequence for each group of duplicates. We applied this method to the underlying raw reads of 39 genomic projects and 10 metagenomic projects that utilized pyrosequencing technique. We compared the occurrences of the duplicates identified by our method and the natural duplicates made by independent simulations. We observed that the duplicates, including both artificial and natural duplicates, make up 4-44% of reads. The number of natural duplicates highly correlates with the samples' read density (number of reads divided by genome size). For high-complexity metagenomic samples lacking dominant species, natural duplicates only make up <1% of all duplicates. But for some other samples like transcriptomic samples, majority of the observed duplicates might be natural duplicates.</p> <p>Conclusions</p> <p>Our method is available from <url>http://cd-hit.org</url> as a downloadable program and a web server. It is important not only to identify the duplicates from metagenomic datasets but also to distinguish whether they are artificial or natural duplicates. We provide a tool to estimate the number of natural duplicates according to user-defined sample types, so users can decide whether to retain or remove duplicates in their projects.</p

    CD-HIT Suite: a web server for clustering and comparing biological sequences

    Get PDF
    Summary: CD-HIT is a widely used program for clustering and comparing large biological sequence datasets. In order to further assist the CD-HIT users, we significantly improved this program with more functions and better accuracy, scalability and flexibility. Most importantly, we developed a new web server, CD-HIT Suite, for clustering a user-uploaded sequence dataset or comparing it to another dataset at different identity levels. Users can now interactively explore the clusters within web browsers. We also provide downloadable clusters for several public databases (NCBI NR, Swissprot and PDB) at different identity levels

    SIMULATION OF URBAN RAIL VEHICLE CRASH AND FACTORS INFLUNCING ANTI-CLIMBING ABILITY OF ITS ANTI-CLIMBER

    No full text
    By use of the finite element software Hypermesh and LS-DYNA,the processes were respectively simulated of urban rail vehicle head car,with and without anti-climbing energy absorption device,impacting the fixed rigid wall face to face at the speed of 12.25 km / h and 18 km / h.Based on the obtained data,the crashworthiness of urban rail vehicle head-car body and performance of its energy absorption device were evaluated.Using the response surface methodology,the factors influencing anticlimbing ability of anti-climber were also studied.The results show that,when the crash speed is respectively at 12.25 km / h and18 km / h,the energy absorption device would absorb impact energy before car body structure by plastic deformation,protecting the car body without and with only a little plastic deformation.In addition,when the total height and tooth thickness of anti-climber are fixed,its anti-climbing ability would decrease as the tooth height and angle increases,and the tooth height has more influence than the angle

    WebMGA: a customizable web server for fast metagenomic sequence analysis

    No full text
    Abstract Background The new field of metagenomics studies microorganism communities by culture-independent sequencing. With the advances in next-generation sequencing techniques, researchers are facing tremendous challenges in metagenomic data analysis due to huge quantity and high complexity of sequence data. Analyzing large datasets is extremely time-consuming; also metagenomic annotation involves a wide range of computational tools, which are difficult to be installed and maintained by common users. The tools provided by the few available web servers are also limited and have various constraints such as login requirement, long waiting time, inability to configure pipelines etc. Results We developed WebMGA, a customizable web server for fast metagenomic analysis. WebMGA includes over 20 commonly used tools such as ORF calling, sequence clustering, quality control of raw reads, removal of sequencing artifacts and contaminations, taxonomic analysis, functional annotation etc. WebMGA provides users with rapid metagenomic data analysis using fast and effective tools, which have been implemented to run in parallel on our local computer cluster. Users can access WebMGA through web browsers or programming scripts to perform individual analysis or to configure and run customized pipelines. WebMGA is freely available at http://weizhongli-lab.org/metagenomic-analysis. Conclusions WebMGA offers to researchers many fast and unique tools and great flexibility for complex metagenomic data analysis.</p

    Facilitating software refactoring with appropriate resolution order of bad smells

    No full text
    Bad smell is a key concept in software refactoring. We have a bunch of bad smells, refactoring rules, and refactoring tools, but we do not know which kind of bad smells should be resolved first. The resolution of one kind of bad smells may have impact on the resolution of other bad smells. Consequently, different resolution orders of the same set of bad smells may require different effort, and/or lead to different quality improvement. In order to ease the work and maximize the effect of refactoring, we try to analyze the relationships among different kinds of bad smells, and their impact on resolution orders of these bad smells. With the analysis, we recommend a resolution order of common bad smells. The main contribution of this paper is to motivate the necessity to arrange resolution orders of bad smells, and recommend a resolution order of common bad smells. Copyright 2009 ACM.EI

    WebMGA: a Customizable Web Server for Fast Metagenomic Sequence Analysis

    Get PDF
    Abstract Background The new field of metagenomics studies microorganism communities by culture-independent sequencing. With the advances in next-generation sequencing techniques, researchers are facing tremendous challenges in metagenomic data analysis due to huge quantity and high complexity of sequence data. Analyzing large datasets is extremely time-consuming; also metagenomic annotation involves a wide range of computational tools, which are difficult to be installed and maintained by common users. The tools provided by the few available web servers are also limited and have various constraints such as login requirement, long waiting time, inability to configure pipelines etc. Results We developed WebMGA, a customizable web server for fast metagenomic analysis. WebMGA includes over 20 commonly used tools such as ORF calling, sequence clustering, quality control of raw reads, removal of sequencing artifacts and contaminations, taxonomic analysis, functional annotation etc. WebMGA provides users with rapid metagenomic data analysis using fast and effective tools, which have been implemented to run in parallel on our local computer cluster. Users can access WebMGA through web browsers or programming scripts to perform individual analysis or to configure and run customized pipelines. WebMGA is freely available at http://weizhongli-lab.org/metagenomic-analysis. Conclusions WebMGA offers to researchers many fast and unique tools and great flexibility for complex metagenomic data analysis

    Gclust: A Parallel Clustering Tool for Microbial Genomic Data

    No full text
    The accelerating growth of the public microbial genomic data imposes substantial burden on the research community that uses such resources. Building databases for non-redundant reference sequences from massive microbial genomic data based on clustering analysis is essential. However, existing clustering algorithms perform poorly on long genomic sequences. In this article, we present Gclust, a parallel program for clustering complete or draft genomic sequences, where clustering is accelerated with a novel parallelization strategy and a fast sequence comparison algorithm using sparse suffix arrays (SSAs). Moreover, genome identity measures between two sequences are calculated based on their maximal exact matches (MEMs). In this paper, we demonstrate the high speed and clustering quality of Gclust by examining four genome sequence datasets. Gclust is freely available for non-commercial use at https://github.com/niu-lab/gclust. We also introduce a web server for clustering user-uploaded genomes at http://niulab.scgrid.cn/gclust. Keywords: Microbial genome clustering, Parallelization, Sparse suffix array, Maximal exact match, Segment extensio
    corecore