10,711 research outputs found

    MSPKmerCounter: A Fast and Memory Efficient Approach for K-mer Counting

    Full text link
    A major challenge in next-generation genome sequencing (NGS) is to assemble massive overlapping short reads that are randomly sampled from DNA fragments. To complete assembling, one needs to finish a fundamental task in many leading assembly algorithms: counting the number of occurrences of k-mers (length-k substrings in sequences). The counting results are critical for many components in assembly (e.g. variants detection and read error correction). For large genomes, the k-mer counting task can easily consume a huge amount of memory, making it impossible for large-scale parallel assembly on commodity servers. In this paper, we develop MSPKmerCounter, a disk-based approach, to efficiently perform k-mer counting for large genomes using a small amount of memory. Our approach is based on a novel technique called Minimum Substring Partitioning (MSP). MSP breaks short reads into multiple disjoint partitions such that each partition can be loaded into memory and processed individually. By leveraging the overlaps among the k-mers derived from the same short read, MSP can achieve astonishing compression ratio so that the I/O cost can be significantly reduced. For the task of k-mer counting, MSPKmerCounter offers a very fast and memory-efficient solution. Experiment results on large real-life short reads data sets demonstrate that MSPKmerCounter can achieve better overall performance than state-of-the-art k-mer counting approaches. MSPKmerCounter is available at http://www.cs.ucsb.edu/~yangli/MSPKmerCounte

    Deep proteogenomics; high throughput gene validation by multidimensional liquid chromatography and mass spectrometry of proteins from the fungal wheat pathogen Stagonospora nodorum

    Get PDF
    BACKGROUND: Stagonospora nodorum, a fungal ascomycete in the class dothideomycetes, is a damaging pathogen of wheat. It is a model for necrotrophic fungi that cause necrotic symptoms via the interaction of multiple effector proteins with cultivar-specific receptors. A draft genome sequence and annotation was published in 2007. A second-pass gene prediction using a training set of 795 fully EST-supported genes predicted a total of 10762 version 2 nuclear-encoded genes, with an additional 5354 less reliable version 1 genes also retained. RESULTS: In this study, we subjected soluble mycelial proteins to proteolysis followed by 2D LC MALDI-MS/MS. Comparison of the detected peptides with the gene models validated 2134 genes. 62% of these genes (1324) were not supported by prior EST evidence. Of the 2134 validated genes, all but 188 were version 2 annotations. Statistical analysis of the validated gene models revealed a preponderance of cytoplasmic and nuclear localised proteins, and proteins with intracellularassociated GO terms. These statistical associations are consistent with the source of the peptides used in the study. Comparison with a 6-frame translation of the S. nodorum genome assembly confirmed 905 existing gene annotations (including 119 not previously confirmed) and provided evidence supporting 144 genes with coding exon frameshift modifications, 604 genes with extensions of coding exons into annotated introns or untranslated regions (UTRs), 3 new gene annotations which were supported by tblastn to NR, and 44 potential new genes residing within un-assembled regions of the genome. CONCLUSION: We conclude that 2D LC MALDI-MS/MS is a powerful, rapid and economical tool to aid in the annotation of fungal genomic assemblies
    • …
    corecore