10,711 research outputs found
MSPKmerCounter: A Fast and Memory Efficient Approach for K-mer Counting
A major challenge in next-generation genome sequencing (NGS) is to assemble
massive overlapping short reads that are randomly sampled from DNA fragments.
To complete assembling, one needs to finish a fundamental task in many leading
assembly algorithms: counting the number of occurrences of k-mers (length-k
substrings in sequences). The counting results are critical for many components
in assembly (e.g. variants detection and read error correction). For large
genomes, the k-mer counting task can easily consume a huge amount of memory,
making it impossible for large-scale parallel assembly on commodity servers.
In this paper, we develop MSPKmerCounter, a disk-based approach, to
efficiently perform k-mer counting for large genomes using a small amount of
memory. Our approach is based on a novel technique called Minimum Substring
Partitioning (MSP). MSP breaks short reads into multiple disjoint partitions
such that each partition can be loaded into memory and processed individually.
By leveraging the overlaps among the k-mers derived from the same short read,
MSP can achieve astonishing compression ratio so that the I/O cost can be
significantly reduced. For the task of k-mer counting, MSPKmerCounter offers a
very fast and memory-efficient solution. Experiment results on large real-life
short reads data sets demonstrate that MSPKmerCounter can achieve better
overall performance than state-of-the-art k-mer counting approaches.
MSPKmerCounter is available at http://www.cs.ucsb.edu/~yangli/MSPKmerCounte
Deep proteogenomics; high throughput gene validation by multidimensional liquid chromatography and mass spectrometry of proteins from the fungal wheat pathogen Stagonospora nodorum
BACKGROUND: Stagonospora nodorum, a fungal ascomycete in the class dothideomycetes, is a
damaging pathogen of wheat. It is a model for necrotrophic fungi that cause necrotic symptoms via
the interaction of multiple effector proteins with cultivar-specific receptors. A draft genome
sequence and annotation was published in 2007. A second-pass gene prediction using a training set
of 795 fully EST-supported genes predicted a total of 10762 version 2 nuclear-encoded genes, with
an additional 5354 less reliable version 1 genes also retained.
RESULTS: In this study, we subjected soluble mycelial proteins to proteolysis followed by 2D LC
MALDI-MS/MS. Comparison of the detected peptides with the gene models validated 2134 genes.
62% of these genes (1324) were not supported by prior EST evidence. Of the 2134 validated genes,
all but 188 were version 2 annotations. Statistical analysis of the validated gene models revealed a
preponderance of cytoplasmic and nuclear localised proteins, and proteins with intracellularassociated
GO terms. These statistical associations are consistent with the source of the peptides
used in the study. Comparison with a 6-frame translation of the S. nodorum genome assembly
confirmed 905 existing gene annotations (including 119 not previously confirmed) and provided
evidence supporting 144 genes with coding exon frameshift modifications, 604 genes with
extensions of coding exons into annotated introns or untranslated regions (UTRs), 3 new gene
annotations which were supported by tblastn to NR, and 44 potential new genes residing within
un-assembled regions of the genome.
CONCLUSION: We conclude that 2D LC MALDI-MS/MS is a powerful, rapid and economical tool to
aid in the annotation of fungal genomic assemblies
- …