145 research outputs found

    CLARK: fast and accurate classification of metagenomic and genomic sequences using discriminative k-mers.

    Get PDF
    BackgroundThe problem of supervised DNA sequence classification arises in several fields of computational molecular biology. Although this problem has been extensively studied, it is still computationally challenging due to size of the datasets that modern sequencing technologies can produce.ResultsWe introduce CLARK a novel approach to classify metagenomic reads at the species or genus level with high accuracy and high speed. Extensive experimental results on various metagenomic samples show that the classification accuracy of CLARK is better or comparable to the best state-of-the-art tools and it is significantly faster than any of its competitors. In its fastest single-threaded mode CLARK classifies, with high accuracy, about 32 million metagenomic short reads per minute. CLARK can also classify BAC clones or transcripts to chromosome arms and centromeric regions.ConclusionsCLARK is a versatile, fast and accurate sequence classification method, especially useful for metagenomics and genomics applications. It is freely available at http://clark.cs.ucr.edu/

    Scrible: Ultra-Accurate Error-Correction of Pooled Sequenced Reads

    Full text link
    Abstract. We recently proposed a novel clone-by-clone protocol for de novo genome sequencing that leverages combinatorial pooling design to overcome the limitations of DNA barcoding when multiplexing a large number of samples on second-generation sequencing instruments. Here we address the problem of correcting the short reads obtained from our sequencing protocol. We introduce a novel algorithm called Scrible that exploits properties of the pooling design to accurately identify/correct sequencing errors and minimize the chance of “over-correcting”. Exper-imental results on synthetic data on the rice genome demonstrate that our method has much higher accuracy in correcting short reads com-pared to state-of-the-art error-correcting methods. On real data on the barley genome we show that Scrible significantly improves the decoding accuracy of short reads to individual BACs.

    Construction of a map-based reference genome sequence for barley, Hordeum vulgare L.

    Get PDF
    Barley (Hordeum vulgare L.) is a cereal grass mainly used as animal fodder and raw material for the malting industry. The map-based reference genome sequence of barley cv. `Morex' was constructed by the International Barley Genome Sequencing Consortium (IBSC) using hierarchical shotgun sequencing. Here, we report the experimental and computational procedures to (i) sequence and assemble more than 80,000 bacterial artificial chromosome (BAC) clones along the minimum tiling path of a genome-wide physical map, (ii) find and validate overlaps between adjacent BACs, (iii) construct 4,265 non-redundant sequence scaffolds representing clusters of overlapping BACs, and (iv) order and orient these BAC clusters along the seven barley chromosomes using positional information provided by dense genetic maps, an optical map and chromosome conformation capture sequencing (Hi-C). Integrative access to these sequence and mapping resources is provided by the barley genome explorer (BARLEX).Peer reviewe

    A physical, genetic and functional sequence assembly of the barley genome

    Get PDF
    Barley (Hordeum vulgare L.) is among the world's earliest domesticated and most important crop plants. It is diploid with a large haploid genome of 5.1 gigabases (Gb). Here we present an integrated and ordered physical, genetic and functional sequence resource that describes the barley gene-space in a structured whole-genome context. We developed a physical map of 4.98 Gb, with more than 3.90 Gb anchored to a high-resolution genetic map. Projecting a deep whole-genome shotgun assembly, complementary DNA and deep RNA sequence data onto this framework supports 79,379 transcript clusters, including 26,159 'high-confidence' genes with homology support from other plant genomes. Abundant alternative splicing, premature termination codons and novel transcriptionally active regions suggest that post-transcriptional processing forms an important regulatory layer. Survey sequences from diverse accessions reveal a landscape of extensive single-nucleotide variation. Our data provide a platform for both genome-assisted research and enabling contemporary crop improvement

    Identification of candidate genes and molecular markers for heat-induced brown discoloration of seed coats in cowpea [Vigna unguiculata (L.) Walp].

    Get PDF
    BackgroundHeat-induced browning (Hbs) of seed coats is caused by high temperatures which discolors the seed coats of many legumes, affecting the visual appearance and quality of seeds. The genetic determinants underlying Hbs in cowpea are unknown.ResultsWe identified three QTL associated with the heat-induced browning of seed coats trait, Hbs-1, Hbs-2 and Hbs-3, using cowpea RIL populations IT93K-503-1 (Hbs positive) x CB46 (hbs negative) and IT84S-2246 (Hbs positive) x TVu14676 (hbs negative). Hbs-1 was identified in both populations, accounting for 28.3% -77.3% of the phenotypic variation. SNP markers 1_0032 and 1_1128 co-segregated with the trait. Within the syntenic regions of Hbs-1 in soybean, Medicago and common bean, several ethylene forming enzymes, ethylene responsive element binding factors and an ACC oxidase 2 were observed. Hbs-1 was identified in a BAC clone in contig 217 of the cowpea physical map, where ethylene forming enzymes were present. Hbs-2 was identified in the IT93K-503-1 x CB46 population and accounted for of 9.5 to 12.3% of the phenotypic variance. Hbs-3 was identified in the IT84S-2246 x TVu14676 population and accounted for 6.2 to 6.8% of the phenotypic variance. SNP marker 1_0640 co-segregated with the heat-induced browning phenotype. Hbs-3 was positioned on BAC clones in contig512 of the cowpea physical map, where several ACC synthase 1 genes were present.ConclusionThe identification of loci determining heat-induced browning of seed coats and co-segregating molecular markers will enable transfer of hbs alleles into cowpea varieties, contributing to higher quality seeds

    Computational Methods for Sequencing and Analysis of Heterogeneous RNA Populations

    Get PDF
    Next-generation sequencing (NGS) and mass spectrometry technologies bring unprecedented throughput, scalability and speed, facilitating the studies of biological systems. These technologies allow to sequence and analyze heterogeneous RNA populations rather than single sequences. In particular, they provide the opportunity to implement massive viral surveillance and transcriptome quantification. However, in order to fully exploit the capabilities of NGS technology we need to develop computational methods able to analyze billions of reads for assembly and characterization of sampled RNA populations. In this work we present novel computational methods for cost- and time-effective analysis of sequencing data from viral and RNA samples. In particular, we describe: i) computational methods for transcriptome reconstruction and quantification; ii) method for mass spectrometry data analysis; iii) combinatorial pooling method; iv) computational methods for analysis of intra-host viral populations

    New computational methods and plant models for evolutionary genomics

    Get PDF
    This thesis is in the service of a greater understanding of the genetic basis of adaptive traits. Chapter 1 introduces background literature relevant to this thesis. Chapters 2, 3, and 4 develop novel methods and software for the analysis of genetic sequencing data. Chapter 5 details a large collaborative project to establish genetic resources in the model cereal Brachypodium, and perform a genome-wide association study for several agriculturally-relevant traits under two climate change scenarios. Chapter 6 investigates the spatial genetic patterns in two species of woodland eucalypt, and determines the landscape process that could be driving these patterns. Finally, Chapter 7 summarises these works, and proposes some areas of further study. In Chapters 2 and 3, I develop methods that enable analysis of Genotyping-by-sequencing analysis. Axe, a short read sequence demultiplexer, demultiplexes samples from multiplexed GBS sequencing datasets. I show Axe has high accuracy, and outperforms previously published software. Axe also tolerates complex indexing schemes such as the variable-length combinatorial indexes used in GBS data. Trimit and libqcpp (Chapter 3) implements several low-level sequence read quality assessment and control methods as a C++ library, and as a command line tool. Both these works have been published in peer-reviewed journals, and are used by numerous groups internationally. In Chapter 4, I develop kWIP, a de novo estimator of genetic distance. kWIP enables rapid estimation of genetic distances directly from sequence reads. We first show kWIP outperforms a competing method at low coverage using simulations that mimic a population resequencing experiment. We propose and demonstrate several use cases for kWIP, including population resequencing, initial assessment of sample identity, and estimating metagenomic similarity. kWIP was published in PLoS Computational Biology. In Chapter 5, I present the results of a large, collaborative project which surveys the global genetic diversity of the model cereal Brachypodium. We amass a collection of over 2000 accessions from the Brachypodium species complex. Using GBS and whole genome sequencing we identify around 800 accessions of the diploid Brachypodium distachyon, within which we find extensive population structure and clonal families. Through population restructuring we create a core collection of 74 accessions containing the majority of genetic diversity in the "A genome" sub-population. Using this core collection, we assay several phenotypes of agricultural interest including early vigour, harvest index and energy use efficiency under two climates, and dissect the genetic basis of these traits using a genome-wide association study (GWAS). This work has been accepted for publication at Genetics; I am co-first author with Pip Wilson and Jared Streich, having lead many genomic analyses. In Chapter 6, I perform a study of landscape genomic variation in two woodland eucalypt species. Using whole genome sequencing of around 200 individuals from around 20 localities of both E. albens and E. sideroxylon, I find incredible genetic diversity and low genome-wide inter-species differentiation.I find no support for strong discrete population structure, but strong support for isolation by (geographic) distance (IBD). Using generalised dissimilarity modelling, I further examine the pattern of IBD, and establish additional isolation by environment (IBE). E. albens shows moderately strong IBD, explaining 26% of deviance in genetic distance using geographic distance, and an additional 6% deviance explained by incorporating environmental predictors (IBE). E. sideroxylon shows much stronger IBD, with 78% of deviance explained by geography, and stronger IBE (12% additional deviance explained). This work will soon be submitted for publication

    Biotechnologies for Plant Mutation Breeding: Protocols

    Get PDF
    Plant Breeding/Biotechnology; Agriculture; Genetic Engineering; Plant Genetics & Genomic
    • …
    corecore