396 research outputs found

    Inference of Markovian Properties of Molecular Sequences from NGS Data and Applications to Comparative Genomics

    Full text link
    Next Generation Sequencing (NGS) technologies generate large amounts of short read data for many different organisms. The fact that NGS reads are generally short makes it challenging to assemble the reads and reconstruct the original genome sequence. For clustering genomes using such NGS data, word-count based alignment-free sequence comparison is a promising approach, but for this approach, the underlying expected word counts are essential. A plausible model for this underlying distribution of word counts is given through modelling the DNA sequence as a Markov chain (MC). For single long sequences, efficient statistics are available to estimate the order of MCs and the transition probability matrix for the sequences. As NGS data do not provide a single long sequence, inference methods on Markovian properties of sequences based on single long sequences cannot be directly used for NGS short read data. Here we derive a normal approximation for such word counts. We also show that the traditional Chi-square statistic has an approximate gamma distribution, using the Lander-Waterman model for physical mapping. We propose several methods to estimate the order of the MC based on NGS reads and evaluate them using simulations. We illustrate the applications of our results by clustering genomic sequences of several vertebrate and tree species based on NGS reads using alignment-free sequence dissimilarity measures. We find that the estimated order of the MC has a considerable effect on the clustering results, and that the clustering results that use a MC of the estimated order give a plausible clustering of the species.Comment: accepted by RECOMB-SEQ 201

    Identification of lignin genes and regulatory sequences involved in secondary cell wall formation in Acacia auriculiformis and Acacia mangium via de novo transcriptome sequencing

    Get PDF
    <p>Abstract</p> <p>Background</p> <p><it>Acacia auriculiformis </it>× <it>Acacia mangium </it>hybrids are commercially important trees for the timber and pulp industry in Southeast Asia. Increasing pulp yield while reducing pulping costs are major objectives of tree breeding programs. The general monolignol biosynthesis and secondary cell wall formation pathways are well-characterized but genes in these pathways are poorly characterized in <it>Acacia </it>hybrids. RNA-seq on short-read platforms is a rapid approach for obtaining comprehensive transcriptomic data and to discover informative sequence variants.</p> <p>Results</p> <p>We sequenced transcriptomes of <it>A. auriculiformis </it>and <it>A. mangium </it>from non-normalized cDNA libraries synthesized from pooled young stem and inner bark tissues using paired-end libraries and a single lane of an Illumina GAII machine. <it>De novo </it>assembly produced a total of 42,217 and 35,759 contigs with an average length of 496 bp and 498 bp for <it>A. auriculiformis </it>and <it>A. mangium </it>respectively. The assemblies of <it>A. auriculiformis </it>and <it>A. mangium </it>had a total length of 21,022,649 bp and 17,838,260 bp, respectively, with the largest contig 15,262 bp long. We detected all ten monolignol biosynthetic genes using Blastx and further analysis revealed 18 lignin isoforms for each species. We also identified five contigs homologous to R2R3-MYB proteins in other plant species that are involved in transcriptional regulation of secondary cell wall formation and lignin deposition. We searched the contigs against public microRNA database and predicted the stem-loop structures of six highly conserved microRNA families (miR319, miR396, miR160, miR172, miR162 and miR168) and one legume-specific family (miR2086). Three microRNA target genes were predicted to be involved in wood formation and flavonoid biosynthesis. By using the assemblies as a reference, we discovered 16,648 and 9,335 high quality putative Single Nucleotide Polymorphisms (SNPs) in the transcriptomes of <it>A. auriculiformis </it>and <it>A. mangium</it>, respectively, thus yielding useful markers for population genetics studies and marker-assisted selection.</p> <p>Conclusion</p> <p>We have produced the first comprehensive transcriptome-wide analysis in <it>A. auriculiformis </it>and <it>A. mangium </it>using <it>de novo </it>assembly techniques. Our high quality and comprehensive assemblies allowed the identification of many genes in the lignin biosynthesis and secondary cell wall formation in <it>Acacia </it>hybrids. Our results demonstrated that Next Generation Sequencing is a cost-effective method for gene discovery, identification of regulatory sequences, and informative markers in a non-model plant.</p

    Exploiting sparseness in de novo genome assembly

    Get PDF
    Background: The very large memory requirements for the construction of assembly graphs for de novo genome assembly limit current algorithms to super-computing environments. Methods: In this paper, we demonstrate that constructing a sparse assembly graph which stores only a small fraction of the observed k- mers as nodes and the links between these nodes allows the de novo assembly of even moderately-sized genomes (~500 M) on a typical laptop computer. Results: We implement this sparse graph concept in a proof-of-principle software package, SparseAssembler, utilizing a new sparse k- mer graph structure evolved from the de Bruijn graph. We test our SparseAssembler with both simulated and real data, achieving ~90% memory savings and retaining high assembly accuracy, without sacrificing speed in comparison to existing de novo assemblers

    Where Have the Beans Been? Student-Driven Laboratory Learning Activities with Legumes for Conceptual Change

    Get PDF
    Accessible, familiar, relevant, effective and expansive teaching and learning resources is the dream of every teacher and educator throughout all types of educational systems. Furthermore, engaging students in meaningful scientific investigations using familiar objects inspire students to make the needed connection with the science concept being introduced.  Actively engaging in solving problems, and arriving at empirically based conclusions, leads to a lasting effect on students’ learning; what is more, a deep appreciation of science and the real understanding of the scientific process is fostered.  In this paper, we provide a set of laboratory-based activities using a variety of edible legumes (beans, peas, lentils, etc.) to introduce students to various STEM concepts in integrated, empirical investigations.  Legumes have been grown throughout the world, and have been cultivated since ancient times for more than 11,000 years.  The seeds of legumes come in a wide variety of shapes, sizes, colors, and are known for their differing nutritional values based on their content. But most of all, they are accessible, familiar, real and relevant, and are limitless in terms of locales where they can be found.  It is precisely these reason that make them an effective teaching and learning resource in the laboratory classroom settings.  Throughout all these laboratory learning activities, students engage in hands-on experiments, conducting research, engage in productive discussion, write scientific papers, and present their findings within a scientific framework.  Through these set of inquiry activities, teachers and students will never look at beans in the same way again.  Perhaps in fact, teachers may even consider them as one of their best teaching and learning resources. Finally, the appendix section offers more ideas that support the teachers whom is introducing these scientific concepts with the use of legumes.  We include additional ideas, information, activities, and questions (complete with answers) that we feel students may ask during the learning process. In doing so, we aim to save time and energy for those teachers who wish to use and/or adapt the suggested laboratory learning activities as a means of introducing conceptual changes. Keywords: Legumes, Science Inquiry, Laboratory experiments, Learning science, Effective learning resources.

    Warm Dust and Spatially Variable PAH Emission in the Dwarf Starburst Galaxy NGC 1705

    Full text link
    We present Spitzer observations of the dwarf starburst galaxy NGC 1705 obtained as part of SINGS. The galaxy morphology is very different shortward and longward of ~5 microns: short-wavelength imaging shows an underlying red stellar population, with the central super star cluster (SSC) dominating the luminosity; longer-wavelength data reveals warm dust emission arising from two off-nuclear regions offset by ~250 pc from the SSC. These regions show little extinction at optical wavelengths. The galaxy has a relatively low global dust mass (~2E5 solar masses, implying a global dust-to-gas mass ratio ~2--4 times lower than the Milky Way average). The off-nuclear dust emission appears to be powered by photons from the same stellar population responsible for the excitation of the observed H Alpha emission; these photons are unassociated with the SSC (though a contribution from embedded sources to the IR luminosity of the off-nuclear regions cannot be ruled out). Low-resolution IRS spectroscopy shows moderate-strength PAH emission in the 11.3 micron band in the eastern peak; no PAH emission is detected in the SSC or the western dust emission complex. There is significant diffuse 8 micron emission after scaling and subtracting shorter wavelength data; the spatially variable PAH emission strengths revealed by the IRS data suggest caution in the interpretation of diffuse 8 micron emission as arising from PAH carriers alone. The metallicity of NGC 1705 falls at the transition level of 35% solar found by Engelbracht and collaborators; the fact that a system at this metallicity shows spatially variable PAH emission demonstrates the complexity of interpreting diffuse 8 micron emission. A radio continuum non-detection, NGC 1705 deviates significantly from the canonical far-IR vs. radio correlation. (Abridged)Comment: ApJ, in press; please retrieve full-resolution version from http://www.astro.wesleyan.edu/~cannon/pubs.htm

    The Nature of Infrared Emission in the Local Group Dwarf Galaxy NGC 6822 As Revealed by Spitzer

    Get PDF
    We present Spitzer imaging of the metal-deficient (Z ~30% Z_sun) Local Group dwarf galaxy NGC 6822. On spatial scales of ~130 pc, we study the nature of IR, H alpha, HI, and radio continuum emission. Nebular emission strength correlates with IR surface brightness; however, roughly half of the IR emission is associated with diffuse regions not luminous at H alpha (as found in previous studies). The global ratio of dust to HI gas in the ISM, while uncertain at the factor of ~2 level, is ~25 times lower than the global values derived for spiral galaxies using similar modeling techniques; localized ratios of dust to HI gas are about a factor of five higher than the global value in NGC 6822. There are strong variations (factors of ~10) in the relative ratios of H alpha and IR flux throughout the central disk; the low dust content of NGC 6822 is likely responsible for the different H alpha/IR ratios compared to those found in more metal-rich environments. The H alpha and IR emission is associated with high-column density (> ~1E21 cm^-2) neutral gas. Increases in IR surface brightness appear to be affected by both increased radiation field strength and increased local gas density. Individual regions and the galaxy as a whole fall within the observed scatter of recent high-resolution studies of the radio-far IR correlation in nearby spiral galaxies; this is likely the result of depleted radio and far-IR emission strengths in the ISM of this dwarf galaxy.Comment: ApJ, in press; please retrieve full-resolution version from http://www.astro.wesleyan.edu/~cannon/pubs.htm

    VESPA: software to facilitate genomic annotation of prokaryotic organisms through integration of proteomic and transcriptomic data

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>The procedural aspects of genome sequencing and assembly have become relatively inexpensive, yet the full, accurate structural annotation of these genomes remains a challenge. Next-generation sequencing transcriptomics (RNA-Seq), global microarrays, and tandem mass spectrometry (MS/MS)-based proteomics have demonstrated immense value to genome curators as individual sources of information, however, integrating these data types to validate and improve structural annotation remains a major challenge. Current visual and statistical analytic tools are focused on a single data type, or existing software tools are retrofitted to analyze new data forms. We present Visual Exploration and Statistics to Promote Annotation (VESPA) is a new interactive visual analysis software tool focused on assisting scientists with the annotation of prokaryotic genomes though the integration of proteomics and transcriptomics data with current genome location coordinates.</p> <p>Results</p> <p>VESPA is a desktop Java™ application that integrates high-throughput proteomics data (peptide-centric) and transcriptomics (probe or RNA-Seq) data into a genomic context, all of which can be visualized at three levels of genomic resolution. Data is interrogated via searches linked to the genome visualizations to find regions with high likelihood of mis-annotation. Search results are linked to exports for further validation outside of VESPA or potential coding-regions can be analyzed concurrently with the software through interaction with BLAST. VESPA is demonstrated on two use cases (<it>Yersinia pestis </it>Pestoides F and <it>Synechococcus </it>sp. PCC 7002) to demonstrate the rapid manner in which mis-annotations can be found and explored in VESPA using either proteomics data alone, or in combination with transcriptomic data.</p> <p>Conclusions</p> <p>VESPA is an interactive visual analytics tool that integrates high-throughput data into a genomic context to facilitate the discovery of structural mis-annotations in prokaryotic genomes. Data is evaluated via visual analysis across multiple levels of genomic resolution, linked searches and interaction with existing bioinformatics tools. We highlight the novel functionality of VESPA and core programming requirements for visualization of these large heterogeneous datasets for a client-side application. The software is freely available at <url>https://www.biopilot.org/docs/Software/Vespa.php</url>.</p
    • …
    corecore