396 research outputs found
Inference of Markovian Properties of Molecular Sequences from NGS Data and Applications to Comparative Genomics
Next Generation Sequencing (NGS) technologies generate large amounts of short
read data for many different organisms. The fact that NGS reads are generally
short makes it challenging to assemble the reads and reconstruct the original
genome sequence. For clustering genomes using such NGS data, word-count based
alignment-free sequence comparison is a promising approach, but for this
approach, the underlying expected word counts are essential.
A plausible model for this underlying distribution of word counts is given
through modelling the DNA sequence as a Markov chain (MC). For single long
sequences, efficient statistics are available to estimate the order of MCs and
the transition probability matrix for the sequences. As NGS data do not provide
a single long sequence, inference methods on Markovian properties of sequences
based on single long sequences cannot be directly used for NGS short read data.
Here we derive a normal approximation for such word counts. We also show that
the traditional Chi-square statistic has an approximate gamma distribution,
using the Lander-Waterman model for physical mapping. We propose several
methods to estimate the order of the MC based on NGS reads and evaluate them
using simulations. We illustrate the applications of our results by clustering
genomic sequences of several vertebrate and tree species based on NGS reads
using alignment-free sequence dissimilarity measures. We find that the
estimated order of the MC has a considerable effect on the clustering results,
and that the clustering results that use a MC of the estimated order give a
plausible clustering of the species.Comment: accepted by RECOMB-SEQ 201
Identification of lignin genes and regulatory sequences involved in secondary cell wall formation in Acacia auriculiformis and Acacia mangium via de novo transcriptome sequencing
<p>Abstract</p> <p>Background</p> <p><it>Acacia auriculiformis </it>Ă <it>Acacia mangium </it>hybrids are commercially important trees for the timber and pulp industry in Southeast Asia. Increasing pulp yield while reducing pulping costs are major objectives of tree breeding programs. The general monolignol biosynthesis and secondary cell wall formation pathways are well-characterized but genes in these pathways are poorly characterized in <it>Acacia </it>hybrids. RNA-seq on short-read platforms is a rapid approach for obtaining comprehensive transcriptomic data and to discover informative sequence variants.</p> <p>Results</p> <p>We sequenced transcriptomes of <it>A. auriculiformis </it>and <it>A. mangium </it>from non-normalized cDNA libraries synthesized from pooled young stem and inner bark tissues using paired-end libraries and a single lane of an Illumina GAII machine. <it>De novo </it>assembly produced a total of 42,217 and 35,759 contigs with an average length of 496 bp and 498 bp for <it>A. auriculiformis </it>and <it>A. mangium </it>respectively. The assemblies of <it>A. auriculiformis </it>and <it>A. mangium </it>had a total length of 21,022,649 bp and 17,838,260 bp, respectively, with the largest contig 15,262 bp long. We detected all ten monolignol biosynthetic genes using Blastx and further analysis revealed 18 lignin isoforms for each species. We also identified five contigs homologous to R2R3-MYB proteins in other plant species that are involved in transcriptional regulation of secondary cell wall formation and lignin deposition. We searched the contigs against public microRNA database and predicted the stem-loop structures of six highly conserved microRNA families (miR319, miR396, miR160, miR172, miR162 and miR168) and one legume-specific family (miR2086). Three microRNA target genes were predicted to be involved in wood formation and flavonoid biosynthesis. By using the assemblies as a reference, we discovered 16,648 and 9,335 high quality putative Single Nucleotide Polymorphisms (SNPs) in the transcriptomes of <it>A. auriculiformis </it>and <it>A. mangium</it>, respectively, thus yielding useful markers for population genetics studies and marker-assisted selection.</p> <p>Conclusion</p> <p>We have produced the first comprehensive transcriptome-wide analysis in <it>A. auriculiformis </it>and <it>A. mangium </it>using <it>de novo </it>assembly techniques. Our high quality and comprehensive assemblies allowed the identification of many genes in the lignin biosynthesis and secondary cell wall formation in <it>Acacia </it>hybrids. Our results demonstrated that Next Generation Sequencing is a cost-effective method for gene discovery, identification of regulatory sequences, and informative markers in a non-model plant.</p
Recommended from our members
Evidence for a Trade-Off Strategy in Stone Oak (Lithocarpus) Seeds between Physical and Chemical Defense Highlights Fiber as an Important Antifeedant
Trees in the beech or oak family (Fagaceae) have a mutualistic relationship with scatter-hoarding rodents. Rodents obtain nutrients and energy by consuming seeds, while providing seed dispersal for the tree by allowing some cached seeds to germinate. Seed predation and caching behavior of rodents is primarily affected by seed size, mechanical protection, macronutrient content, and chemical antifeedants. To enhance seed dispersal, trees must optimize trade-offs in investment between macronutrients and antifeedants. Here, we examine this important chemical balance in the seeds of tropical stone oak species with two substantially different fruit morphologies. These two distinct fruit morphologies in Lithocarpus differ in the degree of mechanical protection of the seed. For âacornâ fruit, a thin exocarp forms a shell around the seed while for âenclosed receptacleâ (ER) fruit, the seed is embedded in a woody receptacle. We compared the chemical composition of numerous macronutrient and antifeedant in seeds from several Lithocarpus species, focusing on two pairs of sympatric species with different fruit morphologies. We found that macronutrients, particularly total non-structural carbohydrate, was more concentrated in seeds of ER fruits while antifeedants, primarily fibers, were more concentrated in seeds of acorn fruits. The trade-off in these two major chemical components was more evident between the two sympatric lowland species than between two highland species. Surprisingly, no significant difference in overall tannin concentrations in the seeds was observed between the two fruit morphologies. Instead, the major trade-off between macronutrients and antifeedants involved indigestible fibers. Future studies of this complex mutualism should carefully consider the role of indigestible fibers in the foraging behavior of scatter-hoarding rodents.Human Evolutionary Biolog
Exploiting sparseness in de novo genome assembly
Background: The very large memory requirements for the construction of assembly graphs for de novo genome assembly limit current algorithms to super-computing environments. Methods: In this paper, we demonstrate that constructing a sparse assembly graph which stores only a small fraction of the observed k- mers as nodes and the links between these nodes allows the de novo assembly of even moderately-sized genomes (~500 M) on a typical laptop computer. Results: We implement this sparse graph concept in a proof-of-principle software package, SparseAssembler, utilizing a new sparse k- mer graph structure evolved from the de Bruijn graph. We test our SparseAssembler with both simulated and real data, achieving ~90% memory savings and retaining high assembly accuracy, without sacrificing speed in comparison to existing de novo assemblers
Where Have the Beans Been? Student-Driven Laboratory Learning Activities with Legumes for Conceptual Change
Accessible, familiar, relevant, effective and expansive teaching and learning resources is the dream of every teacher and educator throughout all types of educational systems. Furthermore, engaging students in meaningful scientific investigations using familiar objects inspire students to make the needed connection with the science concept being introduced. Actively engaging in solving problems, and arriving at empirically based conclusions, leads to a lasting effect on studentsâ learning; what is more, a deep appreciation of science and the real understanding of the scientific process is fostered. In this paper, we provide a set of laboratory-based activities using a variety of edible legumes (beans, peas, lentils, etc.) to introduce students to various STEM concepts in integrated, empirical investigations. Legumes have been grown throughout the world, and have been cultivated since ancient times for more than 11,000 years. The seeds of legumes come in a wide variety of shapes, sizes, colors, and are known for their differing nutritional values based on their content. But most of all, they are accessible, familiar, real and relevant, and are limitless in terms of locales where they can be found. It is precisely these reason that make them an effective teaching and learning resource in the laboratory classroom settings.  Throughout all these laboratory learning activities, students engage in hands-on experiments, conducting research, engage in productive discussion, write scientific papers, and present their findings within a scientific framework.  Through these set of inquiry activities, teachers and students will never look at beans in the same way again. Perhaps in fact, teachers may even consider them as one of their best teaching and learning resources. Finally, the appendix section offers more ideas that support the teachers whom is introducing these scientific concepts with the use of legumes. We include additional ideas, information, activities, and questions (complete with answers) that we feel students may ask during the learning process. In doing so, we aim to save time and energy for those teachers who wish to use and/or adapt the suggested laboratory learning activities as a means of introducing conceptual changes. Keywords: Legumes, Science Inquiry, Laboratory experiments, Learning science, Effective learning resources.
Warm Dust and Spatially Variable PAH Emission in the Dwarf Starburst Galaxy NGC 1705
We present Spitzer observations of the dwarf starburst galaxy NGC 1705
obtained as part of SINGS. The galaxy morphology is very different shortward
and longward of ~5 microns: short-wavelength imaging shows an underlying red
stellar population, with the central super star cluster (SSC) dominating the
luminosity; longer-wavelength data reveals warm dust emission arising from two
off-nuclear regions offset by ~250 pc from the SSC. These regions show little
extinction at optical wavelengths. The galaxy has a relatively low global dust
mass (~2E5 solar masses, implying a global dust-to-gas mass ratio ~2--4 times
lower than the Milky Way average). The off-nuclear dust emission appears to be
powered by photons from the same stellar population responsible for the
excitation of the observed H Alpha emission; these photons are unassociated
with the SSC (though a contribution from embedded sources to the IR luminosity
of the off-nuclear regions cannot be ruled out). Low-resolution IRS
spectroscopy shows moderate-strength PAH emission in the 11.3 micron band in
the eastern peak; no PAH emission is detected in the SSC or the western dust
emission complex. There is significant diffuse 8 micron emission after scaling
and subtracting shorter wavelength data; the spatially variable PAH emission
strengths revealed by the IRS data suggest caution in the interpretation of
diffuse 8 micron emission as arising from PAH carriers alone. The metallicity
of NGC 1705 falls at the transition level of 35% solar found by Engelbracht and
collaborators; the fact that a system at this metallicity shows spatially
variable PAH emission demonstrates the complexity of interpreting diffuse 8
micron emission. A radio continuum non-detection, NGC 1705 deviates
significantly from the canonical far-IR vs. radio correlation. (Abridged)Comment: ApJ, in press; please retrieve full-resolution version from
http://www.astro.wesleyan.edu/~cannon/pubs.htm
The Nature of Infrared Emission in the Local Group Dwarf Galaxy NGC 6822 As Revealed by Spitzer
We present Spitzer imaging of the metal-deficient (Z ~30% Z_sun) Local Group
dwarf galaxy NGC 6822. On spatial scales of ~130 pc, we study the nature of IR,
H alpha, HI, and radio continuum emission. Nebular emission strength correlates
with IR surface brightness; however, roughly half of the IR emission is
associated with diffuse regions not luminous at H alpha (as found in previous
studies). The global ratio of dust to HI gas in the ISM, while uncertain at the
factor of ~2 level, is ~25 times lower than the global values derived for
spiral galaxies using similar modeling techniques; localized ratios of dust to
HI gas are about a factor of five higher than the global value in NGC 6822.
There are strong variations (factors of ~10) in the relative ratios of H alpha
and IR flux throughout the central disk; the low dust content of NGC 6822 is
likely responsible for the different H alpha/IR ratios compared to those found
in more metal-rich environments. The H alpha and IR emission is associated with
high-column density (> ~1E21 cm^-2) neutral gas. Increases in IR surface
brightness appear to be affected by both increased radiation field strength and
increased local gas density. Individual regions and the galaxy as a whole fall
within the observed scatter of recent high-resolution studies of the radio-far
IR correlation in nearby spiral galaxies; this is likely the result of depleted
radio and far-IR emission strengths in the ISM of this dwarf galaxy.Comment: ApJ, in press; please retrieve full-resolution version from
http://www.astro.wesleyan.edu/~cannon/pubs.htm
VESPA: software to facilitate genomic annotation of prokaryotic organisms through integration of proteomic and transcriptomic data
<p>Abstract</p> <p>Background</p> <p>The procedural aspects of genome sequencing and assembly have become relatively inexpensive, yet the full, accurate structural annotation of these genomes remains a challenge. Next-generation sequencing transcriptomics (RNA-Seq), global microarrays, and tandem mass spectrometry (MS/MS)-based proteomics have demonstrated immense value to genome curators as individual sources of information, however, integrating these data types to validate and improve structural annotation remains a major challenge. Current visual and statistical analytic tools are focused on a single data type, or existing software tools are retrofitted to analyze new data forms. We present Visual Exploration and Statistics to Promote Annotation (VESPA) is a new interactive visual analysis software tool focused on assisting scientists with the annotation of prokaryotic genomes though the integration of proteomics and transcriptomics data with current genome location coordinates.</p> <p>Results</p> <p>VESPA is a desktop Java⢠application that integrates high-throughput proteomics data (peptide-centric) and transcriptomics (probe or RNA-Seq) data into a genomic context, all of which can be visualized at three levels of genomic resolution. Data is interrogated via searches linked to the genome visualizations to find regions with high likelihood of mis-annotation. Search results are linked to exports for further validation outside of VESPA or potential coding-regions can be analyzed concurrently with the software through interaction with BLAST. VESPA is demonstrated on two use cases (<it>Yersinia pestis </it>Pestoides F and <it>Synechococcus </it>sp. PCC 7002) to demonstrate the rapid manner in which mis-annotations can be found and explored in VESPA using either proteomics data alone, or in combination with transcriptomic data.</p> <p>Conclusions</p> <p>VESPA is an interactive visual analytics tool that integrates high-throughput data into a genomic context to facilitate the discovery of structural mis-annotations in prokaryotic genomes. Data is evaluated via visual analysis across multiple levels of genomic resolution, linked searches and interaction with existing bioinformatics tools. We highlight the novel functionality of VESPA and core programming requirements for visualization of these large heterogeneous datasets for a client-side application. The software is freely available at <url>https://www.biopilot.org/docs/Software/Vespa.php</url>.</p
- âŚ