342 research outputs found

    SLIQ: Simple Linear Inequalities for Efficient Contig Scaffolding

    Full text link
    Scaffolding is an important subproblem in "de novo" genome assembly in which mate pair data are used to construct a linear sequence of contigs separated by gaps. Here we present SLIQ, a set of simple linear inequalities derived from the geometry of contigs on the line that can be used to predict the relative positions and orientations of contigs from individual mate pair reads and thus produce a contig digraph. The SLIQ inequalities can also filter out unreliable mate pairs and can be used as a preprocessing step for any scaffolding algorithm. We tested the SLIQ inequalities on five real data sets ranging in complexity from simple bacterial genomes to complex mammalian genomes and compared the results to the majority voting procedure used by many other scaffolding algorithms. SLIQ predicted the relative positions and orientations of the contigs with high accuracy in all cases and gave more accurate position predictions than majority voting for complex genomes, in particular the human genome. Finally, we present a simple scaffolding algorithm that produces linear scaffolds given a contig digraph. We show that our algorithm is very efficient compared to other scaffolding algorithms while maintaining high accuracy in predicting both contig positions and orientations for real data sets.Comment: 16 pages, 6 figures, 7 table

    pGQL: A probabilistic graphical query language for gene expression time courses

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Timeboxes are graphical user interface widgets that were proposed to specify queries on time course data. As queries can be very easily defined, an exploratory analysis of time course data is greatly facilitated. While timeboxes are effective, they have no provisions for dealing with noisy data or data with fluctuations along the time axis, which is very common in many applications. In particular, this is true for the analysis of gene expression time courses, which are mostly derived from noisy microarray measurements at few unevenly sampled time points. From a data mining point of view the robust handling of data through a sound statistical model is of great importance.</p> <p>Results</p> <p>We propose probabilistic timeboxes, which correspond to a specific class of Hidden Markov Models, that constitutes an established method in data mining. Since HMMs are a particular class of probabilistic graphical models we call our method Probabilistic Graphical Query Language. Its implementation was realized in the free software package pGQL. We evaluate its effectiveness in exploratory analysis on a yeast sporulation data set.</p> <p>Conclusions</p> <p>We introduce a new approach to define dynamic, statistical queries on time course data. It supports an interactive exploration of reasonably large amounts of data and enables users without expert knowledge to specify fairly complex statistical models with ease. The expressivity of our approach is by its statistical nature greater and more robust with respect to amplitude and frequency fluctuation than the prior, deterministic timeboxes.</p

    Evaluation of reference genes for RT-qPCR studies in the seagrass zostera muelleri exposed to light limitation

    Get PDF
    Seagrass meadows are threatened by coastal development and global change. In the face of these pressures, molecular techniques such as reverse transcription quantitative real-time PCR (RT-qPCR) have great potential to improve management of these ecosystems by allowing early detection of chronic stress. In RT-qPCR, the expression levels of target genes are estimated on the basis of reference genes, in order to control for RNA variations. Although determination of suitable reference genes is critical for RT-qPCR studies, reports on the evaluation of reference genes are still absent for the major Australian species Zostera muelleri subsp. capricorni (Z. muelleri). Here, we used three different software (geNorm, NormFinder and Bestkeeper) to evaluate ten widely used reference genes according to their expression stability in Z. muelleri exposed to light limitation. We then combined results from different software and used a consensus rank of four best reference genes to validate regulation in Photosystem I reaction center subunit IV B and Heat Stress Transcription factor A- gene expression in Z. muelleri under light limitation. This study provides the first comprehensive list of reference genes in Z. muelleri and demonstrates RT-qPCR as an effective tool to identify early responses to light limitation in seagrass

    Conformational rearrangements upon start codon recognition in human 48S translation initiation complex

    Get PDF
    Selection of the translation start codon is a key step during protein synthesis in human cells. We obtained cryo-EM structures of human 48S initiation complexes and characterized the intermediates of codon recognition by kinetic methods using eIF1A as a reporter. Both approaches capture two distinct ribosome populations formed on an mRNA with a cognate AUG codon in the presence of eIF1, eIF1A, eIF2–GTP–Met-tRNAiMet and eIF3. The ‘open’ 40S subunit conformation differs from the human 48S scanning complex and represents an intermediate preceding the codon recognition step. The ‘closed’ form is similar to reported structures of complexes from yeast and mammals formed upon codon recognition, except for the orientation of eIF1A, which is unique in our structure. Kinetic experiments show how various initiation factors mediate the population distribution of open and closed conformations until 60S subunit docking. Our results provide insights into the timing and structure of human translation initiation intermediates and suggest the differences in the mechanisms of start codon selection between mammals and yeast

    Photosynthetic acclimation of Nannochloropsis oculata investigated by multi-wavelength chlorophyll fluorescence analysis

    Full text link
    Multi-wavelength chlorophyll fluorescence analysis was utilised to examine the photosynthetic efficiency of the biofuel-producing alga Nannochloropsis oculata, grown under two light regimes; low (LL) and high (HL) irradiance levels. Wavelength dependency was evident in the functional absorption cross-section of Photosystem II (σII(λ)), absolute electron transfer rates (ETR(II)), and non-photochemical quenching (NPQ) of chlorophyll fluorescence in both HL and LL cells. While σII(λ) was not significantly different between the two growth conditions, HL cells upregulated ETR(II) 1.6-1.8-fold compared to LL cells, most significantly in the wavelength range of 440-540nm. This indicates preferential utilisation of blue-green light, a highly relevant spectral region for visible light in algal pond conditions. Under these conditions, the HL cells accumulated saturated fatty acids, whereas polyunsaturated fatty acids were more abundant in LL cells. This knowledge is of importance for the use of N. oculata for fatty acid production in the biofuel industry. © 2014 Elsevier Ltd

    Accelerating Bayesian hierarchical clustering of time series data with a randomised algorithm

    Get PDF
    We live in an era of abundant data. This has necessitated the development of new and innovative statistical algorithms to get the most from experimental data. For example, faster algorithms make practical the analysis of larger genomic data sets, allowing us to extend the utility of cutting-edge statistical methods. We present a randomised algorithm that accelerates the clustering of time series data using the Bayesian Hierarchical Clustering (BHC) statistical method. BHC is a general method for clustering any discretely sampled time series data. In this paper we focus on a particular application to microarray gene expression data. We define and analyse the randomised algorithm, before presenting results on both synthetic and real biological data sets. We show that the randomised algorithm leads to substantial gains in speed with minimal loss in clustering quality. The randomised time series BHC algorithm is available as part of the R package BHC, which is available for download from Bioconductor (version 2.10 and above) via http://bioconductor.org/packages/2.10/bioc/html/BHC.html. We have also made available a set of R scripts which can be used to reproduce the analyses carried out in this paper. These are available from the following URL. https://sites.google.com/site/randomisedbhc/

    Fast MCMC sampling for hidden markov models to determine copy number variations

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Hidden Markov Models (HMM) are often used for analyzing Comparative Genomic Hybridization (CGH) data to identify chromosomal aberrations or copy number variations by segmenting observation sequences. For efficiency reasons the parameters of a HMM are often estimated with maximum likelihood and a segmentation is obtained with the Viterbi algorithm. This introduces considerable uncertainty in the segmentation, which can be avoided with Bayesian approaches integrating out parameters using Markov Chain Monte Carlo (MCMC) sampling. While the advantages of Bayesian approaches have been clearly demonstrated, the likelihood based approaches are still preferred in practice for their lower running times; datasets coming from high-density arrays and next generation sequencing amplify these problems.</p> <p>Results</p> <p>We propose an approximate sampling technique, inspired by compression of discrete sequences in HMM computations and by <it>kd</it>-trees to leverage spatial relations between data points in typical data sets, to speed up the MCMC sampling.</p> <p>Conclusions</p> <p>We test our approximate sampling method on simulated and biological ArrayCGH datasets and high-density SNP arrays, and demonstrate a speed-up of 10 to 60 respectively 90 while achieving competitive results with the state-of-the art Bayesian approaches.</p> <p><it>Availability: </it>An implementation of our method will be made available as part of the open source GHMM library from <url>http://ghmm.org</url>.</p
    • …
    corecore