73 research outputs found
TRAPID : an efficient online tool for the functional and comparative analysis of de novo RNA-Seq transcriptomes
Transcriptome analysis through next-generation sequencing technologies allows the generation of detailed gene catalogs for non-model species, at the cost of new challenges with regards to computational requirements and bioinformatics expertise. Here, we present TRAPID, an online tool for the fast and efficient processing of assembled RNA-Seq transcriptome data, developed to mitigate these challenges. TRAPID offers high-throughput open reading frame detection, frameshift correction and includes a functional, comparative and phylogenetic toolbox, making use of 175 reference proteomes. Benchmarking and comparison against state-of-the-art transcript analysis tools reveals the efficiency and unique features of the TRAPID system
Predicting protein-protein interactions in Arabidopsis thaliana through integration of orthology, gene ontology and co-expression
<p>Abstract</p> <p>Background</p> <p>Large-scale identification of the interrelationships between different components of the cell, such as the interactions between proteins, has recently gained great interest. However, unraveling large-scale protein-protein interaction maps is laborious and expensive. Moreover, assessing the reliability of the interactions can be cumbersome.</p> <p>Results</p> <p>In this study, we have developed a computational method that exploits the existing knowledge on protein-protein interactions in diverse species through orthologous relations on the one hand, and functional association data on the other hand to predict and filter protein-protein interactions in <it>Arabidopsis thaliana</it>. A highly reliable set of protein-protein interactions is predicted through this integrative approach making use of existing protein-protein interaction data from yeast, human, <it>C. elegans </it>and <it>D. melanogaster</it>. Localization, biological process, and co-expression data are used as powerful indicators for protein-protein interactions. The functional repertoire of the identified interactome reveals interactions between proteins functioning in well-conserved as well as plant-specific biological processes. We observe that although common mechanisms (e.g. actin polymerization) and components (e.g. ARPs, actin-related proteins) exist between different lineages, they are active in specific processes such as growth, cancer metastasis and trichome development in yeast, human and Arabidopsis, respectively.</p> <p>Conclusion</p> <p>We conclude that the integration of orthology with functional association data is adequate to predict protein-protein interactions. Through this approach, a high number of novel protein-protein interactions with diverse biological roles is discovered. Overall, we have predicted a reliable set of protein-protein interactions suitable for further computational as well as experimental analyses.</p
i-ADHoRe 3.0—fast and sensitive detection of genomic homology in extremely large data sets
Comparative genomics is a powerful means to gain insight into the evolutionary processes that shape the genomes of related species. As the number of sequenced genomes increases, the development of software to perform accurate cross-species analyses becomes indispensable. However, many implementations that have the ability to compare multiple genomes exhibit unfavorable computational and memory requirements, limiting the number of genomes that can be analyzed in one run. Here, we present a software package to unveil genomic homology based on the identification of conservation of gene content and gene order (collinearity), i-ADHoRe 3.0, and its application to eukaryotic genomes. The use of efficient algorithms and support for parallel computing enable the analysis of large-scale data sets. Unlike other tools, i-ADHoRe can process the Ensembl data set, containing 49 species, in 1 h. Furthermore, the profile search is more sensitive to detect degenerate genomic homology than chaining pairwise collinearity information based on transitive homology. From ultra-conserved collinear regions between mammals and birds, by integrating coexpression information and protein–protein interactions, we identified more than 400 regions in the human genome showing significant functional coherence. The different algorithmical improvements ensure that i-ADHoRe 3.0 will remain a powerful tool to study genome evolution
The genome sequence of the orchid Phalaenopsis equestris
Orchidaceae, renowned for its spectacular flowers and other reproductive and ecological adaptations, is one of the most diverse plant families. Here we present the genome sequence of the tropical epiphytic orchid Phalaenopsis equestris, a frequently used parent species for orchid breeding. P. equestris is the first plant with crassulacean acid metabolism (CAM) for which the genome has been sequenced. Our assembled genome contains 29,431 predicted protein-coding genes. We find that contigs likely to be underassembled, owing to heterozygosity, are enriched for genes that might be involved in self-incompatibility pathways. We find evidence for an orchid-specific paleopolyploidy event that preceded the radiation of most orchid clades, and our results suggest that gene duplication might have contributed to the evolution of CAM photosynthesis in P. equestris. Finally, we find expanded and diversified families of MADS-box C/D-class, B-class AP3 and AGL6-class genes, which might contribute to the highly specialized morphology of orchid flowers. (Résumé d'auteur
DNA Damage in Plant Herbarium Tissue
Dried plant herbarium specimens are potentially a valuable source of DNA. Efforts to obtain genetic information from this source are often hindered by an inability to obtain amplifiable DNA as herbarium DNA is typically highly degraded. DNA post-mortem damage may not only reduce the number of amplifiable template molecules, but may also lead to the generation of erroneous sequence information. A qualitative and quantitative assessment of DNA post-mortem damage is essential to determine the accuracy of molecular data from herbarium specimens. In this study we present an assessment of DNA damage as miscoding lesions in herbarium specimens using 454-sequencing of amplicons derived from plastid, mitochondrial, and nuclear DNA. In addition, we assess DNA degradation as a result of strand breaks and other types of polymerase non-bypassable damage by quantitative real-time PCR. Comparing four pairs of fresh and herbarium specimens of the same individuals we quantitatively assess post-mortem DNA damage, directly after specimen preparation, as well as after long-term herbarium storage. After specimen preparation we estimate the proportion of gene copy numbers of plastid, mitochondrial, and nuclear DNA to be 2.4–3.8% of fresh control DNA and 1.0–1.3% after long-term herbarium storage, indicating that nearly all DNA damage occurs on specimen preparation. In addition, there is no evidence of preferential degradation of organelle versus nuclear genomes. Increased levels of C→T/G→A transitions were observed in old herbarium plastid DNA, representing 21.8% of observed miscoding lesions. We interpret this type of post-mortem DNA damage-derived modification to have arisen from the hydrolytic deamination of cytosine during long-term herbarium storage. Our results suggest that reliable sequence data can be obtained from herbarium specimens
LSTrAP: efficiently combining RNA sequencing data into co-expression networks
Abstract Background Since experimental elucidation of gene function is often laborious, various in silico methods have been developed to predict gene function of uncharacterized genes. Since functionally related genes are often expressed in the same tissues, conditions and developmental stages (co-expressed), functional annotation of characterized genes can be transferred to co-expressed genes lacking annotation. With genome-wide expression data available, the construction of co-expression networks, where genes are nodes and edges connect significantly co-expressed genes, provides unprecedented opportunities to predict gene function. However, the construction of such networks requires large volumes of high-quality data, multiple processing steps and a considerable amount of computation power. While efficient tools exist to process RNA-Seq data, pipelines which combine them to construct co-expression networks efficiently are currently lacking. Results LSTrAP (Large-Scale Transcriptome Analysis Pipeline), presented here, combines all essential tools to construct co-expression networks based on RNA-Seq data into a single, efficient workflow. By supporting parallel computing on computer cluster infrastructure, processing hundreds of samples becomes feasible as shown here for Arabidopsis thaliana and Sorghum bicolor, which comprised 876 and 215 samples respectively. The former was used here to show how the quality control, included in LSTrAP, can detect spurious or low-quality samples. The latter was used to show how co-expression networks are able to group known photosynthesis genes and imply a role in this process of several, currently uncharacterized, genes. Conclusions LSTrAP combines the most popular and performant methods to construct co-expression networks from RNA-Seq data into a single workflow. This allows large amounts of expression data, required to construct co-expression networks, to be processed efficiently and consistently across hundreds of samples. LSTrAP is implemented in Python 3.4 (or higher) and available under MIT license from https://github.molgen.mpg.de/proost/LSTrA
raeslab/lorepy: lorepy version 0.1.1
<p>lorepy: a python package to create logistic regression plots which visualize the relation between a continuous variable and a categorical one.</p>
PhytoNet: comparative co-expression network analyses across phytoplankton and land plants
Phytoplankton consists of autotrophic, photosynthesizing microorganisms that are a crucial component of freshwater and ocean ecosystems. However, despite being the major primary producers of organic compounds, accounting for half of the photosynthetic activity worldwide and serving as the entry point to the food chain, functions of most of the genes of the model phytoplankton organisms remain unknown. To remedy this, we have gathered publicly available expression data for one chlorophyte, one rhodophyte, one haptophyte, two heterokonts and four cyanobacteria and integrated it into our PlaNet (Plant Networks) database, which now allows mining gene expression profiles and identification of co-expressed genes of 19 species. We exemplify how the co-expressed gene networks can be used to reveal functionally related genes and how the comparative features of PhytoNet allow detection of conserved transcriptional programs between cyanobacteria, green algae, and land plants. Additionally, we illustrate how the database allows detection of duplicated transcriptional programs within an organism, as exemplified by two putative DNA repair programs within Chlamydomonas reinhardtii. PhytoNet is available from www.gene2function.de.status: publishe
Additional file 1: Figure S1. of LSTrAP: efficiently combining RNA sequencing data into co-expression networks
Quality statistics for Sorghum bicolor samples. Gray dots indicate quality statistics of the samples based on HTSeq-Count and TopHat. Samples below our suggested quality control (contained within red areas in plot) were excluded from the final network. Figure S2. Dendrogram and heatmap of Sorghum bicolor sample distances. The helper script matrix_heatmap.py calculates the Euclidean distance between samples and plots a hierarchically clustered heatmap of those sample distances. This can be used to detect outliers. Here the most divergent samples (in the top left) are valid pollen and seed samples which are known to have a unique transcriptional profile. Figure S3. Node degree distribution of the Arabidopsis thaliana samples co expression network. Co-expression networks are known to have few nodes with many connections to other genes and many genes with few connections. For the co expression network of Arabidopsis thaliana based on the positive samples, this behavior can clearly be observed. Table S1. Negative Arabidopsis thaliana dataset. The columns correspond to SRA run IDs for the samples, short description (description and type) and mapping percentages for TopHat and HTSeq-count. Table S2. Sorghum bicolor samples with organ annotation. Overview of all Sorghum bicolor samples used, organized by organ the samples were derived from. Methods S1. Data source and curation. Methods S2. PCA analysis of expression data. Methods S3. Power law. (DOCX 411Â kb
- …