Search CORE

465 research outputs found

Bioinformatic pipelines in Python with Leaf

Author
Publication venue: BioMed Central
Publication date: 21/06/2013
Field of study

Springer - Publisher Connector

Bioinformatic pipelines in Python with Leaf

Author: A Cockburn
B Bruegge
B Linke
D Hull
DC Ince
Francesco Napolitano
I Altintas
I Sommerville
J Cheney
J Goecks
K Ovaska
K Wang
L Goodstadt
M Fourment
MF Sanner
P Buneman
P Romano
PJ Hastings
PJA Cock
Renato Mariani-Costantini
Roberto Tagliaferri
S Hoon
SB Davidson
SP Sadedin
SP Shah
TH Cormen
Tratt L
WM Johnston
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref

Bioinformatic Workflows for Generating Complete Plastid Genome Sequences - An Example from Cabomba (Cabombaceae) in the Context of the Phylogenomic Analysis of the Water-Lily Clade

Author: Borsch Thomas
Gerschler Nico
Gruenstaeudl Michael
Publication venue
Publication date: 01/01/2018
Field of study

The sequencing and comparison of plastid genomes are becoming a standard method in plant genomics, and many researchers are using this approach to infer plant phylogenetic relationships. Due to the widespread availability of next-generation sequencing, plastid genome sequences are being generated at breakneck pace. This trend towards massive sequencing of plastid genomes highlights the need for standardized bioinformatic workflows. In particular, documentation and dissemination of the details of genome assembly, annotation, alignment and phylogenetic tree inference are needed, as these processes are highly sensitive to the choice of software and the precise settings used. Here, we present the procedure and results of sequencing, assembling, annotating and quality-checking of three complete plastid genomes of the aquatic plant genus Cabomba as well as subsequent gene alignment and phylogenetic tree inference. We accompany our findings by a detailed description of the bioinformatic workflow employed. Importantly, we share a total of eleven software scripts for each of these bioinformatic processes, enabling other researchers to evaluate and replicate our analyses step by step. The results of our analyses illustrate that the plastid genomes of Cabomba are highly conserved in both structure and gene content

Institutional Repository of the Freie Universität Berlin

Directory of Open Access Journals

ETE: a python Environment for Tree Exploration

Author: Dopazo Joaquín
Gabaldón Toni
Huerta-Cepas Jaime
Publication venue: BioMed Central
Publication date: 01/01/2010
Field of study

Abstract Background Many bioinformatics analyses, ranging from gene clustering to phylogenetics, produce hierarchical trees as their main result. These are used to represent the relationships among different biological entities, thus facilitating their analysis and interpretation. A number of standalone programs are available that focus on tree visualization or that perform specific analyses on them. However, such applications are rarely suitable for large-scale surveys, in which a higher level of automation is required. Currently, many genome-wide analyses rely on tree-like data representation and hence there is a growing need for scalable tools to handle tree structures at large scale. Results Here we present the Environment for Tree Exploration (ETE), a python programming toolkit that assists in the automated manipulation, analysis and visualization of hierarchical trees. ETE libraries provide a broad set of tree handling options as well as specific methods to analyze phylogenetic and clustering trees. Among other features, ETE allows for the independent analysis of tree partitions, has support for the extended newick format, provides an integrated node annotation system and permits to link trees to external data such as multiple sequence alignments or numerical arrays. In addition, ETE implements a number of built-in analytical tools, including phylogeny-based orthology prediction and cluster validation techniques. Finally, ETE's programmable tree drawing engine can be used to automate the graphical rendering of trees with customized node-specific visualizations. Conclusions ETE provides a complete set of methods to manipulate tree data structures that extends current functionality in other bioinformatic toolkits of a more general purpose. ETE is free software and can be downloaded from <url>http://ete.cgenomics.org</url>.</p

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

BUSCA: An integrative web server to predict subcellular localization of proteins

Author: Casadio Rita
Fariselli Piero
Martelli Pier Luigi
Profiti Giuseppe
Savojardo Castrense
Publication venue: 'Oxford University Press (OUP)'
Publication date: 01/01/2018
Field of study

Here, we present BUSCA (http://busca.biocomp.unibo.it), a novel web server that integrates different computational tools for predicting protein subcellular localization. BUSCA combines methods for identifying signal and transit peptides (DeepSig and TPpred3), GPI-anchors (PredGPI) and transmembrane domains (ENSEMBLE3.0 and BetAware) with tools for discriminating subcellular localization of both globular and membrane proteins (BaCelLo, MemLoci and SChloro). Outcomes from the different tools are processed and integrated for annotating subcellular localization of both eukaryotic and bacterial protein sequences. We benchmark BUSCA against protein targets derived from recent CAFA experiments and other specific data sets, reporting performance at the state-of-the-art. BUSCA scores better than all other evaluated methods on 2732 targets from CAFA2, with a F1 value equal to 0.49 and among the best methods when predicting targets from CAFA3. We propose BUSCA as an integrated and accurate resource for the annotation of protein subcellular localization

Crossref

Archivio istituzionale della ricerca - Alma Mater Studiorum Università di Bologna

Archivio istituzionale della ricerca - Università di Padova

Institutional Research Information System University of Turin

From Genes to Ecosystems: Resource Availability and DNA Methylation Drive the Diversity and Abundance of Restriction Modification Systems in Prokaryotes

Author: Papoulis Spiridon E
Publication venue: TRACE: Tennessee Research and Creative Exchange
Publication date: 01/06/2020
Field of study

Together, prokaryotic hosts and their viruses numerically dominate the planet and are engaged in an eternal struggle of hosts evading viral predation and viruses overcoming defensive mechanisms employed by their hosts. Prokaryotic hosts have been found to carry several viral defense systems in recent years with Restriction Modification systems (RMs) were the first discovered in the 1950s. While we have biochemically elucidated many of these systems in the last 70 years, we still struggle to understand what drives their gain and loss in prokaryotic genomes. In this work, we take a computational approach to understand the underlying evolutionary drivers of RMs by assessing ‘big data’ signals of RMs in prokaryotic genomes and incorporating molecular data in trait-based mathematical models. Focusing on the Cyanobacteria, we found a large discrepancy in the frequency of RMs per genome in different environmental contexts, where Cyanobacteria that live in oligotrophic nutrient conditions have few to no RMs and those in nutrient-rich conditions consistently have many RMs. While our models agree with the observation that increased nutrient inputs make the selective pressure of RMs more intense, they were unable to reconcile the high numbers of RMs per genome with their potent defensive properties- a situation of apparent overkill. By incorporating viral methylation, an unavoidable effect of RMs, we were able to explain how organisms could carry over 15 RMs. With this discovery, we then tried and reassess the distribution of methyltransferases, an essential component of RMs that can also have alternate physiological rolls in the cell. We expand on conventional wisdom, that methyltransferases that are widely phylogenetically conserved are associated with global cellular regulation. However, we also find that organisms with high numbers of RMs also have a surprising amount of conservation in the methyltransferases that they carry. This data suggests caution should be used in associating phylogenic signals with functional rolls in methyltransferases as different functional rolls seem to overlap in their phylogenetic signal. Indeed, we suggest trait-based modeling may be the best tool in elucidating why organisms with a high selective pressure to maintain RMs appear to have conserved methyltransferase

University of Tennessee, Knoxville: Trace

EMBL2checklists: A Python package to facilitate the user-friendly submission of plant and fungal DNA barcoding sequences to ENA

Author: Gruenstaeudl Michael
Hartmaring Yannick
Publication venue
Publication date: 01/01/2019
Field of study

The submission of DNA sequences to public sequence databases is an essential, but insufficiently automated step in the process of generating and disseminating novel DNA sequence data. Despite the centrality of database submissions to biological research, the range of available software tools that facilitate the preparation of sequence data for database submissions is low, especially for sequences generated via plant and fungal DNA barcoding. Current submission procedures can be complex and prohibitively time expensive for any but a small number of input sequences. A user-friendly software tool is needed that streamlines the file preparation for database submissions of DNA sequences that are commonly generated in plant and fungal DNA barcoding

Institutional Repository of the Freie Universität Berlin

Directory of Open Access Journals

FigShare

qTeller: a tool for comparative multi-genomic gene expression analysis

Author: Andorf Carson M.
Freeling Michael
Portwood John L., II
Schnable James
Schott David
Sen Shatabdi
Walley Justin W.
Woodhouse Margaret R.
Publication venue: DigitalCommons@University of Nebraska - Lincoln
Publication date: 01/01/2022
Field of study

Motivation: Over the last decade, RNA-Seq whole-genome sequencing has become a widely used method for measuring and understanding transcriptome-level changes in gene expression. Since RNA-Seq is relatively inexpensive, it can be used on multiple genomes to evaluate gene expression across many different conditions, tissues and cell types. Although many tools exist to map and compare RNA-Seq at the genomics level, few web-based tools are dedicated to making data generated for individual genomic analysis accessible and reusable at a gene-level scale for comparative analysis between genes, across different genomes and meta-analyses. Results: To address this challenge, we revamped the comparative gene expression tool qTeller to take advantage of the growing number of public RNA-Seq datasets. qTeller allows users to evaluate gene expression data in a defined genomic interval and also perform two-gene comparisons across multiple user-chosen tissues. Though previously unpublished, qTeller has been cited extensively in the scientific literature, demonstrating its importance to researchers. Our new version of qTeller now supports multiple genomes for intergenomic comparisons, and includes capabilities for both mRNA and protein abundance datasets. Other new features include support for additional data formats, modernized interface and back-end database and an optimized framework for adoption by other organisms’ databases. Availability and implementation: The source code for qTeller is open-source and available through GitHub (https:// github.com/Maize-Genetics-and-Genomics-Database/qTeller). A maize instance of qTeller is available at the Maize Genetics and Genomics database (MaizeGDB) (https://qteller.maizegdb.org/), where we have mapped over 200 unique datasets from GenBank across 27 maize genomes

DigitalCommons@University of Nebraska

Comprehensive compendium of Arabidopsis RNA-seq data, A

Author: Halladay Gareth A.
Publication venue: Colorado State University. Libraries
Publication date: 01/01/2020
Field of study

2020 Spring.Includes bibliographical references.In the last fifteen years, the amount of publicly available genomic sequencing data has doubled every few months. Analyzing large collections of RNA-seq datasets can provide insights that are not available when analyzing data from single experiments. There are barriers towards such analyses: combining processed data is challenging because varying methods for processing data make it difficult to compare data across studies; combining data in raw form is challenging because of the resources needed to process the data. Multiple RNA-seq compendiums, which are curated sets of RNA-seq data that have been pre-processed in a uniform fashion, exist; however, there is no such resource in plants. We created a comprehensive compendium for Arabidopsis thaliana using a pipeline based on Snakemake. We downloaded over 80 Arabidopsis studies from the Sequence Read Archive. Through a strict set of criteria, we chose 35 studies containing a total of 700 biological replicates, with a focus on the response of different Arabidopsis tissues to a variety of stresses. In order to make the studies comparable, we hand-curated the metadata, pre-processed and analyzed each sample using our pipeline. We performed exploratory analysis on the samples in our compendium for quality control, and to identify biologically distinct subgroups, using PCA and t-SNE. We discuss the differences between these two methods and show that the data separates primarily by tissue type, and to a lesser extent, by the type of stress. We identified treatment conditions for each study and generated three lists: differentially expressed genes, differentially expressed introns, and genes that were differentially expressed under multiple conditions. We then visually analyzed these groups, looking for overarching patterns within the data, finding around a thousand genes that participate in stress response across tissues and stresses

Mountain Scholar (Digital Collections of Colorado and Wyoming)

Design considerations for workflow management systems use in production genomics research and the clinic

Author: Ahmed Azza E.
Allen Joshua M.
Bhat Tajesvi
Burra Prakruthi
Fadlelmola Faisal M.
Fliege Christina E.
Hart Steven N.
Heldenbrand Jacob R.
Hudson Matthew E.
Istanto Dave Deandre
Kalmbach Michael T.
Kapraun Gregory D.
Kendig Katherine I.
Kendzior Matthew Charles
Klee Eric W.
Mainzer Liudmila S.
Mattson Nate
Ross Christian A.
Sharif Sami M.
Venkatakrishnan Ramshankar
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/11/2021
Field of study

Abstract The changing landscape of genomics research and clinical practice has created a need for computational pipelines capable of efficiently orchestrating complex analysis stages while handling large volumes of data across heterogeneous computational environments. Workflow Management Systems (WfMSs) are the software components employed to fill this gap. This work provides an approach and systematic evaluation of key features of popular bioinformatics WfMSs in use today: Nextflow, CWL, and WDL and some of their executors, along with Swift/T, a workflow manager commonly used in high-scale physics applications. We employed two use cases: a variant-calling genomic pipeline and a scalability-testing framework, where both were run locally, on an HPC cluster, and in the cloud. This allowed for evaluation of those four WfMSs in terms of language expressiveness, modularity, scalability, robustness, reproducibility, interoperability, ease of development, along with adoption and usage in research labs and healthcare settings. This article is trying to answer, which WfMS should be chosen for a given bioinformatics application regardless of analysis type?. The choice of a given WfMS is a function of both its intrinsic language and engine features. Within bioinformatics, where analysts are a mix of dry and wet lab scientists, the choice is also governed by collaborations and adoption within large consortia and technical support provided by the WfMS team/community. As the community and its needs continue to evolve along with computational infrastructure, WfMSs will also evolve, especially those with permissive licenses that allow commercial use. In much the same way as the dataflow paradigm and containerization are now well understood to be very useful in bioinformatics applications, we will continue to see innovations of tools and utilities for other purposes, like big data technologies, interoperability, and provenance

Proceedings - University of Groningen

University of Groningen

ARTS repository - University of Groningen

Directory of Open Access Journals

Dissertations of the University of Groningen