60 research outputs found
PRI-CAT: a web-tool for the analysis, storage and visualization of plant ChIP-seq experiments
Although several tools for the analysis of ChIP-seq data have been published recently, there is a growing demand, in particular in the plant research community, for computational resources with which such data can be processed, analyzed, stored, visualized and integrated within a single, user-friendly environment. To accommodate this demand, we have developed PRI-CAT (Plant Research International ChIP-seq analysis tool), a web-based workflow tool for the management and analysis of ChIP-seq experiments. PRI-CAT is currently focused on Arabidopsis, but will be extended with other plant species in the near future. Users can directly submit their sequencing data to PRI-CAT for automated analysis. A QuickLoad server compatible with genome browsers is implemented for the storage and visualization of DNA-binding maps. Submitted datasets and results can be made publicly available through PRI-CAT, a feature that will enable community-based integrative analysis and visualization of ChIP-seq experiments. Secondary analysis of data can be performed with the aid of GALAXY, an external framework for tool and data integration. PRI-CAT is freely available at http://www.ab.wur.nl/pricat. No login is required
Sequencing the Potato Genome: Outline and First Results to Come from the Elucidation of the Sequence of the World’s Third Most Important Food Crop
Potato is a member of the Solanaceae, a plant family that includes several other economically important species, such as tomato, eggplant, petunia, tobacco and pepper. The Potato Genome Sequencing Consortium (PGSC) aims to elucidate the complete genome sequence of potato, the third most important food crop in the world. The PGSC is a collaboration between 13 research groups from China, India, Poland, Russia, the Netherlands, Ireland, Argentina, Brazil, Chile, Peru, USA, New Zealand and the UK. The potato genome consists of 12 chromosomes and has a (haploid) length of approximately 840 million base pairs, making it a medium-sized plant genome. The sequencing project builds on a diploid potato genomic bacterial artificial chromosome (BAC) clone library of 78000 clones, which has been fingerprinted and aligned into ~7000 physical map contigs. In addition, the BAC-ends have been sequenced and are publicly available. Approximately 30000 BACs are anchored to the Ultra High Density genetic map of potato, composed of 10000 unique AFLPTM markers. From this integrated genetic-physical map, between 50 to 150 seed BACs have currently been identified for every chromosome. Fluorescent in situ hybridization experiments on selected BAC clones confirm these anchor points. The seed clones provide the starting point for a BAC-by-BAC sequencing strategy. This strategy is being complemented by whole genome shotgun sequencing approaches using both 454 GS FLX and Illumina GA2 instruments. Assembly and annotation of the sequence data will be performed using publicly available and tailor-made tools. The availability of the annotated data will help to characterize germplasm collections based on allelic variance and to assist potato breeders to more fully exploit the genetic potential of potat
Unsupervised protein embeddings outperform hand-crafted sequence and structure features at predicting molecular function
This work was supported by Keygene N.V., a crop innovation company in the Netherlands and by the Spanish MINECO/FEDER Project TEC201680141-P with the associated FPI grant BES-2017-079792.The authors thank Dr. Elvin Isufi and Chirag Raman for their valuable comments
and feedback.Motivation: Protein function prediction is a difficult bioinformatics problem. Many recent methods use deep neural networks to learn complex sequence representations and predict function from these. Deep supervised models require a lot of labeled training data which are not available for this task. However, a very large amount of protein sequences without functional labels is available.
Results: We applied an existing deep sequence model that had been pretrained in an unsupervised setting on the supervised task of protein molecular function prediction. We found that this complex feature representation is effective for this task, outperforming hand-crafted features such as one-hot encoding of amino acids, k-mer counts, secondary structure and backbone angles. Also, it partly negates the need for complex prediction models, as a two-layer perceptron was enough to achieve competitive performance in the third Critical Assessment of Functional Annotation benchmark. We also show that combining this sequence representation with protein 3D structure information does not lead to performance improvement, hinting that 3D structure is also potentially learned during the unsupervised pretraining.Keygene N.V., a crop innovation company in the NetherlandsSpanish MINECO/FEDER
TEC201680141-PFPI grant
BES-2017-07979
Bayesian Markov Random Field Analysis for Protein Function Prediction Based on Network Data
Inference of protein functions is one of the most important aims of modern
biology. To fully exploit the large volumes of genomic data typically produced
in modern-day genomic experiments, automated computational methods for protein
function prediction are urgently needed. Established methods use sequence or
structure similarity to infer functions but those types of data do not suffice
to determine the biological context in which proteins act. Current
high-throughput biological experiments produce large amounts of data on the
interactions between proteins. Such data can be used to infer interaction
networks and to predict the biological process that the protein is involved in.
Here, we develop a probabilistic approach for protein function prediction using
network data, such as protein-protein interaction measurements. We take a
Bayesian approach to an existing Markov Random Field method by performing
simultaneous estimation of the model parameters and prediction of protein
functions. We use an adaptive Markov Chain Monte Carlo algorithm that leads to
more accurate parameter estimates and consequently to improved prediction
performance compared to the standard Markov Random Fields method. We tested our
method using a high quality S.cereviciae validation network
with 1622 proteins against 90 Gene Ontology terms of different levels of
abstraction. Compared to three other protein function prediction methods, our
approach shows very good prediction performance. Our method can be directly
applied to protein-protein interaction or coexpression networks, but also can be
extended to use multiple data sources. We apply our method to physical protein
interaction data from S. cerevisiae and provide novel
predictions, using 340 Gene Ontology terms, for 1170 unannotated proteins and we
evaluate the predictions using the available literature
The Genomes of the Fungal Plant Pathogens Cladosporium fulvum and Dothistroma septosporum Reveal Adaptation to Different Hosts and Lifestyles But Also Signatures of Common Ancestry.
We sequenced and compared the genomes of the Dothideomycete fungal plant pathogensCladosporium fulvum (Cfu) (syn. Passalora fulva) and Dothistroma septosporum (Dse) that are closely related phylogenetically, but have different lifestyles and hosts. Although both fungi grow extracellularly in close contact with host mesophyll cells, Cfu is a biotroph infecting tomato, while Dse is a hemibiotroph infecting pine. The genomes of these fungi have a similar set of genes (70% of gene content in both genomes are homologs), but differ significantly in size (Cfu \u3e61.1-Mb; Dse 31.2-Mb), which is mainly due to the difference in repeat content (47.2% in Cfu versus 3.2% in Dse). Recent adaptation to different lifestyles and hosts is suggested by diverged sets of genes. Cfu contains an α-tomatinase gene that we predict might be required for detoxification of tomatine, while this gene is absent in Dse. Many genes encoding secreted proteins are unique to each species and the repeat-rich areas in Cfu are enriched for these species-specific genes. In contrast, conserved genes suggest common host ancestry. Homologs of Cfu effector genes, including Ecp2 and Avr4, are present in Dse and induce a Cf-Ecp2- and Cf-4-mediated hypersensitive response, respectively. Strikingly, genes involved in production of the toxin dothistromin, a likely virulence factor for Dse, are conserved in Cfu, but their expression differs markedly with essentially no expression by Cfu in planta. Likewise, Cfu has a carbohydrate-degrading enzyme catalog that is more similar to that of necrotrophs or hemibiotrophs and a larger pectinolytic gene arsenal than Dse, but many of these genes are not expressed in planta or are pseudogenized. Overall, comparison of their genomes suggests that these closely related plant pathogens had a common ancestral host but since adapted to different hosts and lifestyles by a combination of differentiated gene content, pseudogenization, and gene regulation
Sequence Motifs in MADS Transcription Factors Responsible for Specificity and Diversification of Protein-Protein Interaction
Protein sequences encompass tertiary structures and contain information about specific molecular interactions, which in turn determine biological functions of proteins. Knowledge about how protein sequences define interaction specificity is largely missing, in particular for paralogous protein families with high sequence similarity, such as the plant MADS domain transcription factor family. In comparison to the situation in mammalian species, this important family of transcription regulators has expanded enormously in plant species and contains over 100 members in the model plant species Arabidopsis thaliana. Here, we provide insight into the mechanisms that determine protein-protein interaction specificity for the Arabidopsis MADS domain transcription factor family, using an integrated computational and experimental approach. Plant MADS proteins have highly similar amino acid sequences, but their dimerization patterns vary substantially. Our computational analysis uncovered small sequence regions that explain observed differences in dimerization patterns with reasonable accuracy. Furthermore, we show the usefulness of the method for prediction of MADS domain transcription factor interaction networks in other plant species. Introduction of mutations in the predicted interaction motifs demonstrated that single amino acid mutations can have a large effect and lead to loss or gain of specific interactions. In addition, various performed bioinformatics analyses shed light on the way evolution has shaped MADS domain transcription factor interaction specificity. Identified protein-protein interaction motifs appeared to be strongly conserved among orthologs, indicating their evolutionary importance. We also provide evidence that mutations in these motifs can be a source for sub- or neo-functionalization. The analyses presented here take us a step forward in understanding protein-protein interactions and the interplay between protein sequences and network evolution
A genome-wide genetic map of NB-LRR disease resistance loci in potato
Like all plants, potato has evolved a surveillance system consisting of a large array of genes encoding for immune receptors that confer resistance to pathogens and pests. The majority of these so-called resistance or R proteins belong to the super-family that harbour a nucleotide binding and a leucine-rich-repeat domain (NB-LRR). Here, sequence information of the conserved NB domain was used to investigate the genome-wide genetic distribution of the NB-LRR resistance gene loci in potato. We analysed the sequences of 288 unique BAC clones selected using filter hybridisation screening of a BAC library of the diploid potato clone RH89-039-16 (S. tuberosum ssp. tuberosum) and a physical map of this BAC library. This resulted in the identification of 738 partial and full-length NB-LRR sequences. Based on homology of these sequences with known resistance genes, 280 and 448 sequences were classified as TIR-NB-LRR (TNL) and CC-NB-LRR (CNL) sequences, respectively. Genetic mapping revealed the presence of 15 TNL and 32 CNL loci. Thirty-six are novel, while three TNL loci and eight CNL loci are syntenic with previously identified functional resistance genes. The genetic map was complemented with 68 universal CAPS markers and 82 disease resistance trait loci described in literature, providing an excellent template for genetic studies and applied research in potato
Conserved and variable correlated mutations in the plant MADS protein network
<p>Abstract</p> <p>Background</p> <p>Plant MADS domain proteins are involved in a variety of developmental processes for which their ability to form various interactions is a key requisite. However, not much is known about the structure of these proteins or their complexes, whereas such knowledge would be valuable for a better understanding of their function. Here, we analyze those proteins and the complexes they form using a correlated mutation approach in combination with available structural, bioinformatics and experimental data.</p> <p>Results</p> <p>Correlated mutations are affected by several types of noise, which is difficult to disentangle from the real signal. In our analysis of the MADS domain proteins, we apply for the first time a correlated mutation analysis to a family of interacting proteins. This provides a unique way to investigate the amount of signal that is present in correlated mutations because it allows direct comparison of mutations in various family members and assessing their conservation. We show that correlated mutations in general are conserved within the various family members, and if not, the variability at the respective positions is less in the proteins in which the correlated mutation does not occur. Also, intermolecular correlated mutation signals for interacting pairs of proteins display clear overlap with other bioinformatics data, which is not the case for non-interacting protein pairs, an observation which validates the intermolecular correlated mutations. Having validated the correlated mutation results, we apply them to infer the structural organization of the MADS domain proteins.</p> <p>Conclusion</p> <p>Our analysis enables understanding of the structural organization of the MADS domain proteins, including support for predicted helices based on correlated mutation patterns, and evidence for a specific interaction site in those proteins.</p
- …