67 research outputs found
ChIP-seq Analysis in R (CSAR): An R package for the statistical detection of protein-bound genomic regions
<p>Abstract</p> <p>Background</p> <p><it>In vivo </it>detection of protein-bound genomic regions can be achieved by combining chromatin-immunoprecipitation with next-generation sequencing technology (ChIP-seq). The large amount of sequence data produced by this method needs to be analyzed in a statistically proper and computationally efficient manner. The generation of high copy numbers of DNA fragments as an artifact of the PCR step in ChIP-seq is an important source of bias of this methodology.</p> <p>Results</p> <p>We present here an R package for the statistical analysis of ChIP-seq experiments. Taking the average size of DNA fragments subjected to sequencing into account, the software calculates single-nucleotide read-enrichment values. After normalization, sample and control are compared using a test based on the ratio test or the Poisson distribution. Test statistic thresholds to control the false discovery rate are obtained through random permutations. Computational efficiency is achieved by implementing the most time-consuming functions in C++ and integrating these in the R package. An analysis of simulated and experimental ChIP-seq data is presented to demonstrate the robustness of our method against PCR-artefacts and its adequate control of the error rate.</p> <p>Conclusions</p> <p>The software <it>ChIP-seq Analysis in R </it>(CSAR) enables fast and accurate detection of protein-bound genomic regions through the analysis of ChIP-seq experiments. Compared to existing methods, we found that our package shows greater robustness against PCR-artefacts and better control of the error rate.</p
PRI-CAT: a web-tool for the analysis, storage and visualization of plant ChIP-seq experiments
Although several tools for the analysis of ChIP-seq data have been published recently, there is a growing demand, in particular in the plant research community, for computational resources with which such data can be processed, analyzed, stored, visualized and integrated within a single, user-friendly environment. To accommodate this demand, we have developed PRI-CAT (Plant Research International ChIP-seq analysis tool), a web-based workflow tool for the management and analysis of ChIP-seq experiments. PRI-CAT is currently focused on Arabidopsis, but will be extended with other plant species in the near future. Users can directly submit their sequencing data to PRI-CAT for automated analysis. A QuickLoad server compatible with genome browsers is implemented for the storage and visualization of DNA-binding maps. Submitted datasets and results can be made publicly available through PRI-CAT, a feature that will enable community-based integrative analysis and visualization of ChIP-seq experiments. Secondary analysis of data can be performed with the aid of GALAXY, an external framework for tool and data integration. PRI-CAT is freely available at http://www.ab.wur.nl/pricat. No login is required
De novo sequencing, assembly and analysis of the genome of the laboratory strain Saccharomyces cerevisiae CEN.PK113-7D, a model for modern industrial biotechnology
Saccharomyces cerevisiae CEN.PK 113-7D is widely used for metabolic engineering and systems biology research in industry and academia. We sequenced, assembled, annotated and analyzed its genome. Single-nucleotide variations (SNV), insertions/deletions (indels) and differences in genome organization compared to the reference strain S. cerevisiae S288C were analyzed. In addition to a few large deletions and duplications, nearly 3000 indels were identified in the CEN.PK113-7D genome relative to S288C. These differences were overrepresented in genes whose functions are related to transcriptional regulation and chromatin remodelling. Some of these variations were caused by unstable tandem repeats, suggesting an innate evolvability of the corresponding genes. Besides a previously characterized mutation in adenylate cyclase, the CEN.PK113-7D genome sequence revealed a significant enrichment of non-synonymous mutations in genes encoding for components of the cAMP signalling pathway. Some phenotypic characteristics of the CEN.PK113-7D strains were explained by the presence of additional specific metabolic genes relative to S288C. In particular, the presence of the BIO1 and BIO6 genes correlated with a biotin prototrophy of CEN.PK113-7D. Furthermore, the copy number, chromosomal location and sequences of the MAL loci were resolved. The assembled sequence reveals that CEN.PK113-7D has a mosaic genome that combines characteristics of laboratory strains and wild-industrial strains
Sequencing the Potato Genome: Outline and First Results to Come from the Elucidation of the Sequence of the World’s Third Most Important Food Crop
Potato is a member of the Solanaceae, a plant family that includes several other economically important species, such as tomato, eggplant, petunia, tobacco and pepper. The Potato Genome Sequencing Consortium (PGSC) aims to elucidate the complete genome sequence of potato, the third most important food crop in the world. The PGSC is a collaboration between 13 research groups from China, India, Poland, Russia, the Netherlands, Ireland, Argentina, Brazil, Chile, Peru, USA, New Zealand and the UK. The potato genome consists of 12 chromosomes and has a (haploid) length of approximately 840 million base pairs, making it a medium-sized plant genome. The sequencing project builds on a diploid potato genomic bacterial artificial chromosome (BAC) clone library of 78000 clones, which has been fingerprinted and aligned into ~7000 physical map contigs. In addition, the BAC-ends have been sequenced and are publicly available. Approximately 30000 BACs are anchored to the Ultra High Density genetic map of potato, composed of 10000 unique AFLPTM markers. From this integrated genetic-physical map, between 50 to 150 seed BACs have currently been identified for every chromosome. Fluorescent in situ hybridization experiments on selected BAC clones confirm these anchor points. The seed clones provide the starting point for a BAC-by-BAC sequencing strategy. This strategy is being complemented by whole genome shotgun sequencing approaches using both 454 GS FLX and Illumina GA2 instruments. Assembly and annotation of the sequence data will be performed using publicly available and tailor-made tools. The availability of the annotated data will help to characterize germplasm collections based on allelic variance and to assist potato breeders to more fully exploit the genetic potential of potat
Unsupervised protein embeddings outperform hand-crafted sequence and structure features at predicting molecular function
This work was supported by Keygene N.V., a crop innovation company in the Netherlands and by the Spanish MINECO/FEDER Project TEC201680141-P with the associated FPI grant BES-2017-079792.The authors thank Dr. Elvin Isufi and Chirag Raman for their valuable comments
and feedback.Motivation: Protein function prediction is a difficult bioinformatics problem. Many recent methods use deep neural networks to learn complex sequence representations and predict function from these. Deep supervised models require a lot of labeled training data which are not available for this task. However, a very large amount of protein sequences without functional labels is available.
Results: We applied an existing deep sequence model that had been pretrained in an unsupervised setting on the supervised task of protein molecular function prediction. We found that this complex feature representation is effective for this task, outperforming hand-crafted features such as one-hot encoding of amino acids, k-mer counts, secondary structure and backbone angles. Also, it partly negates the need for complex prediction models, as a two-layer perceptron was enough to achieve competitive performance in the third Critical Assessment of Functional Annotation benchmark. We also show that combining this sequence representation with protein 3D structure information does not lead to performance improvement, hinting that 3D structure is also potentially learned during the unsupervised pretraining.Keygene N.V., a crop innovation company in the NetherlandsSpanish MINECO/FEDER
TEC201680141-PFPI grant
BES-2017-07979
The Genomes of the Fungal Plant Pathogens Cladosporium fulvum and Dothistroma septosporum Reveal Adaptation to Different Hosts and Lifestyles But Also Signatures of Common Ancestry.
We sequenced and compared the genomes of the Dothideomycete fungal plant pathogensCladosporium fulvum (Cfu) (syn. Passalora fulva) and Dothistroma septosporum (Dse) that are closely related phylogenetically, but have different lifestyles and hosts. Although both fungi grow extracellularly in close contact with host mesophyll cells, Cfu is a biotroph infecting tomato, while Dse is a hemibiotroph infecting pine. The genomes of these fungi have a similar set of genes (70% of gene content in both genomes are homologs), but differ significantly in size (Cfu \u3e61.1-Mb; Dse 31.2-Mb), which is mainly due to the difference in repeat content (47.2% in Cfu versus 3.2% in Dse). Recent adaptation to different lifestyles and hosts is suggested by diverged sets of genes. Cfu contains an α-tomatinase gene that we predict might be required for detoxification of tomatine, while this gene is absent in Dse. Many genes encoding secreted proteins are unique to each species and the repeat-rich areas in Cfu are enriched for these species-specific genes. In contrast, conserved genes suggest common host ancestry. Homologs of Cfu effector genes, including Ecp2 and Avr4, are present in Dse and induce a Cf-Ecp2- and Cf-4-mediated hypersensitive response, respectively. Strikingly, genes involved in production of the toxin dothistromin, a likely virulence factor for Dse, are conserved in Cfu, but their expression differs markedly with essentially no expression by Cfu in planta. Likewise, Cfu has a carbohydrate-degrading enzyme catalog that is more similar to that of necrotrophs or hemibiotrophs and a larger pectinolytic gene arsenal than Dse, but many of these genes are not expressed in planta or are pseudogenized. Overall, comparison of their genomes suggests that these closely related plant pathogens had a common ancestral host but since adapted to different hosts and lifestyles by a combination of differentiated gene content, pseudogenization, and gene regulation
Bayesian Markov Random Field Analysis for Protein Function Prediction Based on Network Data
Inference of protein functions is one of the most important aims of modern
biology. To fully exploit the large volumes of genomic data typically produced
in modern-day genomic experiments, automated computational methods for protein
function prediction are urgently needed. Established methods use sequence or
structure similarity to infer functions but those types of data do not suffice
to determine the biological context in which proteins act. Current
high-throughput biological experiments produce large amounts of data on the
interactions between proteins. Such data can be used to infer interaction
networks and to predict the biological process that the protein is involved in.
Here, we develop a probabilistic approach for protein function prediction using
network data, such as protein-protein interaction measurements. We take a
Bayesian approach to an existing Markov Random Field method by performing
simultaneous estimation of the model parameters and prediction of protein
functions. We use an adaptive Markov Chain Monte Carlo algorithm that leads to
more accurate parameter estimates and consequently to improved prediction
performance compared to the standard Markov Random Fields method. We tested our
method using a high quality S.cereviciae validation network
with 1622 proteins against 90 Gene Ontology terms of different levels of
abstraction. Compared to three other protein function prediction methods, our
approach shows very good prediction performance. Our method can be directly
applied to protein-protein interaction or coexpression networks, but also can be
extended to use multiple data sources. We apply our method to physical protein
interaction data from S. cerevisiae and provide novel
predictions, using 340 Gene Ontology terms, for 1170 unannotated proteins and we
evaluate the predictions using the available literature
Continuous-time modeling of cell fate determination in Arabidopsis flowers
<p>Abstract</p> <p>Background</p> <p>The genetic control of floral organ specification is currently being investigated by various approaches, both experimentally and through modeling. Models and simulations have mostly involved boolean or related methods, and so far a quantitative, continuous-time approach has not been explored.</p> <p>Results</p> <p>We propose an ordinary differential equation (ODE) model that describes the gene expression dynamics of a gene regulatory network that controls floral organ formation in the model plant <it>Arabidopsis thaliana</it>. In this model, the dimerization of MADS-box transcription factors is incorporated explicitly. The unknown parameters are estimated from (known) experimental expression data. The model is validated by simulation studies of known mutant plants.</p> <p>Conclusions</p> <p>The proposed model gives realistic predictions with respect to independent mutation data. A simulation study is carried out to predict the effects of a new type of mutation that has so far not been made in <it>Arabidopsis</it>, but that could be used as a severe test of the validity of the model. According to our predictions, the role of dimers is surprisingly important. Moreover, the functional loss of any dimer leads to one or more phenotypic alterations.</p
Finished Genome of the Fungal Wheat Pathogen Mycosphaerella graminicola Reveals Dispensome Structure, Chromosome Plasticity, and Stealth Pathogenesis.
The plant-pathogenic fungus Mycosphaerella graminicola (asexual stage: Septoria tritici) causes septoria tritici blotch, a disease that greatly reduces the yield and quality of wheat. This disease is economically important in most wheat-growing areas worldwide and threatens global food production. Control of the disease has been hampered by a limited understanding of the genetic and biochemical bases of pathogenicity, including mechanisms of infection and of resistance in the host. Unlike most other plant pathogens, M. graminicola has a long latent period during which it evades host defenses. Although this type of stealth pathogenicity occurs commonly in Mycosphaerella and other Dothideomycetes, the largest class of plant-pathogenic fungi, its genetic basis is not known. To address this problem, the genome of M. graminicolawas sequenced completely. The finished genome contains 21 chromosomes, eight of which could be lost with no visible effect on the fungus and thus are dispensable. This eight-chromosome dispensome is dynamic in field and progeny isolates, is different from the core genome in gene and repeat content, and appears to have originated by ancient horizontal transfer from an unknown donor. Synteny plots of the M. graminicola chromosomes versus those of the only other sequenced Dothideomycete, Stagonospora nodorum, revealed conservation of gene content but not order or orientation, suggesting a high rate of intra-chromosomal rearrangement in one or both species. This observed “mesosynteny” is very different from synteny seen between other organisms. A surprising feature of the M. graminicolagenome compared to other sequenced plant pathogens was that it contained very few genes for enzymes that break down plant cell walls, which was more similar to endophytes than to pathogens. The stealth pathogenesis of M. graminicola probably involves degradation of proteins rather than carbohydrates to evade host defenses during the biotrophic stage of infection and may have evolved from endophytic ancestors
- …