54 research outputs found

    SLIDER: Mining correlated motifs in protein-protein interaction networks

    Get PDF
    Abstract—Correlated motif mining (CMM) is the problem to find overrepresented pairs of patterns, called motif pairs, in interacting protein sequences. Algorithmic solutions for CMM thereby provide a computational method for predicting binding sites for protein interaction. In this paper, we adopt a motif-driven approach where the support of candidate motif pairs is evaluated in the network. We experimentally establish the superiority of the Chi-square-based support measure over other support measures. Furthermore, we obtain that CMM is an NP-hard problem for a large class of support measures (including Chi-square) and reformulate the search for correlated motifs as a combinatorial optimization problem. We then present the method SLIDER which uses local search with a neighborhood function based on sliding motifs and employs the Chi-square-based support measure. We show that SLIDER outperforms existing motif-driven CMM methods and scales to large protein-protein interaction networks

    Predicting the Impact of Alternative Splicing on Plant MADS Domain Protein Function

    Get PDF
    Several genome-wide studies demonstrated that alternative splicing (AS) significantly increases the transcriptome complexity in plants. However, the impact of AS on the functional diversity of proteins is difficult to assess using genome-wide approaches. The availability of detailed sequence annotations for specific genes and gene families allows for a more detailed assessment of the potential effect of AS on their function. One example is the plant MADS-domain transcription factor family, members of which interact to form protein complexes that function in transcription regulation. Here, we perform an in silico analysis of the potential impact of AS on the protein-protein interaction capabilities of MIKC-type MADS-domain proteins. We first confirmed the expression of transcript isoforms resulting from predicted AS events. Expressed transcript isoforms were considered functional if they were likely to be translated and if their corresponding AS events either had an effect on predicted dimerisation motifs or occurred in regions known to be involved in multimeric complex formation, or otherwise, if their effect was conserved in different species. Nine out of twelve MIKC MADS-box genes predicted to produce multiple protein isoforms harbored putative functional AS events according to those criteria. AS events with conserved effects were only found at the borders of or within the K-box domain. We illustrate how AS can contribute to the evolution of interaction networks through an example of selective inclusion of a recently evolved interaction motif in the MADS AFFECTING FLOWERING1-3 (MAF1–3) subclade. Furthermore, we demonstrate the potential effect of an AS event in SHORT VEGETATIVE PHASE (SVP), resulting in the deletion of a short sequence stretch including a predicted interaction motif, by overexpression of the fully spliced and the alternatively spliced SVP transcripts. For most of the AS events we were able to formulate hypotheses about the potential impact on the interaction capabilities of the encoded MIKC protein

    Molecular evolution of aphids and their primary ( Buchnera sp.) and secondary endosymbionts: implications for the role of symbiosis in insect evolution.

    Get PDF
    Aphids maintain an obligate, endosymbiotic association with Buchnera sp., a bacterium closely related to Escherichia coli. Bacteria are housed in specialized cells of organ-like structures called bacteriomes in the hemocoel of the aphid and are maternally transmitted. Phylogenetic studies have shown that the association had a single origin, dated about 200-250 million years ago, and that host and endosymbiont lineages have evolved in parallel since then. However, the pattern of deepest branching within the aphid family remains unsolved, which thereby hampers tin appraisal of, for example, the role played by horizontal gene transfer in the early evolution of Buchnera. The main role of Buchnera in this association is the biosynthesis and provisioning of essential amino acids to its aphid host. Physiological and metabolic studies have recently substantiated such nutritional role. In addition, genetic studies of Buchnera from several aphids have shown additional modifications, such as strong genome reduction, high A+T content compared to free-living bacteria, differential evolutionary rates, a relative increase in the number of non-synonymous substitutions, and gene amplification mediated by plasmids. Symbiosis is an active process in insect evolution cis revealed by the intermediate values of the previous characteristics showed by secondary symbionts compared to free-living bacteria and Buchnera

    A comparison of global sensitivity analysis methods for explainable AI with an application in genomic prediction

    Get PDF
    Explainable Artificial Intelligence (XAI) is an increasingly important field of research required to bring AI to the next level in real-world applications. Global sensitivity analysis (GSA) methods play an important role in XAI, as they can provide an understanding of which (groups of) parameters have high influence in the predictions of machine learning models and the output of simulators and real-world processes. In this paper, we conduct a survey into global sensitivity methods in an XAI context and present both a qualitative and a quantitative analysis of these methods under different conditions. In addition to the overview and comparison, we propose an open source application, GSAreport, that allows you to easily generate extensive reports using a carefully selected set of global sensitivity analysis methods depending on the number of dimensions and samples, to gain a deep understanding of the role of each feature for a given model or data set. We finally present the methods discussed in a complex real-world application of genomic prediction and draw conclusions about when to use which GSA methods.Algorithms and the Foundations of Software technolog

    The use of multiple hierarchically independent gene ontology terms in gene function prediction and genome annotation

    Get PDF
    The Gene Ontology (GO) is a widely used controlled vocabulary for the description of gene function. In this study we quantify the usage of multiple and hierarchically independent GO terms in the curated genome annotations of seven well-studied species. In most genomes, significant proportions (6 - 60%) of genes have been annotated with multiple and hierarchically independent terms. This may be necessary to attain adequate specificity of description. One noticeable exception is Arabidopsis thaliana, in which genes are much less frequently annotated with multiple terms (6 - 14%). In contrast, an analysis of the occurrence of InterPro hits in the proteomes of the seven species, followed by a mapping of the hits to GO terms, did not reveal an aberrant pattern for the A. thaliana genome. This study shows the widespread usage of multiple hierarchically independent GO terms in the functional annotation of genes. By consequence, probabilistic methods that aim to predict gene function automatically through integration of diverse genomic datasets, and that employ the GO, must be able to predict such multiple terms. We attribute the low frequency with which multiple GO terms are used in Arabidopsis to deviating practices in the genome annotation and curation process between communities of annotators. This may bias genome-scale comparisons of gene function between different species. GO term assignment should therefore be performed according to strictly similar rules and standards

    Genome sequence and analysis of the tuber crop potato

    Get PDF
    Potato (Solanum tuberosum L.) is the world’s most important non-grain food crop and is central to global food security. It is clonally propagated, highly heterozygous, autotetraploid, and suffers acute inbreeding depression. Here we use a homozygous doubled-monoploid potato clone to sequence and assemble 86% of the 844-megabase genome. We predict 39,031 protein-coding genes and present evidence for at least two genome duplication events indicative of a palaeopolyploid origin. As the first genome sequence of an asterid, the potato genome reveals 2,642 genes specific to this large angiosperm clade. We also sequenced a heterozygous diploid clone and show that gene presence/absence variants and other potentially deleterious mutations occur frequently and are a likely cause of inbreeding depression. Gene family expansion, tissue-specific expression and recruitment of genes to new pathways contributed to the evolution of tuber development. The potato genome sequence provides a platform for genetic improvement of this vital cro

    Unsupervised protein embeddings outperform hand-crafted sequence and structure features at predicting molecular function

    Get PDF
    Motivation: Protein function prediction is a difficult bioinformatics problem. Many recent methods use deep neural networks to learn complex sequence representations and predict function from these. Deep supervised models require a lot of labeled training data which are not available for this task. However, a very large amount of protein sequences without functional labels is available.Results: We applied an existing deep sequence model that had been pretrained in an unsupervised setting on the supervised task of protein molecular function prediction. We found that this complex feature representation is effective for this task, outperforming hand-crafted features such as one-hot encoding of amino acids, k-mer counts, secondary structure and backbone angles. Also, it partly negates the need for complex prediction models, as a two-layer perceptron was enough to achieve competitive performance in the third Critical Assessment of Functional Annotation benchmark. We also show that combining this sequence representation with protein 3D structure information does not lead to performance improvement, hinting that 3D structure is also potentially learned during the unsupervised pretraining

    Sequencing the potato genome: outline and first results to come from the elucidation of the sequence of the world's third most important food crop

    Get PDF
    Potato is a member of the Solanaceae, a plant family that includes several other economically important species, such as tomato, eggplant, petunia, tobacco and pepper. The Potato Genome Sequencing Consortium (PGSC) aims to elucidate the complete genome sequence of potato, the third most important food crop in the world. The PGSC is a collaboration between 13 research groups from China, India, Poland, Russia, the Netherlands, Ireland, Argentina, Brazil, Chile, Peru, USA, New Zealand and the UK. The potato genome consists of 12 chromosomes and has a (haploid) length of approximately 840 million base pairs, making it a medium-sized plant genome. The sequencing project builds on a diploid potato genomic bacterial artificial chromosome (BAC) clone library of 78000 clones, which has been fingerprinted and aligned into ~7000 physical map contigs. In addition, the BAC-ends have been sequenced and are publicly available. Approximately 30000 BACs are anchored to the Ultra High Density genetic map of potato, composed of 10000 unique AFLPTM markers. From this integrated genetic-physical map, between 50 to 150 seed BACs have currently been identified for every chromosome. Fluorescent in situ hybridization experiments on selected BAC clones confirm these anchor points. The seed clones provide the starting point for a BAC-by-BAC sequencing strategy. This strategy is being complemented by whole genome shotgun sequencing approaches using both 454 GS FLX and Illumina GA2 instruments. Assembly and annotation of the sequence data will be performed using publicly available and tailor-made tools. The availability of the annotated data will help to characterize germplasm collections based on allelic variance and to assist potato breeders to more fully exploit the genetic potential of potat

    A Quantitative and Dynamic Model of the Arabidopsis Flowering Time Gene Regulatory Network

    Get PDF
    Various environmental signals integrate into a network of floral regulatory genes leading to the final decision on when to flower. Although a wealth of qualitative knowledge is available on how flowering time genes regulate each other, only a few studies incorporated this knowledge into predictive models. Such models are invaluable as they enable to investigate how various types of inputs are combined to give a quantitative readout. To investigate the effect of gene expression disturbances on flowering time, we developed a dynamic model for the regulation of flowering time in Arabidopsis thaliana. Model parameters were estimated based on expression time-courses for relevant genes, and a consistent set of flowering times for plants of various genetic backgrounds. Validation was performed by predicting changes in expression level in mutant backgrounds and comparing these predictions with independent expression data, and by comparison of predicted and experimental flowering times for several double mutants. Remarkably, the model predicts that a disturbance in a particular gene has not necessarily the largest impact on directly connected genes. For example, the model predicts that SUPPRESSOR OF OVEREXPRESSION OF CONSTANS (SOC1) mutation has a larger impact on APETALA1 (AP1), which is not directly regulated by SOC1, compared to its effect on LEAFY (LFY) which is under direct control of SOC1. This was confirmed by expression data. Another model prediction involves the importance of cooperativity in the regulation of APETALA1 (AP1) by LFY, a prediction supported by experimental evidence. Concluding, our model for flowering time gene regulation enables to address how different quantitative inputs are combined into one quantitative output, flowering time
    • …
    corecore