73 research outputs found

    TreeDomViewer: a tool for the visualization of phylogeny and protein domain structure

    Get PDF
    Phylogenetic analysis and examination of protein domains allow accurate genome annotation and are invaluable to study proteins and protein complex evolution. However, two sequences can be homologous without sharing statistically significant amino acid or nucleotide identity, presenting a challenging bioinformatics problem. We present TreeDomViewer, a visualization tool available as a web-based interface that combines phylogenetic tree description, multiple sequence alignment and InterProScan data of sequences and generates a phylogenetic tree projecting the corresponding protein domain information onto the multiple sequence alignment. Thereby it makes use of existing domain prediction tools such as InterProScan. TreeDomViewer adopts an evolutionary perspective on how domain structure of two or more sequences can be aligned and compared, to subsequently infer the function of an unknown homolog. This provides insight into the function assignment of, in terms of amino acid substitution, very divergent but yet closely related family members. Our tool produces an interactive scalar vector graphics image that provides orthological relationship and domain content of proteins of interest at one glance. In addition, PDF, JPEG or PNG formatted output is also provided. These features make TreeDomViewer a valuable addition to the annotation pipeline of unknown genes or gene products. TreeDomViewer is available at

    ProGMap: an integrated annotation resource for protein orthology

    Get PDF
    Current protein sequence databases employ different classification schemes that often provide conflicting annotations, especially for poorly characterized proteins. ProGMap (Protein Group Mappings, http://www.bioinformatics.nl/progmap) is a web-tool designed to help researchers and database annotators to assess the coherence of protein groups defined in various databases and thereby facilitate the annotation of newly sequenced proteins. ProGMap is based on a non-redundant dataset of over 6.6 million protein sequences which is mapped to 240 000 protein group descriptions collected from UniProt, RefSeq, Ensembl, COG, KOG, OrthoMCL-DB, HomoloGene, TRIBES and PIRSF. ProGMap combines the underlying classification schemes via a network of links constructed by a fast and fully automated mapping approach originally developed for document classification. The web interface enables queries to be made using sequence identifiers, gene symbols, protein functions or amino acid and nucleotide sequences. For the latter query type BLAST similarity search and QuickMatch identity search services have been incorporated, for finding sequences similar (or identical) to a query sequence. ProGMap is meant to help users of high throughput methodologies who deal with partially annotated genomic data

    Genomic prediction in plants: opportunities for ensemble machine learning based approaches [version 2; peer review: 1 approved, 2 approved with reservations]

    Get PDF
    Background: Many studies have demonstrated the utility of machine learning (ML) methods for genomic prediction (GP) of various plant traits, but a clear rationale for choosing ML over conventionally used, often simpler parametric methods, is still lacking. Predictive performance of GP models might depend on a plethora of factors including sample size, number of markers, population structure and genetic architecture. Methods: Here, we investigate which problem and dataset characteristics are related to good performance of ML methods for genomic prediction. We compare the predictive performance of two frequently used ensemble ML methods (Random Forest and Extreme Gradient Boosting) with parametric methods including genomic best linear unbiased prediction (GBLUP), reproducing kernel Hilbert space regression (RKHS), BayesA and BayesB. To explore problem characteristics, we use simulated and real plant traits under different genetic complexity levels determined by the number of Quantitative Trait Loci (QTLs), heritability (h2 and h2e), population structure and linkage disequilibrium between causal nucleotides and other SNPs. Results: Decision tree based ensemble ML methods are a better choice for nonlinear phenotypes and are comparable to Bayesian methods for linear phenotypes in the case of large effect Quantitative Trait Nucleotides (QTNs). Furthermore, we find that ML methods are susceptible to confounding due to population structure but less sensitive to low linkage disequilibrium than linear parametric methods. Conclusions: Overall, this provides insights into the role of ML in GP as well as guidelines for practitioners

    Constraint-based probabilistic learning of metabolic pathways from tomato volatiles

    Get PDF
    Clustering and correlation analysis techniques have become popular tools for the analysis of data produced by metabolomics experiments. The results obtained from these approaches provide an overview of the interactions between objects of interest. Often in these experiments, one is more interested in information about the nature of these relationships, e.g., cause-effect relationships, than in the actual strength of the interactions. Finding such relationships is of crucial importance as most biological processes can only be understood in this way. Bayesian networks allow representation of these cause-effect relationships among variables of interest in terms of whether and how they influence each other given that a third, possibly empty, group of variables is known. This technique also allows the incorporation of prior knowledge as established from the literature or from biologists. The representation as a directed graph of these relationship is highly intuitive and helps to understand these processes. This paper describes how constraint-based Bayesian networks can be applied to metabolomics data and can be used to uncover the important pathways which play a significant role in the ripening of fresh tomatoes. We also show here how this methods of reconstructing pathways is intuitive and performs better than classical techniques. Methods for learning Bayesian network models are powerful tools for the analysis of data of the magnitude as generated by metabolomics experiments. It allows one to model cause-effect relationships and helps in understanding the underlying processes

    HSPVdb—the Human Short Peptide Variation Database for improved mass spectrometry-based detection of polymorphic HLA-ligands

    Get PDF
    T cell epitopes derived from polymorphic proteins or from proteins encoded by alternative reading frames (ARFs) play an important role in (tumor) immunology. Identification of these peptides is successfully performed with mass spectrometry. In a mass spectrometry-based approach, the recorded tandem mass spectra are matched against hypothetical spectra generated from known protein sequence databases. Commonly used protein databases contain a minimal level of redundancy, and thus, are not suitable data sources for searching polymorphic T cell epitopes, either in normal or ARFs. At the same time, however, these databases contain much non-polymorphic sequence information, thereby complicating the matching of recorded and theoretical spectra, and increasing the potential for finding false positives. Therefore, we created a database with peptides from ARFs and peptide variation arising from single nucleotide polymorphisms (SNPs). It is based on the human mRNA sequences from the well-annotated reference sequence (RefSeq) database and associated variation information derived from the Single Nucleotide Polymorphism Database (dbSNP). In this process, we removed all non-polymorphic information. Investigation of the frequency of SNPs in the dbSNP revealed that many SNPs are non-polymorphic “SNPs”. Therefore, we removed those from our dedicated database, and this resulted in a comprehensive high quality database, which we coined the Human Short Peptide Variation Database (HSPVdb). The value of our HSPVdb is shown by identification of the majority of published polymorphic SNP- and/or ARF-derived epitopes from a mass spectrometry-based proteomics workflow, and by a large variety of polymorphic peptides identified as potential T cell epitopes in the HLA-ligandome presented by the Epstein–Barr virus cells

    Microarray analysis of Bay-0 x Sha recombinant inbred lines at four seed germination stages

    No full text
    Seed germination is characterized by a constant change of gene expression across different time points. These changes are related to specific processes, which eventually determine the onset of seed germination. To get a better understanding on the regulation of gene expression during seed germination, we measured gene expression levels of Arabidopsis thaliana Bay x Sha recombinant inbred lines (RILs) at four important seed germination stages (primary dormant, after-ripened, six-hour after imbibition, and radicle protrusion stage) using. We mapped the eQTL of the gene expression and the result displayed the distinctness of the eQTL landscape for each stage. We found several eQTL hotspots across stages associated with the regulation of expression of a large number of genes. Together, we have revealed that the genetic regulation of gene expression is dynamic along the course of seed germination

    Microarray analysis of Bay-0 x Sha recombinant inbred lines at four seed germination stages

    No full text
    Seed germination is characterized by a constant change of gene expression across different time points. These changes are related to specific processes, which eventually determine the onset of seed germination. To get a better understanding on the regulation of gene expression during seed germination, we measured gene expression levels of Arabidopsis thaliana Bay x Sha recombinant inbred lines (RILs) at four important seed germination stages (primary dormant, after-ripened, six-hour after imbibition, and radicle protrusion stage) using. We mapped the eQTL of the gene expression and the result displayed the distinctness of the eQTL landscape for each stage. We found several eQTL hotspots across stages associated with the regulation of expression of a large number of genes. Together, we have revealed that the genetic regulation of gene expression is dynamic along the course of seed germination

    Cognition and personality in male veiled chameleons, Chamaeleo calyptratus

    No full text
    abstract: Historically, the study of cognition has focused on species-specific learning, memory, problem-solving and decision-making capabilities, and emphasis was placed on the few high-performing individuals who successfully completed cognitive tasks. Studies often deemed the success of a small fraction of individuals as suggestive of the cognitive capacity of the entire species. Recently though, interest in individual variation in cognitive ability within species has increased. This interest has emerged concomitantly with studies of variation in animal personalities (i.e. behavioral syndromes). Cognitive ability may be closely tied to personality because the mechanisms by which an individual perceives and uses environmental input (cognition) should influence how that individual consistently responds to various ecological demands (personality). However, empirical support for links between animal cognition and behavioral syndromes is currently lacking. I examined individual variation in cognition and personality in male veiled chameleons, Chamaeleo calyptratus. I considered three axes of personality (aggression, activity, and exploratory behavior) and cognition in a foraging context using visual cues − specifically, the ability to associate a color with a food reward. I found that aggression was positively correlated with the proportion of correct choices and number of consecutive correct choices. Also, one measure of exploration (the number of vines touched in a novel environment) was correlated negatively with the proportion of correct choices and positively with the number of consecutive incorrect decisions. My investigation suggests that more aggressive, less exploratory chameleons were more successful learners, and that there exists a shared pathway between these personality traits and cognitive ability

    Propagation of errors in citation networks: a study involving the entire citation network of a widely cited paper published in, and later retracted from, the journal Nature

    Get PDF
    BackgroundIn about one in 10,000 cases, a published article is retracted. This very often means that the results it reports are flawed. Several authors have voiced concerns about the presence of retracted research in the memory of science. In particular, a retracted result is propagated by citing it. In the published literature, many instances are given of retracted articles that are cited both before and after their retraction. Even worse is the possibility that these articles in turn are cited in such a way that the retracted result is propagated further.MethodsWe have conducted a case study to find out how a retracted article is cited and whether retracted results are propagated through indirect citations. We have constructed the entire citation network for this case.ResultsWe show that directly citing articles is an important source of propagation of retracted research results. In contrast, in our case study, indirect citations do not contribute to the propagation of the retracted result.ConclusionsWhile admitting the limitations of a study involving a single case, we think there are reasons for the non-contribution of indirect citations that hold beyond our case study
    corecore