11 research outputs found

    A new protein linear motif benchmark for multiple sequence alignment software

    Get PDF
    BACKGROUND: Linear motifs (LMs) are abundant short regulatory sites used for modulating the functions of many eukaryotic proteins. They play important roles in post-translational modification, cell compartment targeting, docking sites for regulatory complex assembly and protein processing and cleavage. Methods for LM detection are now being developed that are strongly dependent on scores for motif conservation in homologous proteins. However, most LMs are found in natively disordered polypeptide segments that evolve rapidly, unhindered by structural constraints on the sequence. These regions of modular proteins are difficult to align using classical multiple sequence alignment programs that are specifically optimised to align the globular domains. As a consequence, poor motif alignment quality is hindering efforts to detect new LMs. RESULTS: We have developed a new benchmark, as part of the BAliBASE suite, designed to assess the ability of standard multiple alignment methods to detect and align LMs. The reference alignments are organised into different test sets representing real alignment problems and contain examples of experimentally verified functional motifs, extracted from the Eukaryotic Linear Motif (ELM) database. The benchmark has been used to evaluate and compare a number of multiple alignment programs. With distantly related proteins, the worst alignment program correctly aligns 48% of LMs compared to 73% for the best program. However, the performance of all the programs is adversely affected by the introduction of other sequences containing false positive motifs. The ranking of the alignment programs based on LM alignment quality is similar to that observed when considering full-length protein alignments, however little correlation was observed between LM and overall alignment quality for individual alignment test cases. CONCLUSION: We have shown that none of the programs currently available is capable of reliably aligning LMs in distantly related sequences and we have highlighted a number of specific problems. The results of the tests suggest possible ways to improve program accuracy for difficult, divergent sequences

    Evidence for the Concerted Evolution between Short Linear Protein Motifs and Their Flanking Regions

    Get PDF
    BACKGROUND: Linear motifs are short modules of protein sequences that play a crucial role in mediating and regulating many protein-protein interactions. The function of linear motifs strongly depends on the context, e.g. functional instances mainly occur inside flexible regions that are accessible for interaction. Sometimes linear motifs appear as isolated islands of conservation in multiple sequence alignments. However, they also occur in larger blocks of sequence conservation, suggesting an active role for the neighbouring amino acids. RESULTS: The evolution of regions flanking 116 functional linear motif instances was studied. The conservation of the amino acid sequence and order/disorder tendency of those regions was related to presence/absence of the instance. For the majority of the analysed instances, the pairs of sequences conserving the linear motif were also observed to maintain a similar local structural tendency and/or to have higher local sequence conservation when compared to pairs of sequences where one is missing the linear motif. Furthermore, those instances have a higher chance to co-evolve with the neighbouring residues in comparison to the distant ones. Those findings are supported by examples where the regulation of the linear motif-mediated interaction has been shown to depend on the modifications (e.g. phosphorylation) at neighbouring positions or is thought to benefit from the binding versatility of disordered regions. CONCLUSION: The results suggest that flanking regions are relevant for linear motif-mediated interactions, both at the structural and sequence level. More interestingly, they indicate that the prediction of linear motif instances can be enriched with contextual information by performing a sequence analysis similar to the one presented here. This can facilitate the understanding of the role of these predicted instances in determining the protein function inside the broader context of the cellular network where they arise

    The identification of short linear motif-mediated interfaces within the human interactome

    Get PDF
    Motivation: Eukaryotic proteins are highly modular, containing multiple interaction interfaces that mediate binding to a network of regulators and effectors. Recent advances in high-throughput proteomics have rapidly expanded the number of known protein–protein interactions (PPIs); however, the molecular basis for the majority of these interactions remains to be elucidated. There has been a growing appreciation of the importance of a subset of these PPIs, namely those mediated by short linear motifs (SLiMs), particularly the canonical and ubiquitous SH2, SH3 and PDZ domain-binding motifs. However, these motif classes represent only a small fraction of known SLiMs and outside these examples little effort has been made, either bioinformatically or experimentally, to discover the full complement of motif instances

    ELM: the status of the 2010 eukaryotic linear motif resource

    Get PDF
    Linear motifs are short segments of multidomain proteins that provide regulatory functions independently of protein tertiary structure. Much of intracellular signalling passes through protein modifications at linear motifs. Many thousands of linear motif instances, most notably phosphorylation sites, have now been reported. Although clearly very abundant, linear motifs are difficult to predict de novo in protein sequences due to the difficulty of obtaining robust statistical assessments. The ELM resource at http://elm.eu.org/ provides an expanding knowledge base, currently covering 146 known motifs, with annotation that includes >1300 experimentally reported instances. ELM is also an exploratory tool for suggesting new candidates of known linear motifs in proteins of interest. Information about protein domains, protein structure and native disorder, cellular and taxonomic contexts is used to reduce or deprecate false positive matches. Results are graphically displayed in a ‘Bar Code’ format, which also displays known instances from homologous proteins through a novel ‘Instance Mapper’ protocol based on PHI-BLAST. ELM server output provides links to the ELM annotation as well as to a number of remote resources. Using the links, researchers can explore the motifs, proteins, complex structures and associated literature to evaluate whether candidate motifs might be worth experimental investigation

    A new protein linear motif benchmark for multiple sequence alignment software

    No full text
    Abstract Background Linear motifs (LMs) are abundant short regulatory sites used for modulating the functions of many eukaryotic proteins. They play important roles in post-translational modification, cell compartment targeting, docking sites for regulatory complex assembly and protein processing and cleavage. Methods for LM detection are now being developed that are strongly dependent on scores for motif conservation in homologous proteins. However, most LMs are found in natively disordered polypeptide segments that evolve rapidly, unhindered by structural constraints on the sequence. These regions of modular proteins are difficult to align using classical multiple sequence alignment programs that are specifically optimised to align the globular domains. As a consequence, poor motif alignment quality is hindering efforts to detect new LMs. Results We have developed a new benchmark, as part of the BAliBASE suite, designed to assess the ability of standard multiple alignment methods to detect and align LMs. The reference alignments are organised into different test sets representing real alignment problems and contain examples of experimentally verified functional motifs, extracted from the Eukaryotic Linear Motif (ELM) database. The benchmark has been used to evaluate and compare a number of multiple alignment programs. With distantly related proteins, the worst alignment program correctly aligns 48% of LMs compared to 73% for the best program. However, the performance of all the programs is adversely affected by the introduction of other sequences containing false positive motifs. The ranking of the alignment programs based on LM alignment quality is similar to that observed when considering full-length protein alignments, however little correlation was observed between LM and overall alignment quality for individual alignment test cases. Conclusion We have shown that none of the programs currently available is capable of reliably aligning LMs in distantly related sequences and we have highlighted a number of specific problems. The results of the tests suggest possible ways to improve program accuracy for difficult, divergent sequences.</p

    Automatic and manual functional annotation in a distributed web service environment

    Get PDF
    While the number of genomic sequences becoming available is increasing exponentially, most genes are not functionally well characterized. Finding out more about the function of a gene and about functional relationships between genes will be the next big bottleneck in the post-genomic era. On the one hand improved pipelines and tools are needed in this context, because running experiments for all predicted genes is not feasible. On the other hand manual curation of the automatic predictions is necessary to judge the reliability of the automatic annotation and to get a more comprehensive view on the function of each individual gene. For the automatic functional annotation often a homology based function transfer from functionally characterized genes is applied using methods like Blast. However, this approach has many drawbacks and makes systematic errors by not taking care of speciation and duplication events. Phylogenomics has shown to improve the functional prediction accuracy by taking the evolutionary history of genes in a phylogenetic tree context into account. In this thesis the manual process from the assembly of the DNA sequence to the functional characterization of genes and the identification and comparison of shared syntenic regions, including the identification of candidate genes for pathogen resistance in potato chromosome V, is explained and problems discussed. To improve the automatic functional annotation in genome projects, a phylogenomic pipeline, which includes SIFTER one of the best phylogenomic tools in this area, is introduced, improved and tested in the Medicago truncatula, Sorghum bicolor and Solanum lycopersicum genome projects. To obtain new candidate genes for the development of new drugs and crop protection products, non-plant specific genes, like the transferrin family which is not known in plants yet, are extracted from the M. truncatula and S. bicolor genomes and further investigated. For further improvement of the annotation, a new phylogenomic approach is developed. This approach makes use of annotated functional attributes to calculate the functional mutation rate between genes and groups of genes in a phylogenetic tree and to find out if the function of a gene can be transferred or not. The new approach is integrated into the SIFTER tool and tested on the blue-light photoreceptor/photolyase family and on a test set of manually curated Arabidopsis thaliana genes. Using both test sets the prediction accuracy could be significantly improved and a more comprehensive view on the gene function could be obtained. But because still no tool is able to annotate all functions of a gene with 100% accuracy, I introduce a system for manual functional annotation, called AFAWE. AFAWE runs different web services for the functional annotation and displays the results and intermediate results in a comprehensive web interface that facilitates comparison. It can be used for any organism and any kind of gene. The inputs are the amino acid sequence and the corresponding organism. Because of its flexible structure, new web services and workflows can be easily integrated. Besides Blast searches against different databases and protein domain prediction tools, AFAWE also includes the phylogenomic pipeline. Different filters help to identify trustworthy results from each analysis. Furthermore a detailed manual annotation can be assigned to each protein, which will be used to update the functional annotation in public databases like MIPSPlantsDB

    Comparative Evaluation of Methods for Sequence Alignment and Annotation

    Get PDF
    The speed of DNA and RNA sequencing has long ago surpassed the capacity of laboratories to assign function to these sequences by direct experiment. Fortunately, function and other information can be effectively transferred to novel data from previously accumulated knowledge by sequence homology. This has resulted in the development of hundreds of novel homology-based methods. However, the tendency of method developers to be overoptimistic about their own results, biases in the evaluation metrics used to rank methods, inconsistency between different rankings and evaluation metrics, misplaced popularity of methods relative to their performance all indicate that, in many cases, clear knowledge of the comparative performance of different methods is lacking. This has two main consequences. First, researchers use suboptimal tools. Second, method development may go astray because the merits used for guiding method optimization are biased or unclear. To avoid these difficulties, further research is needed into methodology of evaluation and comparative studies. One core approach for transferring function by sequence homology is to create a multiple sequence alignment (MSA) that represents a given group of similar sequences. The resulting alignment can be applied to annotate novel sequences using profile hidden Markov models (HMMs), to create phylogenetic trees or to compare structural features. The application of MSAs and profile HMMs for genome annotation was explored in publication (I). Creating MSA has been addressed by a vast field of research, however there is a lack of independent comparative studies and no comparative studies for alignment strategies. In publication (II) a novel modular MSA aligner was implemented to aid in comparative evaluation of different MSA strategies. Different MSA strategies were then compared to each other and to the state-of-the-art MSA software on three benchmark databases. Another core approach has been to combine homology searches with assignment of annotation terms from a controlled vocabulary such as the Gene Ontology (GO). Hundreds of methods that assign GO terms to novel sequences have been introduced. The research community has also invested into the objective evaluation of these methods via third party competitions. However, the evaluation metrics and merits used in these competitions are still under active debate and need further research and development. In publication (III) a novel framework was introduced for the development of unbiased high-quality evaluation metrics. By testing 37 variations of popular metrics, our approach revealed strong differences between metrics, a list of clearly biased metrics, and a list of high-quality metrics that are well suited for the evaluation of GO annotations. In summary, this thesis presents novel frameworks and implementation platforms for comparative evaluation of two important classes of homology-based methods: MSA aligners and GO sequence classifiers. These results will be instrumental for developing more accurate MSA aligners, for eliminating many forms of bias inherent in contemporary evaluation protocols, for producing informative method rankings for non-specialist users and for guiding method development towards merits that truly reflect the utility of the designed tools.Johtuen DNA ja RNA sekvensointiteknologian nopeasta kehityksestÀ suurin osa sekvenssien biologisista kuvauksista tuotetaan sekvenssihomologiaan perustuvilla automaattisilla menetelmillÀ. Homologiaan perustuvia menetelmiÀ on kehitetty satoja, mikÀ korostaa objektiivisen ja riippumattoman menetelmÀvertailun merkitystÀ. On monia virhelÀhteitÀ, jotka vÀÀristÀvÀt ja hankaloittavat menetelmÀvertailua: oman menetelmÀn yliarviointi, ylisovittaminen, valikoitu raportointi, sekÀ harhaiset ja keskenÀÀn ristiriitaiset arviointimetriikat. Harhaisella menetelmÀvertailulla on kaksi merkittÀvÀÀ seurausta: (1) epÀoptimaaliset menetelmÀt pÀÀtyvÀt tutkijayhteisön kÀyttöön, (2) menetelmÀkehitys harhaantuu, koska kehitystÀ ohjaavat arviointikriteerit ovat harhaisia tai epÀselviÀ. EdellÀ mainittuja vaikeuksia voidaan vÀlttÀÀ kohdentamalla tutkimusta itse vertailevaan menetelmÀarviointiin. Monisekvenssilinjaus (MSL) on sekvenssihomologiaan perustuva menetelmÀ, jolla on hyvin laaja sovelluskenttÀ molekyylibiologisessa tutkimustyössÀ. Julkaisussa (I) tutkittiin MSL-linjausten ja Markovin piilomallien soveltamista bakteerigenomien kuvaukseen. MSL-kentÀllÀ on edelleen puutetta riippumattomasta menetelmÀarvioinnista, ja erityisesti eri MSL-algoritmiratkaisuja vertailevista tutkimuksista. Julkaisussa (II) esitettiin uusi modulaarinen MSL-ohjelma, jonka avulla useita MSL-algoritmiratkaisuja vertailtiin toisiinsa ja MSL-alan huippusovelluksiin kolmella vertailutietokannalla. Vertailun perusteella annettiin selkeitÀ suosituksia optimaalisista MSL-algoritmiratkaisuista ja parhaista MSL-ohjelmista. Sekvenssikuvauksia tuottavat automaattiset menetelmÀt useimmiten kÀyttÀvÀt geeniontologian (GO) termistöÀ. Koska vuosittain julkaistaan satoja GO-menetelmiÀ, tutkimusyhteisö on panostanut kyseisten menetelmien vertailevaan arviointiin. Kuitenkin GO-menetelmÀvertailun kentÀllÀ arviointikriteerit ovat vakiintumattomia ja monet kÀytössÀ olevat arviointimetriikat ovat joko harhaisia tai keskenÀÀn ristiriitaisia. Julkaisussa (III) ehdotetaan ratkaisuksi uutta menetelmÀÀ, jonka avulla on mahdollista testata ja kehittÀÀ korkealaatuisia ja harhattomia arviointimetriikoita. Julkaisussa (III) testattiin useita arviointimetriikoita ja osoitettiin, ettÀ monet tÀllÀ hetkellÀ kÀytössÀ olevat GO-arviointimetriikat ovat voimakkaasti harhaisia. Testauksen perusteella annettiin myös selkeitÀ suosituksia arviointimetriikoista, jotka takaavat harhattoman menetelmÀvertailun

    A new protein linear motif benchmark for multiple sequence alignment software-4

    No full text
    subset 1, showing the extreme observations (stars or circles), lower quartile, median, upper quartile, and largest observation in each similarity category. b) Execution times in seconds required to construct all the multiple alignments in Subset 1. Programs are displayed in the order of the Friedman test using the SPS scores for group V11 (additional file ), with the highest scoring program on the left.<p><b>Copyright information:</b></p><p>Taken from "A new protein linear motif benchmark for multiple sequence alignment software"</p><p>http://www.biomedcentral.com/1471-2105/9/213</p><p>BMC Bioinformatics 2008;9():213-213.</p><p>Published online 25 Apr 2008</p><p>PMCID:PMC2374782.</p><p></p

    A new protein linear motif benchmark for multiple sequence alignment software-2

    No full text
    different conditions, showing the extreme observations (stars or circles), lower quartile, median, upper quartile, and largest observation. Significant differences, according to a Wilcoxon signed ranks test (p < 0.05), are indicated by an asterix on the x-axis. P-values for the Wilcoxon tests are available in additional file , table 3. a) SPS scores for alignments of sequences with validated motifs only compared to alignments including sequences with errors. b) SPS scores for alignments of sequences with validated motifs only compared to alignments including sequences containing false positive (FP) motifs. c) SPS scores for alignments of sequences with validated motifs only compared to alignments including sequences that do not contain any examples of the motif.<p><b>Copyright information:</b></p><p>Taken from "A new protein linear motif benchmark for multiple sequence alignment software"</p><p>http://www.biomedcentral.com/1471-2105/9/213</p><p>BMC Bioinformatics 2008;9():213-213.</p><p>Published online 25 Apr 2008</p><p>PMCID:PMC2374782.</p><p></p
    corecore