12 research outputs found

    MSDmotif: exploring protein sites and motifs

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Protein structures have conserved features – motifs, which have a sufficient influence on the protein function. These motifs can be found in sequence as well as in 3D space. Understanding of these fragments is essential for 3D structure prediction, modelling and drug-design. The Protein Data Bank (PDB) is the source of this information however present search tools have limited 3D options to integrate protein sequence with its 3D structure.</p> <p>Results</p> <p>We describe here a web application for querying the PDB for ligands, binding sites, small 3D structural and sequence motifs and the underlying database. Novel algorithms for chemical fragments, 3D motifs, ϕ/ψ sequences, super-secondary structure motifs and for small 3D structural motif associations searches are incorporated. The interface provides functionality for visualization, search criteria creation, sequence and 3D multiple alignment options. MSDmotif is an integrated system where a results page is also a search form. A set of motif statistics is available for analysis. This set includes molecule and motif binding statistics, distribution of motif sequences, occurrence of an amino-acid within a motif, correlation of amino-acids side-chain charges within a motif and Ramachandran plots for each residue. The binding statistics are presented in association with properties that include a ligand fragment library. Access is also provided through the distributed Annotation System (DAS) protocol. An additional entry point facilitates XML requests with XML responses.</p> <p>Conclusion</p> <p>MSDmotif is unique by combining chemical, sequence and 3D data in a single search engine with a range of search and visualisation options. It provides multiple views of data found in the PDB archive for exploring protein structures.</p

    DoBo: Protein domain boundary prediction by integrating evolutionary signals and machine learning

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Accurate identification of protein domain boundaries is useful for protein structure determination and prediction. However, predicting protein domain boundaries from a sequence is still very challenging and largely unsolved.</p> <p>Results</p> <p>We developed a new method to integrate the classification power of machine learning with evolutionary signals embedded in protein families in order to improve protein domain boundary prediction. The method first extracts putative domain boundary signals from a multiple sequence alignment between a query sequence and its homologs. The putative sites are then classified and scored by support vector machines in conjunction with input features such as sequence profiles, secondary structures, solvent accessibilities around the sites and their positions. The method was evaluated on a domain benchmark by 10-fold cross-validation and 60% of true domain boundaries can be recalled at a precision of 60%. The trade-off between the precision and recall can be adjusted according to specific needs by using different decision thresholds on the domain boundary scores assigned by the support vector machines.</p> <p>Conclusions</p> <p>The good prediction accuracy and the flexibility of selecting domain boundary sites at different precision and recall values make our method a useful tool for protein structure determination and modelling. The method is available at <url>http://sysbio.rnet.missouri.edu/dobo/</url>.</p

    A Re-Annotation of the Saccharomyces Cerevisiae Genome

    Get PDF
    Discrepancies in gene and orphan number indicated by previous analyses suggest that S. cerevisiae would benefit from a consistent re-annotation. In this analysis three new genes are identified and 46 alterations to gene coordinates are described. 370 ORFs are defined as totally spurious ORFs which should be disregarded. At least a further 193 genes could be described as very hypothetical, based on a number of criteria. It was found that disparate genes with sequence overlaps over ten amino acids (especially at the N-terminus) are rare in both S. cerevisiae and Sz. pombe. A new S. cerevisiae gene number estimate with an upper limit of 5804 is proposed, but after the removal of very hypothetical genes and pseudogenes this is reduced to 5570. Although this is likely to be closer to the true upper limit, it is still predicted to be an overestimate of gene number. A complete list of revised gene coordinates is available from the Sanger Centre (S. cerevisiae reannotation: ftp://ftp/pub/yeast/SCreannotation)

    The linear chromosome of the plant-pathogenic mycoplasma 'Candidatus Phytoplasma mali'

    Get PDF
    BACKGROUND: Phytoplasmas are insect-transmitted, uncultivable bacterial plant pathogens that cause diseases in hundreds of economically important plants. They represent a monophyletic group within the class Mollicutes (trivial name mycoplasmas) and are characterized by a small genome with a low GC content, and the lack of a firm cell wall. All mycoplasmas, including strains of 'Candidatus (Ca.) Phytoplasma asteris' and 'Ca. P. australiense', examined so far have circular chromosomes, as is the case for almost all walled bacteria. RESULTS: Our work has shown that 'Ca. Phytoplasma mali', the causative agent of apple proliferation disease, has a linear chromosome. Linear chromosomes were also identified in the closely related provisional species 'Ca. P. pyri' and 'Ca. P. prunorum'. The chromosome of 'Ca. P. mali' strain AT is 601,943 bp in size and has a GC content of 21.4%. The chromosome is further characterized by large terminal inverted repeats and covalently closed hairpin ends. Analysis of the protein-coding genes revealed that glycolysis, the major energy-yielding pathway supposed for 'Ca. P. asteris', is incomplete in 'Ca. P. mali'. Due to the apparent lack of other metabolic pathways present in mycoplasmas, it is proposed that maltose and malate are utilized as carbon and energy sources. However, complete ATP-yielding pathways were not identified. 'Ca. P. mali' also differs from 'Ca. P. asteris' by a smaller genome, a lower GC content, a lower number of paralogous genes, fewer insertions of potential mobile DNA elements, and a strongly reduced number of ABC transporters for amino acids. In contrast, 'Ca. P. mali' has an extended set of genes for homologous recombination, excision repair and SOS response than 'Ca. P. asteris'. CONCLUSION: The small linear chromosome with large terminal inverted repeats and covalently closed hairpin ends, the extremely low GC content and the limited metabolic capabilities reflect unique features of 'Ca. P. mali', not only within phytoplasmas, but all mycoplasmas. It is expected that the genome information obtained here will contribute to a better understanding of the reduced metabolism of phytoplasmas, their fastidious nutrition requirements that prevented axenic cultivation, and the mechanisms involved in pathogenicity

    Comparative analysis of pseudogenes across three phyla

    Get PDF
    Pseudogenes are degraded fossil copies of genes. Here, we report a comparison of pseudogenes spanning three phyla, leveraging the completed annotations of the human, worm, and fly genomes, which we make available as an online resource. We find that pseudogenes are lineage specific, much more so than proteincoding genes, reflecting the different remodeling processes marking each organism’s genome evolution. The majority of human pseudogenes are processed, resulting from a retrotranspositional burst at the dawn of the primate lineage. This burst can be seen in the largely uniform distribution of pseudogenes across the genome, their preservation in areas with low recombination rates, and their preponderance in highly expressed gene families. In contrast, worm and fly pseudogenes tell a story of numerous duplication events. In worm, these duplications have been preserved through selective sweeps, so we see a large number of pseudogenes associated with highly duplicated families such as chemoreceptors. However, in fly, the large effective population size and high deletion rate resulted in a depletion of the pseudogene complement. Despite large variations between these species, we also find notable similarities. Overall, we identify a broad spectrum of biochemical activity for pseudogenes, with the majority in each organism exhibiting varying degrees of partial activity. In particular, we identify a consistent amount of transcription (∼15%) across all species, suggesting a uniform degradation process. Also, we see a uniform decay of pseudogene promoter activity relative to their coding counterparts and identify a number of pseudogenes with conserved upstream sequences and activity, hinting at potential regulatory roles

    "Smith-Waterman" paralelo en arquitectura de many-core para búsquedas en bases de datos de secuencias

    Get PDF
    87 p.Trabajo fin de Máster dirigido por Sergio Gálvez Rojas, y co-tutores: Oswaldo Trelles Salazar y Gabriel Dorado Pérez. En este trabajo se ha desarrollado un algoritmo denominado MC64-S3W (MultiCore 64 – Sequence Search Smith-Waterman) para realizar el alineamiento local de una secuencia problema contra una base de datos de secuencias de ácidos nucleicos de gran tamaño (entre 80 y 260 kilobases) en arquitectura hardware de muchos núcleos. La posibilidad de realizar alineamientos de gran tamaño (obteniendo el alineamiento local óptimo) bajo arquitectura de muchos núcleos es, por tanto, uno de los elementos diferenciadores de este trabajo. En el trabajo se justifica el ahorro de tiempo que se consigue al paralelizar varios alineamientos simultáneos y se realiza un estudio comparativo con otras implementaciones paralelas ampliamente referenciadas como es el caso del algoritmo CUDASW++. También se incluye una comparativa con BLAST. El trabajo se completa con una revisión del estado del arte en la comparación de secuencias de ácidos nucleicos y péptidos, con objeto de obtener el grado de similitud entre ellas, tanto desde un punto de vista algorítmico como desde el punto de vista de estudios biológicos en los que se referencian alineamientos de secuencias de gran tamaño

    Work ow-based systematic design of high throughput genome annotation

    No full text
    The genus Eimeria belongs to the phylum Apicomplexa, which includes many obligate intra-cellular protozoan parasites of man and livestock. E. tenella is one of seven species that infect the domestic chicken and cause the intestinal disease coccidiosis which is economy important for poultry industry. E. tenella is highly pathogenic and is often used as a model species for the Eimeria biology studies. In this PhD thesis, a comprehensive annotation system named as \WAGA" (Workflow-based Automatically Genome Annotation) was built and applied to the E. tenella genome. InforSense KDE, and its BioSense plug-in (products of the InforSense Company), were the core softwares used to build the workflows. Workflows were made by integrating individual bioinformatics tools into a single platform. Each workflow was designed to provide a standalone service for a particular task. Three major workflows were developed based on the genomic resources currently available for E. tenella. These were of ESTs-based gene construction, HMM-based gene prediction and protein-based annotation. Finally, a combining workflow was built to sit above the individual ones to generate a set of automatic annotations using all of the available information. The overall system and its three major components were deployed as web servers that are fully tuneable and reusable for end users. WAGA does not require users to have programming skills or knowledge of the underlying algorithms or mechanisms of its low level components. E. tenella was the target genome here and all the results obtained were displayed by GBrowse. A sample of the results is selected for experimental validation. For evaluation purpose, WAGA was also applied to another Apicomplexa parasite, Plasmodium falciparum, the causative agent of human malaria, which has been extensively annotated. The results obtained were compared with gene predictions of PHAT, a gene finder designed for and used in the P. falciparum genome project

    Internal Stipe Necrosis of Agaricus bisporus - Etiology and Molecular Genetic Studies

    Get PDF
    The button mushroom, Agaricus bisporus is the most popular mushroom in cultivation worldwide, and is the most valuable protected crop in the UK, with an estimated wholesale value exceeding £250 million. In 1991 a new disease emerged in mushroom crops in the UK, called Internal Stipe Necrosis (ISN). Crop losses due to this disease may reach 10 %, since affected mushrooms must be downgraded or discarded. Symptoms take the form of a variable browning reaction in the central region of the mushroom stipe, which may also demonstrate varying degrees of internal collapse. During an exhaustive study of ISN over the past 3 years, it was found that an unusual enteric bacterium was consistently associated with the disease, along with diverse members of the Pseudomonas fluorescens complex, which probably represent secondary colonisers. Several strains of the enteric bacterium reproduced ISN symptoms in trials in which mushrooms were injected with bacteria and in trials where bacteria were sprayed onto otherwise normal mushroom beds. Isolates collected from deliberate infection experiments were shown to be identical to the applied strains by the use of restriction fragment length polymorphism (RFLP) studies, using a cloned 16s rRNA gene isolated from a representative strain of the enteric bacteria. These bacteria therefore appear to satisfy Koch's Postulates as the causative agent of ISN. Conventional biochemical profiles identified the ISN causative agent as Ewingella americana, an unusual species previously unknown in mushrooms or their growing environment. This identification was confirmed by genomic DNA hybridisation using a range of reference strains taxonomically related to and including E. americana. Evidence presented suggests that E. americana produces a single endo-acting chitinase. The significance of this enzyme in ISN pathogenesis is discussed. This 33 kDa enzyme has been purified by hydrophobic interaction chromatography and the encoding gene cloned and expressed in E. coli. Sequence analysis of this gene (designated chiA) revealed an open reading frame of 921 bp, with a deduced peptide size corresponding closely to the size of the purified enzyme. The deduced amino acid sequence was most similar to the chitinase II of Aeromonas sp. No. 10S-24 and, to a lesser extent, the chitinase of Saccharopolyspora erythraeus. Alignment with other chitinases, however, revealed very low homology with the exception of two conserved motifs in the catalytic domain of these enzymes. The E. americana sequence also lacks the chitin binding and Type III fibronectin homology units common to many bacterial chitinases. Deletion of a conserved motif, which has previously been implicated as forming the active site of chitinases, produced a product retaining significant chitinolytic activity. Such evidence may lead to a reappraisal of the significance of this motif in catalysis
    corecore