12 research outputs found
MSDmotif: exploring protein sites and motifs
<p>Abstract</p> <p>Background</p> <p>Protein structures have conserved features – motifs, which have a sufficient influence on the protein function. These motifs can be found in sequence as well as in 3D space. Understanding of these fragments is essential for 3D structure prediction, modelling and drug-design. The Protein Data Bank (PDB) is the source of this information however present search tools have limited 3D options to integrate protein sequence with its 3D structure.</p> <p>Results</p> <p>We describe here a web application for querying the PDB for ligands, binding sites, small 3D structural and sequence motifs and the underlying database. Novel algorithms for chemical fragments, 3D motifs, ϕ/ψ sequences, super-secondary structure motifs and for small 3D structural motif associations searches are incorporated. The interface provides functionality for visualization, search criteria creation, sequence and 3D multiple alignment options. MSDmotif is an integrated system where a results page is also a search form. A set of motif statistics is available for analysis. This set includes molecule and motif binding statistics, distribution of motif sequences, occurrence of an amino-acid within a motif, correlation of amino-acids side-chain charges within a motif and Ramachandran plots for each residue. The binding statistics are presented in association with properties that include a ligand fragment library. Access is also provided through the distributed Annotation System (DAS) protocol. An additional entry point facilitates XML requests with XML responses.</p> <p>Conclusion</p> <p>MSDmotif is unique by combining chemical, sequence and 3D data in a single search engine with a range of search and visualisation options. It provides multiple views of data found in the PDB archive for exploring protein structures.</p
DoBo: Protein domain boundary prediction by integrating evolutionary signals and machine learning
<p>Abstract</p> <p>Background</p> <p>Accurate identification of protein domain boundaries is useful for protein structure determination and prediction. However, predicting protein domain boundaries from a sequence is still very challenging and largely unsolved.</p> <p>Results</p> <p>We developed a new method to integrate the classification power of machine learning with evolutionary signals embedded in protein families in order to improve protein domain boundary prediction. The method first extracts putative domain boundary signals from a multiple sequence alignment between a query sequence and its homologs. The putative sites are then classified and scored by support vector machines in conjunction with input features such as sequence profiles, secondary structures, solvent accessibilities around the sites and their positions. The method was evaluated on a domain benchmark by 10-fold cross-validation and 60% of true domain boundaries can be recalled at a precision of 60%. The trade-off between the precision and recall can be adjusted according to specific needs by using different decision thresholds on the domain boundary scores assigned by the support vector machines.</p> <p>Conclusions</p> <p>The good prediction accuracy and the flexibility of selecting domain boundary sites at different precision and recall values make our method a useful tool for protein structure determination and modelling. The method is available at <url>http://sysbio.rnet.missouri.edu/dobo/</url>.</p
A Re-Annotation of the Saccharomyces Cerevisiae Genome
Discrepancies in gene and orphan number indicated by previous analyses suggest that
S. cerevisiae would benefit from a consistent re-annotation. In this analysis three new genes
are identified and 46 alterations to gene coordinates are described. 370 ORFs are defined
as totally spurious ORFs which should be disregarded. At least a further 193 genes could
be described as very hypothetical, based on a number of criteria.
It was found that disparate genes with sequence overlaps over ten amino acids (especially
at the N-terminus) are rare in both S. cerevisiae and Sz. pombe. A new S. cerevisiae gene
number estimate with an upper limit of 5804 is proposed, but after the removal of very
hypothetical genes and pseudogenes this is reduced to 5570. Although this is likely to be
closer to the true upper limit, it is still predicted to be an overestimate of gene number. A
complete list of revised gene coordinates is available from the Sanger Centre (S. cerevisiae
reannotation: ftp://ftp/pub/yeast/SCreannotation)
The linear chromosome of the plant-pathogenic mycoplasma 'Candidatus Phytoplasma mali'
BACKGROUND: Phytoplasmas are insect-transmitted, uncultivable bacterial plant pathogens that cause diseases in hundreds of economically important plants. They represent a monophyletic group within the class Mollicutes (trivial name mycoplasmas) and are characterized by a small genome with a low GC content, and the lack of a firm cell wall. All mycoplasmas, including strains of 'Candidatus (Ca.) Phytoplasma asteris' and 'Ca. P. australiense', examined so far have circular chromosomes, as is the case for almost all walled bacteria. RESULTS: Our work has shown that 'Ca. Phytoplasma mali', the causative agent of apple proliferation disease, has a linear chromosome. Linear chromosomes were also identified in the closely related provisional species 'Ca. P. pyri' and 'Ca. P. prunorum'. The chromosome of 'Ca. P. mali' strain AT is 601,943 bp in size and has a GC content of 21.4%. The chromosome is further characterized by large terminal inverted repeats and covalently closed hairpin ends. Analysis of the protein-coding genes revealed that glycolysis, the major energy-yielding pathway supposed for 'Ca. P. asteris', is incomplete in 'Ca. P. mali'. Due to the apparent lack of other metabolic pathways present in mycoplasmas, it is proposed that maltose and malate are utilized as carbon and energy sources. However, complete ATP-yielding pathways were not identified. 'Ca. P. mali' also differs from 'Ca. P. asteris' by a smaller genome, a lower GC content, a lower number of paralogous genes, fewer insertions of potential mobile DNA elements, and a strongly reduced number of ABC transporters for amino acids. In contrast, 'Ca. P. mali' has an extended set of genes for homologous recombination, excision repair and SOS response than 'Ca. P. asteris'. CONCLUSION: The small linear chromosome with large terminal inverted repeats and covalently closed hairpin ends, the extremely low GC content and the limited metabolic capabilities reflect unique features of 'Ca. P. mali', not only within phytoplasmas, but all mycoplasmas. It is expected that the genome information obtained here will contribute to a better understanding of the reduced metabolism of phytoplasmas, their fastidious nutrition requirements that prevented axenic cultivation, and the mechanisms involved in pathogenicity
Comparative analysis of pseudogenes across three phyla
Pseudogenes are degraded fossil copies of genes. Here, we report
a comparison of pseudogenes spanning three phyla, leveraging
the completed annotations of the human, worm, and fly genomes,
which we make available as an online resource. We find that
pseudogenes are lineage specific, much more so than proteincoding
genes, reflecting the different remodeling processes marking
each organism’s genome evolution. The majority of human
pseudogenes are processed, resulting from a retrotranspositional
burst at the dawn of the primate lineage. This burst can be seen in
the largely uniform distribution of pseudogenes across the genome,
their preservation in areas with low recombination rates,
and their preponderance in highly expressed gene families. In contrast,
worm and fly pseudogenes tell a story of numerous duplication
events. In worm, these duplications have been preserved
through selective sweeps, so we see a large number of pseudogenes
associated with highly duplicated families such as chemoreceptors.
However, in fly, the large effective population size and
high deletion rate resulted in a depletion of the pseudogene complement.
Despite large variations between these species, we also
find notable similarities. Overall, we identify a broad spectrum of
biochemical activity for pseudogenes, with the majority in each organism
exhibiting varying degrees of partial activity. In particular,
we identify a consistent amount of transcription (∼15%) across all
species, suggesting a uniform degradation process. Also, we see
a uniform decay of pseudogene promoter activity relative to their
coding counterparts and identify a number of pseudogenes with
conserved upstream sequences and activity, hinting at potential
regulatory roles
"Smith-Waterman" paralelo en arquitectura de many-core para búsquedas en bases de datos de secuencias
87 p.Trabajo fin de Máster dirigido por Sergio Gálvez Rojas, y co-tutores: Oswaldo Trelles Salazar y Gabriel Dorado Pérez. En este trabajo se ha desarrollado un algoritmo denominado MC64-S3W (MultiCore 64 – Sequence Search Smith-Waterman) para realizar el alineamiento local de una secuencia problema contra una base de datos de secuencias de ácidos nucleicos de gran tamaño (entre 80 y 260 kilobases) en arquitectura hardware de muchos núcleos. La posibilidad de realizar alineamientos de gran tamaño (obteniendo el alineamiento local óptimo) bajo arquitectura de muchos núcleos es, por tanto, uno de los elementos diferenciadores de este trabajo. En el trabajo se justifica el ahorro de tiempo que se consigue al paralelizar varios alineamientos simultáneos y se realiza un estudio comparativo con otras implementaciones paralelas ampliamente referenciadas como es el caso del algoritmo CUDASW++. También se incluye una comparativa con BLAST. El trabajo se completa con una revisión del estado del arte en la comparación de secuencias de ácidos nucleicos y péptidos, con objeto de obtener el grado de similitud entre ellas, tanto desde un punto de vista algorítmico como desde el punto de vista de estudios biológicos en los que se referencian alineamientos de secuencias de gran tamaño
Recommended from our members
Computational analysis of the <i>Caenorhabditis elegans</i> genome sequence.
The genomic sequencing of the model genetic organism, the nematode Caenorhabditis elegans is now essentially complete, representing the first genome sequence to be derived for a multicellular organism. This thesis describes the strategies and software tools that have been utilized in the analysis of the genomic sequence: Preliminary analysis of genomic organisation is also presented.
C. elegans chromosomes do not store genetic information in a uniform manner. Gene density varies between different chromosomal regions and between chromosomes. The highly recombinagenic autosomal arms possess more repetitive elements and generally have a lower gene density than the recombinationally suppressed central regions. Although, the gene density within autosomal arms is higher than had been previously expected. A positive correlation is observed between the number of genetically defined loci from a chromosomal region and the expression rate of a region as estimated by the abundance of Expressed Sequence Tags (ESTs). A similar positive correlation is
observed with the proportion of genes possessing similarity to rion-nematoda proteins. Chromosomal regions with a high density of gene clusters have fewer genetically derived loci. Demonstrating that redundancy reduces the genetic accessibility of a region towards classical genetic approaches. Introns are larger on the autosomal arms than the central clusters. Exon length shows no correlation with chromosomal position but increases with expression rate. Stop codon preference is also influenced by expression rate. Clusters of similar genes are also found on the C. elegans chromosomes although their distribution is not random. The majority of gene clusters have been determined to lie on chromosome V and the left arm of II. The orientation of the genes within gene clusters suggests that inversion events are common and provide a selective advantage. Alternative splicing has also been studied and the results suggest that many alternative transcripts can be attributed to errors in splice acceptor processing
Work ow-based systematic design of high throughput genome annotation
The genus Eimeria belongs to the phylum Apicomplexa, which includes many obligate intra-cellular protozoan parasites of man and livestock. E. tenella is one of seven species that infect
the domestic chicken and cause the intestinal disease coccidiosis which is economy important
for poultry industry. E. tenella is highly pathogenic and is often used as a model species for
the Eimeria biology studies. In this PhD thesis, a comprehensive annotation system named
as \WAGA" (Workflow-based Automatically Genome Annotation) was built and applied to
the E. tenella genome. InforSense KDE, and its BioSense plug-in (products of the InforSense
Company), were the core softwares used to build the workflows.
Workflows were made by integrating individual bioinformatics tools into a single platform.
Each workflow was designed to provide a standalone service for a particular task. Three major
workflows were developed based on the genomic resources currently available for E. tenella.
These were of ESTs-based gene construction, HMM-based gene prediction and protein-based
annotation. Finally, a combining workflow was built to sit above the individual ones to generate
a set of automatic annotations using all of the available information. The overall system and
its three major components were deployed as web servers that are fully tuneable and reusable
for end users. WAGA does not require users to have programming skills or knowledge of the
underlying algorithms or mechanisms of its low level components.
E. tenella was the target genome here and all the results obtained were displayed by GBrowse.
A sample of the results is selected for experimental validation. For evaluation purpose, WAGA
was also applied to another Apicomplexa parasite, Plasmodium falciparum, the causative agent
of human malaria, which has been extensively annotated. The results obtained were compared
with gene predictions of PHAT, a gene finder designed for and used in the P. falciparum genome
project
Recommended from our members
The regulatory roles of PyrR and Crc in pyrimidine metabolism in Pseudomonas aeruginosa
The regulatory gene for pyrimidine biosynthesis has been identified and designated pyrR. The pyrR gene product was purified to homogeneity and found to have a monomeric molecular mass of 19 kDa. The pyrR gene is located directly upstream of the pyrBC' genes in the pyrRBC' operon. Insertional mutagenesis of pyrR led to a 50- 70% decrease in the expression of pyrBC', pyrD, pyrE and pyrF while pyrC was unchanged. This suggests that PyrR is a positive activator. The upstream regions of the pyrD, pyrE and pyrF genes contain a common conserved 9 bp sequence to which the purified PyrR protein is proposed to bind. This consensus sequence is absent in pyrC but is present, as an imperfect inverted repeat separated by 11 bp, within the promoter region of pyrR. Gel retardation assays using upstream DNA fragments proved PyrR binds to the DNA of pyrD, pyrE, pyrF as well as pyrR. This suggests that expression of pyrR is autoregulated; moreover, a stable stem-loop structure was determined in the pyrR promoter region such that the SD sequence and the translation start codon for pyrR is sequestered. β-galactosidase activity from transcriptional pyrR::lacZ fusion assays, showed a two-fold in increase when expressed in a pyrR- strain compared to the isogenic pyrR+ strain. Thus, pyrR is negatively regulated while the other pyr genes (except pyrC) are positively activated by PyrR. That no regulation was seen for pyrC is in keeping with the recent discovery of a second functional pyrC that is not regulated in P. aeruginosa. Gel filtration chromatography shows the PyrR protein exists in a dynamic equilibrium, and it is proposed that PyrR functions as a monomer in activating pyrD, pyrE and pyrF and as a dimeric repressor for pyrR by binding to the inverted repeat. A related study discovered that the catabolite repression control (Crc) protein was indirectly involved in pyr gene regulation, and shown to negatively regulate expression of PyrR at the posttranscriptional level
Internal Stipe Necrosis of Agaricus bisporus - Etiology and Molecular Genetic Studies
The button mushroom, Agaricus bisporus is the most popular mushroom in cultivation worldwide, and is the most valuable protected crop in the UK, with an estimated wholesale value exceeding £250 million. In 1991 a new disease emerged in mushroom crops in the UK, called Internal Stipe Necrosis (ISN). Crop losses due to this disease may reach 10 %, since affected mushrooms must be downgraded or discarded. Symptoms take the form of a variable browning reaction in the central region of the mushroom stipe, which may also demonstrate varying degrees of internal collapse.
During an exhaustive study of ISN over the past 3 years, it was found that an unusual enteric bacterium was consistently associated with the disease, along with diverse members of the Pseudomonas fluorescens complex, which probably represent secondary colonisers. Several strains of the enteric bacterium reproduced ISN symptoms in trials in which mushrooms were injected with bacteria and in trials where bacteria were sprayed onto otherwise normal mushroom beds. Isolates collected from deliberate infection experiments were shown to be identical to the applied strains by the use of restriction fragment length polymorphism (RFLP) studies, using a cloned 16s rRNA gene isolated from a representative strain of the enteric bacteria. These bacteria therefore appear to satisfy Koch's Postulates as the causative agent of ISN.
Conventional biochemical profiles identified the ISN causative agent as Ewingella americana, an unusual species previously unknown in mushrooms or their growing environment. This identification was confirmed by genomic DNA hybridisation using a range of reference strains taxonomically related to and including E. americana.
Evidence presented suggests that E. americana produces a single endo-acting chitinase. The significance of this enzyme in ISN pathogenesis is discussed. This 33 kDa enzyme has been purified by hydrophobic interaction chromatography and the encoding gene cloned and expressed in E. coli. Sequence analysis of this gene (designated chiA) revealed an open reading frame of 921 bp, with a deduced peptide size corresponding closely to the size of the purified enzyme. The deduced amino acid sequence was most similar to the chitinase II of Aeromonas sp. No. 10S-24 and, to a lesser extent, the chitinase of Saccharopolyspora erythraeus. Alignment with other chitinases, however, revealed very low homology with the exception of two conserved motifs in the catalytic domain of these enzymes. The E. americana sequence also lacks the chitin binding and Type III fibronectin homology units common to many bacterial chitinases. Deletion of a conserved motif, which has previously been implicated as forming the active site of chitinases, produced a product retaining significant chitinolytic activity. Such evidence may lead to a reappraisal of the significance of this motif in catalysis