10 research outputs found

    Identification of a novel protein interacting with RPGR

    Get PDF

    New approaches to facilitate genome analysis

    Get PDF
    In this era of concerted genome sequencing efforts, biological sequence information is abundant. With many prokaryotic and simple eukaryotic genomes completed, and with the genomes of more complex organisms nearing completion, the bioinformatics community, those charged with the interpretation of these data, are becoming concerned with the efficacy of current analysis tools. One step towards a more complete understanding of biology at the molecular level is the unambiguous functional assignment of every newly sequenced protein. The sheer scale of this problem precludes the conventional process of biochemically determining function for every example. Rather we must rely on demonstrating similarity to previously characterised proteins via computational methods, which can then be used to infer homology and hence structural and functional relationships. Our ability to do this with any measure of reliability unfortunately diminishes as the pools of experimentally determined sequence data become muddied with sequences that are themselves characterised with "in silico" annotation.Part of the problem stems from the complexity of modelling biology in general, and of evolution in particular. For example, once similarity has been identified between sequences, in order to assign a common function it is important to identify whether the inferred homologous relationship has an orthologous or paralogous origin, which currently cannot be done computationally. The modularity of proteins also poses problems for automatic annotation, as similar domains may occur in proteins with very different functions. Once accepted into the sequence databases, incorrect functional assignments become available for mass propagation and the consequences of incorporating those errors in further "in silico" experiments are potentially catastrophic. One solution to this problem is to collate families of proteins with demonstrable homologous relationships, derive a pattern that represents the essence of those relationships, and use this as a signature to trawl for similarity in the sequence databases. This approach not only provides a more sensitive model of evolution, but also allows annotation from all members of the family to contribute to any assignments made. This thesis describes the development of a new search method (FingerPRINTScan) that exploits the familial models in the PRINTS database to provide more powerful diagnosis of evolutionary relationships. FingerPRINTScan is both selective and sensitive, allowing both precise identification of super-family, family and sub-family relationships, and the detection of more distant ones. Illustrations of the diagnostic performance of the method are given with respect to the haemoglobin and transfer RNA synthetase families, and whole genome data.FingerPRINTScan has become widely used in the biological community, e.g. as the primary search interface to PRINTS via a dedicated web site at the university of Manchester, and as one of the search components of InterPro at the European Bioinformatics Institute (EBI). Furthermore, it is currently responsible for facilitating the use of PRINTS in a number of significant annotation roles, such as the automatic annotation of TrEMBL at the EBI, and as part of the computational suite used to annotate the Drosophila melanogaster genome at Celera Genomics

    Protein-protein interactions and metabolic pathways reconstruction of Caenorhabditis elegans

    Get PDF
    Metabolic networks are the collections of all cellular activities taking place in a living cell and all the relationships among biological elements of the cell including genes, proteins, enzymes, metabolites, and reactions. They provide a better understanding of cellular mechanisms and phenotypic characteristics of the studied organism. In order to reconstruct a metabolic network, interactions among genes and their molecular attributes along with their functions must be known. Using this information, proteins are distributed among pathways as sub-networks of a greater metabolic network. Proteins which carry out various steps of a biological process operate in same pathway.The metabolic network of Caenorhabditis elegans was reconstructed based on current genomic information obtained from the KEGG database, and commonly found in SWISS-PROT and WormBase. Assuming proteins operating in a pathway are interacting proteins, currently available protein-protein interaction map of the studied organism was assembled. This map contains all known protein-protein interactions collected from various sources up to the time. Topology of the reconstructed network was briefly studied and the role of key enzymes in the interconnectivity of the network was analysed. The analysis showed that the shortest metabolic paths represent the most probable routes taken by the organism where endogenous sources of nutrient are available to the organism. Nonetheless, there are alternate paths to allow the organism to survive under extraneous variations. Signature content information of proteins was utilized to reveal protein interactions upon a notion that when two proteins share signature(s) in their primary structures, the two proteins are more likely to interact. The signature content of proteins was used to measure the extent of similarity between pairs of proteins based on binary similarity score. Pairs of proteins with a binary similarity score greater than a threshold corresponding to confidence level 95% were predicted as interacting proteins. The reliability of predicted pairs was statistically analyzed. The sensitivity and specificity analysis showed that the proposed approach outperformed maximum likelihood estimation (MLE) approach with a 22% increase in area under curve of receiving operator characteristic (ROC) when they were applied to the same datasets. When proteins containing one and two known signatures were removed from the protein dataset, the area under curve (AUC) increased from 0.549 to 0.584 and 0.655, respectively. Increase in the AUC indicates that proteins with one or two known signatures do not provide sufficient information to predict robust protein-protein interactions. Moreover, it demonstrates that when proteins with more known signatures are used in signature profiling methods the overlap with experimental findings will increase resulting in higher true positive rate and eventually greater AUC. Despite the accuracy of protein-protein interaction methods proposed here and elsewhere, they often predict true positive interactions along with numerous false positive interactions. A global algorithm was also proposed to reduce the number of false positive predicted protein interacting pairs. This algorithm relies on gene ontology (GO) annotations of proteins involved in predicted interactions. A dataset of experimentally confirmed protein pair interactions and their GO annotations was used as a training set to train keywords which were able to recover both their source interactions (training set) and predicted interactions in other datasets (test sets). These keywords along with the cellular component annotation of proteins were employed to set a pair of rules that were to be satisfied by any predicted pair of interacting proteins. When this algorithm was applied to four predicted datasets obtained using phylogenetic profiles, gene expression patterns, chance co-occurrence distribution coefficient, and maximum likelihood estimation for S. cerevisiae and C. elegans, the improvement in true positive fractions of the datasets was observed in a magnitude of 2-fold to 10-fold depending on the computational method used to create the dataset and the available information on the organism of interest. The predicted protein-protein interactions were incorporated into the prior reconstructed metabolic network of C. elegans, resulting in 1024 new interactions among 94 metabolic pathways. In each of 1024 new interactions one unknown protein was interacting with a known partner found in the reconstructed metabolic network. Unknown proteins were characterized based on the involvement of their known partners. Based on the binary similarity scores, the function of an uncharacterized protein in an interacting pair was defined according to its known counterpart whose function was already specified. With the incorporation of new predicted interactions to the metabolic network, an expanded version of that network was resulted with 27% increase in the number of known proteins involved in metabolism. Connectivity of proteins in protein-protein interaction map changed from 42 to 34 due to the increase in the number of characterized proteins in the network

    Identification and functional analysis of thylakoid membrane proteome

    Get PDF
    Membrane proteins play crucial roles in many biological functions. Identities and functions of most membrane proteins remain to be revealed. New technological breakthroughs in proteomics together with the availability of genomic sequence information make it possible to study functions of membrane proteins on a genome-wide scale. We used a multidisciplinary approach combining biochemistry, genetics, proteomics and bioinformatics to study the functions of the thylakoid proteome of Synechocystis sp. PCC6803. The thylakoid membrane proteins were separated into peripheral and integral fractions and resolved into 2-D gels with different pH ranges. The protein spots in the 2-D gels were subjected to peptide mass fingerprinting analysis, and totally 390 out of 558 analyzed spots were identified as protein products of 128 individual genes, of which 38 gene encode hypothetical proteins with unknown function. To study the function of some hypothetical proteins, we inactivated a set of genes, and 10 knockout mutants were obtained. The growth analysis for the mutant cells revealed that only one mutant (H1) which has a deletion in the ORF slr0110, showed conditional growth phenotype. Detailed analysis indicated that the H1 mutant is sensitive to both glucose and light, which is caused by the over-reduction of the PQ pool in the thylakoid membrane. The ID and the structural and functional information of the identified proteins as well as the 2-D reference maps were included in a web-based relational database for thylakoid membrane proteins. The database was constructed with MySQL, and the application programs were developed with SQL, PERL, JAVASCRIPT and HTML. Users can search the information of identified proteins and compare their own identified proteins with the identified proteins in the database. A manager interface is also provided for the routine maintenance of the database

    The SBASE protein domain library, release 7.0: a collection of annotated protein sequence segments

    Full text link

    Molecular genetic and functional analyses of X-linked congenital cataract.

    Get PDF
    Nance-Horan Syndrome (NHS) is an X-linked developmental syndrome characterised by congenital cataract, dental anomalies, and dysmorphological features often associated with mental retardation. The NHS locus on Xp22.13 is encompassed by the disease locus for X- linked congenital cataract (CXN). Analysis of microsatellites within the CXN family resulted in refinement of the CXN disease interval, reducing the region of overlap between the CXN and NHS disease loci. Candidate genes in the overlapping intervals were identified bioinformatically and their genomic structures evaluated. Patient DNA was screened by direct sequencing, resulting in the identification of mutations within a novel gene in four British families with NHS, but not the CXN family. This novel gene, named NHS, is encoded by at least 10 exons transcribed into at least five mRNA isoforms A, B, C, D, and E (encoding a putative 1,630 a.a., 1,335 a.a., 1,474 a.a., 1,453 a.a., and 1,473 a.a. protein, respectively). All mutations identified are truncating and three mutations have been identified in exon 1, which are only expressed in isoform A. This implies that mutations in isoform A are sufficient to cause disease in families with NHS. Functional clues for the NHS protein were investigated resulting in identification of three new genes with significant homology to NHS {lcub}NHS-Like 1 (NHSL1), NHSL2 and NHSL3). All four genes share a conserved genomic structure. Fetal expression analysis of NHS, NHSL1 and NHSL2 suggests that NHSL1 and NHSL2 are more ubiquitous than NHS. Analysis of the NHS family of proteins revealed significant homology to members of the WASP family, which consists of WASP, N-WASP and WAVE 1-3. The WASP protein family play a crucial role in regulating actin dynamics, directly linking small GTPase signalling to actin polymerisation through activation of the Arp2/3 complex. An anti-peptide antibody to the C-terminus of NHS, completely conserved across species, was raised and characterised. A major NHS isoform (approximately 170 kDa) was detected in several cell lines. Subcellular localisation studies in MTLn3 cells showed localization of endogenous NHS to the leading edge of lamellipodia, a localisation pattern reminiscent of the Arp2/3 complex. Endogenous NHS also localised to some actin stress fibres. Homology to the WASP protein family and localisation of endogenous NHS to the leading edge of lamellipods strongly supports a role for NHS in actin cytoskeletal dynamics during development

    Clustering of scientific fields by integrating text mining and bibliometrics.

    Get PDF
    De toenemende verspreiding van wetenschappelijke en technologische publicaties via het internet, en de beschikbaarheid ervan in grootschalige bibliografische databanken, leiden tot enorme mogelijkheden om de wetenschap en technologie in kaart te brengen. Ook de voortdurende toename van beschikbare rekenkracht en de ontwikkeling van nieuwe algoritmen dragen hiertoe bij. Belangrijke uitdagingen blijven echter bestaan. Dit proefschrift bevestigt de hypothese dat de nauwkeurigheid van zowel het clusteren van wetenschappelijke kennisgebieden als het classificeren van publicaties nog verbeterd kunnen worden door het integreren van tekstontginning en bibliometrie. Zowel de tekstuele als de bibliometrische benadering hebben voor- en nadelen, en allebei bieden ze een andere kijk op een corpus van wetenschappelijke publicaties of patenten. Enerzijds is er een schat aan tekstinformatie aanwezig in dergelijke documenten, anderzijds vormen de onderlinge citaties grote netwerken die extra informatie leveren. We integreren beide gezichtspunten en tonen hoe bestaande tekstuele en bibliometrische methoden kunnen verbeterd worden. De dissertatie is opgebouwd uit drie delen: Ten eerste bespreken we het gebruik van tekstontginningstechnieken voor informatievergaring en voor het in kaart brengen van kennis vervat in teksten. We introduceren en demonstreren het raamwerk voor tekstontginning, evenals het gebruik van agglomeratieve hiërarchische clustering. Voorts onderzoeken we de relatie tussen enerzijds de performantie van het clusteren en anderzijds het gewenste aantal clusters en het aantal factoren bij latent semantische indexering. Daarnaast beschrijven we een samengestelde, semi-automatische strategie om het aantal clusters in een verzameling documenten te bepalen. Ten tweede behandelen we netwerken die bestaan uit citaties tussen wetenschappelijke documenten en netwerken die ontstaan uit onderlinge samenwerkingsverbanden tussen auteurs. Dergelijke netwerken kunnen geanalyseerd worden met technieken van de bibliometrie en de grafentheorie, met als doel het rangschikken van relevante entiteiten, het clusteren en het ontdekken van gemeenschappen. Ten derde tonen we de complementariteit aan van tekstontginning en bibliometrie en stellen we mogelijkheden voor om beide werelden op correcte wijze te integreren. De performantie van ongesuperviseerd clusteren en van classificeren verbetert significant door het samenvoegen van de tekstuele inhoud van wetenschappelijke publicaties en de structuur van citatienetwerken. Een methode gebaseerd op statistische meta-analyse behaalt de beste resultaten en overtreft methoden die enkel gebaseerd zijn op tekst of citaties. Onze geïntegreerde of hybride strategieën voor informatievergaring en clustering worden gedemonstreerd in twee domeinstudies. Het doel van de eerste studie is het ontrafelen en visualiseren van de conceptstructuur van de informatiewetenschappen en het toetsen van de toegevoegde waarde van de hybride methode. De tweede studie omvat de cognitieve structuur, bibliometrische eigenschappen en de dynamica van bio-informatica. We ontwikkelen een methode voor dynamisch en geïntegreerd clusteren van evoluerende bibliografische corpora. Deze methode vergelijkt en volgt clusters doorheen de tijd. Samengevat kunnen we stellen dat we voor de complementaire tekst- en netwerkwerelden een hybride clustermethode ontwerpen die tegelijkertijd rekening houdt met beide paradigma's. We tonen eveneens aan dat de geïntegreerde zienswijze een beter begrip oplevert van de structuur en de evolutie van wetenschappelijke kennisgebieden.SISTA;

    Genomics and genetic engineering of Helicoverpa armigera nucleopolyhedrovirus

    Get PDF
    The single nucleocapsid nucleopolyhedrovirus (SNPV) of the bollworm Helicoverpa armigera has been extensively used to control this insect around the world, especially in China. However, in order to compete with chemical insecticides - mainly for speed of action -novel approaches are sought to improve the efficacy of HaSNPV either by selection of superior natural variants or by genetic engineering. Prior to the development of improved HaSNPV by genetic engineering, understanding of the structure and expression strategy of the HaSNPV genome is required. This thesis describes studies aimed at the unraveling of the genetic properties of the HaSNPV genome. Furthermore, this research can provide molecular information on the taxonomic status of baculovirus morphotypes, i.e . single nucleocapsid NPVs (SNPV) versus multiple nucleocapsid NPVs (MNPV), and ultimately on the phylogenetic relationship among baculoviruses in general.The polyhedrin gene, a highly conserved gene among baculoviruses and encoding the major structural protein of viral polyhedra, was localized on the HaSNPV genome and then characterized (Chapter 2). This indicated that the HaSNPV polyhedrin has a high degree of sequence similarity to that of H. zea SNPV. From this preliminary analysis is appeared that SNPVs are not a separate group from the MNPVs. The position of the HaSNPV polyhedrin gene was chosen as the zero point of the circular physical map of the viral genome (Chapters 5 and 6). The polyhedrin promoter, with a typical baculovirus late transcription initiation motif, was used to drive the expression of a green fluorescent protein (GFP) and a toxin in recombinant HaSNPV (Chapter 8).In the larval stages the enzyme ecdysteroid UDP-glucosyltransferase (EGT) catalyzes the conjugation of ecdysteroid with sugars and is involved in the prevention of molting and pupation. Baculoviruses generally encode such an enzyme, resulting in the prevention of molting of infected larvae and enhanced polyhedra production. The HaSNPV egt gene was located on the Hin d-D fragment and characterized (Chapter 3). Phylogenetic analysis of this gene confirmed that HaSNPV belongs to the Group II NPVs. To further analyze the relationship between HaSNPV and other baculoviruses, a late expression factor 2' gene ( lef- 2) was identified and characterized (Chapter 4). This gene is essential for viral DNA replication and most likely functions as a DNA primase processivity factor. This is the first lef -2 gene characterized in any SNPV to date. Even though lef- 2, an essential gene, and egt , an auxiliary gene, most likely have been under different pressure in their evolutionary past, the phylogenetic tree of baculovirus LEF-2 appeared to be comparable in form to that of EGT. The positive correlation of the genomic location of the lef-2 genes relative to polyhedrin/granulin genes and the clade structure of the gene trees ( lef-2 , egt ) suggest that genome organization and gene phylogeny represent independent parameters to study the evolutionary history of baculoviruses.In order to study the genome organization and phylogenetic status of HaSNPV, a plasmid library of its 130.1 kb-long DNA genome was made and a detailed physical map of the viral DNA was constructed (Chapter 5). From about 45 kb of dispersed sequence data generated from the plasmid library, fifty-three putative open reading frames (ORFs) with homology to ORFs of other baculoviruses were identified and their locations on the genome of HaSNPV were determined. The basic gene content of HaSNPV appeared to be quite similar to that of AcMNPV, BmNPV, and OpMNPV (group I NPVs). However, the arrangement of the ORFs along the HaSNPV genome differed significantly from that of the Group I NPVs, which all have a highly collinear genome, or that of the granulovirus XcGV. In contrast, the genomes of HaSNPV and SeMNPV (Group II NPVs) are highly collinear, both in gene content and organization. This close relatedness between an MNPV and an SNPV is supported by the phylogeny of selected genes (Chapters 2 and 3) of these two viruses and suggests that the NPV morphotype (S or M) has only a taxonomic but not a phylogenetic meaning. Homologous regions ( hr s), a common feature of baculovirus MNPV genomes, were also located dispersed on the HaSNPV genome suggesting that their presence in common in all NPVs.So far, only MNPV and GV genomes have been sequenced to completion, but no SNPV genome to date. Therefore the entire HaSNPV genome sequence was determined (Chapter 7). The circular, double-stranded DNA genome contains 131,403 bp and has a G+C content of 39.1 %, the lowest value among baculoviruses to date. Of 135 potential ORFs predicted from the sequence, 115 have a homologue in other baculoviruses; twenty are unique to HaSNPV and are subject to further investigation. Upon comparison with the available genomic sequences, sixty-five ORFs were found present in all baculoviruses, and hence they are considered as 'core' baculovirus genes. The HaSNPV genome lacks a homologue of the major budded virus (BV) glycoprotein gene gp64 of group I NPVs. Instead, a functional homologue (Ha133) of gp64 was identified after comparison with SeMNPV. The mean overall amino acid identity of the HaSNPV ORFs was the highest with SeMNPV and LdMNPV homologues. This is in accordance with their common genome organization and confirmed, that HaSNPV together with SeMNPV and LdMNPV cluster into Group II NPVs, while AcMNPV, BmNPV and OpMNPV belong to the Group I NPVs. In this analysis GV behaved like a separate group. The clade structure based on selected genes ( lef-2 and egt ) is further strongly support by genome trees based on all conserved ORFs together and based on gene content as well as gene order on the genomes compared.HaSNPV and HzSNPV share many common biological features such as the same heliothine host range (Chapter 1). Sequence analysis of the complete HzSNPV genome revealed that HaSNPV and HzSNPV have a high degree of ORF identity, which is in line with the view that they are two different isolates of the same virus species (Chapters 6 and 7). The HzSNPV genome potentially encodes 139 potential ORFs of which 135 have homologous in HaSNPV. Four ORFs are unique to HzSNPV. However, these unique ORFs are small, are always found adjacent to hr regions and their functionality remains to be determined. Alignment of the genome sequences indicated that overall ORFs of HzSNPV have a high degree of identity with the homologues of HaSNPV genome on nucleotide (99%) and amino acid (98%) level. The 65 baculovirus core genes among these two viruses have the lowest nucleotide substitution rate, but the hr s showed the highest variation. Two 'baculovirus repeat orfs' ( bro ) genes in these two viruses have the highest sequence divergence and might have a different evolutionary history.Deletion of egt from the baculoviral genome has been shown to increase the speed of kill of the virus and hence to reduce the crop damage by infected insects. This approach, along with the insertion of a scorpion neurotoxin gene, was used to generate recombinant HaSNPV with potentially improved insecticidal activity. The egt gene was deleted from the genome and replaced by the GFP and / or by an insect-specific toxin gene, AaIT (Chapter 8). Bioassay data indicated a significant reduction in the time (LT50) required for each of the HaSNPV recombinants to kill second instar H. armigera larvae. The LT 50 of the egt deletion recombinants was about 27% shorter than that of wild type HaSNPV. The largest reduction in LT 50 (32%) was observed when the egt gene was replaced by the scorpion neurotoxin AaIT gene.The genetic and genomic analysis presented in this thesis shows that HaSNPV and HzSNPV are different variants of the same virus species. Alignment of the known baculovirus genome sequences did not clearly show the molecular basis for the baculovirus S and M NPV morphotype. Phylogenic analysis of genes and of genome organization, such as gene content and gene order, confirmed that baculoviruses can be separated into Group I and II NPV and into a GV group. Based on the investigation of the HaSNPV genome, HaSNPV recombinants with enhanced insecticidal properties were Successfully constructed providing alternative agents to bollworm control in China and elsewhere in the world.</p
    corecore