368 research outputs found

    An intelligent data-centric approach toward identification of conserved motifs in protein sequences

    Get PDF
    The continued integration of the computational and biological sciences has revolutionized genomic and proteomic studies. However, efficient collaboration between these fields requires the creation of shared standards. A common problem arises when biological input does not properly fit the expectations of the algorithm, which can result in misinterpretation of the output. This potential confounding of input/output is a drawback especially when regarding motif finding software. Here we propose a method for improving output by selecting input based upon evolutionary distance, domain architecture, and known function. This method improved detection of both known and unknown motifs in two separate case studies. By standardizing input considerations, both biologists and bioinformaticians can better interpret and design the evolving sophistication of bioinformatic software

    Integration and mining of malaria molecular, functional and pharmacological data: how far are we from a chemogenomic knowledge space?

    Get PDF
    The organization and mining of malaria genomic and post-genomic data is highly motivated by the necessity to predict and characterize new biological targets and new drugs. Biological targets are sought in a biological space designed from the genomic data from Plasmodium falciparum, but using also the millions of genomic data from other species. Drug candidates are sought in a chemical space containing the millions of small molecules stored in public and private chemolibraries. Data management should therefore be as reliable and versatile as possible. In this context, we examined five aspects of the organization and mining of malaria genomic and post-genomic data: 1) the comparison of protein sequences including compositionally atypical malaria sequences, 2) the high throughput reconstruction of molecular phylogenies, 3) the representation of biological processes particularly metabolic pathways, 4) the versatile methods to integrate genomic data, biological representations and functional profiling obtained from X-omic experiments after drug treatments and 5) the determination and prediction of protein structures and their molecular docking with drug candidate structures. Progresses toward a grid-enabled chemogenomic knowledge space are discussed.Comment: 43 pages, 4 figures, to appear in Malaria Journa

    Innovative Algorithms and Evaluation Methods for Biological Motif Finding

    Get PDF
    Biological motifs are defined as overly recurring sub-patterns in biological systems. Sequence motifs and network motifs are the examples of biological motifs. Due to the wide range of applications, many algorithms and computational tools have been developed for efficient search for biological motifs. Therefore, there are more computationally derived motifs than experimentally validated motifs, and how to validate the biological significance of the ‘candidate motifs’ becomes an important question. Some of sequence motifs are verified by their structural similarities or their functional roles in DNA or protein sequences, and stored in databases. However, biological role of network motifs is still invalidated and currently no databases exist for this purpose. In this thesis, we focus not only on the computational efficiency but also on the biological meanings of the motifs. We provide an efficient way to incorporate biological information with clustering analysis methods: For example, a sparse nonnegative matrix factorization (SNMF) method is used with Chou-Fasman parameters for the protein motif finding. Biological network motifs are searched by various clustering algorithms with Gene ontology (GO) information. Experimental results show that the algorithms perform better than existing algorithms by producing a larger number of high-quality of biological motifs. In addition, we apply biological network motifs for the discovery of essential proteins. Essential proteins are defined as a minimum set of proteins which are vital for development to a fertile adult and in a cellular life in an organism. We design a new centrality algorithm with biological network motifs, named MCGO, and score proteins in a protein-protein interaction (PPI) network to find essential proteins. MCGO is also combined with other centrality measures to predict essential proteins using machine learning techniques. We have three contributions to the study of biological motifs through this thesis; 1) Clustering analysis is efficiently used in this work and biological information is easily integrated with the analysis; 2) We focus more on the biological meanings of motifs by adding biological knowledge in the algorithms and by suggesting biologically related evaluation methods. 3) Biological network motifs are successfully applied to a practical application of prediction of essential proteins

    Integrative Computational Genomics Based Approaches to Uncover the Tissue-Specific Regulatory Networks in Development and Disease

    Get PDF
    Indiana University-Purdue University Indianapolis (IUPUI)Regulatory protein families such as transcription factors (TFs) and RNA Binding Proteins (RBPs) are increasingly being appreciated for their role in regulating the respective targeted genomic/transcriptomic elements resulting in dynamic transcriptional (TRNs) and post-transcriptional regulatory networks (PTRNs) in higher eukaryotes. The mechanistic understanding of these two regulatory network types require a high resolution tissue-specific functional annotation of both the proteins as well as their target sites. This dissertation addresses the need to uncover the tissue-specific regulatory networks in development and disease. This work establishes multiple computational genomics based approaches to further enhance our understanding of regulatory circuits and decipher the associated mechanisms at several layers of biological processes. This study potentially contributes to the research community by providing valuable resources including novel methods, web interfaces and software which transforms our ability to build high-quality regulatory binding maps of RBPs and TFs in a tissue specific manner using multi-omics datasets. The study deciphered the broad spectrum of temporal and evolutionary dynamics of the transcriptome and their regulation at transcriptional and post transcriptional levels. It also advances our ability to functionally annotate hundreds of RBPs and their RNA binding sites across tissues in the human genome which help in decoding the role of RBPs in the context of disease phenotype, networks, and pathways. The approaches developed in this dissertation is scalable and adaptable to further investigate the tissue specific regulators in any biological systems. Overall, this study contributes towards accelerating the progress in molecular diagnostics and drug target identification using regulatory network analysis method in disease and pathophysiology

    Post-translational processing targets functionally diverse proteins in Mycoplasma hyopneumoniae

    Get PDF
    © 2016 The Authors. Mycoplasma hyopneumoniae is a genome-reduced, cell wall-less, bacterial pathogen with a predicted coding capacity of less than 700 proteins and is one of the smallest self-replicating pathogens. The cell surface of M. hyopneumoniae is extensively modified by processing events that target the P97 and P102 adhesin families. Here, we present analyses of the proteome of M. hyopneumoniae-type strain J using protein-centric approaches (one- and two-dimensional GeLC-MS/MS) that enabled us to focus on global processing events in this species. While these approaches only identified 52% of the predicted proteome (347 proteins), our analyses identified 35 surface-associated proteins with widely divergent functions that were targets of unusual endopro-teolytic processing events, including cell adhesins, lipoproteins and proteins with canonical functions in the cytosol that moonlight on the cell surface. Affinity chromatography assays that separately used heparin, fibronectin, actin and host epithelial cell surface proteins as bait recovered cleavage products derived from these processed proteins, suggesting these fragments interact directly with the bait proteins and display previously unrecognized adhesive functions. We hypothesize that protein processing is underestimated as a post-translational modification in genome-reduced bacteria and prokaryotes more broadly, and represents an important mechanism for creating cell surface protein diversity

    Comparative analysis of plant genomes through data integration

    Get PDF
    When we started our research in 2008, several online resources for genomics existed, each with a different focus. TAIR (The Arabidopsis Information Resource) has a focus on the plant model species Arabidopsis thaliana, with (at that time) little or no support for evolutionary or comparative genomics. Ensemble provided some basic tools and functions as a data warehouse, but it would only start incorporating plant genomes in 2010. There was no online resource at that time however, that provided the necessary data content and tools for plant comparative and evolutionary genomics that we required. As such, the plant community was missing an essential component to get their research at the same level as the biomedicine oriented research communities. We started to work on PLAZA in order to provide such a data resource that could be accessed by the plant community, and which also contained the necessary data content to help our research group’s focus on evolutionary genomics. The platform for comparative and evolutionary genomics, which we named PLAZA, was developed from scratch (i.e. not based on an existing database scheme, such as Ensemble). Gathering the data for all species, parsing this data into a common format and then uploading it into the database was the next step. We developed a processing pipeline, based on sequence similarity measurements, to group genes into gene families and sub families. Functional annotation was gathered through both the original data providers and through InterPro scans, combined with Interpro2GO. This primary data information was then ready to be used in every subsequent analysis. Building such a database was good enough for research within our bioinformatics group, but the target goal was to provide a comprehensive resource for all plant biologists with an interest in comparative and evolutionary genomics. Designing and creating a user-friendly, visually appealing web interface, connected to our database, was the next step. While the most detailed information is commonly presented in data tables, aesthetically pleasing graphics, images and charts are often used to visualize trends, general statistics and also used in specific tools. Design and development of these tools and visualizations is thus one of the core elements within my PhD. The PLAZA platform was designed as a gene-centric data resource, which is easily navigated when a biologist wants to study a relative small number of genes. However, using the default PLAZA website to retrieve information for dozens of genes quickly becomes very tedious. Therefore a ’gene set’-centric extra layer was developed where user-defined gene sets could be quickly analyzed. This extra layer, called the PLAZA workbench, functions on top of the normal PLAZA website, implicating that only gene sets from species present within the PLAZA database can be directly analyzed. The PLAZA resource for comparative and evolutionary genomics was a major success, but it still had several issues. We tried to solve at least two of these problems at the same time by creating a new platform. The first issue was the building procedure of PLAZA: adding a single species, or updating the structural annotation of an existing one, requires the total re-computation of the database content. The second issue was the restrictiveness of the PLAZA workbench: through a mapping procedure gene sets could be entered for species not present in the PLAZA database, but for species without a phylogenetic close relative this approach did not always yield satisfying results. Furthermore, the research in question might just focus on the difference between a species present in PLAZA and a close relative not present in PLAZA (e.g. to study adaptation to a different ecological niche). In such a case, the mapping procedure is in itself useless. With the advent of NGS transcriptome data sets for a growing number of species, it was clear that a next challenge had presented itself. We designed and developed a new platform, named TRAPID, which could automatically process entire transcriptome data sets, using a reference database. The target goal was to have the processing done quickly with the results containing both gene family oriented data (such as multiple sequence alignments and phylogenetic trees) and functional characterization of the transcripts. Major efforts went into designing the processing pipeline so it could be reliable, fast and accurate

    Learning the Regulatory Code of Gene Expression

    Get PDF
    Data-driven machine learning is the method of choice for predicting molecular phenotypes from nucleotide sequence, modeling gene expression events including protein-DNA binding, chromatin states as well as mRNA and protein levels. Deep neural networks automatically learn informative sequence representations and interpreting them enables us to improve our understanding of the regulatory code governing gene expression. Here, we review the latest developments that apply shallow or deep learning to quantify molecular phenotypes and decode the cis-regulatory grammar from prokaryotic and eukaryotic sequencing data. Our approach is to build from the ground up, first focusing on the initiating protein-DNA interactions, then specific coding and non-coding regions, and finally on advances that combine multiple parts of the gene and mRNA regulatory structures, achieving unprecedented performance. We thus provide a quantitative view of gene expression regulation from nucleotide sequence, concluding with an information-centric overview of the central dogma of molecular biology

    Genome-wide Transcriptional Characterization of the ETV6-RUNX1-positive Childhood Leukemia

    Get PDF
    Akuutti lymfoblastileukemia (ALL) on lasten yleisin syöpä. Useimmiten se saa alkunsa epäkypsästä B-solusta (preB), jossa tapahtuu tietty altistava geneettinen muutos. Yksi yleisimmistä muutoksista on translokaatio, joka johtaa ETV6-RUNX1 (E/R) fuusiogeenin syntymiseen. Leukemian puhkeamiseen vaaditaan lisäksi muita geneettisiä muutoksia, jotka usein osuvat B-solun identiteetille tärkeisiin geeneihin. DNA-vaurioiden lisäksi solun toiminta voi häiriintyä RNA-molekyylien ja proteiinien toiminnan muutoksista. E/R on epänormaali transkriptiotekijä ja sen suorat säätelykohteet ovat vielä jääneet epäselviksi. Tässä työssä tutkimme lasten prekursori B-ALL:ssa (preB-ALL) tapahtuvaa genominlaajuista geeniensäätelyä tarkastelemalla varhaista RNA-transkriptiota solulinjoissa ja potilasnäytteissä. E/R-fuusion kohdegeenien kartoittamista varten teimme solulinjamallin, jossa fuusion tuotantoa voidaan säädellä. Määritimme tehostaja-alueet tehostaja-RNA:iden (eRNA) ilmentymisen perusteella sekä niiden mahdolliset kohdegeenit perustuen signaalimuutosten samankaltaisuuteen. E/R- fuusion säätelemistä geeneistä kaksi kolmasosaa hiljeni suoran RUNX1-välitteisen DNA-sitoutumisen kautta. Lisäksi E/R vähensi B-solu-spesifisten tehostaja- alueiden luentaa. Osa geeneistä myös ilmentyi eri tavalla E/R-potilaiden leukemiasoluissa verrattuna muiden preB-ALL alityyppien potilaiden soluihin. RAG ja AID entsyymit on liitetty DNA-katkosten syntymiseen B-solu- leukemiassa ja niiden toimintaan tiedetään liittyvän avoimena oleva kromatiini. Tutkimme RNA-transkriptiota B-linjan soluissa keskittyen lasten leukemiassa usein nähtäviin DNA-katkoskohtiin. Huomasimme, että katkoskohtiin assosioituvat tietyt transkriptionaaliset ominaisuudet: RNA-polymeraasin pysähtyminen sekä yhtäaikainen geenienluenta päällekkäisiltä DNA-juosteilta. Nämä piirteet näyttävät altistavan DNA:n katkoksille erityisesti paljastamalla RAG-entsyymin rekombinaatiosignaalisekvenssejä. Huomasimme myös korkean RAG1-geenin luennan erityisesti E/R-potilailla sekä AID-entsyymiä koodaavan geenin epätavallisen luennan osalla korkean riskin preB-ALL potilaita. Tässä väitöskirjassa tunnistettiin E/R-fuusion genominlaajuisia säätelykohteita sekä toistuvien DNA-katkosten kohdille ominaisia transkriptionaalisia piirteitä lasten leukemiassa.Acute lymphoblastic leukemia (ALL) is the most common cancer affecting in childhood. It occurs typically in early B-lineage cells and is characterized by a few specific initiating genomic alterations. One of the most common alterations is the translocation resulting in the ETV6-RUNX1 (E/R) fusion gene. Progression to overt ALL requires additional genetic abnormalities that are recurrently found at essential B-cell lineage identity determining genes. Besides DNA, alterations in various RNA species and proteins could also have marked unwanted effects on cell behavior. E/R functions as an aberrant transcription factor but its direct target genes have thus far remained uncertain. We set out to study genome-wide gene regulation in childhood precursor B-ALL (preB-ALL) by studying nascent RNA transcription in cell lines and patient samples. For the examination of target sites, we generated a cell line model with an inducible E/R. We detected enhancer regions by the expression of eRNA transcripts and deciphered a possible target gene by correlating between expression level changes. Two thirds of the E/R-regulated genes were repressed by direct regulation via RUNX1 DNA binding. We further showed E/R-mediated downregulation of B-cell specific super-enhancers. Some of the regulated genes were observed to be differentially expressed among E/R patients when compared to other preB-ALL patients. RAG and AID are enzymes that have been linked to the genesis of secondary genetic alterations in B-cell leukemia. We explored the nascent RNA transcription across B-lymphoid cells at the genomic sites that are often deleted in childhood precursor B-ALL and noticed significant association with specific transcriptional features, namely RNA polymerase II stalling and convergent transcription. These features seem to expose the DNA to double strand breaks especially by revealing RAG recombination signal sequences. We noticed high RAG1 expression in the E/R subtype, and abnormal expression of AICDA among the non-classified precursor B-ALL cases. This thesis identifies genome-wide targets of the E/R fusion and specific transcriptional features that are associated with recurrent DNA breakpoint sites in childhood precursor B-ALL

    Candidate-Based Approaches to Identify Genetic Variation Influencing Type 2 Diabetes and Quantitative Traits

    Get PDF
    Type 2 diabetes (T2D) is a metabolic disorder characterized by insulin resistance and impaired insulin secretion that affects more than 20 million Americans, although the genetic component of the disorder is largely unknown. Individual genetic susceptibility to type 2 diabetes and other complex traits is the result of variation that is both common in human populations and rare, de novo and inherited mutations. We adopted a diverse set of genetics, genomics and informatics approaches to prioritize candidate genomic regions and variants and perform in-depth, targeted analysis of their contributions to type 2 diabetes susceptibility and related trait variability. Our initial efforts focused on the selection of candidate genes relevant to a complex trait by developing a metric to weight the relevance of functional gene annotations to the known biology of a trait. We used this method to select candidate genes for type 2 diabetes and performed a T2D case-control and quantitative trait association study in 2,335 Finnish individuals from the FUSION study. After follow-up in additional samples, we identified several variants that might contribute to T2D susceptibility. Genomic regions associated with plasma levels of HDL cholesterol and triglycerides were re-sequenced in individuals with trait-extreme values. Our analysis revealed a denser set of common and rare functional target variants including several non-synonymous, 3' UTR, and non-coding SNPs and indels. Finally, we utilized two approaches to identify candidate functional non-coding variants that may directly contribute to trait susceptibility. First, we used Formaldehyde-assisted isolation of regulatory elements (FAIRE) coupled with high-throughput sequencing to identify nucleosome-depleted regions in pancreatic islets. We used islet FAIRE-seq data to identify SNPs associated with T2D that potentially alter islet transcriptional regulation. A SNP in TCF7L2, rs7903146, was located in a FAIRE-seq site and demonstrated allelic differences in islet chromatin openness and enhancer activity, suggesting that it may contribute functionally to T2D susceptibility. Second, we used transcription factor binding site motifs to computationally predict variants that have allelic differences in regulatory activity. Taken together, these results suggest that identifying candidate genomic regions can successfully enrich for variation important for type 2 diabetes and other complex traits

    INVESTIGATING THE ROLE OF CELLULAR AUTOPHAGY IN HUMAN MONOCYTIC CELL DEATH BY KINOME ANALYSIS

    Get PDF
    Cells of the Monocyte / Macrophage lineage are key players in innate and adaptive immunity. They eliminate pathogens through their phagocytic and antimicrobial properties, secretion of inflammatory and immunoregulatory cytokines, as well as their capacity to present foreign antigens to T lymphocytes in lymphoid tissues. The importance of M/Ms in the immune response require them to undergo strict regulation, which occurs, at least in part, through the control of monocytic cell survival. Autophagy is a ubiquitous cellular process by which cells degrade intracellular, cytoplasmic components via a network of interconnected vacuoles to carry out a variety of functions. Autophagy typically functions in maintaining cellular homeostasis and mitigating stresses. However, more recent studies have shown that autophagy may play a role in cell death. Our laboratory has previously found that the cytokine IFNγ can induce cell death in human monocytes in an autophagy-dependent manner. Conversely, IL-10 inhibits both spontaneous and IFNγ-induced cell death, a capacity that ironically, is also associated with the induction of autophagy. We are thus interested in understanding how the autophagy pathway can play a dual role in human monocyte survival. Interestingly, a novel autophagy-inducing antimicrobial peptide (Atg peptide) was recently shown to be capable of inducing an autophagy-dependent form of cell death. In this study, I established an in vitro model for Atg peptide-induced cell death in human monocytic cells. Subsequently, I designed and tested an autophagy- and cell death pathway-centric kinome microarray in order to begin elucidating the molecular mechanisms responsible for monocytic cell death. Kinome analysis identified several interesting phosphorylation events in response to Atg peptide stimulation, including the tumour suppressor protein p53, which was phosphorylated at S9, an Na+,K+-ATPase ATP1A1 which was phosphorylated at Y10, and the transcription factor 4E-BP1 which was phosphorylated at S65. I confirmed these phosphorylation results in part using Western blotting. Finally, I present several hypotheses for the potential molecular mechanisms involved in Atg peptide-induced autophagic cell death in monocytes revealed by kinome analysis, which provide a basis for further exploration into this extremely interesting area
    corecore