231 research outputs found

    Discovery of Flexible Gap Patterns from Sequences

    Get PDF
    Human genome contains abundant motifs bound by particular biomolecules. These motifs are involved in the complex regulatory mechanisms of gene expressions. The dominant mechanism behind the intriguing gene expression patterns is known as combinatorial regulation, achieved by multiple cooperating biomolecules binding in a nearby genomic region to provide a specific regulatory behavior. To decipher the complicated combinatorial regulation mechanism at work in the cellular processes, there is a pressing need to identify co-binding motifs for these cooperating biomolecules in genomic sequences. The great flexibility of the interaction distance between nearby cooperating biomolecules leads to the presence of flexible gaps in between component motifs of a co-binding motif. Many existing motif discovery methods cannot handle co-binding motifs with flexible gaps. Existing co-binding motif discovery methods are ineffective in dealing with the following problems: (1) co-binding motifs may not appear in a large fraction of the input sequences, (2) the lengths of component motifs are unknown and (3) the maximum range of the flexible gap can be large. As a result, the probabilistic approach is easily trapped into a local optimal solution. Though deterministic approach may resolve these problems by allowing a relaxed motif template, it encounters the challenges of exploring an enormous pattern space and handling a huge output. This thesis presents an effective and scalable method called DFGP which stands for “Discovery of Flexible Gap Patterns” for identifying co-binding motifs in massive datasets. DFGP follows the deterministic approach that uses flexible gap pattern to model co-binding motif. A flexible gap pattern is composed of a number of boxes with a flexible gap in between consecutive boxes where each box is a consensus pattern representing a component motif. To address the computational challenge and the need to effectively process the large output under a relaxed motif template, DFGP incorporates two redundancy reduction methods as well as an effective statistical significance measure for ranking patterns. The first reduction method is achieved by the proposed concept of representative patterns, which aims at reducing the large set of consensus patterns used as boxes in existing deterministic methods into a much smaller yet informative set. The second method is attained by the proposed concept of delegate occurrences aiming at reducing the redundancy among occurrences of a flexible gap pattern. iv Extensive experiment results showed that (1) DFGP outperforms existing co-binding discovery methods significantly in terms of both the capability of identifying co-binding motifs and the runtime, (2) co-binding motifs found by DFGP in datasets reveal biological insights previously unknown, (3) the two redundancy reduction methods via the proposed concepts of representative patterns and delegate occurrences are indeed effective in significantly reducing the computational burden without sacrificing output quality, (4) the proposed statistical significance measures are robust and useful in ranking patterns and (5) DFGP allows a large maximum distance for flexible gap between component motifs and it is scalable to massive datasets

    Expanding the repertoire of bacterial (non-)coding RNAs

    Get PDF
    The detection of non-protein-coding RNA (ncRNA) genes in bacteria and their diverse regulatory mode of action moved the experimental and bio-computational analysis of ncRNAs into the focus of attention. Regulatory ncRNA transcripts are not translated to proteins but function directly on the RNA level. These typically small RNAs have been found to be involved in diverse processes such as (post-)transcriptional regulation and modification, translation, protein translocation, protein degradation and sequestration. Bacterial ncRNAs either arise from independent primary transcripts or their mature sequence is generated via processing from a precursor. Besides these autonomous transcripts, RNA regulators (e.g. riboswitches and RNA thermometers) also form chimera with protein-coding sequences. These structured regulatory elements are encoded within the messenger RNA and directly regulate the expression of their “host” gene. The quality and completeness of genome annotation is essential for all subsequent analyses. In contrast to protein-coding genes ncRNAs lack clear statistical signals on the sequence level. Thus, sophisticated tools have been developed to automatically identify ncRNA genes. Unfortunately, these tools are not part of generic genome annotation pipelines and therefore computational searches for known ncRNA genes are the starting point of each study. Moreover, prokaryotic genome annotation lacks essential features of protein-coding genes. Many known ncRNAs regulate translation via base-pairing to the 5’ UTR (untranslated region) of mRNA transcripts. Eukaryotic 5’ UTRs have been routinely annotated by sequencing of ESTs (expressed sequence tags) for more than a decade. Only recently, experimental setups have been developed to systematically identify these elements on a genome-wide scale in prokaryotes. The first part of this thesis, describes three experimental surveys of exploratory field studies to analyze transcript organization in pathogenic bacteria. To identify ncRNAs in Pseudomonas aeruginosa we used a combination of an experimental RNomics approach and ncRNA prediction. Besides already known ncRNAs we identified and validated the expression of six novel RNA genes. Global detection of transcripts by next generation RNA sequencing techniques unraveled an unexpectedly complex transcript organization in many bacteria. These ultra high-throughput methods give us the appealing opportunity to analyze the complete RNA output of any species at once. The development of the differential RNA sequencing (dRNA-seq) approach enabled us to analyze the primary transcriptome of Helicobacter pylori and Xanthomonas campestris. For the first time we generated a comprehensive and precise transcription start site (TSS) map for both species and provide a general framework for the analysis of dRNA-seq data. Focusing on computer-aided analysis we developed new tools to annotate TSS, detect small protein-coding genes and to infer homology of newly detected transcripts. We discovered hundreds of TSS in intergenic regions, upstream of protein-coding genes, within operons and antisense to annotated genes. Analysis of 5’ UTRs (spanning from the TSS to the start codon of the adjacent protein-coding gene) revealed an unexpected size diversity ranging from zero to several hundred nucleotides. We identified and validated the expression of about 60 and about 20 ncRNA candidates in Helicobacter and Xanthomonas, respectively. Among these ncRNA candidates we found several small protein-coding genes that have previously evaded annotation in both species. We showed that the combination of dRNA-seq and computational analysis is a powerful method to examine prokaryotic transcriptomes. Experimental setups are time consuming and often combined with huge costs. Another limitation of experimental approaches is that genes which are expressed in specific developmental stages or stress conditions are likely to be missed. Bioinformatic tools build an alternative to overcome such restraints. General approaches usually depend on comparative genomic data and evolutionary signatures are used to analyze the (non-)coding potential of multiple sequence alignments. In the second part of my thesis we present our major update of the widely used ncRNA gene finder RNAz and introduce RNAcode, an efficient tool to asses local protein-coding potential of genomic regions. RNAz has been successfully used to identify structured RNA elements in all domains of life. However, our own experience and the user feedback not only demonstrated the applicability of the RNAz approach, but also helped us to identify limitations of the current implementation. Using a much larger training set and a new classification model we significantly improved the prediction accuracy of RNAz. During transcriptome analysis we repeatedly identified small protein-coding genes that have not been annotated so far. Only a few of those genes are known to date and standard proteincoding gene finding tools suffer from the lack of training data. To avoid an excess of false positive predictions, gene finding software is usually run with an arbitrary cutoff of 40-50 amino acids and therefore misses the small sized protein-coding genes. We have implemented RNAcode which is optimized for emerging applications not covered by standard protein-coding gene annotation software. In addition to complementing classical protein gene annotation, a major field of application of RNAcode is the functional classification of transcribed regions. RNA sequencing analyses are likely to falsely report transcript fragments (e.g. mRNA degradation products) as non-coding. Hence, an evaluation of the protein-coding potential of these fragments is an essential task. RNAcode reports local regions of high coding potential instead of complete protein-coding genes. A training on known protein-coding sequences is not necessary and RNAcode can therefore be applied to any species. We showed this with our analysis of the Escherichia coli genome where the current annotation could be accurately reproduced. We furthermore identified novel small protein-coding genes with RNAcode in this extensively studied genome. Using transcriptome and proteome data we found compelling evidence that several of the identified candidates are bona fide proteins. In summary, this thesis clearly demonstrates that bioinformatic methods are mandatory to analyze the huge amount of transcriptome data and to identify novel (non-)coding RNA genes. With the major update of RNAz and the implementation of RNAcode we contributed to complete the repertoire of gene finding software which will help to unearth hidden treasures of the RNA World

    Investigating Hfq-Mrna Interactions In Bacteria

    Get PDF
    Regulatory RNAs (sRNAs) are essential for bacteria to thrive in diverse environments and they also play a key role in virulence [11]. Trans-sRNAs affect the stability and/or translation of their target mRNAs through complementary base-pairing. The base-pairing interaction is not perfect and requires the action of an RNA binding protein, Hfq. Hfq facilitates these RNA-RNA interactions by stabilizing duplex formation, aiding in structural rearrangements, increasing the rate of structural opening, and/or by increasing the rate of annealing [18-21]. Hfq has two well characterized binding surfaces: the proximal surface, which binds AU rich stretches typical of sRNAs, and the distal surface, which binds (ARN)x motifs typically found in target mRNAs [30, 33, 36]. Studies on Hfq-RNA interactions have focused largely on sRNAs until the more recent discovery of an (ARN)x motif within the 5\u27UTR of target mRNAs[36, 37]. The importance of this motif in facilitating Hfq-mRNA binding and its requirement for regulation of a couple well known target mRNAs led us to further characterize the motif in the work described in this thesis. We performed bioinformatic and in vitro analyses to investigate the prevalence, location, structural contexts, and Hfq-binding of (ARN)x motifs in known target mRNAs. We found that the known targets contain single stranded (ARN)x sequences in their 5\u27UTRs that bind to Hfq. Two predominant structural contexts of the single stranded (ARN)x motifs became clear: they were either flanked by stem loop structures or within a loop of an internal bulge, multi-branch junction or hairpin. The key features of the motifs were then used as a bioinformatic tool on a genome wide scale to identify mRNAs that might bind to Hfq. We found that 21% of mRNAs have a suitable (ARN)x motif and therefore likely bind to Hfq. Messages that bind to Hfq may be novel sRNA targets so we investigated this possibility using an in vivo reporter assay and found that 63% of the mRNAs tested are regulated by a specific sRNA. The novel targets are involved in pathways including iron salvage, biofilm formation, and amino acid metabolism. Overall, we defined key features of (ARN)x motifs and were able to use those to predict novel target mRNAs in E. coli. This approach is efficient, effective and adaptable other bacterial species

    Regulatory modules discovery and mesenchymal stem cells characterization from high-throughput cancer genomics data

    Get PDF
    2013/2014Il tumore è una malattia caratterizzata da un’estrema complessità molecolare. Gli approcci di tipo “omic”, collezionando dati sull’intero genoma, sui trascritti e proteine in dataset pubblici, permettono di superare questa complessità e di trovare moduli funzionali che eseguono le funzioni coinvolte nei processi tumorali. Ad esempio, i profili di espressione genica da tessuti vengono usati per definire firme di geni e testarne la rilevanza clinica. Ho usato questo tipo di informazione per caratterizzare specifici geni di interesse in modelli di tumore al seno. Uno dei più recenti progetti di tipo “omic” è il FANTOM5. Questo progetto ha generato una risorsa unica: il primo atlante di espressione in mammifero basato su sequenziamento a singola molecola. Il sistema CAGE (Cap Analysis of Gene Expression) è stato usato per misurare i siti di inizio trascrizione (TSS) e l’utilizzo dei promotori in una collezione di campioni umani: in questo modo sono stati misurati i livelli di espressione di gran parte dei trascritti codificanti e non-codificanti nel genoma umano. Ho usato questo tipo di informazione per caratterizzare una linea staminale mesenchimale/stromale (MSC) derivante da tumori sierosi ovarici di alto grado (HG-SOC-MSCs) o da tessuti normali (N-MSCs) inclusi nel dataset FANTOM5. Ho messo in luce programmi funzionali condivisi tra le due linee cellulari e osservato che le differenze principali tra le funzioni attivate nelle due linee sono di tipo quantitativo più che qualitativo. I risultati suggeriscono inoltre che le HG-SOC-MSCs sono simili alle cellule mesoteliali e alle cellule del tessuto muscolare liscio. Inoltre, ho analizzato l’intero dataset usando ScanAll, un nuovo software utile a predire ab initio la presenza di elementi arricchiti nelle regioni geniche che circondano i promotori trovati del progetto FANTOM5. Ho individuato moduli di regolazione, ossia gruppi di motif che si trovano a distanze predefinite sul genoma uno rispetto all’altro. Questi moduli sono arricchiti in regioni del genoma co-espresse rispetto a sequenze generate casualmente. Infine ho creato un compendio di fattori di trascrizione espressi e che partecipano ad interazione proteina-proteina.Cancer is a disease characterized by an extreme molecular complexity. Omics approaches, collecting data in public databases for all the genome, transcripts and proteins, attempt to overcome this complexity and find the functional modules that perform the functions involved in tumour related processes. For instance, cancer tissues gene expression profiles are widely used to define genes signatures and test their clinical relevance. I used this kind information in order to characterise interesting genes in breast cancer models. On the other hand, cellular models datasets could provide data that permits to focus on specific molecular mechanisms and probe the effects of molecules in a specific cancer model. One of the most recent omics project is the FANTOM5 project, that has generated a unique resource, the first single molecule sequencing-based expression atlas in mammalian systems. Cap analysis of gene expression (CAGE) was used to measure transcription start sites (TSS) and promoter usage across a wide collection of human samples thereby identifying and measuring levels of the majority of coding and non-coding transcripts in the human genome. I used this information to characterize a mesenchymal/stromal stem cell line (MSC) derived from high-grade serous ovarian cancer (HG-SOC-MSCs) or derived from normal tissue (N-MSCs) included in the entire FANTOM5 human dataset. I highlighted shared functional programs between HG-SOC-MSCs and N-MSCs suggesting that the global differences between the two cell lines are based on quantitative levels of transcriptional output rather than on qualitative differences. The results suggested that HG-SOC-MSCs are close relatives of mesothelial cells and smooth muscle cells. Furthermore, we analysed the entire dataset using ScanAll, a newly developed software, to ab initio predict the presence of enriched elements in the genomic regions surrounding FANTOM5 promoters. I pinpointed regulatory modules, i.e. groups of enriched motifs co-occurring in co-expressed regions within a fixed distance. These modules are enriched in the co-expressed sequences in each sample respect to random generated sequences. Finally, I created a Compendium of putative expressed and directly interacting transcription factors.XXVII Ciclo198

    Investigation into the effector repertoire of the H2 type VI secretion system of Pseudomonas aeruginosa

    Get PDF
    Pseudomonas aeruginosa is an opportunistic pathogen, causing both acute and chronic infections. This bacterium displays remarkable adaptability and potential for virulence, partly due to its arsenal of protein secretion systems. The type VI secretion system (T6SS) is a contractile injection apparatus, firing a spear-like structure into target cells to deliver its cargo of effector proteins. P. aeruginosa encodes three such systems, denoted H1-, H2- and H3-T6SS. This dissertation discloses work focused on progressing our understanding of the H2-T6SS in this pathogen. We reveal that the H2-T6SS is controlled by the Gac/Rsm pathway, a major regulatory network in this pathogen responsible for the lifestyle switch between motile and sessile bacteria. Quorum sensing, the sophisticated signalling network governing social behaviour, is responsible for the expression of this secretion system in a growth-phase dependent manner, while temperature also has an input in a strain-dependent fashion. We advance our understanding of the composition of the H2-T6SS nanomachine, identifying multiple components of the spear-like delivery device, comprising an Hcp tube capped with a spike structure composed of three VgrGs and one PAAR protein. Importantly, we begin to decipher the payload of this secretion system, describing several phospholipase family effectors which confer a significant advantage to P. aeruginosa during bacterial competition. Building upon this, we propose a hierarchy of effector delivery determined by the VgrG/PAAR composition of the spike. Finally, we characterise a specific H2-T6SS effector: the C-terminal extension of the VgrG2b spike protein. Although we initially investigate its reported role within eukaryotic cells, we determine that this metallopeptidase-like effector is part of a wider antibacterial T6SS toxin family. We describe its cognate periplasmic immunity determinant and progress the elucidation of the target of the effector. Overall, we advance our understanding of the H2-T6SS of P. aeruginosa in terms of its regulation, organisation and cargo.Open Acces

    Distributed pattern mining and data publication in life sciences using big data technologies

    Get PDF

    Mapping the proteome with data-driven methods: A cycle of measurement, modeling, hypothesis generation, and engineering

    Get PDF
    The living cell exhibits emergence of complex behavior and its modeling requires a systemic, integrative approach if we are to thoroughly understand and harness it. The work in this thesis has had the more narrow aim of quantitatively characterizing and mapping the proteome using data-driven methods, as proteins perform most functional and structural roles within the cell. Covered are the different parts of the cycle from improving quantification methods, to deriving protein features relying on their primary structure, predicting the protein content solely from sequence data, and, finally, to developing theoretical protein engineering tools, leading back to experiment.\ua0\ua0\ua0\ua0 High-throughput mass spectrometry platforms provide detailed snapshots of a cell\u27s protein content, which can be mined towards understanding how the phenotype arises from genotype and the interplay between the various properties of the constituent proteins. However, these large and dense data present an increased analysis challenge and current methods capture only a small fraction of signal. The first part of my work has involved tackling these issues with the implementation of a GPU-accelerated and distributed signal decomposition pipeline, making factorization of large proteomics scans feasible and efficient. The pipeline yields individual analyte signals spanning the majority of acquired signal, enabling high precision quantification and further analytical tasks.\ua0\ua0\ua0 Having such detailed snapshots of the proteome enables a multitude of undertakings. One application has been to use a deep neural network model to learn the amino acid sequence determinants of temperature adaptation, in the form of reusable deep model features. More generally, systemic quantities may be predicted from the information encoded in sequence by evolutionary pressure. Two studies taking inspiration from natural language processing have sought to learn the grammars behind the languages of expression, in one case predicting mRNA levels from DNA sequence, and in the other protein abundance from amino acid sequence. These two models helped build a quantitative understanding of the central dogma and, furthermore, in combination yielded an improved predictor of protein amount. Finally, a mathematical framework relying on the embedded space of a deep model has been constructed to assist guided mutation of proteins towards optimizing their abundance
    • …
    corecore