289 research outputs found
Data integration strategies for informing computational design in synthetic biology
PhD ThesisThe potential design space for biological systems is complex, vast and multidimensional. Therefore, effective large-scale synthetic biology requires computational design and simulation. By constraining this design space, the time- and cost-efficient design of biological systems can be facilitated. One way in which a tractable design space can be achieved is to use the extensive and growing amount of biological data available to inform the design process. By using existing knowledge design efforts can be focused on biologically plausible areas of design space. However, biological data is large, incomplete, heterogeneous, and noisy. Data must be integrated in a systematic fashion in order to maximise its benefit. To date, data integration has not been widely applied to design in synthetic biology. The aim of this project is to apply data integration techniques to facilitate the efficient design of novel biological systems. The specific focus is on the development and application of integration techniques for the design of genetic regulatory networks in the model bacterium Bacillus subtilis.
A dataset was constructed by integrating data from a range of sources in order to capture existing knowledge about B. subtilis 168. The dataset is represented as a computationally-accessible, semantically-rich network which includes information concerning biological entities and their relationships. Also included are sequence-based features mined from the B. subtilis genome, which are a useful source of parts for synthetic biology. In addition, information about the interactions of these parts has been captured, in order to facilitate the construction of circuits with desired behaviours.
This dataset was also modelled in the form of an ontology, providing a formal specification of parts and their interactions. The ontology is a major step towards the unification of the data required for modelling with a range of part catalogues specifically designed for synthetic biology. The data from the ontology is available to existing reasoners for implicit knowledge extraction. The ontology was applied to the automated identification of promoters, operators and coding sequences. Information from the ontology was also used to generate dynamic models of parts.
The work described here contributed to the development of a formalism called Standard Virtual Parts (SVPs), which aims to represent models of biological parts in a standardised manner. SVPs comprise a mapping between biological parts and modular computational models. A genetic circuit designed at a part-level abstraction can be
investigated in detail by analysing a circuit model composed of SVPs. The ontology was used to construct SVPs in the form of standard Systems Biology Markup Language models. These models are publicly available from a computationally-accessible repository, and include metadata which facilitates the computational composition of SVPs in order to create models of larger biological systems.
To test a genetic circuit in vitro or in vivo, the genetics elements necessary to encode the enitites in the in silico model, and their associated behaviour, must be derived. Ultimately, this process results in the specification for synthesisable DNA sequence. For large models, particularly those that are produced computationally, the transformation process is challenging. To automate this process, a model-to-sequence conversion algorithm was developed. The algorithm was implemented as a Java application called MoSeC. Using MoSeC, both CellML and SBML models built with SVPs can be converted into DNA sequences ready to synthesise.
Selection of the host bacterial cell for a synthetic genetic circuit is very important. In order not to interfere with the existing cellular machinery, orthogonal parts from other species are used since these parts are less likely to have undesired interactions with the host. In order to find orthogonal transcription factors (OTFs), and their target binding sequences, a subset of the data from the integrated B. subtilis dataset was used. B. subtilis gene regulatory networks were used to re-construct regulatory networks in closely related Bacillus species. The system, called BacillusRegNet, stores both experimental data for B. subtilis and homology predictions in other species. BacillusRegNet was mined to extract OTFs and their binding sequences, in order to facilitate the engineering of novel regulatory networks in other Bacillus species. Although the techniques presented here were demonstrated using B. subtilis, they can be applied to any other organism. The approaches and tools developed as part of this project demonstrate the utility of this novel integrated approach to synthetic biology.EPSRC:
NSF:
The Newcastle University School of Computing Science
Predikin and PredikinDB: a computational framework for the prediction of protein kinase peptide specificity and an associated database of phosphorylation sites
Background: We have previously described an approach to predicting the substrate specificity of serine-threonine protein kinases. The method, named Predikin, identifies key conserved substrate-determining residues in the kinase catalytic domain that contact the substrate in the region of the phosphorylation site and so determine the sequence surrounding the phosphorylation site. Predikin was implemented originally as a web application written in Javascript
Recommended from our members
Functional Exploration of Antisense Long Non-Coding RNAs Containing Transposable Elements: A Bioinformatics Approach
Long non-coding RNA (lncRNAs) show a wide range of regulatory functions at the transcriptional and post-transcriptional levels both in the nucleus and cytoplasm. Recently, antisense lncRNAs (ASlncRNAs) were reported to up-regulate protein synthesis post-transcriptionally through a mechanism depending on an embedded inverted SINE B2 and 5' overlap to the target mRNAs. Such ASlncRNAs are also referred as SINEUPs. Synthetic SINEUPs with identical modular organization were also demonstrated to exert the same activity suggesting a functional relationship between SINE repetitive elements and ASlncRNAs. In order to gain a broader insight on the contribution of transposable elements (TEs) in the sequence composition of ASlncRNAs, I have developed a bioinformatic pipeline that can identify and characterize transcripts containing TEs and analyze TEs coverage for different classes of coding/non-coding sense/antisense (S/AS) pairs. I aimed at identifying if the functional activity of SINEUPs could be a widespread phenomenon across multiple similar natural ASlnRNAs in the transcriptomes of the extensively studied model organisms that have a well annotated catalog of IncNRAs. From my initial analysis I identified human and mouse are the two species that showed a significant coverage enrichment of SINE repeats among ASlncRNAs. I further performed several functional enrichment analysis for the sense coding genes overlapping to ASlncRNAs taking into consideration of different characteristics of the 5' binding domain and the 3' embedded SINE repetitive elements. This permitted me to identify the effect of these modular features over the functional associations of sense coding genes. The results of the analysis showed that the products of coding genes associated to ASlncRNAs containing SINEs are significantly enriched for mitochondrial localization. Further, to determine if these ASlncRNAs could exert SINEUP-like activity during stress, I analyzed the data from a published custom microarray experiment study, that were associated to the polysome fractions of MRC5 cell lysates in control and oxidative stress condition. The results revealed that the ASlncRNA carrying inverted or direct SINE repeats and their corresponding sense coding genes do not show any significant differential polysome loading in stress with respect to normal conditions, which is not a desired characteristic of a potential SINEUP. However, ASlncRNAs with inverted and direct SINE repeats corresponding to high translating polysome fractions showed a significantly higher ratio of means for RNA levels in stress over control, in contrast to noASlncRNA. This suggests that the ASlncRNA containing SINE elements are the key RNA molecules that are active during stress, although to determine if they are also involved in the increased polysome loading of their respective sense coding mRNAs, there is a need of further experimentation and exploration. Altogether, the work presented in this thesis provides a novel bioinformatics approach to study transcriptome-wide ASlncRNAs containing TEs and their functional association over the sense coding genes, and discover new significant functional features of ASlncRNA to be biologically validated
A Novel Method to Detect Functional Subgraphs in Biomolecular Networks
Several biomolecular pathways governing the control of cellular processes have been discovered over the last several years. Additionally, advances resulting from combining these pathways into networks have produced new insights into the complex behaviors observed in cell function assays. Unfortunately, identification of important subnetworks, or “motifs”, in these networks has been slower in development. This study focused on identifying important network motifs and their rate of occurrence in two different biomolecular networks. The two networks evaluated for this study represented both ends of the spectrum of interaction knowledge by comparing a well defined network (apoptosis) with and poorly studied network that was early in development (autism). This study identified several motifs that could be important in governing and controlling cellular processes in healthy and diseased cells. Additionally, this study revealed an inverse relationship when comparing the occurrence rate of these motifs in apoptosis and autism
Linking genes to literature: text mining, information extraction, and retrieval applications for biology
Efficient access to information contained in online scientific literature collections is essential for life science research, playing a crucial role from the initial stage of experiment planning to the final interpretation and communication of the results. The biological literature also constitutes the main information source for manual literature curation used by expert-curated databases. Following the increasing popularity of web-based applications for analyzing biological data, new text-mining and information extraction strategies are being implemented. These systems exploit existing regularities in natural language to extract biologically relevant information from electronic texts automatically. The aim of the BioCreative challenge is to promote the development of such tools and to provide insight into their performance. This review presents a general introduction to the main characteristics and applications of currently available text-mining systems for life sciences in terms of the following: the type of biological information demands being addressed; the level of information granularity of both user queries and results; and the features and methods commonly exploited by these applications. The current trend in biomedical text mining points toward an increasing diversification in terms of application types and techniques, together with integration of domain-specific resources such as ontologies. Additional descriptions of some of the systems discussed here are available on the internet
Analysis, Visualization, and Machine Learning of Epigenomic Data
The goal of the Encyclopedia of DNA Elements (ENCODE) project has been to characterize all the functional elements of the human genome. These elements include expressed transcripts and genomic regions bound by transcription factors (TFs), occupied by nucleosomes, occupied by nucleosomes with modified histones, or hypersensitive to DNase I cleavage, etc. Chromatin Immunoprecipitation (ChIP-seq) is an experimental technique for detecting TF binding in living cells, and the genomic regions bound by TFs are called ChIP-seq peaks. ENCODE has performed and compiled results from tens of thousands of experiments, including ChIP-seq, DNase, RNA-seq and Hi-C.
These efforts have culminated in two web-based resources from our lab—Factorbook and SCREEN—for the exploration of epigenomic data for both human and mouse. Factorbook is a peak-centric resource presenting data such as motif enrichment and histone modification profiles for transcription factor binding sites computed from ENCODE ChIP-seq data. SCREEN provides an encyclopedia of ~2 million regulatory elements, including promoters and enhancers, identified using ENCODE ChIP-seq and DNase data, with an extensive UI for searching and visualization.
While we have successfully utilized the thousands of available ENCODE ChIP-seq experiments to build the Encyclopedia and visualizers, we have also struggled with the practical and theoretical inability to assay every possible experiment on every possible biosample under every conceivable biological scenario. We have used machine learning techniques to predict TF binding sites and enhancers location, and demonstrate machine learning is critical to help decipher functional regions of the genome
- …