4,351 research outputs found

    Fixed-Parameter Algorithms For Protein Similarity Search Under mRNA Structure Constraints

    Get PDF
    International audienceIn the context of protein engineering, we consider the problem of computing an mRNA sequence of maximal codon-wise similarity to a given mRNA (and consequently, to a given protein) that additionally satisfies some secondary structure constraints, the so-called mRNA Structure Optimization (MRSO) problem. Since MRSO is known to be APX-hard, Bongartz [10] suggested to attack the problem using the approach of parameterized complexity. In this paper we propose three fixed-parameter algorithms that apply for several interesting parameters of MRSO. We believe these algorithms to be relevant for practical applications today, as well as for possible future applications. Furthermore, our results extend the known tractability borderline of MRSO, and provide new research horizons for further improvements of this sort

    Learning condition-specific networks

    Get PDF
    Condition-specific cellular networks are networks of genes and proteins that describe functional interactions among genes occurring under different environmental conditions. These networks provide a systems-level view of how the parts-list (genes and proteins) interact within the cell as it functions under changing environmental conditions and can provide insight into mechanisms of stress response, cellular differentiation and disease susceptibility. The principle challenge, however, is that cellular networks remain unknown for most conditions and must be inferred from activity levels of genes (mRNA levels) under different conditions. This dissertation aims to develop computational approaches for inferring, analyzing and validating cellular networks of genes from expression data. This dissertation first describes an unsupervised machine learning framework for inferring cellular networks using expression data from a single condition. Here cellular networks are represented as undirected probabilistic graphical models and are learned using a novel, data-driven algorithm. Then several approaches are described that can learn networks using data from multiple conditions. These approaches apply to cases where the condition may or may not be known and, therefore, must be inferred as part of the learning problem. For the latter, the condition variable is allowed to influence expression of genes at different levels of granularity: condition variable per gene to a single condition variable for all genes. Results on simulated data suggest that the algorithm performance depends greatly on the size and number of connected components of the union network of all conditions. These algorithms are also applied to microarray data from two yeast populations, quiescent and non-quiescent, isolated from glucose starved cultures. Our results suggest that by sharing information across multiple conditions, better networks can be learned for both conditions, with many more biologically meaningful dependencies, than if networks were learned for these conditions independently. In particular, processes that were shared among both cell populations were involved in response to glucose starvation, whereas the processes specific to individual populations captured characteristics unique to each population. These algorithms were also applied for learning networks across multiple species: yeast (S. cerevisiae) and fly (D. melanogaster). Preliminary analysis suggests that sharing patterns across species is much more complex than across different populations of the same species and basic metabolic processes are shared across the two species. Finally, this dissertation focuses on validation of cellular networks. This validation framework describes scores for measuring how well network learning algorithms capture higher-order dependencies. This framework also introduces a measure for evaluating the entire inferred network structure based on the extent to which similarly functioning genes are close together on the network

    SAMNet: a network-based approach to integrate multi-dimensional high throughput datasets

    Get PDF
    The rapid development of high throughput biotechnologies has led to an onslaught of data describing genetic perturbations and changes in mRNA and protein levels in the cell. Because each assay provides a one-dimensional snapshot of active signaling pathways, it has become desirable to perform multiple assays (e.g. mRNA expression and phospho-proteomics) to measure a single condition. However, as experiments expand to accommodate various cellular conditions, proper analysis and interpretation of these data have become more challenging. Here we introduce a novel approach called SAMNet, for Simultaneous Analysis of Multiple Networks, that is able to interpret diverse assays over multiple perturbations. The algorithm uses a constrained optimization approach to integrate mRNA expression data with upstream genes, selecting edges in the protein–protein interaction network that best explain the changes across all perturbations. The result is a putative set of protein interactions that succinctly summarizes the results from all experiments, highlighting the network elements unique to each perturbation. We evaluated SAMNet in both yeast and human datasets. The yeast dataset measured the cellular response to seven different transition metals, and the human dataset measured cellular changes in four different lung cancer models of Epithelial-Mesenchymal Transition (EMT), a crucial process in tumor metastasis. SAMNet was able to identify canonical yeast metal-processing genes unique to each commodity in the yeast dataset, as well as human genes such as β-catenin and TCF7L2/TCF4 that are required for EMT signaling but escaped detection in the mRNA and phospho-proteomic data. Moreover, SAMNet also highlighted drugs likely to modulate EMT, identifying a series of less canonical genes known to be affected by the BCR-ABL inhibitor imatinib (Gleevec), suggesting a possible influence of this drug on EMT.National Institutes of Health (U.S.) (Grant U54CA112967)National Institutes of Health (U.S.) (Grant R01GN089903)National Science Foundation (U.S.) (Award DB1-0821391)Massachusetts Institute of Technology. Undergraduate Research Opportunities Progra

    From Structure Prediction to Genomic Screens for Novel Non-Coding RNAs

    Get PDF
    Non-coding RNAs (ncRNAs) are receiving more and more attention not only as an abundant class of genes, but also as regulatory structural elements (some located in mRNAs). A key feature of RNA function is its structure. Computational methods were developed early for folding and prediction of RNA structure with the aim of assisting in functional analysis. With the discovery of more and more ncRNAs, it has become clear that a large fraction of these are highly structured. Interestingly, a large part of the structure is comprised of regular Watson-Crick and GU wobble base pairs. This and the increased amount of available genomes have made it possible to employ structure-based methods for genomic screens. The field has moved from folding prediction of single sequences to computational screens for ncRNAs in genomic sequence using the RNA structure as the main characteristic feature. Whereas early methods focused on energy-directed folding of single sequences, comparative analysis based on structure preserving changes of base pairs has been efficient in improving accuracy, and today this constitutes a key component in genomic screens. Here, we cover the basic principles of RNA folding and touch upon some of the concepts in current methods that have been applied in genomic screens for de novo RNA structures in searches for novel ncRNA genes and regulatory RNA structure on mRNAs. We discuss the strengths and weaknesses of the different strategies and how they can complement each other

    "Multiple Sequence Alignment Using External Sources Of Information"

    Get PDF
    Multiple sequence alignment is an alignment of three or more protein or nucleic acid sequences. The alignment area has always been of much interest for researchers, this is due to that fact that many scientifi c researchs depend in their workflow on sequence alignments. Thus, having an alignment of high quality is of high importance. Much work has been done and is still carried in this field to help improving the quality of alignments. Many approaches have been developed so far for performing pairwise and multiple sequence alignments, yet, most of those approaches rely basically on the sequences to be aligned as their only input. Recently, some approaches began to incorporate additional sources of information in the alignment process, the sources of external data can come from user knowledge or online databases. This data, when integrated in the workflow of the alignment programs, may add new constraints to the produced alignment and improve its quality by making it biologically more meaningful. In this thesis, I will introduce new approaches for multiple sequence alignment which use the alignment software DIALIGN along with external information from databases, where useful information is extracted and then integrated in the alignment process. By testing those approaches on benchmark databases, I will show that using additional data during alignment produced better results than using DIALIGN alone without any external input other than the sequences to be aligned
    • …
    corecore