139 research outputs found

    Motif Discovery through Predictive Modeling of Gene Regulation

    Full text link
    We present MEDUSA, an integrative method for learning motif models of transcription factor binding sites by incorporating promoter sequence and gene expression data. We use a modern large-margin machine learning approach, based on boosting, to enable feature selection from the high-dimensional search space of candidate binding sequences while avoiding overfitting. At each iteration of the algorithm, MEDUSA builds a motif model whose presence in the promoter region of a gene, coupled with activity of a regulator in an experiment, is predictive of differential expression. In this way, we learn motifs that are functional and predictive of regulatory response rather than motifs that are simply overrepresented in promoter sequences. Moreover, MEDUSA produces a model of the transcriptional control logic that can predict the expression of any gene in the organism, given the sequence of the promoter region of the target gene and the expression state of a set of known or putative transcription factors and signaling molecules. Each motif model is either a kk-length sequence, a dimer, or a PSSM that is built by agglomerative probabilistic clustering of sequences with similar boosting loss. By applying MEDUSA to a set of environmental stress response expression data in yeast, we learn motifs whose ability to predict differential expression of target genes outperforms motifs from the TRANSFAC dataset and from a previously published candidate set of PSSMs. We also show that MEDUSA retrieves many experimentally confirmed binding sites associated with environmental stress response from the literature.Comment: RECOMB 200

    Removal of AU Bias from Microarray mRNA Expression Data Enhances Computational Identification of Active MicroRNAs

    Get PDF
    Elucidation of regulatory roles played by microRNAs (miRs) in various biological networks is one of the greatest challenges of present molecular and computational biology. The integrated analysis of gene expression data and 3′-UTR sequences holds great promise for being an effective means to systematically delineate active miRs in different biological processes. Applying such an integrated analysis, we uncovered a striking relationship between 3′-UTR AU content and gene response in numerous microarray datasets. We show that this relationship is secondary to a general bias that links gene response and probe AU content and reflects the fact that in the majority of current arrays probes are selected from target transcript 3′-UTRs. Therefore, removal of this bias, which is in order in any analysis of microarray datasets, is of crucial importance when integrating expression data and 3′-UTR sequences to identify regulatory elements embedded in this region. We developed visualization and normalization schemes for the detection and removal of such AU biases and demonstrate that their application to microarray data significantly enhances the computational identification of active miRs. Our results substantiate that, after removal of AU biases, mRNA expression profiles contain ample information which allows in silico detection of miRs that are active in physiological conditions

    Genome-Wide Survey for Biologically Functional Pseudogenes

    Get PDF
    According to current estimates there exist about 20,000 pseudogenes in a mammalian genome. The vast majority of these are disabled and nonfunctional copies of protein-coding genes which, therefore, evolve neutrally. Recent findings that a Makorin1 pseudogene, residing on mouse Chromosome 5, is, indeed, in vivo vital and also evolutionarily preserved, encouraged us to conduct a genome-wide survey for other functional pseudogenes in human, mouse, and chimpanzee. We identify to our knowledge the first examples of conserved pseudogenes common to human and mouse, originating from one duplication predating the human–mouse species split and having evolved as pseudogenes since the species split. Functionality is one possible way to explain the apparently contradictory properties of such pseudogene pairs, i.e., high conservation and ancient origin. The hypothesis of functionality is tested by comparing expression evidence and synteny of the candidates with proper test sets. The tests suggest potential biological function. Our candidate set includes a small set of long-lived pseudogenes whose unknown potential function is retained since before the human–mouse species split, and also a larger group of primate-specific ones found from human–chimpanzee searches. Two processed sequences are notable, their conservation since the human–mouse split being as high as most protein-coding genes; one is derived from the protein Ataxin 7-like 3 (ATX7NL3), and one from the Spinocerebellar ataxia type 1 protein (ATX1). Our approach is comparative and can be applied to any pair of species. It is implemented by a semi-automated pipeline based on cross-species BLAST comparisons and maximum-likelihood phylogeny estimations. To separate pseudogenes from protein-coding genes, we use standard methods, utilizing in-frame disablements, as well as a probabilistic filter based on Ka/Ks ratios

    TREMOR—a tool for retrieving transcriptional modules by incorporating motif covariance

    Get PDF
    A transcriptional module (TM) is a collection of transcription factors (TF) that as a group, co-regulate multiple, functionally related genes. The task of identifying TMs poses an important biological challenge. Since TFs belong to evolutionarily and structurally related families, TF family members often bind to similar DNA motifs and can confound sequence-based approaches to TM identification. A previous approach to TM detection addresses this issue by pre-selecting a single representative from each TF family. One problem with this approach is that closely related transcription factors can still target sufficiently distinct genes in a biologically meaningful way, and thus, pre-selecting a single family representative may in principle miss certain TMs. Here we report a method—TREMOR (Transcriptional Regulatory Module Retriever). This method uses the Mahalanobis distance to assess the validity of a TM and automatically incorporates the inter-TF binding similarity without resorting to pre-selecting family representatives. The application of TREMOR on human muscle-specific, liver-specific and cell-cycle-related genes reveals TFs and TMs that were validated from literature and also reveals additional related genes

    Prioritization of gene regulatory interactions from large-scale modules in yeast

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>The identification of groups of co-regulated genes and their transcription factors, called transcriptional modules, has been a focus of many studies about biological systems. While methods have been developed to derive numerous modules from genome-wide data, individual links between regulatory proteins and target genes still need experimental verification. In this work, we aim to prioritize regulator-target links within transcriptional modules based on three types of large-scale data sources.</p> <p>Results</p> <p>Starting with putative transcriptional modules from ChIP-chip data, we first derive modules in which target genes show both expression and function coherence. The most reliable regulatory links between transcription factors and target genes are established by identifying intersection of target genes in coherent modules for each enriched functional category. Using a combination of genome-wide yeast data in normal growth conditions and two different reference datasets, we show that our method predicts regulatory interactions with significantly higher predictive power than ChIP-chip binding data alone. A comparison with results from other studies highlights that our approach provides a reliable and complementary set of regulatory interactions. Based on our results, we can also identify functionally interacting target genes, for instance, a group of co-regulated proteins related to cell wall synthesis. Furthermore, we report novel conserved binding sites of a glycoprotein-encoding gene, CIS3, regulated by Swi6-Swi4 and Ndd1-Fkh2-Mcm1 complexes.</p> <p>Conclusion</p> <p>We provide a simple method to prioritize individual TF-gene interactions from large-scale transcriptional modules. In comparison with other published works, we predict a complementary set of regulatory interactions which yields a similar or higher prediction accuracy at the expense of sensitivity. Therefore, our method can serve as an alternative approach to prioritization for further experimental studies.</p

    Incorporating Existing Network Information into Gene Network Inference

    Get PDF
    One methodology that has met success to infer gene networks from gene expression data is based upon ordinary differential equations (ODE). However new types of data continue to be produced, so it is worthwhile to investigate how to integrate these new data types into the inference procedure. One such data is physical interactions between transcription factors and the genes they regulate as measured by ChIP-chip or ChIP-seq experiments. These interactions can be incorporated into the gene network inference procedure as a priori network information. In this article, we extend the ODE methodology into a general optimization framework that incorporates existing network information in combination with regularization parameters that encourage network sparsity. We provide theoretical results proving convergence of the estimator for our method and show the corresponding probabilistic interpretation also converges. We demonstrate our method on simulated network data and show that existing network information improves performance, overcomes the lack of observations, and performs well even when some of the existing network information is incorrect. We further apply our method to the core regulatory network of embryonic stem cells utilizing predicted interactions from two studies as existing network information. We show that including the prior network information constructs a more closely representative regulatory network versus when no information is provided

    Validation of computerized diagnostic information in a clinical database from a national equine clinic network

    Get PDF
    BACKGROUND: Computerized diagnostic information offers potential for epidemiological research; however data accuracy must be addressed. The principal aim of this study was to evaluate the completeness and correctness of diagnostic information in a computerized equine clinical database compared to corresponding hand written veterinary clinical records, used as gold standard, and to assess factors related to correctness. Further, the aim was to investigate completeness (epidemiologic sensitivity), correctness (positive predictive value), specificity and prevalence for diagnoses for four body systems and correctness for affected limb information for four joint diseases. METHODS: A random sample of 450 visits over the year 2002 (nvisits=49,591) was taken from 18 nation wide clinics headed under one company. Computerized information for the visits selected and copies of the corresponding veterinary clinical records were retrieved. Completeness and correctness were determined using semi-subjective criteria. Logistic regression was used to examine factors associated with correctness for diagnosis. RESULTS: Three hundred and ninety six visits had veterinary clinical notes that were retrievable. The overall completeness and correctness were 91% and 92%, respectively; both values considered high. Descriptive analyses showed significantly higher degree of correctness for first visits compared to follow up visits and for cases with a diagnostic code recorded in the veterinary records compared to those with no code noted. The correctness was similar regardless of usage category (leisure/sport horse, racing trotter and racing thoroughbred) or gender.For the four body systems selected (joints, skin and hooves, respiratory, skeletal) the completeness varied between 71% (respiration) and 91% (joints) and the correctness ranged from 87% (skin and hooves) to 96% (respiration), whereas the specificity was >95% for all systems. Logistic regression showed that correctness was associated with type of visit, whether an explicit diagnostic code was present in the veterinary clinical record, and body system. Correctness for information on affected limb was 95% and varied with joint. CONCLUSION: Based on the overall high level of correctness and completeness the database was considered useful for research purposes. For the body systems investigated the highest level of completeness and correctness was seen for joints and respiration, respectively

    Analysis of Combinatorial Regulation: Scaling of Partnerships between Regulators with the Number of Governed Targets

    Get PDF
    Through combinatorial regulation, regulators partner with each other to control common targets and this allows a small number of regulators to govern many targets. One interesting question is that given this combinatorial regulation, how does the number of regulators scale with the number of targets? Here, we address this question by building and analyzing co-regulation (co-transcription and co-phosphorylation) networks that describe partnerships between regulators controlling common genes. We carry out analyses across five diverse species: Escherichia coli to human. These reveal many properties of partnership networks, such as the absence of a classical power-law degree distribution despite the existence of nodes with many partners. We also find that the number of co-regulatory partnerships follows an exponential saturation curve in relation to the number of targets. (For E. coli and Bacillus subtilis, only the beginning linear part of this curve is evident due to arrangement of genes into operons.) To gain intuition into the saturation process, we relate the biological regulation to more commonplace social contexts where a small number of individuals can form an intricate web of connections on the internet. Indeed, we find that the size of partnership networks saturates even as the complexity of their output increases. We also present a variety of models to account for the saturation phenomenon. In particular, we develop a simple analytical model to show how new partnerships are acquired with an increasing number of target genes; with certain assumptions, it reproduces the observed saturation. Then, we build a more general simulation of network growth and find agreement with a wide range of real networks. Finally, we perform various down-sampling calculations on the observed data to illustrate the robustness of our conclusions

    Can Molecular Motors Drive Distance Measurements in Injured Neurons?

    Get PDF
    Injury to nerve axons induces diverse responses in neuronal cell bodies, some of which are influenced by the distance from the site of injury. This suggests that neurons have the capacity to estimate the distance of the injury site from their cell body. Recent work has shown that the molecular motor dynein transports importin-mediated retrograde signaling complexes from axonal lesion sites to cell bodies, raising the question whether dynein-based mechanisms enable axonal distance estimations in injured neurons? We used computer simulations to examine mechanisms that may provide nerve cells with dynein-dependent distance assessment capabilities. A multiple-signals model was postulated based on the time delay between the arrival of two or more signals produced at the site of injury–a rapid signal carried by action potentials or similar mechanisms and slower signals carried by dynein. The time delay between the arrivals of these two types of signals should reflect the distance traversed, and simulations of this model show that it can indeed provide a basis for distance measurements in the context of nerve injuries. The analyses indicate that the suggested mechanism can allow nerve cells to discriminate between distances differing by 10% or more of their total axon length, and suggest that dynein-based retrograde signaling in neurons can be utilized for this purpose over different scales of nerves and organisms. Moreover, such a mechanism might also function in synapse to nucleus signaling in uninjured neurons. This could potentially allow a neuron to dynamically sense the relative lengths of its processes on an ongoing basis, enabling appropriate metabolic output from cell body to processes

    Embedding mRNA Stability in Correlation Analysis of Time-Series Gene Expression Data

    Get PDF
    Current methods for the identification of putatively co-regulated genes directly from gene expression time profiles are based on the similarity of the time profile. Such association metrics, despite their central role in gene network inference and machine learning, have largely ignored the impact of dynamics or variation in mRNA stability. Here we introduce a simple, but powerful, new similarity metric called lead-lag R2 that successfully accounts for the properties of gene dynamics, including varying mRNA degradation and delays. Using yeast cell-cycle time-series gene expression data, we demonstrate that the predictive power of lead-lag R2 for the identification of co-regulated genes is significantly higher than that of standard similarity measures, thus allowing the selection of a large number of entirely new putatively co-regulated genes. Furthermore, the lead-lag metric can also be used to uncover the relationship between gene expression time-series and the dynamics of formation of multiple protein complexes. Remarkably, we found a high lead-lag R2 value among genes coding for a transient complex
    corecore