600 research outputs found
Using temporal correlation in factor analysis for reconstructing transcription factor activities
Two-level gene regulatory networks consist of the transcription factors (TFs) in the top level and their regulated genes in the second level. The expression profiles of the regulated genes are the observed high-throughput data given by experiments such as microarrays. The activity profiles of the TFs are treated as hidden variables as well as the connectivity matrix that indicates the regulatory relationships of TFs with their regulated genes. Factor analysis (FA) as well as other methods, such as the network component algorithm, has been suggested for reconstructing gene regulatory networks and also for predicting TF activities. They have been applied to E. coli and yeast data with the assumption that these datasets consist of identical and independently distributed samples. Thus, the main drawback of these algorithms is that they ignore any time correlation existing within the TF profiles. In this paper, we extend previously studied FA algorithms to include time correlation within the transcription factors. At the same time, we consider connectivity matrices that are sparse in order to capture the existing sparsity present in gene regulatory networks. The TFs activity profiles obtained by this approach are significantly smoother than profiles from previous FA algorithms. The periodicities in profiles from yeast expression data become prominent in our reconstruction. Moreover, the strength of the correlation between time points is estimated and can be used to assess the suitability of the experimental time interval
Unexpected correlations between gene expression and codon usage bias from microarray data for the whole Escherichia coli K-12 genome
Escherichia coli has long been regarded as a model organism in the study of codon usage bias (CUB). However, most studies in this organism regarding this topic have been computational or, when experimental, restricted to small datasets; particularly poor attention has been given to genes with low CUB. In this work, correspondence analysis on codon usage is used to classify E.coli genes into three groups, and the relationship between them and expression levels from microarray experiments is studied. These groups are: group 1, highly biased genes; group 2, moderately biased genes; and group 3, AT-rich genes with low CUB. It is shown that, surprisingly, there is a negative correlation between codon bias and expression levels for group 3 genes, i.e. genes with extremely low codon adaptation index (CAI) values are highly expressed, while group 2 show the lowest average expression levels and group 1 show the usual expected positive correlation between CAI and expression. This trend is maintained over all functional gene groups, seeming to contradict the E.coli–yeast paradigm on CUB. It is argued that these findings are still compatible with the mutation–selection balance hypothesis of codon usage and that E.coli genes form a dynamic system shaped by these factors
Investigating inter-chromosomal regulatory relationships through a comprehensive meta-analysis of matched copy number and transcriptomics data sets.
BACKGROUND: Gene regulatory relationships can be inferred using matched array comparative genomics and transcriptomics data sets from cancer samples. The way in which copy numbers of genes in cancer samples are often greatly disrupted works like a natural gene amplification/deletion experiment. There are now a large number of such data sets publicly available making a meta-analysis of the data possible. RESULTS: We infer inter-chromosomal acting gene regulatory relationships from a meta-analysis of 31 publicly available matched array comparative genomics and transcriptomics data sets in humans. We obtained statistically significant predictions of target genes for 1430 potential regulatory genes. The regulatory relationships being inferred are either direct relationships, of a transcription factor on its target, or indirect ones, through pathways containing intermediate steps. We analyse the predictions in terms of cocitations, both publications which cite a regulator with any of its inferred targets and cocitations of any genes in a target list. CONCLUSIONS: The most striking observation from the results is the greater number of inter-chromosomal regulatory relationships involving repression compared to those involving activation. The complete results of the meta-analysis are presented in the database METAMATCHED. We anticipate that the predictions contained in the database will be useful in informing experiments and in helping to construct networks of regulatory relationships
Solving the riddle of codon usage preferences: a test for translational selection
Translational selection is responsible for the unequal usage of synonymous codons in protein coding genes in a wide variety of organisms. It is one of the most subtle and pervasive forces of molecular evolution, yet, establishing the underlying causes for its idiosyncratic behaviour across living kingdoms has proven elusive to researchers over the past 20 years. In this study, a statistical model for measuring translational selection in any given genome is developed, and the test is applied to 126 fully sequenced genomes, ranging from archaea to eukaryotes. It is shown that tRNA gene redundancy and genome size are interacting forces that ultimately determine the action of translational selection, and that an optimal genome size exists for which this kind of selection is maximal. Accordingly, genome size also presents upper and lower boundaries beyond which selection on codon usage is not possible. We propose a model where the coevolution of genome size and tRNA genes explains the observed patterns in translational selection in all living organisms. This model finally unifies our understanding of codon usage across prokaryotes and eukaryotes. Helicobacter pylori, Saccharomyces cerevisiae and Homo sapiens are codon usage paradigms that can be better understood under the proposed model
A comparison of machine learning and Bayesian modelling for molecular serotyping.
BACKGROUND: Streptococcus pneumoniae is a human pathogen that is a major cause of infant mortality. Identifying the pneumococcal serotype is an important step in monitoring the impact of vaccines used to protect against disease. Genomic microarrays provide an effective method for molecular serotyping. Previously we developed an empirical Bayesian model for the classification of serotypes from a molecular serotyping array. With only few samples available, a model driven approach was the only option. In the meanwhile, several thousand samples have been made available to us, providing an opportunity to investigate serotype classification by machine learning methods, which could complement the Bayesian model. RESULTS: We compare the performance of the original Bayesian model with two machine learning algorithms: Gradient Boosting Machines and Random Forests. We present our results as an example of a generic strategy whereby a preliminary probabilistic model is complemented or replaced by a machine learning classifier once enough data are available. Despite the availability of thousands of serotyping arrays, a problem encountered when applying machine learning methods is the lack of training data containing mixtures of serotypes; due to the large number of possible combinations. Most of the available training data comprises samples with only a single serotype. To overcome the lack of training data we implemented an iterative analysis, creating artificial training data of serotype mixtures by combining raw data from single serotype arrays. CONCLUSIONS: With the enhanced training set the machine learning algorithms out perform the original Bayesian model. However, for serotypes currently lacking sufficient training data the best performing implementation was a combination of the results of the Bayesian Model and the Gradient Boosting Machine. As well as being an effective method for classifying biological data, machine learning can also be used as an efficient method for revealing subtle biological insights, which we illustrate with an example
Tests for attraction to prey and predator avoidance by chemical cues in spiders of the beech forest floor
Spiders leave draglines, faeces and other secretions behind when traveling through their microhabitat. The presence of these secretions may unintentionally inform other animals, prey as well as predators, about a recent and possible current predation risk or food availability. For a wolf spider, other spiders including smaller conspecifics, form a substantial part of their prey, and larger wolf spiders, again including conspecifics, are potential predators. We tested two hypotheses: that large wolf spiders may locate patches of potential spider prey through the presence of silk threads and/or other secretions; and that prey spiders may use secretions from large wolf spiders to avoid patches with high predation risk. We used large (subadult or adult) Pardosa saltans to provide predator cues and mixed dwarf spiders or small (juvenile) P. saltans to provide prey cues. Subadult wolf spiders were significantly attracted to litter contaminated by dwarf spiders or small conspecifics after 6 hours but no longer after 24 hours. In contrast, neither dwarf spiders nor small P. saltans showed significant avoidance of substrate contaminated by adult P. saltans. However, small P. saltans showed different activity patterns on the two substrates. The results indicate that wolf spiders are able to increase the efficiency of foraging by searching preferentially in patches with the presence of intraguild prey. The lack of a clear patch selection response of the prey in spite of a modified activity pattern may possibly be associated with the vertical stratification of the beech litter habitat: the reduced volume of spaces in the deeper layers could make downward rather than horizontal movement a fast and safe tactic against a large predator that cannot enter these spaces
Representing and analysing molecular and cellular function in the computer
Determining the biological function of a myriad of genes, and understanding how they interact to yield a living cell, is the major challenge of the post genome-sequencing era. The complexity of biological systems is such that this cannot be envisaged without the help of powerful computer systems capable of representing and analysing the intricate networks of physical and functional interactions between the different cellular components. In this review we try to provide the reader with an appreciation of where we stand in this regard. We discuss some of the inherent problems in describing the different facets of biological function, give an overview of how information on function is currently represented in the major biological databases, and describe different systems for organising and categorising the functions of gene products. In a second part, we present a new general data model, currently under development, which describes information on molecular function and cellular processes in a rigorous manner. The model is capable of representing a large variety of biochemical processes, including metabolic pathways, regulation of gene expression and signal transduction. It also incorporates taxonomies for categorising molecular entities, interactions and processes, and it offers means of viewing the information at different levels of resolution, and dealing with incomplete knowledge. The data model has been implemented in the database on protein function and cellular processes 'aMAZE' (http://www.ebi.ac.uk/research/pfbp/), which presently covers metabolic pathways and their regulation. Several tools for querying, displaying, and performing analyses on such pathways are briefly described in order to illustrate the practical applications enabled by the model
Empirical Bayesian models for analysing molecular serotyping microarrays.
BACKGROUND: Microarrays offer great potential as a platform for molecular diagnostics, testing clinical samples for the presence of numerous biomarkers in highly multiplexed assays. In this study applied to infectious diseases, data from a microarray designed for molecular serotyping of Streptococcus pneumoniae was used, identifying the presence of any one of 91 known pneumococcal serotypes from DNA extracts. This microarray incorporated oligonucleotide probes for all known capsular polysaccharide synthesis genes and required a statistical analysis of the microarray intensity data to determine which serotype, or combination of serotypes, were present within a sample based on the combination of genes detected. RESULTS: We propose an empirical Bayesian model for calculating the probabilities of combinations of serotypes from the microarray data. The model takes into consideration the dependencies between serotypes, induced by genes they have in common, and by homologous genes which, although not identical, are similar to each other in sequence. For serotypes which are very similar in capsular gene composition, extra probes are included on the microarray, providing additional information which is integrated into the Bayesian model. For each serotype combination with high probability, a second model, a Bayesian random effects model is applied to determine the relative abundance of each serotype. CONCLUSIONS: To assess the accuracy of the proposed analysis we applied our methods to experimental data from samples containing individual serotypes and samples containing combinations of serotypes with known levels of abundance. All but two of the known serotypes of S. pneumoniae that were tested as individual samples could be uniquely determined by the Bayesian model. The model also enabled the presence of combinations of serotypes within samples to be determined. Serotypes with very low abundance within a combination of serotypes can be detected (down to 2% abundance in this study). As well as detecting the presence of serotype combinations, an approximate measure of the percentage abundance of the serotypes within the combination can be obtained.RIGHTS : This article is licensed under the BioMed Central licence at http://www.biomedcentral.com/about/license which is similar to the 'Creative Commons Attribution Licence'. In brief you may : copy, distribute, and display the work; make derivative works; or make commercial use of the work - under the following conditions: the original author must be given credit; for any reuse or distribution, it must be made clear to others what the license terms of this work are
Displaying R spatial statistics on Google dynamic maps with web applications created by Rwui.
BACKGROUND: The R project includes a large variety of packages designed for spatial statistics. Google dynamic maps provide web based access to global maps and satellite imagery. We describe a method for displaying directly the spatial output from an R script on to a Google dynamic map. METHODS: This is achieved by creating a Java based web application which runs the R script and then displays the results on the dynamic map. In order to make this method easy to implement by those unfamiliar with programming Java based web applications, we have added the method to the options available in the R Web User Interface (Rwui) application. Rwui is an established web application for creating web applications for running R scripts. A feature of Rwui is that all the code for the web application being created is generated automatically so that someone with no knowledge of web programming can make a fully functional web application for running an R script in a matter of minutes. RESULTS: Rwui can now be used to create web applications that will display the results from an R script on a Google dynamic map. Results may be displayed as discrete markers and/or as continuous overlays. In addition, users of the web application may select regions of interest on the dynamic map with mouse clicks and the coordinates of the region of interest will automatically be made available for use by the R script. CONCLUSIONS: This method of displaying R output on dynamic maps is designed to be of use in a number of areas. Firstly it allows statisticians, working in R and developing methods in spatial statistics, to easily visualise the results of applying their methods to real world data. Secondly, it allows researchers who are using R to study health geographics data, to display their results directly onto dynamic maps. Thirdly, by creating a web application for running an R script, a statistician can enable users entirely unfamiliar with R to run R coded statistical analyses of health geographics data. Fourthly, we envisage an educational role for such applications.RIGHTS : This article is licensed under the BioMed Central licence at http://www.biomedcentral.com/about/license which is similar to the 'Creative Commons Attribution Licence'. In brief you may : copy, distribute, and display the work; make derivative works; or make commercial use of the work - under the following conditions: the original author must be given credit; for any reuse or distribution, it must be made clear to others what the license terms of this work are
- …
