21 research outputs found
Many Microbe Microarrays Database: uniformly normalized Affymetrix compendia with structured experimental metadata
Many Microbe Microarrays Database (M3D) is designed to facilitate the analysis and visualization of expression data in compendia compiled from multiple laboratories. M3D contains over a thousand Affymetrix microarrays for Escherichia coli, Saccharomyces cerevisiae and Shewanella oneidensis. The expression data is uniformly normalized to make the data generated by different laboratories and researchers more comparable. To facilitate computational analyses, M3D provides raw data (CEL file) and normalized data downloads of each compendium. In addition, web-based construction, visualization and download of custom datasets are provided to facilitate efficient interrogation of the compendium for more focused analyses. The experimental condition metadata in M3D is human curated with each chemical and growth attribute stored as a structured and computable set of experimental features with consistent naming conventions and units. All versions of the normalized compendia constructed for each species are maintained and accessible in perpetuity to facilitate the future interpretation and comparison of results published on M3D data. M3D is accessible at http://m3d.bu.edu/
Many Microbe Microarrays Database: uniformly normalized Affymetrix compendia with structured experimental metadata
Many Microbe Microarrays Database (M3D) is designed to facilitate the analysis and visualization of expression data in compendia compiled from multiple laboratories. M3D contains over a thousand Affymetrix microarrays for Escherichia coli, Saccharomyces cerevisiae and Shewanella oneidensis. The expression data is uniformly normalized to make the data generated by different laboratories and researchers more comparable. To facilitate computational analyses, M3D provides raw data (CEL file) and normalized data downloads of each compendium. In addition, web-based construction, visualization and download of custom datasets are provided to facilitate efficient interrogation of the compendium for more focused analyses. The experimental condition metadata in M3D is human curated with each chemical and growth attribute stored as a structured and computable set of experimental features with consistent naming conventions and units. All versions of the normalized compendia constructed for each species are maintained and accessible in perpetuity to facilitate the future interpretation and comparison of results published on M3D data. M3D is accessible at http://m3d.bu.edu/
Predicting gene function using hierarchical multi-label decision tree ensembles
<p>Abstract</p> <p>Background</p> <p><it>S. cerevisiae</it>, <it>A. thaliana </it>and <it>M. musculus </it>are well-studied organisms in biology and the sequencing of their genomes was completed many years ago. It is still a challenge, however, to develop methods that assign biological functions to the ORFs in these genomes automatically. Different machine learning methods have been proposed to this end, but it remains unclear which method is to be preferred in terms of predictive performance, efficiency and usability.</p> <p>Results</p> <p>We study the use of decision tree based models for predicting the multiple functions of ORFs. First, we describe an algorithm for learning hierarchical multi-label decision trees. These can simultaneously predict all the functions of an ORF, while respecting a given hierarchy of gene functions (such as FunCat or GO). We present new results obtained with this algorithm, showing that the trees found by it exhibit clearly better predictive performance than the trees found by previously described methods. Nevertheless, the predictive performance of individual trees is lower than that of some recently proposed statistical learning methods. We show that ensembles of such trees are more accurate than single trees and are competitive with state-of-the-art statistical learning and functional linkage methods. Moreover, the ensemble method is computationally efficient and easy to use.</p> <p>Conclusions</p> <p>Our results suggest that decision tree based methods are a state-of-the-art, efficient and easy-to-use approach to ORF function prediction.</p
Inferring Gene Networks for Strains of <i>Dehalococcoides</i> Highlights Conserved Relationships between Genes Encoding Core Catabolic and Cell-Wall Structural Proteins
<div><p>The interpretation of high-throughput gene expression data for non-model microorganisms remains obscured because of the high fraction of hypothetical genes and the limited number of methods for the robust inference of gene networks. Therefore, to elucidate gene-gene and gene-condition linkages in the bioremediation-important genus <i>Dehalococcoides</i>, we applied a Bayesian inference strategy called Reverse Engineering/Forward Simulation (REFS<sup>™</sup>) on transcriptomic data collected from two organohalide-respiring communities containing different <i>Dehalococcoides mccartyi</i> strains: the Cornell University mixed community D2 and the commercially available KB-1<sup>®</sup> bioaugmentation culture. In total, 49 and 24 microarray datasets were included in the REFS<sup>™</sup> analysis to generate an ensemble of 1,000 networks for the <i>Dehalococcoides</i> population in the Cornell D2 and KB-1<sup>®</sup> culture, respectively. Considering only linkages that appeared in the consensus network for each culture (exceeding the determined frequency cutoff of ≥ 60%), the resulting Cornell D2 and KB-1<sup>®</sup> consensus networks maintained 1,105 nodes (genes or conditions) with 974 edges and 1,714 nodes with 1,455 edges, respectively. These consensus networks captured multiple strong and biologically informative relationships. One of the main highlighted relationships shared between these two cultures was a direct edge between the transcript encoding for the major reductive dehalogenase (<i>tceA</i> (D2) or <i>vcrA</i> (KB-1<sup>®</sup>)) and the transcript for the putative S-layer cell wall protein (DET1407 (D2) or KB1_1396 (KB-1<sup>®</sup>)). Additionally, transcripts for two key oxidoreductases (a [Ni Fe] hydrogenase, Hup, and a protein with similarity to a formate dehydrogenase, “Fdh”) were strongly linked, generalizing a strong relationship noted previously for <i>Dehalococcoides mccartyi</i> strain 195 to multiple strains of <i>Dehalococcoides</i>. Notably, the pangenome array utilized when monitoring the KB-1<sup>®</sup> culture was capable of resolving signals from multiple strains, and the network inference engine was able to reconstruct gene networks in the distinct strain populations.</p></div
REFS<sup>™</sup> consensus network summary for the <i>hup</i> and <i>fdh</i> transcripts.
<p>(a) D2 and (b) KB-1<sup>®</sup>. The connecting lines indicate edge strength scores that exceeded 0.6. Gray text in the KB-1<sup>®</sup> culture indicates the minor stain of the Cornell/Victoria type. All relationships identified in the model between these transcripts were positive.</p
Ordering <i>Dhc</i> pangenome array probes based on sequence similarity and captured expression profiles for the KB-1<sup>®</sup> culture.
<p>The array contains multiple probes for <i>Dhc</i> orthologs. The white-to-blue shaded columns (left) display the genomic % identity of the probe sequence to gene sequences for representative members of the Cornell, Victoria, and Pinellas groups of <i>Dhc</i>. The yellow-to-purple columns (right) represent the correlation relationship scores of the probe intensity across all cDNA pools from all samples. Bolded (*) probes indicate those that were retained for the REFS<sup>™</sup> analysis of the KB-1<sup>®</sup> data.</p
Transcripts in the consensus networks that are a maximum of two edges away from connecting with reductive dehalogenases in D2 (left) and KB-1<sup>®</sup> (right).
<p>The four highest transcribed RDases in the D2 culture and the top five transcribed RDases in the KB-1<sup>®</sup> culture are displayed. Other RDases are present in the final consensus network as well. The dashed lines are indicative of negative relationships, and the solid lines represent positive relationships. For the D2 consensus network, the transcript ID and a brief description is provided. For the KB-1<sup>®</sup> consensus network, the probe ID, orthologous transcript ID in <i>Dhc</i> strain 195 (where applicable), and a brief annotation are provided. The grayed text in the KB-1<sup>®</sup> culture represents transcripts from a minor Cornell-type strain.</p
Gene-gene edges modified by a discrete variable in the D2 consensus network with high frequencies (f > 0.85).
<p>Gene-gene edges modified by a discrete variable in the D2 consensus network with high frequencies (f > 0.85).</p