3,583 research outputs found

    Model-based clustering with data correction for removing artifacts in gene expression data

    Full text link
    The NIH Library of Integrated Network-based Cellular Signatures (LINCS) contains gene expression data from over a million experiments, using Luminex Bead technology. Only 500 colors are used to measure the expression levels of the 1,000 landmark genes measured, and the data for the resulting pairs of genes are deconvolved. The raw data are sometimes inadequate for reliable deconvolution leading to artifacts in the final processed data. These include the expression levels of paired genes being flipped or given the same value, and clusters of values that are not at the true expression level. We propose a new method called model-based clustering with data correction (MCDC) that is able to identify and correct these three kinds of artifacts simultaneously. We show that MCDC improves the resulting gene expression data in terms of agreement with external baselines, as well as improving results from subsequent analysis.Comment: 28 page

    AutoSOME: a clustering method for identifying gene expression modules without prior knowledge of cluster number

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Clustering the information content of large high-dimensional gene expression datasets has widespread application in "omics" biology. Unfortunately, the underlying structure of these natural datasets is often fuzzy, and the computational identification of data clusters generally requires knowledge about cluster number and geometry.</p> <p>Results</p> <p>We integrated strategies from machine learning, cartography, and graph theory into a new informatics method for automatically clustering self-organizing map ensembles of high-dimensional data. Our new method, called AutoSOME, readily identifies discrete and fuzzy data clusters without prior knowledge of cluster number or structure in diverse datasets including whole genome microarray data. Visualization of AutoSOME output using network diagrams and differential heat maps reveals unexpected variation among well-characterized cancer cell lines. Co-expression analysis of data from human embryonic and induced pluripotent stem cells using AutoSOME identifies >3400 up-regulated genes associated with pluripotency, and indicates that a recently identified protein-protein interaction network characterizing pluripotency was underestimated by a factor of four.</p> <p>Conclusions</p> <p>By effectively extracting important information from high-dimensional microarray data without prior knowledge or the need for data filtration, AutoSOME can yield systems-level insights from whole genome microarray expression studies. Due to its generality, this new method should also have practical utility for a variety of data-intensive applications, including the results of deep sequencing experiments. AutoSOME is available for download at <url>http://jimcooperlab.mcdb.ucsb.edu/autosome</url>.</p

    Automated data integration for developmental biological research

    Get PDF
    In an era exploding with genome-scale data, a major challenge for developmental biologists is how to extract significant clues from these publicly available data to benefit our studies of individual genes, and how to use them to improve our understanding of development at a systems level. Several studies have successfully demonstrated new approaches to classic developmental questions by computationally integrating various genome-wide data sets. Such computational approaches have shown great potential for facilitating research: instead of testing 20,000 genes, researchers might test 200 to the same effect. We discuss the nature and state of this art as it applies to developmental research

    An application in bioinformatics : a comparison of affymetrix and compugen human genome microarrays

    Get PDF
    The human genome microarrays from Compugen® and Affymetrix® were compared in the context of the emerging field of computational biology. The two premier database servers for genomic sequence data, the National Center for Biotechnology Information and the European Bioinformatics Institute, were described in detail. The various databases and data mining tools available through these data servers were also discussed. Microarrays were examined from a historical perspective and their main current applications-expression analysis, mutation analysis, and comparative genomic hybridization-were discussed. The two main types of microarrays, cDNA spotted microarrays and high-density spotted microarrays were analyzed by exploring the human genome microarray from Compugen® and the HGU133 Set from Affymetrix® respectively. Array design issues, sequence collection and analysis, and probe selection processes for the two representative types of arrays were described. The respective chip design of the two types of microarrays was also analyzed. It was found that the human genome microarray from Compugen 0 contains probes that interrogate 1,119,840 bases corresponding to 18,664 genes, while the HG-U133 Set from Affymetrix® contains probes that interrogate only 825,000 bases corresponding to 33,000 genes. Based on this, the efficiency of the 25-mer probes of the HG-U133 Set from Affymetrix® compared to the 60-mer probes of the microarray from Compugen® was questioned

    Bioinformatic studies on structural elements for the regulation of alternative oxidase (AOX) gene activities

    Get PDF
    Trabalho de projecto de mestrado em Engenharia Informática, apresentado à Universidade de Lisboa, através da Faculdade de Ciências, 2007Alternative Oxidase genes encode a small family of isoenzymes (enzymes with some differences but act in the same chemical reaction). AOX is present in plants, fungi, algae, some yeast, and was also found in some classes of the animal kingdom. The enzymes are responsible for an alternative pathway of respiration that is responsive to stress conditions but also to pathogen attack, as well as growth and stage development. Scaffold Matrix Attachment Regions (S/MARs) are DNA sequences from 300 to 3000 nucleotides that bound with nuclear proteins serving as anchors for DNA, influencing in this way the DNA organization inside the cell. Several studies have failed to reveal a pattern of organization in the sequences, however some rules have been found that help computer based analysis. Experimental identification of these sequences is hard and time consuming, computer methods could provide a first step selection, and cover larger sequences. In order to highlight possible links between S/MARs and differential regulation of AOX genes, the first part of this project consists in identifying structurally relevant S/MAR regions in the neighborhood of AOX genes in Arabidopsis thaliana and in rice using a selected computer program. Single Nucleotide Polymorphisms (SNPs) are variations in one nucleotide base among DNA sequences from the same location, from different individuals. These differences could serve as markers to classify a specific set of individuals. The second part of this project consists in the development of a bioinformatic application that will help in the identification of specific polymorphisms (SNPs) in sequences that are experimentally obtained at the EU Marie Curie Chair in ICAM University of Évora, where this project is performed.Os genes da oxidase alternativa (ou AOX) codificam uma pequena família de isoenzimas (enzimas com algumas diferenças mas que actuam nas mesmas reacções químicas), que se encontram nas plantas, fungos, algas, algumas leveduras bem como em algumas classes do reino animal. A AOX é responsável por uma via alternativa de respiração, activada principalmente em condições de stress mas também como reacção a ataques patogénicos, bem como em estádios específicos do desenvolvimento da planta. As Scaffold Matrix Attachment Regions (S/MARs) são sequências de DNA entre 300 e 3000 nucleótidos que se ligam a proteínas do núcleo da célula, servindo como âncoras para o DNA, conferindo-lhe assim uma forma própria no interior da célula. Estudos realizados para determinar uma organização específica destas regiões não produziram muitos resultados, no entanto foram definidas algumas regras que permitem ajudar na detecção computacional destas sequências, uma vez que a detecção experimental é difícil e morosa. Com vista a estabelecer possíveis relações entre uma regulação diferenciada dos genes da AOX através dos S/MARs, a primeira parte deste projecto consiste em determinar as regiões do DNA com a estrutura de potenciais S/MARs na vizinhança dos genes da Oxidase Alternativa na Arabidopsis thaliana e no arroz. Single Nucleotide Polymorphisms (SNPs) são diferenças de um nucleótido entreas mesmas regiões de DNA de diferentes indivíduos da mesma espécie. Estas diferençaspodem servir para marcar um determinado conjunto de indivíduos.A segunda parte deste projecto consiste em desenvolver uma aplicação para ajudarna identificação de tipos específicos de polimorfismos, (SNPs) em sequências identificadas na EU Marie Curie Chair, ICAM, Universidade de Évora, onde este projecto foi desenvolvido
    corecore