20 research outputs found
Comparative analysis of missing value imputation methods to improve clustering and interpretation of microarray experiments
<p>Abstract</p> <p>Background</p> <p>Microarray technologies produced large amount of data. In a previous study, we have shown the interest of <it>k-Nearest Neighbour </it>approach for restoring the missing gene expression values, and its positive impact of the gene clustering by hierarchical algorithm. Since, numerous replacement methods have been proposed to impute missing values (MVs) for microarray data. In this study, we have evaluated twelve different usable methods, and their influence on the quality of gene clustering. Interestingly we have used several datasets, both kinetic and non kinetic experiments from yeast and human.</p> <p>Results</p> <p>We underline the excellent efficiency of approaches proposed and implemented by Bo and co-workers and especially one based on expected maximization (<it>EM_array</it>). These improvements have been observed also on the imputation of extreme values, the most difficult predictable values. We showed that the imputed MVs have still important effects on the stability of the gene clusters. The improvement on the clustering obtained by hierarchical clustering remains limited and, not sufficient to restore completely the correct gene associations. However, a common tendency can be found between the quality of the imputation method and the gene cluster stability. Even if the comparison between clustering algorithms is a complex task, we observed that <it>k-means </it>approach is more efficient to conserve gene associations.</p> <p>Conclusions</p> <p>More than 6.000.000 independent simulations have assessed the quality of 12 imputation methods on five very different biological datasets. Important improvements have so been done since our last study. The <it>EM_array </it>approach constitutes one efficient method for restoring the missing expression gene values, with a lower estimation error level. Nonetheless, the presence of MVs even at a low rate is a major factor of gene cluster instability. Our study highlights the need for a systematic assessment of imputation methods and so of dedicated benchmarks. A noticeable point is the specific influence of some biological dataset.</p
Genomic Exploration of the Hemiascomycetous Yeasts: 1. A set of yeast species for molecular evolution studies11Sequences and annotations are accessible at: GĂ©noscope (http://www.genoscope.cns.fr), FEBS Letters Website (http://www.elsevier.nl/febs/show/), Bordeaux (http://cbi.genopole-bordeaux.fr/Genolevures) and were deposited into the EMBL database (accession number from AL392203 to AL441602).
AbstractThe identification of molecular evolutionary mechanisms in eukaryotes is approached by a comparative genomics study of a homogeneous group of species classified as Hemiascomycetes. This group includes Saccharomyces cerevisiae, the first eukaryotic genome entirely sequenced, back in 1996. A random sequencing analysis has been performed on 13 different species sharing a small genome size and a low frequency of introns. Detailed information is provided in the 20 following papers. Additional tables available on websites describe the ca. 20â000 newly identified genes. This wealth of data, so far unique among eukaryotes, allowed us to examine the conservation of chromosome maps, to identify the âyeast-specificâ genes, and to review the distribution of gene families into functional classes. This project conducted by a network of seven French laboratories has been designated âGĂ©nolevuresâ
Genomic Exploration of the Hemiascomycetous Yeasts: 19. Ascomycetes-specific genes
AbstractComparisons of the 6213 predicted Saccharomyces cerevisiae open reading frame (ORF) products with sequences from organisms of other biological phyla differentiate genes commonly conserved in evolution from âmaverickâ genes which have no homologue in phyla other than the Ascomycetes. We show that a majority of the âmaverickâ genes have homologues among other yeast species and thus define a set of 1892 genes that, from sequence comparisons, appear âAscomycetes-specificâ. We estimate, retrospectively, that the S. cerevisiae genome contains 5651 actual protein-coding genes, 50 of which were identified for the first time in this work, and that the present public databases contain 612 predicted ORFs that are not real genes. Interestingly, the sequences of the âAscomycetes-specificâ genes tend to diverge more rapidly in evolution than that of other genes. Half of the âAscomycetes-specificâ genes are functionally characterized in S. cerevisiae, and a few functional categories are over-represented in them
Migraine et contraception orale
AIX-MARSEILLE2-BU Pharmacie (130552105) / SudocSudocFranceF
Influence of microarrays experiments missing values on the stability of gene groups by hierarchical clustering.
12 pages + sup. dataBACKGROUND: Microarray technologies produced large amount of data. The hierarchical clustering is commonly used to identify clusters of co-expressed genes. However, microarray datasets often contain missing values (MVs) representing a major drawback for the use of the clustering methods. Usually the MVs are not treated, or replaced by zero or estimated by the k-Nearest Neighbor (kNN) approach. The topic of the paper is to study the stability of gene clusters, defined by various hierarchical clustering algorithms, of microarrays experiments including or not MVs. RESULTS: In this study, we show that the MVs have important effects on the stability of the gene clusters. Moreover, the magnitude of the gene misallocations is depending on the aggregation algorithm. The most appropriate aggregation methods (e.g. complete-linkage and Ward) are highly sensitive to MVs, and surprisingly, for a very tiny proportion of MVs (e.g. 1%). In most of the case, the MVs must be replaced by expected values. The MVs replacement by the kNN approach clearly improves the identification of co-expressed gene clusters. Nevertheless, we observe that kNN approach is less suitable for the extreme values of gene expression. CONCLUSION: The presence of MVs (even at a low rate) is a major factor of gene cluster instability. In addition, the impact depends on the hierarchical clustering algorithm used. Some methods should be used carefully. Nevertheless, the kNN approach constitutes one efficient method for restoring the missing expression gene values, with a low error level. Our study highlights the need of statistical treatments in microarray data to avoid misinterpretation
Global analysis of VHHs framework regions with a structural alphabet: VHH FRs structures
International audienceThe VHHs are antigen-binding region/domain of camelid heavy chain antibodies (HCAb). They have many interesting biotechnological and biomedical properties due to their small size, high solubility and stability, and high affinity and specificity for their antigens. HCAb and classical IgGs are evolutionary related and share a common fold. VHHs are composed of regions considered as constant, called the frameworks (FRs) connected by Complementarity Determining Regions (CDRs), a highly variable region that provide interaction with the epitope. Actually, no systematic structural analyses had been performed on VHH structures despite a significant number of structures. This work is the first study to analyse the structural diversity of FRs of VHHs. Using a structural alphabet that allows approximating the local conformation, we show that each of the four FRs do not have a unique structure but exhibit many structural variant patterns. Moreover, no direct simple link between the local conformational change and amino acid composition can be detected. These results indicate that long-range interactions affect the local conformation of FRs and impact the building of structural models
Trends in IT Innovation to Build a Next Generation Bioinformatics Solution to Manage and Analyse Biological Big Data Produced by NGS Technologies
Sequencing the human genome began in 1994, and 10 years of work were necessary in order to provide a nearly complete sequence. Nowadays, NGS technologies allow sequencing of a whole human genome in a few days. This deluge of data challenges scientists in many ways, as they are faced with data management issues and analysis and visualization drawbacks due to the limitations of current bioinformatics tools. In this paper, we describe how the NGS Big Data revolution changes the way of managing and analysing data. We present how biologists are confronted with abundance of methods, tools, and data formats. To overcome these problems, focus on Big Data Information Technology innovations from web and business intelligence. We underline the interest of NoSQL databases, which are much more efficient than relational databases. Since Big Data leads to the loss of interactivity with data during analysis due to high processing time, we describe solutions from the Business Intelligence that allow one to regain interactivity whatever the volume of data is. We illustrate this point with a focus on the Amadea platform. Finally, we discuss visualization challenges posed by Big Data and present the latest innovations with JavaScript graphic libraries
Genomic Exploration of the Hemiascomycetous Yeasts: 12. Kluyveromyces marxianus var. marxianus
AbstractAs part of the comparative genomics project âGENOLEVURESâ, we studied the Kluyveromyces marxianus var. marxianus strain CBS712 using a partial random sequencing strategy. With a 0.2Ăgenome equivalent coverage, we identified ca. 1300 novel genes encoding proteins, some containing spliceosomal introns with consensus splice sites identical to those of Saccharomyces cerevisiae, 28 tRNA genes, the whole rDNA repeat, and retrotransposons of the Ty1/2 family of S. cerevisiae with diverged Long Terminal Repeats. Functional classification of the K. marxianus genes, as well as the analysis of the paralogous gene families revealed few differences with respect to S. cerevisiae. Only 42 K. marxianus identified genes are without detectable homolog in the bakerâs yeast. However, we identified several genetic rearrangements between these two yeast species
Random exploration of the Kluyveromyces lactis genome and comparison with that of Saccharomyces cerevisiae.
The genome of the yeast Kluyveromyces lactis was explored by sequencing 588 short tags from two random genomic libraries (random sequenced tags, or RSTs), representing altogether 1.3% of the K. lactis genome. After systematic translation of the RSTs in all six possible frames and comparison with the complete set of proteins predicted from the Saccharomyces cerevisiae genomic sequence using an internally standardized threshold, 296 K.lactis genes were identified of which 292 are new. This corresponds to approximately 5% of the estimated genes of this organism and triples the total number of identified genes in this species. Of the novel K.lactis genes, 169 (58%) are homologous to S.cerevisiae genes of known or assigned functions, allowing tentative functional assignment, but 59 others (20%) correspond to S.cerevisiae genes of unknown function and previously without homolog among all completely sequenced genomes. Interestingly, a lower degree of sequence conservation is observed in this latter class. In nearly all instances in which the novel K.lactis genes have homologs in different species, sequence conservation is higher with their S.cerevisiae counterparts than with any of the other organisms examined. Conserved gene order relationships (synteny) between the two yeast species are also observed for half of the cases studied