3 research outputs found

    Controversies in modern evolutionary biology: the imperative for error detection and quality control

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>The data from high throughput genomics technologies provide unique opportunities for studies of complex biological systems, but also pose many new challenges. The shift to the genome scale in evolutionary biology, for example, has led to many interesting, but often controversial studies. It has been suggested that part of the conflict may be due to errors in the initial sequences. Most gene sequences are predicted by bioinformatics programs and a number of quality issues have been raised, concerning DNA sequencing errors or badly predicted coding regions, particularly in eukaryotes.</p> <p>Results</p> <p>We investigated the impact of these errors on evolutionary studies and specifically on the identification of important genetic events. We focused on the detection of asymmetric evolution after duplication, which has been the subject of controversy recently. Using the human genome as a reference, we established a reliable set of 688 duplicated genes in 13 complete vertebrate genomes, where significantly different evolutionary rates are observed. We estimated the rates at which protein sequence errors occur and are accumulated in the higher-level analyses. We showed that the majority of the detected events (57%) are in fact artifacts due to the putative erroneous sequences and that these artifacts are sufficient to mask the true functional significance of the events.</p> <p>Conclusions</p> <p>Initial errors are accumulated throughout the evolutionary analysis, generating artificially high rates of event predictions and leading to substantial uncertainty in the conclusions. This study emphasizes the urgent need for error detection and quality control strategies in order to efficiently extract knowledge from the new genome data.</p

    Identification of metabolic network models from incomplete high-throughput datasets

    Get PDF
    Motivation: High-throughput measurement techniques for metabolism and gene expression provide a wealth of information for the identification of metabolic network models. Yet, missing observations scattered over the dataset restrict the number of effectively available datapoints and make classical regression techniques inaccurate or inapplicable. Thorough exploitation of the data by identification techniques that explicitly cope with missing observations is therefore of major importance

    Identification of metabolic network models from incomplete high-throughput datasets

    No full text
    High-throughput measurement techniques for metabolism and gene expression provide a wealth of information for the identification of metabolic network models. Yet, missing observations scattered over the dataset restrict the number of effectively available datapoints and make classical regression techniques inaccurate or inapplicable. Thorough exploitation of the data by identification techniques that explicitly cope with missing observations is therefore of major importance. We develop a maximum-likelihood approach for the estimation of unknown parameters of metabolic network models that relies on the integration of statistical priors to compensate for the missing data. In the context of the linlog metabolic modeling framework, we implement the identification method by an Expectation Maximization (EM) algorithm and by a simpler direct numerical optimization method. We evaluate performance of our methods by comparison to existing approaches, and show that our EM method provides the best results over a variety of simulated scenarios. We then apply the EM algorithm to a real problem, the identification of a model for the Escherichia coli central carbon metabolism, based on challenging experimental data from the literature. This leads to promising results and allows us to highlight critical identification issues.Les techniques actuelles de mesures à haut-débit pour le métabolisme et l'expression génique fournissent de trÚs nombreuses données pour l'identification de modÚles de réseaux métaboliques. Cependant, l'existence de données manquantes tout au long du jeu de données restreint le nombre effectif de données disponibles et rend les techniques classiques de régression imprécises ou inapplicables. Il est donc primordial d'utiliser des techniques d'identification qui tiennent compte explicitement de ces observations manquantes. Nous développons une approche basée sur le maximum de vraisemblance pour l'estimation de paramÚtres de modÚles de réseaux métaboliques. Elle repose sur l'intégration de distributions a priori pour compenser les données manquantes. Nous implémentons cette méthode d'identification dans le cadre de la modélisation métabolisme par le formalisme linlog à l'aide d'un algorithme EM (Expectation-Maximization) et d'une méthode plus directe d'optimisation numérique. Nous évaluons la performance de nos méthodes en les comparant à des approches existantes et nous montrons que notre méthode EM produit les meilleurs résultats sur différents scénarios simulés. Nous appliquons ensuite l'algorithme EM à un problÚme réel, l'identification d'un modÚle du métabolisme central du carbone chez Escherichia coli, basée sur un important jeu de données expérimentales de la littérature. Les résultats obtenus sont prometteurs et nous permettent de mettre en évidence certains aspects critiques des jeux de données pour l'identification
    corecore