6,128 research outputs found

    Statistical methods for analysis and correction of high-throughput screening data

    Get PDF
    Durant le criblage à haut débit (High-throughput screening, HTS), la première étape dans la découverte de médicaments, le niveau d'activité de milliers de composés chimiques est mesuré afin d'identifier parmi eux les candidats potentiels pour devenir futurs médicaments (i.e., hits). Un grand nombre de facteurs environnementaux et procéduraux peut affecter négativement le processus de criblage en introduisant des erreurs systématiques dans les mesures obtenues. Les erreurs systématiques ont le potentiel de modifier de manière significative les résultats de la sélection des hits, produisant ainsi un grand nombre de faux positifs et de faux négatifs. Des méthodes de correction des données HTS ont été développées afin de modifier les données reçues du criblage et compenser pour l'effet négatif que les erreurs systématiques ont sur ces données (Heyse 2002, Brideau et al. 2003, Heuer et al. 2005, Kevorkov and Makarenkov 2005, Makarenkov et al. 2006, Malo et al. 2006, Makarenkov et al. 2007). Dans cette thèse, nous évaluons d'abord l'applicabilité de plusieurs méthodes statistiques servant à détecter la présence d'erreurs systématiques dans les données HTS expérimentales, incluant le x2 goodness-of-fit test, le t-test et le test de Kolmogorov-Smirnov précédé par la méthode de Transformation de Fourier. Nous montrons premièrement que la détection d'erreurs systématiques dans les données HTS brutes est réalisable, de même qu'il est également possible de déterminer l'emplacement exact (lignes, colonnes et plateau) des erreurs systématiques de l'essai. Nous recommandons d'utiliser une version spécialisée du t-test pour détecter l'erreur systématique avant la sélection de hits afin de déterminer si une correction d'erreur est nécessaire ou non. Typiquement, les erreurs systématiques affectent seulement quelques lignes ou colonnes, sur certains, mais pas sur tous les plateaux de l'essai. Toutes les méthodes de correction d'erreur existantes ont été conçues pour modifier toutes les données du plateau sur lequel elles sont appliquées et, dans certains cas, même toutes les données de l'essai. Ainsi, lorsqu'elles sont appliquées, les méthodes existantes modifient non seulement les mesures expérimentales biaisées par l'erreur systématique, mais aussi de nombreuses données correctes. Dans ce contexte, nous proposons deux nouvelles méthodes de correction d'erreur systématique performantes qui sont conçues pour modifier seulement des lignes et des colonnes sélectionnées d'un plateau donné, i.e., celles où la présence d'une erreur systématique a été confirmée. Après la correction, les mesures corrigées restent comparables avec les valeurs non modifiées du plateau donné et celles de tout l'essai. Les deux nouvelles méthodes s'appuient sur les résultats d'un test de détection d'erreur pour déterminer quelles lignes et colonnes de chaque plateau de l'essai doivent être corrigées. Une procédure générale pour la correction des données de criblage à haut débit a aussi été suggérée. Les méthodes actuelles de sélection des hits en criblage à haut débit ne permettent généralement pas d'évaluer la fiabilité des résultats obtenus. Dans cette thèse, nous décrivons une méthodologie permettant d'estimer la probabilité de chaque composé chimique d'être un hit dans le cas où l'essai contient plus qu'un seul réplicat. En utilisant la nouvelle méthodologie, nous définissons une nouvelle procédure de sélection de hits basée sur la probabilité qui permet d'estimer un niveau de confiance caractérisant chaque hit. En plus, de nouvelles mesures servant à estimer des taux de changement de faux positifs et de faux négatifs, en fonction du nombre de réplications de l'essai, ont été proposées. En outre, nous étudions la possibilité de définir des modèles statistiques précis pour la prédiction informatique des mesures HTS. Remarquons que le processus de criblage expérimental est très coûteux. Un criblage virtuel, in silico, pourrait mener à une baisse importante de coûts. Nous nous sommes concentrés sur la recherche de relations entre les mesures HTS expérimentales et un groupe de descripteurs chimiques caractérisant les composés chimiques considérés. Nous avons effectué l'analyse de redondance polynomiale (Polynomial Redundancy Analysis) pour prouver l'existence de ces relations. En même temps, nous avons appliqué deux méthodes d'apprentissage machine, réseaux de neurones et arbres de décision, pour tester leur capacité de prédiction des résultats de criblage expérimentaux.\ud ______________________________________________________________________________ \ud MOTS-CLÉS DE L’AUTEUR : criblage à haut débit (HTS), modélisation statistique, modélisation prédictive, erreur systématique, méthodes de correction d'erreur, méthodes d'apprentissage automatiqu

    Bioinformatics tools for analysing viral genomic data

    Get PDF
    The field of viral genomics and bioinformatics is experiencing a strong resurgence due to high-throughput sequencing (HTS) technology, which enables the rapid and cost-effective sequencing and subsequent assembly of large numbers of viral genomes. In addition, the unprecedented power of HTS technologies has enabled the analysis of intra-host viral diversity and quasispecies dynamics in relation to important biological questions on viral transmission, vaccine resistance and host jumping. HTS also enables the rapid identification of both known and potentially new viruses from field and clinical samples, thus adding new tools to the fields of viral discovery and metagenomics. Bioinformatics has been central to the rise of HTS applications because new algorithms and software tools are continually needed to process and analyse the large, complex datasets generated in this rapidly evolving area. In this paper, the authors give a brief overview of the main bioinformatics tools available for viral genomic research, with a particular emphasis on HTS technologies and their main applications. They summarise the major steps in various HTS analyses, starting with quality control of raw reads and encompassing activities ranging from consensus and de novo genome assembly to variant calling and metagenomics, as well as RNA sequencing

    Data-analysis strategies for image-based cell profiling

    Get PDF
    Image-based cell profiling is a high-throughput strategy for the quantification of phenotypic differences among a variety of cell populations. It paves the way to studying biological systems on a large scale by using chemical and genetic perturbations. The general workflow for this technology involves image acquisition with high-throughput microscopy systems and subsequent image processing and analysis. Here, we introduce the steps required to create high-quality image-based (i.e., morphological) profiles from a collection of microscopy images. We recommend techniques that have proven useful in each stage of the data analysis process, on the basis of the experience of 20 laboratories worldwide that are refining their image-based cell-profiling methodologies in pursuit of biological discovery. The recommended techniques cover alternatives that may suit various biological goals, experimental designs, and laboratories' preferences.Peer reviewe

    Detecting and Correcting Contamination in Genetic Data.

    Full text link
    While technological innovation has dramatically increased the amount and variety of genomic data available to geneticists, no assay is perfect and both human error and technical artifacts can lead to erroneous data. A proper analysis pipeline must both detect errors, and, if possible, correct them. One common source of errors in genetic data is sample-to-sample contamination. This dissertation will identify methods to address contamination in the most common types of genetic studies. Chapter 2 focuses on methods for detecting and quantifying contamination in both array-based and next-generation sequencing (NGS) genotype data. For the array-based data, we use the observed intensities from the genotyping instruments to quantify contamination with two distinct methods: 1) a regression-based model using intensities and population allele frequencies and 2) a multivariate normal mixture model that looks at the clustering of intensities. For NGS data, we model the reads using a mixture model to determine the proportion of reads from the true sample and the contaminating sample. Chapter 3 outlines a method to make accurate genotype calls with contaminated NGS data. Given an estimated level of contamination, we propose a likelihood that can be maximized to call genotypes and estimate allele frequencies for samples with no previous genotype data. We investigate the method from data from two common sequencing strategies: 1) low-pass (2-4x depth) genome-wide sequencing and 2) high-depth (50-100x depth) exome sequencing. Chapter 4 looks at contamination in the context of RNA sequencing (RNA-Seq) data. While the technology to generate RNA-Seq data is similar to exome sequencing, the difference in expression between the contaminating and true sample makes it more difficult to accurately estimate the contamination proportion. We propose methods to improve the quality of these estimates.PhDBiostatisticsUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/120783/1/mflick_1.pd

    A critical examination of compound stability predictions from machine-learned formation energies

    Get PDF
    Machine learning has emerged as a novel tool for the efficient prediction of material properties, and claims have been made that machine-learned models for the formation energy of compounds can approach the accuracy of Density Functional Theory (DFT). The models tested in this work include five recently published compositional models, a baseline model using stoichiometry alone, and a structural model. By testing seven machine learning models for formation energy on stability predictions using the Materials Project database of DFT calculations for 85,014 unique chemical compositions, we show that while formation energies can indeed be predicted well, all compositional models perform poorly on predicting the stability of compounds, making them considerably less useful than DFT for the discovery and design of new solids. Most critically, in sparse chemical spaces where few stoichiometries have stable compounds, only the structural model is capable of efficiently detecting which materials are stable. The nonincremental improvement of structural models compared with compositional models is noteworthy and encourages the use of structural models for materials discovery, with the constraint that for any new composition, the ground-state structure is not known a priori. This work demonstrates that accurate predictions of formation energy do not imply accurate predictions of stability, emphasizing the importance of assessing model performance on stability predictions, for which we provide a set of publicly available tests

    The Importance of Contrast Sensitivity, Color Vision, and Electrophysiological Testing In Clinical and Occupational Settings

    Get PDF
    Visual acuity (VA) is universally accepted as the gold standard metric for ocular vision and function. Contrast sensitivity (CS), color vision, and electrophysiological testing for clinical and occupational settings are warranted despite being deemed ancillary and minimally utilized by clinicians. These assessments provide essential information to subjectively and objectively quantify and obtain optimal functional vision. They are useful for baseline data and monitoring hereditary and progressive ocular conditions and cognitive function. The studies in this dissertation highlight the value of contrast sensitivity, color vision, and cone specific electrophysiological testing, as well as the novel metrics obtained with potential practical clinical applications for visual function and perception evaluation in patients in various settings. The first study aimed to design a clinically expedient method to combine color CS and color naming (CN) into a single, multi-metric test of color vision, the Color Contrast Naming Test (CCNT). This was accomplished by comparing and validating it with the standardized computerized Cone Contrast Test (CCT; Innova Systems, Inc.). Color vision deficient (CVD) and color vision normal (CVN) findings showed a strong correlation between the CCNT CS and the standard CCT. Furthermore, CCT CS showed distinct scores in 50% of CVDs, while the CCNT composite score (mean of CS and CN) showed distinct scores in 70% of CVDs, showing better potential discrimination of CVD color abilities. This novel metric has potential applications for identifying hereditary or progressive CVD severity and capabilities. The second study focused on electrophysiological diagnostics, specifically cone specific visual evoked potentials (VEPs), to objectively measure long-term neural adaptive responses to color-correcting lenses (CCLs). Dr. Werner and colleagues determined that extended wear (for 12 days) of color-correcting lenses improved red-green color perception in hereditary CVD even without wearing CCLs. Furthermore, Dr. Rabin and colleagues were able to objectively measure both immediate short-term (baseline, 4, 8, 12 days) and long-term (3, 6, 12 months) improvements of color perception status post-CCL removal with cone specific VEPs – something that has never been done before. The novel findings from both studies support the notion that neural adaptive changes can occur over short- and longer-term periods despite minimal daily wear time. More importantly, this further supports the value of suprathreshold cone VEPs to objectively assess color vision function in both clinical and occupational settings. Most dry eye studies use measures of tear quality and volume coupled with standard clinical tests such as high contrast visual acuity (VA), while fewer studies have investigated the effects of dry eyes on low contrast vision. The final study was designed to determine the impact of Meibomian Gland Dysfunction (MGD) dry eye on high and low-contrast vision, including both black/white (luminance) and cone specific color vision. A primary intent was to determine if these novel metrics improved following minimal meibomian gland (MG) expression. The computerized CCNT and CCT (cone and black/white) tests used in this study confirmed that minimal MG expression improved low contrast performance for long (L cone) and short (S cone) wavelength-sensitive cones. These improvements were most significant using throughput (CS/response time) and CCNT composite scores, both novel metrics for potential use in dry eye diagnosis, treatment, and management. Physical optics, including decreased destructive interference in the stroma, most detrimental with red light, and increased scattering by subtle epithelial, endothelial, and/or tear film defects, most detrimental for blue light, could each decrease retinal image contrast most evident with L and S cone CS. Contrast sensitivity, color vision, and cone specific electrophysiological testing are non-optimally and infrequently utilized in basic, clinical, applied, and translational research or occupational settings. These studies showed provocative results within their respective categories and confirmed their validity and importance for identifying and monitoring ocular conditions and neural adaptive or cognitive functions. Furthermore, novel metrics such as throughput and CCNT composite scores serve as potential tangible and practical visual function and perception assessment standards
    • …
    corecore