132 research outputs found

    Comparing high dimensional partitions, with the Coclustering Adjusted Rand Index

    Get PDF
    We consider the simultaneous clustering of rows and columns of a matrix and more particularly the ability to measure the agreement between two co-clustering partitions. The new criterion we developed is based on the Adjusted Rand Index and is called the Co-clustering Adjusted Rand Index named CARI. We also suggest new improvements to existing criteria such as the Classification Error which counts the proportion of misclassified cells and the Extended Normalized Mutual Information criterion which is a generalization of the criterion based on mutual information in the case of classic classifications. We study these criteria with regard to some desired properties deriving from the co-clustering context. Experiments on simulated and real observed data are proposed to compare the behavior of these criteria.Comment: 52 page

    Molecular Basis for Nucleotide Conservation at the Ends of the Dengue Virus Genome

    Get PDF
    International audienceThe dengue virus (DV) is an important human pathogen from the Flavivirus genus, whose genome- and antigenome RNAs start with the strictly conserved sequence pppAG. The RNA-dependent RNA polymerase (RdRp), a product of the NS5 gene, initiates RNA synthesis de novo, i.e., without the use of a pre-existing primer. Very little is known about the mechanism of this de novo initiation and how conservation of the starting adenosine is achieved. The polymerase domain NS5PolDV of NS5, upon initiation on viral RNA templates, synthesizes mainly dinucleotide primers that are then elongated in a processive manner. We show here that NS5PolDV contains a specific priming site for adenosine 59-triphosphate as the first transcribed nucleotide. Remarkably, in the absence of any RNA template the enzyme is able to selectively synthesize the dinucleotide pppAG when Mn 2+ is present as catalytic ion. The T794 to A799 priming loop is essential for initiation and provides at least part of the ATP-specific priming site. The H798 loop residue is of central importance for the ATP-specific initiation step. In addition to ATP selection, NS5PolDV ensures the conservation of the 59-adenosine by strongly discriminating against viral templates containing an erroneous 39-end nucleotide in the presence of Mg 2+. In the presence of Mn2+, NS5Pol DV is remarkably able to generate and elongate the correct pppAG primer on these erroneous templates. This can be regarded as a genomic/antigenomic RNA end repair mechanism. These conservational mechanisms, mediated by the polymerase alone, may extend to other RNA virus families having RdRps initiating RNA synthesis de novo

    Mouse large-scale phenotyping initiatives: overview of the European Mouse Disease Clinic (EUMODIC) and of the Wellcome Trust Sanger Institute Mouse Genetics Project.

    Get PDF
    Two large-scale phenotyping efforts, the European Mouse Disease Clinic (EUMODIC) and the Wellcome Trust Sanger Institute Mouse Genetics Project (SANGER-MGP), started during the late 2000s with the aim to deliver a comprehensive assessment of phenotypes or to screen for robust indicators of diseases in mouse mutants. They both took advantage of available mouse mutant lines but predominantly of the embryonic stem (ES) cells resources derived from the European Conditional Mouse Mutagenesis programme (EUCOMM) and the Knockout Mouse Project (KOMP) to produce and study 799 mouse models that were systematically analysed with a comprehensive set of physiological and behavioural paradigms. They captured more than 400 variables and an additional panel of metadata describing the conditions of the tests. All the data are now available through EuroPhenome database (www.europhenome.org) and the WTSI mouse portal (http://www.sanger.ac.uk/mouseportal/), and the corresponding mouse lines are available through the European Mouse Mutant Archive (EMMA), the International Knockout Mouse Consortium (IKMC), or the Knockout Mouse Project (KOMP) Repository. Overall conclusions from both studies converged, with at least one phenotype scored in at least 80% of the mutant lines. In addition, 57% of the lines were viable, 13% subviable, 30% embryonic lethal, and 7% displayed fertility impairments. These efforts provide an important underpinning for a future global programme that will undertake the complete functional annotation of the mammalian genome in the mouse model

    A large scale hearing loss screen reveals an extensive unexplored genetic landscape for auditory dysfunction

    Get PDF
    The developmental and physiological complexity of the auditory system is likely reflected in the underlying set of genes involved in auditory function. In humans, over 150 non-syndromic loci have been identified, and there are more than 400 human genetic syndromes with a hearing loss component. Over 100 non-syndromic hearing loss genes have been identified in mouse and human, but we remain ignorant of the full extent of the genetic landscape involved in auditory dysfunction. As part of the International Mouse Phenotyping Consortium, we undertook a hearing loss screen in a cohort of 3006 mouse knockout strains. In total, we identify 67 candidate hearing loss genes. We detect known hearing loss genes, but the vast majority, 52, of the candidate genes were novel. Our analysis reveals a large and unexplored genetic landscape involved with auditory function

    Soft windowing application to improve analysis of high-throughput phenotyping data.

    Get PDF
    MOTIVATION: High-throughput phenomic projects generate complex data from small treatment and large control groups that increase the power of the analyses but introduce variation over time. A method is needed to utlize a set of temporally local controls that maximizes analytic power while minimizing noise from unspecified environmental factors. RESULTS: Here we introduce \u27soft windowing\u27, a methodological approach that selects a window of time that includes the most appropriate controls for analysis. Using phenotype data from the International Mouse Phenotyping Consortium (IMPC), adaptive windows were applied such that control data collected proximally to mutants were assigned the maximal weight, while data collected earlier or later had less weight. We applied this method to IMPC data and compared the results with those obtained from a standard non-windowed approach. Validation was performed using a resampling approach in which we demonstrate a 10% reduction of false positives from 2.5 million analyses. We applied the method to our production analysis pipeline that establishes genotype-phenotype associations by comparing mutant versus control data. We report an increase of 30% in significant P-values, as well as linkage to 106 versus 99 disease models via phenotype overlap with the soft-windowed and non-windowed approaches, respectively, from a set of 2082 mutant mouse lines. Our method is generalizable and can benefit large-scale human phenomic projects such as the UK Biobank and the All of Us resources. AVAILABILITY AND IMPLEMENTATION: The method is freely available in the R package SmoothWin, available on CRAN http://CRAN.R-project.org/package=SmoothWin. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online

    Nat Genet

    Get PDF
    The function of the majority of genes in the mouse and human genomes remains unknown. The mouse embryonic stem cell knockout resource provides a basis for the characterization of relationships between genes and phenotypes. The EUMODIC consortium developed and validated robust methodologies for the broad-based phenotyping of knockouts through a pipeline comprising 20 disease-oriented platforms. We developed new statistical methods for pipeline design and data analysis aimed at detecting reproducible phenotypes with high power. We acquired phenotype data from 449 mutant alleles, representing 320 unique genes, of which half had no previous functional annotation. We captured data from over 27,000 mice, finding that 83% of the mutant lines are phenodeviant, with 65% demonstrating pleiotropy. Surprisingly, we found significant differences in phenotype annotation according to zygosity. New phenotypes were uncovered for many genes with previously unknown function, providing a powerful basis for hypothesis generation and further investigation in diverse systems.Comment in : Genetic differential calculus. [Nat Genet. 2015] Comment in : Scaling up phenotyping studies. [Nat Biotechnol. 2015

    Inférence de réseaux de régulation orientés pour les facteurs de transcription d'Arabidopsis thaliana et création de groupes de co-régulation

    Get PDF
    This thesis deals with the characterisation of key genes in gene expression regulation, called transcription factors, in the plant Arabidopsis thaliana. Using expression data, our biological goal is to cluster transcription factors in groups of co-regulator transcription factors, and in groups of co-regulated transcription factors. To do so, we propose a two-step procedure. First, we infer the network of regulation between transcription factors. Second, we cluster transcription factors based on their connexion patterns to other transcriptions factors. From a statistical point of view, the transcription factors are the variables and the samples are the observations. The regulatory network between the transcription factors is modelled using a directed graph, where variables are nodes. The estimation of the nodes can be interpreted as a problem of variables selection. To infer the network, we perform LASSO type penalised linear regression. A preliminary approach selects a set of variable along the regularisation path using penalised likelihood criterion. However, this approach is unstable and leads to select too many variables. To overcome this difficulty, we propose to put in competition two selection procedures, designed to deal with high dimension data and mixing linear penalised regression and subsampling. Parameters estimation of the two procedures are designed to lead to select stable set of variables. Stability of results is evaluated on simulated data under a graphical model. Subsequently, we use an unsupervised clustering method on each inferred oriented graph to detect groups of co-regulators and groups of co-regulated. To evaluate the proximity between the two classifications, we have developed an index of comparaison of pairs of partitions whose relevance is tested and promoted. From a practical point of view, we propose a cascade simulation method required to respect the model complexity and inspired from parametric bootstrap, to simulate data under our model. We have validated our model by inspecting the proximity between the two classifications on simulated and real data.Dans cette thĂšse, nous cherchons Ă  caractĂ©riser les facteurs de transcription de la plante Arabidopsis thaliana, gĂšnes importants pour la rĂ©gulation de l'expression du gĂ©nome. A l'aide de donnĂ©es d'expression, notre objectif biologique est de classer ces facteurs de transcription en groupes de gĂšnes co-rĂ©gulateurs et en groupes de gĂšnes co-rĂ©gulĂ©s. Nous procĂ©dons en deux phases pour y parvenir. La premiĂšre phase consiste Ă  construire un rĂ©seau de rĂ©gulation entre les facteurs de transcription. La seconde phase consiste en la classification des facteurs de transcription selon les liens de rĂ©gulation Ă©tablis par ce rĂ©seau. D'un point de vue statistique, les facteurs de transcription sont les variables et les donnĂ©es d'expression sont les observations. Nous reprĂ©sentons le rĂ©seau Ă  infĂ©rer par un graphe orientĂ© dont les noeuds sont les variables. L'estimation de ses arĂȘtes est vue comme un problĂšme de sĂ©lection de variables en grande dimension avec un faible nombre d'unitĂ©s statistiques. Nous traitons ce problĂšme Ă  l'aide de rĂ©gressions linĂ©aires pĂ©nalisĂ©es de type LASSO. Une approche prĂ©liminaire qui consiste Ă  sĂ©lectionner un ensemble de variables du chemin de rĂ©gularisation par le biais de critĂšres de vraisemblance pĂ©nalisĂ©e s'avĂšre ĂȘtre instable et fournit trop de variables explicatives. Pour contrecarrer cela, nous proposons et mettons en compĂ©tition deux procĂ©dures de sĂ©lection, adaptĂ©es au problĂšme de la haute dimension et mĂȘlant rĂ©gression linĂ©aire pĂ©nalisĂ©e et rĂ©Ă©chantillonnage. L'estimation des diffĂ©rents paramĂštres de ces procĂ©dures a Ă©tĂ© effectuĂ©e dans le but d'obtenir des ensembles de variables stables. Nous Ă©valuons la stabilitĂ© des rĂ©sultats Ă  l'aide de jeux de donnĂ©es simulĂ©s selon notre modĂšle graphique. Nous faisons appel ensuite Ă  une mĂ©thode de classification non supervisĂ©e sur chacun des graphes orientĂ©s obtenus pour former des groupes de noeuds vus comme contrĂŽleurs et des groupes de noeuds vus comme contrĂŽlĂ©s. Pour Ă©valuer la proximitĂ© entre les classifications doubles des noeuds obtenus sur diffĂ©rents graphes, nous avons dĂ©veloppĂ© un indice de comparaison de couples de partition dont nous Ă©prouvons et promouvons la pertinence. D'un point de vue pratique, nous proposons une mĂ©thode de simulation en cascade, exigĂ©e par la complexitĂ© de notre modĂšle et inspirĂ©e du bootstrap paramĂ©trique, pour simuler des jeux de donnĂ©es en accord avec notre modĂšle. Nous avons validĂ© notre modĂšle en Ă©valuant la proximitĂ© des classifications obtenues par application de la procĂ©dure statistique sur les donnĂ©es rĂ©elles et sur ces donnĂ©es simulĂ©es

    Inference of directed regulatory networks on the transcription factors of Arabidopsis thaliana and setting up of co-regulation groups

    No full text
    Dans cette thĂšse, nous cherchons Ă  caractĂ©riser les facteurs de transcription de la plante Arabidopsis thaliana, gĂšnes importants pour la rĂ©gulation de l'expression du gĂ©nome. À l'aide de donnĂ©es d'expression, notre objectif biologique est de classer ces facteurs de transcription en groupes de gĂšnes co-rĂ©gulateurs et en groupes de gĂšnes co-rĂ©gulĂ©s. Nous procĂ©dons en deux phases pour y parvenir. La premiĂšre phase consiste Ă  construire un rĂ©seau de rĂ©gulation entre les facteurs de transcription. La seconde phase consiste en la classification des facteurs de transcription selon les liens de rĂ©gulation Ă©tablis par ce rĂ©seau. D'un point de vue statistique, les facteurs de transcription sont les variables et les donnĂ©es d'expression sont les observations. Nous reprĂ©sentons le rĂ©seau Ă  infĂ©rer par un graphe orientĂ© dont les nƓuds sont les variables. L'estimation de ses arĂȘtes est vue comme un problĂšme de sĂ©lection de variables en grande dimension avec un faible nombre d'unitĂ©s statistiques. Nous traitons ce problĂšme Ă  l'aide de rĂ©gressions linĂ©aires pĂ©nalisĂ©es de type LASSO. Une approche prĂ©liminaire qui consiste Ă  sĂ©lectionner un ensemble de variables du chemin de rĂ©gularisation par le biais de critĂšres de vraisemblance pĂ©nalisĂ©e s'avĂšre ĂȘtre instable et fournit trop de variables explicatives. Pour contrecarrer cela, nous proposons et mettons en compĂ©tition deux procĂ©dures de sĂ©lection, adaptĂ©es au problĂšme de la haute dimension et mĂȘlant rĂ©gression linĂ©aire pĂ©nalisĂ©e et rĂ©Ă©chantillonnage. L'estimation des diffĂ©rents paramĂštres de ces procĂ©dures a Ă©tĂ© effectuĂ©e dans le but d'obtenir des ensembles de variables stables. Nous Ă©valuons la stabilitĂ© des rĂ©sultats Ă  l'aide de jeux de donnĂ©es simulĂ©s selon notre modĂšle graphique. Nous faisons appel ensuite Ă  une mĂ©thode de classification non supervisĂ©e sur chacun des graphes orientĂ©s obtenus pour former des groupes de nƓuds vus comme contrĂŽleurs et des groupes de nƓuds vus comme contrĂŽlĂ©s. Pour Ă©valuer la proximitĂ© entre les classifications doubles des nƓuds obtenus sur diffĂ©rents graphes, nous avons dĂ©veloppĂ© un indice de comparaison de couples de partition dont nous Ă©prouvons et promouvons la pertinence. D'un point de vue pratique, nous proposons une mĂ©thode de simulation en cascade, exigĂ©e par la complexitĂ© de notre modĂšle et inspirĂ©e du bootstrap paramĂ©trique, pour simuler des jeux de donnĂ©es en accord avec notre modĂšle. Nous avons validĂ© notre modĂšle en Ă©valuant la proximitĂ© des classifications obtenues par application de la procĂ©dure statistique sur les donnĂ©es rĂ©elles et sur ces donnĂ©es simulĂ©es.This thesis deals with the characterisation of key genes in gene expression regulation, called transcription factors, in the plant Arabidopsis thaliana. Using expression data, our biological goal is to cluster transcription factors in groups of co-regulator transcription factors, and in groups of co-regulated transcription factors. To do so, we propose a two-step procedure. First, we infer the network of regulation between transcription factors. Second, we cluster transcription factors based on their connexion patterns to other transcriptions factors.From a statistical point of view, the transcription factors are the variables and the samples are the observations. The regulatory network between the transcription factors is modelled using a directed graph, where variables are nodes. The estimation of the nodes can be interpreted as a problem of variables selection. To infer the network, we perform LASSO type penalised linear regression. A preliminary approach selects a set of variable along the regularisation path using penalised likelihood criterion. However, this approach is unstable and leads to select too many variables. To overcome this difficulty, we propose to put in competition two selection procedures, designed to deal with high dimension data and mixing linear penalised regression and subsampling. Parameters estimation of the two procedures are designed to lead to select stable set of variables. Stability of results is evaluated on simulated data under a graphical model. Subsequently, we use an unsupervised clustering method on each inferred oriented graph to detect groups of co-regulators and groups of co-regulated. To evaluate the proximity between the two classifications, we have developed an index of comparaison of pairs of partitions whose relevance is tested and promoted. From a practical point of view, we propose a cascade simulation method required to respect the model complexity and inspired from parametric bootstrap, to simulate data under our model. We have validated our model by inspecting the proximity between the two classifications on simulated and real data

    DNA glycoclusters and DNA-based carbohydrate microarrays: From design to applications

    No full text
    International audienceOur goal was to design carbohydrate mimetics capable of preventing cell adhesion of pathogens by targeting lectins. The original synthesis of these mimetics combines the automated chemistry of DNA and ''click'' chemistry. From simple building blocks (i.e. phosphoramidites, solid supports and carbohydrates such as alkyne or azide derivatives), a ''Lego'' approach provides complex and varied decoys exhibiting various topologies and number of carbohydrate residues. The resulting glycomimetics tagged with a specific DNA sequence are efficiently immobilized on a DNA microarray by double strand formation. This glycoarray allows the study of the interactions between carbohydrate mimics and lectins using a minute amount of material. This micro system using DNA arrays is much more effective than conventional carbohydrate microarrays. The studies allowed the identification of the important structural parameters for a customized construction of high-affinity carbohydrate mimics
    • 

    corecore