79 research outputs found

    Exploring and controlling for underlying structure in genome and microbiome case-control association studies

    Get PDF
    Case-control association studies in human genetics and microbiome pave the way to personalized medicine by enabling a personalized risk assessment, improved prognosis, or allowing an early diagnosis. However, confounding due to population structure, or other unobserved factors, can produce spurious findings or mask true associations, if not detected and corrected for. As a consequence, underlying structure improperly accounted for could explain lack of power or some unsuccessful replications observed in case-control association studies. Besides, points considered as outliers are commonly removed in such studies although they do not always correspond to technical errors. A wealth of methods exist to determine structure in genetic and microbiome association studies. However, there are few systematic comparisons between these methods in the frame of genetic or microbiome association studies, and even less attempts to apply robust methods, which produce stable estimates of confounding underlying structure, and which are able to incorporate information from outliers without degrading estimates quality. Consequently, the aim of this thesis was to detect and control robustly for underlying confounding structure in genetic and microbiome data, by comparing systematically the most relevant standard and robust forms of principal components analysis (PCA) or multidimensional scaling (MDS) based methods, and by contributing new robust methods. Own contributions include robustification of existing methods, adaption to the genetic or to the microbiome framework, and a dimensionality exploration and reduction method, nSimplices. Analysed datasets include a first synthetic example with a low-variance 2-groups confounding structure, a second synthetic example with a simple linear underlying structure, genome-wide single nucleotide polymorphism (SNP) from 860 case and control individuals enrolled in the European Prospective Investigation into Cancer and nutrition (EPIC prostate), and finally, 2 255 microbiome samples from the human microbiome project (HMP). Synthetic or real outliers were added in the second example and in EPIC and HMP datasets. All meaningful existing and contributed methods were applied to the EPIC and HMP datasets, while a restricted set was applied to the synthetic, illustrative examples. The 10 principal components or top axes resulting from each method were kept for further analysis. Quality of a method was assessed by how well these axes summarized the underlying structure (using Akaike's information criterion -AIC- from the regression of the 10 axes on known underlying structure in the data), and by how robust the estimates stayed in the presence of outliers (adjusted R2 from the regression of each outlier-disturbed axis on the original axis). In synthetic example 1, only ICA was able to uncover the low-variance confounding structure, whereas PCA or MDS failed to do so, in agreement with the fact that these methods detect large rather than small variance or distance components. In synthetic example 2, non-metric MDS remained the most representative and robust method when distance outliers are included, while nSimplices combined with classical MDS was the only method to stay representative and robust if contextual outliers are present. In the EPIC dataset, Eigenstrat was the most representative method (AIC of 782.8) whereas sample ancestry was best captured by new method gMCD (unbiased genetic relatedness estimates used in a Minimum Covariance Determinant procedure). Methods gMCD, spherical PCA, IBS (MDS on Identity-by-State estimates) and nSimplices were more robust than Eigenstrat, with a small to moderate loss in terms of representativity (AIC between 789.6 and 864.9). Association testing yielded p-values comparable with published values on candidate SNPs. Further SNPs rs8071475, rs3799631, rs2589118 with lowest p-value were identified, whose known role in other disorders could point to an indirect link with prostate cancer. In the HMP dataset, the new method nSimplices combined to data-driven normalization method qMDS mirrored best the underlying structure. The most robust method was qMDS (with nSimplices or alone), followed by CSS and MDS. Lastly, the original method nSimplices performed in all settings at least comparably (except ancestry in EPIC), and in some cases considerably better than other methods, while remaining tractable and fast in high-dimensional datasets. The improved performance of gMCD and qMDS agrees with the fact that these methods use adapted measures (genetic relatedness, selected model distribution, respectively) and recognized robust approaches (minimum covariance determinant and quantiles). Conversely, wMDS is likely to have failed because variance is not an adequate parameter for microbiome data. More generally, different methods report the underlying structure differently and are advantageous in different settings, for example PCA or non-metric MDS were best in some settings but failed in other. Finally, the original method nSimplices proved useful or markedly better in a variety of settings, with the exception of highly noisy datasets, and provided that distance outliers are corrected. Current genetic case-control association studies tend to integrate several types of data, for example clinical and SNP data, or several omics datasets. These approaches are promising but could be subject to increased inaccuracies or replication issues, by the mere combination of several sources of data. This motivates a reinforced use of robust methods, which are able to mirror accurately and steadily genetic information, such as gMCD, nSimplices or spherical PCA. Nevertheless, results on Eigenstrat show this stays a reasonable method. Results in microbiome confirmed that MDS based on proportions is a suboptimal method, and suggested the exponential distribution should be considered instead of multinomial-based distributions, certainly because the exponential better represents the inherent competitiveness between phylogenies in the microbiome. Moreover, illustrative and real world examples showed that methods could capture relevant, but different information, encouraging to apply several complementary methods when starting to explore a dataset. In particular, a low-variance confounder could stay undetected in some methods. Additionally, methods based on least absolute residuals revealed several shortcomings in spite of their utility in a univariate frame, but their expected benefit in a multivariate setting should motivate the development of more tractable implementations. Finally, SPH, IBS, gMCD are recommended methods in a genetic SNP dataset, while Eigenstrat should perform best if no more than 2% outliers are present. To mirror structure in a microbiome dataset, nSimplices (combined with qMDS, or with CSS) can be expected to perform best, whereas MDS on proportions is likely to underperform. Method nSimplices proved beneficial or largely better in various situations and should therefore be considered to analyse datasets including, but not limited to, genetic SNP and microbiome abundances

    Practical investigation of the performance of robust logistic regression to predict the genetic risk of hypertension

    Get PDF
    Logistic regression is usually applied to investigate the association between inherited genetic variants and a binary disease phenotype. A limitation of standard methods used to estimate the parameters of logistic regression models is their strong dependence on a few observations deviating from the majority of the data. We used data from the Genetic Analysis Workshop 18 to explore the possible benefit of robust logistic regression to estimate the genetic risk of hypertension. The comparison between standard and robust methods relied on the influence of departing hypertension profiles (outliers) on the estimated odds ratios, areas under the receiver operating characteristic curves, and clinical net benefit. Our results confirmed that single outliers may substantially affect the estimated genotype relative risks. The ranking of variants by probability values was different in standard and in robust logistic regression. For cutoff probabilities between 0.2 and 0.6, the clinical net benefit estimated by leave-one-out cross-validation in the investigated sample was slightly larger under robust regression, but the overall area under the receiver operating characteristic curve was larger for standard logistic regression. The potential advantage of robust statistics in the context of genetic association studies should be investigated in future analyses based on real and simulated data

    Translational adaptation to heat stress is mediated by RNA 5-methylcytosine in Caenorhabditis elegans.

    Get PDF
    Methylation of carbon-5 of cytosines (m5 C) is a post-transcriptional nucleotide modification of RNA found in all kingdoms of life. While individual m5 C-methyltransferases have been studied, the impact of the global cytosine-5 methylome on development, homeostasis and stress remains unknown. Here, using Caenorhabditis elegans, we generated the first organism devoid of m5 C in RNA, demonstrating that this modification is non-essential. Using this genetic tool, we determine the localisation and enzymatic specificity of m5 C sites in the RNome in vivo. We find that NSUN-4 acts as a dual rRNA and tRNA methyltransferase in C. elegans mitochondria. In agreement with leucine and proline being the most frequently methylated tRNA isoacceptors, loss of m5 C impacts the decoding of some triplets of these two amino acids, leading to reduced translation efficiency. Upon heat stress, m5 C loss leads to ribosome stalling at UUG triplets, the only codon translated by an m5 C34-modified tRNA. This leads to reduced translation efficiency of UUG-rich transcripts and impaired fertility, suggesting a role of m5 C tRNA wobble methylation in the adaptation to higher temperatures

    Using next-generation DNA sequence data for genetic association tests based on allele counts with and without consideration of zero inflation

    Get PDF
    The relationship between genetic variability and individual phenotypes is usually investigated by testing for association relying on called genotypes. Allele counts obtained from next-generation sequence data could be used for this purpose too. Genetic association can be examined by treating alternative allele counts (AACs) as the response variable in negative binomial regression. AACs from sequence data often contain an excess of zeros, thus motivating the use of Hurdle and zero-inflated models. Here we examine rough type I error rates and the ability to pick out variants with small probability values for 7 different testing approaches that incorporate AACs as an explanatory or as a response variable. Model comparisons relied on chromosome 3 DNA sequence data from 407 Hispanic participants in the Type 2 Diabetes Genetic Exploration by Next-generation sequencing in Ethnic Samples (T2D-GENES) project 1 with complete information on diastolic blood pressure and related medication. Our results suggest that in the investigation of the relationship between AAC as response variable and individual phenotypes as explanatory variable, Hurdle-negative binomial regression has some advantages. This model showed a good ability to discriminate strongly associated variants and controlled overall type I error rates. However, probability values from Hurdle-negative binomial regression were not obtained for approximately 25 % of the investigated variants because of convergence problems, and the mass of the probability value distribution was concentrated around 1

    Orthogonal outlier detection and dimension estimation for improved MDS embedding of biological datasets

    Get PDF
    Conventional dimensionality reduction methods like Multidimensional Scaling (MDS) are sensitive to the presence of orthogonal outliers, leading to significant defects in the embedding. We introduce a robust MDS method, called DeCOr-MDS (Detection and Correction of Orthogonal outliers using MDS), based on the geometry and statistics of simplices formed by data points, that allows to detect orthogonal outliers and subsequently reduce dimensionality. We validate our methods using synthetic datasets, and further show how it can be applied to a variety of large real biological datasets, including cancer image cell data, human microbiome project data and single cell RNA sequencing data, to address the task of data cleaning and visualization

    Peri-operative red blood cell transfusion in neonates and infants: NEonate and Children audiT of Anaesthesia pRactice IN Europe: A prospective European multicentre observational study

    Get PDF
    BACKGROUND: Little is known about current clinical practice concerning peri-operative red blood cell transfusion in neonates and small infants. Guidelines suggest transfusions based on haemoglobin thresholds ranging from 8.5 to 12 g dl-1, distinguishing between children from birth to day 7 (week 1), from day 8 to day 14 (week 2) or from day 15 (≥week 3) onwards. OBJECTIVE: To observe peri-operative red blood cell transfusion practice according to guidelines in relation to patient outcome. DESIGN: A multicentre observational study. SETTING: The NEonate-Children sTudy of Anaesthesia pRactice IN Europe (NECTARINE) trial recruited patients up to 60 weeks' postmenstrual age undergoing anaesthesia for surgical or diagnostic procedures from 165 centres in 31 European countries between March 2016 and January 2017. PATIENTS: The data included 5609 patients undergoing 6542 procedures. Inclusion criteria was a peri-operative red blood cell transfusion. MAIN OUTCOME MEASURES: The primary endpoint was the haemoglobin level triggering a transfusion for neonates in week 1, week 2 and week 3. Secondary endpoints were transfusion volumes, 'delta haemoglobin' (preprocedure - transfusion-triggering) and 30-day and 90-day morbidity and mortality. RESULTS: Peri-operative red blood cell transfusions were recorded during 447 procedures (6.9%). The median haemoglobin levels triggering a transfusion were 9.6 [IQR 8.7 to 10.9] g dl-1 for neonates in week 1, 9.6 [7.7 to 10.4] g dl-1 in week 2 and 8.0 [7.3 to 9.0] g dl-1 in week 3. The median transfusion volume was 17.1 [11.1 to 26.4] ml kg-1 with a median delta haemoglobin of 1.8 [0.0 to 3.6] g dl-1. Thirty-day morbidity was 47.8% with an overall mortality of 11.3%. CONCLUSIONS: Results indicate lower transfusion-triggering haemoglobin thresholds in clinical practice than suggested by current guidelines. The high morbidity and mortality of this NECTARINE sub-cohort calls for investigative action and evidence-based guidelines addressing peri-operative red blood cell transfusions strategies. TRIAL REGISTRATION: ClinicalTrials.gov, identifier: NCT02350348

    La bilharziose d'importation chez un groupe de touristes au retour du Mali

    No full text
    PARIS-BIUP (751062107) / SudocSudocFranceF

    Division of labour: tRNA methylation by the NSun2 tRNA methyltransferases Trm4a and Trm4b in fission yeast

    Get PDF
    Enzymes of the cytosine-5 RNA methyltransferase Trm4/NSun2 family methylate tRNAs at C48 and C49 in multiple tRNAs, as well as C34 and C40 in selected tRNAs. In contrast to most other organisms, fission yeast Schizosaccharomyces pombe carries two Trm4/NSun2 homologs, Trm4a (SPAC17D4.04) and Trm4b (SPAC23C4.17). Here, we have employed tRNA methylome analysis to determine the dependence of cytosine-5 methylation (m5C) tRNA methylation in vivo on the two enzymes. Remarkably, Trm4a is responsible for all C48 methylation, which lies in the tRNA variable loop, as well as for C34 in tRNALeuCAA and tRNAProCGG, which are at the anticodon wobble position. Conversely, Trm4b methylates C49 and C50, which both lie in the TΨC-stem. Thus, S. pombe show an unusual separation of activities of the NSun2/Trm4 enzymes that are united in a single enzyme in other eukaryotes like humans, mice and Saccharomyces cerevisiae. Furthermore, in vitro activity assays showed that Trm4a displays intron-dependent methylation of C34, whereas Trm4b activity is independent of the intron. The absence of Trm4a, but not Trm4b, causes a mild resistance of S. pombe to calcium chloride.Peer Reviewe
    corecore