104 research outputs found
npInv: accurate detection and genotyping of inversions using long read sub-alignment
BACKGROUND: Detection of genomic inversions remains challenging. Many existing methods primarily target inzversions with a non repetitive breakpoint, leaving inverted repeat (IR) mediated non-allelic homologous recombination (NAHR) inversions largely unexplored. RESULT: We present npInv, a novel tool specifically for detecting and genotyping NAHR inversion using long read sub-alignment of long read sequencing data. We benchmark npInv with other tools in both simulation and real data. We use npInv to generate a whole-genome inversion map for NA12878 consisting of 30 NAHR inversions (of which 15 are novel), including all previously known NAHR mediated inversions in NA12878 with flanking IR less than 7kb. Our genotyping accuracy on this dataset was 94%. We used PCR to confirm the presence of two of these novel inversions. We show that there is a near linear relationship between the length of flanking IR and the minimum inversion size, without inverted repeats. CONCLUSION: The application of npInv shows high accuracy in both simulation and real data. The results give deeper insight into understanding inversion
Fregene: Simulation of realistic sequence-level data in populations and ascertained samples
Background: FREGENE simulates sequence-level data over large genomic regions in large populations. Because, unlike coalescent simulators, it works forwards through time, it allows complex scenarios of selection, demography, and recombination to be modelled simultaneously. Detailed tracking of sites under selection is implemented in FREGENE and provides the opportunity to test theoretical predictions and gain new insights into mechanisms of selection. We describe here main functionalities of both FREGENE and SAMPLE, a companion program that can replicate association study datasets.Results: We report detailed analyses of six large simulated datasets that we have made publicly available. Three demographic scenarios are modelled: one panmictic, one substructured with migration, and one complex scenario that mimics the principle features of genetic variation in major worldwide human populations. For each scenario there is one neutral simulation, and one with a complex pattern of selection.Conclusion: FREGENE and the simulated datasets will be valuable for assessing the validity of models for selection, demography and population genetic parameters, as well as the efficacy of association studies. Its principle advantages are modelling flexibility and computational efficiency. It is open source and object-oriented. As such, it can be customised and the range of models extended
Pathway Analysis of GWAS Provides New Insights into Genetic Susceptibility to 3 Inflammatory Diseases
Although the introduction of genome-wide association studies (GWAS) have greatly increased the number of genes associated with common diseases, only a small proportion of the predicted genetic contribution has so far been elucidated. Studying the cumulative variation of polymorphisms in multiple genes acting in functional pathways may provide a complementary approach to the more common single SNP association approach in understanding genetic determinants of common disease. We developed a novel pathway-based method to assess the combined contribution of multiple genetic variants acting within canonical biological pathways and applied it to data from 14,000 UK individuals with 7 common diseases. We tested inflammatory pathways for association with Crohn's disease (CD), rheumatoid arthritis (RA) and type 1 diabetes (T1D) with 4 non-inflammatory diseases as controls. Using a variable selection algorithm, we identified variants responsible for the pathway association and evaluated their use for disease prediction using a 10 fold cross-validation framework in order to calculate out-of-sample area under the Receiver Operating Curve (AUC). The generalisability of these predictive models was tested on an independent birth cohort from Northern Finland. Multiple canonical inflammatory pathways showed highly significant associations (p 10−3–10−20) with CD, T1D and RA. Variable selection identified on average a set of 205 SNPs (149 genes) for T1D, 350 SNPs (189 genes) for RA and 493 SNPs (277 genes) for CD. The pattern of polymorphisms at these SNPS were found to be highly predictive of T1D (91% AUC) and RA (85% AUC), and weakly predictive of CD (60% AUC). The predictive ability of the T1D model (without any parameter refitting) had good predictive ability (79% AUC) in the Finnish cohort. Our analysis suggests that genetic contribution to common inflammatory diseases operates through multiple genes interacting in functional pathways
Evaluation of Host Serum Protein Biomarkers of Tuberculosis in sub-Saharan Africa.
Accurate and affordable point-of-care diagnostics for tuberculosis (TB) are needed. Host serum protein signatures have been derived for use in primary care settings, however validation of these in secondary care settings is lacking. We evaluated serum protein biomarkers discovered in primary care cohorts from Africa reapplied to patients from secondary care. In this nested case-control study, concentrations of 22 proteins were quantified in sera from 292 patients from Malawi and South Africa who presented predominantly to secondary care. Recruitment was based upon intention of local clinicians to test for TB. The case definition for TB was culture positivity for Mycobacterium tuberculosis; and for other diseases (OD) a confirmed alternative diagnosis. Equal numbers of TB and OD patients were selected. Within each group, there were equal numbers with and without HIV and from each site. Patients were split into training and test sets for biosignature discovery. A nine-protein signature to distinguish TB from OD was discovered comprising fibrinogen, alpha-2-macroglobulin, CRP, MMP-9, transthyretin, complement factor H, IFN-gamma, IP-10, and TNF-alpha. This signature had an area under the receiver operating characteristic curve in the training set of 90% (95% CI 86-95%), and, after adjusting the cut-off for increased sensitivity, a sensitivity and specificity in the test set of 92% (95% CI 80-98%) and 71% (95% CI 56-84%), respectively. The best single biomarker was complement factor H [area under the receiver operating characteristic curve 70% (95% CI 64-76%)]. Biosignatures consisting of host serum proteins may function as point-of-care screening tests for TB in African hospitals. Complement factor H is identified as a new biomarker for such signatures
Simultaneous Analysis of All SNPs in Genome-Wide and Re-Sequencing Association Studies
Testing one SNP at a time does not fully realise the potential of genome-wide association studies to identify multiple causal variants, which is a plausible scenario for many complex diseases. We show that simultaneous analysis of the entire set of SNPs from a genome-wide study to identify the subset that best predicts disease outcome is now feasible, thanks to developments in stochastic search methods. We used a Bayesian-inspired penalised maximum likelihood approach in which every SNP can be considered for additive, dominant, and recessive contributions to disease risk. Posterior mode estimates were obtained for regression coefficients that were each assigned a prior with a sharp mode at zero. A non-zero coefficient estimate was interpreted as corresponding to a significant SNP. We investigated two prior distributions and show that the normal-exponential-gamma prior leads to improved SNP selection in comparison with single-SNP tests. We also derived an explicit approximation for type-I error that avoids the need to use permutation procedures. As well as genome-wide analyses, our method is well-suited to fine mapping with very dense SNP sets obtained from re-sequencing and/or imputation. It can accommodate quantitative as well as case-control phenotypes, covariate adjustment, and can be extended to search for interactions. Here, we demonstrate the power and empirical type-I error of our approach using simulated case-control data sets of up to 500 K SNPs, a real genome-wide data set of 300 K SNPs, and a sequence-based dataset, each of which can be analysed in a few hours on a desktop workstation
Natural and Orthogonal Interaction framework for modeling gene-environment interactions with application to lung cancer
Objectives: We aimed at extending the Natural and Orthogonal Interaction (NOIA) framework, developed for modeling gene-gene interactions in the analysis of quantitative traits, to allow for reduced genetic models, dichotomous traits, and gene-environment interactions. We evaluate the performance of the NOIA statistical models using simulated data and lung cancer data. Methods: The NOIA statistical models are developed for additive, dominant, and recessive genetic models as well as for a binary environmental exposure. Using the Kronecker product rule, a NOIA statistical model is built to model gene-environment interactions. By treating the genotypic values as the logarithm of odds, the NOIA statistical models are extended to the analysis of case-control data. Results: Our simulations showed that power for testing associations while allowing for interaction using the NOIA statistical model is much higher than using functional models for most of the scenarios we simulated. When applied to lung cancer data, much smaller p values were obtained using the NOIA statistical model for either the main effects or the SNP-smoking interactions for some of the SNPs tested. Conclusion: The NOIA statistical models are usually more powerful than the functional models in detecting main effects and interaction effects for both quantitative traits and binary traits. Copyright (C) 2012 S. Karger AG, Base
Genome-wide association study of primary tooth eruption identifies pleiotropic loci associated with height and craniofacial distances
Twin and family studies indicate that the timing of primary tooth eruption is highly heritable, with estimates typically exceeding 80%. To identify variants involved in primary tooth eruption we performed a population based genome-wide association study of ‘age at first tooth’ and ‘number of teeth’ using 5998 and 6609 individuals respectively from the Avon Longitudinal Study of Parents and Children (ALSPAC) and 5403 individuals from the 1966 Northern Finland Birth Cohort (NFBC1966). We tested 2,446,724 SNPs imputed in both studies. Analyses were controlled for the effect of gestational age, sex and age of measurement. Results from the two studies were combined using fixed effects inverse variance meta-analysis. We identified a total of fifteen independent loci, with ten loci reaching genome-wide significance (p<5x10−8) for ‘age at first tooth’ and eleven loci for ‘number of teeth’. Together these associations explain 6.06% of the variation in ‘age of first tooth’ and 4.76% of the variation in ‘number of teeth’. The identified loci included eight previously unidentified loci, some containing genes known to play a role in tooth and other developmental pathways, including a SNP in the protein-coding region of BMP4 (rs17563, P= 9.080x10−17). Three of these loci, containing the genes HMGA2, AJUBA and ADK, also showed evidence of association with craniofacial distances, particularly those indexing facial width. Our results suggest that the genome-wide association approach is a powerful strategy for detecting variants involved in tooth eruption, and potentially craniofacial growth and more generally organ development
Diagnostic Test Accuracy of a 2-Transcript Host RNA Signature for Discriminating Bacterial vs Viral Infection in Febrile Children.
IMPORTANCE: Because clinical features do not reliably distinguish bacterial from viral infection, many children worldwide receive unnecessary antibiotic treatment, while bacterial infection is missed in others. OBJECTIVE: To identify a blood RNA expression signature that distinguishes bacterial from viral infection in febrile children. DESIGN, SETTING, AND PARTICIPANTS: Febrile children presenting to participating hospitals in the United Kingdom, Spain, the Netherlands, and the United States between 2009-2013 were prospectively recruited, comprising a discovery group and validation group. Each group was classified after microbiological investigation as having definite bacterial infection, definite viral infection, or indeterminate infection. RNA expression signatures distinguishing definite bacterial from viral infection were identified in the discovery group and diagnostic performance assessed in the validation group. Additional validation was undertaken in separate studies of children with meningococcal disease (n = 24) and inflammatory diseases (n = 48) and on published gene expression datasets. EXPOSURES: A 2-transcript RNA expression signature distinguishing bacterial infection from viral infection was evaluated against clinical and microbiological diagnosis. MAIN OUTCOMES AND MEASURES: Definite bacterial and viral infection was confirmed by culture or molecular detection of the pathogens. Performance of the RNA signature was evaluated in the definite bacterial and viral group and in the indeterminate infection group. RESULTS: The discovery group of 240 children (median age, 19 months; 62% male) included 52 with definite bacterial infection, of whom 36 (69%) required intensive care, and 92 with definite viral infection, of whom 32 (35%) required intensive care. Ninety-six children had indeterminate infection. Analysis of RNA expression data identified a 38-transcript signature distinguishing bacterial from viral infection. A smaller (2-transcript) signature (FAM89A and IFI44L) was identified by removing highly correlated transcripts. When this 2-transcript signature was implemented as a disease risk score in the validation group (130 children, with 23 definite bacterial, 28 definite viral, and 79 indeterminate infections; median age, 17 months; 57% male), all 23 patients with microbiologically confirmed definite bacterial infection were classified as bacterial (sensitivity, 100% [95% CI, 100%-100%]) and 27 of 28 patients with definite viral infection were classified as viral (specificity, 96.4% [95% CI, 89.3%-100%]). When applied to additional validation datasets from patients with meningococcal and inflammatory diseases, bacterial infection was identified with a sensitivity of 91.7% (95% CI, 79.2%-100%) and 90.0% (95% CI, 70.0%-100%), respectively, and with specificity of 96.0% (95% CI, 88.0%-100%) and 95.8% (95% CI, 89.6%-100%). Of the children in the indeterminate groups, 46.3% (63/136) were classified as having bacterial infection, although 94.9% (129/136) received antibiotic treatment. CONCLUSIONS AND RELEVANCE: This study provides preliminary data regarding test accuracy of a 2-transcript host RNA signature discriminating bacterial from viral infection in febrile children. Further studies are needed in diverse groups of patients to assess accuracy and clinical utility of this test in different clinical settings
MultiPhen: Joint Model of Multiple Phenotypes Can Increase Discovery in GWAS
The genome-wide association study (GWAS) approach has discovered hundreds of genetic variants associated with diseases and quantitative traits. However, despite clinical overlap and statistical correlation between many phenotypes, GWAS are generally performed one-phenotype-at-a-time. Here we compare the performance of modelling multiple phenotypes jointly with that of the standard univariate approach. We introduce a new method and software, MultiPhen, that models multiple phenotypes simultaneously in a fast and interpretable way. By performing ordinal regression, MultiPhen tests the linear combination of phenotypes most associated with the genotypes at each SNP, and thus potentially captures effects hidden to single phenotype GWAS. We demonstrate via simulation that this approach provides a dramatic increase in power in many scenarios. There is a boost in power for variants that affect multiple phenotypes and for those that affect only one phenotype. While other multivariate methods have similar power gains, we describe several benefits of MultiPhen over these. In particular, we demonstrate that other multivariate methods that assume the genotypes are normally distributed, such as canonical correlation analysis (CCA) and MANOVA, can have highly inflated type-1 error rates when testing case-control or non-normal continuous phenotypes, while MultiPhen produces no such inflation. To test the performance of MultiPhen on real data we applied it to lipid traits in the Northern Finland Birth Cohort 1966 (NFBC1966). In these data MultiPhen discovers 21% more independent SNPs with known associations than the standard univariate GWAS approach, while applying MultiPhen in addition to the standard approach provides 37% increased discovery. The most associated linear combinations of the lipids estimated by MultiPhen at the leading SNPs accurately reflect the Friedewald Formula, suggesting that MultiPhen could be used to refine the definition of existing phenotypes or uncover novel heritable phenotypes
Identification of novel locus associated with coronary artery aneurysms and validation of loci for susceptibility to Kawasaki disease
Kawasaki disease (KD) is a paediatric vasculitis associated with coronary artery aneurysms (CAA). Genetic variants influencing susceptibility to KD have been previously identified, but no risk alleles have been validated that influence CAA formation. We conducted a genome-wide association study (GWAS) for CAA in KD patients of European descent with 200 cases and 276 controls. A second GWAS for susceptibility pooled KD cases with healthy paediatric controls from vaccine trials in the UK (n = 1609). Logistic regression mixed models were used for both GWASs. The susceptibility GWAS was meta-analysed with 400 KD cases and 6101 controls from a previous European GWAS, these results were further meta-analysed with Japanese GWASs at two putative loci. The CAA GWAS identified an intergenic region of chromosome 20q13 with multiple SNVs showing genome-wide significance. The risk allele of the most associated SNV (rs6017006) was present in 13% of cases and 4% of controls; in East Asian 1000 Genomes data, the allele was absent or rare. Susceptibility GWAS with meta-analysis with previously published European data identified two previously associated loci (ITPKC and FCGR2A). Further meta-analysis with Japanese GWAS summary data from the CASP3 and FAM167A genomic regions validated these loci in Europeans showing consistent effects of the top SNVs in both populations. We identified a novel locus for CAA in KD patients of European descent. The results suggest that different genes determine susceptibility to KD and development of CAA and future work should focus on the function of the intergenic region on chromosome 20q13
- …