1,326 research outputs found

    Performance of random forest when SNPs are in linkage disequilibrium

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Single nucleotide polymorphisms (SNPs) may be correlated due to linkage disequilibrium (LD). Association studies look for both direct and indirect associations with disease loci. In a Random Forest (RF) analysis, correlation between a true risk SNP and SNPs in LD may lead to diminished variable importance for the true risk SNP. One approach to address this problem is to select SNPs in linkage equilibrium (LE) for analysis. Here, we explore alternative methods for dealing with SNPs in LD: change the tree-building algorithm by building each tree in an RF only with SNPs in LE, modify the importance measure (IM), and use haplotypes instead of SNPs to build a RF.</p> <p>Results</p> <p>We evaluated the performance of our alternative methods by simulation of a spectrum of complex genetics models. When a haplotype rather than an individual SNP is the risk factor, we find that the original Random Forest method performed on SNPs provides good performance. When individual, genotyped SNPs are the risk factors, we find that the stronger the genetic effect, the stronger the effect LD has on the performance of the original RF. A revised importance measure used with the original RF is relatively robust to LD among SNPs; this revised importance measure used with the revised RF is sometimes inflated. Overall, we find that the revised importance measure used with the original RF is the best choice when the genetic model and the number of SNPs in LD with risk SNPs are unknown. For the haplotype-based method, under a multiplicative heterogeneity model, we observed a decrease in the performance of RF with increasing LD among the SNPs in the haplotype.</p> <p>Conclusion</p> <p>Our results suggest that by strategically revising the Random Forest method tree-building or importance measure calculation, power can increase when LD exists between SNPs. We conclude that the revised Random Forest method performed on SNPs offers an advantage of not requiring genotype phase, making it a viable tool for use in the context of thousands of SNPs, such as candidate gene studies and follow-up of top candidates from genome wide association studies.</p

    Improving the yield of circulating tumour cells facilitates molecular characterisation and recognition of discordant HER2 amplification in breast cancer

    Get PDF
    BACKGROUND: Circulating tumour cells (CTCs) offer a non-invasive approach to obtain and characterise metastatic tumour cells, but their usefulness has been limited by low CTC yields from conventional isolation methods. METHODS: To improve CTC yields and facilitate their molecular characterisation we compared the Food and Drug Administration-approved CellSearch Epithelial Kit (CEK) to a simplified CTC capture method, CellSearch Profile Kit (CPK), on paired blood samples from patients with metastatic breast (n=75) and lung (n=71) cancer. Molecular markers including Human Epidermal growth factor Receptor 2 (HER2) were evaluated on CTCs by fluorescence in situ hybridisation (FISH) and compared to patients' primary and metastatic cancer. RESULTS: The median cell count from patients with breast cancer using the CPK was 117 vs 4 for CEK (P<0.0001). Lung cancer samples were similar; CPK: 145 cells vs CEK:4 cells (P<0.0001). Recovered CTCs were relatively pure (60-70%) and were evaluable by FISH and immunofluorescence. A total of 10 of 30 (33%) breast cancer patients with HER2-negative primary and metastatic tissue had HER2-amplified CTCs. CONCLUSION: The CPK method provides a high yield of relatively pure CTCs, facilitating their molecular characterisation. Circulating tumour cells obtained using CPK technology demonstrate that significant discordance exists between HER2 amplification of a patient's CTCs and that of the primary and metastatic tumour

    A Machine Learning Trainable Model to Assess the Accuracy of Probabilistic Record Linkage

    Get PDF
    Record linkage (RL) is the process of identifying and linking data that relates to the same physical entity across multiple heterogeneous data sources. Deterministic linkage methods rely on the presence of common uniquely identifying attributes across all sources while probabilistic approaches use non-unique attributes and calculates similarity indexes for pair wise comparisons. A key component of record linkage is accuracy assessment — the process of manually verifying and validating matched pairs to further refine linkage parameters and increase its overall effectiveness. This process however is time-consuming and impractical when applied to large administrative data sources where millions of records must be linked. Additionally, it is potentially biased as the gold standard used is often the reviewer’s intuition. In this paper, we present an approach for assessing and refining the accuracy of probabilistic linkage based on different supervised machine learning methods (decision trees, naïve Bayes, logistic regression, random forest, linear support vector machines and gradient boosted trees). We used data sets extracted from huge Brazilian socioeconomic and public health care data sources. These models were evaluated using receiver operating characteristic plots, sensitivity, specificity and positive predictive values collected from a 10-fold cross-validation method. Results show that logistic regression outperforms other classifiers and enables the creation of a generalized, very accurate model to validate linkage results

    Evaluating Nuclei Concentration in Amyloid Fibrillation Reactions Using Back-Calculation Approach

    Get PDF
    Background: In spite of our extensive knowledge of the more than 20 proteins associated with different amyloid diseases, we do not know how amyloid toxicity occurs or how to block its action. Recent contradictory reports suggest that the fibrils and/or the oligomer precursors cause toxicity. An estimate of their temporal concentration may broaden understanding of the amyloid aggregation process. Methodology/Principal Findings: Assuming that conversion of folded protein to fibril is initiated by a nucleation event, we back-calculate the distribution of nuclei concentration. The temporal in vitro concentration of nuclei for the model hormone, recombinant human insulin, is estimated to be in the picomolar range. This is a conservative estimate since the back-calculation method is likely to overestimate the nuclei concentration because it does not take into consideration fibril fragmentation, which would lower the amount of nuclei Conclusions: Because of their propensity to form aggregates (non-ordered) and fibrils (ordered), this very low concentration could explain the difficulty in isolating and blocking oligomers or nuclei toxicity and the long onset time for amyloid diseases

    Transfection of IL-10 expression vectors into endothelial cultures attenuates α4β7-dependent lymphocyte adhesion mediated by MAdCAM-1

    Get PDF
    BACKGROUND: Enhanced expression of MAdCAM-1 (mucosal addressin cell adhesion molecule-1) is associated with the onset and progression of inflammatory bowel disease. The clinical significance of elevated MAdCAM-1 expression is supported by studies showing that immunoneutralization of MAdCAM-1, or its ligands reduce inflammation and mucosal damage in models of colitis. Interleukin-10 (IL-10) is an endogenous anti-inflammatory and immunomodulatory cytokine that has been shown to prevent inflammation and injury in several animal studies, however clinical IL-10 treatment remains insufficient because of difficulties in the route of IL-10 administration and its biological half-life. Here, we examined the ability of introducing an IL-10 expression vector into endothelial cultures to reduce responses to a proinflammatory cytokine, TNF-α METHODS: A human IL-10 expression vector was transfected into high endothelial venular ('HEV') cells (SVEC4-10); we then examined TNF-α induced lymphocyte adhesion to lymphatic endothelial cells and TNF-α induced expression of MAdCAM-1 and compared these responses to control monolayers. RESULTS: Transfection of the IL-10 vector into endothelial cultures significantly reduced TNF-α induced, MAdCAM-1 dependent lymphocyte adhesion (compared to non-transfected cells). IL-10 transfected endothelial cells expressed less than half (46 ± 6.6%) of the MAdCAM-1 induced by TNF-α (set as 100%) in non-transfected (control) cells. CONCLUSION: Our results suggest that gene therapy of the gut microvasculature with IL-10 vectors may be useful in the clinical treatment of IBD

    Dissection of genetic associations with language-related traits in population-based cohorts

    Get PDF
    Recent advances in the field of language-related disorders have led to the identification of candidate genes for specific language impairment (SLI) and dyslexia. Replication studies have been conducted in independent samples including population-based cohorts, which can be characterised for a large number of relevant cognitive measures. The availability of a wide range of phenotypes allows us to not only identify the most suitable traits for replication of genetic association but also to refine the associated cognitive trait. In addition, it is possible to test for pleiotropic effects across multiple phenotypes which could explain the extensive comorbidity observed across SLI, dyslexia and other neurodevelopmental disorders. The availability of genome-wide genotype data for such cohorts will facilitate this kind of analysis but important issues, such as multiple test corrections, have to be taken into account considering that small effect sizes are expected to underlie such associations

    Lattice Boltzmann simulations of soft matter systems

    Full text link
    This article concerns numerical simulations of the dynamics of particles immersed in a continuum solvent. As prototypical systems, we consider colloidal dispersions of spherical particles and solutions of uncharged polymers. After a brief explanation of the concept of hydrodynamic interactions, we give a general overview over the various simulation methods that have been developed to cope with the resulting computational problems. We then focus on the approach we have developed, which couples a system of particles to a lattice Boltzmann model representing the solvent degrees of freedom. The standard D3Q19 lattice Boltzmann model is derived and explained in depth, followed by a detailed discussion of complementary methods for the coupling of solvent and solute. Colloidal dispersions are best described in terms of extended particles with appropriate boundary conditions at the surfaces, while particles with internal degrees of freedom are easier to simulate as an arrangement of mass points with frictional coupling to the solvent. In both cases, particular care has been taken to simulate thermal fluctuations in a consistent way. The usefulness of this methodology is illustrated by studies from our own research, where the dynamics of colloidal and polymeric systems has been investigated in both equilibrium and nonequilibrium situations.Comment: Review article, submitted to Advances in Polymer Science. 16 figures, 76 page

    A multi-stage genome-wide association study of bladder cancer identifies multiple susceptibility loci.

    Get PDF
    We conducted a multi-stage, genome-wide association study of bladder cancer with a primary scan of 591,637 SNPs in 3,532 affected individuals (cases) and 5,120 controls of European descent from five studies followed by a replication strategy, which included 8,382 cases and 48,275 controls from 16 studies. In a combined analysis, we identified three new regions associated with bladder cancer on chromosomes 22q13.1, 19q12 and 2q37.1: rs1014971, (P = 8 × 10⁻¹²) maps to a non-genic region of chromosome 22q13.1, rs8102137 (P = 2 × 10⁻¹¹) on 19q12 maps to CCNE1 and rs11892031 (P = 1 × 10⁻⁷) maps to the UGT1A cluster on 2q37.1. We confirmed four previously identified genome-wide associations on chromosomes 3q28, 4p16.3, 8q24.21 and 8q24.3, validated previous candidate associations for the GSTM1 deletion (P = 4 × 10⁻¹¹) and a tag SNP for NAT2 acetylation status (P = 4 × 10⁻¹¹), and found interactions with smoking in both regions. Our findings on common variants associated with bladder cancer risk should provide new insights into the mechanisms of carcinogenesis

    A Two-Stage Random Forest-Based Pathway Analysis Method

    Get PDF
    Pathway analysis provides a powerful approach for identifying the joint effect of genes grouped into biologically-based pathways on disease. Pathway analysis is also an attractive approach for a secondary analysis of genome-wide association study (GWAS) data that may still yield new results from these valuable datasets. Most of the current pathway analysis methods focused on testing the cumulative main effects of genes in a pathway. However, for complex diseases, gene-gene interactions are expected to play a critical role in disease etiology. We extended a random forest-based method for pathway analysis by incorporating a two-stage design. We used simulations to verify that the proposed method has the correct type I error rates. We also used simulations to show that the method is more powerful than the original random forest-based pathway approach and the set-based test implemented in PLINK in the presence of gene-gene interactions. Finally, we applied the method to a breast cancer GWAS dataset and a lung cancer GWAS dataset and interesting pathways were identified that have implications for breast and lung cancers
    corecore