23 research outputs found

    Assessing batch effects of genotype calling algorithm BRLMM for the Affymetrix GeneChip Human Mapping 500 K array set using 270 HapMap samples

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Genome-wide association studies (GWAS) aim to identify genetic variants (usually single nucleotide polymorphisms [SNPs]) across the entire human genome that are associated with phenotypic traits such as disease status and drug response. Highly accurate and reproducible genotype calling are paramount since errors introduced by calling algorithms can lead to inflation of false associations between genotype and phenotype. Most genotype calling algorithms currently used for GWAS are based on multiple arrays. Because hundreds of gigabytes (GB) of raw data are generated from a GWAS, the samples are typically partitioned into batches containing subsets of the entire dataset for genotype calling. High call rates and accuracies have been achieved. However, the effects of batch size (i.e., number of chips analyzed together) and of batch composition (i.e., the choice of chips in a batch) on call rate and accuracy as well as the propagation of the effects into significantly associated SNPs identified have not been investigated. In this paper, we analyzed both the batch size and batch composition for effects on the genotype calling algorithm BRLMM using raw data of 270 HapMap samples analyzed with the Affymetrix Human Mapping 500 K array set.</p> <p>Results</p> <p>Using data from 270 HapMap samples interrogated with the Affymetrix Human Mapping 500 K array set, three different batch sizes and three different batch compositions were used for genotyping using the BRLMM algorithm. Comparative analysis of the calling results and the corresponding lists of significant SNPs identified through association analysis revealed that both batch size and composition affected genotype calling results and significantly associated SNPs. Batch size and batch composition effects were more severe on samples and SNPs with lower call rates than ones with higher call rates, and on heterozygous genotype calls compared to homozygous genotype calls.</p> <p>Conclusion</p> <p>Batch size and composition affect the genotype calling results in GWAS using BRLMM. The larger the differences in batch sizes, the larger the effect. The more homogenous the samples in the batches, the more consistent the genotype calls. The inconsistency propagates to the lists of significantly associated SNPs identified in downstream association analysis. Thus, uniform and large batch sizes should be used to make genotype calls for GWAS. In addition, samples of high homogeneity should be placed into the same batch.</p

    ASSOCIATON TESTS THAT ACCOMMODATE GENOTYPING ERRORS

    Get PDF
    High-throughput SNP arrays provide estimates of genotypes for up to one million loci, often used in genome-wide association studies. While these estimates are typically very accurate, genotyping errors do occur, which can influence in particular the most extreme test statistics and p-values. Estimates for the genotype uncertainties are also available, although typically ignored. In this manuscript, we develop a framework to incorporate these genotype uncertainties in case-control studies for any genetic model. We verify that using the assumption of a “local alternative” in the score test is very reasonable for effect sizes typically seen in SNP association studies, and show that the power of the score test is simply a function of the correlation of the genotype probabilities with the true genotypes. We demonstrate that the power to detect a true association can be substantially increased for difficult to call genotypes, resulting in improved inference in association studies

    SNP HiTLink: a high-throughput linkage analysis system employing dense SNP data

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>During this recent decade, microarray-based single nucleotide polymorphism (SNP) data are becoming more widely used as markers for linkage analysis in the identification of loci for disease-associated genes. Although microarray-based SNP analyses have markedly reduced genotyping time and cost compared with microsatellite-based analyses, applying these enormous data to linkage analysis programs is a time-consuming step, thus, necessitating a high-throughput platform.</p> <p>Results</p> <p>We have developed SNP HiTLink (SNP High Throughput Linkage analysis system). In this system, SNP chip data of the Affymetrix Mapping 100 k/500 k array set and Genome-Wide Human SNP array 5.0/6.0 can be directly imported and passed to parametric or model-free linkage analysis programs; MLINK, Superlink, Merlin and Allegro. Various marker-selecting functions are implemented to avoid the effect of typing-error data, markers in linkage equilibrium or to select informative data.</p> <p>Conclusion</p> <p>The results using the 100 k SNP dataset were comparable or even superior to those obtained from analyses using microsatellite markers in terms of LOD scores obtained. General personal computers are sufficient to execute the process, as runtime for whole-genome analysis was less than a few hours. This system can be widely applied to linkage analysis using microarray-based SNP data and with which one can expect high-throughput and reliable linkage analysis.</p

    Identification of novel schizophrenia loci by homozygosity mapping using DNA microarray analysis.

    Get PDF
    The recent development of high-resolution DNA microarrays, in which hundreds of thousands of single nucleotide polymorphisms (SNPs) are genotyped, enables the rapid identification of susceptibility genes for complex diseases. Clusters of these SNPs may show runs of homozygosity (ROHs) that can be analyzed for association with disease. An analysis of patients whose parents were first cousins enables the search for autozygous segments in their offspring. Here, using the Affymetrix® Genome-Wide Human SNP Array 5.0 to determine ROHs, we genotyped 9 individuals with schizophrenia (SCZ) whose parents were first cousins. We identified overlapping ROHs on chromosomes 1, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 16, 17, 19, 20, and 21 in at least 3 individuals. Only the locus on chromosome 5 has been reported previously. The ROHs on chromosome 5q23.3-q31.1 include the candidate genes histidine triad nucleotide binding protein 1 (HINT1) and acyl-CoA synthetase long-chain family member 6 (ACSL6). Other overlapping ROHs may contain novel rare recessive variants that affect SCZ specifically in our samples, given the highly heterozygous nature of SCZ. Analysis of patients whose parents are first cousins may provide new insights for the genetic analysis of psychiatric diseases

    Population Genetic Analyis Of Entire Genomes, From Snp Discovery To Genome-Wide Scans For Selection

    Full text link
    The analysis of molecular genetic data has driven the fields of molecular biology, genetics, population genetics, and quantitative genetics for over half a century. Only recently though has technology advanced to the point where molecular genetic data can be acquired cheaply and efficiently for the entire genome of several individuals enabling scientists to conduct genome-wide comparisons between several individuals or several population samples, and ask comprehensive questions regarding the nature of genetic variation in extant populations and the evolutionary forces in the population's history which generated and influenced this variation. Several challenges exist to utilizing these new technologies successfully however and in most cases both experimental optimization of laboratory protocols and the customization or de novo implementation of computational and statistical analysis methods are required to obtain adequate results. Even when the raw physical data acquired by these technologies has been successfully rendered into biologically meaningful molecular genetic data, the analysis of these large, genome-wide datasets is formidable and again requires advanced and customized methods to ask biologically motivated questions and produce conclusive results which may not have been obtainable without complete genome information. Here, I discuss two main technologies for the acquisition of genome-wide molecular data, next-generation sequencing technologies and fixed-array highly multiplexed SNP genotyping, and discuss the challenges in applying them in plant systems. Additionally, I demonstrate a population genetic analysis for the detection of recent selective sweeps in four subpopulations of Oryza sativa (cultivated Asian rice) and one Oryza rufipogon population (wild Asian rice) utilizing the genome-wide molecular data acquired by next-generation sequencing. The development of an improved and accurate statistical method to detect selection in population genomic analysis combined with genome-wide data in each of these subpopulations allowed the extent and location of selective sweeps in Oryza sativa subpopulations and its wild progenitor Oryza rufipogon to be quantified and compared for the first time, revealing that each cultivated subpopulation appears to have a largely unique and independent selective and domestication history, but several advantageous alleles for cultivation of rice that originated and were selected for in one subpopulation have been introduced into other subpopulations by introgression

    Somatic Copy Number Mosaicism Contributes to Genomic Diversity in Mus musculus

    Get PDF
    Copy number variants (CNVs) are a source of genomic variation associated with altered phenotypes. Somatic copy number mosaicism results when different populations of cells in an individual differ due to de novo copy number changes (CNCs). Tissue-specific patterns of CNCs resulting in mosaicism have yet to be characterized in the mouse, an organism frequently used to model human diseases. Here, DNA was sampled from spleen, liver, and cerebellum of eight highly related mice selected from a familial unit. CNVs and CNCs were detected using the Mouse Diversity Genotyping Array with three computational methods (ConsecN, Partek, and PennCNV). Tissue-specific patterns of CNCs were revealed, including genomic regions of putative recurring CNCs. Genetic distance estimated using CNVs and CNCs recapitulated genealogical relationships. The novel framework can thus be used to identify and analyze tissue-specific CNCs, and the results establish the need to account for CNCs in model organisms

    Linkage disequilibrium based genotype calling from low-coverage shotgun sequencing reads

    Get PDF
    Background Recent technology advances have enabled sequencing of individual genomes, promising to revolutionize biomedical research. However, deep sequencing remains more expensive than microarrays for performing whole-genome SNP genotyping. Results In this paper we introduce a new multi-locus statistical model and computationally efficient genotype calling algorithms that integrate shotgun sequencing data with linkage disequilibrium (LD) information extracted from reference population panels such as Hapmap or the 1000 genomes project. Experiments on publicly available 454, Illumina, and ABI SOLiD sequencing datasets suggest that integration of LD information results in genotype calling accuracy comparable to that of microarray platforms from sequencing data of low-coverage. A software package implementing our algorithm, released under the GNU General Public License, is available at http://dna.engr.uconn.edu/software/GeneSeq/. Conclusions Integration of LD information leads to significant improvements in genotype calling accuracy compared to prior LD-oblivious methods, rendering low-coverage sequencing as a viable alternative to microarrays for conducting large-scale genome-wide association studies

    Genetic and Metabolic Characterization of Insomnia

    Get PDF
    Insomnia is reported to chronically affect 10∼15% of the adult population. However, very little is known about the genetics and metabolism of insomnia. Here we surveyed 10,038 Korean subjects whose genotypes have been previously profiled on a genome-wide scale. About 16.5% reported insomnia and displayed distinct metabolic changes reflecting an increase in insulin secretion, a higher risk of diabetes, and disrupted calcium signaling. Insomnia-associated genotypic differences were highly concentrated within genes involved in neural function. The most significant SNPs resided in ROR1 and PLCB1, genes known to be involved in bipolar disorder and schizophrenia, respectively. Putative enhancers, as indicated by the histone mark H3K4me1, were discovered within both genes near the significant SNPs. In neuronal cells, the enhancers were bound by PAX6, a neural transcription factor that is essential for central nervous system development. Open chromatin signatures were found on the enhancers in human pancreas, a tissue where PAX6 is known to play a role in insulin secretion. In PLCB1, CTCF was found to bind downstream of the enhancer and interact with PAX6, suggesting that it can probably inhibit gene activation by PAX6. PLCB4, a circadian gene that is closely located downstream of PLCB1, was identified as a candidate target gene. Hence, dysregulation of ROR1, PLCB1, or PLCB4 by PAX6 and CTCF may be one mechanism that links neural and pancreatic dysfunction not only in insomnia but also in the relevant psychiatric disorders that are accompanied with circadian rhythm disruption and metabolic syndrome
    corecore