17 research outputs found

    Fast kk-NNG construction with GPU-based quick multi-select

    Full text link
    In this paper we describe a new brute force algorithm for building the kk-Nearest Neighbor Graph (kk-NNG). The kk-NNG algorithm has many applications in areas such as machine learning, bio-informatics, and clustering analysis. While there are very efficient algorithms for data of low dimensions, for high dimensional data the brute force search is the best algorithm. There are two main parts to the algorithm: the first part is finding the distances between the input vectors which may be formulated as a matrix multiplication problem. The second is the selection of the kk-NNs for each of the query vectors. For the second part, we describe a novel graphics processing unit (GPU) -based multi-select algorithm based on quick sort. Our optimization makes clever use of warp voting functions available on the latest GPUs along with use-controlled cache. Benchmarks show significant improvement over state-of-the-art implementations of the kk-NN search on GPUs

    Utilizing Genotype Imputation for the Augmentation of Sequence Data

    Get PDF
    In recent years, capabilities for genotyping large sets of single nucleotide polymorphisms (SNPs) has increased considerably with the ability to genotype over 1 million SNP markers across the genome. This advancement in technology has led to an increase in the number of genome-wide association studies (GWAS) for various complex traits. These GWAS have resulted in the implication of over 1500 SNPs associated with disease traits. However, the SNPs identified from these GWAS are not necessarily the functional variants. Therefore, the next phase in GWAS will involve the refining of these putative loci.A next step for GWAS would be to catalog all variants, especially rarer variants, within the detected loci, followed by the association analysis of the detected variants with the disease trait. However, sequencing a locus in a large number of subjects is still relatively expensive. A more cost effective approach would be to sequence a portion of the individuals, followed by the application of genotype imputation methods for imputing markers in the remaining individuals. A potentially attractive alternative option would be to impute based on the 1000 Genomes Project; however, this has the drawbacks of using a reference population that does not necessarily match the disease status and LD pattern of the study population. We explored a variety of approaches for carrying out the imputation using a reference panel consisting of sequence data for a fraction of the study participants using data from both a candidate gene sequencing study and the 1000 Genomes Project.Imputation of genetic variation based on a proportion of sequenced samples is feasible. Our results indicate the following sequencing study design guidelines which take advantage of the recent advances in genotype imputation methodology: Select the largest and most diverse reference panel for sequencing and genotype as many "anchor" markers as possible

    Escaneamento genômico para tolerância à seca em sorgo.

    Get PDF
    O sorgo é adaptado a ambientes extremos onde estresses abióticos como a seca limitam a produção de grãos e de biomassa, como nas vastas regiões do Cerrado brasileiro. Por ser usado como alimento básico em regiões do mundo onde a produção de alimentos é ainda um desafio, o aumento da tolerância à seca em sorgo é importante para a segurança alimentar global, particularmente em um cenário de mudanças climáticas. Além disso, como o genoma do sorgo é menor e menos duplicado, em comparação com gramíneas como o milho e a cana-de-açúcar, o sorgo pode ser utilizado para elucidar os determinantes genéticos da tolerância à seca em outras espécies. Neste trabalho, o mapeamento associativo em escala genômica foi utilizado para a identificação de regiões genômicas associadas com características relacionadas com a tolerância à seca em dois ambientes, em Janaúba (MG) e em Teresina (PI). Um total de 265.587 marcadores SNP foram testados para associações com diferentes características em um painel de sorgo com 243 acessos. As estimativas de herdabilidade foram moderadas a altas e a redução máxima na produção de grãos causada pelo estresse de seca foi de 57% em Teresina. Os testes de associação com um modelo incorporando simultaneamente estrutura populacional e a matriz de relacionamento revelaram vários SNPs associados com diferentes característas, alguns dos quais foram estáveis entre ambientes.bitstream/item/160901/1/bol-152.pd

    FastMap: Fast eQTL mapping in homozygous populations

    Get PDF
    Motivation: Gene expression Quantitative Trait Locus (eQTL) mapping measures the association between transcript expression and genotype in order to find genomic locations likely to regulate transcript expression. The availability of both gene expression and high-density genotype data has improved our ability to perform eQTL mapping in inbred mouse and other homozygous populations. However, existing eQTL mapping software does not scale well when the number of transcripts and markers are on the order of 105 and 105–106, respectively

    TEAM: efficient two-locus epistasis tests in human genome-wide association study

    Get PDF
    As a promising tool for identifying genetic markers underlying phenotypic differences, genome-wide association study (GWAS) has been extensively investigated in recent years. In GWAS, detecting epistasis (or gene–gene interaction) is preferable over single locus study since many diseases are known to be complex traits. A brute force search is infeasible for epistasis detection in the genome-wide scale because of the intensive computational burden. Existing epistasis detection algorithms are designed for dataset consisting of homozygous markers and small sample size. In human study, however, the genotype may be heterozygous, and number of individuals can be up to thousands. Thus, existing methods are not readily applicable to human datasets. In this article, we propose an efficient algorithm, TEAM, which significantly speeds up epistasis detection for human GWAS. Our algorithm is exhaustive, i.e. it does not ignore any epistatic interaction. Utilizing the minimum spanning tree structure, the algorithm incrementally updates the contingency tables for epistatic tests without scanning all individuals. Our algorithm has broader applicability and is more efficient than existing methods for large sample study. It supports any statistical test that is based on contingency tables, and enables both family-wise error rate and false discovery rate controlling. Extensive experiments show that our algorithm only needs to examine a small portion of the individuals to update the contingency tables, and it achieves at least an order of magnitude speed up over the brute force approach

    Using Population Mixtures to Optimize the Utility of Genomic Databases: Linkage Disequilibrium and Association Study Design in India

    Full text link
    When performing association studies in populations that have not been the focus of large-scale investigations of haplotype variation, it is often helpful to rely on genomic databases in other populations for study design and analysis – such as in the selection of tag SNPs and in the imputation of missing genotypes. One way of improving the use of these databases is to rely on a mixture of database samples that is similar to the population of interest, rather than using the single most similar database sample. We demonstrate the effectiveness of the mixture approach in the application of African, European, and East Asian HapMap samples for tag SNP selection in populations from India, a genetically intermediate region underrepresented in genomic studies of haplotype variation.Peer Reviewedhttp://deepblue.lib.umich.edu/bitstream/2027.42/65949/1/j.1469-1809.2008.00457.x.pd
    corecore