16,242 research outputs found

    Second-generation PLINK: rising to the challenge of larger and richer datasets

    Get PDF
    PLINK 1 is a widely used open-source C/C++ toolset for genome-wide association studies (GWAS) and research in population genetics. However, the steady accumulation of data from imputation and whole-genome sequencing studies has exposed a strong need for even faster and more scalable implementations of key functions. In addition, GWAS and population-genetic data now frequently contain probabilistic calls, phase information, and/or multiallelic variants, none of which can be represented by PLINK 1's primary data format. To address these issues, we are developing a second-generation codebase for PLINK. The first major release from this codebase, PLINK 1.9, introduces extensive use of bit-level parallelism, O(sqrt(n))-time/constant-space Hardy-Weinberg equilibrium and Fisher's exact tests, and many other algorithmic improvements. In combination, these changes accelerate most operations by 1-4 orders of magnitude, and allow the program to handle datasets too large to fit in RAM. This will be followed by PLINK 2.0, which will introduce (a) a new data format capable of efficiently representing probabilities, phase, and multiallelic variants, and (b) extensions of many functions to account for the new types of information. The second-generation versions of PLINK will offer dramatic improvements in performance and compatibility. For the first time, users without access to high-end computing resources can perform several essential analyses of the feature-rich and very large genetic datasets coming into use.Comment: 2 figures, 1 additional fil

    The South Asian genome

    Get PDF
    Genetics of disease Microarrays Variant genotypes Population genetics Sequence alignment AllelesThe genetic sequence variation of people from the Indian subcontinent who comprise one-quarter of the world's population, is not well described. We carried out whole genome sequencing of 168 South Asians, along with whole-exome sequencing of 147 South Asians to provide deeper characterisation of coding regions. We identify 12,962,155 autosomal sequence variants, including 2,946,861 new SNPs and 312,738 novel indels. This catalogue of SNPs and indels amongst South Asians provides the first comprehensive map of genetic variation in this major human population, and reveals evidence for selective pressures on genes involved in skin biology, metabolism, infection and immunity. Our results will accelerate the search for the genetic variants underlying susceptibility to disorders such as type-2 diabetes and cardiovascular disease which are highly prevalent amongst South Asians.Whole genome sequencing to discover genetic variants underlying type-2 diabetes, coronary heart disease and related phenotypes amongst Indian Asians. Imperial College Healthcare NHS Trust cBRC 2011-13 (JS Kooner [PI], JC Chambers)

    Fast and scalable inference of multi-sample cancer lineages.

    Get PDF
    Somatic variants can be used as lineage markers for the phylogenetic reconstruction of cancer evolution. Since somatic phylogenetics is complicated by sample heterogeneity, novel specialized tree-building methods are required for cancer phylogeny reconstruction. We present LICHeE (Lineage Inference for Cancer Heterogeneity and Evolution), a novel method that automates the phylogenetic inference of cancer progression from multiple somatic samples. LICHeE uses variant allele frequencies of somatic single nucleotide variants obtained by deep sequencing to reconstruct multi-sample cell lineage trees and infer the subclonal composition of the samples. LICHeE is open source and available at http://viq854.github.io/lichee

    Structural Prediction of Proteinโ€“Protein Interactions by Docking: Application to Biomedical Problems

    Get PDF
    A huge amount of genetic information is available thanks to the recent advances in sequencing technologies and the larger computational capabilities, but the interpretation of such genetic data at phenotypic level remains elusive. One of the reasons is that proteins are not acting alone, but are specifically interacting with other proteins and biomolecules, forming intricate interaction networks that are essential for the majority of cell processes and pathological conditions. Thus, characterizing such interaction networks is an important step in understanding how information flows from gene to phenotype. Indeed, structural characterization of proteinโ€“protein interactions at atomic resolution has many applications in biomedicine, from diagnosis and vaccine design, to drug discovery. However, despite the advances of experimental structural determination, the number of interactions for which there is available structural data is still very small. In this context, a complementary approach is computational modeling of protein interactions by docking, which is usually composed of two major phases: (i) sampling of the possible binding modes between the interacting molecules and (ii) scoring for the identification of the correct orientations. In addition, prediction of interface and hot-spot residues is very useful in order to guide and interpret mutagenesis experiments, as well as to understand functional and mechanistic aspects of the interaction. Computational docking is already being applied to specific biomedical problems within the context of personalized medicine, for instance, helping to interpret pathological mutations involved in proteinโ€“protein interactions, or providing modeled structural data for drug discovery targeting proteinโ€“protein interactions.Spanish Ministry of Economy grant number BIO2016-79960-R; D.B.B. is supported by a predoctoral fellowship from CONACyT; M.R. is supported by an FPI fellowship from the Severo Ochoa program. We are grateful to the Joint BSC-CRG-IRB Programme in Computational Biology.Peer ReviewedPostprint (author's final draft

    ํฌ๊ท€ ์‹ ๊ฒฝ๊ทผ ์งˆํ™˜์˜ ์œ ์ „์ฒด, ์ „์‚ฌ์ฒด ํ†ตํ•ฉ ๋ถ„์„ ์—ฐ๊ตฌ

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ(์„์‚ฌ)--์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› :์˜๊ณผ๋Œ€ํ•™ ์˜๊ณผํ•™๊ณผ,2019. 8. ์ตœ๋ฌด๋ฆผ.Whole exome sequencing (WES)์€ ๋น„์šฉ ๋ฐ ๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌ์˜ ์šฉ์ด์„ฑ์œผ๋กœ ์ธํ•˜์—ฌ ํฌ๊ท€์งˆํ™˜ ์ง„๋‹จ๋“ฑ์— ๋งค์šฐ ํšจ๊ณผ์ ์ธ ๋ฐฉ๋ฒ•์ด ๋˜์—ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ variant of unknown significances (VUS)๋ฅผ ํ•ด์„ํ•˜๋Š” ์–ด๋ ค์›€๊ณผnon-coding ๋ณ€์ดํ˜•์„ ํ™•์ธํ•  ์ˆ˜ ์—†๋‹ค๋Š” ์  ๋“ฑ์˜ ์ด์œ ๋กœ WES ๊ธฐ๋ฐ˜์˜ ํฌ๊ท€์งˆํ™˜ ์ง„๋‹จ๋ฅ ์€ ๋Œ€๋ถ€๋ถ„ 50%๋ฅผ ๋„˜์ง€ ๋ชปํ•œ๋‹ค. ๋”ฐ๋ผ์„œ, ๋ณธ ์—ฐ๊ตฌ์—์„œ๋Š” ํฌ๊ท€์งˆํ™˜ ์ง„๋‹จ์˜ ๋ณด์™„์ ์ธ ์ ‘๊ทผ๋ฒ•์œผ๋กœ ์ƒˆ๋กœ์ด ์ „์‚ฌ์ฒด ๋ถ„์„๋ฒ•์„ ๋„์ž…ํ•  ๊ฒƒ์„ ์ œ์‹œํ•˜๊ณ ์ž ํ•œ๋‹ค. ์ด๋ฅผ ์œ„ํ•˜์—ฌ ์„œ์šธ๋Œ€ํ•™๊ต ์–ด๋ฆฐ์ด๋ณ‘์› ์†Œ์•„์‹ ๊ฒฝ๊ณผ์—์„œ ์ž„์ƒ์ ์œผ๋กœ ์ง„๋‹จ๋˜์ง€ ๋ชปํ•œ ๊ทผ์‹ ๊ฒฝ์งˆํ™˜ ํ™˜์ž 94 ๋ช…์„ ๋Œ€์ƒ์œผ๋กœ WES ๋ถ„์„์„ ์‹ค์‹œํ•˜๊ณ , ์ด๋ฏธ ์•Œ๋ ค์ง„ ๊ทผ์‹ ๊ฒฝ์งˆํ™˜์˜ ์›์ธ ์œ ์ „์ž ๋ณ€์ด๋“ค์„ ๋ถ„์„ํ•˜์˜€๋‹ค. ์ถ”๊ฐ€์ ์œผ๋กœ, ๊ธฐ์กด์— WES ๋ถ„์„์ด ์ˆ˜ํ–‰๋œ 63๋ช…์˜ ํ™˜์ž๊ตฐ๊ณผ ์ด ์™ธ์˜ 10๋ช…์˜ ํ™˜์ž๊ตฐ์„ ์ถ”๊ฐ€ํ•˜์—ฌ ์ „์‚ฌ์ฒด ๋ถ„์„์„ ์ˆ˜ํ–‰ํ•˜์˜€๋‹ค. ์ „์‚ฌ์ฒด ๋ฐ์ดํ„ฐ๋ฅผ ์ด์šฉํ•˜์—ฌ damaging ๋ณ€์ด ๋ถ„์„, allele-specific expression ๋ถ„์„, ํ™˜์ž๊ตฐ๊ณผ ์ •์ƒ๊ตฐ์—์„œ ๋‹ค๋ฅด๊ฒŒ ๋ฐœํ˜„ํ•˜๋Š” ์œ ์ „์ž (DEG) ๋ฐ ๋น„์ •์ƒ์ ์ธ splicing ์–‘์ƒ์„ ํƒ์ƒ‰ํ•˜๋Š” ๋ถ„์„์„ ์ˆ˜ํ–‰ํ•˜์˜€๋‹ค. ๋˜ํ•œ, non-negative matrix factorization ๋ถ„์„ ๊ธฐ๋ฒ•์„ ํ†ตํ•ด ์œ ์ „์ž ๋ฐœํ˜„ ํ”„๋กœํŒŒ์ผ์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•œ ๊ตฐ์ง‘ํ™”๋ฅผ ์ˆ˜ํ–‰ํ•˜๊ณ , ๊ฐ ๊ตฐ์ง‘์„ ํŠน์ง• ์ง“๋Š” ์œ ์ „์ž ๊ทธ๋ฃน์„ ๋„์ถœํ•˜์˜€๋‹ค. ๊ทธ ๊ฒฐ๊ณผ, WES ๋ถ„์„์„ ํ†ตํ•˜์—ฌ 49%์˜ ํ™˜์ž์—์„œ ํ›„๋ณด ์›์ธ ๋ณ€์ด๋ฅผ ํ™•์ธํ•˜์˜€์œผ๋ฉฐ, ๊ทธ ์ค‘ 83%์˜ ํ™˜์ž์—์„œ๋Š” ์•Œ๋ ค์ง„ ๊ทผ์‹ ๊ฒฝ์งˆํ™˜ ์›์ธ ์œ ์ „์ž์˜ ๋ณ€์ด๋ฅผ ํ™•์ธํ•˜์˜€๋‹ค. 12๋ช…์˜ ํ™˜์ž์—์„œ๋Š” ๊ทธ ๊ธฐ๋Šฅ์„ฑ์ด ํ™•์‹คํ•˜์ง€ ์•Š์€ ๊ตฌ์กฐ ๋ณ€์ด๋ฅผ ํ™•์ธํ•˜์˜€๋‹ค. ์ „์‚ฌ์ฒด ๋ฐ์ดํ„ฐ ๊ธฐ๋ฐ˜์˜ ๋ณ€์ด ๋ถ„์„์„ ํ†ตํ•˜์—ฌ, WES ์„ ์ˆ˜ํ–‰ํ•˜์ง€ ์•Š์€ 5 ๋ช…์˜ ํ™˜์ž๋ฅผ ํฌํ•จํ•œ ์ด 9 ๋ช…์˜ ํ™˜์ž์—์„œ heterozygous ๋ณ€์ด๋ฅผ ์ถ”๊ฐ€๋กœ ๋ฐœ๊ฒฌํ•˜์˜€๋‹ค. Allele-specific expression ๋ถ„์„์„ ํ†ตํ•˜์—ฌ 2๊ฐœ์˜ ํ›„๋ณด ์›์ธ์œ ์ „์ž๋ฅผ ๋ฐœ๊ฒฌํ•˜์˜€๊ณ , DEG ๋ถ„์„ ๊ฒฐ๊ณผ, 4๋ช…์˜ ํ™˜์ž์—์„œ ์ž ์žฌ์ ์ธ ์›์ธ ์œ ์ „์ž ๊ทธ๋ฃน์„ ์„ ๋ณ„ํ•  ์ˆ˜ ์žˆ์—ˆ๋‹ค. ๋˜ํ•œ, 4 ๋ช…์˜ ํ™˜์ž์—๊ฒŒ์„œ DMD, TTN, MICU1 ์œ ์ „์ž๋“ค์˜ ๋น„์ •์ƒ์ ์ธ splicing์ด ํ™•์ธ๋˜์—ˆ๋‹ค. non-negative matrix factorization ๊ธฐ๋ฐ˜ ๊ตฐ์ง‘ํ™” ๋ถ„์„ ๊ฒฐ๊ณผ, ์œ ์ „์ž ๋ฐœํ˜„ ์–‘์ƒ์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•œ 6๊ฐœ์˜ ๊ตฐ์ง‘์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์—ˆ๋‹ค. ๋ณธ ์—ฐ๊ตฌ๋ฅผ ํ†ตํ•˜์—ฌ ์ „์‚ฌ์ฒด ๋ถ„์„๋ฒ•์ด ๊ธฐ์กด์˜ WES ๊ธฐ๋ฒ• ๊ธฐ๋ฐ˜ ๋ถ„์„์˜ ํšจ๊ณผ์ ์ธ ๋ณด์™„ ๊ธฐ๋ฒ•์ด ๋ ์ง€์˜ ์—ฌ๋ถ€๋ฅผ ํ™•์ธํ•˜๊ณ ์ž ํ•˜์˜€๋‹ค. ์ „์‚ฌ์ฒด ๋ถ„์„ ๊ฒฐ๊ณผ, WES ๊ธฐ๋ฒ•์„ ํ†ตํ•ด ์›์ธ ์œ ์ „์ž ๋ณ€์ด๊ฐ€ ํ™•์ธ๋œ ํ™˜์ž๋“ค ์ค‘ 9๋ช…์—๊ฒŒ์„œ ๊ฐ™์€ ๋งฅ๋ฝ์˜ ์ „์‚ฌ์ฒด ์ด์ƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์—ˆ์œผ๋ฉฐ, WES์„ ์ˆ˜ํ–‰ํ•˜์ง€ ์•Š์€ ํ™˜์ž๋“ค ์ค‘ 18๋ช…์—๊ฒŒ์„œ๋„ ์ž ์žฌ์ ์ธ ์›์ธ ์œ ์ „์ž ๋ณ€์ด๋ฅผ ํ™•์ธํ•˜์˜€๋‹ค. ๋”ฐ๋ผ์„œ ์ „์‚ฌ์ฒด ๋ถ„์„๋ฒ•์€ ๊ธฐ์กด์˜ ๋ถ„์„๊ธฐ๋ฒ•์œผ๋กœ ์›์ธ ์œ ์ „์ž ๋ณ€์ด๋ฅผ ๋ฐœ๊ฒฌํ•  ์ˆ˜ ์—†๋Š” ์ฆ๋ก€์˜ ์ง„๋‹จ์— ์œ ์šฉํ•œ ๋„๊ตฌ๋กœ ์‚ฌ์šฉ๋  ์ˆ˜ ์žˆ์Œ์„ ์‹œ์‚ฌํ•œ๋‹ค.Introduction. Whole exome sequencing has become a robust and standard tool for rare diseases diagnosis thanks to advantages in cost and data handling. However, whole exome sequencing-based diagnosis rates typically do not exceed 50%, which can be attributed to the difficulty of interpreting variants of uncertain significance, as well as to the disregard of non-coding variants, including variants in intronic and regulatory regions in the genome. Therefore, I explored the utility of transcriptome sequencing as a compensatory approach in rare neuromuscular disorders diagnosis. Methods. Whole exome sequencing of 94 patients with undiagnosed neuromuscular disorders was collected from Seoul National University Childrens Hospital and analyzed for variants in known neuromuscular disease genes. Additional transcriptome sequencing was performed for 63 of the whole exome sequenced patients and for ten patients without genome data. Transcriptome data were utilized for cryptic damaging variants, differentially expression, aberrant splicing and allele specific expression analysis. Furthermore, non-negative matrix factorization was applied to identify expression-based clustering and cluster-specific gene ontology was derived. Results. Whole exome sequencing analysis identified candidate variants in 49% of patients, with 83% of them located within known disease genes. Structural variants with questionable pathogenicity were discovered in twelve cases. RNA-Sequencing based variant calling lead to further discovery of heterozygous candidate variants in nine samples, five of which did not undergo whole exome sequencing. Allele specific expression identified two likely candidate genes and differential gene expression analysis lead to the prioritization of sets of genes in an additional four samples. Lastly, aberrant splicing of DMD, TTN and MICU1 was detected in each of four samples. Non-negative matrix factorization-based clustering resulted in the identification of six clusters with distinct gene expression profiles. Discussion. Firstly, I aimed to evaluate whether transcriptome sequencing can provide additional evidence for the interpretation of whole exome sequencing variants. Overall, transcriptome sequencing was able to detect abnormalities associated with the previously identified mutation in less than 30% of positive whole exome sequencing cases. For samples without whole exome sequencing result, I successfully used transcriptome sequencing to identify potential pathogenic causes in 18 cases. In conclusion, transcriptome sequencing proved to be a useful tool for the diagnosis of whole exome sequencing negative samples, but did not prove to have great utility for the interpretation of pathogenic whole exome sequencing variants.1. INTRODUCTION.....................................................................................1 1.1. Advancement through next generation sequencing...................1 1.2. Genetics of neuromuscular disorders (NMD)..............................3 1.3. Transcriptome sequencing-based NMD diagnosis.......................8 2. METHODS............................................................................................12 2.1. Data collection.........................................................................12 2.2. Whole exome sequencing data analysis....................................13 2.3. Transcriptome sequencing analysis...........................................15 2.4. Non-negative matrix factorization based clustering...................19 3. RESULTS...............................................................................................22 3.1. Data collection.........................................................................22 3.2. Phenotype information.............................................................23 3.3. Whole exome sequencing results..............................................25 3.4. Transcriptome sequencing quality control..................................28 3.5. Transcriptome-based clustering.................................................31 3.6. Exome variants in transcriptome sequencing.............................35 3.7. Transcriptome-sequencing based diagnosis...............................39 4. DISCUSSION..........................................................................................48 5. REFERENCES.........................................................................................57 6. APPENDIX.............................................................................................63 6.1. Supplementary Figures..............................................................63 6.2. Supplementary Tables................................................................67 7. ๊ตญ๋ฌธ์ดˆ๋ก.................................................................................................71Maste

    Genetic ancestry of participants in the National Children's Study.

    Get PDF
    BackgroundThe National Children's Study (NCS) is a prospective epidemiological study in the USA tasked with identifying a nationally representative sample of 100,000 children, and following them from their gestation until they are 21 years of age. The objective of the study is to measure environmental and genetic influences on growth, development, and health. Determination of the ancestry of these NCS participants is important for assessing the diversity of study participants and for examining the effect of ancestry on various health outcomes.ResultsWe estimated the genetic ancestry of a convenience sample of 641 parents enrolled at the 7 original NCS Vanguard sites, by analyzing 30,000 markers on exome arrays, using the 1000 Genomes Project superpopulations as reference populations, and compared this with the measures of self-reported ethnicity and race. For 99% of the individuals, self-reported ethnicity and race agreed with the predicted superpopulation. NCS individuals self-reporting as Asian had genetic ancestry of either South Asian or East Asian groups, while those reporting as either Hispanic White or Hispanic Other had similar genetic ancestry. Of the 33 individuals who self-reported as Multiracial or Non-Hispanic Other, 33% matched the South Asian or East Asian groups, while these groups represented only 4.4% of the other reported categories.ConclusionsOur data suggest that self-reported ethnicity and race have some limitations in accurately capturing Hispanic and South Asian populations. Overall, however, our data indicate that despite the complexity of the US population, individuals know their ancestral origins, and that self-reported ethnicity and race is a reliable indicator of genetic ancestry

    Using GWAS Data to Identify Copy Number Variants Contributing to Common Complex Diseases

    Full text link
    Copy number variants (CNVs) account for more polymorphic base pairs in the human genome than do single nucleotide polymorphisms (SNPs). CNVs encompass genes as well as noncoding DNA, making these polymorphisms good candidates for functional variation. Consequently, most modern genome-wide association studies test CNVs along with SNPs, after inferring copy number status from the data generated by high-throughput genotyping platforms. Here we give an overview of CNV genomics in humans, highlighting patterns that inform methods for identifying CNVs. We describe how genotyping signals are used to identify CNVs and provide an overview of existing statistical models and methods used to infer location and carrier status from such data, especially the most commonly used methods exploring hybridization intensity. We compare the power of such methods with the alternative method of using tag SNPs to identify CNV carriers. As such methods are only powerful when applied to common CNVs, we describe two alternative approaches that can be informative for identifying rare CNVs contributing to disease risk. We focus particularly on methods identifying de novo CNVs and show that such methods can be more powerful than case-control designs. Finally we present some recommendations for identifying CNVs contributing to common complex disorders.Comment: Published in at http://dx.doi.org/10.1214/09-STS304 the Statistical Science (http://www.imstat.org/sts/) by the Institute of Mathematical Statistics (http://www.imstat.org

    A reference haplotype panel for genome-wide imputation of short tandem repeats.

    Get PDF
    Shortย tandem repeats (STRs) are involved in dozens of Mendelian disorders and have been implicated in complex traits. However, genotyping arrays used in genome-wide association studies focus on single nucleotide polymorphisms (SNPs) and do not readily allow identification of STR associations. We leverage next-generation sequencing (NGS) from 479 families to create a SNPโ€‰+โ€‰STR reference haplotype panel. Our panel enables imputing STR genotypes into SNP array data when NGS is not available for directly genotyping STRs. Imputed genotypes achieve mean concordance of 97% with observed genotypes in an external dataset compared to 71% expected under a naive model. Performance varies widely across STRs, with near perfect concordance at bi-allelic STRs vs. 70% at highly polymorphic repeats. Imputation increases power over individual SNPs to detect STR associations with gene expression. Imputing STRs into existing SNP datasets will enable the first large-scale STR association studies across a range of complex traits
    • โ€ฆ
    corecore