Search CORE

71 research outputs found

Recommended from our members

Statistical Approaches for Next-Generation Sequencing Data

Author: Qiao Dandi
Publication venue: 'Harvard University Botany Libraries'
Publication date: 06/02/2015
Field of study

During the last two decades, genotyping technology has advanced rapidly, which enabled the tremendous success of genome-wide association studies (GWAS) in the search of disease susceptibility loci (DSLs). However, only a small fraction of the overall predicted heritability can be explained by the DSLs discovered. One possible explanation for this ”missing heritability” phenomenon is that many causal variants are rare. The recent development of high-throughput next-generation sequencing (NGS) technology provides the instrument to look closely at these rare variants with precision and efficiency. However, new approaches for both the storage and analysis of sequencing data are in imminent needs. In this thesis, we introduce three methods that could be utilized in the management and analysis of sequencing data. In Chapter 1, we propose a novel and simple algorithm for compressing sequencing data that leverages on the scarcity of rare variant data, which enables the storage and analysis of sequencing data efficiently in current hardware environment. We also provide a C++ implementation that supports direct and parallel loading of the compressed format without requiring extra time for decompression. Chapter 2 and 3 focus on the association analysis of sequencing data in population-based design. In Chapter 2, we present a statistical methodology that allows the identification of genetic outliers to obtain a genetically homogeneous subpopulation, which reduces the false positives due to population substructure. Our approach is computationally efficient that can be applied to all the genetic loci in the data and does not require pruning of variants in linkage disequilibrium (LD). In Chapter 3, we propose a general analysis framework in which thousands of genetic loci can be tested simultaneously for association with complex phenotypes. The approach is built on spatial-clustering methodology, assuming that genetic loci that are associated with the target phenotype cluster in certain genomic regions. In contrast to standard methodology for multi-loci analysis, which has focused on the dimension reduction of data, the proposed approach profits from the availability of large numbers of genetic loci. Thus it will be especially relevant for whole-genome sequencing studies which commonly record several thousand loci per gene

Harvard University - DASH

Recommended from our members

Handling the data management needs of high-throughput sequencing data: SpeedGene, a compression algorithm for the efficient storage of genetic data

Author: Lange Christoph
Qiao Dandi
Yip Wai-Ki
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/05/2012
Field of study

Background: As Next-Generation Sequencing data becomes available, existing hardware environments do not provide sufficient storage space and computational power to store and process the data due to their enormous size. This is and will be a frequent problem that is encountered everyday by researchers who are working on genetic data. There are some options available for compressing and storing such data, such as general-purpose compression software, PBAT/PLINK binary format, etc. However, these currently available methods either do not offer sufficient compression rates, or require a great amount of CPU time for decompression and loading every time the data is accessed. Results: Here, we propose a novel and simple algorithm for storing such sequencing data. We show that, the compression factor of the algorithm ranges from 16 to several hundreds, which potentially allows SNP data of hundreds of Gigabytes to be stored in hundreds of Megabytes. We provide a C++ implementation of the algorithm, which supports direct loading and parallel loading of the compressed format without requiring extra time for decompression. By applying the algorithm to simulated and real datasets, we show that the algorithm gives greater compression rate than the commonly used compression methods, and the data-loading process takes less time. Also, The C++ library provides direct-data-retrieving functions, which allows the compressed information to be easily accessed by other C++ programs. Conclusions: The SpeedGene algorithm enables the storage and the analysis of next generation sequencing data in current hardware environment, making system upgrades unnecessary

Harvard University - DASH

Springer - Publisher Connector

Directory of Open Access Journals

A comparative analysis of family-based and population-based association tests using whole genome sequence data

Author: Cho Michael H
Laird Nan M
McDonald Merry-Lynn N
Qiao Dandi
Yip Wai-Ki
Zhou Jin J
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2014
Field of study

The revolution in next-generation sequencing has made obtaining both common and rare high-quality sequence variants across the entire genome feasible. Because researchers are now faced with the analytical challenges of handling a massive amount of genetic variant information from sequencing studies, numerous methods have been developed to assess the impact of both common and rare variants on disease traits. In this report, whole genome sequencing data from Genetic Analysis Workshop 18 was used to compare the power of several methods, considering both family-based and population-based designs, to detect association with variants in the MAP4 gene region and on chromosome 3 with blood pressure. To prioritize variants across the genome for testing, variants were first functionally assessed using prediction algorithms and expression quantitative trait loci (eQTLs) data. Four set-based tests in the family-based association tests (FBAT) framework--FBAT-v, FBAT-lmm, FBAT-m, and FBAT-l--were used to analyze 20 pedigrees, and 2 variance component tests, sequence kernel association test (SKAT) and genome-wide complex trait analysis (GCTA), were used with 142 unrelated individuals in the sample. Both set-based and variance-component-based tests had high power and an adequate type I error rate. Of the various FBATs, FBAT-l demonstrated superior performance, indicating the potential for it to be used in rare-variant analysis. The updated FBAT package is available at: http://www.hsph.harvard.edu/fbat/

Crossref

Harvard University - DASH

Springer - Publisher Connector

PubMed Central

The University of Arizona

WISARD: workbench for integrated superfast association studies for related datasets

Author: Cho Michael
Choi Sungkyoung
Lee Sungyoung
Park Taesung
Qiao Dandi
Silverman Edwin K.
Won Sungho
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/04/2018
Field of study

Background: A Mendelian transmission produces phenotypic and genetic relatedness between family members, giving family-based analytical methods an important role in genetic epidemiological studies—from heritability estimations to genetic association analyses. With the advance in genotyping technologies, whole-genome sequence data can be utilized for genetic epidemiological studies, and family-based samples may become more useful for detecting de novo mutations. However, genetic analyses employing family-based samples usually suffer from the complexity of the computational/statistical algorithms, and certain types of family designs, such as incorporating data from extended families, have rarely been used. Results: We present a Workbench for Integrated Superfast Association studies for Related Data (WISARD) programmed in C/C++. WISARD enables the fast and a comprehensive analysis of SNP-chip and next-generation sequencing data on extended families, with applications from designing genetic studies to summarizing analysis results. In addition, WISARD can automatically be run in a fully multithreaded manner, and the integration of R software for visualization makes it more accessible to non-experts. Conclusions: Comparison with existing toolsets showed that WISARD is computationally suitable for integrated analysis of related subjects, and demonstrated that WISARD outperforms existing toolsets. WISARD has also been successfully utilized to analyze the large-scale massive sequencing dataset of chronic obstructive pulmonary disease data (COPD), and we identified multiple genes associated with COPD, which demonstrates its practical value. Electronic supplementary material The online version of this article (10.1186/s12920-018-0345-y) contains supplementary material, which is available to authorized users

SNU Open Repository and Archive

Harvard University - DASH

Directory of Open Access Journals

A Genome-Wide Linkage Study for Chronic Obstructive Pulmonary Disease in a Dutch Genetic Isolate Identifies Novel Rare Candidate Variants

Author: Amin Najaf
Boezen H. M.
Brusselle Guy G.
Cho Michael H.
Hobbs Brian D.
Lahousse Lies
Nedeljkovic Ivana
Postma Dirkje S.
Qiao Dandi
Terzikhan Natalie
van der Plaat Diana A.
van Diemen Cleo C.
van Duijn Cornelia M.
Vonk Judith M.
Publication venue
Publication date: 01/01/2018
Field of study

Chronic obstructive pulmonary disease (COPD) is a complex and heritable disease, associated with multiple genetic variants. Specific familial types of COPD may be explained by rare variants, which have not been widely studied. We aimed to discover rare genetic variants underlying COPD through a genome-wide linkage scan. Affected-only analysis was performed using the 6K Illumina Linkage IV Panel in 142 cases clustered in 27 families from a genetic isolate, the Erasmus Rucphen Family (ERF) study. Potential causal variants were identified by searching for shared rare variants in the exome-sequence data of the affected members of the families contributing most to the linkage peak. The identified rare variants were then tested for association with COPD in a large meta-analysis of several cohorts. Significant evidence for linkage was observed on chromosomes 15q14-15q25 [logarithm of the odds (LOD) score = 5.52], 11p15.4-11q14.1 (LOD = 3.71) and 5q14.3-5q33.2 (LOD = 3.49). In the chromosome 15 peak, that harbors the known COPD locus for nicotinic receptors, and in the chromosome 5 peak we could not identify shared variants. In the chromosome 11 locus, we identified four rare (minor allele frequency (MAF) <0.02), predicted pathogenic, missense variants. These were shared among the affected family members. The identified variants localize to genes including neuroblast differentiation-associated protein (AHNAK), previously associated with blood biomarkers in COPD, phospholipase C Beta 3 (PLCB3), shown to increase airway hyper-responsiveness, solute carrier family 22-A11 (SLC22A11), involved in amino acid metabolism and ion transport, and metallothionein-like protein 5 (MTL5), involved in nicotinate and nicotinamide metabolism. Association of SLC22A11 and MTL5 variants were confirmed in the meta-analysis of 9,888 cases and 27,060 controls. In conclusion, we have identified novel rare variants in plausible genes related to COPD. Further studies utilizing large sample whole-genome sequencing should further confirm the associations at chromosome 11 and investigate the chromosome 15 and 5 linked regions

University of Groningen

Spiral - Imperial College Digital Repository

Erasmus University Digital Repository

Proceedings - University of Groningen

Crossref

ARTS repository - University of Groningen

Ghent University Academic Bibliography

Frontiers - Publisher Connector

EUR Research Repository

Archivsystem Ask23

Dissertations of the University of Groningen

The genetic determinants of recurrent somatic mutations in 43,693 blood genomes

Author: Abecasis Goncalo R
Albert Christine
Arnett Donna K
Barnes Kathleen C
Becker Lewis C
Bick Alexander G
Bis Joshua C
Blackwell Thomas W
Blangero John
Boerwinkle Eric
Bowden Donald W
Brody Jennifer A
Broome Jai G
Cade Brian E
Chami Nathalie
Chen Yii-Der Ida
Chen Zhanghua
Cho Michael H
Correa Adolfo
Curran Joanne E
Custer Brian S
Darbar Dawood
de Andrade Mariza
DeMeo Dawn L
Desai Pinkal
Duggirala Ravindranath
Fornage Myriam
Freedman Barry I
Gao Yan
Gauderman W. James
Gilliland Frank D
Grammer Leslie
Gu C. Charles
Gui Hongsheng
Guo Xiuqing
He Jiang
Heckbert Susan R
Irvin Marguerite R
Jaiswal Siddhartha
Johnsen Jill M
Johnson Andrew D
Kang Hyun M
Kaplan Robert
Kardia Sharon
Kenny Eimear E
Konkle Barbara A
Kooperberg Charles
Kumar Rajesh
Lasky-Su Jessica
Laurie Cecelia A
Lee Wen-Jane
Lewis Joshua P
Li Xingnan
Loos Ruth J. F
Manichaikul Ani W
Mathias Rasika
McGarvey Stephen
Meyers Deborah A
Natarajan Pradeep
O\u27Connell Jeffrey R
Ortega Victor
Palmer Nicholette D
Psaty Bruce M
Qiao Dandi
Raffield Laura M
Redline Susan
Reiner Alexander P
Rich Stephen S
Roden Dan
Rotter Jerome I
Schleimer Robert P
Shoemaker M. Benjamin
Shuldiner Alan R
Silverman Edwin K
Smith Albert V
Smith Nicholas L
Taylor Kent D
Tiwari Hemant
Vasan Ramachandran S
Weinstock Joshua S
Weiss Scott T
Wheeler Marsha M
Wiggins Kerri L
Williams Keoki L
Xiao Shujie
Yanek Lisa R
Yun Jeong H
Publication venue: Henry Ford Health Scholarly Commons
Publication date: 28/04/2023
Field of study

Nononcogenic somatic mutations are thought to be uncommon and inconsequential. To test this, we analyzed 43,693 National Heart, Lung and Blood Institute Trans-Omics for Precision Medicine blood whole genomes from 37 cohorts and identified 7131 non-missense somatic mutations that are recurrently mutated in at least 50 individuals. These recurrent non-missense somatic mutations (RNMSMs) are not clearly explained by other clonal phenomena such as clonal hematopoiesis. RNMSM prevalence increased with age, with an average 50-year-old having 27 RNMSMs. Inherited germline variation associated with RNMSM acquisition. These variants were found in genes involved in adaptive immune function, proinflammatory cytokine production, and lymphoid lineage commitment. In addition, the presence of eight specific RNMSMs associated with blood cell traits at effect sizes comparable to Mendelian genetic mutations. Overall, we found that somatic mutations in blood are an unexpectedly common phenomenon with ancestry-specific determinants and human health consequences

Henry Ford Health System Scholarly Commons

Common Genetic Polymorphisms Influence Blood Biomarker Measurements in COPD

Author: Adami Alessandra
Adams Sandra
Al Qaisi Mustafa
Alapat Philip
Alexis Neil E.
Allen Tadashi
Anderson Wayne
Anzueto Antonio
Atik Mustafa
Austin John
Bailey William
Bandi Venkata
Barr R. Graham
Basta Patricia V.
Beaty Terri
Begum Ferdouse
Bell Brian
Berkowitz Eugene
Bhatt Surya
Billings Joanne
Bleecker Eugene R.
Bon Jessica
Boriek Aladin
Boucher Richard C.
Bowler Russell
Bowler Russell
Brown Robert
Budoff Matthew
Busch Robert
Carretta Elizabeth E.
Casaburi Richard
Castaldi Peter
Chandra Divay
Chen Ting-Huei
Cho Michael
Cho Michael H.
Christenson Stephanie A.
Ciccolella David
Comellas Alejandro
Comellas Alejandro P.
Cooper Christopher B.
Cordova Francis
Cornellas Alejandro
Couper David J.
Coxson Harvey O.
Crapo James
Crapo James D.
Criner Gerard
Criner Gerard J.
Crystal Ronald G.
Curtis Jeffrey L.
D'Alonzo Gilbert
D'Souza Belinda
Dass Chandra
De Dawn
Demeo Dawn
Desai Parag
Doerschuk Claire M.
Dransfield Mark
Dransfield Mark T.
Drummond M. Bradley
Duca Lindsey
El-Bouiez Adel
Everett Douglas
Faino Anna
Fischer Hans
Foreman Marilyn
Freeman Christine M.
Friedman Paul
Fuhrman Carl
Gouskova Natalia A.
Gray Teresa
Guntupalli Kalpatha
Guy Elizabeth
Halper-Stromberg Eitan
Han MeiLan
Han MeiLan K.
Hanania Nicola A.
Hansel Nadia
Hansel Nadia N.
Hardin Megan
Hastie Annette T.
Hawkins Gregory A.
Hersh Craig
Hersh Craig P.
Hetmanski Jacqueline
Hobbs Brian
Hoffman Eric A.
Hokanson John
Hokanson John E.
Horton Karen
Humphries Stephen
Jacobs Michael
Jacobson Francine
Jacobson Francine L
Jacobson Sean
James Mamary
Jensen Robert
Judy Philip F
Kaner Robert J.
Kanner Richard E.
Kazerooni Ella
Kazerooni Ella A
Kechris Katerina
Kelsen Steven
Kim Victor
Kinney Gregory
Kleerup Eric C.
Kluiber Alex
Krishnan Jerry A.
Laird Nan
Lan Charlie
Lange Christoph
LaVange Lisa M.
Lazarus Stephen C.
Lutz Sharon
Lutz Sharon
Lynch David
Lynch David A.
MacIntyre Neil
Make Barry
Mann Tanya
Marchetti Nathaniel
Martinez Carlos
Martinez Carlos H.
Martinez Fernando J.
Maselli-Caceres Diego
McAdams H. Page
McDonald Merry-Lynn
McEvoy Charlene
Meyers Deborah A.
Michael Wells
Nachiappan Arun
Nath Hrudaya
Newell John
Newell John D.
Oelsner Elizabeth C.
O’Neal Wanda K.
Pace David
Paine Robert
Parker Margaret
Parulekar Amit
Pearson Gregory D.N.
Pernicano Perry G.
Peters Stephen P.
Porszasz Janos
Pratte Katherine
Putcha Nirupama
Qiao Dandi
Quibrera Pedro Miguel
Ramsdell Joe
Regan Elizabeth
Regan Elizabeth A.
Rennard Stephen I.
Rosiello Richard
Ross James C
Rossiter Harry
Rozenshtein Anna
Ruiz Mario E.
San Jose Estepar Raul
Santorico Stephanie
Satti Aditi
Scholand Mary Beth
Schroeder Joyce
Sciurba Frank
Sharafkhaneh Amir
Shenoy Kartik
Sieren Jered
Silverman Edwin
Silverman Edwin K.
Soler Xavier
Steiner Robert M.
Stinson Douglas
Stoel Berend C
Strand Matt
Sun Wei
Swift Alex
Swift Irene
Tashjian Joseph
Tashkin Donald P.
Thomashow Byron
Thompson Brad
Tschirren Juerg
Van Beek Edwin
van Ginneken Bram
van Rikxoort Eva
Vega-Sanchez Maria Elena
Wan Emily
Washington Lacey
Washko George
Weissfeld Joel
Wells J. Michael
Wendt Christine
Westney Gloria
Wilson Carla
Wilson Carla G
Wise Robert
Wise Robert A.
Won Sungho
Woodruff Prescott G.
Yang Jenny
Yen Andrew
Young Kendra
Publication venue
Publication date: 01/01/2016
Field of study

Implementing precision medicine for complex diseases such as chronic obstructive lung disease (COPD) will require extensive use of biomarkers and an in-depth understanding of how genetic, epigenetic, and environmental variations contribute to phenotypic diversity and disease progression. A meta-analysis from two large cohorts of current and former smokers with and without COPD [SPIROMICS (N = 750); COPDGene (N = 590)] was used to identify single nucleotide polymorphisms (SNPs) associated with measurement of 88 blood proteins (protein quantitative trait loci; pQTLs). PQTLs consistently replicated between the two cohorts. Features of pQTLs were compared to previously reported expression QTLs (eQTLs). Inference of causal relations of pQTL genotypes, biomarker measurements, and four clinical COPD phenotypes (airflow obstruction, emphysema, exacerbation history, and chronic bronchitis) were explored using conditional independence tests. We identified 527 highly significant (p 10% of measured variation in 13 protein biomarkers, with a single SNP (rs7041; p = 10−392) explaining 71%-75% of the measured variation in vitamin D binding protein (gene = GC). Some of these pQTLs [e.g., pQTLs for VDBP, sRAGE (gene = AGER), surfactant protein D (gene = SFTPD), and TNFRSF10C] have been previously associated with COPD phenotypes. Most pQTLs were local (cis), but distant (trans) pQTL SNPs in the ABO blood group locus were the top pQTL SNPs for five proteins. The inclusion of pQTL SNPs improved the clinical predictive value for the established association of sRAGE and emphysema, and the explanation of variance (R2) for emphysema improved from 0.3 to 0.4 when the pQTL SNP was included in the model along with clinical covariates. Causal modeling provided insight into specific pQTL-disease relationships for airflow obstruction and emphysema. In conclusion, given the frequency of highly significant local pQTLs, the large amount of variance potentially explained by pQTL, and the differences observed between pQTLs and eQTLs SNPs, we recommend that protein biomarker-disease association studies take into account the potential effect of common local SNPs and that pQTLs be integrated along with eQTLs to uncover disease mechanisms. Large-scale blood biomarker studies would also benefit from close attention to the ABO blood group

Carolina Digital Repository

Asthma Is a Risk Factor for Respiratory Exacerbations Without Increased Rate of Lung Function Decline:Five-Year Follow-up in Adult Smokers From the COPDGene Study

Author: Adel R. Boueiz
Alex Kluiber
Barry J. Make
Berend C. Stoel
Bhatt
Bram van Ginneken
Brian D. Hobbs
Brooks
Bui
Camille Moore
Carla G. Wilson
Carla G. Wilson
Christoph Lange
Craig P. Hersh
Craig P. Hersh
Dandi Qiao
David A. Lynch
David A. Lynch
Dawn L. DeMeo
de Marco
Diaz
Douglas Everett
Douglas Stinson
Dransfield
Edwin J. van Beek
Edwin K. Silverman
Edwin K. Silverman
Edwin K. Silverman
Edwin Van Beek
Eitan Halper-Stromberg
Elizabeth A. Regan
Elizabeth A. Regan
Elizabeth A. Regan
Ella A. Kazerooni
Emily S. Wan
Eric A. Hoffman
Eva van Rikxoort
Ferdouse Begum
Francine L. Jacobson
Fu
George Washko
Gregory Kinney
Hardin
Hardin
Harvey O. Coxson
Hayden
Hayden
Jacqueline Hetmanski
James C. Ross
James D. Crapo
James D. Crapo
Jered Sieren
Jim Crooks
John D. Newell
John E. Hokanson
John E. Hokanson
John Hughes
Jones
Jones
Joyce Schroeder
Juerg Tschirren
Katherine Pratte
Kendra A. Young
Lange
Lystra P. Hayden
Lystra P. Hayden
Margaret M. Parker
Marilyn G. Foreman
Martinez
Matt Strand
Matthew J. Strand
McGeachie
McGeachie
Megan E. Hardin
Megan E. Hardin
MeiLan K. Han
Merry-Lynn McDonald
Michael Cho
Mustafa Al Qaisi
Nadia N. Hansel
Nakano
Nan Laird
Peter J. Castaldi
Philip F. Judy
Postma
Rabe
Raul San Jose Estepar
Regan
Robert Busch
Robert Jensen
Sharon M. Lutz
Sharon M. Lutz
Sin
Stephanie Santorico
Stephen Humphries
Svanes
Tagiyeva
Tai
Teresa Gray
Terri Beaty
Weiliang Qiu
Publication venue: 'Elsevier BV'
Publication date: 01/02/2018
Field of study

Crossref

Edinburgh Research Explorer

Recommended from our members

Genome-wide assessment of gene-by-smoking interactions in COPD

Author: An Jaehoon
Cho Michael H.
Kang Hae Yeon
Koo So-My
Lee MoonGyu
Park Boram
Qiao Dandi
Silverman Edwin K.
Sung Joohon
Won Sungho
Yang Hyeon-Jong
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 25/07/2018
Field of study

Cigarette smoke exposure is a major risk factor in chronic obstructive pulmonary disease (COPD) and its interactions with genetic variants could affect lung function. However, few gene-smoking interactions have been reported. In this report, we evaluated the effects of gene-smoking interactions on lung function using Korea Associated Resource (KARE) data with the spirometric variables—forced expiratory volume in 1 s (FEV1). We found that variations in FEV1 were different among smoking status. Thus, we considered a linear mixed model for association analysis under heteroscedasticity according to smoking status. We found a previously identified locus near SOX9 on chromosome 17 to be the most significant based on a joint test of the main and interaction effects of smoking. Smoking interactions were replicated with Gene-Environment of Interaction and phenotype (GENIE), Multi-Ethnic Study of Atherosclerosis-Lung (MESA-Lung), and COPDGene studies. We found that individuals with minor alleles, rs17765644, rs17178251, rs11870732, and rs4793541, tended to have lower FEV1 values, and lung function decreased much faster with age for smokers. There have been very few reports to replicate a common variant gene-smoking interaction, and our results revealed that statistical models for gene-smoking interaction analyses should be carefully selected

Harvard University - DASH

Handling the data management needs of high-throughput sequencing data: SpeedGene, a compression algorithm for the efficient storage of genetic data

Author: C Lange
Christoph Lange
Dandi Qiao
ML Metzker
P Danecek
S Christley
S Purcell
S Wright
The 1000 Genome Project Consortium
V Bansal
Wai-Ki Yip
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref