Search CORE

7,456 research outputs found

A robust clustering algorithm for identifying problematic samples in genome-wide association studies

Author: Amy Strange
Chris C.A. Spencer
Colin Freeman
Céline Bellenguez
Genetic Analysis of Psoriasis Consortium & the WTCCC2
Hadi
Peter Donnelly
The International Multiple Sclerosis Genetics Consortium & the WTCCC2
The UK IBD Genetics Consortium & the WTCCC2
Publication venue: Oxford University Press
Publication date: 01/01/2012
Field of study

Summary: High-throughput genotyping arrays provide an efficient way to survey single nucleotide polymorphisms (SNPs) across the genome in large numbers of individuals. Downstream analysis of the data, for example in genome-wide association studies (GWAS), often involves statistical models of genotype frequencies across individuals. The complexities of the sample collection process and the potential for errors in the experimental assay can lead to biases and artefacts in an individual's inferred genotypes. Rather than attempting to model these complications, it has become a standard practice to remove individuals whose genome-wide data differ from the sample at large. Here we describe a simple, but robust, statistical algorithm to identify samples with atypical summaries of genome-wide variation. Its use as a semi-automated quality control tool is demonstrated using several summary statistics, selected to identify different potential problems, and it is applied to two different genotyping platforms and sample collections

Crossref

PubMed Central

Oxford University Research Archive

University of Queensland eSpace

Genome-wide association study identifies a variant in HDAC9 associated with large vessel ischemic stroke

Author: A Gschwendtner
Agnieszka Slowik
Aiden Corvin
Amy Strange
Ananth C Viswanathan
Andreas Gschwendtner
Anna Helgadottir
Anna Rautanen
Annette I Burgess
Anuj Goel
Arne Lindgren
Audrey Duncanson
B Langley
Bertram Müller-Myhsok
BF Voight
BL Browning
Bo Norrving
Bradford B Worrall
Braxton D Mitchell
Brett Kissela
C Bellenguez
C Newton-Cheh
Caroline A Jackson
Cathie L M Sudlow
Chris C A Spencer
Chris Levi
Christa Meisinger
Christopher G Mathew
Colin Freeman
Colin N A Palmer
Cordelia Langford
Céline Bellenguez
Daniel Woo
Deborah Poole
DF Conrad
DF Gudbjartsson
Dorota Wloch-Kopec
Elizabeth Holliday
Elvira Bramon
Emma Gray
Eugenio A Parati
Gavin Band
Giorgio B Boncoraglio
Gudmar Thorleifsson
Helen Ross-Adams
Helen Segal
Hossein Delavaran
HP Adams Jr.
HS Markus
Hugh S Markus
James F Meschia
Janusz Jankowski
Jenefer M Blackwell
Joanna Pera
John Attia
Jonathan Rosand
Juan P Casas
Julia Slark
K Chen
Karen Furie
Kari Stefansson
KB Glaser
Lee Murphy
Leena Peltonen
Lynelle Cortellini
M Dichgans
M Haberland
M Kubo
MA Ikram
Maria-Grazia Franzosi
Martin Dichgans
Martin Farrall
Mary Joan Macleod
Matthew A Brown
Matthew Traylor
Matthew Walters
Matti Pirinen
Michael A Nalls
Nicholas W Wood
P Jerrard-Dunne
Pankaj Sharma
Panos Deloukas
Paul D Syme
Peter Donnelly
Peter M Rothwell
R Lemmens
Rainer Malik
Renata Schanz
Richard C Trembath
RL Sacco
Robert Plomin
Robin Lemmens
S Chang
S Gretarsdottir
S Olsson
Sarah Edkins
Sarah Hunt
Serge Dronov
Solveig Gretarsdottir
Stephen J Sawcer
Steve Bevan
Steven Boonen
Steven J Kittner
T Kouzarides
TM Teslovich
Tom James
Udo Seedorf
Unnur Thorsteinsdottir
Valerie Valant
Vincent Thijs
Yu-Ching Cheng
YY Teo
Z Su
Zhan Su
Publication venue: NATURE PUBLISHING GROUP
Publication date: 01/01/2012
Field of study

Genetic factors have been implicated in stroke risk but few replicated associations have been reported. We conducted a genome-wide association study (GWAS) in ischemic stroke and its subtypes in 3,548 cases and 5,972 controls, all of European ancestry. Replication of potential signals was performed in 5,859 cases and 6,281 controls. We replicated reported associations between variants close to PITX2 and ZFHX3 with cardioembolic stroke, and a 9p21 locus with large vessel stroke. We identified a novel association for a SNP within the histone deacetylase 9(HDAC9) gene on chromosome 7p21.1 which was associated with large vessel stroke including additional replication in a further 735 cases and 28583 controls (rs11984041, combined P = 1.87×10−11, OR=1.42 (95% CI) 1.28-1.57). All four loci exhibit evidence for heterogeneity of effect across the stroke subtypes, with some, and possibly all, affecting risk for only one subtype. This suggests differing genetic architectures for different stroke subtypes

OPUS Augsburg

Julkari

Edinburgh Research Explorer

Enlighten

MPG.PuRe

University of Newcastle's Digital Repository

Lund University Publications

Crossref

LSHTM Research Online

UCL Discovery

Queensland University of Technology ePrints Archive

Oxford University Research Archive

King's Research Portal

University of Dundee Online Publications

St George's Online Research Archive

University of Melbourne Institutional Repository

Statistical Methods For Detecting Genetic Risk Factors of a Disease with Applications to Genome-Wide Association Studies

Author: Ali Fadhaa
Ministry of Higher Education and Scientific Research- Republic of Iraq
Publication venue
Publication date: 01/03/2015
Field of study

This thesis aims to develop various statistical methods for analysing the data derived from genome wide association studies (GWAS). The GWAS often involves genotyping individual human genetic variation, using high-throughput genome-wide single nucleotide polymorphism (SNP) arrays, in thousands of individuals and testing for association between those variants and a given disease under the assumption of common disease/common variant. Although GWAS have identified many potential genetic factors in the genome that affect the risks to complex diseases, there is still much of the genetic heritability that remains unexplained. The power of detecting new genetic risk variants can be improved by considering multiple genetic variants simultaneously with novel statistical methods. Improving the analysis of the GWAS data has received much attention from statisticians and other scientific researchers over the past decade. There are several challenges arising in analysing the GWAS data. First, determining the risk SNPs might be difficult due to non-random correlation between SNPs that can inflate type I and II errors in statistical inference. When a group of SNPs are considered together in the context of haplotypes/genotypes, the distribution of the haplotypes/genotypes is sparse, which makes it difficult to detect risk haplotypes/genotypes in terms of disease penetrance. In this work, we proposed four new methods to identify risk haplotypes/genotypes based on their frequency differences between cases and controls. To evaluate the performances of our methods, we simulated datasets under wide range of scenarios according to both retrospective and prospective designs. In the first method, we first reconstruct haplotypes by using unphased genotypes, followed by clustering and thresholding the inferred haplotypes into risk and non-risk groups with a two-component binomial-mixture model. In the method, the parameters were estimated by using the modified Expectation-Maximization algorithm, where the maximisation step was replaced the posterior sampling of the component parameters. We also elucidated the relationships between risk and non-risk haplotypes under different modes of inheritance and genotypic relative risk. In the second method, we fitted a three-component mixture model to genotype data directly, followed by an odds-ratio thresholding. In the third method, we combined the existing haplotype reconstruction software PHASE and permutation method to infer risk haplotypes. In the fourth method, we proposed a new way to score the genotypes by clustering and combined it with a logistic regression approach to infer risk haplotypes. The simulation studies showed that the first three methods outperformed the multiple testing method of (Zhu, 2010) in terms of average specificity and sensitivity (AVSS) in all scenarios considered. The logistic regression methods also outperformed the standard logistic regression method. We applied our methods to two GWAS datasets on coronary artery disease (CAD) and hypertension (HT), detecting several new risk haplotypes and recovering a number of the existing disease-associated genetic variants in the literature

Kent Academic Repository

Ancestral Informative Marker Selection and Population Structure Visualization Using Sparse Laplacian Eigenfunctions

Author: A Lee
AL Price
AL Price
B Shameek
C Tian
CC Chang
CM Carvalho
DL Donoho
EJ Candes
EJ Parra
FRK Chung
G Coop
H Chen
H Tang
H Zou
HE Collins-Schramm
J Novembre
J Pritchard
J Shawe-Taylor
J Zhang
J Zhang
JK Pritchard
Jun Zhang
L Cavalli-Sforza
L Sun
L Sun
M Bauchet
M Belkin
Manfred Kayser
MS McPeek
N Mantel
N Rosenberg
NA Rosenberg
O Lao
P Menozzi
P Paschou
P Paschou
R Tibshirani
R Tibshirani
RR Hudson
U von Luxburg
V Vapnik
X Zhu
Publication venue: Public Library of Science
Publication date: 04/11/2010
Field of study

Identification of a small panel of population structure informative markers can reduce genotyping cost and is useful in various applications, such as ancestry inference in association mapping, forensics and evolutionary theory in population genetics. Traditional methods to ascertain ancestral informative markers usually require the prior knowledge of individual ancestry and have difficulty for admixed populations. Recently Principal Components Analysis (PCA) has been employed with success to select SNPs which are highly correlated with top significant principal components (PCs) without use of individual ancestral information. The approach is also applicable to admixed populations. Here we propose a novel approach based on our recent result on summarizing population structure by graph Laplacian eigenfunctions, which differs from PCA in that it is geometric and robust to outliers. Our approach also takes advantage of the priori sparseness of informative markers in the genome. Through simulation of a ring population and the real global population sample HGDP of 650K SNPs genotyped in 940 unrelated individuals, we validate the proposed algorithm at selecting most informative markers, a small fraction of which can recover the similar underlying population structure efficiently. Employing a standard Support Vector Machine (SVM) to predict individuals' continental memberships on HGDP dataset of seven continents, we demonstrate that the selected SNPs by our method are more informative but less redundant than those selected by PCA. Our algorithm is a promising tool in genome-wide association studies and population genetics, facilitating the selection of structure informative markers, efficient detection of population substructure and ancestral inference

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals

PubMed Central

Bayesian correlated clustering to integrate multiple datasets

Author: Balasubramanian
Barash
Brock
Carlson
Cheng
Cherry
Cho
Cooke
Datta
David L. Wild
Dempster
Friedman
Fritsch
Granovskaia
Green
Harbison
Hubert
Huttenhower
Ideker
Ishwaran
Jackson
Jackson
Jansen
Jim E. Griffin
Kirk
Lee
Liu
Liu
Lockhart
Mistry
Myers
Myers
Neal
Neal
Nieto-Barajas
Paul Kirk
Puig
Rand
Rasmussen
Rasmussen
Reiss
Rhodes
Richard S. Savage
Rigaut
Rogers
Rogers
Rousseau
Santisteban
Savage
Schena
Shen
Solomon
Stark
Suchard
Troyanskaya
Wei
Wong
Yeung
Yuan
Zoubin Ghahramani
Publication venue: 'Oxford University Press (OUP)'
Publication date: 01/01/2012
Field of study

Motivation: The integration of multiple datasets remains a key challenge in systems biology and genomic medicine. Modern high-throughput technologies generate a broad array of different data types, providing distinct – but often complementary – information. We present a Bayesian method for the unsupervised integrative modelling of multiple datasets, which we refer to as MDI (Multiple Dataset Integration). MDI can integrate information from a wide range of different datasets and data types simultaneously (including the ability to model time series data explicitly using Gaussian processes). Each dataset is modelled using a Dirichlet-multinomial allocation (DMA) mixture model, with dependencies between these models captured via parameters that describe the agreement among the datasets. Results: Using a set of 6 artificially constructed time series datasets, we show that MDI is able to integrate a significant number of datasets simultaneously, and that it successfully captures the underlying structural similarity between the datasets. We also analyse a variety of real S. cerevisiae datasets. In the 2-dataset case, we show that MDI’s performance is comparable to the present state of the art. We then move beyond the capabilities of current approaches and integrate gene expression, ChIP-chip and protein-protein interaction data, to identify a set of protein complexes for which genes are co-regulated during the cell cycle. Comparisons to other unsupervised data integration techniques – as well as to non-integrative approaches – demonstrate that MDI is very competitive, while also providing information that would be difficult or impossible to extract using other methods

CiteSeerX

Crossref

PubMed Central

Warwick Research Archives Portal Repository

Kent Academic Repository

Tiny microbes, enormous impacts: what matters in gut microbiome studies?

Author: Debelius Justine
Gonzalez Antonio
Knight Rob
Song Se Jin
Vazquez-Baeza Yoshiki
Xu Zhenjiang Zech
Publication venue: eScholarship, University of California
Publication date: 01/10/2016
Field of study

Many factors affect the microbiomes of humans, mice, and other mammals, but substantial challenges remain in determining which of these factors are of practical importance. Considering the relative effect sizes of both biological and technical covariates can help improve study design and the quality of biological conclusions. Care must be taken to avoid technical bias that can lead to incorrect biological conclusions. The presentation of quantitative effect sizes in addition to P values will improve our ability to perform meta-analysis and to evaluate potentially relevant biological effects. A better consideration of effect size and statistical power will lead to more robust biological conclusions in microbiome studies

Springer - Publisher Connector

PubMed Central

eScholarship - University of California

Edge-weighting of gene expression graphs

Author: Bourqui R.
Bryan K.
Carlson M. R.
Chebyshev P. L.
Di Gesu V.
Shamir R.
Tanay A.
Zhang B.
Publication venue: 'World Scientific Pub Co Pte Lt'
Publication date: 01/01/2009
Field of study

In recent years, considerable research efforts have been directed to micro-array technologies and their role in providing simultaneous information on expression profiles for thousands of genes. These data, when subjected to clustering and classification procedures, can assist in identifying patterns and providing insight on biological processes. To understand the properties of complex gene expression datasets, graphical representations can be used. Intuitively, the data can be represented in terms of a bipartite graph, with weighted edges corresponding to gene-sample node couples in the dataset. Biologically meaningful subgraphs can be sought, but performance can be influenced both by the search algorithm, and, by the graph-weighting scheme and both merit rigorous investigation. In this paper, we focus on edge-weighting schemes for bipartite graphical representation of gene expression. Two novel methods are presented: the first is based on empirical evidence; the second on a geometric distribution. The schemes are compared for several real datasets, assessing efficiency of performance based on four essential properties: robustness to noise and missing values, discrimination, parameter influence on scheme efficiency and reusability. Recommendations and limitations are briefly discussed

Crossref

Irish Universities

Queensland University of Technology ePrints Archive

DCU Online Research Access Service

Probabilistic analysis of the human transcriptome with side information

Author: Lahti Leo
Publication venue
Publication date: 01/01/2010
Field of study

Understanding functional organization of genetic information is a major challenge in modern biology. Following the initial publication of the human genome sequence in 2001, advances in high-throughput measurement technologies and efficient sharing of research material through community databases have opened up new views to the study of living organisms and the structure of life. In this thesis, novel computational strategies have been developed to investigate a key functional layer of genetic information, the human transcriptome, which regulates the function of living cells through protein synthesis. The key contributions of the thesis are general exploratory tools for high-throughput data analysis that have provided new insights to cell-biological networks, cancer mechanisms and other aspects of genome function. A central challenge in functional genomics is that high-dimensional genomic observations are associated with high levels of complex and largely unknown sources of variation. By combining statistical evidence across multiple measurement sources and the wealth of background information in genomic data repositories it has been possible to solve some the uncertainties associated with individual observations and to identify functional mechanisms that could not be detected based on individual measurement sources. Statistical learning and probabilistic models provide a natural framework for such modeling tasks. Open source implementations of the key methodological contributions have been released to facilitate further adoption of the developed methods by the research community.Comment: Doctoral thesis. 103 pages, 11 figure

arXiv.org e-Print Archive

Aaltodoc Publication Archive