Search CORE

28 research outputs found

Joint Analysis of Multiple Metagenomic Samples

Author: A Kislyuk
B Yang
C Chan
Chris P. Ponting
D Cohn
D Cohn
D Huson
D Lee
D Richter
D Rusch
Eran Halperin
H Leung
H Teeling
I Jolliffe
J Hartigan
J Qin
J Sivic
M Arumugam
M Chiang
M Hamady
M Takahashi
M Wendl
P Meinicke
P Turnbaugh
S Chatterji
S Karlin
T Brants
T Hofmann
T Hofmann
T Hofmann
W Kent
X Jiang
Y Wu
Yael Baran
Publication venue: Public Library of Science
Publication date: 16/02/2012
Field of study

The availability of metagenomic sequencing data, generated by sequencing DNA pooled from multiple microbes living jointly, has increased sharply in the last few years with developments in sequencing technology. Characterizing the contents of metagenomic samples is a challenging task, which has been extensively attempted by both supervised and unsupervised techniques, each with its own limitations. Common to practically all the methods is the processing of single samples only; when multiple samples are sequenced, each is analyzed separately and the results are combined. In this paper we propose to perform a combined analysis of a set of samples in order to obtain a better characterization of each of the samples, and provide two applications of this principle. First, we use an unsupervised probabilistic mixture model to infer hidden components shared across metagenomic samples. We incorporate the model in a novel framework for studying association of microbial sequence elements with phenotypes, analogous to the genome-wide association studies performed on human genomes: We demonstrate that stratification may result in false discoveries of such associations, and that the components inferred by the model can be used to correct for this stratification. Second, we propose a novel read clustering (also termed “binning”) algorithm which operates on multiple samples simultaneously, leveraging on the assumption that the different samples contain the same microbial species, possibly in different proportions. We show that integrating information across multiple samples yields more precise binning on each of the samples. Moreover, for both applications we demonstrate that given a fixed depth of coverage, the average per-sample performance generally increases with the number of sequenced samples as long as the per-sample coverage is high enough

CiteSeerX

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals

PubMed Central

No Evidence from Genome-Wide Data of a Khazar Origin for the Ashkenazi Jews

Author: &gt
Alena Kushniarevich
Anastasia Kouvatsi
Antonio Torroni
Ardeshir Bahmanimehr
Ariella Gladstein
Baron S. W.
Bayazit Yunusbayev
Ben-Sasson H. H
Costas Triantaphyllidis
Damir Marjanovic
De Lange N.
Doron M. Behar
Elena Balanovsky
Elza K. Khusnutdinova
Ene Metspalu
Eran Halperin
Evelin Mihailov
Haber M.
Henn B. M.
Hovhannes Sahakyan
Karl Skorecki
Kristiina Tambets
Lejla Kovacevic
Levon Yepiskoposyan
Mait Metspalu
Michael F. Hammer
Naama M. Kopelman
Noah A. Rosenberg
Oleg Balanovsky
Ornella Semino
R Development Core Team.
Richard Villems
Ritte U.
Roy J. King
Saharon Rosset
Shay Tzur
Yael Baran
Publication venue: 'Human Biology (The International Journal of Population Biology and Genetics)'
Publication date
Field of study

Crossref

Simultaneous binning over multiple samples achieves higher precision compared with the equivalent single-sample approach.

Author: Eran Halperin (22303)
Yael Baran (332633)
Publication venue
Publication date
Field of study

MultiBin and AbundanceBin were both run on datasets of increasing complexity. Each dataset is composed of 5 mixtures of the specified number of species. The specified precision is the proportion of reads correctly assigned to a bin, averaged over all species. For MultiBin (red) the curves show average precision over 10 random starts of the clustering algorithm, and the error bars give the standard error of the mean. For AbundanceBin (blue) the curves show the average precision over the 5 samples in the dataset, and the dashed lines give the highest and lowest result of the 5. MultiBin achieves consistently better precision over both read lengths and over all sample complexities. AbundanceBin's performance exhibits high between-sample variability, and also deteriorate more rapidly as the number of species increase.</p

FigShare

Crohn's disease status separates with components proportions.

Author: Eran Halperin (22303)
Yael Baran (332633)
Publication venue
Publication date
Field of study

Each marker corresponds to an individual, red for Crohn-free and filled black for Crohn cases. The markers are positioned on the two-dimensional plane defined by the components proportions (there are three components but only two dimensions because the proportions sum to ). The Crohn-free individual at the bottom part of the figure is a colitis case.</p

FigShare

PLSA approximates mixture coefficient better than PCA.

Author: Eran Halperin (22303)
Yael Baran (332633)
Publication venue
Publication date
Field of study

PCA and PLSA were performed on a simulated counts matrix with and different number of per-sample counts. The plot shows the average squared correlation coefficient between the true vectors and the three strongest principal components (in the case of PCA) or PLSA estimates . For each per-sample counts value 20 experiments were performed, and the plot gives the mean result and the standard error of the mean. The estimates obtained by PLSA show higher correlation with the true mixture proportions.</p

FigShare

Increasing the number of samples for a fixed depth of coverage improves both components characterization and binning precision.

Author: Eran Halperin (22303)
Yael Baran (332633)
Publication venue
Publication date
Field of study

Left: A fixed number of counts were generated from a model defined by uniformly drawn and matrices using . The value of , the number of samples, varied from 1 to 1000, and 100 trials were performed for each value. The highest average precision of estimation is obtained for . Right: A fixed number of 8,192,000 reads of length 400 bp were sampled from different numbers of samples, each consisting of 15 species in uniformly drawn proportions. The smallest average error over all samples was obtained when 32 samples are sequenced. In both plots the error bars give the standard error of the mean.</p

FigShare

Significant correlations between components proportions and phenotypes.

Author: Eran Halperin (22303)
Yael Baran (332633)
Publication venue
Publication date
Field of study

The predicted variables were regressed on the proportions of each component separately. The table gives the regression p-values, and in parentheses the empirical p-values obtained by permuting the components proportions times while keeping the phenotypes constant.</p

FigShare

Association between

Author: Eran Halperin (22303)
Yael Baran (332633)
Publication venue
Publication date
Field of study

-mers relative abundances and BMI can be corrected using the components proportions. Quantile-quantile curves comparing the uniform distribution to the distribution of the p-values for association between all -mers and BMI within the Danish samples. The uncorrected p-values are highly deflated (black), indicating that the abundance of many -mers is correlated with BMI. However, when the components proportions are added the the regression equation (red), the correlation disappears for most -mers.</p

FigShare

PIGS: improved estimates of identity-by-descent probabilities by probabilistic IBD graph sampling.

Author: Baran Yael
Burchard Esteban G
Eng Celeste
Hormozdiari Farhad
Park Danny S
Torgerson Dara G
Zaitlen Noah
Publication venue: eScholarship, University of California
Publication date: 01/01/2015
Field of study

Identifying segments in the genome of different individuals that are identical-by-descent (IBD) is a fundamental element of genetics. IBD data is used for numerous applications including demographic inference, heritability estimation, and mapping disease loci. Simultaneous detection of IBD over multiple haplotypes has proven to be computationally difficult. To overcome this, many state of the art methods estimate the probability of IBD between each pair of haplotypes separately. While computationally efficient, these methods fail to leverage the clique structure of IBD resulting in less powerful IBD identification, especially for small IBD segments

CiteSeerX

Crossref

Springer - Publisher Connector

PubMed Central

eScholarship - University of California

Recommended from our members

PIGS: improved estimates of identity-by-descent probabilities by probabilistic IBD graph sampling.

Author: Baran Yael
Burchard Esteban G
Eng Celeste
Hormozdiari Farhad
Park Danny S
Torgerson Dara G
Zaitlen Noah
Publication venue: eScholarship, University of California
Publication date: 01/01/2015
Field of study

eScholarship - University of California