706 research outputs found
De novo human genome assemblies reveal spectrum of alternative haplotypes in diverse populations.
The human reference genome is used extensively in modern biological research. However, a single consensus representation is inadequate to provide a universal reference structure because it is a haplotype among many in the human population. Using 10× Genomics (10×G) "Linked-Read" technology, we perform whole genome sequencing (WGS) and de novo assembly on 17 individuals across five populations. We identify 1842 breakpoint-resolved non-reference unique insertions (NUIs) that, in aggregate, add up to 2.1 Mb of so far undescribed genomic content. Among these, 64% are considered ancestral to humans since they are found in non-human primate genomes. Furthermore, 37% of the NUIs can be found in the human transcriptome and 14% likely arose from Alu-recombination-mediated deletion. Our results underline the need of a set of human reference genomes that includes a comprehensive list of alternative haplotypes to depict the complete spectrum of genetic diversity across populations
Limit theorems for functions of marginal quantiles
Multivariate distributions are explored using the joint distributions of
marginal sample quantiles. Limit theory for the mean of a function of order
statistics is presented. The results include a multivariate central limit
theorem and a strong law of large numbers. A result similar to Bahadur's
representation of quantiles is established for the mean of a function of the
marginal quantiles. In particular, it is shown that
as , where is a constant and are
i.i.d. random variables for each . This leads to the central limit theorem.
Weak convergence to a Gaussian process using equicontinuity of functions is
indicated. The results are established under very general conditions. These
conditions are shown to be satisfied in many commonly occurring situations.Comment: Published in at http://dx.doi.org/10.3150/10-BEJ287 the Bernoulli
(http://isi.cbs.nl/bernoulli/) by the International Statistical
Institute/Bernoulli Society (http://isi.cbs.nl/BS/bshome.htm
AT excursion: a new approach to predict replication origins in viral genomes by locating AT-rich regions
<p>Abstract</p> <p>Background</p> <p>Replication origins are considered important sites for understanding the molecular mechanisms involved in DNA replication. Many computational methods have been developed for predicting their locations in archaeal, bacterial and eukaryotic genomes. However, a prediction method designed for a particular kind of genomes might not work well for another. In this paper, we propose the AT excursion method, which is a score-based approach, to quantify local AT abundance in genomic sequences and use the identified high scoring segments for predicting replication origins. This method has the advantages of requiring no preset window size and having rigorous criteria to evaluate statistical significance of high scoring segments.</p> <p>Results</p> <p>We have evaluated the AT excursion method by checking its predictions against known replication origins in herpesviruses and comparing its performance with an existing base weighted score method (BWS<sub>1</sub>). Out of 43 known origins, 39 are predicted by either one or the other method and 26 origins are predicted by both. The excursion method identifies six origins not predicted by BWS<sub>1</sub>, showing that the AT excursion method is a valuable complement to BWS<sub>1</sub>. We have also applied the AT excursion method to two other families of double stranded DNA viruses, the poxviruses and iridoviruses, of which very few replication origins are documented in the public domain. The prediction results are made available as supplementary materials at <abbrgrp><abbr bid="B1">1</abbr></abbrgrp>. Preliminary investigation shows that the proposed method works well on some larger genomes too.</p> <p>Conclusion</p> <p>The AT excursion method will be a useful computational tool for identifying replication origins in a variety of genomic sequences.</p
A post-processing method for optimizing synthesis strategy for oligonucleotide microarrays
The broad applicability of gene expression profiling to genomic analyses has generated huge demand for mass production of microarrays and hence for improving the cost effectiveness of microarray fabrication. We developed a post-processing method for deriving a good synthesis strategy. In this paper, we assessed all the known efficient methods and our post-processing method for reducing the number of synthesis cycles for manufacturing a DNA-chip of a given set of oligos. Our experimental results on both simulated and 52 real datasets show that no single method consistently gives the best synthesis strategy, and post-processing an existing strategy is necessary as it often reduces the number of synthesis cycles further
ConReg-R: Extrapolative recalibration of the empirical distribution of p-values to improve false discovery rate estimates
<p>Abstract</p> <p>Background</p> <p>False discovery rate (FDR) control is commonly accepted as the most appropriate error control in multiple hypothesis testing problems. The accuracy of FDR estimation depends on the accuracy of the estimation of p-values from each test and validity of the underlying assumptions of the distribution. However, in many practical testing problems such as in genomics, the p-values could be under-estimated or over-estimated for many known or unknown reasons. Consequently, FDR estimation would then be influenced and lose its veracity.</p> <p>Results</p> <p>We propose a new extrapolative method called <it>Constrained Regression Recalibration </it>(ConReg-R) to recalibrate the empirical p-values by modeling their distribution to improve the FDR estimates. Our ConReg-R method is based on the observation that accurately estimated p-values from true null hypotheses follow uniform distribution and the observed distribution of p-values is indeed a mixture of distributions of p-values from true null hypotheses and true alternative hypotheses. Hence, ConReg-R recalibrates the observed p-values so that they exhibit the properties of an ideal empirical p-value distribution. The proportion of true null hypotheses (<it>π</it><sub>0</sub>) and FDR are estimated after the recalibration.</p> <p>Conclusions</p> <p>ConReg-R provides an efficient way to improve the FDR estimates. It only requires the p-values from the tests and avoids permutation of the original test data. We demonstrate that the proposed method significantly improves FDR estimation on several gene expression datasets obtained from microarray and RNA-seq experiments.</p> <p>Reviewers</p> <p>The manuscript was reviewed by Prof. Vladimir Kuznetsov, Prof. Philippe Broet, and Prof. Hongfang Liu (nominated by Prof. Yuriy Gusev).</p
Application of next generation sequencing to CEPH cell lines to discover variants associated with FDA approved chemotherapeutics
After publication of this work [1], it has come to our attention
that there is an error in the author list of the initial
version of this manuscript; rather than Ernest J Lam,
the second author of the manuscript should be listed as
Ernest T Lam
Comprehensive Analysis of Human Subtelomeres by Whole Genome Mapping
Detailed comprehensive knowledge of the structures of individual long-range telomere-terminal haplotypes are needed to understand their impact on telomere function, and to delineate the population structure and evolution of subtelomere regions. However, the abundance of large evolutionarily recent segmental duplications and high levels of large structural variations have complicated both the mapping and sequence characterization of human subtelomere regions. Here, we use high throughput optical mapping of large single DNA molecules in nanochannel arrays for 154 human genomes from 26 populations to present a comprehensive look at human subtelomere structure and variation. The results catalog many novel long-range subtelomere haplotypes and determine the frequencies and contexts of specific subtelomeric duplicons on each chromosome arm, helping to clarify the currently ambiguous nature of many specific subtelomere structures as represented in the current reference sequence (HG38). The organization and content of some duplicons in subtelomeres appear to show both chromosome arm and population-specific trends. Based upon these trends we estimate a timeline for the spread of these duplication blocks
- …