110 research outputs found

    Modeling ChIP Sequencing In Silico with Applications

    Get PDF
    ChIP sequencing (ChIP-seq) is a new method for genomewide mapping of protein binding sites on DNA. It has generated much excitement in functional genomics. To score data and determine adequate sequencing depth, both the genomic background and the binding sites must be properly modeled. To develop a computational foundation to tackle these issues, we first performed a study to characterize the observed statistical nature of this new type of high-throughput data. By linking sequence tags into clusters, we show that there are two components to the distribution of tag counts observed in a number of recent experiments: an initial power-law distribution and a subsequent long right tail. Then we develop in silico ChIP-seq, a computational method to simulate the experimental outcome by placing tags onto the genome according to particular assumed distributions for the actual binding sites and for the background genomic sequence. In contrast to current assumptions, our results show that both the background and the binding sites need to have a markedly nonuniform distribution in order to correctly model the observed ChIP-seq data, with, for instance, the background tag counts modeled by a gamma distribution. On the basis of these results, we extend an existing scoring approach by using a more realistic genomic-background model. This enables us to identify transcription-factor binding sites in ChIP-seq data in a statistically rigorous fashion

    Identification and analysis of unitary pseudogenes: historic and contemporary gene losses in humans and other primates

    Get PDF
    Novel human pseudogenes are identified that had previous functionality and their age is estimated. The rate of loss-of-function occurred uniformly

    Substance abuse and the risk of severe COVID-19: Mendelian randomization confirms the causal role of opioids but hints a negative causal effect for cannabinoids

    Get PDF
    Since the start of the COVID-19 global pandemic, our understanding of the underlying disease mechanism and factors associated with the disease severity has dramatically increased. A recent study investigated the relationship between substance use disorders (SUD) and the risk of severe COVID-19 in the United States and concluded that the risk of hospitalization and death due to COVID-19 is directly correlated with substance abuse, including opioid use disorder (OUD) and cannabis use disorder (CUD). While we found this analysis fascinating, we believe this observation may be biased due to comorbidities (such as hypertension, diabetes, and cardiovascular disease) confounding the direct effect of SUD on severe COVID-19 illness. To answer this question, we sought to investigate the causal relationship between substance abuse and medication-taking history (as a proxy trait for comorbidities) with the risk of COVID-19 adverse outcomes. Our Mendelian randomization analysis confirms the causal relationship between OUD and severe COVID-19 illness but suggests an inverse causal effect for cannabinoids. Considering that COVID-19 mortality is largely attributed to disturbed immune regulation, the possible modulatory impact of cannabinoids in alleviating cytokine storms merits further investigation

    Tilescope: online analysis pipeline for high-density tiling microarray data

    Get PDF
    Tilescope is a fully integrated and automated new data-processing pipeline for analyzing high-density tiling-array data

    Identification of genomic indels and structural variations using split reads

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Recent studies have demonstrated the genetic significance of insertions, deletions, and other more complex structural variants (SVs) in the human population. With the development of the next-generation sequencing technologies, high-throughput surveys of SVs on the whole-genome level have become possible. Here we present split-read identification, calibrated (SRiC), a sequence-based method for SV detection.</p> <p>Results</p> <p>We start by mapping each read to the reference genome in standard fashion using gapped alignment. Then to identify SVs, we score each of the many initial mappings with an assessment strategy designed to take into account both sequencing and alignment errors (e.g. scoring more highly events gapped in the center of a read). All current SV calling methods have multilevel biases in their identifications due to both experimental and computational limitations (e.g. calling more deletions than insertions). A key aspect of our approach is that we calibrate all our calls against synthetic data sets generated from simulations of high-throughput sequencing (with realistic error models). This allows us to calculate sensitivity and the positive predictive value under different parameter-value scenarios and for different classes of events (e.g. long deletions <it>vs</it>. short insertions). We run our calculations on representative data from the 1000 Genomes Project. Coupling the observed numbers of events on chromosome 1 with the calibrations gleaned from the simulations (for different length events) allows us to construct a relatively unbiased estimate for the total number of SVs in the human genome across a wide range of length scales. We estimate in particular that an individual genome contains ~670,000 indels/SVs.</p> <p>Conclusions</p> <p>Compared with the existing read-depth and read-pair approaches for SV identification, our method can pinpoint the exact breakpoints of SV events, reveal the actual sequence content of insertions, and cover the whole size spectrum for deletions. Moreover, with the advent of the third-generation sequencing technologies that produce longer reads, we expect our method to be even more useful.</p

    Mechanosignaling activation of TGFβ maintains intervertebral disc homeostasis

    Get PDF
    Intervertebral disc (IVD) degeneration is the leading cause of disability with no disease-modifying treatment. IVD degeneration is associated with instable mechanical loading in the spine, but little is known about how mechanical stress regulates nucleus notochordal (NC) cells to maintain IVD homeostasis. Here we report that mechanical stress can result in excessive integrin αv β6-mediated activation of transforming growth factor beta (TGFβ), decreased NC cell vacuoles, and increased matrix proteoglycan production, and results in degenerative disc disease (DDD). Knockout of TGFβ type II receptor (TβRII) or integrin α v in the NC cells inhibited functional activity of postnatal NC cells and also resulted in DDD under mechanical loading. Administration of RGD peptide, TGFβ, and α v β 6-neutralizing antibodies attenuated IVD degeneration. Thus, integrin-mediated activation of TGFβ plays a critical role in mechanical signaling transduction to regulate IVD cell function and homeostasis. Manipulation of this signaling pathway may be a potential therapeutic target to modify DDD

    The DNA Repair Gene APE1 T1349G Polymorphism and Risk of Gastric Cancer in a Chinese Population

    Get PDF
    Background: Apurinic/apyrimidinic endonuclease 1 (APE1) has a central role in the repair of apurinic apyrimidic sites through both its endonuclease and its phosphodiesterase activities. A common APE1 polymorphism, T1349G (rs3136820), was previously shown to be associated with the risk of cancers. Objective: We hypothesized that the APE1 T1349G polymorphism is also associated with risk of gastric cancer. Methods: In a hospital-based case-control study of 338 case patients with newly diagnosed gastric cancer and 362 cancerfree controls frequency-matched by age and sex, we genotyped the T1349G polymorphism and assessed its associations with risk of gastric cancer. Results: Compared with the APE1 TT genotype, individuals with the variant TG/GG genotypes had a significantly increased risk of gastric cancer (odds ratio = 1.69, 95 % confidence interval = 1.19–2.40), which was more pronounced among subgroups of aged #60 years, male, ever smokers, and ever drinkers. Further analyses revealed that the variant genotypes were associated with an increased risk for diffuse-type, low depth of tumor infiltration (T1 and T2), and lymph node metastasis gastric cancer. Conclusions: The APE1 T1349G polymorphism may be a marker for the development of gastric cancer in the Chinese population. Larger studies are required to validate these findings in diverse populations

    Integrating Sequencing Technologies in Personal Genomics: Optimal Low Cost Reconstruction of Structural Variants

    Get PDF
    The goal of human genome re-sequencing is obtaining an accurate assembly of an individual's genome. Recently, there has been great excitement in the development of many technologies for this (e.g. medium and short read sequencing from companies such as 454 and SOLiD, and high-density oligo-arrays from Affymetrix and NimbelGen), with even more expected to appear. The costs and sensitivities of these technologies differ considerably from each other. As an important goal of personal genomics is to reduce the cost of re-sequencing to an affordable point, it is worthwhile to consider optimally integrating technologies. Here, we build a simulation toolbox that will help us optimally combine different technologies for genome re-sequencing, especially in reconstructing large structural variants (SVs). SV reconstruction is considered the most challenging step in human genome re-sequencing. (It is sometimes even harder than de novo assembly of small genomes because of the duplications and repetitive sequences in the human genome.) To this end, we formulate canonical problems that are representative of issues in reconstruction and are of small enough scale to be computationally tractable and simulatable. Using semi-realistic simulations, we show how we can combine different technologies to optimally solve the assembly at low cost. With mapability maps, our simulations efficiently handle the inhomogeneous repeat-containing structure of the human genome and the computational complexity of practical assembly algorithms. They quantitatively show how combining different read lengths is more cost-effective than using one length, how an optimal mixed sequencing strategy for reconstructing large novel SVs usually also gives accurate detection of SNPs/indels, how paired-end reads can improve reconstruction efficiency, and how adding in arrays is more efficient than just sequencing for disentangling some complex SVs. Our strategy should facilitate the sequencing of human genomes at maximum accuracy and low cost

    Detection of copy number variation from array intensity and sequencing read depth using a stepwise Bayesian model

    Get PDF
    Abstract Background Copy number variants (CNVs) have been demonstrated to occur at a high frequency and are now widely believed to make a significant contribution to the phenotypic variation in human populations. Array-based comparative genomic hybridization (array-CGH) and newly developed read-depth approach through ultrahigh throughput genomic sequencing both provide rapid, robust, and comprehensive methods to identify CNVs on a whole-genome scale. Results We developed a Bayesian statistical analysis algorithm for the detection of CNVs from both types of genomic data. The algorithm can analyze such data obtained from PCR-based bacterial artificial chromosome arrays, high-density oligonucleotide arrays, and more recently developed high-throughput DNA sequencing. Treating parameters--e.g., the number of CNVs, the position of each CNV, and the data noise level--that define the underlying data generating process as random variables, our approach derives the posterior distribution of the genomic CNV structure given the observed data. Sampling from the posterior distribution using a Markov chain Monte Carlo method, we get not only best estimates for these unknown parameters but also Bayesian credible intervals for the estimates. We illustrate the characteristics of our algorithm by applying it to both synthetic and experimental data sets in comparison to other segmentation algorithms. Conclusions In particular, the synthetic data comparison shows that our method is more sensitive than other approaches at low false positive rates. Furthermore, given its Bayesian origin, our method can also be seen as a technique to refine CNVs identified by fast point-estimate methods and also as a framework to integrate array-CGH and sequencing data with other CNV-related biological knowledge, all through informative priors.</p

    Associations of IL-4, IL-4R, and IL-13 Gene Polymorphisms in Coal Workers' Pneumoconiosis in China: A Case-Control Study

    Get PDF
    Background: The IL-4, IL-4 receptor (IL4R), and IL-13 genes are crucial immune factors and may influence the course of various diseases. In the present study, we investigated the association between the potential functional polymorphisms in IL-4, IL-4R, and IL-13 and coal workers ’ pneumoconiosis (CWP) risk in a Chinese population. Methods: Six polymorphisms (C-590T in IL-4, Ile50Val, Ser478Pro, and Gln551Arg in IL-4R, C-1055T and Arg130Gln in IL-13) were genotyped and analyzed in a case-control study of 556 CWP and 541 control subjects. Results: Our results revealed that the IL-4 CT/CC genotypes were associated with a significantly decreased risk of CWP (odd
    • …
    corecore