115 research outputs found
Integrating Prior Knowledge in Multiple Testing under Dependence with Applications to Detecting Differential DNA Methylation
DNA methylation has emerged as an important hallmark of epigenetics. Numerous platforms including tiling arrays and next generation sequencing, and experimental protocols are available for profiling DNA methylation. Similar to other tiling array data, DNA methylation data shares the characteristics of inherent correlation structure among nearby probes. However, unlike gene expression or protein DNA binding data, the varying CpG density which gives rise to CpG island, shore and shelf definition provides exogenous information in detecting differential methylation. This paper aims to introduce a robust testing and probe ranking procedure based on a non-homogeneous hidden Markov model that incorporates the above-mentioned features for detecting differential methylation. We revisit the seminal work of Sun and Cai (2009, J. R. Stat. Soc. B. 71, 393-424) and propose modeling the non-null using a non-parametric symmetric distribution in two-sided hypothesis testing. We show that this model improves probe ranking and is robust to model misspecification based on extensive simulation studies. We further illustrate that our proposed framework achieves good operating characteristics as compared to commonly used methods in real DNA methylation data that aims to detect differential methylation sites
Determining Physical Constraints in Transcriptional Initiation Complexes Using DNA Sequence Analysis
Eukaryotic gene expression is often under the control of cooperatively acting transcription factors whose binding is limited by structural constraints. By determining these structural constraints, we can understand the “rules” that define functional cooperativity. Conversely, by understanding the rules of binding, we can infer structural characteristics. We have developed an information theory based method for approximating the physical limitations of cooperative interactions by comparing sequence analysis to microarray expression data. When applied to the coordinated binding of the sulfur amino acid regulatory protein Met4 by Cbf1 and Met31, we were able to create a combinatorial model that can correctly identify Met4 regulated genes. Interestingly, we found that the major determinant of Met4 regulation was the sum of the strength of the Cbf1 and Met31 binding sites and that the energetic costs associated with spacing appeared to be minimal
Position specific variation in the rate of evolution in transcription factor binding sites
BACKGROUND: The binding sites of sequence specific transcription factors are an important and relatively well-understood class of functional non-coding DNAs. Although a wide variety of experimental and computational methods have been developed to characterize transcription factor binding sites, they remain difficult to identify. Comparison of non-coding DNA from related species has shown considerable promise in identifying these functional non-coding sequences, even though relatively little is known about their evolution. RESULTS: Here we analyse the genome sequences of the budding yeasts Saccharomyces cerevisiae, S. bayanus, S. paradoxus and S. mikatae to study the evolution of transcription factor binding sites. As expected, we find that both experimentally characterized and computationally predicted binding sites evolve slower than surrounding sequence, consistent with the hypothesis that they are under purifying selection. We also observe position-specific variation in the rate of evolution within binding sites. We find that the position-specific rate of evolution is positively correlated with degeneracy among binding sites within S. cerevisiae. We test theoretical predictions for the rate of evolution at positions where the base frequencies deviate from background due to purifying selection and find reasonable agreement with the observed rates of evolution. Finally, we show how the evolutionary characteristics of real binding motifs can be used to distinguish them from artefacts of computational motif finding algorithms. CONCLUSION: As has been observed for protein sequences, the rate of evolution in transcription factor binding sites varies with position, suggesting that some regions are under stronger functional constraint than others. This variation likely reflects the varying importance of different positions in the formation of the protein-DNA complex. The characterization of the pattern of evolution in known binding sites will likely contribute to the effective use of comparative sequence data in the identification of transcription factor binding sites and is an important step toward understanding the evolution of functional non-coding DNA
MONKEY: identifying conserved transcription-factor binding sites in multiple alignments using a binding site-specific evolutionary model
We introduce a method (MONKEY) to identify conserved transcription-factor binding sites in multispecies alignments. MONKEY employs probabilistic models of factor specificity and binding-site evolution, on which basis we compute the likelihood that putative sites are conserved and assign statistical significance to each hit. Using genomes from the genus Saccharomyces, we illustrate how the significance of real sites increases with evolutionary distance and explore the relationship between conservation and function
Recommended from our members
Flexible Promoter Architecture Requirements for Coactivator Recruitment
Background: The spatial organization of transcription factor binding sites in regulatory DNA, and the composition of intersite sequences, influences the assembly of the multiprotein complexes that regulate RNA polymerase recruitment and thereby affects transcription. We have developed a genetic approach to investigate how reporter gene transcription is affected by varying the spacing between transcription factor binding sites. We characterized the components of promoter architecture that govern the yeast transcription factors Cbf1 and Met31/32, which bind independently, but collaboratively recruit the coactivator Met4. Results: A Cbf1 binding site was required upstream of a Met31/32 binding site for full reporter gene expression. Distance constraints on coactivator recruitment were more flexible than those for cooperatively binding transcription factors. Distances from 18 to 50 bp between binding sites support efficient recruitment of Met4, with only slight modulation by helical phasing. Intriguingly, we found that certain sequences located between the binding sites abolished gene expression. Conclusion: These results yield insight to the influence of both binding site architecture and local DNA flexibility on gene expression, and can be used to refine computational predictions of gene expression from promoter sequences. In addition, our approach can be applied to survey promoter architecture requirements for arbitrary combinations of transcription factor binding sites
Conservation and Evolution of Cis-Regulatory Systems in Ascomycete Fungi
Relatively little is known about the mechanisms through which gene expression regulation evolves. To investigate this, we systematically explored the conservation of regulatory networks in fungi by examining the cis-regulatory elements that govern the expression of coregulated genes. We first identified groups of coregulated Saccharomyces cerevisiae genes enriched for genes with known upstream or downstream cis-regulatory sequences. Reasoning that many of these gene groups are coregulated in related species as well, we performed similar analyses on orthologs of coregulated S. cerevisiae genes in 13 other ascomycete species. We find that many species-specific gene groups are enriched for the same flanking regulatory sequences as those found in the orthologous gene groups from S. cerevisiae, indicating that those regulatory systems have been conserved in multiple ascomycete species. In addition to these clear cases of regulatory conservation, we find examples of cis-element evolution that suggest multiple modes of regulatory diversification, including alterations in transcription factor-binding specificity, incorporation of new gene targets into an existing regulatory system, and cooption of regulatory systems to control a different set of genes. We investigated one example in greater detail by measuring the in vitro activity of the S. cerevisiae transcription factor Rpn4p and its orthologs from Candida albicans and Neurospora crassa. Our results suggest that the DNA binding specificity of these proteins has coevolved with the sequences found upstream of the Rpn4p target genes and suggest that Rpn4p has a different function in N. crassa
Cancer gene discovery in hepatocellular carcinoma
Hepatocellular carcinoma (HCC) is a deadly cancer, whose incidence is increasing worldwide. Albeit the main risk factors for HCC development have been clearly identified, such as hepatitis B and C virus infection and alcohol abuse, there is still preliminary understanding of the key drivers of this malignancy. Recent data suggest that genomic analysis of cirrhotic tissue - the pre-neoplastic carcinogenic field - may provide a read-out to identify at risk populations for cancer development. Given this contextual complexity, it is of utmost importance to characterize the molecular pathogenesis of this disease, and pinpoint the dominant pathways/drivers by integrative oncogenomic approaches and/or sophisticated experimental models. Identification of the dominant proliferative signals and key aberrations will allow for a more personalized therapy
Lack of benefits for prevention of cardiovascular disease with aspirin therapy in type 2 diabetic patients - a longitudinal observational study
<p>Abstract</p> <p>Background</p> <p>The risk-benefit ratio of aspirin therapy in prevention of cardiovascular disease (CVD) remains contentious, especially in type 2 diabetes. This study examined the benefit and harm of low-dose aspirin (daily dose < 300 mg) in patients with type 2 diabetes.</p> <p>Methods</p> <p>This is a longitudinal observational study with primary and secondary prevention cohorts based on history of CVD at enrolment. We compared the occurrence of primary composite (non-fatal myocardial infarction or stroke and vascular death) and secondary endpoints (upper GI bleeding and haemorrhagic stroke) between aspirin users and non-users between January 1995 and July 2005.</p> <p>Results</p> <p>Of the 6,454 patients (mean follow-up: median [IQR]: 4.7 [4.4] years), usage of aspirin was 18% (n = 1,034) in the primary prevention cohort (n = 5731) and 81% (n = 585) in the secondary prevention cohort (n = 723). After adjustment for covariates, in the primary prevention cohort, aspirin use was associated with a hazard-ratio of 2.07 (95% CI: 1.66, 2.59, p < 0.001) for primary endpoint. There was no difference in CVD event rate in the secondary prevention cohort. Overall, aspirin use was associated with a hazard-ratio of 2.2 (1.53, 3.15, p < 0.001) of GI bleeding and 1.71 (1.00, 2.95, p = 0.051) of haemorrhagic stroke. The absolute risk of aspirin-related GI bleeding was 10.7 events per 1,000 person-years of treatment.</p> <p>Conclusion</p> <p>In Chinese type 2 diabetic patients, low dose aspirin was associated with a paradoxical increase in CVD risk in primary prevention and did not confer benefits in secondary prevention. In addition, the risk of GI bleeding in aspirin users was rather high.</p
A Robust Method for Transcript Quantification with RNA-Seq Data
The advent of high throughput RNA-seq technology allows deep sampling of the transcriptome, making it possible to characterize both the diversity and the abundance of transcript isoforms. Accurate abundance estimation or transcript quantification of isoforms is critical for downstream differential analysis (e.g., healthy vs. diseased cells) but remains a challenging problem for several reasons. First, while various types of algorithms have been developed for abundance estimation, short reads often do not uniquely identify the transcript isoforms from which they were sampled. As a result, the quantification problem may not be identifiable, i.e., lacks a unique transcript solution even if the read maps uniquely to the reference genome. In this article, we develop a general linear model for transcript quantification that leverages reads spanning multiple splice junctions to ameliorate identifiability. Second, RNA-seq reads sampled from the transcriptome exhibit unknown position-specific and sequence-specific biases. We extend our method to simultaneously learn bias parameters during transcript quantification to improve accuracy. Third, transcript quantification is often provided with a candidate set of isoforms, not all of which are likely to be significantly expressed in a given tissue type or condition. By resolving the linear system with LASSO, our approach can infer an accurate set of dominantly expressed transcripts while existing methods tend to assign positive expression to every candidate isoform. Using simulated RNA-seq datasets, our method demonstrated better quantification accuracy and the inference of dominant set of transcripts than existing methods. The application of our method on real data experimentally demonstrated that transcript quantification is effective for differential analysis of transcriptomes
MapSplice: Accurate Mapping of RNA-Seq Reads for Splice Junction Discovery
The accurate mapping of reads that span splice junctions is a critical component of all analytic techniques that work with RNA-seq data. We introduce a second generation splice detection algorithm, MapSplice, whose focus is high sensitivity and specificity in the detection of splices as well as CPU and memory efficiency. MapSplice can be applied to both short (\u3c75 bp) and long reads (≥75 bp). MapSplice is not dependent on splice site features or intron length, consequently it can detect novel canonical as well as non-canonical splices. MapSplice leverages the quality and diversity of read alignments of a given splice to increase accuracy. We demonstrate that MapSplice achieves higher sensitivity and specificity than TopHat and SpliceMap on a set of simulated RNA-seq data. Experimental studies also support the accuracy of the algorithm. Splice junctions derived from eight breast cancer RNA-seq datasets recapitulated the extensiveness of alternative splicing on a global level as well as the differences between molecular subtypes of breast cancer. These combined results indicate that MapSplice is a highly accurate algorithm for the alignment of RNA-seq reads to splice junctions. Software download URL: http://www.netlab.uky.edu/p/bioinfo/MapSplice
- …