Search CORE

8,706 research outputs found

Methods for Joint Normalization and Comparison of Hi-C data

Author: Stansfield John C
Publication venue: VCU Scholars Compass
Publication date: 01/01/2019
Field of study

The development of chromatin conformation capture technology has opened new avenues of study into the 3D structure and function of the genome. Chromatin structure is known to influence gene regulation, and differences in structure are now emerging as a mechanism of regulation between, e.g., cell differentiation and disease vs. normal states. Hi-C sequencing technology now provides a way to study the 3D interactions of the chromatin over the whole genome. However, like all sequencing technologies, Hi-C suffers from several forms of bias stemming from both the technology and the DNA sequence itself. Several normalization methods have been developed for normalizing individual Hi-C datasets, but little work has been done on developing joint normalization methods for comparing two or more Hi-C datasets. To make full use of Hi-C data, joint normalization and statistical comparison techniques are needed to carry out experiments to identify regions where chromatin structure differs between conditions. We develop methods for the joint normalization and comparison of two Hi-C datasets, which we then extended to more complex experimental designs. Our normalization method is novel in that it makes use of the distance-dependent nature of chromatin interactions. Our modification of the Minus vs. Average (MA) plot to the Minus vs. Distance (MD) plot allows for a nonparametric data-driven normalization technique using loess smoothing. Additionally, we present a simple statistical method using Z-scores for detecting differentially interacting regions between two datasets. Our initial method was published as the Bioconductor R package HiCcompare [http://bioconductor.org/packages/HiCcompare/](http://bioconductor.org/packages/HiCcompare/). We then further extended our normalization and comparison method for use in complex Hi-C experiments with more than two datasets and optional covariates. We extended the normalization method to jointly normalize any number of Hi-C datasets by using a cyclic loess procedure on the MD plot. The cyclic loess normalization technique can remove between dataset biases efficiently and effectively even when several datasets are analyzed at one time. Our comparison method implements a generalized linear model-based approach for comparing complex Hi-C experiments, which may have more than two groups and additional covariates. The extended methods are also available as a Bioconductor R package [http://bioconductor.org/packages/multiHiCcompare/](http://bioconductor.org/packages/multiHiCcompare/). Finally, we demonstrate the use of HiCcompare and multiHiCcompare in several test cases on real data in addition to comparing them to other similar methods (https://doi.org/10.1002/cpbi.76)

VCU Scholars Compass

Computational representation and discovery of transcription factor binding sites

Author: Maynou Fernàndez Joan
Publication venue: Universitat Politècnica de Catalunya
Publication date: 01/01/2016
Field of study

Tesi per compendi de publicacions.The information about how, when, and where are produced the proteins has been one of the major challenge in molecular biology. The studies about the control of the gene expression are essential in order to have a better knowledge about the protein synthesis. The gene regulation is a highly controlled process that starts with the DNA transcription. This process operates at the gene level, hereditary basic units, which will be copied into primary ribonucleic acid (RNA). This first step is controlled by the binding of specific proteins, called as Transcription Factors (TF), with a sequence of the DNA (Deoxyribonucleic Acid) in the regulatory region of the gene. These DNA sequences are known as binding sites (BS). The binding sites motifs are usually very short (5 to 20 bp long) and highly degenerate. These sequences are expected to occur at random every few hundred base pairs. Besides, a TF can bind among different sites. Due to its highly variability, it is difficult to establish a consensus sequence. The study and identification binding sites is important to clarify the control of the gene expression. Due to the importance of identifying binding sites sequences, projects such as ENCODE (Encyclopedia of DNA elements), have dedicated efforts to map binding sites for large set of transcription factor to identify regulatory regions. In this thesis, we have approached the problem of the binding site detection from another angle. We have developed a set of toolkit for motif binding detection based on linear and non-linear models. First of all, we have been able to characterize binding sites using different approaches. The first one is based on the information that there is in each binding sites position. The second one is based on the covariance model of an aligned set of binding sites sequences. From these motif characterizations, we have proposed a new set of computational methods to detect binding sites. First, it was developed a new method based on parametric uncertainty measurement (Rényi entropy). This detection algorithm evaluates the variation on the total Rényi entropy of a set of sequences when a candidate sequence is assumed to be a true binding site belonging to the set. This method was found to perform especially well on transcription factors that the correlation among binding sites was null. The correlation among binding sites positions was considered through linear, Q-residuals, and non-linear models, alpha-Divergence and SIGMA. Q-residuals is a novel motif finding method which constructs a subspace based on the covariance of numerical DNA sequences. When the number of available sequences was small, The Q-residuals performance was significantly better and faster than all the others methodologies. Alpha-Divergence was based on the variation of the total parametric divergence in a set of aligned sequenced with binding evidence when a candidate sequence is added. Given an optimal q-value, the alpha-Divergence performance had a better behavior than the others methodologies in most of the studied transcription factor binding sites. And finally, a new computational tool, SIGMA, was developed as a trade-off between the good generalisation properties of pure entropy methods and the ability of position-dependency metrics to improve detection power. In approximately 70% of the cases considered, SIGMA exhibited better performance properties, at comparable levels of computational resources, than the methods which it was compared. This set of toolkits and the models for the detection of a set of transcription factor binding sites (TFBS) has been included in an R-package called MEET.La informació sobre com, quan i on es produeixen les proteïnes ha estat un dels majors reptes en la biologia molecular. Els estudis sobre el control de l'expressió gènica són essencials per conèixer millor el procés de síntesis d'una proteïna. La regulació gènica és un procés altament controlat que s'inicia amb la transcripció de l'ADN. En aquest procés, els gens, unitat bàsica d'herència, són copiats a àcid ribonucleic (RNA). El primer pas és controlat per la unió de proteïnes, anomenades factors de transcripció (TF), amb una seqüència d'ADN (àcid desoxiribonucleic) en la regió reguladora del gen. Aquestes seqüències s'anomenen punts d'unió i són específiques de cada proteïna. La unió dels factors de transcripció amb el seu corresponent punt d'unió és l'inici de la transcripció. Els punts d'unió són seqüències molt curtes (5 a 20 parells de bases de llargada) i altament degenerades. Aquestes seqüències poden succeir de forma aleatòria cada centenar de parells de bases. A més a més, un factor de transcripció pot unir-se a diferents punts. A conseqüència de l'alta variabilitat, és difícil establir una seqüència consensus. Per tant, l'estudi i la identificació del punts d'unió és important per entendre el control de l'expressió gènica. La importància d'identificar seqüències reguladores ha portat a projectes com l'ENCODE (Encyclopedia of DNA Elements) a dedicar grans esforços a mapejar les seqüències d'unió d'un gran conjunt de factors de transcripció per identificar regions reguladores. L'accés a seqüències genòmiques i els avanços en les tecnologies d'anàlisi de l'expressió gènica han permès també el desenvolupament dels mètodes computacionals per la recerca de motius. Gràcies aquests avenços, en els últims anys, un gran nombre de algorismes han sigut aplicats en la recerca de motius en organismes procariotes i eucariotes simples. Tot i la simplicitat dels organismes, l'índex de falsos positius és alt respecte als veritables positius. Per tant, per estudiar organismes més complexes és necessari mètodes amb més sensibilitat. En aquesta tesi ens hem apropat al problema de la detecció de les seqüències d'unió des de diferents angles. Concretament, hem desenvolupat un conjunt d'eines per la detecció de motius basats en models lineals i no-lineals. Les seqüències d'unió dels factors de transcripció han sigut caracteritzades mitjançant dues aproximacions. La primera està basada en la informació inherent continguda en cada posició de les seqüències d'unió. En canvi, la segona aproximació caracteritza la seqüència d'unió mitjançant un model de covariància. A partir d'ambdues caracteritzacions, hem proposat un nou conjunt de mètodes computacionals per la detecció de seqüències d'unió. Primer, es va desenvolupar un nou mètode basat en la mesura paramètrica de la incertesa (entropia de Rényi). Aquest algorisme de detecció avalua la variació total de l'entropia de Rényi d'un conjunt de seqüències d'unió quan una seqüència candidata és afegida al conjunt. Aquest mètode va obtenir un bon rendiment per aquells seqüències d'unió amb poca o nul.la correlació entre posicions. La correlació entre posicions fou considerada a través d'un model lineal, Qresiduals, i dos models no-lineals, alpha-Divergence i SIGMA. Q-residuals és una nova metodologia per la recerca de motius basada en la construcció d'un subespai a partir de la covariància de les seqüències d'ADN numèriques. Quan el nombre de seqüències disponible és petit, el rendiment de Q-residuals fou significant millor i més ràpid que en les metodologies comparades. Alpha-Divergence avalua la variació total de la divergència paramètrica en un conjunt de seqüències d'unió quan una seqüència candidata és afegida. Donat un q-valor òptim, alpha-Divergence va tenir un millor rendiment que les metodologies comparades en la majoria de seqüències d'unió dels factors de transcripció considerats. Finalment, un nou mètode computacional, SIGMA, va ser desenvolupat per tal millorar la potència de deteccióPostprint (published version

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

UPCommons. Portal del coneixement obert de la UPC

Tesis Doctorals en Xarxa

The EM Algorithm and the Rise of Computational Biology

Author: Citable Link
Jun S. Liu
Xiaodan Fan
Yuan Yuan
Publication venue: 'Institute of Mathematical Statistics'
Publication date: 01/01/2010
Field of study

In the past decade computational biology has grown from a cottage industry with a handful of researchers to an attractive interdisciplinary field, catching the attention and imagination of many quantitatively-minded scientists. Of interest to us is the key role played by the EM algorithm during this transformation. We survey the use of the EM algorithm in a few important computational biology problems surrounding the "central dogma"; of molecular biology: from DNA to RNA and then to proteins. Topics of this article include sequence motif discovery, protein sequence alignment, population genetics, evolutionary models and mRNA expression microarray data analysis.Comment: Published in at http://dx.doi.org/10.1214/09-STS312 the Statistical Science (http://www.imstat.org/sts/) by the Institute of Mathematical Statistics (http://www.imstat.org

arXiv.org e-Print Archive

CiteSeerX

Crossref

chroGPS, a global chromatin positioning system for the functional analysis and visualization of the epigenome

Author: Boulesteix
Celniker
David Rossell
Dunham
Ernst
Ernst
Ernst
Fernando Azorín
Filion
Gan
Gentleman
Gerstein
Joan Font-Burgada
Karnik
Kharchenko
Lange
Lee
Lin
Liu
Lloret-Llinares
Lloret-Llinares
Miclaus
Motsinger
Musselman
Oscar Reina
Pedersen
Pérez-Lluch
Roudier
Roy
Smoot
Steiner
Torgerson
van Bemmel
van de Wiel
Publication venue: 'Oxford University Press (OUP)'
Publication date: 23/11/2013
Field of study

Development of tools to jointly visualize the genome and the epigenome remains a challenge. chroGPS is a computational approach that addresses this question. chroGPS uses multidimensional scaling techniques to represent similarity between epigenetic factors, or between genetic elements on the basis of their epigenetic state, in 2D/3D reference maps. We emphasize biological interpretability, statistical robustness, integration of genetic and epigenetic data from heterogeneous sources, and computational feasibility. Although chroGPS is a general methodology to create reference maps and study the epigenetic state of any class of genetic element or genomic region, we focus on two specific kinds of maps: chroGPSfactors, which visualizes functional similarities between epigenetic factors, and chroGPSgenes, which describes the epigenetic state of genes and integrates gene expression and other functional data. We use data from the modENCODE project on the genomic distribution of a large collection of epigenetic factors in Drosophila, a model system extensively used to study genome organization and function. Our results show that the maps allow straightforward visualization of relationships between factors and elements, capturing relevant information about their functional properties that helps to interpret epigenetic information in a functional context and derive testable hypotheses

Crossref

PubMed Central

Warwick Research Archives Portal Repository

Digital.CSIC

UPF Digital Repository

Targeted re-sequencing of linkage region on 2q21 identifies a novel functional variant for hip and knee osteoarthritis

Author: Ala-Kokko L.
Barral S.
Gao P.
Jakkula E.
Kamarainen O. -P
Kiviranta I.
Kroger H.
Mannikko M.
Ott J.
Skarp S.
Taipale M.
Wei G. -H.
Publication venue
Publication date: 01/04/2016
Field of study

Objective: The aim of the study was to identify genetic variants predisposing to primary hip and knee osteoarthritis (OA) in a sample of Finnish families. Methods: Genome wide analysis was performed using 15 independent families (279 individuals) originating from Central Finland identified as having multiple individuals with primary hip and/or knee OA. Targeted re-sequencing was performed for three samples from one 33-member, four-generation family contributing most significantly to the LOD score. In addition, exome sequencing was performed in three family members from the same family. Results: Genome wide linkage analysis identified a susceptibility locus on chromosome 2q21 with a multipoint LOD score of 3.91. Targeted re-sequencing and subsequent linkage analysis revealed a susceptibility insertion variant rs11446594. It locates in a predicted strong enhancer element region with maximum LOD score 3.42 under dominant model of inheritance. Insertion creates a recognition sequence for ELF3 and HMGA1 transcription factors. Their DNA-binding affinity is highly increased in the presence of A-allele compared to wild type null allele. Conclusion: A potentially novel functional OA susceptibility variant was identified by targeted resequencing. This variant locates in a predicted regulatory site and creates a recognition sequence for ELF3 and HMGA1 transcription factors that are predicted to play a significant role in articular cartilage homeostasis. (C) 2015 The Authors. Published by Elsevier Ltd and Osteoarthritis Research Society International.Peer reviewe

Elsevier - Publisher Connector

Crossref

Institutional Repository of Institute of Psychology, Chinese Academy of Sciences

Lehigh Valley Health Network: LVHN Scholarly Works

Helsingin yliopiston digitaalinen arkisto

Transcriptional programs: modelling higher order structure in transcriptional control.

Author: Ott Sascha
Reid John E
Wernisch Lorenz
Publication venue: BMC Bioinformatics
Publication date: 01/07/2009
Field of study

BACKGROUND: Transcriptional regulation is an important part of regulatory control in eukaryotes. Even if binding motifs for transcription factors are known, the task of finding binding sites by scanning sequences is plagued by false positives. One way to improve the detection of binding sites from motifs is by taking cooperativity of transcription factor binding into account. We propose a non-parametric probabilistic model, similar to a document topic model, for detecting transcriptional programs, groups of cooperative transcription factors and co-regulated genes. The analysis results in transcriptional programs which generalise both transcriptional modules and TF-target gene incidence matrices and provide a higher-level summary of these structures. The method is independent of prior specification of training sets of genes, for example, via gene expression data. The analysis is based on known binding motifs. RESULTS: We applied our method to putative regulatory regions of 18,445 Mus musculus genes. We discovered just 68 transcriptional programs that effectively summarised the action of 149 transcription factors on these genes. Several of these programs were significantly enriched for known biological processes and signalling pathways. One transcriptional program has a significant overlap with a reference set of cell cycle specific transcription factors. CONCLUSION: Our method is able to pick out higher order structure from noisy sequence analyses. The transcriptional programs it identifies potentially represent common mechanisms of regulatory control across the genome. It simultaneously predicts which genes are co-regulated and which sets of transcription factors cooperate to achieve this co-regulation. The programs we discovered enable biologists to choose new genes and transcription factors to study in specific transcriptional regulatory systems.RIGHTS : This article is licensed under the BioMed Central licence at http://www.biomedcentral.com/about/license which is similar to the 'Creative Commons Attribution Licence'. In brief you may : copy, distribute, and display the work; make derivative works; or make commercial use of the work - under the following conditions: the original author must be given credit; for any reuse or distribution, it must be made clear to others what the license terms of this work are

Springer - Publisher Connector

PubMed Central

Warwick Research Archives Portal Repository

Birkbeck Institutional Research Online

Apollo (Cambridge)

Evidence-ranked motif identification

Author: Boyle Alan P
Ding Xuan
Georgiev Stoyan
Jayasurya Karthik
Mukherjee Sayan
Ohler Uwe
Publication venue: BioMed Central
Publication date: 01/01/2010
Field of study

A new computational method for the identification of regulatory motifs from large genomic datasets is presented her

Crossref

Springer - Publisher Connector

PubMed Central

DukeSpace

MDC Repository

A Multi-Omics Analysis of Transcription Control by BRD4

Author: Bressin Annkatrin Sarah
Publication venue
Publication date: 01/01/2023
Field of study

RNA polymerase II (Pol II) regulation during early elongation has emerged as a regulatory hub in the gene expression of multicellular organisms. Prior research links the BRD4 protein to this control point, regulating the release of paused Pol II into productive elongation. However, the exact roles and mechanisms by which BRD4 influences this and potentially other post-initiation regulatory processes remain unknown. This study combines rapid BRD4 protein degradation and multi-omics approaches, including nascent elongating transcript sequencing (NET-seq), to uncover BRD4’s direct protein functions. Applying NET-seq in comparative studies required experimental adaptations. First, analyses with spiked-in mouse cells proved essential for reliable normalization. Second, the study identified a disproportional enrichment of a chromatin-associated RNA class as NET-seq’s major limitation. Incorporating an additional enrichment step solved this problem and significantly increased Pol II coverage. The resulting high-sensitivity NET-seq method confirmed BRD4’s proposed role in early elongation by revealing a global defect in Pol II pause release upon BRD4 degradation. Observations from proteomics and chromatin immunoprecipitation followed by sequencing (ChIP-seq) experiments suggest that the failed recruitment of Pol II-associated factors (PAF) causes an assembly defect of a competent elongation complex. Interestingly, the elongation defect also affected transcribed enhancers. Pol II occupancy increased in a region proximal to the enhancer center, strikingly similar to the impaired Pol II pause release at genes. An integrated multi-omics analysis that included genome-wide 3D genome information revealed reduced interactions between these enhancers and other regulatory regions. Another unexpected result was the widespread Pol II readthrough transcription quantified by the developed readthrough index, revealing an apparent transcriptional termination defect. The implementation of long-read nascent RNA-sequencing (nascONT-seq) combined with a 3’-RNA cleavage efficiency test detected impaired 3’-RNA processing. Notably, those 3’-RNA cleavage defects correlated with the observed termination defects. A potential explanation is the BRD4-dependent recruitment of general 3’-RNA processing factors to the 5’-control region. These observations start to establish regulatory links between 5’ and 3’ control that require further validation. Overall, the results indicate a general BRD4-dependent 5’ elongation control point required for 3’-RNA processing and termination

Institutional Repository of the Freie Universität Berlin

Screening for Potential Therapeutic Targets of Yb-1 Protein Using a Bioinformatics Approach

Author: Karkoutly Omar Muneer
Publication venue: ScholarWorks @ UTRGV
Publication date: 01/07/2022
Field of study

Treatment options for cancer are becoming much more limited due to the robust characteristics of cancer that allows them to rapidly develop drug resistance. This may be a result of cancer cells’ ability to switch between differentiated and undifferentiated states (plasticity). Diagnostic measurement and detection of cancer and its progression is essential for developing successful treatments. Specific cancer targets whose expressions are highly associated with increased incidence, risk, and spread of cancer therefore become perfect targets for therapeutic intervention. One such novel target that is still being studied is the Y-box binding protein 1 (YB-1), which is a prominent transcription factor in many cancer types including breast, liver, and colorectal. Therefore, attacking these cancer targets with specific drugs that act against them may prove to be extremely helpful in fighting off its associated cancer. Taking advantage of the latest bioinformatics tools may aid in streamlining this process

Scholarworks@UTRGV Univ. of Texas RioGrande Valley