24 research outputs found
Chromatin Accessibility-Based Characterization of the Gene Regulatory Network Underlying Plasmodium falciparum Blood-Stage Development.
Underlying the development of malaria parasites within erythrocytes and the resulting pathogenicity is a hardwired program that secures proper timing of gene transcription and production of functionally relevant proteins. How stage-specific gene expression is orchestrated in vivo remains unclear. Here, using the assay for transposase accessible chromatin sequencing (ATAC-seq), we identified ∼4,000 regulatory regions in P. falciparum intraerythrocytic stages. The vast majority of these sites are located within 2 kb upstream of transcribed genes and their chromatin accessibility pattern correlates positively with abundance of the respective mRNA transcript. Importantly, these regions are sufficient to drive stage-specific reporter gene expression and DNA motifs enriched in stage-specific sets of regulatory regions interact with members of the P. falciparum AP2 transcription factor family. Collectively, this study provides initial insights into the in vivo gene regulatory network of P. falciparum intraerythrocytic stages and should serve as a valuable resource for future studies
GaKCo: a Fast GApped k-mer string Kernel using COunting
String Kernel (SK) techniques, especially those using gapped -mers as
features (gk), have obtained great success in classifying sequences like DNA,
protein, and text. However, the state-of-the-art gk-SK runs extremely slow when
we increase the dictionary size () or allow more mismatches (). This
is because current gk-SK uses a trie-based algorithm to calculate co-occurrence
of mismatched substrings resulting in a time cost proportional to
. We propose a \textbf{fast} algorithm for calculating
\underline{Ga}pped -mer \underline{K}ernel using \underline{Co}unting
(GaKCo). GaKCo uses associative arrays to calculate the co-occurrence of
substrings using cumulative counting. This algorithm is fast, scalable to
larger and , and naturally parallelizable. We provide a rigorous
asymptotic analysis that compares GaKCo with the state-of-the-art gk-SK.
Theoretically, the time cost of GaKCo is independent of the term
that slows down the trie-based approach. Experimentally, we observe that GaKCo
achieves the same accuracy as the state-of-the-art and outperforms its speed by
factors of 2, 100, and 4, on classifying sequences of DNA (5 datasets), protein
(12 datasets), and character-based English text (2 datasets), respectively.
GaKCo is shared as an open source tool at
\url{https://github.com/QData/GaKCo-SVM}Comment: @ECML 201
A massively parallel reporter assay reveals context-dependent activity of homeodomain binding sites in vivo
BROCKMAN: deciphering variance in epigenomic regulators by k-mer factorization
Background: Variation in chromatin organization across single cells can help shed important light on the mechanisms controlling gene expression, but scale, noise, and sparsity pose significant challenges for interpretation of single cell chromatin data. Here, we develop BROCKMAN (Brockman Representation Of Chromatin by K-mers in Mark-Associated Nucleotides), an approach to infer variation in transcription factor (TF) activity across samples through unsupervised analysis of the variation in DNA sequences associated with an epigenomic mark. Results: BROCKMAN represents each sample as a vector of epigenomic-mark-associated DNA word frequencies, and decomposes the resulting matrix to find hidden structure in the data, followed by unsupervised grouping of samples and identification of the TFs that distinguish groups. Applied to single cell ATAC-seq, BROCKMAN readily distinguished cell types, treatments, batch effects, experimental artifacts, and cycling cells. We show that each variable component in the k-mer landscape reflects a set of co-varying TFs, which are often known to physically interact. For example, in K562 cells, AP-1 TFs were central determinant of variability in chromatin accessibility through their variable expression levels and diverse interactions with other TFs. We provide a theoretical basis for why cooperative TF binding – and any associated epigenomic mark – is inherently more variable than non-cooperative binding. Conclusions: BROCKMAN and related approaches will help gain a mechanistic understanding of the trans determinants of chromatin variability between cells, treatments, and individuals. Keywords: Single-cell, Epigenome, Chromatin, scATAC-seq, K-mer, N-gram, Factorization, Decomposition, Clustering,
Transcription factorNational Human Genome Research Institute (U.S.) (Centers of Excellence in Genomic Science Grant)Howard Hughes Medical Institute (Centers of Excellence in Genomic Science Grant
Discovery and validation of information theory-based transcription factor and cofactor binding site motifs.
Data from ChIP-seq experiments can derive the genome-wide binding specificities of transcription factors (TFs) and other regulatory proteins. We analyzed 765 ENCODE ChIP-seq peak datasets of 207 human TFs with a novel motif discovery pipeline based on recursive, thresholded entropy minimization. This approach, while obviating the need to compensate for skewed nucleotide composition, distinguishes true binding motifs from noise, quantifies the strengths of individual binding sites based on computed affinity and detects adjacent cofactor binding sites that coordinate with the targets of primary, immunoprecipitated TFs. We obtained contiguous and bipartite information theory-based position weight matrices (iPWMs) for 93 sequence-specific TFs, discovered 23 cofactor motifs for 127 TFs and revealed six high-confidence novel motifs. The reliability and accuracy of these iPWMs were determined via four independent validation methods, including the detection of experimentally proven binding sites, explanation of effects of characterized SNPs, comparison with previously published motifs and statistical analyses. We also predict previously unreported TF coregulatory interactions (e.g. TF complexes). These iPWMs constitute a powerful tool for predicting the effects of sequence variants in known binding sites, performing mutation analysis on regulatory SNPs and predicting previously unrecognized binding sites and target genes
Prediction of regulatory motifs from human Chip-sequencing data using a deep learning framework
The identification of transcription factor binding sites and cis-regulatory motifs is a frontier whereupon the rules governing protein-DNA binding are being revealed. Here, we developed a new method (DEep Sequence and Shape mOtif or DESSO) for cis-regulatory motif prediction using deep neural networks and the binomial distribution model. DESSO outperformed existing tools, including DeepBind, in predicting motifs in 690 human ENCODE ChIP-sequencing datasets. Furthermore, the deep-learning framework of DESSO expanded motif discovery beyond the state-of-the-art by allowing the identification of known and new protein-protein-DNA tethering interactions in human transcription factors (TFs). Specifically, 61 putative tethering interactions were identified among the 100 TFs expressed in the K562 cell line. In this work, the power of DESSO was further expanded by integrating the detection of DNA shape features. We found that shape information has strong predictive power for TF-DNA binding and provides new putative shape motif information for human TFs. Thus, DESSO improves in the identification and structural analysis of TF binding sites, by integrating the complexities of DNA binding into a deep-learning framework
An NF-κB Transcription-Factor-Dependent Lineage-Specific Transcriptional Program Promotes Regulatory T Cell Identity and Function
Both conventional T (Tconv) cells and regulatory T (Treg) cells are activated through ligation of the T cell receptor (TCR) complex, leading to the induction of the transcription factor NF-κB. In Tconv cells, NF-κB regulates expression of genes essential for T cell activation, proliferation, and function. However the role of NF-κB in Treg function remains unclear. We conditionally deleted canonical NF-κB members p65 and c-Rel in developing and mature Treg cells and found they have unique but partially redundant roles. c-Rel was critical for thymic Treg development while p65 was essential for mature Treg identity and maintenance of immune tolerance. Transcriptome and NF-κB p65 binding analyses demonstrated a lineage specific, NF-κB-dependent transcriptional program, enabled by enhanced chromatin accessibility. These dual roles of canonical NF-κB in Tconv and Treg cells highlight the functional plasticity of the NF-κB signaling pathway and underscores the need for more selective strategies to therapeutically target NF-κB
