Search CORE

9 research outputs found

GENCODE: reference annotation for the human and mouse genomes in 2023.

Author: Arnan Carme
Banerjee Abhimanyu
Barnes If
Bennett Ruth
Berry Andrew
Bignell Alexandra
Boix Carles
Calvet Ferriol
Carbonell-Sala Sílvia
Cerdán-Vélez Daniel
Choudhary Jyoti S
Cunningham Fiona
Davidson Claire
Diekhans Mark
Donaldson Sarah
Dursun Cagatay
Fatima Reham
Flicek Paul
Frankish Adam
Gerstein Mark
Giorgetti Stefano
Giron Carlos Garcıa
Gonzalez Jose Manuel
Guigo Roderic
Gómez Laura Martínez
Hardy Matthew
Harrison Peter W
Hollis Zoe
Hourlier Thibaut
Hubbard Tim J P
Hunt Toby
James Benjamin
Jiang Yunzhe
Johnson Rory
Jungreis Irwin
Kay Mike
Kellis Manolis
Kundaje Anshul
Lagarde Julien
Loveland Jane E
Martin Fergal J
Mudge Jonathan M
Nair Surag
Ni Pengyu
Paten Benedict
Pozo Fernando
Ramalingam Vivek
Ruffier Magali
Schmitt Bianca M
Schreiber Jacob M
Sisu Cristina
Steed Emily
Sumathipala Dulika
Suner Marie-Marthe
Sycheva Irina
Tress Michael L
Uszczynska-Ratajczak Barbara
Wass Elizabeth
Wright James C
Yang Yucheng T
Yates Andrew
Zafrulla Zahoor
Publication venue: 'Oxford University Press (OUP)'
Publication date: 24/11/2022
Field of study

GENCODE produces high quality gene and transcript annotation for the human and mouse genomes. All GENCODE annotation is supported by experimental data and serves as a reference for genome biology and clinical genomics. The GENCODE consortium generates targeted experimental data, develops bioinformatic tools and carries out analyses that, along with externally produced data and methods, support the identification and annotation of transcript structures and the determination of their function. Here, we present an update on the annotation of human and mouse genes, including developments in the tools, data, analyses and major collaborations which underpin this progress. For example, we report the creation of a set of non-canonical ORFs identified in GENCODE transcripts, the LRGASP collaboration to assess the use of long transcriptomic data to build transcript models, the progress in collaborations with RefSeq and UniProt to increase convergence in the annotation of human and mouse protein-coding genes, the propagation of GENCODE across the human pan-genome and the development of new tools to support annotation of regulatory features by GENCODE. Our annotation is accessible via Ensembl, the UCSC Genome Browser and https://www.gencodegenes.org

Bern Open Repository and Information System (BORIS)

The dynseq browser track shows context-specific features at nucleotide resolution.

Author: Nair Surag,
Publication venue
Publication date: 06/07/2023
Field of study

Ezid

Analysis Products: Transcription factor stoichiometry, motif affinity and syntax regulate single cell chromatin dynamics during fibroblast reprogramming to pluripotency

Author: Ameen Mohamed
Kundaje Anshul
Nair Surag
Wang Kevin
Publication venue: Zenodo
Publication date: 01/09/2023
Field of study

This record contains analysis products for the paper "Transcription factor stoichiometry, motif affinity and syntax regulate single cell chromatin dynamics during fibroblast reprogramming to pluripotency" by Nair, Ameen et al. Please refer to the READMEs in the directories, which are summarized below. The record contains the following files: `clusters.tsv`:  contains the cluster id, name and colour of clusters in the paper scATAC.zip Analysis products for the single-cell ATAC-seq data. Contains: - `cells.tsv`: list of barcodes that pass QC. Columns include:     - `barcode`     - `sample`: (time point)     - `umap1`     - `umap2`     - `cluster`     - `dpt_pseudotime_fibr_root`: pseudotime values treating a fibroblast cell as root     - `dpt_pseudotime_xOSK_root`: pseudotime values treating xOSK cell as root - `peaks.bed`: list of peaks of 500bp across all cell states. 4th column contains the peak set label. Note that ~5000 peaks are not assigned to any peak set and are marked as NA. - `features.tsv`: 50 dimensional representation of each cell  - `cell_x_peak.mtx.gz`: sparse matrix of fragment counts within peaks. Load using scipy.io.mmread in python or readMM in R. Columns correspond to cells from `cells.tsv` (combine sample + barcode). Rows correspond to peaks in `peaks.bed`  scATAC_clusters.zip Analysis products corresponding to cluster pseudo-bulks of the single-cell ATAC-seq data.  - `clusters.tsv`: contains the cluster id, name and colour used in the paper - `peaks`: contains `overlap_reproducibilty/overlap.optimal_peak` peaks called using ENCODE bulk ATAC-seq pipeline in the narrowPeak format. - `fragments`: contains per cluster fragment files  scATAC_scRNA_integration.zip Analysis products from the integration of scATAC with scRNA. Contains: - `peak_gene_links_fdr1e-4.tsv`: file with peak gene links passing FDR 1e-4. For analyses in the paper, we filter to peaks with absolute correlation >0.45. - `harmony.cca.30.feat.tsv`: 30 dimensional co-embedding for scATAC and scRNA cells obtained by CCA followed by applying Harmony over assay type. - `harmony.cca.metadata.tsv`: UMAP coordinates for scATAC and scRNA cells derived from the Harmony CCA embedding. First column contains barcode. scRNA.zip Analysis products for the single-cell RNA-seq data. Contains: - `seurat.rds`: seurat object that contains expression data (raw counts, normalized, and scaled), reductions (umap, pca), knn graphs, all associated metadata. Note that barcode suffix (1-9 corresponds to samples D0, D2, ..., D14, iPSC) - `genes.txt`: list of all genes - `cells.tsv`: list of barcodes that pass QC across samples. Contains:     - `barcode_sample`: barcode with index of sample (1-9 corresponding to D0, D2, ..., D14, iPSC)      - `sample`: sample name (D0, D2, .., D14, iPSC)     - `umap1`     - `umap2`     - `nCount_RNA`     - `nFeature_RNA`     - `cluster`     - `percent.mt`: percent of mitochondrial transcripts in cell     - `percent.oskm`: percent of OSKM transcripts in cell - `gene_x_cell.mtx.gz`: sparse matrix of gene counts. Load using scipy.io.mmread in python or readMM in R. Columns correspond to cells from `cells.tsv` (barcode suffix contains sample information). Rows correspond to genes in `genes.txt`  - `pca.tsv`: first 50 PC of each cell - `oskm_endo_sendai.tsv`: estimated raw counts (cts, may not be integers) and log(1+ tp10k) normalized expression (norm) for endogenous and exogenous (Sendai derived) counts of POU5F1 (OCT4), SOX2, KLF4 and MYC genes. Rows are consistent with `seurat.rds` and `cells.tsv` multiome.zip multiome/snATAC: These files are derived from the integration of nuclei from multiome (D1M and D2M), with cells from day 2 of scATAC-seq (labeled D2).  - `cells.tsv`: This is the list of nuclei barcodes that pass QC from multiome AND also cell barcodes from D2 of scATAC-seq. Includes:     - `barcode`     - `umap1`: These are the coordinates used for the figures involving multiome in the paper.     - `umap2`: ^^^      - `sample`: D1M and D2M correspond to multiome, D2 corresponds to day 2 of scATAC-seq     - `cluster`: For multiome barcodes, these are labels transfered from scATAC-seq. For D2 scATAC-seq, it is the original cluster labels.  - `peaks.bed`: This is the same file as scATAC/peaks.bed. List of peaks of 500bp. 4th column contains the peak set label. Note that ~5000 peaks are not assigned to any peak set and are marked as NA. - `cell_x_peak.mtx.gz`: sparse matrix of fragment counts within peaks. Load using scipy.io.mmread in python or readMM in R. Columns correspond to cells from `cells.tsv` (combine sample + barcode). Rows correspond to peaks in `peaks.bed`. - `features.no.harmony.50d.tsv`: 50 dimensional representation of each cell prior to running Harmony (to correct for batch effect between D2 scATAC and D1M,D2M snMultiome). Rows correspond to cells from `cells.tsv`. - `features.harmony.10d.tsv`: 10 dimensional representation of each cell after running Harmony. Rows correspond to cells from `cells.tsv`. multiome/snRNA: - `seurat.rds`: seurat object that contains expression data (raw counts, normalized, and scaled), reductions (umap, pca),associated metadata. Note that barcode suffix (1,2 corresponds to samples D1M, D2M). Please use the UMAP/features from snATAC/ for consistency. - `genes.txt`: list of all genes (this is different from the list in scRNA analysis) - `cells.tsv`: list of barcodes that pass QC across samples. Contains:     - `barcode_sample`: barcode with index of sample (1,2 corresponding to D1M, D2M respectively)      - `sample`: sample name (D1M, D2M)     - `nCount_RNA`     - `nFeature_RNA`     - `percent.oskm`: percent of OSKM genes in cell - `gene_x_cell.mtx.gz`: sparse matrix of gene counts. Load using scipy.io.mmread in python or readMM in R. Columns correspond to cells from `cells.tsv` (barcode suffix contains sample information). Rows correspond to genes in `genes.txt` scRNA-seq can be visualized interactively at https://kundajelab.github.io/reprogramming-browser/

ZENODO

ChromBPNet models and data: Transcription factor stoichiometry, motif affinity and syntax regulate single cell chromatin dynamics during fibroblast reprogramming to pluripotency

Author: Ameen Mohamed
Kundaje Anshul
Nair Surag
Wang Kevin
Publication venue: Zenodo
Publication date: 01/09/2023
Field of study

This record contains ChromBPNet models and data used to train the models for the paper "Transcription factor stoichiometry, motif affinity and syntax regulate single cell chromatin dynamics during fibroblast reprogramming to pluripotency" by Nair, Ameen et al. `data` contains bigwigs and regions (peaks + non-peaks) used for training each of the models. See `data/README.txt` for more details. Models: Loading the model: The models were trained using tf1.14. The models are provided in h5 format for tf1.14 (py3.7) and SavedModel format for tf2.X. tf2.X tested only for py3.8-11, tf2.8-13. To load the models in tf1.14: <pre><code class="language-python">model = tf.keras.models.load_model("path/to/model.h5")</code></pre> In tf2: <pre><code class="language-python">model = tf.keras.models.load_model("path/to/model_dir")</code></pre> If all fails, you can load the architecture as provided in `model_arch.py` with default parameters (`bpnet_seq` for bias model and `chrombpnet` for chrombpnet model), and then load the weights using `model.load_weights` from the weights provided in the `weights` directory.   Usage: The bias models take as input one-hot sequence of length 2000. It has 2 outputs, a vector of logits of length 2000, and 1 logcounts scalar: <pre><code class="language-python"># seq_one_hot of length B x 2000 x 4 out_bias_logits, out_bias_logcounts = bias_model.predict(seq_one_hot) # out_bias_logits: B x 2000 # out_bias_logcounts: B x 1</code></pre> The ChromBPNet model takes as input a one-hot sequence of length 2000, bias logits of length 2000 and bias log-counts scalar. It has the same output types as the bias model. To run the chrombpnet model to obtain predictions: <pre><code class="language-python">pred_profile, pred_logcounts = chrombpnet_model.predict([seq_one_hot, out_bias_logits, out_bias_logcounts]) # pred_profile: B x 2000 # pred_logcounts: B x 1 </code></pre> If you wish to obtain the "de-biased" predictions (see Methods), simply pass in zeros instead of the bias model predictions as: <pre><code class="language-python">pred_profile_debiased, pred_logcounts_debiased = chrombpnet_model.predict([seq_one_hot, np.zeros((seq_one_hot.shape[0], 2000)), np.zeros((seq_one_hot.shape[0], 1))])</code></pre> To obtain predicted per-base predicted counts (with or without bias): <pre><code class="language-python">pred_per_base_counts = scipy.special.softmax(pred_profile, axis=-1) * (np.exp(pred_logcounts)-1) # pred_per_base_counts: B x 2000 </code></pre> Note that in general predicted counts can't be compared across models as they are not corrected for sequencing depth.   Note: All bias models used across folds are identical, except for the final intercept term in the counts output (see Methods), that is specific to each cell state, fold combination.   Folds: The splits used for training the different folds are as below: Fold Test Chromosomes Validation Chromosomes 0 chr1 chr8, chr10 1 chr2, chr19 chr1 2 chr3, chr20 chr2, chr19 3 chr6, chr13, chr22 chr3, chr20 4 chr5, chr16, chrY chr6, chr13, chr22 5 chr4, chr15, chr21 chr5, chr16, chrY 6 chr7, chr18, chr14 chr4, chr15, chr21 7 chr11, chr17, chrX chr7, chr18, chr14 8 chr9, chr12 chr11, chr17, chrX 9 chr8, chr10 chr9, chr12 Remaining chromosomes were used as the training chromosome for each fold.</p&gt

ZENODO

Deciphering regulatory DNA sequences and noncoding genetic variants using neural network models of massively parallel reporter assays.

Author: Anshul Kundaje
Avanti Shrikumar
Georgi K Marinov
Peyton Greenside
Rajiv Movva
Surag Nair
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 01/01/2019
Field of study

The relationship between noncoding DNA sequence and gene expression is not well-understood. Massively parallel reporter assays (MPRAs), which quantify the regulatory activity of large libraries of DNA sequences in parallel, are a powerful approach to characterize this relationship. We present MPRA-DragoNN, a convolutional neural network (CNN)-based framework to predict and interpret the regulatory activity of DNA sequences as measured by MPRAs. While our method is generally applicable to a variety of MPRA designs, here we trained our model on the Sharpr-MPRA dataset that measures the activity of ∼500,000 constructs tiling 15,720 regulatory regions in human K562 and HepG2 cell lines. MPRA-DragoNN predictions were moderately correlated (Spearman ρ = 0.28) with measured activity and were within range of replicate concordance of the assay. State-of-the-art model interpretation methods revealed high-resolution predictive regulatory sequence features that overlapped transcription factor (TF) binding motifs. We used the model to investigate the cell type and chromatin state preferences of predictive TF motifs. We explored the ability of our model to predict the allelic effects of regulatory variants in an independent MPRA experiment and fine map putative functional SNPs in loci associated with lipid traits. Our results suggest that interpretable deep learning models trained on MPRA data have the potential to reveal meaningful patterns in regulatory DNA sequences and prioritize regulatory genetic variants, especially as larger, higher-quality datasets are produced

Directory of Open Access Journals

Bias factorized, base-resolution deep learning models of chromatin accessibility reveal cis-regulatory sequence syntax, transcription factor footprints and regulatory variants.

Author: Kundaje Anshul
Kundu Soumya
Nair Surag
Pampari Anusri
Patel Aman
Schreiber Jacob
Shcherbina Anna
Shrikumar Avanti
Wang Austin
Publication venue: Zenodo
Publication date: 11/02/2023
Field of study

<ul> <li>(MAJOR) Bug in chrombpnet modisco_motifs command. seqlets was limited to 50000. If users wanted to change it to 1 million this did not happen.</li> <li>Filter peaks at edges for pred_bw command and bias pipleline. So bias evaluation now done on these filtered peaks.</li> <li>Preprocessing deafulted to use unix sort. Provided option to switch to bedtools sort.</li> <li>Provided option to use filter chromosomes option in preprocessing.</li> </ul> Full Changelog: https://github.com/kundajelab/chrombpnet/compare/v0.1.3...v0.1.4If you use this software, please cite it as below

ZENODO

NEUROSURGERY ENTHUSIASTIC WOMEN SOCIETY

Recommended from our members

The dynseq browser track shows context-specific features at nucleotide resolution

Author: Barrett Arjun
Haeussler Maximilian
Kerpedjiev Peter
Kundaje Anshul
Lee Brian T
Lekschas Fritz
Li Daofeng
Nair Surag
Pampari Anusri
Ramalingam Vivekanandan
Raney Brian J
Wang Ting
Publication venue: eScholarship, University of California
Publication date: 01/11/2022
Field of study

High-throughput experimental platforms have revolutionized the ability to profile biochemical and functional properties of biological sequences such as DNA, RNA and proteins. By collating several data modalities with customizable tracks rendered using intuitive visualizations, genome browsers enable an interactive and interpretable exploration of diverse types of genome profiling experiments and derived annotations. However, existing genome browser tracks are not well suited for intuitive visualization of high-resolution DNA sequence features such as transcription factor motifs. Typically, motif instances in regulatory DNA sequences are visualized as BED-based annotation tracks, which highlight the genomic coordinates of the motif instances but do not expose their specific sequences. Instead, a genome sequence track needs to be cross-referenced with the BED track to identify sequences of motif hits. Even so, quantitative information about the motif instances such as affinity or conservation as well as differences in base resolution from the consensus motif are not immediately apparent. This makes interpretation slow and challenging. This problem is compounded when analyzing several cellular states and/or molecular readouts (such as ATAC-seq and ChIP–seq) simultaneously, as coordinates of enriched regions (peaks) and the set of active transcription factor motifs vary across cell states

eScholarship - University of California