9 research outputs found

    GENCODE: reference annotation for the human and mouse genomes in 2023.

    Get PDF
    GENCODE produces high quality gene and transcript annotation for the human and mouse genomes. All GENCODE annotation is supported by experimental data and serves as a reference for genome biology and clinical genomics. The GENCODE consortium generates targeted experimental data, develops bioinformatic tools and carries out analyses that, along with externally produced data and methods, support the identification and annotation of transcript structures and the determination of their function. Here, we present an update on the annotation of human and mouse genes, including developments in the tools, data, analyses and major collaborations which underpin this progress. For example, we report the creation of a set of non-canonical ORFs identified in GENCODE transcripts, the LRGASP collaboration to assess the use of long transcriptomic data to build transcript models, the progress in collaborations with RefSeq and UniProt to increase convergence in the annotation of human and mouse protein-coding genes, the propagation of GENCODE across the human pan-genome and the development of new tools to support annotation of regulatory features by GENCODE. Our annotation is accessible via Ensembl, the UCSC Genome Browser and https://www.gencodegenes.org

    Analysis Products: Transcription factor stoichiometry, motif affinity and syntax regulate single cell chromatin dynamics during fibroblast reprogramming to pluripotency

    No full text
    <p>This record contains analysis products for the paper "Transcription factor stoichiometry, motif affinity and syntax regulate single cell chromatin dynamics during fibroblast reprogramming to pluripotency" by Nair, Ameen <em>et al</em>. Please refer to the READMEs in the directories, which are summarized below.</p> <p>The record contains the following files:<br> <br> `clusters.tsv`: <strong> </strong>contains the cluster id, name and colour of clusters in the paper</p> <p><strong>scATAC.zip</strong></p> <p>Analysis products for the single-cell ATAC-seq data. Contains:</p> <p>- `cells.tsv`: list of barcodes that pass QC. Columns include:<br>     - `barcode`<br>     - `sample`: (time point)<br>     - `umap1`<br>     - `umap2`<br>     - `cluster`<br>     - `dpt_pseudotime_fibr_root`: pseudotime values treating a fibroblast cell as root<br>     - `dpt_pseudotime_xOSK_root`: pseudotime values treating xOSK cell as root<br> - `peaks.bed`: list of peaks of 500bp across all cell states. 4th column contains the peak set label. Note that ~5000 peaks are not assigned to any peak set and are marked as NA.<br> - `features.tsv`: 50 dimensional representation of each cell <br> - `cell_x_peak.mtx.gz`: sparse matrix of fragment counts within peaks. Load using scipy.io.mmread in python or readMM in R. Columns correspond to cells from `cells.tsv` (combine sample + barcode). Rows correspond to peaks in `peaks.bed` </p> <p><strong>scATAC_clusters.zip</strong></p> <p>Analysis products corresponding to cluster pseudo-bulks of the single-cell ATAC-seq data. </p> <p>- `clusters.tsv`: contains the cluster id, name and colour used in the paper<br> - `peaks`: contains `overlap_reproducibilty/overlap.optimal_peak` peaks called using ENCODE bulk ATAC-seq pipeline in the narrowPeak format.<br> - `fragments`: contains per cluster fragment files </p> <p><strong>scATAC_scRNA_integration.zip</strong></p> <p>Analysis products from the integration of scATAC with scRNA. Contains:</p> <p>- `peak_gene_links_fdr1e-4.tsv`: file with peak gene links passing FDR 1e-4. For analyses in the paper, we filter to peaks with absolute correlation >0.45.<br> - `harmony.cca.30.feat.tsv`: 30 dimensional co-embedding for scATAC and scRNA cells obtained by CCA followed by applying Harmony over assay type.<br> - `harmony.cca.metadata.tsv`: UMAP coordinates for scATAC and scRNA cells derived from the Harmony CCA embedding. First column contains barcode.</p> <p><strong>scRNA.zip</strong></p> <p>Analysis products for the single-cell RNA-seq data. Contains:</p> <p>- `seurat.rds`: seurat object that contains expression data (raw counts, normalized, and scaled), reductions (umap, pca), knn graphs, all associated metadata. Note that barcode suffix (1-9 corresponds to samples D0, D2, ..., D14, iPSC)<br> - `genes.txt`: list of all genes<br> - `cells.tsv`: list of barcodes that pass QC across samples. Contains:<br>     - `barcode_sample`: barcode with index of sample (1-9 corresponding to D0, D2, ..., D14, iPSC) <br>     - `sample`: sample name (D0, D2, .., D14, iPSC)<br>     - `umap1`<br>     - `umap2`<br>     - `nCount_RNA`<br>     - `nFeature_RNA`<br>     - `cluster`<br>     - `percent.mt`: percent of mitochondrial transcripts in cell<br>     - `percent.oskm`: percent of OSKM transcripts in cell<br> - `gene_x_cell.mtx.gz`: sparse matrix of gene counts. Load using scipy.io.mmread in python or readMM in R. Columns correspond to cells from `cells.tsv` (barcode suffix contains sample information). Rows correspond to genes in `genes.txt` <br> - `pca.tsv`: first 50 PC of each cell<br> - `oskm_endo_sendai.tsv`: estimated raw counts (cts, may not be integers) and log(1+ tp10k) normalized expression (norm) for endogenous and exogenous (Sendai derived) counts of POU5F1 (OCT4), SOX2, KLF4 and MYC genes. Rows are consistent with `seurat.rds` and `cells.tsv`</p> <p><strong>multiome.zip</strong></p> <p><em>multiome/snATAC:</em></p> <p>These files are derived from the integration of nuclei from multiome (D1M and D2M), with cells from day 2 of scATAC-seq (labeled D2). </p> <p>- `cells.tsv`: This is the list of nuclei barcodes that pass QC from multiome AND also cell barcodes from D2 of scATAC-seq. Includes:<br>     - `barcode`<br>     - `umap1`: These are the coordinates used for the figures involving multiome in the paper.<br>     - `umap2`: ^^^ <br>     - `sample`: D1M and D2M correspond to multiome, D2 corresponds to day 2 of scATAC-seq<br>     - `cluster`: For multiome barcodes, these are labels transfered from scATAC-seq. For D2 scATAC-seq, it is the original cluster labels. <br> - `peaks.bed`: This is the same file as scATAC/peaks.bed. List of peaks of 500bp. 4th column contains the peak set label. Note that ~5000 peaks are not assigned to any peak set and are marked as NA.<br> - `cell_x_peak.mtx.gz`: sparse matrix of fragment counts within peaks. Load using scipy.io.mmread in python or readMM in R. Columns correspond to cells from `cells.tsv` (combine sample + barcode). Rows correspond to peaks in `peaks.bed`.<br> - `features.no.harmony.50d.tsv`: 50 dimensional representation of each cell prior to running Harmony (to correct for batch effect between D2 scATAC and D1M,D2M snMultiome). Rows correspond to cells from `cells.tsv`.<br> - `features.harmony.10d.tsv`: 10 dimensional representation of each cell after running Harmony. Rows correspond to cells from `cells.tsv`.</p> <p><em>multiome/snRNA:</em></p> <p>- `seurat.rds`: seurat object that contains expression data (raw counts, normalized, and scaled), reductions (umap, pca),associated metadata. Note that barcode suffix (1,2 corresponds to samples D1M, D2M). Please use the UMAP/features from snATAC/ for consistency.<br> - `genes.txt`: list of all genes (this is different from the list in scRNA analysis)<br> - `cells.tsv`: list of barcodes that pass QC across samples. Contains:<br>     - `barcode_sample`: barcode with index of sample (1,2 corresponding to D1M, D2M respectively) <br>     - `sample`: sample name (D1M, D2M)<br>     - `nCount_RNA`<br>     - `nFeature_RNA`<br>     - `percent.oskm`: percent of OSKM genes in cell<br> - `gene_x_cell.mtx.gz`: sparse matrix of gene counts. Load using scipy.io.mmread in python or readMM in R. Columns correspond to cells from `cells.tsv` (barcode suffix contains sample information). Rows correspond to genes in `genes.txt` </p>scRNA-seq can be visualized interactively at https://kundajelab.github.io/reprogramming-browser/

    ChromBPNet models and data: Transcription factor stoichiometry, motif affinity and syntax regulate single cell chromatin dynamics during fibroblast reprogramming to pluripotency

    No full text
    <p>This record contains ChromBPNet models and data used to train the models for the paper "Transcription factor stoichiometry, motif affinity and syntax regulate single cell chromatin dynamics during fibroblast reprogramming to pluripotency" by Nair, Ameen <em>et al</em>.</p> <p>`data` contains bigwigs and regions (peaks + non-peaks) used for training each of the models. See `data/README.txt` for more details.</p> <p><strong>Models:</strong></p> <p><em>Loading the model:</em></p> <p>The models were trained using tf1.14. The models are provided in h5 format for tf1.14 (py3.7) and SavedModel format for tf2.X. tf2.X tested only for py3.8-11, tf2.8-13.</p> <p>To load the models in tf1.14:</p> <pre><code class="language-python">model = tf.keras.models.load_model("path/to/model.h5")</code></pre> <p>In tf2:</p> <pre><code class="language-python">model = tf.keras.models.load_model("path/to/model_dir")</code></pre> <p>If all fails, you can load the architecture as provided in `model_arch.py` with default parameters (`bpnet_seq` for bias model and `chrombpnet` for chrombpnet model), and then load the weights using `model.load_weights` from the weights provided in the `weights` directory.</p> <p> </p> <p><em>Usage:</em></p> <p>The bias models take as input one-hot sequence of length 2000. It has 2 outputs, a vector of logits of length 2000, and 1 logcounts scalar:</p> <pre><code class="language-python"># seq_one_hot of length B x 2000 x 4 out_bias_logits, out_bias_logcounts = bias_model.predict(seq_one_hot) # out_bias_logits: B x 2000 # out_bias_logcounts: B x 1</code></pre> <p>The ChromBPNet model takes as input a one-hot sequence of length 2000, bias logits of length 2000 and bias log-counts scalar. It has the same output types as the bias model. To run the chrombpnet model to obtain predictions:</p> <pre><code class="language-python">pred_profile, pred_logcounts = chrombpnet_model.predict([seq_one_hot, out_bias_logits, out_bias_logcounts]) # pred_profile: B x 2000 # pred_logcounts: B x 1 </code></pre> <p>If you wish to obtain the "de-biased" predictions (see Methods), simply pass in zeros instead of the bias model predictions as:</p> <pre><code class="language-python">pred_profile_debiased, pred_logcounts_debiased = chrombpnet_model.predict([seq_one_hot, np.zeros((seq_one_hot.shape[0], 2000)), np.zeros((seq_one_hot.shape[0], 1))])</code></pre> <p>To obtain predicted per-base predicted counts (with or without bias):</p> <pre><code class="language-python">pred_per_base_counts = scipy.special.softmax(pred_profile, axis=-1) * (np.exp(pred_logcounts)-1) # pred_per_base_counts: B x 2000 </code></pre> <p>Note that in general predicted counts can't be compared across models as they are not corrected for sequencing depth.</p> <p> </p> <p><em>Note:</em></p> <p>All bias models used across folds are identical, except for the final intercept term in the counts output (see Methods), that is specific to each cell state, fold combination.</p> <p> </p> <p><em>Folds:</em></p> <p>The splits used for training the different folds are as below:</p> Fold Test Chromosomes Validation Chromosomes 0 chr1 chr8, chr10 1 chr2, chr19 chr1 2 chr3, chr20 chr2, chr19 3 chr6, chr13, chr22 chr3, chr20 4 chr5, chr16, chrY chr6, chr13, chr22 5 chr4, chr15, chr21 chr5, chr16, chrY 6 chr7, chr18, chr14 chr4, chr15, chr21 7 chr11, chr17, chrX chr7, chr18, chr14 8 chr9, chr12 chr11, chr17, chrX 9 chr8, chr10 chr9, chr12 <p>Remaining chromosomes were used as the training chromosome for each fold.</p&gt

    Deciphering regulatory DNA sequences and noncoding genetic variants using neural network models of massively parallel reporter assays.

    No full text
    The relationship between noncoding DNA sequence and gene expression is not well-understood. Massively parallel reporter assays (MPRAs), which quantify the regulatory activity of large libraries of DNA sequences in parallel, are a powerful approach to characterize this relationship. We present MPRA-DragoNN, a convolutional neural network (CNN)-based framework to predict and interpret the regulatory activity of DNA sequences as measured by MPRAs. While our method is generally applicable to a variety of MPRA designs, here we trained our model on the Sharpr-MPRA dataset that measures the activity of ∼500,000 constructs tiling 15,720 regulatory regions in human K562 and HepG2 cell lines. MPRA-DragoNN predictions were moderately correlated (Spearman ρ = 0.28) with measured activity and were within range of replicate concordance of the assay. State-of-the-art model interpretation methods revealed high-resolution predictive regulatory sequence features that overlapped transcription factor (TF) binding motifs. We used the model to investigate the cell type and chromatin state preferences of predictive TF motifs. We explored the ability of our model to predict the allelic effects of regulatory variants in an independent MPRA experiment and fine map putative functional SNPs in loci associated with lipid traits. Our results suggest that interpretable deep learning models trained on MPRA data have the potential to reveal meaningful patterns in regulatory DNA sequences and prioritize regulatory genetic variants, especially as larger, higher-quality datasets are produced

    Bias factorized, base-resolution deep learning models of chromatin accessibility reveal cis-regulatory sequence syntax, transcription factor footprints and regulatory variants.

    No full text
    <ul> <li>(MAJOR) Bug in chrombpnet modisco_motifs command. seqlets was limited to 50000. If users wanted to change it to 1 million this did not happen.</li> <li>Filter peaks at edges for pred_bw command and bias pipleline. So bias evaluation now done on these filtered peaks.</li> <li>Preprocessing deafulted to use unix sort. Provided option to switch to bedtools sort.</li> <li>Provided option to use filter chromosomes option in preprocessing.</li> </ul> <p><strong>Full Changelog</strong>: https://github.com/kundajelab/chrombpnet/compare/v0.1.3...v0.1.4</p>If you use this software, please cite it as below
    corecore