59 research outputs found
Recommended from our members
EpiAlign: an alignment-based bioinformatic tool for comparing chromatin state sequences.
The availability of genome-wide epigenomic datasets enables in-depth studies of epigenetic modifications and their relationships with chromatin structures and gene expression. Various alignment tools have been developed to align nucleotide or protein sequences in order to identify structurally similar regions. However, there are currently no alignment methods specifically designed for comparing multi-track epigenomic signals and detecting common patterns that may explain functional or evolutionary similarities. We propose a new local alignment algorithm, EpiAlign, designed to compare chromatin state sequences learned from multi-track epigenomic signals and to identify locally aligned chromatin regions. EpiAlign is a dynamic programming algorithm that novelly incorporates varying lengths and frequencies of chromatin states. We demonstrate the efficacy of EpiAlign through extensive simulations and studies on the real data from the NIH Roadmap Epigenomics project. EpiAlign is able to extract recurrent chromatin state patterns along a single epigenome, and many of these patterns carry cell-type-specific characteristics. EpiAlign can also detect common chromatin state patterns across multiple epigenomes, and it will serve as a useful tool to group and distinguish epigenomic samples based on genome-wide or local chromatin state patterns
Response to Neglecting normalization impact in semi-synthetic RNA-seq data simulation generates artificial false positives and Winsorization greatly reduces false positives by popular differential expression methods when analyzing human population samples.
Two correspondences raised concerns or comments about our analyses regarding exaggerated false positives found by differential expression (DE) methods. Here, we discuss the points they raise and explain why we agree or disagree with these points. We add new analysis to confirm that the Wilcoxon rank-sum test remains the most robust method compared to the other five DE methods (DESeq2, edgeR, limma-voom, dearseq, and NOISeq) in two-condition DE analyses after considering normalization and winsorization, the data preprocessing steps discussed in the two correspondences
Speaker Recognition Using Isomorphic Graph Attention Network Based Pooling on Self-Supervised Representation
The emergence of self-supervised representation (i.e., wav2vec 2.0) allows
speaker-recognition approaches to process spoken signals through foundation
models built on speech data. Nevertheless, effective fusion on the
representation requires further investigating, due to the inclusion of fixed or
sub-optimal temporal pooling strategies. Despite of improved strategies
considering graph learning and graph attention factors, non-injective
aggregation still exists in the approaches, which may influence the performance
for speaker recognition. In this regard, we propose a speaker recognition
approach using Isomorphic Graph ATtention network (IsoGAT) on self-supervised
representation. The proposed approach contains three modules of representation
learning, graph attention, and aggregation, jointly considering learning on the
self-supervised representation and the IsoGAT. Then, we perform experiments for
speaker recognition tasks on VoxCeleb1\&2 datasets, with the corresponding
experimental results demonstrating the recognition performance for the proposed
approach, compared with existing pooling approaches on the self-supervised
representation.Comment: 9 pages, 4 figure
Recommended from our members
Developmental isoform diversity in the human neocortex informs neuropsychiatric risk mechanisms
RNA splicing is highly prevalent in the brain and has strong links to neuropsychiatric disorders; yet, the role of cell type-specific splicing and transcript-isoform diversity during human brain development has not been systematically investigated. In this work, we leveraged single-molecule long-read sequencing to deeply profile the full-length transcriptome of the germinal zone and cortical plate regions of the developing human neocortex at tissue and single-cell resolution. We identified 214,516 distinct isoforms, of which 72.6% were novel (not previously annotated in Gencode version 33), and uncovered a substantial contribution of transcript-isoform diversity-regulated by RNA binding proteins-in defining cellular identity in the developing neocortex. We leveraged this comprehensive isoform-centric gene annotation to reprioritize thousands of rare de novo risk variants and elucidate genetic risk mechanisms for neuropsychiatric disorders
Recommended from our members
Statistical and Computational Methods for Comparing High-Throughput Data from Two Conditions
The development of high-throughput biological technologies have enabled researchers to simultaneously perform analysis on thousands of features (e.g., genes, genomic regions, and proteins). The most common goal of analyzing high-throughput data is to contrast two conditions, to identify ``interesting’’ features, whose values differ between two conditions. How to contrast the features from two conditions to extract useful information from high-throughput data, and how to ensure the reliability of identified features are two increasingly pressing challenge to statistical and computational science. This dissertation aim to address these two problems regarding analysing high-throughput data from two conditions.My first project focuses on false discovery rate (FDR) control in high-throughput data analysis from two conditions. FDR is defined as the expected proportion of uninteresting features among the identified ones. It is the most widely-used criterion to ensure the reliability of the interesting features identified. Existing bioinformatics tools primarily control the FDR based on p-values. However, obtaining valid p-values relies on either reasonable assumptions of data distribution or large numbers of replicates under both conditions, two requirements that are often unmet in biological studies. In Chapter \ref{chap:clipper}, we propose Clipper, a general statistical framework for FDR control without relying on p-values or specific data distributions. Clipper is applicable to identifying both enriched and differential features from high-throughput biological data of diverse types. In comprehensive simulation and real-data benchmarking, Clipper outperforms existing generic FDR control methods and specific bioinformatics tools designed for various tasks, including peak calling from ChIP-seq data, and differentially expressed gene identification from bulk or single-cell RNA-seq data. Our results demonstrate Clipper's flexibility and reliability for FDR control, as well as its broad applications in high-throughput data analysis. My second project focuses on alignment of multi-track epigenomic signals from different samples or conditions. The availability of genome-wide epigenomic datasets enables in-depth studies of epigenetic modifications and their relationships with chromatin structures and gene expression. Various alignment tools have been developed to align nucleotide or protein sequences in order to identify structurally similar regions. However, there are currently no alignment methods specifically designed for comparing multi-track epigenomic signals and detecting common patterns that may explain functional or evolutionary similarities. We propose a new local alignment algorithm, EpiAlign, designed to compare chromatin state sequences learned from multi-track epigenomic signals and to identify locally aligned chromatin regions. EpiAlign is a dynamic programming algorithm that novelly incorporates varying lengths and frequencies of chromatin states. We demonstrate the efficacy of EpiAlign through extensive simulations and studies on the real data from the NIH Roadmap Epigenomics project. EpiAlign can also detect common chromatin state patterns across multiple epigenomes from conditions, and it will serve as a useful tool to group and distinguish epigenomic samples based on genome-wide or local chromatin state patterns
Recommended from our members
Statistical and Computational Methods for Comparing High-Throughput Data from Two Conditions
The development of high-throughput biological technologies have enabled researchers to simultaneously perform analysis on thousands of features (e.g., genes, genomic regions, and proteins). The most common goal of analyzing high-throughput data is to contrast two conditions, to identify ``interesting’’ features, whose values differ between two conditions. How to contrast the features from two conditions to extract useful information from high-throughput data, and how to ensure the reliability of identified features are two increasingly pressing challenge to statistical and computational science. This dissertation aim to address these two problems regarding analysing high-throughput data from two conditions.My first project focuses on false discovery rate (FDR) control in high-throughput data analysis from two conditions. FDR is defined as the expected proportion of uninteresting features among the identified ones. It is the most widely-used criterion to ensure the reliability of the interesting features identified. Existing bioinformatics tools primarily control the FDR based on p-values. However, obtaining valid p-values relies on either reasonable assumptions of data distribution or large numbers of replicates under both conditions, two requirements that are often unmet in biological studies. In Chapter \ref{chap:clipper}, we propose Clipper, a general statistical framework for FDR control without relying on p-values or specific data distributions. Clipper is applicable to identifying both enriched and differential features from high-throughput biological data of diverse types. In comprehensive simulation and real-data benchmarking, Clipper outperforms existing generic FDR control methods and specific bioinformatics tools designed for various tasks, including peak calling from ChIP-seq data, and differentially expressed gene identification from bulk or single-cell RNA-seq data. Our results demonstrate Clipper's flexibility and reliability for FDR control, as well as its broad applications in high-throughput data analysis. My second project focuses on alignment of multi-track epigenomic signals from different samples or conditions. The availability of genome-wide epigenomic datasets enables in-depth studies of epigenetic modifications and their relationships with chromatin structures and gene expression. Various alignment tools have been developed to align nucleotide or protein sequences in order to identify structurally similar regions. However, there are currently no alignment methods specifically designed for comparing multi-track epigenomic signals and detecting common patterns that may explain functional or evolutionary similarities. We propose a new local alignment algorithm, EpiAlign, designed to compare chromatin state sequences learned from multi-track epigenomic signals and to identify locally aligned chromatin regions. EpiAlign is a dynamic programming algorithm that novelly incorporates varying lengths and frequencies of chromatin states. We demonstrate the efficacy of EpiAlign through extensive simulations and studies on the real data from the NIH Roadmap Epigenomics project. EpiAlign can also detect common chromatin state patterns across multiple epigenomes from conditions, and it will serve as a useful tool to group and distinguish epigenomic samples based on genome-wide or local chromatin state patterns
EpiAlign: an alignment-based bioinformatic tool for comparing chromatin state sequences.
Recommended from our members
Enhancing rigor in computational methods for biological data analysis
When you use the most popular computational methods for biological data analysis, have you checked whether their models are reasonable in your settings
- …
