51,243 research outputs found
Large scale hierarchical clustering of protein sequences
BACKGROUND: Searching a biological sequence database with a query sequence looking for homologues has become a routine operation in computational biology. In spite of the high degree of sophistication of currently available search routines it is still virtually impossible to identify quickly and clearly a group of sequences that a given query sequence belongs to. RESULTS: We report on our developments in grouping all known protein sequences hierarchically into superfamily and family clusters. Our graph-based algorithms take into account the topology of the sequence space induced by the data itself to construct a biologically meaningful partitioning. We have applied our clustering procedures to a non-redundant set of about 1,000,000 sequences resulting in a hierarchical clustering which is being made available for querying and browsing at . CONCLUSIONS: Comparisons with other widely used clustering methods on various data sets show the abilities and strengths of our clustering methods in producing a biologically meaningful grouping of protein sequences
Establishment of a integrative multi-omics expression database CKDdb in the context of chronic kidney disease (CKD)
Complex human traits such as chronic kidney disease (CKD) are a major health and financial burden in modern societies. Currently, the description of the CKD onset and progression at the molecular level is still not fully understood. Meanwhile, the prolific use of high-throughput omic technologies in disease biomarker discovery studies yielded a vast amount of disjointed data that cannot be easily collated. Therefore, we aimed to develop a molecule-centric database featuring CKD-related experiments from available literature publications. We established the Chronic Kidney Disease database CKDdb, an integrated and clustered information resource that covers multi-omic studies (microRNAs, genomics, peptidomics, proteomics and metabolomics) of CKD and related disorders by performing literature data mining and manual curation. The CKDdb database contains differential expression data from 49395 molecule entries (redundant), of which 16885 are unique molecules (non-redundant) from 377 manually curated studies of 230 publications. This database was intentionally built to allow disease pathway analysis through a systems approach in order to yield biological meaning by integrating all existing information and therefore has the potential to unravel and gain an in-depth understanding of the key molecular events that modulate CKD pathogenesis
Redundancy-Free Self-Supervised Relational Learning for Graph Clustering
Graph clustering, which learns the node representations for effective cluster
assignments, is a fundamental yet challenging task in data analysis and has
received considerable attention accompanied by graph neural networks in recent
years. However, most existing methods overlook the inherent relational
information among the non-independent and non-identically distributed nodes in
a graph. Due to the lack of exploration of relational attributes, the semantic
information of the graph-structured data fails to be fully exploited which
leads to poor clustering performance. In this paper, we propose a novel
self-supervised deep graph clustering method named Relational Redundancy-Free
Graph Clustering (RFGC) to tackle the problem. It extracts the attribute-
and structure-level relational information from both global and local views
based on an autoencoder and a graph autoencoder. To obtain effective
representations of the semantic information, we preserve the consistent
relation among augmented nodes, whereas the redundant relation is further
reduced for learning discriminative embeddings. In addition, a simple yet valid
strategy is utilized to alleviate the over-smoothing issue. Extensive
experiments are performed on widely used benchmark datasets to validate the
superiority of our RFGC over state-of-the-art baselines. Our codes are
available at https://github.com/yisiyu95/R2FGC.Comment: Accepted by IEEE Transactions on Neural Networks and Learning Systems
(TNNLS 2024
- …