117 research outputs found
Multilevel Hierarchical Kernel Spectral Clustering for Real-Life Large Scale Complex Networks
Kernel spectral clustering corresponds to a weighted kernel principal
component analysis problem in a constrained optimization framework. The primal
formulation leads to an eigen-decomposition of a centered Laplacian matrix at
the dual level. The dual formulation allows to build a model on a
representative subgraph of the large scale network in the training phase and
the model parameters are estimated in the validation stage. The KSC model has a
powerful out-of-sample extension property which allows cluster affiliation for
the unseen nodes of the big data network. In this paper we exploit the
structure of the projections in the eigenspace during the validation stage to
automatically determine a set of increasing distance thresholds. We use these
distance thresholds in the test phase to obtain multiple levels of hierarchy
for the large scale network. The hierarchical structure in the network is
determined in a bottom-up fashion. We empirically showcase that real-world
networks have multilevel hierarchical organization which cannot be detected
efficiently by several state-of-the-art large scale hierarchical community
detection techniques like the Louvain, OSLOM and Infomap methods. We show a
major advantage our proposed approach i.e. the ability to locate good quality
clusters at both the coarser and finer levels of hierarchy using internal
cluster quality metrics on 7 real-life networks.Comment: PLOS ONE, Vol 9, Issue 6, June 201
Detection of statistically significant network changes in complex biological networks
Table S1. Description of data: GHD and MRA Results for all the 457 considered transcription factors on the TCGA and Rembrandt datasets. (XLSX 62.7 kb
An unsupervised disease module identification technique in biological networks using novel quality metric based on connectivity, conductance and modularity
Disease processes are usually driven by several genes interacting in molecular modules or pathways leading to the disease. The identification of such modules in gene or protein networks is the core of computational methods in biomedical research. With this pretext, the Disease Module Identification (DMI) DREAM Challenge was initiated as an effort to systematically assess module identification methods on a panel of 6 diverse genomic networks. In this paper, we propose a generic refinement method based on ideas of merging and splitting the hierarchical tree obtained from any community detection technique for constrained DMI in biological networks. The only constraint was that size of community is in the range [3, 100]. We propose a novel model evaluation metric, called F-score, computed from several unsupervised quality metrics like modularity, conductance and connectivity to determine the quality of a graph partition at given level of hierarchy. We also propose a quality measure, namely Inverse Confidence, which ranks and prune insignificant modules to obtain a curated list of candidate disease modules (DM) for biological network. The predicted modules are evaluated on the basis of the total number of unique candidate modules that are associated with complex traits and diseases from over 200 genome-wide association study (GWAS) datasets. During the competition, we identified 42 modules, ranking 15th at the official false detection rate (FDR) cut-off of 0.05 for identifying statistically significant DM in the 6 benchmark networks. However, for stringent FDR cut-offs 0.025 and 0.01, the proposed method identified 31 (rank 9) and 16 DMIs (rank 10) respectively. From additional analysis, our proposed approach detected a total of 44 DM in the networks in comparison to 60 for the winner of DREAM Challenge. Interestingly, for several individual benchmark networks, our performance was better or competitive with the winner
A new efficient and unbiased approach for clustering quality evaluation
International audienceTraditional quality indexes (Inertia, DB, . . . ) are known to be method-dependent indexes that do not allow to properly estimate the quality of the clustering in several cases, as in that one of complex data, like textual data. We thus propose an alternative approach for clustering quality evaluation based on unsupervised measures of Recall, Precision and F-measure exploiting the descriptors of the data associated with the obtained clusters. Two categories of index are proposed, that are Macro and Micro indexes. This paper also focuses on the construction of a new cumulative Micro precision index that makes it possible to evalu- ate the overall quality of a clustering result while clearly distinguishing between homogeneous and heterogeneous, or degenerated results. The experimental comparison of the behavior of the classical indexes with our new approach is performed on a polythematic dataset of bibliographical references issued from the PASCAL database
Characteristic MicroRNAs Linked to Dysregulated Metabolic Pathways in Qatari Adult Subjects With Obesity and Metabolic Syndrome
BackgroundObesity-associated dysglycemia is associated with metabolic disorders. MicroRNAs (miRNAs) are known regulators of metabolic homeostasis. We aimed to assess the relationship of circulating miRNAs with clinical features in obese Qatari individuals.MethodsWe analyzed a dataset of 39 age-matched patients that includes 18 subjects with obesity only (OBO) and 21 subjects with obesity and metabolic syndrome (OBM). We measured 754 well-characterized human microRNAs (miRNAs) and identified differentially expressed miRNAs along with their significant associations with clinical markers in these patients.ResultsA total of 64 miRNAs were differentially expressed between metabolically healthy obese (OBO) versus metabolically unhealthy obese (OBM) patients. Thirteen out of 64 miRNAs significantly correlated with at least one clinical trait of the metabolic syndrome. Six out of the thirteen demonstrated significant association with HbA1c levels; miR-331-3p, miR-452-3p, and miR-485-5p were over-expressed, whereas miR-153-3p, miR-182-5p, and miR-433-3p were under-expressed in the OBM patients with elevated HbA1c levels. We also identified, miR-106b-3p, miR-652-3p, and miR-93-5p that showed a significant association with creatinine; miR-130b-5p, miR-363-3p, and miR-636 were significantly associated with cholesterol, whereas miR-130a-3p was significantly associated with LDL. Additionally, miR-652-3p’s differential expression correlated significantly with HDL and creatinine.ConclusionsMicroRNAs associated with metabolic syndrome in obese subjects may have a pathophysiologic role and can serve as markers for obese individuals predisposed to various metabolic diseases like diabetes
Kernel Spectral Clustering and applications
In this chapter we review the main literature related to kernel spectral
clustering (KSC), an approach to clustering cast within a kernel-based
optimization setting. KSC represents a least-squares support vector machine
based formulation of spectral clustering described by a weighted kernel PCA
objective. Just as in the classifier case, the binary clustering model is
expressed by a hyperplane in a high dimensional space induced by a kernel. In
addition, the multi-way clustering can be obtained by combining a set of binary
decision functions via an Error Correcting Output Codes (ECOC) encoding scheme.
Because of its model-based nature, the KSC method encompasses three main steps:
training, validation, testing. In the validation stage model selection is
performed to obtain tuning parameters, like the number of clusters present in
the data. This is a major advantage compared to classical spectral clustering
where the determination of the clustering parameters is unclear and relies on
heuristics. Once a KSC model is trained on a small subset of the entire data,
it is able to generalize well to unseen test points. Beyond the basic
formulation, sparse KSC algorithms based on the Incomplete Cholesky
Decomposition (ICD) and , , Group Lasso regularization are
reviewed. In that respect, we show how it is possible to handle large scale
data. Also, two possible ways to perform hierarchical clustering and a soft
clustering method are presented. Finally, real-world applications such as image
segmentation, power load time-series clustering, document clustering and big
data learning are considered.Comment: chapter contribution to the book "Unsupervised Learning Algorithms
Recommended from our members
The future of sleep health: a data-driven revolution in sleep science and medicine.
In recent years, there has been a significant expansion in the development and use of multi-modal sensors and technologies to monitor physical activity, sleep and circadian rhythms. These developments make accurate sleep monitoring at scale a possibility for the first time. Vast amounts of multi-sensor data are being generated with potential applications ranging from large-scale epidemiological research linking sleep patterns to disease, to wellness applications, including the sleep coaching of individuals with chronic conditions. However, in order to realise the full potential of these technologies for individuals, medicine and research, several significant challenges must be overcome. There are important outstanding questions regarding performance evaluation, as well as data storage, curation, processing, integration, modelling and interpretation. Here, we leverage expertise across neuroscience, clinical medicine, bioengineering, electrical engineering, epidemiology, computer science, mHealth and human-computer interaction to discuss the digitisation of sleep from a inter-disciplinary perspective. We introduce the state-of-the-art in sleep-monitoring technologies, and discuss the opportunities and challenges from data acquisition to the eventual application of insights in clinical and consumer settings. Further, we explore the strengths and limitations of current and emerging sensing methods with a particular focus on novel data-driven technologies, such as Artificial Intelligence
Molecular mechanism of RIPK1 and caspase-8 in homeostatic type I interferon production and regulation
Type I interferons (IFNs) are essential innate immune proteins that maintain tissue homeostasis through tonic expression and can be upregulated to drive antiviral resistance and inflammation upon stimulation. However, the mechanisms that inhibit aberrant IFN upregulation in homeostasis and the impacts of tonic IFN production on health and disease remain enigmatic. Here, we report that caspase-8 negatively regulates type I IFN production by inhibiting the RIPK1-TBK1 axis during homeostasis across multiple cell types and tissues. When caspase-8 is deleted or inhibited, RIPK1 interacts with TBK1 to drive elevated IFN production, leading to heightened resistance to norovirus infection in macrophages but also early onset lymphadenopathy in mice. Combined deletion of caspase-8 and RIPK1 reduces the type I IFN signaling and lymphadenopathy, highlighting the critical role of RIPK1 in this process. Overall, our study identifies a mechanism to constrain tonic type I IFN during homeostasis which could be targeted for infectious and inflammatory diseases
An integrated multi-omic approach demonstrates distinct molecular signatures between human obesity with and without metabolic complications: a case–control study
Objectives: To examine the hypothesis that obesity complicated by the metabolic syndrome, compared to uncomplicated obesity, has distinct molecular signatures and metabolic pathways. Methods: We analyzed a cohort of 39 participants with obesity that included 21 with metabolic syndrome, age-matched to 18 without metabolic complications. We measured in whole blood samples 754 human microRNAs (miRNAs), 704 metabolites using unbiased mass spectrometry metabolomics, and 25,682 transcripts, which include both protein coding genes (PCGs) as well as non-coding transcripts. We then identified differentially expressed miRNAs, PCGs, and metabolites and integrated them using databases such as mirDIP (mapping between miRNA-PCG network), Human Metabolome Database (mapping between metabolite-PCG network) and tools like MetaboAnalyst (mapping between metabolite-metabolic pathway network) to determine dysregulated metabolic pathways in obesity with metabolic complications. Results: We identified 8 significantly enriched metabolic pathways comprising 8 metabolites, 25 protein coding genes and 9 microRNAs which are each differentially expressed between the subjects with obesity and those with obesity and metabolic syndrome. By performing unsupervised hierarchical clustering on the enrichment matrix of the 8 metabolic pathways, we could approximately segregate the uncomplicated obesity strata from that of obesity with metabolic syndrome. Conclusions: The data suggest that at least 8 metabolic pathways, along with their various dysregulated elements, identified via our integrative bioinformatics pipeline, can potentially differentiate those with obesity from those with obesity and metabolic complications
Sparsity in Large Scale Kernel Models
In the modern era with the advent of technology and its widespread usage there is a huge proliferation of data. Gigabytes of data from mobile devices, market basket, geo-spatial images, search engines, online social networks etc. can be easily obtained, accumulated and stored. This immense wealth of data has resulted in massive datasets and has led to the emergence of the concept of Big Data. Mining useful information from this big data is a challenging task. With the availability of more data the choices in selecting a predictive model decreases, because very few tools arenbsp;feasible for processing large scale datasets. A successful learning framework to perform various learning tasks like classification, regression, clustering, dimensionality reduction, feature selection etc. is offered by Least Squares Support Vector Machines (LSSVM) which is designed in a primal-dual optimization setting. It provides the flexibility to extend core models by adding additional constraints to the primal problem, by changing the objective function ornbsp;introducing new model selection criteria.
The goal of this thesis is to explore the role of sparsity in large scale kernel models using core models adopted from the LSSVM framework. Real-world data is often noisy and only a small fraction of it contains the most relevant information. Sparsity plays a big role in selection of this representative subset of data. We first explored sparsity in the case of large scale LSSVM using fixed-size methods with a re-weighted L1 penalty on top resulting in very sparse LSSVM (VS-LSSVM).
An important aspect of kernel based methods is the selection of a subset on which the model is built and validated. We proposed a novel fast and unique representative subset (FURS) selection technique to select a subset from complex networks which retains the inherent community structure in the network. We extend this method for Big Data learning by constructing k-NN graphs out of dense data using a distributed computing platform i.e. Hadoop and then apply the FURS selection technique to obtain representative subsets on top of which models are built by kernel based methods.
We then focused on scaling the kernel spectralnbsp;(KSC) technique for big data networks. We devised two model selection techniques namely balanced angular fitting (BAF) and self-tuned KSC (ST-KSC) by exploiting the structure of the projections in the eigenspace to obtain the optimal number of communities k in the large graph. A multilevel hierarchical kernel spectral clustering (MH-KSC) technique was then proposed which performs agglomerative hierarchical clustering using similarity information between the out-of-sample eigen-projections.
Furthermore, we developed an algorithm to identify intervals for hierarchical clustering using the Gershgorin Circle theorem. These intervals were used to identify the optimal number of clusters at a given level of hierarchy in combination with KSC model. The MH-KSC technique was extended from networks to images and datasets using the BAF model selection criterion. We also proposed optimal sparse reductions to KSC model by reconstructing the model using a reduced set. We exploited the Group Lasso and convex re-weighted L1 penalty to sparsify the KSC model.
Finally, we explored the role of re-weighted L1 penalty in case of feature selection in combination with LSSVM. We proposed a visualization (Netgram) toolkit to track the evolution of communities/clusters over time in case of dynamic time-evolving communities and datasets. Real world applications considered in this thesis include classification and regression of large scale datasets, image segmentation, flat and hierarchical community detection in large scale graphs and visualization of evolving communities.nrpages: 238status: publishe
- …