71,133 research outputs found
Dimensionality reduction and hierarchical clustering in framework for hyperspectral image segmentation
The hyperspectral data contains hundreds of narrows bands representing the same scene on earth, with each pixel has a continuous reflectance spectrum. The first attempts to analysehyperspectral images were based on techniques that were developed for multispectral images by randomly selecting few spectral channels, usually less than seven. This random selection of bands degrades the performance of segmentation algorithm on hyperspectraldatain terms of accuracies. In this paper, a new framework is designed for the analysis of hyperspectral image by taking the information from all the data channels with dimensionality reduction method using subset selection and hierarchical clustering. A methodology based on subset construction is used for selecting k informative bands from d bands dataset. In this selection, similarity metrics such as Average Pixel Intensity [API], Histogram Similarity [HS], Mutual Information [MI] and Correlation Similarity [CS] are used to create k distinct subsets and from each subset, a single band is selected. The informative bands which are selected are merged into a single image using hierarchical fusion technique. After getting fused image, Hierarchical clustering algorithm is used for segmentation of image. The qualitative and quantitative analysis shows that CS similarity metric in dimensionality reduction algorithm gets high quality segmented image
Recommended from our members
Integrative analysis of the inter-tumoral heterogeneity of triple-negative breast cancer.
Triple-negative breast cancers (TNBC) lack estrogen and progesterone receptors and HER2 amplification, and are resistant to therapies that target these receptors. Tumors from TNBC patients are heterogeneous based on genetic variations, tumor histology, and clinical outcomes. We used high throughput genomic data for TNBC patients (nβ=β137) from TCGA to characterize inter-tumor heterogeneity. Similarity network fusion (SNF)-based integrative clustering combining gene expression, miRNA expression, and copy number variation, revealed three distinct patient clusters. Integrating multiple types of data resulted in more distinct clusters than analyses with a single datatype. Whereas most TNBCs are classified by PAM50 as basal subtype, one of the clusters was enriched in the non-basal PAM50 subtypes, exhibited more aggressive clinical features and had a distinctive signature of oncogenic mutations, miRNAs and expressed genes. Our analyses provide a new classification scheme for TNBC based on multiple omics datasets and provide insight into molecular features that underlie TNBC heterogeneity
Similarity-based virtual screening using 2D fingerprints
This paper summarises recent work at the University of Sheffield on virtual screening methods that use 2D fingerprint measures of structural similarity. A detailed comparison of a large number of similarity coefficients demonstrates that the well-known Tanimoto coefficient remains the method of choice for the computation of fingerprint-based similarity, despite possessing some inherent biases related to the sizes of the molecules that are being sought. Group fusion involves combining the results of similarity searches based on multiple reference structures and a single similarity measure. We demonstrate the effectiveness of this approach to screening, and also describe an approximate form of group fusion, turbo similarity searching, that can be used when just a single reference structure is available
An integrated clustering analysis framework for heterogeneous data
Big data is a growing area of research with some important research challenges that motivate
our work. We focus on one such challenge, the variety aspect. First, we introduce
our problem by defining heterogeneous data as data about objects that are described by
different data types, e.g., structured data, text, time-series, images, etc. Through our work
we use five datasets for experimentation: a real dataset of prostate cancer data and four
synthetic dataset that we have created and made them publicly available. Each dataset
covers different combinations of data types that are used to describe objects. Our strategy
for clustering is based on fusion approaches. We compare intermediate and late fusion
schemes. We propose an intermediary fusion approach, Similarity Matrix Fusion (SMF),
where the integration process takes place at the level of calculating similarities. SMF produces
a single distance fusion matrix and two uncertainty expression matrices. We then
propose a clustering algorithm, Hk-medoids, a modified version of the standard k-medoids
algorithm that utilises uncertainty calculations to improve on the clustering performance.
We evaluate our results by comparing them to clustering produced using individual elements
and show that the fusion approach produces equal or significantly better results.
Also, we show that there are advantages in utilising the uncertainty information as Hkmedoids
does. In addition, from a theoretical point of view, our proposed Hk-medoids
algorithm has less computation complexity than the popular PAM implementation of the
k-medoids algorithm. Then, we employed late fusion that aggregates the results of clustering
by individual elements by combining cluster labels using an object co-occurrence
matrix technique. The final cluster is then derived by a hierarchical clustering algorithm.
We show that intermediate fusion for clustering of heterogeneous data is a feasible and
efficient approach using our proposed Hk-medoids algorithm
Grouping of coefficients for the calculation of inter-molecular similarity and dissimilarity using 2D fragment bit-strings
This paper compares 22 different similarity coefficients when they are used for searching databases of 2D fragment bit-strings. Experiments with the National Cancer Institute's AIDS and IDAlert databases show that the coefficients fall into several well-marked clusters, in which the members of a cluster will produce comparable rankings of a set of molecules. These clusters provide a basis for selecting combinations of coefficients for use in data fusion experiments. The results of these experiments provide a simple way of increasing the effectiveness of fragment-based similarity searching systems
Grouping of coefficients for the calculation of inter-molecular similarity and dissimilarity using 2D fragment bit-strings
This paper compares 22 different similarity coefficients when they are used for searching databases of 2D fragment bit-strings. Experiments with the National Cancer Institute's AIDS and IDAlert databases show that the coefficients fall into several well-marked clusters, in which the members of a cluster will produce comparable rankings of a set of molecules. These clusters provide a basis for selecting combinations of coefficients for use in data fusion experiments. The results of these experiments provide a simple way of increasing the effectiveness of fragment-based similarity searching systems
Patient-specific data fusion for cancer stratification and personalised treatment
According to Cancer Research UK, cancer is a leading cause of death accounting for more than one in four of all deaths in 2011. The recent advances in experimental technologies in cancer research have resulted in the accumulation of large amounts of patient-specific datasets, which provide complementary information on the same cancer type. We introduce a versatile data fusion (integration) framework that can effectively integrate somatic mutation data, molecular interactions and drug chemical data to address three key challenges in cancer research: stratification of patients into groups having different clinical outcomes, prediction of driver genes whose mutations trigger the onset and development of cancers, and repurposing of drugs treating particular cancer patient groups. Our new framework is based on graph-regularised non-negative matrix tri-factorization, a machine learning technique for co-clustering heterogeneous datasets. We apply our framework on ovarian cancer data to simultaneously cluster patients, genes and drugs by utilising all datasets.We demonstrate superior performance of our method over the state-of-the-art method, Network-based Stratification, in identifying three patient subgroups that have significant differences in survival outcomes and that are in good agreement with other clinical data. Also, we identify potential new driver genes that we obtain by analysing the gene clusters enriched in known drivers of ovarian cancer progression. We validated the top scoring genes identified as new drivers through database search and biomedical literature curation. Finally, we identify potential candidate drugs for repurposing that could be used in treatment of the identified patient subgroups by targeting their mutated gene products. We validated a large percentage of our drug-target predictions by using other databases and through literature curation
Bayesian correlated clustering to integrate multiple datasets
Motivation: The integration of multiple datasets remains a key challenge in systems biology and genomic medicine. Modern high-throughput technologies generate a broad array of different data types, providing distinct β but often complementary β information. We present a Bayesian method for the unsupervised integrative modelling of multiple datasets, which we refer to as MDI (Multiple Dataset Integration). MDI can integrate information from a wide range of different datasets and data types simultaneously (including the ability to model time series data explicitly using Gaussian processes). Each dataset is modelled using a Dirichlet-multinomial allocation (DMA) mixture model, with dependencies between these models captured via parameters that describe the agreement among the datasets.
Results: Using a set of 6 artificially constructed time series datasets, we show that MDI is able to integrate a significant number of datasets simultaneously, and that it successfully captures the underlying structural similarity between the datasets. We also analyse a variety of real S. cerevisiae datasets. In the 2-dataset case, we show that MDIβs performance is comparable to the present state of the art. We then move beyond the capabilities of current approaches and integrate gene expression, ChIP-chip and protein-protein interaction data, to identify a set of protein complexes for which genes are co-regulated during the cell cycle. Comparisons to other unsupervised data integration techniques β as well as to non-integrative approaches β demonstrate that MDI is very competitive, while also providing information that would be difficult or impossible to extract using other methods
- β¦