70 research outputs found

    HAWKS: Evolving Challenging Benchmark Sets for Cluster Analysis

    Get PDF
    Comprehensive benchmarking of clustering algorithms is rendered difficult by two key factors: (i) the elusiveness of a unique mathematical definition of this unsupervised learning approach and (ii) dependencies between the generating models or clustering criteria adopted by some clustering algorithms and indices for internal cluster validation. Consequently, there is no consensus regarding the best practice for rigorous benchmarking, and whether this is possible at all outside the context of a given application. Here, we argue that synthetic datasets must continue to play an important role in the evaluation of clustering algorithms, but that this necessitates constructing benchmarks that appropriately cover the diverse set of properties that impact clustering algorithm performance. Through our framework, HAWKS, we demonstrate the important role evolutionary algorithms play to support flexible generation of such benchmarks, allowing simple modification and extension. We illustrate two possible uses of our framework: (i) the evolution of benchmark data consistent with a set of hand-derived properties and (ii) the generation of datasets that tease out performance differences between a given pair of algorithms. Our work has implications for the design of clustering benchmarks that sufficiently challenge a broad range of algorithms, and for furthering insight into the strengths and weaknesses of specific approaches

    Generating Multidimensional Clusters With Support Lines

    Full text link
    Synthetic data is essential for assessing clustering techniques, complementing and extending real data, and allowing for a more complete coverage of a given problem's space. In turn, synthetic data generators have the potential of creating vast amounts of data -- a crucial activity when real-world data is at premium -- while providing a well-understood generation procedure and an interpretable instrument for methodically investigating cluster analysis algorithms. Here, we present \textit{Clugen}, a modular procedure for synthetic data generation, capable of creating multidimensional clusters supported by line segments using arbitrary distributions. \textit{Clugen} is open source, 100\% unit tested and fully documented, and is available for the Python, R, Julia and MATLAB/Octave ecosystems. We demonstrate that our proposal is able to produce rich and varied results in various dimensions, is fit for use in the assessment of clustering algorithms, and has the potential to be a widely used framework in diverse clustering-related research tasks

    Unsupervised learning methods for identifying and evaluating disease clusters in electronic health records

    Get PDF
    Introduction Clustering algorithms are a class of algorithms that can discover groups of observations in complex data and are often used to identify subtypes of heterogeneous diseases in electronic health records (EHR). Evaluating clustering experiments for biological and clinical significance is a vital but challenging task due to the lack of consensus on best practices. As a result, the translation of findings from clustering experiments to clinical practice is limited. Aim The aim of this thesis was to investigate and evaluate approaches that enable the evaluation of clustering experiments using EHR. Methods We conducted a scoping review of clustering studies in EHR to identify common evaluation approaches. We systematically investigated the performance of the identified approaches using a cohort of Alzheimer's Disease (AD) patients as an exemplar comparing four different clustering methods (K-means, Kernel K-means, Affinity Propagation and Latent Class Analysis.). Using the same population, we developed and evaluated a method (MCHAMMER) that tested whether clusterable structures exist in EHR. To develop this method we tested several cluster validation indexes and methods of generating null data to see which are the best at discovering clusters. In order to enable the robust benchmarking of evaluation approaches, we created a tool that generated synthetic EHR data that contain known cluster labels across a range of clustering scenarios. Results Across 67 EHR clustering studies, the most popular internal evaluation metric was comparing cluster results across multiple algorithms (30% of studies). We examined this approach conducting a clustering experiment on AD patients using a population of 10,065 AD patients and 21 demographic, symptom and comorbidity features. K-means found 5 clusters, Kernel K means found 2 clusters, Affinity propagation found 5 and latent class analysis found 6. K-means 4 was found to have the best clustering solution with the highest silhouette score (0.19) and was more predictive of outcomes. The five clusters found were: typical AD (n=2026), non-typical AD (n=1640), cardiovascular disease cluster (n=686), a cancer cluster (n=1710) and a cluster of mental health issues, smoking and early disease onset (n=1528), which has been found in previous research as well as in the results of other clustering methods. We created a synthetic data generation tool which allows for the generation of realistic EHR clusters that can vary in separation and number of noise variables to alter the difficulty of the clustering problem. We found that decreasing cluster separation did increase cluster difficulty significantly whereas noise variables increased cluster difficulty but not significantly. To develop the tool to assess clusters existence we tested different methods of null dataset generation and cluster validation indices, the best performing null dataset method was the min max method and the best performing indices we Calinksi Harabasz index which had an accuracy of 94%, Davies Bouldin index (97%) silhouette score ( 93%) and BWC index (90%). We further found that when clusters were identified using the Calinski Harabasz index they were more likely to have significantly different outcomes between clusters. Lastly we repeated the initial clustering experiment, comparing 10 different pre-processing methods. The three best performing methods were RBF kernel (2 clusters), MCA (4 clusters) and MCA and PCA (6 clusters). The MCA approach gave the best results highest silhouette score (0.23) and meaningful clusters, producing 4 clusters; heart and circulatory( n=1379), early onset mental health (n=1761), male cluster with memory loss (n = 1823), female with more problem (n=2244). Conclusion We have developed and tested a series of methods and tools to enable the evaluation of EHR clustering experiments. We developed and proposed a novel cluster evaluation metric and provided a tool for benchmarking evaluation approaches in synthetic but realistic EHR

    Privacy Intelligence: A Survey on Image Sharing on Online Social Networks

    Full text link
    Image sharing on online social networks (OSNs) has become an indispensable part of daily social activities, but it has also led to an increased risk of privacy invasion. The recent image leaks from popular OSN services and the abuse of personal photos using advanced algorithms (e.g. DeepFake) have prompted the public to rethink individual privacy needs when sharing images on OSNs. However, OSN image sharing itself is relatively complicated, and systems currently in place to manage privacy in practice are labor-intensive yet fail to provide personalized, accurate and flexible privacy protection. As a result, an more intelligent environment for privacy-friendly OSN image sharing is in demand. To fill the gap, we contribute a systematic survey of 'privacy intelligence' solutions that target modern privacy issues related to OSN image sharing. Specifically, we present a high-level analysis framework based on the entire lifecycle of OSN image sharing to address the various privacy issues and solutions facing this interdisciplinary field. The framework is divided into three main stages: local management, online management and social experience. At each stage, we identify typical sharing-related user behaviors, the privacy issues generated by those behaviors, and review representative intelligent solutions. The resulting analysis describes an intelligent privacy-enhancing chain for closed-loop privacy management. We also discuss the challenges and future directions existing at each stage, as well as in publicly available datasets.Comment: 32 pages, 9 figures. Under revie

    Artificial Intelligence in Materials Science: Applications of Machine Learning to Extraction of Physically Meaningful Information from Atomic Resolution Microscopy Imaging

    Get PDF
    Materials science is the cornerstone for technological development of the modern world that has been largely shaped by the advances in fabrication of semiconductor materials and devices. However, the Mooreā€™s Law is expected to stop by 2025 due to reaching the limits of traditional transistor scaling. However, the classical approach has shown to be unable to keep up with the needs of materials manufacturing, requiring more than 20 years to move a material from discovery to market. To adapt materials fabrication to the needs of the 21st century, it is necessary to develop methods for much faster processing of experimental data and connecting the results to theory, with feedback flow in both directions. However, state-of-the-art analysis remains selective and manual, prone to human error and unable to handle large quantities of data generated by modern equipment. Recent advances in scanning transmission electron and scanning tunneling microscopies have allowed imaging and manipulation of materials on the atomic level, and these capabilities require development of automated, robust, reproducible methods.Artificial intelligence and machine learning have dealt with similar issues in applications to image and speech recognition, autonomous vehicles, and other projects that are beginning to change the world around us. However, materials science faces significant challenges preventing direct application of the such models without taking physical constraints and domain expertise into account.Atomic resolution imaging can generate data that can lead to better understanding of materials and their properties through using artificial intelligence methods. Machine learning, in particular combinations of deep learning and probabilistic modeling, can learn to recognize physical features in imaging, making this process automated and speeding up characterization. By incorporating the knowledge from theory and simulations with such frameworks, it is possible to create the foundation for the automated atomic scale manufacturing

    Applications of advanced spectroscopic imaging to biological tissues

    Get PDF
    The objectives of this research were to develop experimental approaches that can be applied to classify different stages of malignancy in routine formalin-ļ¬xed and paraļ¬ƒn-embedded tissues and to optimise the imaging approaches using novel implementations. It is hoped that the approach developed in this research may be applied for early cancer diagnostics in clinical settings in the future in order to increase cancer survival rates. Infrared spectroscopic imaging has recently shown to have great potential as a powerful method for the spatial visualization of biological tissues. This spectroscopic technique does not require sample labelling because its chemical speciļ¬city allows the diļ¬€erentiation of biocomponents to be achieved based on their chemical structures. Experiments were performed on 3-Āµm thick prostate and colon tissues that were deposited on 2 mm-calcium fluoride (CaF2) which were subsequently deparaffinised. The samples were measured under IR microscopes, in both transmission and attenuated total reflection (ATR) mode. In transmission, thermo-spectroscopic imaging of the prostate samples was first carried out to investigate the potential of thermography to complement the information obtained from IR spectral. Spectroscopic imaging has made the acquisition of chemical map of a sample possible within a short time span since this approach facilitates the simultaneous acquisition of thousands of spatially resolved infrared spectra. Spectral differences in the lipid region (3000 -2800 cm-1) were identified between cancer and benign regions within prostate tissues. The governing spectral band for classification was anti-symmetric stretching of CH2 (2921 cm-1) from PCA analysis. Nonetheless, the difference in tissue emissivity at room temperature was minimal, thus the contrast in the thermal image is low for intra-tissue classification. Besides, the thermal camera could only capture IR light between 3333-2000 cm-1. To record spectral data between 3900 - 900 cm-1 (mid-IR), Fourier transform infrared (FTIR) spectroscopic imaging was used to classify the different stages of colon disease. An automated processing framework was developed, that could achieve an overall classification accuracy of 92.7%. The processing steps included unsupervised k-means clustering of lipid bands, followed by Random Forest (RF) classification using the ā€˜fingerprintā€™ region of the data. The implementation of a correcting lens and the effect of the RMieS-EMSC correction on the tissue spectra were also investigated, which showed that computational RMieS-EMSC correction was more effective at removing spectral artefacts than the correcting lens. Furthermore, the effect of the fluctuations of surrounding humidity where the experiments were carried out was studied by using various supersaturated salt solutions. Significant peak changes of the phosphate band were observed, most notably the peak shift of the anti-symmetric stretching of phosphate bands from 1230 cm-1 to 1238 cm-1 was observed. By regulating and controlling humidity at its lowest, the classification accuracy of the colon specimens was improved without having to resort to alteration on the RF machine learning algorithm. In the ATR mode, additional apertures were introduced to the FTIR microscope, as a novel means of depth profiling the prostate tissue samples by changing the angle of incidence of IR light beam. Despite the successful attempts in capturing the qualitative information on the change of tissue morphology with the depth of penetration (dp), the spectral data were not suitable for further processing with machine learning as dp changes with wavelengths. Apart from the apertures, a ā€˜large-areaā€™ germanium (Ge) crystal was introduced to enable simultaneous mapping and imaging of the colon tissue samples. Many advantages of this new implementation were observed, which included improvement in signal-to-noise ratio, uniform distribution, and no impression left on the sample. The research done in this thesis set a groundwork for clinical diagnosis and the novel implementations were transferable to studies of other samples.Open Acces

    Characterization, modeling, and simulation of multiscale directed-assembly systems

    Get PDF
    Nanoscience is a rapidly developing field at the nexus of all physical sciences which holds the potential for mankind to gain a new level of control of matter over matter and energy altogether. Directed-assembly is an emerging field within nanoscience in which non-equilibrium system dynamics are controlled to produce scalable, arbitrarily complex and interconnected multi-layered structures with custom chemical, biologically or environmentally-responsive, electronic, or optical properties. We construct mathematical models and interpret data from direct-assembly experiments via application and augmentation of classical and contemporary physics, biology, and chemistry methods. Crystal growth, protein pathway mapping, LASER tweezers optical trapping, and colloid processing are areas of directed-assembly with established experimental techniques. We apply a custom set of characterization, modeling, and simulation techniques to experiments to each of these four areas. Many of these techniques can be applied across several experimental areas within directed-assembly and to systems featuring multiscale system dynamics in general. We pay special attention to mathematical methods for bridging models of system dynamics across scale regimes, as they are particularly applicable and relevant to directed-assembly. We employ massively parallel simulations, enabled by custom software, to establish underlying system dynamics and develop new device production methods
    • ā€¦
    corecore