698 research outputs found

    Clustering Single-cell RNA-sequencing Data based on Matching Clusters Structures

    Get PDF
    Single-cell sequencing technology can generate RNA-sequencing data at the single cell level, and one important single-cell RNA-sequencing data analysis method is to identify their cell types without supervised information. Clustering is an unsupervised approach that can help find new insights into biology especially for exploring the biological functions of specific cell type. However, it is challenging for traditional clustering methods to obtain high-quality cell type recognition results. In this research, we propose a novel Clustering method based on Matching Clusters Structures (MCSC) for identifying cell types among single-cell RNA-sequencing data. Firstly, MCSC obtains two different groups of clustering results from the same K-means algorithm because its initial centroids are randomly selected. Then, for one group, MCSC uses shared nearest neighbour information to calculate a label transition matrix, which denotes label transition probability between any two initial clusters. Each initial cluster may be reassigned if merging results after label transition satisfy a consensus function that maximizes structural matching degree of two different groups of clustering results. In essence, the MCSC may be interpreted as a label training process. We evaluate the proposed MCSC with five commonly used datasets and compare MCSC with several classical and state-of-the-art algorithms. The experimental results show that MCSC outperform other algorithms

    Multi-dimensional clustering in user profiling

    Get PDF
    User profiling has attracted an enormous number of technological methods and applications. With the increasing amount of products and services, user profiling has created opportunities to catch the attention of the user as well as achieving high user satisfaction. To provide the user what she/he wants, when and how, depends largely on understanding them. The user profile is the representation of the user and holds the information about the user. These profiles are the outcome of the user profiling. Personalization is the adaptation of the services to meet the user’s needs and expectations. Therefore, the knowledge about the user leads to a personalized user experience. In user profiling applications the major challenge is to build and handle user profiles. In the literature there are two main user profiling methods, collaborative and the content-based. Apart from these traditional profiling methods, a number of classification and clustering algorithms have been used to classify user related information to create user profiles. However, the profiling, achieved through these works, is lacking in terms of accuracy. This is because, all information within the profile has the same influence during the profiling even though some are irrelevant user information. In this thesis, a primary aim is to provide an insight into the concept of user profiling. For this purpose a comprehensive background study of the literature was conducted and summarized in this thesis. Furthermore, existing user profiling methods as well as the classification and clustering algorithms were investigated. Being one of the objectives of this study, the use of these algorithms for user profiling was examined. A number of classification and clustering algorithms, such as Bayesian Networks (BN) and Decision Trees (DTs) have been simulated using user profiles and their classification accuracy performances were evaluated. Additionally, a novel clustering algorithm for the user profiling, namely Multi-Dimensional Clustering (MDC), has been proposed. The MDC is a modified version of the Instance Based Learner (IBL) algorithm. In IBL every feature has an equal effect on the classification regardless of their relevance. MDC differs from the IBL by assigning weights to feature values to distinguish the effect of the features on clustering. Existing feature weighing methods, for instance Cross Category Feature (CCF), has also been investigated. In this thesis, three feature value weighting methods have been proposed for the MDC. These methods are; MDC weight method by Cross Clustering (MDC-CC), MDC weight method by Balanced Clustering (MDC-BC) and MDC weight method by changing the Lower-limit to Zero (MDC-LZ). All of these weighted MDC algorithms have been tested and evaluated. Additional simulations were carried out with existing weighted and non-weighted IBL algorithms (i.e. K-Star and Locally Weighted Learning (LWL)) in order to demonstrate the performance of the proposed methods. Furthermore, a real life scenario is implemented to show how the MDC can be used for the user profiling to improve personalized service provisioning in mobile environments. The experiments presented in this thesis were conducted by using user profile datasets that reflect the user’s personal information, preferences and interests. The simulations with existing classification and clustering algorithms (e.g. Bayesian Networks (BN), Naïve Bayesian (NB), Lazy learning of Bayesian Rules (LBR), Iterative Dichotomister 3 (Id3)) were performed on the WEKA (version 3.5.7) machine learning platform. WEKA serves as a workbench to work with a collection of popular learning schemes implemented in JAVA. In addition, the MDC-CC, MDC-BC and MDC-LZ have been implemented on NetBeans IDE 6.1 Beta as a JAVA application and MATLAB. Finally, the real life scenario is implemented as a Java Mobile Application (Java ME) on NetBeans IDE 7.1. All simulation results were evaluated based on the error rate and accuracy

    Pattern Recognition

    Get PDF
    A wealth of advanced pattern recognition algorithms are emerging from the interdiscipline between technologies of effective visual features and the human-brain cognition process. Effective visual features are made possible through the rapid developments in appropriate sensor equipments, novel filter designs, and viable information processing architectures. While the understanding of human-brain cognition process broadens the way in which the computer can perform pattern recognition tasks. The present book is intended to collect representative researches around the globe focusing on low-level vision, filter design, features and image descriptors, data mining and analysis, and biologically inspired algorithms. The 27 chapters coved in this book disclose recent advances and new ideas in promoting the techniques, technology and applications of pattern recognition

    Features Extraction from Time Series

    Get PDF
    Time series can be found in various domains like medicine, engineering, and finance. Generally speaking, a time series is a sequence of data that represents recorded values of a phenomenon over time. This thesis studies time series mining, including transformation and distance measure, anomaly or anomalies detection, clustering and remaining useful life estimation. In the course of the first mining task (transformation and distance measure), in order to increase the accuracy of distance measure between transformed series (symbolic series), we introduce a novel calculation of distance between symbols. By integrating this newly defined method to symbolic aggregate approximation and its extensions, the experimental results show this proposed method is promising. During the process of the second mining task (anomaly or anomalies detection), for the purpose of improving the accuracy of anomaly or anomalies detection, we propose a distance measure method and an anomalies detection calculation. These proposed methods, together with previous published anomaly detection methods, are applied to real ECG data selected from MIT-BIH database. The experimental results show that our proposed outperforms other methods. During the course of the third mining task (clustering), we present an automatic clustering method, called AT-means, which can automatically carry out clustering for a given time series dataset: from the calculation of global average time series to the setting of initial centres and the determination of the number of clusters. The performance of the proposed method was tested on 10 benchmark time series datasets obtained from UCR database. For comparison, the K-means method with three different conditions are also applied to the same datasets. The experimental results show the proposed method outperforms the compared K-means approaches. During the process of the fourth mining task (remaining useful life estimation), all the original data are transformed into low-dimensional space through principal components analysis. We then proposed a novel multidimensional time series distance measure method, called as multivariate time series warping distance (MTWD), for remaining useful life estimation. This whole process is tested on the CMAPSS (Commercial Modular Aero Propulsion System Simulation) datasets and the performance is compared with two existing methods. The experimental results show that the estimated remaining useful life (RUL) values are closer to real RUL values when compared with the comparison methods. Our work contributes to the time series mining by introducing novel approaches to distance measure, anomalies detection, clustering and RUL estimation. We furthermore apply our proposed methods and related methods to benchmark datasets. The experimental results show that our methods are better than previously published methods in terms of accuracy and efficiency

    Data mining for heart failure : an investigation into the challenges in real life clinical datasets

    Get PDF
    Clinical data presents a number of challenges including missing data, class imbalance, high dimensionality and non-normal distribution. A motivation for this research is to investigate and analyse the manner in which the challenges affect the performance of algorithms. The challenges were explored with the help of a real life heart failure clinical dataset known as Hull LifeLab, obtained from a live cardiology clinic at the Hull Royal Infirmary Hospital. A Clinical Data Mining Workflow (CDMW) was designed with three intuitive stages, namely, descriptive, predictive and prescriptive. The naming of these stages reflects the nature of the analysis that is possible within each stage; therefore a number of different algorithms are employed. Most algorithms require the data to be distributed in a normal manner. However, the distribution is not explicitly used within the algorithms. Approaches based on Bayes use the properties of the distributions very explicitly, and thus provides valuable insight into the nature of the data.The first stage of the analysis is to investigate if the assumptions made for Bayes hold, e.g. the strong independence assumption and the assumption of a Gaussian distribution. The next stage is to investigate the role of missing values. Results found that imputation does not affect the performance as much as those records which are initially complete. These records are often not outliers, but contain problem variables. A method was developed to identify these. The effect of skews in the data was also investigated within the CDMW. However, it was found that methods based on Bayes were able to handle these, albeit with a small variability in performance. The thesis provides an insight into the reasons why clinical data often causes problems. Even the issue of imbalanced classes is not an issue, for Bayes is independent of this

    Swarm-Organized Topographic Mapping

    Get PDF
    Topographieerhaltende Abbildungen versuchen, hochdimensionale oder komplexe Datenbestände auf einen niederdimensionalen Ausgaberaum abzubilden, wobei die Topographie der Daten hinreichend gut wiedergegeben werden soll. Die Qualität solcher Abbildung hängt gewöhnlich vom eingesetzten Nachbarschaftskonzept des konstruierenden Algorithmus ab. Die Schwarm-Organisierte Projektion ermöglicht eine Lösung dieses Parametrisierungsproblems durch die Verwendung von Techniken der Schwarmintelligenz. Die praktische Verwendbarkeit dieser Methodik wurde durch zwei Anwendungen auf dem Feld der Molekularbiologie sowie der Finanzanalytik demonstriert
    corecore