2,687 research outputs found

    Robust techniques and applications in fuzzy clustering

    Get PDF
    This dissertation addresses issues central to frizzy classification. The issue of sensitivity to noise and outliers of least squares minimization based clustering techniques, such as Fuzzy c-Means (FCM) and its variants is addressed. In this work, two novel and robust clustering schemes are presented and analyzed in detail. They approach the problem of robustness from different perspectives. The first scheme scales down the FCM memberships of data points based on the distance of the points from the cluster centers. Scaling done on outliers reduces their membership in true clusters. This scheme, known as the Mega-clustering, defines a conceptual mega-cluster which is a collective cluster of all data points but views outliers and good points differently (as opposed to the concept of Dave\u27s Noise cluster). The scheme is presented and validated with experiments and similarities with Noise Clustering (NC) are also presented. The other scheme is based on the feasible solution algorithm that implements the Least Trimmed Squares (LTS) estimator. The LTS estimator is known to be resistant to noise and has a high breakdown point. The feasible solution approach also guarantees convergence of the solution set to a global optima. Experiments show the practicability of the proposed schemes in terms of computational requirements and in the attractiveness of their simplistic frameworks. The issue of validation of clustering results has often received less attention than clustering itself. Fuzzy and non-fuzzy cluster validation schemes are reviewed and a novel methodology for cluster validity using a test for random position hypothesis is developed. The random position hypothesis is tested against an alternative clustered hypothesis on every cluster produced by the partitioning algorithm. The Hopkins statistic is used as a basis to accept or reject the random position hypothesis, which is also the null hypothesis in this case. The Hopkins statistic is known to be a fair estimator of randomness in a data set. The concept is borrowed from the clustering tendency domain and its applicability to validating clusters is shown here. A unique feature selection procedure for use with large molecular conformational datasets with high dimensionality is also developed. The intelligent feature extraction scheme not only helps in reducing dimensionality of the feature space but also helps in eliminating contentious issues such as the ones associated with labeling of symmetric atoms in the molecule. The feature vector is converted to a proximity matrix, and is used as an input to the relational fuzzy clustering (FRC) algorithm with very promising results. Results are also validated using several cluster validity measures from literature. Another application of fuzzy clustering considered here is image segmentation. Image analysis on extremely noisy images is carried out as a precursor to the development of an automated real time condition state monitoring system for underground pipelines. A two-stage FCM with intelligent feature selection is implemented as the segmentation procedure and results on a test image are presented. A conceptual framework for automated condition state assessment is also developed

    Mining Aircraft Telemetry Data With Evolutionary Algorithms

    Get PDF
    The Ganged Phased Array Radar - Risk Mitigation System (GPAR-RMS) was a mobile ground-based sense-and-avoid system for Unmanned Aircraft System (UAS) operations developed by the University of North Dakota. GPAR-RMS detected proximate aircraft with various sensor systems, including a 2D radar and an Automatic Dependent Surveillance - Broadcast (ADS-B) receiver. Information about those aircraft was then displayed to UAS operators via visualization software developed by the University of North Dakota. The Risk Mitigation (RM) subsystem for GPAR-RMS was designed to estimate the current risk of midair collision, between the Unmanned Aircraft (UA) and a General Aviation (GA) aircraft flying under Visual Flight Rules (VFR) in the surrounding airspace, for UAS operations in Class E airspace (i.e. below 18,000 feet MSL). However, accurate probabilistic models for the behavior of pilots of GA aircraft flying under VFR in Class E airspace were needed before the RM subsystem could be implemented. In this dissertation the author presents the results of data mining an aircraft telemetry data set from a consecutive nine month period in 2011. This aircraft telemetry data set consisted of Flight Data Monitoring (FDM) data obtained from Garmin G1000 devices onboard every Cessna 172 in the University of North Dakota\u27s training fleet. Data from aircraft which were potentially within the controlled airspace surrounding controlled airports were excluded. Also, GA aircraft in the FDM data flying in Class E airspace were assumed to be flying under VFR, which is usually a valid assumption. Complex subpaths were discovered from the aircraft telemetry data set using a novel application of an ant colony algorithm. Then, probabilistic models were data mined from those subpaths using extensions of the Genetic K-Means (GKA) and Expectation- Maximization (EM) algorithms. The results obtained from the subpath discovery and data mining suggest a pilot flying a GA aircraft near to an uncontrolled airport will perform different maneuvers than a pilot flying a GA aircraft far from an uncontrolled airport, irrespective of the altitude of the GA aircraft. However, since only aircraft telemetry data from the University of North Dakota\u27s training fleet were data mined, these results are not likely to be applicable to GA aircraft operating in a non-training environment

    Unsupervised multiple kernel learning approaches for integrating molecular cancer patient data

    Get PDF
    Cancer is the second leading cause of death worldwide. A characteristic of this disease is its complexity leading to a wide variety of genetic and molecular aberrations in the tumors. This heterogeneity necessitates personalized therapies for the patients. However, currently defined cancer subtypes used in clinical practice for treatment decision-making are based on relatively few selected markers and thus provide only a coarse classifcation of tumors. The increased availability in multi-omics data measured for cancer patients now offers the possibility of defining more informed cancer subtypes. Such a more fine-grained characterization of cancer subtypes harbors the potential of substantially expanding treatment options in personalized cancer therapy. In this thesis, we identify comprehensive cancer subtypes using multidimensional data. For this purpose, we apply and extend unsupervised multiple kernel learning methods. Three challenges of unsupervised multiple kernel learning are addressed: robustness, applicability, and interpretability. First, we show that regularization of the multiple kernel graph embedding framework, which enables the implementation of dimensionality reduction techniques, can increase the stability of the resulting patient subgroups. This improvement is especially beneficial for data sets with a small number of samples. Second, we adapt the objective function of kernel principal component analysis to enable the application of multiple kernel learning in combination with this widely used dimensionality reduction technique. Third, we improve the interpretability of kernel learning procedures by performing feature clustering prior to integrating the data via multiple kernel learning. On the basis of these clusters, we derive a score indicating the impact of a feature cluster on a patient cluster, thereby facilitating further analysis of the cluster-specific biological properties. All three procedures are successfully tested on real-world cancer data. Comparing our newly derived methodologies to established methods provides evidence that our work offers novel and beneficial ways of identifying patient subgroups and gaining insights into medically relevant characteristics of cancer subtypes.Krebs ist eine der häufigsten Todesursachen weltweit. Krebs ist gekennzeichnet durch seine Komplexität, die zu vielen verschiedenen genetischen und molekularen Aberrationen im Tumor führt. Die Unterschiede zwischen Tumoren erfordern personalisierte Therapien für die einzelnen Patienten. Die Krebssubtypen, die derzeit zur Behandlungsplanung in der klinischen Praxis verwendet werden, basieren auf relativ wenigen, genetischen oder molekularen Markern und können daher nur eine grobe Unterteilung der Tumoren liefern. Die zunehmende Verfügbarkeit von Multi-Omics-Daten für Krebspatienten ermöglicht die Neudefinition von fundierteren Krebssubtypen, die wiederum zu spezifischeren Behandlungen für Krebspatienten führen könnten. In dieser Dissertation identifizieren wir neue, potentielle Krebssubtypen basierend auf Multi-Omics-Daten. Hierfür verwenden wir unüberwachtes Multiple Kernel Learning, welches in der Lage ist mehrere Datentypen miteinander zu kombinieren. Drei Herausforderungen des unüberwachten Multiple Kernel Learnings werden adressiert: Robustheit, Anwendbarkeit und Interpretierbarkeit. Zunächst zeigen wir, dass die zusätzliche Regularisierung des Multiple Kernel Learning Frameworks zur Implementierung verschiedener Dimensionsreduktionstechniken die Stabilität der identifizierten Patientengruppen erhöht. Diese Robustheit ist besonders vorteilhaft für Datensätze mit einer geringen Anzahl von Proben. Zweitens passen wir die Zielfunktion der kernbasierten Hauptkomponentenanalyse an, um eine integrative Version dieser weit verbreiteten Dimensionsreduktionstechnik zu ermöglichen. Drittens verbessern wir die Interpretierbarkeit von kernbasierten Lernprozeduren, indem wir verwendete Merkmale in homogene Gruppen unterteilen bevor wir die Daten integrieren. Mit Hilfe dieser Gruppen definieren wir eine Bewertungsfunktion, die die weitere Auswertung der biologischen Eigenschaften von Patientengruppen erleichtert. Alle drei Verfahren werden an realen Krebsdaten getestet. Den Vergleich unserer Methodik mit etablierten Methoden weist nach, dass unsere Arbeit neue und nützliche Möglichkeiten bietet, um integrative Patientengruppen zu identifizieren und Einblicke in medizinisch relevante Eigenschaften von Krebssubtypen zu erhalten

    Ligand-based design of dopamine reuptake inhibitors : fuzzy relational clustering and 2-D and 3-D QSAR modleing

    Get PDF
    As the three-dimensional structure of the dopamine transporter (DAT) remains undiscovered, any attempt to model the binding of drug-like ligands to this protein must necessarily include strategies that use ligand information. For flexible ligands that bind to the DAT, the identification of the binding conformation becomes an important but challenging task. In the first part of this work, the selection of a few representative structures as putative binding conformations from a large collection of conformations of a flexible GBR 12909 analogue was demonstrated by cluster analysis. Novel structurebased features that can be easily generalized to other molecules were developed and used for clustering. Since the feature space may or may not be Euclidean, a recently-developed fuzzy relational clustering algorithm capable of handling such data was used. Both superposition-dependent and superposition-independent features were used along with region-specific clustering that focused on separate pharmacophore elements in the molecule. Separate sets of representative structures were identified for the superpositiondependent and superposition-independent analyses. In the second part of this work, several QSAR models were developed for a series of analogues of methylphenidate (MP), another potent dopamine reuptake inhibitor. In a novel method, the Electrotopological-state (B-state) indices for atoms of the scaffold common to all 80 compounds were used to develop an effective test set spanning both the structure space as well as the activity space. The utility of B-state indices in modeling a series of analogues with a common scaffold was demonstrated. Several models were developed using various combinations of 2-D and 3-D descriptors in the Molconn-Z and MOE descriptor sets. The models derived from CoMFA descriptors were found to be the most predictive and explanatory. Progressive scrambling of all models indicated several stable models. The best models were used to predict the activity of the test set analogues and were found to produce reasonable residuals. Substitutions in the phenyl ring of MP, especially at the 3- and 4-positions, were found to be the most important for DATbinding. It was predicted that for better DAT-binding the substituents at these positions should be relatively bulky, electron-rich atoms or groups

    Normality-based validation for crisp clustering

    Full text link
    This is the author’s version of a work that was accepted for publication in Pattern Recognition. Changes resulting from the publishing process, such as peer review, editing, corrections, structural formatting, and other quality control mechanisms may not be reflected in this document. Changes may have been made to this work since it was submitted for publication. A definitive version was subsequently published in Pattern Recognition, 43, 36, (2010) DOI 10.1016/j.patcog.2009.09.018We introduce a new validity index for crisp clustering that is based on the average normality of the clusters. Unlike methods based on inter-cluster and intra-cluster distances, this index emphasizes the cluster shape by using a high order characterization of its probability distribution. The normality of a cluster is characterized by its negentropy, a standard measure of the distance to normality which evaluates the difference between the cluster's entropy and the entropy of a normal distribution with the same covariance matrix. The definition of the negentropy involves the distribution's differential entropy. However, we show that it is possible to avoid its explicit computation by considering only negentropy increments with respect to the initial data distribution, where all the points are assumed to belong to the same cluster. The resulting negentropy increment validity index only requires the computation of covariance matrices. We have applied the new index to an extensive set of artificial and real problems where it provides, in general, better results than other indices, both with respect to the prediction of the correct number of clusters and to the similarity among the real clusters and those inferred.This work has been partially supported with funds from MEC BFU2006-07902/BFI, CAM S-SEM-0255-2006 and CAM/UAM CCG08-UAM/TIC-442

    Clustering and Visualizing the Status of Child Health in Kenya: A Data Mining Approach.

    Get PDF
    International audienceThe inauguration of the new constitution in Kenya has led to the devolution of health care in the counties. It is against this backdrop that has necessitated the need to develop a model of grouping these regions into natural groups with similar characteristics that can influence the child health for the purpose of health care planning and regulation. Little research has explored the methodology that can be used to create such groupings in Kenya. The purpose of this research was to develop and explore a methodology of clustering and visualizing the status of the child health in Kenya. In this research we propose a new model that clusters the counties based on the UNICEF indicators of child health. The cluster analysis methodology employed to achieve this was by use of k-means clustering algorithm. Both hierarchical and non-hierarchical clustering algorithms were used to build a consensus with the results of clusters obtained by k-means. The number of clusters selected was based on heuristic integrating a statistical-based measure of cluster fit. Using data from literature, the clustering methodology developed grouped the 47 counties into three distinctive clusters. These three clusters were made up of 12, 8 and 27 observations respectively. The study classified the clusters as well-off, most marginalized and moderately marginalized counties. The methodology developed was objective, replicable and sustainable to create the clusters. It was developed in a theoretically sound principle and can generalize across applications requiring clustering. An examination of several clustering algorithms revealed similar results

    New Fundamental Technologies in Data Mining

    Get PDF
    The progress of data mining technology and large public popularity establish a need for a comprehensive text on the subject. The series of books entitled by "Data Mining" address the need by presenting in-depth description of novel mining algorithms and many useful applications. In addition to understanding each section deeply, the two books present useful hints and strategies to solving problems in the following chapters. The contributing authors have highlighted many future research directions that will foster multi-disciplinary collaborations and hence will lead to significant development in the field of data mining
    • …
    corecore