1,000 research outputs found

    Application of Machine Learning Techniques to Parameter Selection for Flight Risk Identification

    Get PDF
    In recent years, the use of data mining and machine learning techniques for safety analysis, incident and accident investigation, and fault detection has gained traction among the aviation community. Flight data collected from recording devices contains a large number of heterogeneous parameters, sometimes reaching up to thousands on modern commercial aircraft. More data is being collected continuously which adds to the ever-increasing pool of data available for safety analysis. However, among the data collected, not all parameters are important from a risk and safety analysis perspective. Similarly, in order to be useful for modern analysis techniques such as machine learning, using thousands of parameters collected at a high frequency might not be computationally tractable. As such, an intelligent and repeatable methodology to select a reduced set of significant parameters is required to allow safety analysts to focus on the right parameters for risk identification. In this paper, a step-by-step methodology is proposed to down-select a reduced set of parameters that can be used for safety analysis. First, correlation analysis is conducted to remove highly correlated, duplicate, or redundant parameters from the data set. Second, a pre-processing step removes metadata and empty parameters. This step also considers requirements imposed by regulatory bodies such as the Federal Aviation Administration and subject matter experts to further trim the list of parameters. Third, a clustering algorithm is used to group similar flights and identify abnormal operations and anomalies. A retrospective analysis is conducted on the clusters to identify their characteristics and impact on flight safety. Finally, analysis of variance techniques are used to identify which parameters were significant in the formation of the clusters. Visualization dashboards were created to analyze the cluster characteristics and parameter significance. This methodology is employed on data from the approach phase of a representative single-aisle aircraft to demonstrate its application and robustness across heterogeneous data sets. It is envisioned that this methodology can be further extended to other phases of flight and aircraft

    Clustering Data of Mixed Categorical and Numerical Type with Unsupervised Feature Learning

    Get PDF
    Mixed-type categorical and numerical data are a challenge in many applications. This general area of mixed-type data is among the frontier areas, where computational intelligence approaches are often brittle compared with the capabilities of living creatures. In this paper, unsupervised feature learning (UFL) is applied to the mixed-type data to achieve a sparse representation, which makes it easier for clustering algorithms to separate the data. Unlike other UFL methods that work with homogeneous data, such as image and video data, the presented UFL works with the mixed-type data using fuzzy adaptive resonance theory (ART). UFL with fuzzy ART (UFLA) obtains a better clustering result by removing the differences in treating categorical and numeric features. The advantages of doing this are demonstrated with several real-world data sets with ground truth, including heart disease, teaching assistant evaluation, and credit approval. The approach is also demonstrated on noisy, mixed-type petroleum industry data. UFLA is compared with several alternative methods. To the best of our knowledge, this is the first time UFL has been extended to accomplish the fusion of mixed data types

    Clustering and Visualization of Fuzzy Communities In Social Networks

    Get PDF
    Abstract-We discuss a new formulation of a fuzzy validity index that generalizes the Newman-Girvan (NG) modularity function. The NG function serves as a cluster validity functional in community detection studies. The input data is an undirected graph G = (V, E) that represents a social network. Clusters in V correspond to socially similar substructures in the network. We compare our fuzzy modularity to an existing modularity function using the well-studied Karate Club data set

    Robust techniques and applications in fuzzy clustering

    Get PDF
    This dissertation addresses issues central to frizzy classification. The issue of sensitivity to noise and outliers of least squares minimization based clustering techniques, such as Fuzzy c-Means (FCM) and its variants is addressed. In this work, two novel and robust clustering schemes are presented and analyzed in detail. They approach the problem of robustness from different perspectives. The first scheme scales down the FCM memberships of data points based on the distance of the points from the cluster centers. Scaling done on outliers reduces their membership in true clusters. This scheme, known as the Mega-clustering, defines a conceptual mega-cluster which is a collective cluster of all data points but views outliers and good points differently (as opposed to the concept of Dave\u27s Noise cluster). The scheme is presented and validated with experiments and similarities with Noise Clustering (NC) are also presented. The other scheme is based on the feasible solution algorithm that implements the Least Trimmed Squares (LTS) estimator. The LTS estimator is known to be resistant to noise and has a high breakdown point. The feasible solution approach also guarantees convergence of the solution set to a global optima. Experiments show the practicability of the proposed schemes in terms of computational requirements and in the attractiveness of their simplistic frameworks. The issue of validation of clustering results has often received less attention than clustering itself. Fuzzy and non-fuzzy cluster validation schemes are reviewed and a novel methodology for cluster validity using a test for random position hypothesis is developed. The random position hypothesis is tested against an alternative clustered hypothesis on every cluster produced by the partitioning algorithm. The Hopkins statistic is used as a basis to accept or reject the random position hypothesis, which is also the null hypothesis in this case. The Hopkins statistic is known to be a fair estimator of randomness in a data set. The concept is borrowed from the clustering tendency domain and its applicability to validating clusters is shown here. A unique feature selection procedure for use with large molecular conformational datasets with high dimensionality is also developed. The intelligent feature extraction scheme not only helps in reducing dimensionality of the feature space but also helps in eliminating contentious issues such as the ones associated with labeling of symmetric atoms in the molecule. The feature vector is converted to a proximity matrix, and is used as an input to the relational fuzzy clustering (FRC) algorithm with very promising results. Results are also validated using several cluster validity measures from literature. Another application of fuzzy clustering considered here is image segmentation. Image analysis on extremely noisy images is carried out as a precursor to the development of an automated real time condition state monitoring system for underground pipelines. A two-stage FCM with intelligent feature selection is implemented as the segmentation procedure and results on a test image are presented. A conceptual framework for automated condition state assessment is also developed

    Relational data clustering algorithms with biomedical applications

    Get PDF

    Archetypes for histogram-valued data

    Get PDF
    Il principale sviluppo innovativo del lavoro è quello di propone una estensione dell'analisi archetipale per dati ad istogramma. Per quanto concerne l'impianto metodologico nell'approccio all'analisi di dati ad istogramma, che sono di natura complessa, il presente lavora utilizza le intuizioni della "Symbolic Data Analysis" (SDA) e le relazioni intrinseche tra dati valutati ad intervallo e dati valutati ad istogramma. Dopo aver discusso la tecnica sviluppata in ambiente Matlab, il suo funzionamento e le sue proprietà su di un esempio di comodo, tale tecnica viene proposta, nella sezione applicativa, come strumento per effettuare una analisi di tipo "benchmarking" quantitativo. Nello specifico, si propongono i principali risultati ottenuti da una applicazione degli archetipi per dati ad istogramma ad un caso di benchmarking interno del sistema scolastico, utilizzando dati provenienti dal test INVALSI relativi all'anno scolastico 2015/2016. In questo contesto l'unità di analisi è considerata essere la singola scuola, definita operativamente attraverso le distribuzioni dei punteggi dei propri alunni valutate, congiuntamente, sotto forma di oggetti simbolici ad istogramma

    Neuroengineering of Clustering Algorithms

    Get PDF
    Cluster analysis can be broadly divided into multivariate data visualization, clustering algorithms, and cluster validation. This dissertation contributes neural network-based techniques to perform all three unsupervised learning tasks. Particularly, the first paper provides a comprehensive review on adaptive resonance theory (ART) models for engineering applications and provides context for the four subsequent papers. These papers are devoted to enhancements of ART-based clustering algorithms from (a) a practical perspective by exploiting the visual assessment of cluster tendency (VAT) sorting algorithm as a preprocessor for ART offline training, thus mitigating ordering effects; and (b) an engineering perspective by designing a family of multi-criteria ART models: dual vigilance fuzzy ART and distributed dual vigilance fuzzy ART (both of which are capable of detecting complex cluster structures), merge ART (aggregates partitions and lessens ordering effects in online learning), and cluster validity index vigilance in fuzzy ART (features a robust vigilance parameter selection and alleviates ordering effects in offline learning). The sixth paper consists of enhancements to data visualization using self-organizing maps (SOMs) by depicting in the reduced dimension and topology-preserving SOM grid information-theoretic similarity measures between neighboring neurons. This visualization\u27s parameters are estimated using samples selected via a single-linkage procedure, thereby generating heatmaps that portray more homogeneous within-cluster similarities and crisper between-cluster boundaries. The seventh paper presents incremental cluster validity indices (iCVIs) realized by (a) incorporating existing formulations of online computations for clusters\u27 descriptors, or (b) modifying an existing ART-based model and incrementally updating local density counts between prototypes. Moreover, this last paper provides the first comprehensive comparison of iCVIs in the computational intelligence literature --Abstract, page iv

    Combinatorial data analysis for data ordering

    Get PDF
    Seriation is a combinatorial optimisation problem that aims to sequence a set of objects such that a natural ordering is created. A large variety of applications exist ranging from archaeology to bioinformatics and text mining. Initially, a thorough and useful quantitative analysis compares different seriation algorithms using the positional proximity coefficient (PPC). This analysis helps the practitioner to understand how similar two algorithms are for a given set of datasets. The first contribution is consensus seriation. This method uses the principles of other consensus based methods to combine different seriation solutions according to the PPC. As it creates a solution that no individual algorithm can create, the usefulness comes in the form of combining different structural elements from each original algorithms. In particular, it is possible to create a solution that combines the local characteristics of one algorithm together with the global characteristics of another. Experimental results show that compared to consensus ranking based methods, using the Hamming, Spearman and Kendall coefficients, the consensus seriation using the PPC gives generally superior results according to the independent accumulated relative generalised anti-Robinson events measure. The second contribution is a metaheuristic for creating good approximation solutions very large seriation problems. This adapted harmony search algorithm makes use of modified crossover operators taken from genetic algorithm literature to optimise the least-squares criterion commonly used in seriation. As for all combinatorial optimisation problems, there is a need for metaheuristics that can produce better solutions quicker. Results show that that algorithm consistently outperforms existing metaheuristic algorithms such as genetic algorithm, particle swarm optimisation, simulated annealing and tabu search as well as the genetic algorithm using the modified crossover operators, with the main advantage of creating a much superior result in a very short iteration frame. These two major contributions offer practitioners and academics with new tools to tackle seriation related problems and a suggested direction for future work is concluded
    • …
    corecore