2,877 research outputs found

    GBMST: An Efficient Minimum Spanning Tree Clustering Based on Granular-Ball Computing

    Full text link
    Most of the existing clustering methods are based on a single granularity of information, such as the distance and density of each data. This most fine-grained based approach is usually inefficient and susceptible to noise. Therefore, we propose a clustering algorithm that combines multi-granularity Granular-Ball and minimum spanning tree (MST). We construct coarsegrained granular-balls, and then use granular-balls and MST to implement the clustering method based on "large-scale priority", which can greatly avoid the influence of outliers and accelerate the construction process of MST. Experimental results on several data sets demonstrate the power of the algorithm. All codes have been released at https://github.com/xjnine/GBMST

    A taxonomy framework for unsupervised outlier detection techniques for multi-type data sets

    Get PDF
    The term "outlier" can generally be defined as an observation that is significantly different from the other values in a data set. The outliers may be instances of error or indicate events. The task of outlier detection aims at identifying such outliers in order to improve the analysis of data and further discover interesting and useful knowledge about unusual events within numerous applications domains. In this paper, we report on contemporary unsupervised outlier detection techniques for multiple types of data sets and provide a comprehensive taxonomy framework and two decision trees to select the most suitable technique based on data set. Furthermore, we highlight the advantages, disadvantages and performance issues of each class of outlier detection techniques under this taxonomy framework

    Business Analytics Using Predictive Algorithms

    Get PDF
    In today's data-driven business landscape, organizations strive to extract actionable insights and make informed decisions using their vast data. Business analytics, combining data analysis, statistical modeling, and predictive algorithms, is crucial for transforming raw data into meaningful information. However, there are gaps in the field, such as limited industry focus, algorithm comparison, and data quality challenges. This work aims to address these gaps by demonstrating how predictive algorithms can be applied across business domains for pattern identification, trend forecasting, and accurate predictions. The report focuses on sales forecasting and topic modeling, comparing the performance of various algorithms including Linear Regression, Random Forest Regression, XGBoost, LSTMs, and ARIMA. It emphasizes the importance of data preprocessing, feature selection, and model evaluation for reliable sales forecasts, while utilizing S-BERT, UMAP, and HDBScan unsupervised algorithms for extracting valuable insights from unstructured textual data

    Unsupervised Anomaly Detection of High Dimensional Data with Low Dimensional Embedded Manifold

    Get PDF
    Anomaly detection techniques are supposed to identify anomalies from loads of seemingly homogeneous data and being able to do so can lead us to timely, pivotal and actionable decisions, saving us from potential human, financial and informational loss. In anomaly detection, an often encountered situation is the absence of prior knowledge about the nature of anomalies. Such circumstances advocate for ‘unsupervised’ learning-based anomaly detection techniques. Compared to its ‘supervised’ counterpart, which possesses the luxury to utilize a labeled training dataset containing both normal and anomalous samples, unsupervised problems are far more difficult. Moreover, high dimensional streaming data from tons of interconnected sensors present in modern day industries makes the task more challenging. To carry out an investigative effort to address these challenges is the overarching theme of this dissertation. In this dissertation, the fundamental issue of similarity measure among observations, which is a central piece in any anomaly detection techniques, is reassessed. Manifold hypotheses suggests the possibility of low dimensional manifold structure embedded in high dimensional data. In the presence of such structured space, traditional similarity measures fail to measure the true intrinsic similarity. In light of this revelation, reevaluating the notion of similarity measure seems more pressing rather than providing incremental improvements over any of the existing techniques. A graph theoretic similarity measure is proposed to differentiate and thus identify the anomalies from normal observations. Specifically, the minimum spanning tree (MST), a graph-based approach is proposed to approximate the similarities among data points in the presence of high dimensional structured space. It can track the structure of the embedded manifold better than the existing measures and help to distinguish the anomalies from normal observations. This dissertation investigates further three different aspects of the anomaly detection problem and develops three sets of solution approaches with all of them revolving around the newly proposed MST based similarity measure. In the first part of the dissertation, a local MST (LoMST) based anomaly detection approach is proposed to detect anomalies using the data in the original space. A two-step procedure is developed to detect both cluster and point anomalies. The next two sets of methods are proposed in the subsequent two parts of the dissertation, for anomaly detection in reduced data space. In the second part of the dissertation, a neighborhood structure assisted version of the nonnegative matrix factorization approach (NS-NMF) is proposed. To detect anomalies, it uses the neighborhood information captured by a sparse MST similarity matrix along with the original attribute information. To meet the industry demands, the online version of both LoMST and NS-NMF is also developed for real-time anomaly detection. In the last part of the dissertation, a graph regularized autoencoder is proposed which uses an MST regularizer in addition to the original loss function and is thus capable of maintaining the local invariance property. All of the approaches proposed in the dissertation are tested on 20 benchmark datasets and one real-life hydropower dataset. When compared with the state of art approaches, all three approaches produce statistically significant better outcomes. “Industry 4.0” is a reality now and it calls for anomaly detection techniques capable of processing a large amount of high dimensional data generated in real-time. The proposed MST based similarity measure followed by the individual techniques developed in this dissertation are equipped to tackle each of these issues and provide an effective and reliable real-time anomaly identification platform

    Wind Turbine Fault Detection: an Unsupervised vs Semi-Supervised Approach

    Get PDF
    The need for renewable energy has been growing in recent years for the reasons we all know, wind power is no exception. Wind turbines are complex and expensive structures and the need for maintenance exists. Conditioning Monitoring Systems that make use of supervised machine learning techniques have been recently studied and the results are quite promising. Though, such systems still require the physical presence of professionals but with the advantage of gaining insight of the operating state of the machine in use, to decide upon maintenance interventions beforehand. The wind turbine failure is not an abrupt process but a gradual one. The main goal of this dissertation is: to compare semi-supervised methods to at tack the problem of automatic recognition of anomalies in wind turbines; to develop an approach combining the Mahalanobis Taguchi System (MTS) with two popular fuzzy partitional clustering algorithms like the fuzzy c-means and archetypal analysis, for the purpose of anomaly detection; and finally to develop an experimental protocol to com paratively study the two types of algorithms. In this work, the algorithms Local Outlier Factor (LOF), Connectivity-based Outlier Factor (COF), Cluster-based Local Outlier Factor (CBLOF), Histogram-based Outlier Score (HBOS), k-nearest-neighbours (k-NN), Subspace Outlier Detection (SOD), Fuzzy c-means (FCM), Archetypal Analysis (AA) and Local Minimum Spanning Tree (LoMST) were explored. The data used consisted of SCADA data sets regarding turbine sensorial data, 8 to tal, from a wind farm in the North of Portugal. Each data set comprises between 1070 and 1096 data cases and characterized by 5 features, for the years 2011, 2012 and 2013. The analysis of the results using 7 different validity measures show that, the CBLOF al gorithm got the best results in the semi-supervised approach while LoMST won in the unsupervised scenario. The extension of both FCM and AA got promissing results.A necessidade de produzir energia renovável tem vindo a crescer nos últimos anos pelas razões que todos sabemos, a energia eólica não é excepção. As turbinas eólicas são es truturas complexas e caras e a necessidade de manutenção existe. Sistemas de Condição Monitorizada utilizando técnicas de aprendizagem supervisionada têm vindo a ser estu dados recentemente e os resultados são bastante promissores. No entanto, estes sistemas ainda exigem a presença física de profissionais, mas com a vantagem de obter informa ções sobre o estado operacional da máquina em uso, para decidir sobre intervenções de manutenção antemão. O principal objetivo desta dissertação é: comparar métodos semi-supervisionados para atacar o problema de reconhecimento automático de anomalias em turbinas eólicas; desenvolver um método que combina o Mahalanobis Taguchi System (MTS) com dois mé todos de agrupamento difuso bem conhecidos como fuzzy c-means e archetypal analysis, no âmbito de deteção de anomalias; e finalmente desenvolver um protocolo experimental onde é possível o estudo comparativo entre os dois diferentes tipos de algoritmos. Neste trabalho, os algoritmos Local Outlier Factor (LOF), Connectivity-based Outlier Factor (COF), Cluster-based Local Outlier Factor (CBLOF), Histogram-based Outlier Score (HBOS), k-nearest-neighbours (k-NN), Subspace Outlier Detection (SOD), Fuzzy c-means (FCM), Archetypal Analysis (AA) and Local Minimum Spanning Tree (LoMST) foram explorados. Os conjuntos de dados utilizados provêm do sistema SCADA, referentes a dados sen soriais de turbinas, 8 no total, com origem num parque eólico no Norte de Portugal. Cada um está compreendendido entre 1070 e 1096 observações e caracterizados por 5 caracte rísticas, para os anos 2011, 2012 e 2013. A ánalise dos resultados através de 7 métricas de validação diferentes mostraram que, o algoritmo CBLOF obteve os melhores resultados na abordagem semi-supervisionada enquanto que o LoMST ganhou na abordagem não supervisionada. A extensão do FCM e do AA originou resultados promissores

    Signature-Based Community Detection for Time Series

    Full text link
    Community detection for time series without prior knowledge poses an open challenge within complex networks theory. Traditional approaches begin by assessing time series correlations and maximizing modularity under diverse null models. These methods suffer from assuming temporal stationarity and are influenced by the granularity of observation intervals. In this study, we propose an approach based on the signature matrix, a concept from path theory for studying stochastic processes. By employing a signature-derived similarity measure, our method overcomes drawbacks of traditional correlation-based techniques. Through a series of numerical experiments, we demonstrate that our method consistently yields higher modularity compared to baseline models, when tested on the Standard and Poor's 500 dataset. Moreover, our approach showcases enhanced stability in modularity when the length of the underlying time series is manipulated. This research contributes to the field of community detection by introducing a signature-based similarity measure, offering an alternative to conventional correlation matrices

    NLP Analysis of Email Interactions to find automation opportunities

    Get PDF
    Finding automatization opportunities for email interactions can have positive effects for several industries, especially in tasks such as reading, receiving, writing and responding emails, categorizing emails or even to prevent loss of productivity and financial loses by dealing with spam, or improve users' satisfaction; even improving automatic categorization systems can mitigate negative impacts on personal and organization performance. Furthermore, people who work in companies spend around 28 % of their time reading and answering emails. In this project we proposed a methodology based on NLP and Unsupervised Machine Learning to look for opportunities of automation arising from recurrent email patterns found in email texts. We intent to facilitate the linguistic analysis in order to retrieve interaction patterns that can trigger automation actions. We proposed CRISP-DM methodology that lays the groundwork for detection of automatization opportunities in tasks relates. We compared the unsupervised machine learning methods K-Means, DBSCAN, and HDBSCAN with four clustering metrics applied to the Enron e-mails dataset transformed into paragraph vectors and performed several experiments with Word Mover's Distance, Euclidean Distance, L2-Norm and Cosine Similarity. Although our process yielded limited results in the detection of email interactions, we found that DBSCAN combined with Euclidean Distance was the best method among all scores. This project also contributes to the parameterization literature of said clustering algorithms as well as showing which methods, distances and scores settings are relevant for unsupervised email mining

    Organising a photograph collection based on human appearance

    Get PDF
    This thesis describes a complete framework for organising digital photographs in an unsupervised manner, based on the appearance of people captured in the photographs. Organising a collection of photographs manually, especially providing the identities of people captured in photographs, is a time consuming task. Unsupervised grouping of images containing similar persons makes annotating names easier (as a group of images can be named at once) and enables quick search based on query by example. The full process of unsupervised clustering is discussed in this thesis. Methods for locating facial components are discussed and a technique based on colour image segmentation is proposed and tested. Additionally a method based on the Principal Component Analysis template is tested, too. These provide eye locations required for acquiring a normalised facial image. This image is then preprocessed by a histogram equalisation and feathering, and the features of MPEG-7 face recognition descriptor are extracted. A distance measure proposed in the MPEG-7 standard is used as a similarity measure. Three approaches to grouping that use only face recognition features for clustering are analysed. These are modified k-means, single-link and a method based on a nearest neighbour classifier. The nearest neighbour-based technique is chosen for further experiments with fusing information from several sources. These sources are context-based such as events (party, trip, holidays), the ownership of photographs, and content-based such as information about the colour and texture of the bodies of humans appearing in photographs. Two techniques are proposed for fusing event and ownership (user) information with the face recognition features: a Transferable Belief Model (TBM) and three level clustering. The three level clustering is carried out at “event” level, “user” level and “collection” level. The latter technique proves to be most efficient. For combining body information with the face recognition features, three probabilistic fusion methods are tested. These are the average sum, the generalised product and the maximum rule. Combinations are tested within events and within user collections. This work concludes with a brief discussion on extraction of key images for a representation of each cluster
    corecore