2,877 research outputs found
GBMST: An Efficient Minimum Spanning Tree Clustering Based on Granular-Ball Computing
Most of the existing clustering methods are based on a single granularity of
information, such as the distance and density of each data. This most
fine-grained based approach is usually inefficient and susceptible to noise.
Therefore, we propose a clustering algorithm that combines multi-granularity
Granular-Ball and minimum spanning tree (MST). We construct coarsegrained
granular-balls, and then use granular-balls and MST to implement the clustering
method based on "large-scale priority", which can greatly avoid the influence
of outliers and accelerate the construction process of MST. Experimental
results on several data sets demonstrate the power of the algorithm. All codes
have been released at https://github.com/xjnine/GBMST
A taxonomy framework for unsupervised outlier detection techniques for multi-type data sets
The term "outlier" can generally be defined as an observation that is significantly different from
the other values in a data set. The outliers may be instances of error or indicate events. The
task of outlier detection aims at identifying such outliers in order to improve the analysis of
data and further discover interesting and useful knowledge about unusual events within numerous
applications domains. In this paper, we report on contemporary unsupervised outlier detection
techniques for multiple types of data sets and provide a comprehensive taxonomy framework and
two decision trees to select the most suitable technique based on data set. Furthermore, we
highlight the advantages, disadvantages and performance issues of each class of outlier detection
techniques under this taxonomy framework
Business Analytics Using Predictive Algorithms
In today's data-driven business landscape, organizations strive to extract actionable insights and make informed decisions using their vast data. Business analytics, combining data analysis, statistical modeling, and predictive algorithms, is crucial for transforming raw data into meaningful information. However, there are gaps in the field, such as limited industry focus, algorithm comparison, and data quality challenges. This work aims to address these gaps by demonstrating how predictive algorithms can be applied across business domains for pattern identification, trend forecasting, and accurate predictions. The report focuses on sales forecasting and topic modeling, comparing the performance of various algorithms including Linear Regression, Random Forest Regression, XGBoost, LSTMs, and ARIMA. It emphasizes the importance of data preprocessing, feature selection, and model evaluation for reliable sales forecasts, while utilizing S-BERT, UMAP, and HDBScan unsupervised algorithms for extracting valuable insights from unstructured textual data
Unsupervised Anomaly Detection of High Dimensional Data with Low Dimensional Embedded Manifold
Anomaly detection techniques are supposed to identify anomalies from loads of seemingly homogeneous data and being able to do so can lead us to timely, pivotal and actionable decisions, saving us from potential human, financial and informational loss. In anomaly detection, an often encountered situation is the absence of prior knowledge about the nature of anomalies. Such circumstances advocate for ‘unsupervised’ learning-based anomaly detection techniques. Compared to its ‘supervised’ counterpart, which possesses the luxury to utilize a labeled training dataset containing both normal and anomalous samples, unsupervised problems are far more difficult. Moreover, high dimensional streaming data from tons of interconnected sensors present in modern day industries makes the task more challenging. To carry out an investigative effort to address these challenges is the overarching theme of this dissertation.
In this dissertation, the fundamental issue of similarity measure among observations, which is a central piece in any anomaly detection techniques, is reassessed. Manifold hypotheses suggests the possibility of low dimensional manifold structure embedded in high dimensional data. In the presence of such structured space, traditional similarity measures fail to measure the true intrinsic similarity. In light of this revelation, reevaluating the notion of similarity measure seems more pressing rather than providing incremental improvements over any of the existing techniques. A graph theoretic similarity measure is proposed to differentiate and thus identify the anomalies from normal observations. Specifically, the minimum spanning tree (MST), a graph-based approach is proposed to approximate the similarities among data points in the presence of high dimensional structured space. It can track the structure of the embedded manifold better than the existing measures and help to distinguish the anomalies from normal observations. This dissertation investigates further three different aspects of the anomaly detection problem and develops three sets of solution approaches with all of them revolving around the newly proposed MST based similarity measure.
In the first part of the dissertation, a local MST (LoMST) based anomaly detection approach is proposed to detect anomalies using the data in the original space. A two-step procedure is developed to detect both cluster and point anomalies. The next two sets of methods are proposed in the subsequent two parts of the dissertation, for anomaly detection in reduced data space. In the second part of the dissertation, a neighborhood structure assisted version of the nonnegative matrix factorization approach (NS-NMF) is proposed. To detect anomalies, it uses the neighborhood information captured by a sparse MST similarity matrix along with the original attribute information. To meet the industry demands, the online version of both LoMST and NS-NMF is also developed for real-time anomaly detection. In the last part of the dissertation, a graph regularized autoencoder is proposed which uses an MST regularizer in addition to the original loss function and is thus capable of maintaining the local invariance property. All of the approaches proposed in the dissertation are tested on 20 benchmark datasets and one real-life hydropower dataset. When compared with the state of art approaches, all three approaches produce statistically significant better outcomes.
“Industry 4.0” is a reality now and it calls for anomaly detection techniques capable of processing a large amount of high dimensional data generated in real-time. The proposed MST based similarity measure followed by the individual techniques developed in this dissertation are equipped to tackle each of these issues and provide an effective and reliable real-time anomaly identification platform
Wind Turbine Fault Detection: an Unsupervised vs Semi-Supervised Approach
The need for renewable energy has been growing in recent years for the reasons we all
know, wind power is no exception. Wind turbines are complex and expensive structures
and the need for maintenance exists. Conditioning Monitoring Systems that make use of
supervised machine learning techniques have been recently studied and the results are
quite promising. Though, such systems still require the physical presence of professionals
but with the advantage of gaining insight of the operating state of the machine in use, to
decide upon maintenance interventions beforehand. The wind turbine failure is not an
abrupt process but a gradual one.
The main goal of this dissertation is: to compare semi-supervised methods to at tack the problem of automatic recognition of anomalies in wind turbines; to develop an
approach combining the Mahalanobis Taguchi System (MTS) with two popular fuzzy
partitional clustering algorithms like the fuzzy c-means and archetypal analysis, for the
purpose of anomaly detection; and finally to develop an experimental protocol to com paratively study the two types of algorithms.
In this work, the algorithms Local Outlier Factor (LOF), Connectivity-based Outlier
Factor (COF), Cluster-based Local Outlier Factor (CBLOF), Histogram-based Outlier Score
(HBOS), k-nearest-neighbours (k-NN), Subspace Outlier Detection (SOD), Fuzzy c-means
(FCM), Archetypal Analysis (AA) and Local Minimum Spanning Tree (LoMST) were
explored.
The data used consisted of SCADA data sets regarding turbine sensorial data, 8 to tal, from a wind farm in the North of Portugal. Each data set comprises between 1070
and 1096 data cases and characterized by 5 features, for the years 2011, 2012 and 2013.
The analysis of the results using 7 different validity measures show that, the CBLOF al gorithm got the best results in the semi-supervised approach while LoMST won in the
unsupervised scenario. The extension of both FCM and AA got promissing results.A necessidade de produzir energia renovável tem vindo a crescer nos últimos anos pelas
razões que todos sabemos, a energia eólica não é excepção. As turbinas eólicas são es truturas complexas e caras e a necessidade de manutenção existe. Sistemas de Condição
Monitorizada utilizando técnicas de aprendizagem supervisionada têm vindo a ser estu dados recentemente e os resultados são bastante promissores. No entanto, estes sistemas
ainda exigem a presença física de profissionais, mas com a vantagem de obter informa ções sobre o estado operacional da máquina em uso, para decidir sobre intervenções de
manutenção antemão.
O principal objetivo desta dissertação é: comparar métodos semi-supervisionados
para atacar o problema de reconhecimento automático de anomalias em turbinas eólicas;
desenvolver um método que combina o Mahalanobis Taguchi System (MTS) com dois mé todos de agrupamento difuso bem conhecidos como fuzzy c-means e archetypal analysis,
no âmbito de deteção de anomalias; e finalmente desenvolver um protocolo experimental
onde é possível o estudo comparativo entre os dois diferentes tipos de algoritmos.
Neste trabalho, os algoritmos Local Outlier Factor (LOF), Connectivity-based Outlier
Factor (COF), Cluster-based Local Outlier Factor (CBLOF), Histogram-based Outlier Score
(HBOS), k-nearest-neighbours (k-NN), Subspace Outlier Detection (SOD), Fuzzy c-means
(FCM), Archetypal Analysis (AA) and Local Minimum Spanning Tree (LoMST) foram
explorados.
Os conjuntos de dados utilizados provêm do sistema SCADA, referentes a dados sen soriais de turbinas, 8 no total, com origem num parque eólico no Norte de Portugal. Cada
um está compreendendido entre 1070 e 1096 observações e caracterizados por 5 caracte rísticas, para os anos 2011, 2012 e 2013. A ánalise dos resultados através de 7 métricas de
validação diferentes mostraram que, o algoritmo CBLOF obteve os melhores resultados
na abordagem semi-supervisionada enquanto que o LoMST ganhou na abordagem não
supervisionada. A extensão do FCM e do AA originou resultados promissores
Signature-Based Community Detection for Time Series
Community detection for time series without prior knowledge poses an open
challenge within complex networks theory. Traditional approaches begin by
assessing time series correlations and maximizing modularity under diverse null
models. These methods suffer from assuming temporal stationarity and are
influenced by the granularity of observation intervals. In this study, we
propose an approach based on the signature matrix, a concept from path theory
for studying stochastic processes. By employing a signature-derived similarity
measure, our method overcomes drawbacks of traditional correlation-based
techniques. Through a series of numerical experiments, we demonstrate that our
method consistently yields higher modularity compared to baseline models, when
tested on the Standard and Poor's 500 dataset. Moreover, our approach showcases
enhanced stability in modularity when the length of the underlying time series
is manipulated. This research contributes to the field of community detection
by introducing a signature-based similarity measure, offering an alternative to
conventional correlation matrices
NLP Analysis of Email Interactions to find automation opportunities
Finding automatization opportunities for email interactions can have positive effects for several industries, especially in tasks such as reading, receiving, writing and responding emails, categorizing emails or even to prevent loss of productivity and financial loses by dealing with spam, or improve users' satisfaction; even improving automatic categorization systems can mitigate negative impacts on personal and organization performance. Furthermore, people who work in companies spend around 28 % of their time reading and answering emails. In this project we proposed a methodology based on NLP and Unsupervised Machine Learning to look for opportunities of automation arising from recurrent email patterns found in email texts. We intent to facilitate the linguistic analysis in order to retrieve interaction patterns that can trigger automation actions. We proposed CRISP-DM methodology that lays the groundwork for detection of automatization opportunities in tasks relates. We compared the unsupervised machine learning methods K-Means, DBSCAN, and HDBSCAN with four clustering metrics applied to the Enron e-mails dataset transformed into paragraph vectors and performed several experiments with Word Mover's Distance, Euclidean Distance, L2-Norm and Cosine Similarity. Although our process yielded limited results in the detection of email interactions, we found that DBSCAN combined with Euclidean Distance was the best method among all scores. This project also contributes to the parameterization literature of said clustering algorithms as well as showing which methods, distances and scores settings are relevant for unsupervised email mining
Organising a photograph collection based on human appearance
This thesis describes a complete framework for organising digital photographs in an unsupervised manner, based on the appearance of people captured in the photographs. Organising a collection of photographs manually, especially providing the identities of people captured in photographs, is a time consuming task. Unsupervised grouping of images containing similar persons makes annotating names easier (as a group of images can be named at once) and enables quick search based on query by example.
The full process of unsupervised clustering is discussed in this thesis. Methods for locating facial components are discussed and a technique based on colour
image segmentation is proposed and tested. Additionally a method based on the Principal Component Analysis template is tested, too. These provide eye locations required for acquiring a normalised facial image. This image is then preprocessed by a histogram equalisation and feathering, and the features of MPEG-7 face recognition descriptor are extracted. A distance measure proposed in the MPEG-7 standard is used as a similarity measure.
Three approaches to grouping that use only face recognition features for clustering are analysed. These are modified k-means, single-link and a method based on a nearest neighbour classifier. The nearest neighbour-based technique is chosen for further experiments with fusing information from several sources. These sources are context-based such as events (party, trip, holidays), the ownership of photographs, and content-based such as information about the colour and texture of the bodies of humans appearing in photographs. Two techniques are proposed for fusing event and ownership (user) information with the face recognition features: a Transferable Belief Model (TBM) and three level clustering. The three level clustering is carried out at “event” level, “user” level and “collection” level. The latter technique proves to be most efficient.
For combining body information with the face recognition features, three probabilistic fusion methods are tested. These are the average sum, the generalised product and the maximum rule. Combinations are tested within events and within user collections. This work concludes with a brief discussion on extraction of key images for a representation of each cluster
- …