81 research outputs found
Revisiting Data Complexity Metrics Based on Morphology for Overlap and Imbalance: Snapshot, New Overlap Number of Balls Metrics and Singular Problems Prospect
Data Science and Machine Learning have become fundamental assets for
companies and research institutions alike. As one of its fields, supervised
classification allows for class prediction of new samples, learning from given
training data. However, some properties can cause datasets to be problematic to
classify.
In order to evaluate a dataset a priori, data complexity metrics have been
used extensively. They provide information regarding different intrinsic
characteristics of the data, which serve to evaluate classifier compatibility
and a course of action that improves performance. However, most complexity
metrics focus on just one characteristic of the data, which can be insufficient
to properly evaluate the dataset towards the classifiers' performance. In fact,
class overlap, a very detrimental feature for the classification process
(especially when imbalance among class labels is also present) is hard to
assess.
This research work focuses on revisiting complexity metrics based on data
morphology. In accordance to their nature, the premise is that they provide
both good estimates for class overlap, and great correlations with the
classification performance. For that purpose, a novel family of metrics have
been developed. Being based on ball coverage by classes, they are named after
Overlap Number of Balls. Finally, some prospects for the adaptation of the
former family of metrics to singular (more complex) problems are discussed.Comment: 23 pages, 9 figures, preprin
Investigating Randomised Sphere Covers in Supervised Learning
c©This copy of the thesis has been supplied on condition that anyone who consults it is understood to recognise that its copyright rests with the author and that no quotation from the thesis, nor any information derived therefrom, may be published without the authorâs prior, written consent. In this thesis, we thoroughly investigate a simple Instance Based Learning (IBL) classifier known as Sphere Cover. We propose a simple Randomized Sphere Cover Classifier (αRSC) and use several datasets in order to evaluate the classification performance of the αRSC classifier. In addition, we analyse the generalization error of the proposed classifier using bias/variance decomposition. A Sphere Cover Classifier may be described from the compression scheme which stipulates data compression as the reason for high generalization performance. We investigate the compression capacity of αRSC using a sample compression bound. The Compression Scheme prompted us to search new compressibility methods for αRSC. As such, we used a Gaussian kernel to investigate further data compression
Fault diagnosis for IP-based network with real-time conditions
BACKGROUND:
Fault diagnosis techniques have been based on many paradigms, which derive from diverse areas
and have different purposes: obtaining a representation model of the network for fault localization,
selecting optimal probe sets for monitoring network devices, reducing fault detection time, and
detecting faulty components in the network. Although there are several solutions for diagnosing
network faults, there are still challenges to be faced: a fault diagnosis solution needs to always be
available and able enough to process data timely, because stale results inhibit the quality and speed
of informed decision-making. Also, there is no non-invasive technique to continuously diagnose the
network symptoms without leaving the system vulnerable to any failures, nor a resilient technique
to the network's dynamic changes, which can cause new failures with different symptoms.
AIMS:
This thesis aims to propose a model for the continuous and timely diagnosis of IP-based networks
faults, independent of the network structure, and based on data analytics techniques.
METHOD(S):
This research's point of departure was the hypothesis of a fault propagation phenomenon that
allows the observation of failure symptoms at a higher network level than the fault origin. Thus, for
the model's construction, monitoring data was collected from an extensive campus network in
which impact link failures were induced at different instants of time and with different duration.
These data correspond to widely used parameters in the actual management of a network. The
collected data allowed us to understand the faults' behavior and how they are manifested at a
peripheral level.
Based on this understanding and a data analytics process, the first three modules of our model,
named PALADIN, were proposed (Identify, Collection and Structuring), which define the data
collection peripherally and the necessary data pre-processing to obtain the description of the
network's state at a given moment. These modules give the model the ability to structure the data
considering the delays of the multiple responses that the network delivers to a single monitoring
probe and the multiple network interfaces that a peripheral device may have.
Thus, a structured data stream is obtained, and it is ready to be analyzed. For this analysis, it was
necessary to implement an incremental learning framework that respects networks' dynamic
nature. It comprises three elements, an incremental learning algorithm, a data rebalancing strategy,
and a concept drift detector. This framework is the fourth module of the PALADIN model named
Diagnosis.
In order to evaluate the PALADIN model, the Diagnosis module was implemented with 25 different
incremental algorithms, ADWIN as concept-drift detector and SMOTE (adapted to streaming scenario) as the rebalancing strategy. On the other hand, a dataset was built through the first
modules of the PALADIN model (SOFI dataset), which means that these data are the incoming data
stream of the Diagnosis module used to evaluate its performance.
The PALADIN Diagnosis module performs an online classification of network failures, so it is a
learning model that must be evaluated in a stream context. Prequential evaluation is the most used
method to perform this task, so we adopt this process to evaluate the model's performance over
time through several stream evaluation metrics.
RESULTS:
This research first evidences the phenomenon of impact fault propagation, making it possible to
detect fault symptoms at a monitored network's peripheral level. It translates into non-invasive
monitoring of the network. Second, the PALADIN model is the major contribution in the fault
detection context because it covers two aspects. An online learning model to continuously process
the network symptoms and detect internal failures. Moreover, the concept-drift detection and
rebalance data stream components which make resilience to dynamic network changes possible.
Third, it is well known that the amount of available real-world datasets for imbalanced stream
classification context is still too small. That number is further reduced for the networking context.
The SOFI dataset obtained with the first modules of the PALADIN model contributes to that number
and encourages works related to unbalanced data streams and those related to network fault
diagnosis.
CONCLUSIONS:
The proposed model contains the necessary elements for the continuous and timely diagnosis of IPbased
network faults; it introduces the idea of periodical monitorization of peripheral network
elements and uses data analytics techniques to process it. Based on the analysis, processing, and
classification of peripherally collected data, it can be concluded that PALADIN achieves the
objective. The results indicate that the peripheral monitorization allows diagnosing faults in the
internal network; besides, the diagnosis process needs an incremental learning process, conceptdrift
detection elements, and rebalancing strategy.
The results of the experiments showed that PALADIN makes it possible to learn from the network
manifestations and diagnose internal network failures. The latter was verified with 25 different
incremental algorithms, ADWIN as concept-drift detector and SMOTE (adapted to streaming
scenario) as the rebalancing strategy.
This research clearly illustrates that it is unnecessary to monitor all the internal network elements
to detect a network's failures; instead, it is enough to choose the peripheral elements to be
monitored. Furthermore, with proper processing of the collected status and traffic descriptors, it is
possible to learn from the arriving data using incremental learning in cooperation with data
rebalancing and concept drift approaches. This proposal continuously diagnoses the network
symptoms without leaving the system vulnerable to failures while being resilient to the network's
dynamic changes.Programa de Doctorado en Ciencia y TecnologĂa InformĂĄtica por la Universidad Carlos III de MadridPresidente: JosĂ© Manuel Molina LĂłpez.- Secretario: Juan Carlos Dueñas LĂłpez.- Vocal: Juan Manuel Corchado RodrĂgue
Evaluation of whole graph embedding techniques for a clustering task in the manufacturing domain
Production systems in manufacturing consume and generate data. Representing the relationships between subsystems and their associated data is complex, but suitable for Knowledge Graphs (KG), which allow us to visualize the relationships between subsystems and store their measurement data. In this work, KG act as a feature engineering technique for a clustering task by converting KG into Euclidean space with so-called graph embeddings and serving as input to a clustering algorithm. The Python library Karate Club proposes 10 different technologies for embedding whole graphs, i.e., only one vector is generated for each graph. These were successfully tested on benchmark datasets that include social media platforms and chemical or biochemical structures. This work presents the potential of graph embeddings for the manufacturing domain for a clustering task by modifying and evaluating Karate Clubâs techniques for a manufacturing dataset. First, an introduction to graph theory is given and the state of the art in whole graph embedding techniques is explained. Second, the Bosch production line dataset is examined with an Exploratory Data Analysis (EDA), and a graph data model for directed and undirected graphs is defined based on the results. Third, a data processing pipeline is developed to generate graph embeddings from the raw data. Finally, the graph embeddings are used as input to a clustering algorithm, and a quantitative comparison of the performance of the techniques is conducted
- âŠ