461 research outputs found
Framework for data quality in knowledge discovery tasks
Actualmente la explosión de datos es tendencia en el universo digital debido a los
avances en las tecnologías de la información. En este sentido, el descubrimiento
de conocimiento y la minería de datos han ganado mayor importancia debido a
la gran cantidad de datos disponibles. Para un exitoso proceso de descubrimiento
de conocimiento, es necesario preparar los datos. Expertos afirman que la fase de
preprocesamiento de datos toma entre un 50% a 70% del tiempo de un proceso de
descubrimiento de conocimiento.
Herramientas software basadas en populares metodologías para el descubrimiento
de conocimiento ofrecen algoritmos para el preprocesamiento de los datos.
Según el cuadrante mágico de Gartner de 2018 para ciencia de datos y plataformas
de aprendizaje automático, KNIME, RapidMiner, SAS, Alteryx, y H20.ai son las
mejores herramientas para el desucrimiento del conocimiento. Estas herramientas
proporcionan diversas técnicas que facilitan la evaluación del conjunto de datos,
sin embargo carecen de un proceso orientado al usuario que permita abordar los
problemas en la calidad de datos. Adem´as, la selección de las técnicas adecuadas
para la limpieza de datos es un problema para usuarios inexpertos, ya que estos
no tienen claro cuales son los métodos más confiables.
De esta forma, la presente tesis doctoral se enfoca en abordar los problemas
antes mencionados mediante: (i) Un marco conceptual que ofrezca un proceso
guiado para abordar los problemas de calidad en los datos en tareas de descubrimiento
de conocimiento, (ii) un sistema de razonamiento basado en casos
que recomiende los algoritmos adecuados para la limpieza de datos y (iii) una ontología que representa el conocimiento de los problemas de calidad en los datos
y los algoritmos de limpieza de datos. Adicionalmente, esta ontología contribuye
en la representacion formal de los casos y en la fase de adaptación, del sistema de
razonamiento basado en casos.The creation and consumption of data continue to grow by leaps and bounds. Due
to advances in Information and Communication Technologies (ICT), today the
data explosion in the digital universe is a new trend. The Knowledge Discovery
in Databases (KDD) gain importance due the abundance of data. For a successful
process of knowledge discovery is necessary to make a data treatment. The
experts affirm that preprocessing phase take the 50% to 70% of the total time of
knowledge discovery process.
Software tools based on Knowledge Discovery Methodologies offers algorithms
for data preprocessing. According to Gartner 2018 Magic Quadrant for
Data Science and Machine Learning Platforms, KNIME, RapidMiner, SAS, Alteryx
and H20.ai are the leader tools for knowledge discovery. These software
tools provide different techniques and they facilitate the evaluation of data analysis,
however, these software tools lack any kind of guidance as to which techniques
can or should be used in which contexts. Consequently, the use of suitable data
cleaning techniques is a headache for inexpert users. They have no idea which
methods can be confidently used and often resort to trial and error.
This thesis presents three contributions to address the mentioned problems:
(i) A conceptual framework to provide the user a guidance to address data quality
issues in knowledge discovery tasks, (ii) a Case-based reasoning system to
recommend the suitable algorithms for data cleaning, and (iii) an Ontology that
represent the knowledge in data quality issues and data cleaning methods. Also,
this ontology supports the case-based reasoning system for case representation
and reuse phase.Programa Oficial de Doctorado en Ciencia y Tecnología InformáticaPresidente: Fernando Fernández Rebollo.- Secretario: Gustavo Adolfo Ramírez.- Vocal: Juan Pedro Caraça-Valente Hernánde
Toward Building an Intelligent and Secure Network: An Internet Traffic Forecasting Perspective
Internet traffic forecast is a crucial component for the proactive management of self-organizing networks (SON) to ensure better Quality of Service (QoS) and Quality of Experience (QoE). Given the volatile and random nature of traffic data, this forecasting influences strategic development and investment decisions in the Internet Service Provider (ISP) industry. Modern machine learning algorithms have shown potential in dealing with complex Internet traffic prediction tasks, yet challenges persist. This thesis systematically explores these issues over five empirical studies conducted in the past three years, focusing on four key research questions: How do outlier data samples impact prediction accuracy for both short-term and long-term forecasting? How can a denoising mechanism enhance prediction accuracy? How can robust machine learning models be built with limited data? How can out-of-distribution traffic data be used to improve the generalizability of prediction models? Based on extensive experiments, we propose a novel traffic forecast/prediction framework and associated models that integrate outlier management and noise reduction strategies, outperforming traditional machine learning models. Additionally, we suggest a transfer learning-based framework combined with a data augmentation technique to provide robust solutions with smaller datasets. Lastly, we propose a hybrid model with signal decomposition techniques to enhance model generalization for out-of-distribution data samples. We also brought the issue of cyber threats as part of our forecast research, acknowledging their substantial influence on traffic unpredictability and forecasting challenges. Our thesis presents a detailed exploration of cyber-attack detection, employing methods that have been validated using multiple benchmark datasets. Initially, we incorporated ensemble feature selection with ensemble classification to improve DDoS (Distributed Denial-of-Service) attack detection accuracy with minimal false alarms. Our research further introduces a stacking ensemble framework for classifying diverse forms of cyber-attacks. Proceeding further, we proposed a weighted voting mechanism for Android malware detection to secure Mobile Cyber-Physical Systems, which integrates the mobility of various smart devices to exchange information between physical and cyber systems. Lastly, we employed Generative Adversarial Networks for generating flow-based DDoS attacks in Internet of Things environments. By considering the impact of cyber-attacks on traffic volume and their challenges to traffic prediction, our research attempts to bridge the gap between traffic forecasting and cyber security, enhancing proactive management of networks and contributing to resilient and secure internet infrastructure
Efficient Decision Support Systems
This series is directed to diverse managerial professionals who are leading the transformation of individual domains by using expert information and domain knowledge to drive decision support systems (DSSs). The series offers a broad range of subjects addressed in specific areas such as health care, business management, banking, agriculture, environmental improvement, natural resource and spatial management, aviation administration, and hybrid applications of information technology aimed to interdisciplinary issues. This book series is composed of three volumes: Volume 1 consists of general concepts and methodology of DSSs; Volume 2 consists of applications of DSSs in the biomedical domain; Volume 3 consists of hybrid applications of DSSs in multidisciplinary domains. The book is shaped upon decision support strategies in the new infrastructure that assists the readers in full use of the creative technology to manipulate input data and to transform information into useful decisions for decision makers
From Theory to Practice: A Data Quality Framework for Classification Tasks
The data preprocessing is an essential step in knowledge discovery projects. The experts affirm that preprocessing tasks take between 50% to 70% of the total time of the knowledge discovery process. In this sense, several authors consider the data cleaning as one of the most cumbersome and critical tasks. Failure to provide high data quality in the preprocessing stage will significantly reduce the accuracy of any data analytic project. In this paper, we propose a framework to address the data quality issues in classification tasks DQF4CT. Our approach is composed of: (i) a conceptual framework to provide the user guidance on how to deal with data problems in classification tasks; and (ii) an ontology that represents the knowledge in data cleaning and suggests the proper data cleaning approaches. We presented two case studies through real datasets: physical activity monitoring (PAM) and occupancy detection of an office room (OD). With the aim of evaluating our proposal, the cleaned datasets by DQF4CT were used to train the same algorithms used in classification tasks by the authors of PAM and OD. Additionally, we evaluated DQF4CT through datasets of the Repository of Machine Learning Databases of the University of California, Irvine (UCI). In addition, 84% of the results achieved by the models of the datasets cleaned by DQF4CT are better than the models of the datasets authors.This work has also been supported by:
Project: “Red de formación de talento humano para la innovación social y productiva en el Departamento del Cauca InnovAcción Cauca”. Convocatoria 03-2018 Publicación de artículos en revistas de alto impacto.
Project: “Alternativas Innovadoras de Agricultura Inteligente para sistemas productivos agrícolas del departamento del Cauca soportado en entornos de IoT - ID 4633” financed by Convocatoria 04C–2018 “Banco de Proyectos Conjuntos UEES-Sostenibilidad” of Project “Red de formación de talento humano para la innovación social y productiva en el Departamento del Cauca InnovAcción Cauca”.
Spanish Ministry of Economy, Industry and Competitiveness (Projects TRA2015-63708-R and TRA2016-78886-C3-1-R)
An Intelligent Framework for Estimating Software Development Projects using Machine Learning
The IT industry has faced many challenges related to software effort and cost estimation. A cost assessment is conducted after software effort estimation, which benefits customers as well as developers. The purpose of this paper is to discuss various methods for the estimation of software effort and cost in the context of software engineering, such as algorithmic methods, expert judgment methods, analogy-based estimation methods, and machine learning methods, as well as their different aspects. In spite of this, estimation of the effort involved in software development are subject to uncertainty. Several methods have been developed in the literature for improving estimation accuracy, many of which involve the use of machine learning techniques. A machine learning framework is proposed in this paper to address this challenging problem. In addition to being completely independent of algorithmic models and estimation problems, this framework also features a modular architecture. It has high interpretability, learning capability, and robustness to imprecise and uncertain inputs
Experience: Quality benchmarking of datasets used in software effort estimation
Data is a cornerstone of empirical software engineering (ESE) research and practice. Data underpin numerous
process and project management activities, including the estimation of development effort and the prediction
of the likely location and severity of defects in code. Serious questions have been raised, however, over the
quality of the data used in ESE. Data quality problems caused by noise, outliers, and incompleteness have
been noted as being especially prevalent. Other quality issues, although also potentially important, have
received less attention. In this study, we assess the quality of 13 datasets that have been used extensively
in research on software effort estimation. The quality issues considered in this article draw on a taxonomy
that we published previously based on a systematic mapping of data quality issues in ESE. Our contributions
are as follows: (1) an evaluation of the “fitness for purpose” of these commonly used datasets and (2) an
assessment of the utility of the taxonomy in terms of dataset benchmarking. We also propose a template
that could be used to both improve the ESE data collection/submission process and to evaluate other such
datasets, contributing to enhanced awareness of data quality issues in the ESE community and, in time, the
availability and use of higher-quality datasets
Design and evaluation of a case-based system for modelling exploratory learning behaviour of math generalisation
Exploratory learning environments (ELEs) promote a view of learning that encourages students to construct and/or explore
models and observe the effects of modifying their parameters. The freedom given to learners in this exploration context leads to a
variety of learner approaches for constructing models and makes modelling of learner behaviour a challenging task. To address this
issue, we propose a learner modelling mechanism for monitoring learners’ actions when constructing/exploring models by modelling
sequences of actions reflecting different strategies in solving a task. This is based on a modified version of case-based reasoning for
problems with multiple solutions. In our formulation, approaches to explore the task are represented as sequences of simple cases
linked by temporal and dependency relations, which are mapped to the learners’ behaviour in the system by means of appropriate
similarity metrics. This paper presents the development and validation of the modelling mechanism. The model was validated in the
context of an ELE for mathematical generalisation using data from classroom sessions and pedagogically-driven learning scenarios
- …