3,595 research outputs found
Representation Learning: A Review and New Perspectives
The success of machine learning algorithms generally depends on data
representation, and we hypothesize that this is because different
representations can entangle and hide more or less the different explanatory
factors of variation behind the data. Although specific domain knowledge can be
used to help design representations, learning with generic priors can also be
used, and the quest for AI is motivating the design of more powerful
representation-learning algorithms implementing such priors. This paper reviews
recent work in the area of unsupervised feature learning and deep learning,
covering advances in probabilistic models, auto-encoders, manifold learning,
and deep networks. This motivates longer-term unanswered questions about the
appropriate objectives for learning good representations, for computing
representations (i.e., inference), and the geometrical connections between
representation learning, density estimation and manifold learning
On the role of pre and post-processing in environmental data mining
The quality of discovered knowledge is highly depending on data quality. Unfortunately real data use to contain noise, uncertainty, errors, redundancies or even irrelevant information. The more complex is the reality to be analyzed, the higher the risk of getting low quality data. Knowledge Discovery from Databases (KDD) offers a global framework to prepare data in the right form to perform correct analyses. On the other hand, the quality of decisions taken upon KDD results, depend not only on the quality of the results themselves, but on the capacity of the system to communicate those results in an understandable form. Environmental systems are particularly complex and environmental users particularly require clarity in their results. In this paper some details about how this can be achieved are provided. The role of the pre and post processing in the whole process of Knowledge Discovery in environmental systems is discussed
FeatureExplorer: Interactive Feature Selection and Exploration of Regression Models for Hyperspectral Images
Feature selection is used in machine learning to improve predictions,
decrease computation time, reduce noise, and tune models based on limited
sample data. In this article, we present FeatureExplorer, a visual analytics
system that supports the dynamic evaluation of regression models and importance
of feature subsets through the interactive selection of features in
high-dimensional feature spaces typical of hyperspectral images. The
interactive system allows users to iteratively refine and diagnose the model by
selecting features based on their domain knowledge, interchangeable
(correlated) features, feature importance, and the resulting model performance.Comment: To appear in IEEE VIS 2019 Short Paper
Performance evaluation and hyperparameter tuning of statistical and machine-learning models using spatial data
Machine-learning algorithms have gained popularity in recent years in the
field of ecological modeling due to their promising results in predictive
performance of classification problems. While the application of such
algorithms has been highly simplified in the last years due to their
well-documented integration in commonly used statistical programming languages
such as R, there are several practical challenges in the field of ecological
modeling related to unbiased performance estimation, optimization of algorithms
using hyperparameter tuning and spatial autocorrelation. We address these
issues in the comparison of several widely used machine-learning algorithms
such as Boosted Regression Trees (BRT), k-Nearest Neighbor (WKNN), Random
Forest (RF) and Support Vector Machine (SVM) to traditional parametric
algorithms such as logistic regression (GLM) and semi-parametric ones like
generalized additive models (GAM). Different nested cross-validation methods
including hyperparameter tuning methods are used to evaluate model performances
with the aim to receive bias-reduced performance estimates. As a case study the
spatial distribution of forest disease Diplodia sapinea in the Basque Country
in Spain is investigated using common environmental variables such as
temperature, precipitation, soil or lithology as predictors. Results show that
GAM and RF (mean AUROC estimates 0.708 and 0.699) outperform all other methods
in predictive accuracy. The effect of hyperparameter tuning saturates at around
50 iterations for this data set. The AUROC differences between the bias-reduced
(spatial cross-validation) and overoptimistic (non-spatial cross-validation)
performance estimates of the GAM and RF are 0.167 (24%) and 0.213 (30%),
respectively. It is recommended to also use spatial partitioning for
cross-validation hyperparameter tuning of spatial data
Enabling Auditing and Intrusion Detection of Proprietary Controller Area Networks
The goal of this dissertation is to provide automated methods for security researchers to overcome ‘security through obscurity’ used by manufacturers of proprietary Industrial Control Systems (ICS). `White hat\u27 security analysts waste significant time reverse engineering these systems\u27 opaque network configurations instead of performing meaningful security auditing tasks. Automating the process of documenting proprietary protocol configurations is intended to improve independent security auditing of ICS networks. The major contributions of this dissertation are a novel approach for unsupervised lexical analysis of binary network data flows and analysis of the time series data extracted as a result. We demonstrate the utility of these methods using Controller Area Network (CAN) data sampled from passenger vehicles
Data Mining Application for Healthcare Sector: Predictive Analysis of Heart Attacks
Project Work presented as the partial requirement for obtaining a Master's degree in Information Management, specialization in Knowledge Management and Business IntelligenceCardiovascular diseases are the main cause of the number of deaths in the world, being the heart
disease the most killing one affecting more than 75% of individuals living in countries of low and middle
earnings. Considering all the consequences, firstly for the individual’s health, but also for the health
system and the cost of healthcare (for instance, treatments and medication), specifically for
cardiovascular diseases treatment, it has become extremely important the provision of quality services
by making use of preventive medicine, whose focus is identifying the disease risk, and then, applying
the right action in case of early signs. Therefore, by resorting to DM (Data Mining) and its techniques,
there is the ability to uncover patterns and relationships amongst the objects in healthcare data, giving
the potential to use it more efficiently, and to produce business intelligence and extract knowledge
that will be crucial for future answers about possible diseases and treatments on patients. Nowadays,
the concept of DM is already applied in medical information systems for clinical purposes such as
diagnosis and treatments, that by making use of predictive models can diagnose some group of
diseases, in this case, heart attacks.
The focus of this project consists on applying machine learning techniques to develop a predictive
model based on a real dataset, in order to detect through the analysis of patient’s data whether a
person can have a heart attack or not. At the end, the best model is found by comparing the different
algorithms used and assessing its results, and then, selecting the one with the best measures.
The correct identification of early cardiovascular problems signs through the analysis of patient data
can lead to the possible prevention of heart attacks, to the consequent reduction of complications and
secondary effects that the disease may bring, and most importantly, to the decrease on the number
of deaths in the future. Making use of Data Mining and analytics in healthcare will allow the analysis
of high volumes of data, the development of new predictive models, and the understanding of the
factors and variables that have the most influence and contribution for this disease, which people
should pay attention. Hence, this practical approach is an example of how predictive analytics can have
an important impact in the healthcare sector: through the collection of patient’s data, models learn
from it so that in the future they can predict new unknown cases of heart attacks with better
accuracies. In this way, it contributes to the creation of new models, to the tracking of patient’s health
data, to the improvement of medical decisions, to efficient and faster responses, and to the wellbeing
of the population that can be improved if diseases like this can be predicted and avoided. To conclude, this project aims to present and show how Data Mining techniques are applied in
healthcare and medicine, and how they contribute for the better knowledge of cardiovascular diseases
and for the support of important decisions that will influence the patient’s quality of life
- …