3,595 research outputs found

    Representation Learning: A Review and New Perspectives

    Full text link
    The success of machine learning algorithms generally depends on data representation, and we hypothesize that this is because different representations can entangle and hide more or less the different explanatory factors of variation behind the data. Although specific domain knowledge can be used to help design representations, learning with generic priors can also be used, and the quest for AI is motivating the design of more powerful representation-learning algorithms implementing such priors. This paper reviews recent work in the area of unsupervised feature learning and deep learning, covering advances in probabilistic models, auto-encoders, manifold learning, and deep networks. This motivates longer-term unanswered questions about the appropriate objectives for learning good representations, for computing representations (i.e., inference), and the geometrical connections between representation learning, density estimation and manifold learning

    On the role of pre and post-processing in environmental data mining

    Get PDF
    The quality of discovered knowledge is highly depending on data quality. Unfortunately real data use to contain noise, uncertainty, errors, redundancies or even irrelevant information. The more complex is the reality to be analyzed, the higher the risk of getting low quality data. Knowledge Discovery from Databases (KDD) offers a global framework to prepare data in the right form to perform correct analyses. On the other hand, the quality of decisions taken upon KDD results, depend not only on the quality of the results themselves, but on the capacity of the system to communicate those results in an understandable form. Environmental systems are particularly complex and environmental users particularly require clarity in their results. In this paper some details about how this can be achieved are provided. The role of the pre and post processing in the whole process of Knowledge Discovery in environmental systems is discussed

    FeatureExplorer: Interactive Feature Selection and Exploration of Regression Models for Hyperspectral Images

    Full text link
    Feature selection is used in machine learning to improve predictions, decrease computation time, reduce noise, and tune models based on limited sample data. In this article, we present FeatureExplorer, a visual analytics system that supports the dynamic evaluation of regression models and importance of feature subsets through the interactive selection of features in high-dimensional feature spaces typical of hyperspectral images. The interactive system allows users to iteratively refine and diagnose the model by selecting features based on their domain knowledge, interchangeable (correlated) features, feature importance, and the resulting model performance.Comment: To appear in IEEE VIS 2019 Short Paper

    Performance evaluation and hyperparameter tuning of statistical and machine-learning models using spatial data

    Get PDF
    Machine-learning algorithms have gained popularity in recent years in the field of ecological modeling due to their promising results in predictive performance of classification problems. While the application of such algorithms has been highly simplified in the last years due to their well-documented integration in commonly used statistical programming languages such as R, there are several practical challenges in the field of ecological modeling related to unbiased performance estimation, optimization of algorithms using hyperparameter tuning and spatial autocorrelation. We address these issues in the comparison of several widely used machine-learning algorithms such as Boosted Regression Trees (BRT), k-Nearest Neighbor (WKNN), Random Forest (RF) and Support Vector Machine (SVM) to traditional parametric algorithms such as logistic regression (GLM) and semi-parametric ones like generalized additive models (GAM). Different nested cross-validation methods including hyperparameter tuning methods are used to evaluate model performances with the aim to receive bias-reduced performance estimates. As a case study the spatial distribution of forest disease Diplodia sapinea in the Basque Country in Spain is investigated using common environmental variables such as temperature, precipitation, soil or lithology as predictors. Results show that GAM and RF (mean AUROC estimates 0.708 and 0.699) outperform all other methods in predictive accuracy. The effect of hyperparameter tuning saturates at around 50 iterations for this data set. The AUROC differences between the bias-reduced (spatial cross-validation) and overoptimistic (non-spatial cross-validation) performance estimates of the GAM and RF are 0.167 (24%) and 0.213 (30%), respectively. It is recommended to also use spatial partitioning for cross-validation hyperparameter tuning of spatial data

    Enabling Auditing and Intrusion Detection of Proprietary Controller Area Networks

    Get PDF
    The goal of this dissertation is to provide automated methods for security researchers to overcome ‘security through obscurity’ used by manufacturers of proprietary Industrial Control Systems (ICS). `White hat\u27 security analysts waste significant time reverse engineering these systems\u27 opaque network configurations instead of performing meaningful security auditing tasks. Automating the process of documenting proprietary protocol configurations is intended to improve independent security auditing of ICS networks. The major contributions of this dissertation are a novel approach for unsupervised lexical analysis of binary network data flows and analysis of the time series data extracted as a result. We demonstrate the utility of these methods using Controller Area Network (CAN) data sampled from passenger vehicles

    Data Mining Application for Healthcare Sector: Predictive Analysis of Heart Attacks

    Get PDF
    Project Work presented as the partial requirement for obtaining a Master's degree in Information Management, specialization in Knowledge Management and Business IntelligenceCardiovascular diseases are the main cause of the number of deaths in the world, being the heart disease the most killing one affecting more than 75% of individuals living in countries of low and middle earnings. Considering all the consequences, firstly for the individual’s health, but also for the health system and the cost of healthcare (for instance, treatments and medication), specifically for cardiovascular diseases treatment, it has become extremely important the provision of quality services by making use of preventive medicine, whose focus is identifying the disease risk, and then, applying the right action in case of early signs. Therefore, by resorting to DM (Data Mining) and its techniques, there is the ability to uncover patterns and relationships amongst the objects in healthcare data, giving the potential to use it more efficiently, and to produce business intelligence and extract knowledge that will be crucial for future answers about possible diseases and treatments on patients. Nowadays, the concept of DM is already applied in medical information systems for clinical purposes such as diagnosis and treatments, that by making use of predictive models can diagnose some group of diseases, in this case, heart attacks. The focus of this project consists on applying machine learning techniques to develop a predictive model based on a real dataset, in order to detect through the analysis of patient’s data whether a person can have a heart attack or not. At the end, the best model is found by comparing the different algorithms used and assessing its results, and then, selecting the one with the best measures. The correct identification of early cardiovascular problems signs through the analysis of patient data can lead to the possible prevention of heart attacks, to the consequent reduction of complications and secondary effects that the disease may bring, and most importantly, to the decrease on the number of deaths in the future. Making use of Data Mining and analytics in healthcare will allow the analysis of high volumes of data, the development of new predictive models, and the understanding of the factors and variables that have the most influence and contribution for this disease, which people should pay attention. Hence, this practical approach is an example of how predictive analytics can have an important impact in the healthcare sector: through the collection of patient’s data, models learn from it so that in the future they can predict new unknown cases of heart attacks with better accuracies. In this way, it contributes to the creation of new models, to the tracking of patient’s health data, to the improvement of medical decisions, to efficient and faster responses, and to the wellbeing of the population that can be improved if diseases like this can be predicted and avoided. To conclude, this project aims to present and show how Data Mining techniques are applied in healthcare and medicine, and how they contribute for the better knowledge of cardiovascular diseases and for the support of important decisions that will influence the patient’s quality of life
    corecore