50 research outputs found

    Instance selection of linear complexity for big data

    Get PDF
    Over recent decades, database sizes have grown considerably. Larger sizes present new challenges, because machine learning algorithms are not prepared to process such large volumes of information. Instance selection methods can alleviate this problem when the size of the data set is medium to large. However, even these methods face similar problems with very large-to-massive data sets. In this paper, two new algorithms with linear complexity for instance selection purposes are presented. Both algorithms use locality-sensitive hashing to find similarities between instances. While the complexity of conventional methods (usually quadratic, O(n2), or log-linear, O(nlogn)) means that they are unable to process large-sized data sets, the new proposal shows competitive results in terms of accuracy. Even more remarkably, it shortens execution time, as the proposal manages to reduce complexity and make it linear with respect to the data set size. The new proposal has been compared with some of the best known instance selection methods for testing and has also been evaluated on large data sets (up to a million instances).Supported by the Research Projects TIN 2011-24046 and TIN 2015-67534-P from the Spanish Ministry of Economy and Competitiveness

    An Optimal k Nearest Neighbours Ensemble for Classification Based on Extended Neighbourhood Rule with Features subspace

    Full text link
    To minimize the effect of outliers, kNN ensembles identify a set of closest observations to a new sample point to estimate its unknown class by using majority voting in the labels of the training instances in the neighbourhood. Ordinary kNN based procedures determine k closest training observations in the neighbourhood region (enclosed by a sphere) by using a distance formula. The k nearest neighbours procedure may not work in a situation where sample points in the test data follow the pattern of the nearest observations that lie on a certain path not contained in the given sphere of nearest neighbours. Furthermore, these methods combine hundreds of base kNN learners and many of them might have high classification errors thereby resulting in poor ensembles. To overcome these problems, an optimal extended neighbourhood rule based ensemble is proposed where the neighbours are determined in k steps. It starts from the first nearest sample point to the unseen observation. The second nearest data point is identified that is closest to the previously selected data point. This process is continued until the required number of the k observations are obtained. Each base model in the ensemble is constructed on a bootstrap sample in conjunction with a random subset of features. After building a sufficiently large number of base models, the optimal models are then selected based on their performance on out-of-bag (OOB) data.Comment: 12 page

    Estudio de métodos de selección de instancias

    Get PDF
    En la tesis se ha realizado un estudio de las técnicas de selección de instancias: analizando el estado del arte y desarrollando nuevos métodos para cubrir algunas áreas que no habían recibido la debida atención hasta el momento. Los dos primeros capítulos presentan nuevos métodos de selección de instancias para regresión, un tema poco estudiado hasta la fecha en la literatura. El tercer capítulo, estudia la posibilidad de cómo la combinación de algoritmos de selección de instancias para regresión ofrece mejores resultados que los métodos por sí mismos. El último de los capítulos presenta una novedosa idea: la utilización de las funciones hash localmente sensibles para diseñar dos nuevos algoritmos de selección de instancias para clasificación. La ventaja que presenta esta solución, es que ambos algoritmos tienen complejidad lineal. Los resultados de esta tesis han sido publicados en cuatro artículos en revistas JCR del primer cuartil.Ministerio de Economía, Industria y Competitividad, la Junta de Castilla y León y el Fondo Europeo para el Desarrollo Regional, proyectos TIN 2011-24046, TIN 2015-67534-P (MINECO/FEDER) y BU085P17 (JCyL/FEDER

    Improved automated CASH optimization with tree parzen estimators for class imbalance problems

    Get PDF
    The imbalanced classification problem is very relevant in both academic and industrial applications. The task of finding the best machine learning model to use for a specific imbalanced dataset is complicated due to a large number of existing algorithms, each with its own hyperparameters. The Combined Algorithm Selection and Hyperparameter optimization (CASH) has been introduced to tackle both aspects at the same time. However, CASH has not been studied in detail in the class imbalance domain, where the best combination of resampling technique and classification algorithm is searched for, together with their optimized hyperparameters. Thus, we target the CASH problem for imbalanced classification. We experiment with a search space of 5 classification algorithms, 21 resampling approaches and 64 relevant hyperparameters in total. Moreover, we investigate performance of 2 well-known optimization approaches: Random search and Tree Parzen Estimators approach which is a kind of Bayesian optimization. For comparison, we also perform grid search on all combinations of resampling techniques and classification algorithms with their default hyperparameters. Our experimental results show that a Bayesian optimization approach outperforms the other approaches for CASH in this application domain.Horizon 2020(H2020)766186Algorithms and the Foundations of Software technolog

    Artificial neural network models: data selection and online adaptation

    Get PDF
    Energy consumption has been increasing steadily due to globalization and industrialization. Studies have shown that buildings have the biggest proportion in energy consumption; for example in European Union countries, energy consumption in buildings represents around 40% of the total energy consumption. Hence this PhD was intended towards managing the energy consumed by Heating, Ventilating and Air Conditioning (HVAC) systems in buildings benefiting from Model Predictive Control (MPC) technique. To achieve this goal, artificial intelligence models such as neural networks and Support Vector Machines (SVM) have been proposed because of their high potential capabilities of performing accurate nonlinear mappings between inputs and outputs in real environments which are not noise-free. In this PhD, Radial Basis Function Neural Networks (RBFNN) as a promising class of Artificial Neural Networks (ANN) were considered to model a sequence of time series processes where the RBFNN models were built using Multi Objective Genetic Algorithm (MOGA) as a design platform. Regarding the design of such models, two main challenges were tackled; data selection and model adaptation. Since RBFNNs are data driven models, the performance of such models relies, to a good extent, on selecting proper data throughout the design phase, covering the whole input-output range in which they will be employed. The convex hull algorithms can be applied as methods for data selection; however the use of conventional implementations of these methods in high dimensions, due to their high complexity, is not feasible. As the first phase of this PhD, a new randomized approximation convex hull algorithm called ApproxHull was proposed for high dimensions so that it can be used in an acceptable execution time, and with low memory requirements. Simulation results showed that applying ApproxHull as a filter data selection method (i.e., unsupervised data selection method) could improve the performance of the classification and regression models, in comparison with random data selection method. In addition, ApproxHull was employed in real applications in terms of three case studies. The first two were in association with applying predictive models for energy saving. The last case study was related to segmentation of lesion areas in brain Computed Tomography (CT) images. The evaluation results showed that applying ApproxHull in MOGA could result in models with an acceptable level of accuracy. Specifically, the results obtained from the third case study demonstrated that ApproxHull is capable of being applied on large size data sets in high dimensions. Besides the random selection method, it was also compared with an entropy based unsupervised data selection method and a hybrid method involving ApproxHull and the entropy based method. Based on the simulation results, for most cases, ApproxHull and the hybrid method achieved a better performance than the others. In the second phase of this PhD, a new convex-hull-based sliding window online adaptation method was proposed. The goal was to update the offline predictive RBFNN models used in HVAC MPC technique, where these models are applied to processes in which the data input-output range changes over time. The idea behind the proposed method is capturing a new arriving point at each time instant which reflects a new range of data by comparing the point with current convex hull presented via ApproxHull. In this situation the underlying model’s parameters are updated based on the new point and a sliding window of some past points. The simulation results showed that not only the proposed method could efficiently update the model while a good level of accuracy is kept but also it was comparable with other methods.Devido aos processos de industrialização e globalização o consumo de energia tem aumentado de forma contínua. A investigação sobre o consumo mostra que os edifícios consomem a maior fatia de energia. Por exemplo nos países da União Europeia essa fatia corresponde a cerca de 40% de toda a energia consumida. Assim, esta tese de Doutoramento tem um objetivo prático de contribuir para melhorar a gestão da energia consumida por sistemas Heating, Ventilating and Air Conditioning (HVAC) em edifícios, no âmbito de uma estratégia de controlo preditivo baseado em modelos. Neste contexto foram já propostos modelos baseados em redes neuronais artificiais e máquinas de vetores de suporte, para mencionar apenas alguns. Estas técnicas têm uma grande capacidade de modelar relações não-lineares entre entradas e saídas de sistemas, e são aplicáveis em ambientes de operação, que, como sabemos, estão sujeitos a várias formas de ruído. Nesta tese foram consideradas redes neuronais de função de base radial, uma técnica consolidada no contexto da modelação de séries temporais. Para desenhar essas redes foi utilizada uma ferramenta baseada num algoritmo genético multi-objectivo. Relativamente ao processo de desenho destes modelos, esta tese versa sobre dois aspetos menos estudados: a seleção de dados e a adaptação em linha dos modelos. Uma vez que as redes neuronais artificiais são modelos baseados em dados, a sua performance depende em boa medida da existência de dados apropriados e representativos do sistema/processo, que cubram toda a gama de valores que a representação entrada/saída do processo/sistema gera. Os algoritmos que determinam a figura geométrica que envolve todos os dados, denominados algoritmos convex hull, podem ser aplicados à tarefa de seleção de dados. Contudo a utilização das implementações convencionais destes algoritmos em problemas de grane dimensionalidade não é viável do ponto de vista prático. Numa primeira fase deste trabalho foi proposto um novo método randomizado de aproximação ao convex hull, cunhado com o nome ApproxHull, apropriado para conjuntos de dados de grande dimensão, de forma a ser viável do ponto de vista das aplicações práticas. Os resultados experimentais mostraram que a aplicação do ApproxHull como método de seleção de dados do tipo filtro, ou seja, não supervisionado, pode melhorar o desempenho de modelos em problemas de classificação e regressão, quando comparado com a seleção aleatória de dados. O ApproxHull foi também aplicado em três casos de estudo relativos a aplicações reais. Nos dois primeiros casos no contexto do desenvolvimento de modelos preditivos para sistemas na área da eficiência energética. O terceiro caso de estudo consiste no desenvolvimento de modelos de classificação para uma aplicação na área da segmentação de lesões em imagens de tomografia computorizada. Os resultados revelaram que da aplicação do método proposto resultaram modelos com uma precisão aceitável. Do ponto de vista da aplicabilidade do método, os resultados mostraram que o ApproxHull pode ser utilizado em conjuntos de dados grandes e com dados de grande dimensionalidade. Para além da comparação com a seleção aleatória de dados, o método foi também comparado com um método de seleção de dados baseado no conceito de entropia e com um método híbrido que resulta da combinação do ApproxHull com o método entrópico. Com base nos resultados experimentais apurou-se que na maioria dos casos estudados o método híbrido conseguiu melhor desempenho que os restantes. Numa segunda fase do trabalho foi proposto um novo método de adaptação em linha com base no algoritmo ApproxHull e numa janela deslizante no tempo. Uma vez que os processos e sistemas na envolvente do sistema HVAC são variantes no tempo e dinâmicos, o objetivo foi aplicar o método proposto para adaptar em linha os modelos que foram primeiramente obtidos fora de linha. A ideia base do método proposto consiste em comparar cada novo par entrada/saída com o convex hull conhecido, e determinar se o novo par tem dados situados fora da gama conhecida. Nessa situação os parâmetros dos modelos são atualizados com base nesse novo ponto e num conjunto de pontos numa determinada janela temporal deslizante. Os resultados experimentais demonstraram não só que o novo método é eficiente na atualização dos modelos e em mantê-los num bom nível de precisão, mas também que era comparável a outros métodos existentes

    Graph-based Estimation of Information Divergence Functions

    Get PDF
    abstract: Information divergence functions, such as the Kullback-Leibler divergence or the Hellinger distance, play a critical role in statistical signal processing and information theory; however estimating them can be challenge. Most often, parametric assumptions are made about the two distributions to estimate the divergence of interest. In cases where no parametric model fits the data, non-parametric density estimation is used. In statistical signal processing applications, Gaussianity is usually assumed since closed-form expressions for common divergence measures have been derived for this family of distributions. Parametric assumptions are preferred when it is known that the data follows the model, however this is rarely the case in real-word scenarios. Non-parametric density estimators are characterized by a very large number of parameters that have to be tuned with costly cross-validation. In this dissertation we focus on a specific family of non-parametric estimators, called direct estimators, that bypass density estimation completely and directly estimate the quantity of interest from the data. We introduce a new divergence measure, the DpD_p-divergence, that can be estimated directly from samples without parametric assumptions on the distribution. We show that the DpD_p-divergence bounds the binary, cross-domain, and multi-class Bayes error rates and, in certain cases, provides provably tighter bounds than the Hellinger divergence. In addition, we also propose a new methodology that allows the experimenter to construct direct estimators for existing divergence measures or to construct new divergence measures with custom properties that are tailored to the application. To examine the practical efficacy of these new methods, we evaluate them in a statistical learning framework on a series of real-world data science problems involving speech-based monitoring of neuro-motor disorders.Dissertation/ThesisDoctoral Dissertation Electrical Engineering 201

    Explainable AI for Machine Fault Diagnosis: Understanding Features' Contribution in Machine Learning Models for Industrial Condition Monitoring

    Get PDF
    Although the effectiveness of machine learning (ML) for machine diagnosis has been widely established, the interpretation of the diagnosis outcomes is still an open issue. Machine learning models behave as black boxes; therefore, the contribution given by each of the selected features to the diagnosis is not transparent to the user. This work is aimed at investigating the capabilities of the SHapley Additive exPlanation (SHAP) to identify the most important features for fault detection and classification in condition monitoring programs for rotating machinery. The authors analyse the case of medium-sized bearings of industrial interest. Namely, vibration data were collected for different health states from the test rig for industrial bearings available at the Mechanical Engineering Laboratory of Politecnico di Torino. The Support Vector Machine (SVM) and k-Nearest Neighbour (kNN) diagnosis models are explained by means of the SHAP. Accuracies higher than 98.5% are achieved for both the models using the SHAP as a criterion for feature selection. It is found that the skewness and the shape factor of the vibration signal have the greatest impact on the models’ outcomes
    corecore