    Improving the family orientation process in Cuban Special Schools trough Nearest Prototype classification

    Cuban Schools for children with Affective – Behavioral Maladies (SABM) have as goal to accomplish a major change in children behavior, to insert them effectively into society. One of the key elements in this objective is to give an adequate orientation to the children’s families; due to the family is one of the most important educational contexts in which the children will develop their personality. The family orientation process in SABM involves clustering and classification of mixed type data with non-symmetric similarity functions. To improve this process, this paper includes some novel characteristics in clustering and prototype selection. The proposed approach uses a hierarchical clustering based on compact sets, making it suitable for dealing with non-symmetric similarity functions, as well as with mixed and incomplete data. The proposal obtains very good results on the SABM data, and over repository databases

    A survey on pre-processing techniques: relevant issues in the context of environmental data mining

    One of the important issues related with all types of data analysis, either statistical data analysis, machine learning, data mining, data science or whatever form of data-driven modeling, is data quality. The more complex the reality to be analyzed is, the higher the risk of getting low quality data. Unfortunately real data often contain noise, uncertainty, errors, redundancies or even irrelevant information. Useless models will be obtained when built over incorrect or incomplete data. As a consequence, the quality of decisions made over these models, also depends on data quality. This is why pre-processing is one of the most critical steps of data analysis in any of its forms. However, pre-processing has not been properly systematized yet, and little research is focused on this. In this paper a survey on most popular pre-processing steps required in environmental data analysis is presented, together with a proposal to systematize it. Rather than providing technical details on specific pre-processing techniques, the paper focus on providing general ideas to a non-expert user, who, after reading them, can decide which one is the more suitable technique required to solve his/her problem.Peer ReviewedPostprint (author's final draft

    LACE: Supporting Privacy-Preserving Data Sharing in Transfer Defect Learning

    Cross Project Defect Prediction (CPDP) is a field of study where an organization lacking enough local data can use data from other organizations or projects for building defect predictors. Research in CPDP has shown challenges in using ``other\u27\u27 data, therefore transfer defect learning has emerged to improve on the quality of CPDP results. With this new found success in CPDP, it is now increasingly important to focus on the privacy concerns of data owners.;To support CPDP, data must be shared. There are many privacy threats that inhibit data sharing. We focus on sensitive attribute disclosure threats or attacks, where an attacker seeks to associate a record(s) in a data set to its sensitive information. Solutions to this sharing problem comes from the field of Privacy Preserving Data Publishing (PPDP) which has emerged as a means to confuse the efforts of sensitive attribute disclosure attacks and therefore reduce privacy concerns. PPDP covers methods and tools used to disguise raw data for publishing. However, prior work warned that increasing data privacy decreases the efficacy of data mining on privatized data.;The goal of this research is to help encourage organizations and individuals to share their data publicly and/or with each other for research purposes and/or improving the quality of their software product through defect prediction. The contributions of this work allow three benefits for data owners willing to share privatized data: 1) that they are fully aware of the sensitive attribute disclosure risks involved so they can make an informed decision about what to share, 2) they are provided with the ability to privatize their data and have it remain useful, and 3) the ability to work with others to share their data based on what they learn from each others data. We call this private multiparty data sharing.;To achieve these benefits, this dissertation presents LACE (Large-scale Assurance of Confidentiality Environment). LACE incorporates a privacy metric called IPR (Increased Privacy Ratio) which calculates the risk of sensitive attribute disclosure of data through comparing results of queries (attacks) on the original data and a privatized version of that data. LACE also includes a privacy algorithm which uses intelligent instance selection to prune the data to as low as 10% of the original data (thus offering complete privacy to the other 90%). It then mutates the remaining data making it possible that over 70% of sensitive attribute disclosure attacks are unsuccessful. Finally, LACE can facilitate private multiparty data sharing via a unique leader-follower algorithm (developed for this dissertation). The algorithm allows data owners to serially build a privatized data set, by allowing them to only contribute data that are not already in the private cache. In this scenario, each data owner shares even less of their data, some as low as 2%.;The experiments of this thesis, lead to the following conclusion: at least for the defect data studied here, data can be minimized, privatized and shared without a significant degradation in utility. Specifically, in comparative studies with standard privacy models (k-anonymity and data swapping), applied to 10 open-source data sets and 3 proprietary data sets, LACE produces privatized data sets that are significantly smaller than the original data (as low as 2%). As a result LACE offers better protection against sensitive attribute disclosure attacks than other methods

    Advances in Data Mining Knowledge Discovery and Applications

    Advances in Data Mining Knowledge Discovery and Applications aims to help data miners, researchers, scholars, and PhD students who wish to apply data mining techniques. The primary contribution of this book is highlighting frontier fields and implementations of the knowledge discovery and data mining. It seems to be same things are repeated again. But in general, same approach and techniques may help us in different fields and expertise areas. This book presents knowledge discovery and data mining applications in two different sections. As known that, data mining covers areas of statistics, machine learning, data management and databases, pattern recognition, artificial intelligence, and other areas. In this book, most of the areas are covered with different data mining applications. The eighteen chapters have been classified in two parts: Knowledge Discovery and Data Mining Applications

    Artificial neural network models: data selection and online adaptation

    Energy consumption has been increasing steadily due to globalization and industrialization. Studies have shown that buildings have the biggest proportion in energy consumption; for example in European Union countries, energy consumption in buildings represents around 40% of the total energy consumption. Hence this PhD was intended towards managing the energy consumed by Heating, Ventilating and Air Conditioning (HVAC) systems in buildings benefiting from Model Predictive Control (MPC) technique. To achieve this goal, artificial intelligence models such as neural networks and Support Vector Machines (SVM) have been proposed because of their high potential capabilities of performing accurate nonlinear mappings between inputs and outputs in real environments which are not noise-free. In this PhD, Radial Basis Function Neural Networks (RBFNN) as a promising class of Artificial Neural Networks (ANN) were considered to model a sequence of time series processes where the RBFNN models were built using Multi Objective Genetic Algorithm (MOGA) as a design platform. Regarding the design of such models, two main challenges were tackled; data selection and model adaptation. Since RBFNNs are data driven models, the performance of such models relies, to a good extent, on selecting proper data throughout the design phase, covering the whole input-output range in which they will be employed. The convex hull algorithms can be applied as methods for data selection; however the use of conventional implementations of these methods in high dimensions, due to their high complexity, is not feasible. As the first phase of this PhD, a new randomized approximation convex hull algorithm called ApproxHull was proposed for high dimensions so that it can be used in an acceptable execution time, and with low memory requirements. Simulation results showed that applying ApproxHull as a filter data selection method (i.e., unsupervised data selection method) could improve the performance of the classification and regression models, in comparison with random data selection method. In addition, ApproxHull was employed in real applications in terms of three case studies. The first two were in association with applying predictive models for energy saving. The last case study was related to segmentation of lesion areas in brain Computed Tomography (CT) images. The evaluation results showed that applying ApproxHull in MOGA could result in models with an acceptable level of accuracy. Specifically, the results obtained from the third case study demonstrated that ApproxHull is capable of being applied on large size data sets in high dimensions. Besides the random selection method, it was also compared with an entropy based unsupervised data selection method and a hybrid method involving ApproxHull and the entropy based method. Based on the simulation results, for most cases, ApproxHull and the hybrid method achieved a better performance than the others. In the second phase of this PhD, a new convex-hull-based sliding window online adaptation method was proposed. The goal was to update the offline predictive RBFNN models used in HVAC MPC technique, where these models are applied to processes in which the data input-output range changes over time. The idea behind the proposed method is capturing a new arriving point at each time instant which reflects a new range of data by comparing the point with current convex hull presented via ApproxHull. In this situation the underlying model’s parameters are updated based on the new point and a sliding window of some past points. The simulation results showed that not only the proposed method could efficiently update the model while a good level of accuracy is kept but also it was comparable with other methods.Devido aos processos de industrialização e globalização o consumo de energia tem aumentado de forma contínua. A investigação sobre o consumo mostra que os edifícios consomem a maior fatia de energia. Por exemplo nos países da União Europeia essa fatia corresponde a cerca de 40% de toda a energia consumida. Assim, esta tese de Doutoramento tem um objetivo prático de contribuir para melhorar a gestão da energia consumida por sistemas Heating, Ventilating and Air Conditioning (HVAC) em edifícios, no âmbito de uma estratégia de controlo preditivo baseado em modelos. Neste contexto foram já propostos modelos baseados em redes neuronais artificiais e máquinas de vetores de suporte, para mencionar apenas alguns. Estas técnicas têm uma grande capacidade de modelar relações não-lineares entre entradas e saídas de sistemas, e são aplicáveis em ambientes de operação, que, como sabemos, estão sujeitos a várias formas de ruído. Nesta tese foram consideradas redes neuronais de função de base radial, uma técnica consolidada no contexto da modelação de séries temporais. Para desenhar essas redes foi utilizada uma ferramenta baseada num algoritmo genético multi-objectivo. Relativamente ao processo de desenho destes modelos, esta tese versa sobre dois aspetos menos estudados: a seleção de dados e a adaptação em linha dos modelos. Uma vez que as redes neuronais artificiais são modelos baseados em dados, a sua performance depende em boa medida da existência de dados apropriados e representativos do sistema/processo, que cubram toda a gama de valores que a representação entrada/saída do processo/sistema gera. Os algoritmos que determinam a figura geométrica que envolve todos os dados, denominados algoritmos convex hull, podem ser aplicados à tarefa de seleção de dados. Contudo a utilização das implementações convencionais destes algoritmos em problemas de grane dimensionalidade não é viável do ponto de vista prático. Numa primeira fase deste trabalho foi proposto um novo método randomizado de aproximação ao convex hull, cunhado com o nome ApproxHull, apropriado para conjuntos de dados de grande dimensão, de forma a ser viável do ponto de vista das aplicações práticas. Os resultados experimentais mostraram que a aplicação do ApproxHull como método de seleção de dados do tipo filtro, ou seja, não supervisionado, pode melhorar o desempenho de modelos em problemas de classificação e regressão, quando comparado com a seleção aleatória de dados. O ApproxHull foi também aplicado em três casos de estudo relativos a aplicações reais. Nos dois primeiros casos no contexto do desenvolvimento de modelos preditivos para sistemas na área da eficiência energética. O terceiro caso de estudo consiste no desenvolvimento de modelos de classificação para uma aplicação na área da segmentação de lesões em imagens de tomografia computorizada. Os resultados revelaram que da aplicação do método proposto resultaram modelos com uma precisão aceitável. Do ponto de vista da aplicabilidade do método, os resultados mostraram que o ApproxHull pode ser utilizado em conjuntos de dados grandes e com dados de grande dimensionalidade. Para além da comparação com a seleção aleatória de dados, o método foi também comparado com um método de seleção de dados baseado no conceito de entropia e com um método híbrido que resulta da combinação do ApproxHull com o método entrópico. Com base nos resultados experimentais apurou-se que na maioria dos casos estudados o método híbrido conseguiu melhor desempenho que os restantes. Numa segunda fase do trabalho foi proposto um novo método de adaptação em linha com base no algoritmo ApproxHull e numa janela deslizante no tempo. Uma vez que os processos e sistemas na envolvente do sistema HVAC são variantes no tempo e dinâmicos, o objetivo foi aplicar o método proposto para adaptar em linha os modelos que foram primeiramente obtidos fora de linha. A ideia base do método proposto consiste em comparar cada novo par entrada/saída com o convex hull conhecido, e determinar se o novo par tem dados situados fora da gama conhecida. Nessa situação os parâmetros dos modelos são atualizados com base nesse novo ponto e num conjunto de pontos numa determinada janela temporal deslizante. Os resultados experimentais demonstraram não só que o novo método é eficiente na atualização dos modelos e em mantê-los num bom nível de precisão, mas também que era comparável a outros métodos existentes