79 research outputs found

    Multi-Objective Genetic Algorithm for Multi-View Feature Selection

    Full text link
    Multi-view datasets offer diverse forms of data that can enhance prediction models by providing complementary information. However, the use of multi-view data leads to an increase in high-dimensional data, which poses significant challenges for the prediction models that can lead to poor generalization. Therefore, relevant feature selection from multi-view datasets is important as it not only addresses the poor generalization but also enhances the interpretability of the models. Despite the success of traditional feature selection methods, they have limitations in leveraging intrinsic information across modalities, lacking generalizability, and being tailored to specific classification tasks. We propose a novel genetic algorithm strategy to overcome these limitations of traditional feature selection methods for multi-view data. Our proposed approach, called the multi-view multi-objective feature selection genetic algorithm (MMFS-GA), simultaneously selects the optimal subset of features within a view and between views under a unified framework. The MMFS-GA framework demonstrates superior performance and interpretability for feature selection on multi-view datasets in both binary and multiclass classification tasks. The results of our evaluations on three benchmark datasets, including synthetic and real data, show improvement over the best baseline methods. This work provides a promising solution for multi-view feature selection and opens up new possibilities for further research in multi-view datasets

    Evolutionary approaches for feature selection in biological data

    Get PDF
    Data mining techniques have been used widely in many areas such as business, science, engineering and medicine. The techniques allow a vast amount of data to be explored in order to extract useful information from the data. One of the foci in the health area is finding interesting biomarkers from biomedical data. Mass throughput data generated from microarrays and mass spectrometry from biological samples are high dimensional and is small in sample size. Examples include DNA microarray datasets with up to 500,000 genes and mass spectrometry data with 300,000 m/z values. While the availability of such datasets can aid in the development of techniques/drugs to improve diagnosis and treatment of diseases, a major challenge involves its analysis to extract useful and meaningful information. The aims of this project are: 1) to investigate and develop feature selection algorithms that incorporate various evolutionary strategies, 2) using the developed algorithms to find the “most relevant” biomarkers contained in biological datasets and 3) and evaluate the goodness of extracted feature subsets for relevance (examined in terms of existing biomedical domain knowledge and from classification accuracy obtained using different classifiers). The project aims to generate good predictive models for classifying diseased samples from control

    Classication and Clustering Using Intelligent Techniques: Application to Microarray Cancer Data

    Get PDF
    Analysis and interpretation of DNA Microarray data is a fundamental task in bioinformatics. Feature Extraction plays a critical role in better performance of the classifier. We address the dimension reduction of DNA features in which relevant features are extracted among thousands of irrelevant ones through dimensionality reduction. This enhances the speed and accuracy of the classifiers. Principal Component Analysis is a technique used for feature extraction which helps to retrieve intrinsic information from high dimensional data in eigen spaces to solve the curse of dimensionality problem. Neural Networks and Support Vector Machine are implemented on reduced data set and their performances are measured in terms of predictive accuracy, specificity, and sensitivity. Next, we propose a Multiobjective Genetic Algorithm-based fuzzy clustering technique using real coded encoding of cluster centers for clustering and classification. This technique is implemented on microarray cancer data to select training data using multiobjective genetic algorithm with non-dominated sorting. The two objective functions for this multiobjective techniques are optimization of cluster compactness as well as separation simultaneously. This approach identifies the solution. Support Vector Machine classifier is further trained by the selected training points which have high confidence value. Then remaining points are classified by trained SVM classifier. Finally, the four clustering label vectors through majority voting ensemble are combined. The performance of the proposed MOGA-SVM, classification and clustering method has been compared to MOGA-BP, SVM, BP. The performance are measured in terms of Silhoutte Index, ARI Index respectively. The experiment were carried on three public domain cancer data sets, viz., Ovarian, Colon and Leukemia cancer

    Efficient Learning Machines

    Get PDF
    Computer scienc

    Otimização multi-objetivo em aprendizado de máquina

    Get PDF
    Orientador: Fernando José Von ZubenTese (doutorado) - Universidade Estadual de Campinas, Faculdade de Engenharia Elétrica e de ComputaçãoResumo: Regressão logística multinomial regularizada, classificação multi-rótulo e aprendizado multi-tarefa são exemplos de problemas de aprendizado de máquina em que objetivos conflitantes, como funções de perda e penalidades que promovem regularização, devem ser simultaneamente minimizadas. Portanto, a perspectiva simplista de procurar o modelo de aprendizado com o melhor desempenho deve ser substituída pela proposição e subsequente exploração de múltiplos modelos de aprendizado eficientes, cada um caracterizado por um compromisso (trade-off) distinto entre os objetivos conflitantes. Comitês de máquinas e preferências a posteriori do tomador de decisão podem ser implementadas visando explorar adequadamente este conjunto diverso de modelos de aprendizado eficientes, em busca de melhoria de desempenho. A estrutura conceitual multi-objetivo para aprendizado de máquina é suportada por três etapas: (1) Modelagem multi-objetivo de cada problema de aprendizado, destacando explicitamente os objetivos conflitantes envolvidos; (2) Dada a formulação multi-objetivo do problema de aprendizado, por exemplo, considerando funções de perda e termos de penalização como objetivos conflitantes, soluções eficientes e bem distribuídas ao longo da fronteira de Pareto são obtidas por um solver determinístico e exato denominado NISE (do inglês Non-Inferior Set Estimation); (3) Esses modelos de aprendizado eficientes são então submetidos a um processo de seleção de modelos que opera com preferências a posteriori, ou a filtragem e agregação para a síntese de ensembles. Como o NISE é restrito a problemas de dois objetivos, uma extensão do NISE capaz de lidar com mais de dois objetivos, denominada MONISE (do inglês Many-Objective NISE), também é proposta aqui, sendo uma contribuição adicional que expande a aplicabilidade da estrutura conceitual proposta. Para atestar adequadamente o mérito da nossa abordagem multi-objetivo, foram realizadas investigações mais específicas, restritas à aprendizagem de modelos lineares regularizados: (1) Qual é o mérito relativo da seleção a posteriori de um único modelo de aprendizado, entre os produzidos pela nossa proposta, quando comparado com outras abordagens de modelo único na literatura? (2) O nível de diversidade dos modelos de aprendizado produzidos pela nossa proposta é superior àquele alcançado por abordagens alternativas dedicadas à geração de múltiplos modelos de aprendizado? (3) E quanto à qualidade de predição da filtragem e agregação dos modelos de aprendizado produzidos pela nossa proposta quando aplicados a: (i) classificação multi-classe, (ii) classificação desbalanceada, (iii) classificação multi-rótulo, (iv) aprendizado multi-tarefa, (v) aprendizado com multiplos conjuntos de atributos? A natureza determinística de NISE e MONISE, sua capacidade de lidar adequadamente com a forma da fronteira de Pareto em cada problema de aprendizado, e a garantia de sempre obter modelos de aprendizado eficientes são aqui pleiteados como responsáveis pelos resultados promissores alcançados em todas essas três frentes de investigação específicasAbstract: Regularized multinomial logistic regression, multi-label classification, and multi-task learning are examples of machine learning problems in which conflicting objectives, such as losses and regularization penalties, should be simultaneously minimized. Therefore, the narrow perspective of looking for the learning model with the best performance should be replaced by the proposition and further exploration of multiple efficient learning models, each one characterized by a distinct trade-off among the conflicting objectives. Committee machines and a posteriori preferences of the decision-maker may be implemented to properly explore this diverse set of efficient learning models toward performance improvement. The whole multi-objective framework for machine learning is supported by three stages: (1) The multi-objective modelling of each learning problem, explicitly highlighting the conflicting objectives involved; (2) Given the multi-objective formulation of the learning problem, for instance, considering loss functions and penalty terms as conflicting objective functions, efficient solutions well-distributed along the Pareto front are obtained by a deterministic and exact solver named NISE (Non-Inferior Set Estimation); (3) Those efficient learning models are then subject to a posteriori model selection, or to ensemble filtering and aggregation. Given that NISE is restricted to two objective functions, an extension for many objectives, named MONISE (Many Objective NISE), is also proposed here, being an additional contribution and expanding the applicability of the proposed framework. To properly access the merit of our multi-objective approach, more specific investigations were conducted, restricted to regularized linear learning models: (1) What is the relative merit of the a posteriori selection of a single learning model, among the ones produced by our proposal, when compared with other single-model approaches in the literature? (2) Is the diversity level of the learning models produced by our proposal higher than the diversity level achieved by alternative approaches devoted to generating multiple learning models? (3) What about the prediction quality of ensemble filtering and aggregation of the learning models produced by our proposal on: (i) multi-class classification, (ii) unbalanced classification, (iii) multi-label classification, (iv) multi-task learning, (v) multi-view learning? The deterministic nature of NISE and MONISE, their ability to properly deal with the shape of the Pareto front in each learning problem, and the guarantee of always obtaining efficient learning models are advocated here as being responsible for the promising results achieved in all those three specific investigationsDoutoradoEngenharia de ComputaçãoDoutor em Engenharia Elétrica2014/13533-0FAPES

    Revisiting Data Complexity Metrics Based on Morphology for Overlap and Imbalance: Snapshot, New Overlap Number of Balls Metrics and Singular Problems Prospect

    Full text link
    Data Science and Machine Learning have become fundamental assets for companies and research institutions alike. As one of its fields, supervised classification allows for class prediction of new samples, learning from given training data. However, some properties can cause datasets to be problematic to classify. In order to evaluate a dataset a priori, data complexity metrics have been used extensively. They provide information regarding different intrinsic characteristics of the data, which serve to evaluate classifier compatibility and a course of action that improves performance. However, most complexity metrics focus on just one characteristic of the data, which can be insufficient to properly evaluate the dataset towards the classifiers' performance. In fact, class overlap, a very detrimental feature for the classification process (especially when imbalance among class labels is also present) is hard to assess. This research work focuses on revisiting complexity metrics based on data morphology. In accordance to their nature, the premise is that they provide both good estimates for class overlap, and great correlations with the classification performance. For that purpose, a novel family of metrics have been developed. Being based on ball coverage by classes, they are named after Overlap Number of Balls. Finally, some prospects for the adaptation of the former family of metrics to singular (more complex) problems are discussed.Comment: 23 pages, 9 figures, preprin

    Genetic Programming is Naturally Suited to Evolve Bagging Ensembles

    Get PDF
    Learning ensembles by bagging can substantially improve the generalization performance of low-bias, high-variance estimators, including those evolved by Genetic Programming (GP). To be efficient, modern GP algorithms for evolving (bagging) ensembles typically rely on several (often inter-connected) mechanisms and respective hyper-parameters, ultimately compromising ease of use. In this paper, we provide experimental evidence that such complexity might not be warranted. We show that minor changes to fitness evaluation and selection are sufficient to make a simple and otherwise-traditional GP algorithm evolve ensembles efficiently. The key to our proposal is to exploit the way bagging works to compute, for each individual in the population, multiple fitness values (instead of one) at a cost that is only marginally higher than the one of a normal fitness evaluation. Experimental comparisons on classification and regression tasks taken and reproduced from prior studies show that our algorithm fares very well against state-of-the-art ensemble and non-ensemble GP algorithms. We further provide insights into the proposed approach by (i) scaling the ensemble size, (ii) ablating the changes to selection, (iii) observing the evolvability induced by traditional subtree variation. Code: https://github.com/marcovirgolin/2SEGP.Comment: Added interquartile range in tables 1, 2, and 3; improved Fig. 3 and its analysis, improved experiment design of section 7.

    New Approaches to Mapping Forest Conditions and Landscape Change from Moderate Resolution Remote Sensing Data across the Species-Rich and Structurally Diverse Atlantic Northern Forest of Northeastern North America

    Get PDF
    The sustainable management of forest landscapes requires an understanding of the functional relationships between management practices, changes in landscape conditions, and ecological response. This presents a substantial need of spatial information in support of both applied research and adaptive management. Satellite remote sensing has the potential to address much of this need, but forest conditions and patterns of change remain difficult to synthesize over large areas and long time periods. Compounding this problem is error in forest attribute maps and consequent uncertainty in subsequent analyses. The research described in this document is directed at these long-standing problems. Chapter 1 demonstrates a generalizable approach to the characterization of predominant patterns of forest landscape change. Within a ~1.5 Mha northwest Maine study area, a time series of satellite-derived forest harvest maps (1973-2010) served as the basis grouping landscape units according to time series of cumulative harvest area. Different groups reflected different harvest histories, which were linked to changes in landscape composition and configuration through time series of selected landscape metrics. Time series data resolved differences in landscape change attributable to passage of the Maine Forest Practices Act, a major change in forest policy. Our approach should be of value in supporting empirical landscape research. Perhaps the single most important source of uncertainty in the characterization of landscape conditions is over- or under-representation of class prevalence caused by prediction bias. Systematic error is similarly impactful in maps of continuous forest attributes, where regression dilution or attenuation bias causes the overestimation of low values and underestimation of high values. In both cases, patterns of error tend to produce more homogeneous characterizations of landscape conditions. Chapters 2 and 3 present a machine learning method designed to simultaneously reduce systematic and total error in continuous and categorical maps, respectively. By training support vector machines with a multi-objective genetic algorithm, attenuation bias was substantially reduced in regression models of tree species relative abundance (chapter 2), and prediction bias was effectively removed from classification models predicting tree species occurrence and forest disturbance (chapter 3). This approach is generalizable to other prediction problems, other regions, or other geospatial disciplines

    A Multi Agent System for Flow-Based Intrusion Detection

    Get PDF
    The detection and elimination of threats to cyber security is essential for system functionality, protection of valuable information, and preventing costly destruction of assets. This thesis presents a Mobile Multi-Agent Flow-Based IDS called MFIREv3 that provides network anomaly detection of intrusions and automated defense. This version of the MFIRE system includes the development and testing of a Multi-Objective Evolutionary Algorithm (MOEA) for feature selection that provides agents with the optimal set of features for classifying the state of the network. Feature selection provides separable data points for the selected attacks: Worm, Distributed Denial of Service, Man-in-the-Middle, Scan, and Trojan. This investigation develops three techniques of self-organization for multiple distributed agents in an intrusion detection system: Reputation, Stochastic, and Maximum Cover. These three movement models are tested for effectiveness in locating good agent vantage points within the network to classify the state of the network. MFIREv3 also introduces the design of defensive measures to limit the effects of network attacks. Defensive measures included in this research are rate-limiting and elimination of infected nodes. The results of this research provide an optimistic outlook for flow-based multi-agent systems for cyber security. The impact of this research illustrates how feature selection in cooperation with movement models for multi agent systems provides excellent attack detection and classification
    corecore