11 research outputs found

    A survey of feature selection in Internet traffic characterization

    Get PDF
    In the last decade, the research community has focused on new classification methods that rely on statistical characteristics of Internet traffic, instead of pre-viously popular port-number-based or payload-based methods, which are under even bigger constrictions. Some research works based on statistical characteristics generated large fea-ture sets of Internet traffic; however, nowadays it?s impossible to handle hun-dreds of features in big data scenarios, only leading to unacceptable processing time and misleading classification results due to redundant and correlative data. As a consequence, a feature selection procedure is essential in the process of Internet traffic characterization. In this paper a survey of feature selection methods is presented: feature selection frameworks are introduced, and differ-ent categories of methods are briefly explained and compared; several proposals on feature selection in Internet traffic characterization are shown; finally, future application of feature selection to a concrete project is proposed

    Quantum-inspired feature and parameter optimization of evolving spiking neural networks with a case study from ecological modelling

    Get PDF
    The paper introduces a framework and implementation of an integrated connectionist system, where the features and the parameters of an evolving spiking neural network are optimised together using a quantum representation of the features and a quantum inspired evolutionary algorithm for optimisation. The proposed model is applied on ecological data modeling problem demonstrating a significantly better classification accuracy than traditional neural network approaches and a more appropriate feature subset selected from a larger initial number of features. Results are compared to a naive Bayesian classifier

    PBIL AutoEns: uma Ferramenta de Aprendizado de Máquina Automatizado Integrada à Plataforma Weka / PBIL AutoEns: an Automated Machine Learning Tool integrated to the Weka ML Platform

    Get PDF
    O Aprendizado de Máquina (AM) tem se popularizado nos últimos anos como uma abordagem eficiente para resolução de problemas. Existem na atualidade centenas de métodos de classificação, por exemplo, o que torna praticamente impossível analisar todos os possíveis resultados, dado que além de existirem muitos métodos, são muitas as configurações para cada um desses métodos. A partir desse problema, surgiu o conceito de Aprendizado de Máquina Automatizado (AutoML), uma técnica que busca entre diversas soluções, a melhor possível para um determinado problema, sem a necessidade de interferência humana. Este trabalho apresenta o PBIL AutoEns, uma ferramenta de AutoML que utiliza a API da plataforma WEKA para buscar soluções (modelos) dentre um grande conjunto de possibilidades. O PBIL AutoEns foi comparado com o Random Forest e XGBoost (métodos de comitês) e o MLP (classificador base). Nessa comparação, usamos uma medida de precisão preditiva muito forte (F-measure) para analisar o desempenho de classificação de todos os quatro métodos em 21 base de dados. 

    The problem of variable selection for financial distress: applying GRASP methaeuristics

    Get PDF
    We use the GRASP procedure to select a subset of financial ratios that are then used to estimate a model of logistic regression to anticipate financial distress on a sample of Spanish firms. The algorithm we suggest is designed "ad-hoc" for this type of variables. Reducing dimensionality has several advantages such as reducing the cost of data acquisition, better understanding of the final classification model, and increasing the efficiency and the efficacy. The application of the GRASP procedure to preselect a reduced subset of financial ratios generated better results than those obtained directly by applying a model of logistic regression to the set of the 141 original financial ratios.Genetic algorithms, Financial distress, Failure, Financial ratios, Variable selection, GRASP, Metaheuristic

    Robot Trajectories Comparison: A Statistical Approach

    Get PDF
    The task of planning a collision-free trajectory from a start to a goal position is fundamental for an autonomous mobile robot. Although path planning has been extensively investigated since the beginning of robotics, there is no agreement on how to measure the performance of a motion algorithm. This paper presents a new approach to perform robot trajectories comparison that could be applied to any kind of trajectories and in both simulated and real environments. Given an initial set of features, it automatically selects the most significant ones and performs a statistical comparison using them. Additionally, a graphical data visualization named polygraph which helps to better understand the obtained results is provided. The proposed method has been applied, as an example, to compare two different motion planners, FM2 and WaveFront, using different environments, robots, and local planners

    TEDA: A Targeted Estimation of Distribution Algorithm

    Get PDF
    This thesis discusses the development and performance of a novel evolutionary algorithm, the Targeted Estimation of Distribution Algorithm (TEDA). TEDA takes the concept of targeting, an idea that has previously been shown to be effective as part of a Genetic Algorithm (GA) called Fitness Directed Crossover (FDC), and introduces it into a novel hybrid algorithm that transitions from a GA to an Estimation of Distribution Algorithm (EDA). Targeting is a process for solving optimisation problems where there is a concept of control points, genes that can be said to be active, and where the total number of control points found within a solution is as important as where they are located. When generating a new solution an algorithm that uses targeting must first of all choose the number of control points to set in the new solution before choosing which to set. The hybrid approach is designed to take advantage of the ability of EDAs to exploit patterns within the population to effectively locate the global optimum while avoiding the tendency of EDAs to prematurely converge. This is achieved by initially using a GA to effectively explore the search space before transitioning into an EDA as the population converges on the region of the global optimum. As targeting places an extra restriction on the solutions produced by specifying their size, combining it with the hybrid approach allows TEDA to produce solutions that are of an optimal size and of a higher quality than would be found using a GA alone without risking a loss of diversity. TEDA is tested on three different problem domains. These are optimal control of cancer chemotherapy, network routing and Feature Subset Selection (FSS). Of these problems, TEDA showed consistent advantage over standard EAs in the routing problem and demonstrated that it is able to find good solutions faster than untargeted EAs and non evolutionary approaches at the FSS problem. It did not demonstrate any advantage over other approaches when applied to chemotherapy. The FSS domain demonstrated that in large and noisy problems TEDA’s targeting derived ability to reduce the size of the search space significantly increased the speed with which good solutions could be found. The routing domain demonstrated that, where the ideal number of control points is deceptive, both targeting and the exploitative capabilities of an EDA are needed, making TEDA a more effective approach than both untargeted approaches and FDC. Additionally, in none of the problems was TEDA seen to perform significantly worse than any alternative approaches

    Machine learning for corporate failure prediction : an empirical study of South African companies

    Get PDF
    Includes bibliographical references (leaves 255-266).The research objective of this study was to construct an empirical model for the prediction of corporate failure in South Africa through the application of machine learning techniques using information generally available to investors. The study began with a thorough review of the corporate failure literature, breaking the process of prediction model construction into the following steps: * Defining corporate failure * Sample selection * Feature selection * Data pre-processing * Feature Subset Selection * Classifier construction * Model evaluation These steps were applied to the construction of a model, using a sample of failed companies that were listed on the JSE Securities Exchange between 1 January 1996 and 30 June 2003. A paired sample of non-failed companies was selected. Pairing was performed on the basis of year of failure, industry and asset size (total assets per the company financial statements excluding intangible assets). A minimum of two years and a maximum of three years of financial data were collated for each company. Such data was mainly sourced from BFA McGregor RAID Station, although the BFA McGregor Handbook and JSE Handbook were also consulted for certain data items. A total of 75 financial and non-financial ratios were calculated for each year of data collected for every company in the final sample. Two databases of ratios were created - one for all companies with at least two years of data and another for those companies with three years of data. Missing and undefined data items were rectified before all the ratios were normalised. The set of normalised values was then imported into MatLab Version 6 and input into a Population-Based Incremental Learning (PBIL) algorithm. PBIL was then used to identify those subsets of features that best separated the failed and non-failed data clusters for a one, two and three year forward forecast period. Thornton's Separability Index (SI) was used to evaluate the degree of separation achieved by each feature subset

    Uma metodologia para classificação de dados nominais baseada no processo KDD

    Get PDF
    Resumo: A classificação de padrões é um problema de aprendizado supervisionado do campo da ciência conhecido como Reconhecimento de Padrões (RP), através do qual se deseja discriminar instâncias de dados em diferentes classes. A solução para este problema é obtida por meio de algoritmos (classificadores) que buscam por padrões de relacionamento entre classes em casos conhecidos (treinamento), usando tais relações para classificar casos desconhecidos (teste). O desempenho em termos de acurácia preditiva dos algoritmos que se propõem a realizar tal tarefa depende muito da qualidade e dos tipos de dados contidos nas bases. Visando melhorar a qualidade dos dados e dar tratamento adequado aos tipos de dados utilizados, o presente trabalho faz uso do processo de Descoberta de Conhecimento em Bases de Dados (Knowledge Discovery in Databases; KDD), no qual a classificação é uma das tarefas da etapa conhecida como Mineração de Dados (Data Mining; DM). As etapas aqui aplicadas antes da classificação são a seleção de atributos wrapper e um processo de transformação de atributos baseado em Análise Geométrica de Dados (Geometric Data Analysis; GDA). Para a seleção de atributos é proposta uma nova técnica baseada em Algoritmo de Estimação de Distribuição (Estimation of Distribution Algorithm; EDA) e em Algoritmos Culturais (AC) batizada de Belief-Based Incremental Learning (BBIL). Para a transformação de atributos é aqui proposta a utilização de uma alternativa à clássica Análise de Componentes Principais (Principal Component Analysis; PCA) para lidar especificamente com dados nominais: a Análise de Correspondência Múltipla (Multiple Correspondence Analysis; MCA). Na etapa de DM, de fato, faz-se a aplicação de dois tradicionais classificadores da área de RP, Naïve Bayes e Função Discriminante Linear de Fisher (Linear Discriminant Analysis; LDA). Apoiado em argumentos teóricos e em testes empíricos realizados com nove diferentes conjuntos de dados nominais, o presente trabalho objetiva avaliar a capacidade do MCA e do BBIL em melhorar o desempenho de classificadores em termos de acurácia preditiva média. Com o objetivo de se beneficiar simultaneamente das vantagens de ambos os tratamentos de dados são avaliadas duas combinações entre estas técnicas. A primeira trata-se da transformação GDA sobre os atributos previamente selecionados e, a segunda, a seleção de factor scores do MCA utilizando o BBIL (metodologia proposta). Os resultados dos experimentos confirmam a melhoria no desempenho de classificação proporcionada pelos tratamentos realizados e atestam a superioridade da metodologia proposta na maioria das situações analisadas

    Synergy between artificial immune systems and probabilistic graphical models

    Get PDF
    Orientador: Fernando Jose Von ZubenTese (doutorado) - Universidade Estadual de Campinas, Faculdade de Engenharia Eletrica e de ComputaçãoResumo: Sistemas imunológicos artificiais (SIAs) e modelos gráficos probabilísticos são duas importantes técnicas para a construção de sistemas inteligentes e tem sido amplamente exploradas por pesquisadores das mais diversas áreas, tanto no aspecto teórico quanto pratico. Entretanto, geralmente o potencial de cada técnica é explorado isoladamente, sem levar em consideração a possível cooperação entre elas. Como uma primeira contribuição deste trabalho, é proposta uma metodologia que explora as principais vantagens dos SIAs como ferramentas de otimização voltadas para aprendizado de redes bayesianas a partir de conjuntos de dados. Por outro lado, os SIAs já propostos para otimização em espaços discretos e contínuos correspondem a meta-heurísticas populacionais sem mecanismos para lidarem eficientemente com blocos construtivos, e também com poucos recursos para se beneficiarem do conhecimento já adquirido acerca do espaço de busca. A segunda contribuição desta tese é a proposição de quatro algoritmos que procuram superar estas limitações, em contextos mono-objetivo e multiobjetivo. São substituídos os operadores de clonagem e mutação por um modelo probabilístico representando a distribuição de probabilidades das melhores soluções. Em seguida, este modelo é empregado para gerar novas soluções. Os modelos probabilísticos utilizados são a rede bayesiana, para espaços discretos, e a rede gaussiana, para espaços contínuos. A escolha de ambas se deve às suas capacidades de capturar adequadamente as interações mais relevantes das variáveis do problema. Resultados promissores foram obtidos nos experimentos de otimização realizados, os quais trataram, em espaços discretos, de seleção de atributos e de ensembles para classificação de padrões, e em espaços contínuos, de funções multimodais de elevada dimensão. Palavras-chave: sistemas imunológicos artificiais, redes bayesianas, redes gaussianas, otimização em espaços discretos e contínuos, otimização mono-objetivo e multiobjetivoAbstract: Artificial immune systems (AISs) and probabilistic graphical models are two important techniques for the design of intelligent systems, and they have been widely explored by researchers from diverse areas, in both theoretical and practical aspects. However, the potential of each technique is usually explored in isolation, without considering the possible cooperation between them. As a first contribution of this work, it is proposed an approach that explores the main advantages of AISs as optimization tools applied to the learning of Bayesian networks from data sets. On the other hand, the AISs already proposed to perform optimization in discrete and continuous spaces correspond to population-based meta-heuristics without mechanisms to deal effectively with building blocks, and also having few resources to benefit from the knowledge already acquired from the search space. The second contribution of this thesis is the proposition of four algorithms devoted to overcoming these limitations, both in single-objective and multi-objective contexts. The cloning and mutation operators are replaced by a probabilistic model representing the probability distribution of the best solutions. After that, this model is employed to generate new solutions. The probabilistic models adopted are the Bayesian network, for discrete spaces, and the Gaussian network, for continuous spaces. These choices are supported by their ability to properly capture the most relevant interactions among the variables of the problem. Promising results were obtained in the optimization experiments carried out, which have treated, in discrete spaces, feature selection and ensembles for pattern classification, and, in continuous spaces, multimodal functions of high dimension. Keywords: artificial immune systems, Bayesian networks, Gaussian networks, optimization in discrete and continuous domains, single-objective and multi-objective optimizationDoutoradoEngenharia de ComputaçãoDoutor em Engenharia Elétric

    Feature Subset Selection by Estimation of Distribution Algorithms

    No full text
    This paper describes the application of four evolutionary algorithms to the selection of feature subsets for classification problems. Besides of
    corecore