6 research outputs found

    A study of the use of complexity measures in the similarity search process adopted by kNN algorithm for time series prediction

    Get PDF
    In the last two decades, with the rise of the Data Mining process, there is an increasing interest in the adaptation of Machine Learning methods to support Time Series non-parametric modeling and prediction. The non-parametric temporal data modeling can be performed according to local and global approaches. The most of the local prediction data strategies are based on the k-Nearest Neighbor (kNN) learning method. In this paper we propose a modification of the kNN algorithm for Time Series prediction. Our proposal differs from the literature by incorporating three techniques for obtaining amplitude and offset invariance, complexity invariance, and treatment of trivial matches. We evaluate the proposed method with six complexity measures, in order to verify the impact of these measures in the projection of the future values. Besides, we face our method with two Machine Learning regression algorithms. The experimental comparisons were performed using 55 data sets, which are available at the ICMC-USP Time Series Prediction Repository. Our results indicate that the developed method is competitive and the use of a complexity-invariant distance measure generally improves the predictive performance.FAPESP (grant 2013/109- 78-8)CNPq (grants 303083/2013-1 and 446330/2014-0

    Fine-tuning pre-trained neural networks for medical image classification in small clinical datasets

    Get PDF
    Funding We would like to acknowledge eurekaSD: Enhancing University Research and Education in Areas Useful for Sustainable Development - grants EK14AC0037 and EK15AC0264. We thank Araucária Foundation for the Support of the Scientific and Technological Development of Paraná through a Research and Technological Productivity Scholarship for H. D. Lee (grant 028/2019). We also thank the Brazilian National Council for Scientific and Technological Development (CNPq) through the grant number 142050/2019-9 for A. R. S. Parmezan. The Portuguese team was partially supported by Fundação para a Ciência e a Tecnologia (FCT). R. Fonseca-Pinto was financed by the projects UIDB/50008/2020, UIDP/50008/2020, UIDB/05704/2020 and UIDP/05704/2020 and C. V. Nogueira was financed by the projects UIDB/00013/2020 and UIDP/00013/2020. The funding agencies did not have any further involvement in this paper.Convolutional neural networks have been effective in several applications, arising as a promising supporting tool in a relevant Dermatology problem: skin cancer diagnosis. However, generalizing well can be difficult when little training data is available. The fine-tuning transfer learning strategy has been employed to differentiate properly malignant from non-malignant lesions in dermoscopic images. Fine-tuning a pre-trained network allows one to classify data in the target domain, occasionally with few images, using knowledge acquired in another domain. This work proposes eight fine-tuning settings based on convolutional networks previously trained on ImageNet that can be employed mainly in limited data samples to reduce overfitting risk. They differ on the architecture, the learning rate and the number of unfrozen layer blocks. We evaluated the settings in two public datasets with 104 and 200 dermoscopic images. By finding competitive configurations in small datasets, this paper illustrates that deep learning can be effective if one has only a few dozen malignant and non-malignant lesion images to study and differentiate in Dermatology. The proposal is also flexible and potentially useful for other domains. In fact, it performed satisfactorily in an assessment conducted in a larger dataset with 746 computerized tomographic images associated with the coronavirus disease.info:eu-repo/semantics/publishedVersio

    Classificação hierárquica de dados em lote e em fluxo contínuo com aplicações para entomologia

    No full text
    Traditional supervised machine learning algorithms conduct data classification in a flat way, i.e., they seek to associate each example with a class belonging to a finite, devoid of structural dependencies and usually small, set of classes. However, there are more challenging problems in which classes can be divided or grouped into subclasses or superclasses, respectively. This structural dependency between classes demands the application of methods prepared to deal with hierarchical classification. An algorithm for hierarchical classification considers the structural information embedded in the class hierarchy and uses it to decompose the original problems feature space into subproblems with fewer classes. Such decomposition reduces the complexity of the classification function as well as the prediction error. This thesis advances the state-ofthe-art by proposing novel algorithms for hierarchical classification considering two learning paradigms: (i) batch, where learning takes place offline employing a sample of fixed-size examples (ideally) coming from a stationary probability distribution. Each observation within the sample is independently and identically distributed; and (ii) streaming, in which learning is performed online from a usually uninterrupted and ordered sequence of examples available, at various update rates and without human intervention, by systems or devices. The features that describe the streaming examples may drift over time due to the non-stationary nature of the environment in which they are. In this context, the main contributions of this thesis include: (i) the most extensive and comprehensive study ever done to understand the impact of climatic-environmental conditions on the bee and wasp wing-beat frequencies. From the practical standpoint, the work builds base components for (online) (hierarchical) classification of flying insects; (ii) a method that combines local approaches to quickly and efficiently obtain a hierarchical decision model that faithfully represents the music genre identification scenario. We also validated the approach on hymenopteran data; (iii) a reference process that uses optical sensors and hierarchical classifiers to identify pollinating flying insects under natural field conditions. The results obtained provided answers to ten research questions; (iv) the first algorithm for hierarchical classification of data streams. It is based on nearest neighbors and works incrementally; (v) a framework and (vi) a collection of methods for hierarchical labeling of streaming data.Os algoritmos de aprendizado de máquina supervisionado tradicionais conduzem a classificação de dados de maneira plana, ou seja, buscam associar cada exemplo a uma classe pertencente a um conjunto finito, desprovido de dependências estruturais e normalmente pequeno, de classes. No entanto, existem problemas mais desafiadores nos quais as classes podem ser divididas ou agrupadas em subclasses ou superclasses, respectivamente. Essa dependência estrutural entre classes demanda a aplicação de métodos preparados para lidar com a classificação hierárquica. Um algoritmo para classificação hierárquica considera as informações estruturais embutidas na hierarquia de classes e as usa para decompor o espaço de atributos do problema original em subproblemas com menos classes. Tal decomposição reduz a complexidade da função de classificação enquanto aprimora o desempenho preditivo. Esta tese avança o estado da arte ao propor novos algoritmos para classificação hierárquica considerando dois paradigmas de aprendizado: (i) lote, onde o aprendizado ocorre offline a partir de uma amostra de exemplos de tamanho fixo (idealmente) proveniente de uma distribuição de probabilidade estacionária. Cada observação dentro da amostra é independente e identicamente distribuída; e (ii) fluxo contínuo, em que o aprendizado é realizado online a partir de uma sequência ordenada de exemplos usualmente ilimitada que é disponibilizada, em várias taxas de atualização e sem intervenção humana, por sistemas ou dispositivos. Devido à natureza não-estacionária do ambiente no qual estão inseridas, as características que compõem os exemplos de um fluxo contínuo podem variar no decorrer do tempo. Nesse contexto, as principais contribuições desta tese incluem: (i) o estudo mais extenso e abrangente já feito para entender o impacto das condições climáticas-ambientais nas frequências de batimento de asas de abelhas e vespas. Do ponto de vista prático, o trabalho constrói componentes-base para a classificação (hierárquica) (online) de insetos voadores; (ii) um método que combina abordagens locais para obter de forma rápida e eficiente um modelo de decisão hierárquica que representa fielmente o cenário de identificação de gêneros musicais. A abordagem também é validada em dados de himenópteros; (iii) um processo de referência que utiliza sensores ópticos e classificadores hierárquicos para identificar insetos voadores polinizadores em condições naturais de campo. Os resultados obtidos forneceram respostas à dez questões de pesquisa; (iv) o primeiro algoritmo para classificação hierárquica de fluxos de dados. Ele baseia-se em vizinhos mais próximos e funciona de maneira incremental; (v) um framework e (vi) uma coleção de métodos para rotulagem hierárquica de dados em fluxo contínuo

    Similarity-based time series prediction

    No full text
    Um dos maiores desafios em Mineração de Dados é a integração da informação temporal ao seu processo. Esse fato tem desafiado profissionais de diferentes domínios de aplicação e recebido investimentos consideráveis da comunidade científica e empresarial. No contexto de predição de Séries Temporais, os investimentos se concentram no subsídio de pesquisas destinadas à adaptação dos métodos convencionais de Aprendizado de Máquina para a análise de dados na qual o tempo constitui um fator importante. À vista disso, neste trabalho é proposta uma nova extensão do algoritmo de Aprendizado de Máquina k-Nearest Neighbors (kNN) para predição de Séries Temporais, intitulado de kNN - Time Series Prediction with Invariances (kNN-TSPI ). O algoritmo concebido difere da versão convencional pela incorporação de três técnicas para obtenção de invariância à amplitude e deslocamento, invariância à complexidade e tratamento de casamentos triviais. Como demonstrado ao longo desta dissertação de mestrado, o uso simultâneo dessas técnicas proporciona ao kNN-TSPI uma melhor correspondência entre as subsequências de dados e a consulta de referência. Os resultados de uma das avaliações empíricas mais extensas, imparciais e compreensíveis já conduzidas no tema de predição de Séries Temporais evidenciaram, a partir do confronto de dez métodos de projeção, que o algoritmo kNN-TSPI, além de ser conveniente para a predição automática de dados a curto prazo, é competitivo com os métodos estatísticos estado-da-arte ARIMA e SARIMA. Por mais que o modelo SARIMA tenha atingido uma precisão relativamente superior a do método baseado em similaridade, o kNN-TSPI é consideravelmente mais simples de ajustar. A comparação objetiva e subjetiva entre algoritmos estatísticos e de Aprendizado de Máquina para a projeção de dados temporais vem a suprir uma importante lacuna na literatura, a qual foi identificada por meio de uma revisão sistemática seguida de uma meta-análise das publicações selecionadas. Os 95 conjuntos de dados empregados nos experimentos computacionais juntamente com todas as projeções analisadas em termos de Erro Quadrático Médio, coeficiente U de Theil e taxa de acerto Prediction Of Change In Direction encontram-se disponíveis no portal Web ICMC-USP Time Series Prediction Repository. A presente pesquisa abrange também contribuições e resultados significativos em relação às propriedades inerentes à predição baseada em similaridade, sobretudo do ponto de vista prático. Os protocolos experimentais delineados e as diversas conclusões obtidas poderão ser usados como referência para guiar o processo de escolha de modelos, configuração de parâmetros e aplicação dos algoritmos de Inteligência Artificial para predição de Séries Temporais.One of the major challenges in Data Mining is integrating temporal information into process. This difficulty has challenged professionals several application fields and has been object of considerable investment from scientific and business communities. In the context of Time Series prediction, these investments consist majority of grants for designed research aimed at adapting conventional Machine Learning methods for data analysis problems in which time is an important factor. We propose a novel modification of the k-Nearest Neighbors (kNN) learning algorithm for Time Series prediction, namely the kNN - Time Series Prediction with Invariances (kNN-TSPI). Our proposal differs from the literature by incorporating techniques for amplitude and offset invariance, complexity invariance, and treatment of trivial matches. These three modifications allow more meaningful matching between the reference queries and Time Series subsequences, as we discuss with more details throughout this masters thesis. We have performed one of the most comprehensible empirical evaluations of Time Series prediction, in which we faced the proposed algorithm with ten methods commonly found in literature. The results show that the kNN-TSPI is appropriate for automated short-term projection and is competitive with the state-of-the-art statistical methods ARIMA and SARIMA. Although in our experiments the SARIMA model has reached a slightly higher precision than the similarity based method, the kNN-TSPI is considerably simpler to adjust. The objective and subjective comparisons of statistical and Machine Learning algorithms for temporal data projection fills a major gap in the literature, which was identified through a systematic review followed by a meta-analysis of selected publications. The 95 data sets used in our computational experiments, as well all the projections with respect to Mean Squared Error, Theils U coefficient and hit rate Prediction Of Change In Direction are available online at the ICMC-USP Time Series Prediction Repository. This work also includes contributions and significant results with respect to the properties inherent to similarity-based prediction, especially from the practical point of view. The outlined experimental protocols and our discussion on the usage of them, can be used as a guideline for models selection, parameters setting, and employment of Artificial Intelligence algorithms for Time Series prediction

    Metalearning For Choosing Feature Selection Algorithms In Data Mining: Proposal Of A New Framework

    No full text
    In Data Mining, during the preprocessing step, there is a considerable diversity of candidate algorithms to select important features, according to some criteria. This broad availability of algorithms that perform the Feature Selection task gives rise to the difficulty of choosing, a priori, between the algorithms at hand, the most promising one for a particular problem. In this paper, we present the proposal and evaluation of a new architecture for the recommendation of Feature Selection algorithms based on the use of Metalearning. Our framework is very flexible since the user can adapt it to its proper needs. This flexibility is one of the main advantages of our proposal over other approaches in the literature, which involve steps that cannot be adapted to the user's local requirements. Furthermore, it combines several concepts of intelligent systems, including Machine Learning and Data Mining, with topics derived from expert systems, as user and data-driven knowledge, with meta-knowledge. This set of solutions coupled with leading edge technologies allows our architecture to be integrated into any information system, which impact on the automation of services and in reducing human effort during the process. Regarding the Metalearning process, our framework considers several types of properties inherent to the data sets, as well as, Feature Selection algorithms based on many information, distance, dependence and consistency measures. The quality of the methods for Feature Selection was estimated according to a multicriteria performance measure, which guided the ranking process of these algorithms for the construction of data metabases. Proposed by the authors of this work, this multicriteria performance measure combines any three measurements on a single one, creating an interesting and powerful tool to evaluate not only FS algorithms but also to assess any context where it is necessary a combination to maximize a measure or minimize it. The recommendation models, represented by decision trees and induced from the training metabases, allowed us to see in what circumstances a Feature Selection algorithm outperforms the other and what aspects of the data present greater influence in determining the performance of these algorithms. Nevertheless, if the user wishes, any other learning algorithm may be used to induce the recommendation model. This versatility is another strong point of this proposal. Results show that with the characterization of data, through statistical, information and complexity measures, it is possible to reach an accuracy higher than 90%. Besides yielding recommendation models that are interpretable and robust to overfitting, the developed architecture is less computationally expensive than approaches recently proposed in the literature. (C) 2017 Elsevier Ltd. All rights reserved.75124Brazilian National Counsel of Technological and Scientific DevelopmentAraucaria's FoundationTechnological Development of Paran

    Metalearning for choosing feature selection algorithms in data mining: Proposal of a new framework

    No full text
    In Data Mining, during the preprocessing step, there is a considerable diversity of candidate algorithms to select important features, according to some criteria. This broad availability of algorithms that perform the Feature Selection task gives rise to the difficulty of choosing, a priori, between the algorithms at hand, the most promising one for a particular problem. In this paper, we present the proposal and evaluation of a new architecture for the recommendation of Feature Selection algorithms based on the use of Metalearning. Our framework is very flexible since the user can adapt it to its proper needs. This flexibility is one of the main advantages of our proposal over other approaches in the literature, which involve steps that cannot be adapted to the user's local requirements. Furthermore, it combines several concepts of intelligent systems, including Machine Learning and Data Mining, with topics derived from expert systems, as user and data-driven knowledge, with meta-knowledge. This set of solutions coupled with leading edge technologies allows our architecture to be integrated into any information system, which impact on the automation of services and in reducing human effort during the process. Regarding the Metalearning process, our framework considers several types of properties inherent to the data sets, as well as, Feature Selection algorithms based on many information, distance, dependence and consistency measures. The quality of the methods for Feature Selection was estimated according to a multicriteria performance measure, which guided the ranking process of these algorithms for the construction of data metabases. Proposed by the authors of this work, this multicriteria performance measure combines any three measurements on a single one, creating an interesting and powerful tool to evaluate not only FS algorithms but also to assess any context where it is necessary a combination to maximize a measure or minimize it. The recommendation models, represented by decision trees and induced from the training metabases, allowed us to see in what circumstances a Feature Selection algorithm outperforms the other and what aspects of the data present greater influence in determining the performance of these algorithms. Nevertheless, if the user wishes, any other learning algorithm may be used to induce the recommendation model. This versatility is another strong point of this proposal. Results show that with the characterization of data, through statistical, information and complexity measures, it is possible to reach an accuracy higher than 90%. Besides yielding recommendation models that are interpretable and robust to overfitting, the developed architecture is less computationally expensive than approaches recently proposed in the literature75124CONSELHO NACIONAL DE DESENVOLVIMENTO CIENTÍFICO E TECNOLÓGICO - CNPQsem informaçã
    corecore