120 research outputs found

    CNN training with graph-based sample preselection: application to handwritten character recognition

    Full text link
    In this paper, we present a study on sample preselection in large training data set for CNN-based classification. To do so, we structure the input data set in a network representation, namely the Relative Neighbourhood Graph, and then extract some vectors of interest. The proposed preselection method is evaluated in the context of handwritten character recognition, by using two data sets, up to several hundred thousands of images. It is shown that the graph-based preselection can reduce the training data set without degrading the recognition accuracy of a non pretrained CNN shallow model.Comment: Paper of 10 pages. Minor spelling corrections brought regarding the v2. Accepted as an oral paper in the 13th IAPR Internationale Workshop on Document Analysis Systems (DAS 2018

    Reducing Dimensionality to Improve Search in Semantic Genetic Programming

    Get PDF
    Genetic programming approaches are moving from analysing the syntax of individual solutions to look into their semantics. One of the common definitions of the semantic space in the context of symbolic regression is a n-dimensional space, where n corresponds to the number of training examples. In problems where this number is high, the search process can became harder as the number of dimensions increase. Geometric semantic genetic programming (GSGP) explores the semantic space by performing geometric semantic operations—the fitness landscape seen by GSGP is guaranteed to be conic by construction. Intuitively, a lower number of dimensions can make search more feasible in this scenario, decreasing the chances of data overfitting and reducing the number of evaluations required to find a suitable solution. This paper proposes two approaches for dimensionality reduction in GSGP: (i) to apply current instance selection methods as a pre-process step before training points are given to GSGP; (ii) to incorporate instance selection to the evolution of GSGP. Experiments in 15 datasets show that GSGP performance is improved by using instance reduction during the evolution

    SOUL: Scala Oversampling and Undersampling Library for imbalance classification

    Get PDF
    This work has been supported by the research project TIN2017-89517-P, by the UGR research contract OTRI 3940 and by a research scholarship, given to the authors Nestor Rodriguez and David Lopez by the University of Granada, Spain.The improvements in technology and computation have promoted a global adoption of Data Science. It is devoted to extracting significant knowledge from high amounts of information by means of the application of Artificial Intelligence and Machine Learning tools. Among the different tasks within Data Science, classification is probably the most widespread overall. Focusing on the classification scenario, we often face some datasets in which the number of instances for one of the classes is much lower than that of the remaining ones. This issue is known as the imbalanced classification problem, and it is mainly related to the need for boosting the recognition of the minority class examples. In spite of a large number of solutions that were proposed in the specialized literature to address imbalanced classification, there is a lack of open-source software that compiles the most relevant ones in an easy-to-use and scalable way. In this paper, we present a novel software approach named as SOUL, which stands for Scala Oversampling and Undersampling Library for imbalanced classification. The main capabilities of this new library include a large number of different data preprocessing techniques, efficient execution of these approaches, and a graphical environment to contrast the output for the different preprocessing solutions.UGR research contract OTRI 3940University of Granada, SpainTIN2017-89517-

    Leveraging Time Series Data in Similarity Based Healthcare Predictive Models: The Case of Early ICU Mortality Prediction

    Get PDF
    Patient time series classification faces challenges in high degrees of dimensionality and missingness. In light of patient similarity theory, this study explores effective temporal feature engineering and reduction, missing value imputation, and change point detection methods that can afford similarity-based classification models with desirable accuracy enhancement. We select a piecewise aggregation approximation method to extract fine-grain temporal features and propose a minimalist method to impute missing values in temporal features. For dimensionality reduction, we adopt a gradient descent search method for feature weight assignment. We propose new patient status and directional change definitions based on medical knowledge or clinical guidelines about the value ranges for different patient status levels, and develop a method to detect change points indicating positive or negative patient status changes. We evaluate the effectiveness of the proposed methods in the context of early Intensive Care Unit mortality prediction. The evaluation results show that the k-Nearest Neighbor algorithm that incorporates methods we select and propose significantly outperform the relevant benchmarks for early ICU mortality prediction. This study makes contributions to time series classification and early ICU mortality prediction via identifying and enhancing temporal feature engineering and reduction methods for similarity-based time series classification. Keywords: time-series classification, similarity-based classification, mortality prediction, directional change poin
    • …
    corecore