928 research outputs found

    Melodic track identification in MIDI files considering the imbalanced context

    Get PDF
    In this paper, the problem of identifying the melodic track of a MIDI file in imbalanced scenarios is addressed. A polyphonic MIDI file is a digital score that consists of a set of tracks where usually only one of them contains the melody and the remaining tracks hold the accompaniment. This leads to a two-class imbalance problem that, unlike in previous work, is managed by over-sampling the melody class (the minority one) or by under-sampling the accompaniment class (the majority one) until both classes are the same size. Experimental results over three different music genres prove that learning from balanced training sets clearly provides better results than the standard classification proces

    On the suitability of combining feature selection and resampling to manage data complexity

    Get PDF
    The effectiveness of a learning task depends on data com- plexity (class overlap, class imbalance, irrelevant features, etc.). When more than one complexity factor appears, two or more preprocessing techniques should be applied. Nevertheless, no much effort has been de- voted to investigate the importance of the order in which they can be used. This paper focuses on the joint use of feature reduction and bal- ancing techniques, and studies which could be the application order that leads to the best classification results. This analysis was made on a spe- cific problem whose aim was to identify the melodic track given a MIDI file. Several experiments were performed from different imbalanced 38- dimensional training sets with many more accompaniment tracks than melodic tracks, and where features were aggregated without any correla- tion study. Results showed that the most effective combination was the ordered use of resampling and feature reduction techniques

    The sensitivity of mapping methods to reference data quality:training supervised image classifications with imperfect reference data

    Get PDF
    The accuracy of a map is dependent on the reference dataset used in its construction. Classification analyses used in thematic mapping can, for example, be sensitive to a range of sampling and data quality concerns. With particular focus on the latter, the effects of reference data quality on land cover classifications from airborne thematic mapper data are explored. Variations in sampling intensity and effort are highlighted in a dataset that is widely used in mapping and modelling studies; these may need accounting for in analyses. The quality of the labelling in the reference dataset was also a key variable influencing mapping accuracy. Accuracy varied with the amount and nature of mislabelled training cases with the nature of the effects varying between classifiers. The largest impacts on accuracy occurred when mislabelling involved confusion between similar classes. Accuracy was also typically negatively related to the magnitude of mislabelled cases and the support vector machine (SVM), which has been claimed to be relatively insensitive to training data error, was the most sensitive of the set of classifiers investigated, with overall classification accuracy declining by 8% (significant at 95% level of confidence) with the use of a training set containing 20% mislabelled cases

    Shallow vs deep learning architectures for white matter lesion segmentation in the early stages of multiple sclerosis

    Get PDF
    In this work, we present a comparison of a shallow and a deep learning architecture for the automated segmentation of white matter lesions in MR images of multiple sclerosis patients. In particular, we train and test both methods on early stage disease patients, to verify their performance in challenging conditions, more similar to a clinical setting than what is typically provided in multiple sclerosis segmentation challenges. Furthermore, we evaluate a prototype naive combination of the two methods, which refines the final segmentation. All methods were trained on 32 patients, and the evaluation was performed on a pure test set of 73 cases. Results show low lesion-wise false positives (30%) for the deep learning architecture, whereas the shallow architecture yields the best Dice coefficient (63%) and volume difference (19%). Combining both shallow and deep architectures further improves the lesion-wise metrics (69% and 26% lesion-wise true and false positive rate, respectively).Comment: Accepted to the MICCAI 2018 Brain Lesion (BrainLes) worksho

    The detection of globular clusters in galaxies as a data mining problem

    Get PDF
    We present an application of self-adaptive supervised learning classifiers derived from the Machine Learning paradigm, to the identification of candidate Globular Clusters in deep, wide-field, single band HST images. Several methods provided by the DAME (Data Mining & Exploration) web application, were tested and compared on the NGC1399 HST data described in Paolillo 2011. The best results were obtained using a Multi Layer Perceptron with Quasi Newton learning rule which achieved a classification accuracy of 98.3%, with a completeness of 97.8% and 1.6% of contamination. An extensive set of experiments revealed that the use of accurate structural parameters (effective radius, central surface brightness) does improve the final result, but only by 5%. It is also shown that the method is capable to retrieve also extreme sources (for instance, very extended objects) which are missed by more traditional approaches.Comment: Accepted 2011 December 12; Received 2011 November 28; in original form 2011 October 1

    Parallel classification and feature selection in microarray data using SPRINT

    Get PDF
    The statistical language R is favoured by many biostatisticians for processing microarray data. In recent times, the quantity of data that can be obtained in experiments has risen significantly, making previously fast analyses time consuming or even not possible at all with the existing software infrastructure. High performance computing (HPC) systems offer a solution to these problems but at the expense of increased complexity for the end user. The Simple Parallel R Interface is a library for R that aims to reduce the complexity of using HPC systems by providing biostatisticians with drop‐in parallelised replacements of existing R functions. In this paper we describe parallel implementations of two popular techniques: exploratory clustering analyses using the random forest classifier and feature selection through identification of differentially expressed genes using the rank product method

    Handwritten digit recognition by bio-inspired hierarchical networks

    Full text link
    The human brain processes information showing learning and prediction abilities but the underlying neuronal mechanisms still remain unknown. Recently, many studies prove that neuronal networks are able of both generalizations and associations of sensory inputs. In this paper, following a set of neurophysiological evidences, we propose a learning framework with a strong biological plausibility that mimics prominent functions of cortical circuitries. We developed the Inductive Conceptual Network (ICN), that is a hierarchical bio-inspired network, able to learn invariant patterns by Variable-order Markov Models implemented in its nodes. The outputs of the top-most node of ICN hierarchy, representing the highest input generalization, allow for automatic classification of inputs. We found that the ICN clusterized MNIST images with an error of 5.73% and USPS images with an error of 12.56%

    Using Fourier coefficients in time series analysis for student performance prediction in blended learning environments

    Get PDF
    In this work, it is shown that student access time series generated from Moodle log files contain information sufficient for successful prediction of student final results in blended learning courses. It is also shown that if time series is transformed into frequency domain, using discrete Fourier transforms (DFT), the information contained in it will be preserved. Hence, resulting periodogram and its DFT coefficients can be used for generating student performance models with the algorithms commonly used for that purposes. The amount of data extracted from log files, especially for lengthy courses, can be huge. Nevertheless, by using DFT, drastic compression of data is possible. It is experimentally shown, by means of several commonly used modelling algorithms, that if in average all but 5–10% of most intensive and most frequently used DFT coefficients are removed from datasets, the modelling with the remained data will result with the increase of the model accuracy. Resulting accuracy of the calculated models is in accordance with results for student performance models calculated for different dataset types reported in literature. The advantage of this approach is its applicability because the data are automatically collected in Moodle logs

    Parameterizing neural networks for disease classification

    Get PDF
    Neural networks are one option to implement decision support systems for health care applications. In this paper, we identify optimal settings of neural networks for medical diagnoses: The study involves the application of supervised machine learning using an artificial neural network to distinguish between gout and leukaemia patients. With the objective to improve the base accuracy (calculated from the initial set-up of the neural network model), several enhancements are analysed, such as the use of hyperbolic tangent activation function instead of the sigmoid function, the use of two hidden layers instead of one, and transforming the measurements with linear regression to obtain a smoothened data set. Another setting we study is the impact on the accuracy when using a data set of reduced size but with higher data quality. We also discuss the tradeoff between accuracy and runtime efficiency

    A Multiclassifier Approach for Drill Wear Prediction

    Get PDF
    Classification methods have been widely used during last years in order to predict patterns and trends of interest in data. In present paper, a multiclassifier approach that combines the output of some of the most popular data mining algorithms is shown. The approach is based on voting criteria, by estimating the confidence distributions of each algorithm individually and combining them according to three different methods: confidence voting, weighted voting and majority voting. To illustrate its applicability in a real problem, the drill wear detection in machine-tool sector is addressed. In this study, the accuracy obtained by each isolated classifier is compared with the performance of the multiclassifier when characterizing the patterns of interest involved in the drilling process and predicting the drill wear. Experimental results show that, in general, false positives obtained by the classifiers can be slightly reduced by using the multiclassifier approach
    corecore