689 research outputs found

    On the class overlap problem in imbalanced data classification.

    Get PDF
    Class imbalance is an active research area in the machine learning community. However, existing and recent literature showed that class overlap had a higher negative impact on the performance of learning algorithms. This paper provides detailed critical discussion and objective evaluation of class overlap in the context of imbalanced data and its impact on classification accuracy. First, we present a thorough experimental comparison of class overlap and class imbalance. Unlike previous work, our experiment was carried out on the full scale of class overlap and an extreme range of class imbalance degrees. Second, we provide an in-depth critical technical review of existing approaches to handle imbalanced datasets. Existing solutions from selective literature are critically reviewed and categorised as class distribution-based and class overlap-based methods. Emerging techniques and the latest development in this area are also discussed in detail. Experimental results in this paper are consistent with existing literature and show clearly that the performance of the learning algorithm deteriorates across varying degrees of class overlap whereas class imbalance does not always have an effect. The review emphasises the need for further research towards handling class overlap in imbalanced datasets to effectively improve learning algorithms’ performance

    Learning from class-imbalanced data: overlap-driven resampling for imbalanced data classification.

    Get PDF
    Classification of imbalanced datasets has attracted substantial research interest over the past years. This is because imbalanced datasets are common in several domains such as health, finance and security, but learning algorithms are generally not designed to handle them. Many existing solutions focus mainly on the class distribution problem. However, a number of reports showed that class overlap had a higher negative impact on the learning process than class imbalance. This thesis thoroughly explores the impact of class overlap on the learning algorithm and demonstrates how elimination of class overlap can effectively improve the classification of imbalanced datasets. Novel undersampling approaches were developed with the main objective of enhancing the presence of minority class instances in the overlapping region. This is achieved by identifying and removing majority class instances potentially residing in such a region. Seven methods under the two different approaches were designed for the task. Extensive experiments were carried out to evaluate the methods on simulated and well-known real-world datasets. Results showed that substantial improvement in the classification accuracy of the minority class was obtained with favourable trade-offs with the majority class accuracy. Moreover, successful application of the methods in predictive diagnostics of diseases with imbalanced records is presented. These novel overlap-based approaches have several advantages over other common resampling methods. First, the undersampling amount is independent of class imbalance and proportional to the degree of overlap. This could effectively address the problem of class overlap while reducing the effect of class imbalance. Second, information loss is minimised as instance elimination is contained within the problematic region. Third, adaptive parameters enable the methods to be generalised across different problems. It is also worth pointing out that these methods provide different trade-offs, which offer more alternatives to real-world users in selecting the best fit solution to the problem

    Support matrix machine: A review

    Full text link
    Support vector machine (SVM) is one of the most studied paradigms in the realm of machine learning for classification and regression problems. It relies on vectorized input data. However, a significant portion of the real-world data exists in matrix format, which is given as input to SVM by reshaping the matrices into vectors. The process of reshaping disrupts the spatial correlations inherent in the matrix data. Also, converting matrices into vectors results in input data with a high dimensionality, which introduces significant computational complexity. To overcome these issues in classifying matrix input data, support matrix machine (SMM) is proposed. It represents one of the emerging methodologies tailored for handling matrix input data. The SMM method preserves the structural information of the matrix data by using the spectral elastic net property which is a combination of the nuclear norm and Frobenius norm. This article provides the first in-depth analysis of the development of the SMM model, which can be used as a thorough summary by both novices and experts. We discuss numerous SMM variants, such as robust, sparse, class imbalance, and multi-class classification models. We also analyze the applications of the SMM model and conclude the article by outlining potential future research avenues and possibilities that may motivate academics to advance the SMM algorithm

    Adaptive Ensemble of Classifiers with Regularization for Imbalanced Data Classification

    Get PDF
    The dynamic ensemble selection of classifiers is an effective approach for processing label-imbalanced data classifications. However, such a technique is prone to overfitting, owing to the lack of regularization methods and the dependence of the aforementioned technique on local geometry. In this study, focusing on binary imbalanced data classification, a novel dynamic ensemble method, namely adaptive ensemble of classifiers with regularization (AER), is proposed, to overcome the stated limitations. The method solves the overfitting problem through implicit regularization. Specifically, it leverages the properties of stochastic gradient descent to obtain the solution with the minimum norm, thereby achieving regularization; furthermore, it interpolates the ensemble weights by exploiting the global geometry of data to further prevent overfitting. According to our theoretical proofs, the seemingly complicated AER paradigm, in addition to its regularization capabilities, can actually reduce the asymptotic time and memory complexities of several other algorithms. We evaluate the proposed AER method on seven benchmark imbalanced datasets from the UCI machine learning repository and one artificially generated GMM-based dataset with five variations. The results show that the proposed algorithm outperforms the major existing algorithms based on multiple metrics in most cases, and two hypothesis tests (McNemar's and Wilcoxon tests) verify the statistical significance further. In addition, the proposed method has other preferred properties such as special advantages in dealing with highly imbalanced data, and it pioneers the research on the regularization for dynamic ensemble methods.Comment: Major revision; Change of authors due to contribution

    Habitat Associations and Reproduction of Fishes on the Northwestern Gulf of Mexico Shelf Edge

    Get PDF
    Several of the northwestern Gulf of Mexico (GOM) shelf-edge banks provide critical hard bottom habitat for coral and fish communities, supporting a wide diversity of ecologically and economically important species. These sites may be fish aggregation and spawning sites and provide important habitat for fish growth and reproduction. Already designated as habitat areas of particular concern, many of these banks are also under consideration for inclusion in the expansion of the Flower Garden Banks National Marine Sanctuary. This project aimed to gain a more comprehensive understanding of the communities and fish species on shelf-edge banks by way of gonad histology, baited remote underwater video, and hydroacoustics, as well as traditional statistical analyses, Bayesian estimation, and machine learning techniques. The study had several objectives: (1) estimate size at sexual transition for six GOM grouper species, (2) determine the optimal number of cameras on a baited remote underwater video system, (3) create a predictive model to provide presence of fish species based on habitat, and (4) grow a model to predict fish backscatter and density based on habitat parameters. Bayesian estimation allowed for size at sexual transition determinations for the six grouper species, outperforming the tradition frequentist models, especially for situations where tradition models failed to converge. Random forests based on video data had mixed results, but models for several species were able to predict fish presences with class and overall accuracies of greater than 80%. Boosted regression trees based on hydroacoustic data reinforced the importance of depth as a driving factor in fish distributions. The study provided greater understanding and predictive ability regarding fish on the bank habitats

    Satellite and UAV Platforms, Remote Sensing for Geographic Information Systems

    Get PDF
    The present book contains ten articles illustrating the different possible uses of UAVs and satellite remotely sensed data integration in Geographical Information Systems to model and predict changes in both the natural and the human environment. It illustrates the powerful instruments given by modern geo-statistical methods, modeling, and visualization techniques. These methods are applied to Arctic, tropical and mid-latitude environments, agriculture, forest, wetlands, and aquatic environments, as well as further engineering-related problems. The present Special Issue gives a balanced view of the present state of the field of geoinformatics

    Statistical and Machine Learning Models for Remote Sensing Data Mining - Recent Advancements

    Get PDF
    This book is a reprint of the Special Issue entitled "Statistical and Machine Learning Models for Remote Sensing Data Mining - Recent Advancements" that was published in Remote Sensing, MDPI. It provides insights into both core technical challenges and some selected critical applications of satellite remote sensing image analytics

    On-the-fly synthesizer programming with rule learning

    Get PDF
    This manuscript explores automatic programming of sound synthesis algorithms within the context of the performative artistic practice known as live coding. Writing source code in an improvised way to create music or visuals became an instrument the moment affordable computers were able to perform real-time sound synthesis with languages that keep their interpreter running. Ever since, live coding has dealt with real time programming of synthesis algorithms. For that purpose, one possibility is an algorithm that automatically creates variations out of a few presets selected by the user. However, the need for real-time feedback and the small size of the data sets (which can even be collected mid-performance) are constraints that make existing automatic sound synthesizer programmers and learning algorithms unfeasible. Also, the design of such algorithms is not oriented to create variations of a sound but rather to find the synthesizer parameters that match a given one. Other approaches create representations of the space of possible sounds, allowing the user to explore it by means of interactive evolution. Even though these systems are exploratory-oriented, they require longer run-times. This thesis investigates inductive rule learning for on-the-fly synthesizer programming. This approach is conceptually different from those found in both synthesizer programming and live coding literature. Rule models offer interpretability and allow working with the parameter values of the synthesis algorithms (even with symbolic data), making preprocessing unnecessary. RuLer, the proposed learning algorithm, receives a dataset containing user labeled combinations of parameter values of a synthesis algorithm. Among those combinations sharing the same label, it analyses the patterns based on dissimilarity. These patterns are described as an IF-THEN rule model. The algorithm parameters provide control to define what is considered a pattern. As patterns are the base for inducting new parameter settings, the algorithm parameters control the degree of consistency of the inducted settings respect to the original input data. An algorithm (named FuzzyRuLer) able to extend IF-THEN rules to hyperrectangles, which in turn are used as the cores of membership functions, is presented. The resulting fuzzy rule model creates a map of the entire input feature space. For such a pursuit, the algorithm generalizes the logical rules solving the contradictions by following a maximum volume heuristics. Across the manuscript it is discussed how, when machine learning algorithms are used as creative tools, glitches, errors or inaccuracies produced by the resulting models are sometimes desirable as they might offer novel, unpredictable results. The evaluation of the algorithms follows two paths. The first focuses on user tests. The second responds to the fact that this work was carried out within the computer science department and is intended to provide a broader, nonspecific domain evaluation of the algorithms performance using extrinsic benchmarks (i.e not belonging to a synthesizer's domain) for cross validation and minority oversampling. In oversampling tasks, using imbalanced datasets, the algorithm yields state-of-the-art results. Moreover, the synthetic points produced are significantly different from those created by the other algorithms and perform (controlled) exploration of more distant regions. Finally, accompanying the research, various performances, concerts and an album were produced with the algorithms and examples of this thesis. The reviews received and collections where the album has been featured show a positive reception within the community. Together, these evaluations suggest that rule learning is both an effective method and a promising path for further research.Aquest manuscrit explora la programació automàtica d’algorismes de síntesi de so dins del context de la pràctica artística performativa coneguda com a live coding. L'escriptura improvisada de codi font per crear música o visuals es va convertir en un instrument en el moment en què els ordinadors van poder realitzar síntesis de so en temps real amb llenguatges que mantenien el seu intèrpret en funcionament. D'aleshores ençà, el live coding comporta la programació en temps real d’algorismes de síntesi de so. Per a aquest propòsit, una possibilitat és tenir un algorisme que creï automàticament variacions a partir d'alguns presets seleccionats. No obstant, la necessitat de retroalimentació en temps real i la petita mida dels conjunts de dades són restriccions que fan que els programadors automàtics de sintetitzadors de so i els algorismes d’aprenentatge no siguin factibles d’utilitzar. A més, el seu disseny no està orientat a crear variacions d'un so, sinó a trobar els paràmetres del sintetitzador que aplicats a l'algorisme de síntesi produeixen un so determinat (target). Altres enfocaments creen representacions de l'espai de sons possibles, per permetre a l'usuari explorar-lo mitjançant l'evolució interactiva, però requereixen temps més llargs. Aquesta tesi investiga l'aprenentatge inductiu de regles per a la programació on-the-fly de sintetitzadors. Aquest enfocament és conceptualment diferent dels que es troben a la literatura. Els models de regles ofereixen interpretabilitat i permeten treballar amb els valors dels paràmetres dels algorismes de síntesi, sense processament previ. RuLer, l'algorisme d'aprenentatge proposat, rep dades amb combinacions etiquetades per l'usuari dels valors dels paràmetres d'un algorisme de síntesi. A continuació, analitza els patrons, basats en la dissimilitud, entre les combinacions de cada etiqueta. Aquests patrons es descriuen com un model de regles IF-THEN. Els paràmetres de l'algorisme proporcionen control per definir el que es considera un patró. Llavors, controlen el grau de consistència dels nous paràmetres de síntesi induïts respecte a les dades d'entrada originals. A continuació, es presenta un algorisme (FuzzyRuLer) capaç d’estendre les regles IF-THEN a hiperrectangles, que al seu torn s’utilitzen com a nuclis de funcions de pertinença. El model de regles difuses resultant crea un mapa complet de l'espai de la funció d'entrada. Per això, l'algorisme generalitza les regles lògiques seguint una heurística de volum màxim. Al llarg del manuscrit es discuteix com, quan s’utilitzen algorismes d’aprenentatge automàtic com a eines creatives, de vegades són desitjables glitches, errors o imprecisions produïdes pels models resultants, ja que poden oferir nous resultats imprevisibles. L'avaluació dels algorismes segueix dos camins. El primer es centra en proves d'usuari. El segon, que respon al fet que aquest treball es va dur a terme dins del departament de ciències de la computació, pretén proporcionar una avaluació més àmplia, no específica d'un domini, del rendiment dels algorismes mitjançant benchmarks extrínsecs utilitzats per cross-validation i minority oversampling. En tasques d'oversampling, mitjançant imbalanced data sets, l'algorisme proporciona resultats equiparables als de l'estat de l'art. A més, els punts sintètics produïts són significativament diferents als creats pels altres algorismes i realitzen exploracions (controlades) de regions més llunyanesEste manuscrito explora la programación automática de algoritmos de síntesis de sonido dentro del contexto de la práctica artística performativa conocida como live coding. La escritura de código fuente de forma improvisada para crear música o imágenes, se convirtió en un instrumento en el momento en que las computadoras asequibles pudieron realizar síntesis de sonido en tiempo real con lenguajes que mantuvieron su interprete en funcionamiento. Desde entonces, el live coding ha implicado la programación en tiempo real de algoritmos de síntesis. Para ese propósito, una posibilidad es tener un algoritmo que cree automáticamente variaciones a partir de unos pocos presets seleccionados. Sin embargo, la necesidad de retroalimentación en tiempo real y el pequeño tamaño de los conjuntos de datos (que incluso pueden recopilarse durante la misma actuación), limitan el uso de los algoritmos existentes, tanto de programación automática de sintetizadores como de aprendizaje de máquina. Además, el diseño de dichos algoritmos no está orientado a crear variaciones de un sonido, sino a encontrar los parámetros del sintetizador que coincidan con un sonido dado. Otros enfoques crean representaciones del espacio de posibles sonidos, para permitir al usuario explorarlo mediante evolución interactiva. Aunque estos sistemas están orientados a la exploración, requieren tiempos más largos. Esta tesis investiga el aprendizaje inductivo de reglas para la programación de sintetizadores on-the-fly. Este enfoque es conceptualmente diferente de los que se encuentran en la literatura, tanto de programación de sintetizadores como de live coding. Los modelos de reglas ofrecen interpretabilidad y permiten trabajar con los valores de los parámetros de los algoritmos de síntesis (incluso con datos simbólicos), haciendo innecesario el preprocesamiento. RuLer, el algoritmo de aprendizaje propuesto, recibe un conjunto de datos que contiene combinaciones, etiquetadas por el usuario, de valores de parámetros de un algoritmo de síntesis. Luego, analiza los patrones, en función de la disimilitud, entre las combinaciones de cada etiqueta. Estos patrones se describen como un modelo de reglas lógicas IF-THEN. Los parámetros del algoritmo proporcionan el control para definir qué se considera un patrón. Como los patrones son la base para inducir nuevas configuraciones de parámetros, los parámetros del algoritmo controlan también el grado de consistencia de las configuraciones inducidas con respecto a los datos de entrada originales. Luego, se presenta un algoritmo (llamado FuzzyRuLer) capaz de extender las reglas lógicas tipo IF-THEN a hiperrectángulos, que a su vez se utilizan como núcleos de funciones de pertenencia. El modelo de reglas difusas resultante crea un mapa completo del espacio de las clases de entrada. Para tal fin, el algoritmo generaliza las reglas lógicas resolviendo las contradicciones utilizando una heurística de máximo volumen. A lo largo del manuscrito se analiza cómo, cuando los algoritmos de aprendizaje automático se utilizan como herramientas creativas, los glitches, errores o inexactitudes producidas por los modelos resultantes son a veces deseables, ya que pueden ofrecer resultados novedosos e impredecibles. La evaluación de los algoritmos sigue dos caminos. El primero se centra en pruebas de usuario. El segundo, responde al hecho de que este trabajo se llevó a cabo dentro del departamento de ciencias de la computación y está destinado a proporcionar una evaluación más amplia, no de dominio específica, del rendimiento de los algoritmos utilizando beanchmarks extrínsecos para cross-validation y oversampling. En estas últimas pruebas, utilizando conjuntos de datos no balanceados, el algoritmo produce resultados equiparables a los del estado del arte. Además, los puntos sintéticos producidos son significativamente diferentes de los creados por los otros algoritmos y realizan una exploración (controlada) de regiones más distantes. Finalmente, acompañando la investigación, realicé diversas presentaciones, conciertos y un ´álbum utilizando los algoritmos y ejemplos de esta tesis. Las críticas recibidas y las listas donde se ha presentado el álbum muestran una recepción positiva de la comunidad. En conjunto, estas evaluaciones sugieren que el aprendizaje de reglas es al mismo tiempo un método eficaz y un camino prometedor para futuras investigaciones.Postprint (published version

    Advances in Sensors, Big Data and Machine Learning in Intelligent Animal Farming

    Get PDF
    Animal production (e.g., milk, meat, and eggs) provides valuable protein production for human beings and animals. However, animal production is facing several challenges worldwide such as environmental impacts and animal welfare/health concerns. In animal farming operations, accurate and efficient monitoring of animal information and behavior can help analyze the health and welfare status of animals and identify sick or abnormal individuals at an early stage to reduce economic losses and protect animal welfare. In recent years, there has been growing interest in animal welfare. At present, sensors, big data, machine learning, and artificial intelligence are used to improve management efficiency, reduce production costs, and enhance animal welfare. Although these technologies still have challenges and limitations, the application and exploration of these technologies in animal farms will greatly promote the intelligent management of farms. Therefore, this Special Issue will collect original papers with novel contributions based on technologies such as sensors, big data, machine learning, and artificial intelligence to study animal behavior monitoring and recognition, environmental monitoring, health evaluation, etc., to promote intelligent and accurate animal farm management
    corecore