951 research outputs found

    Class-Imbalanced Complementary-Label Learning via Weighted Loss

    Full text link
    Complementary-label learning (CLL) is widely used in weakly supervised classification, but it faces a significant challenge in real-world datasets when confronted with class-imbalanced training samples. In such scenarios, the number of samples in one class is considerably lower than in other classes, which consequently leads to a decline in the accuracy of predictions. Unfortunately, existing CLL approaches have not investigate this problem. To alleviate this challenge, we propose a novel problem setting that enables learning from class-imbalanced complementary labels for multi-class classification. To tackle this problem, we propose a novel CLL approach called Weighted Complementary-Label Learning (WCLL). The proposed method models a weighted empirical risk minimization loss by utilizing the class-imbalanced complementary labels, which is also applicable to multi-class imbalanced training samples. Furthermore, we derive an estimation error bound to provide theoretical assurance. To evaluate our approach, we conduct extensive experiments on several widely-used benchmark datasets and a real-world dataset, and compare our method with existing state-of-the-art methods. The proposed approach shows significant improvement in these datasets, even in the case of multiple class-imbalanced scenarios. Notably, the proposed method not only utilizes complementary labels to train a classifier but also solves the problem of class imbalance.Comment: 9 pages, 9 figures, 3 table

    Statistical Challenges and Methods for Missing and Imbalanced Data

    Get PDF
    Missing data remains a prevalent issue in every area of research. The impact of missing data, if not carefully handled, can be detrimental to any statistical analysis. Some statistical challenges associated with missing data include, loss of information, reduced statistical power and non-generalizability of findings in a study. It is therefore crucial that researchers pay close and particular attention when dealing with missing data. This multi-paper dissertation provides insight into missing data across different fields of study and addresses some of the above mentioned challenges of missing data through simulation studies and application to real datasets. The first paper of this dissertation addresses the dropout phenomenon in single-cell RNA (scRNA) sequencing through a comparative analyses of some existing scRNA sequencing techniques. The second paper of this work focuses on using simulation studies to assess whether it is appropriate to address the issue of non-detects in data using a traditional substitution approach, imputation, or a non-imputation based approach. The final paper of this dissertation presents an efficient strategy to address the issue of imbalance in data at any degree (whether moderate or highly imbalanced) by combining random undersampling with different weighting strategies. We conclude generally, based on findings from this dissertation that, missingness is not always lack of information but interestingness that needs to investigated

    Evaluating the benefits of key-value databases for scientific applications

    Get PDF
    The convergence of Big Data applications with High-Performance Computing requires new methodologies to store, manage and process large amounts of information. Traditional storage solutions are unable to scale and that results in complex coding strategies. For example, the brain atlas of the Human Brain Project has the challenge to process large amounts of high-resolution brain images. Given the computing needs, we study the effects of replacing a traditional storage system with a distributed Key-Value database on a cell segmentation application. The original code uses HDF5 files on GPFS through an intricate interface, imposing synchronizations. On the other hand, by using Apache Cassandra or ScyllaDB through Hecuba, the application code is greatly simplified. Thanks to the Key-Value data model, the number of synchronizations is reduced and the time dedicated to I/O scales when increasing the number of nodes.This project/research has received funding from the European Unions Horizon 2020 Framework Programme for Research and Innovation under the Speci c Grant Agreement No. 720270 (Human Brain Project SGA1) and the Speci c Grant Agreement No. 785907 (Human Brain Project SGA2). This work has also been supported by the Spanish Government (SEV2015-0493), by the Spanish Ministry of Science and Innovation (contract TIN2015-65316-P), and by Generalitat de Catalunya (contract 2017-SGR-1414).Postprint (author's final draft

    A survey on learning from imbalanced data streams: taxonomy, challenges, empirical study, and reproducible experimental framework

    Full text link
    Class imbalance poses new challenges when it comes to classifying data streams. Many algorithms recently proposed in the literature tackle this problem using a variety of data-level, algorithm-level, and ensemble approaches. However, there is a lack of standardized and agreed-upon procedures on how to evaluate these algorithms. This work presents a taxonomy of algorithms for imbalanced data streams and proposes a standardized, exhaustive, and informative experimental testbed to evaluate algorithms in a collection of diverse and challenging imbalanced data stream scenarios. The experimental study evaluates 24 state-of-the-art data streams algorithms on 515 imbalanced data streams that combine static and dynamic class imbalance ratios, instance-level difficulties, concept drift, real-world and semi-synthetic datasets in binary and multi-class scenarios. This leads to the largest experimental study conducted so far in the data stream mining domain. We discuss the advantages and disadvantages of state-of-the-art classifiers in each of these scenarios and we provide general recommendations to end-users for selecting the best algorithms for imbalanced data streams. Additionally, we formulate open challenges and future directions for this domain. Our experimental testbed is fully reproducible and easy to extend with new methods. This way we propose the first standardized approach to conducting experiments in imbalanced data streams that can be used by other researchers to create trustworthy and fair evaluation of newly proposed methods. Our experimental framework can be downloaded from https://github.com/canoalberto/imbalanced-streams

    Image forgery detection using textural features and deep learning

    Full text link
    La croissance exponentielle et les progrès de la technologie ont rendu très pratique le partage de données visuelles, d'images et de données vidéo par le biais d’une vaste prépondérance de platesformes disponibles. Avec le développement rapide des technologies Internet et multimédia, l’efficacité de la gestion et du stockage, la rapidité de transmission et de partage, l'analyse en temps réel et le traitement des ressources multimédias numériques sont progressivement devenus un élément indispensable du travail et de la vie de nombreuses personnes. Sans aucun doute, une telle croissance technologique a rendu le forgeage de données visuelles relativement facile et réaliste sans laisser de traces évidentes. L'abus de ces données falsifiées peut tromper le public et répandre la désinformation parmi les masses. Compte tenu des faits mentionnés ci-dessus, la criminalistique des images doit être utilisée pour authentifier et maintenir l'intégrité des données visuelles. Pour cela, nous proposons une technique de détection passive de falsification d'images basée sur les incohérences de texture et de bruit introduites dans une image du fait de l'opération de falsification. De plus, le réseau de détection de falsification d'images (IFD-Net) proposé utilise une architecture basée sur un réseau de neurones à convolution (CNN) pour classer les images comme falsifiées ou vierges. Les motifs résiduels de texture et de bruit sont extraits des images à l'aide du motif binaire local (LBP) et du modèle Noiseprint. Les images classées comme forgées sont ensuite utilisées pour mener des expériences afin d'analyser les difficultés de localisation des pièces forgées dans ces images à l'aide de différents modèles de segmentation d'apprentissage en profondeur. Les résultats expérimentaux montrent que l'IFD-Net fonctionne comme les autres méthodes de détection de falsification d'images sur l'ensemble de données CASIA v2.0. Les résultats discutent également des raisons des difficultés de segmentation des régions forgées dans les images du jeu de données CASIA v2.0.The exponential growth and advancement of technology have made it quite convenient for people to share visual data, imagery, and video data through a vast preponderance of available platforms. With the rapid development of Internet and multimedia technologies, performing efficient storage and management, fast transmission and sharing, real-time analysis, and processing of digital media resources has gradually become an indispensable part of many people’s work and life. Undoubtedly such technological growth has made forging visual data relatively easy and realistic without leaving any obvious visual clues. Abuse of such tampered data can deceive the public and spread misinformation amongst the masses. Considering the facts mentioned above, image forensics must be used to authenticate and maintain the integrity of visual data. For this purpose, we propose a passive image forgery detection technique based on textural and noise inconsistencies introduced in an image because of the tampering operation. Moreover, the proposed Image Forgery Detection Network (IFD-Net) uses a Convolution Neural Network (CNN) based architecture to classify the images as forged or pristine. The textural and noise residual patterns are extracted from the images using Local Binary Pattern (LBP) and the Noiseprint model. The images classified as forged are then utilized to conduct experiments to analyze the difficulties in localizing the forged parts in these images using different deep learning segmentation models. Experimental results show that both the IFD-Net perform like other image forgery detection methods on the CASIA v2.0 dataset. The results also discuss the reasons behind the difficulties in segmenting the forged regions in the images of the CASIA v2.0 dataset

    Data Optimization in Deep Learning: A Survey

    Full text link
    Large-scale, high-quality data are considered an essential factor for the successful application of many deep learning techniques. Meanwhile, numerous real-world deep learning tasks still have to contend with the lack of sufficient amounts of high-quality data. Additionally, issues such as model robustness, fairness, and trustworthiness are also closely related to training data. Consequently, a huge number of studies in the existing literature have focused on the data aspect in deep learning tasks. Some typical data optimization techniques include data augmentation, logit perturbation, sample weighting, and data condensation. These techniques usually come from different deep learning divisions and their theoretical inspirations or heuristic motivations may seem unrelated to each other. This study aims to organize a wide range of existing data optimization methodologies for deep learning from the previous literature, and makes the effort to construct a comprehensive taxonomy for them. The constructed taxonomy considers the diversity of split dimensions, and deep sub-taxonomies are constructed for each dimension. On the basis of the taxonomy, connections among the extensive data optimization methods for deep learning are built in terms of four aspects. We probe into rendering several promising and interesting future directions. The constructed taxonomy and the revealed connections will enlighten the better understanding of existing methods and the design of novel data optimization techniques. Furthermore, our aspiration for this survey is to promote data optimization as an independent subdivision of deep learning. A curated, up-to-date list of resources related to data optimization in deep learning is available at \url{https://github.com/YaoRujing/Data-Optimization}

    Surface loss for medical image segmentation

    Get PDF
    Last decades have witnessed an unprecedented expansion of medical data in various largescale and complex systems. While achieving a lot of successes in many complex medical problems, there are still some challenges to deal with. Class imbalance is one of the common problems of medical image segmentation. It occurs mostly when there is a severely unequal class distribution, for instance, when the size of target foreground region is several orders of magnitude less that the background region size. In such problems, typical loss functions used for convolutional neural networks (CNN) segmentation fail to deliver good performances. Widely used losses,e.g., Dice or cross-entropy, are based on regional terms. They assume that all classes are equally distributed. Thus, they tend to favor the majority class and misclassify the target class. To address this issue, the main objective of this work is to build a boundary loss, a distance based measure on the space of contours and not regions. We argue that a boundary loss can mitigate the problems of regional losses via introducing a complementary distance-based information. Our loss is inspired by discrete (graph-based) optimization techniques for computing gradient flows of curve evolution. Following an integral approach for computing boundary variations, we express a non-symmetric L2 distance on the space of shapes as a regional integral, which avoids completely local differential computations. Our boundary loss is the sum of linear functions of the regional softmax probability outputs of the network. Therefore, it can easily be combined with standard regional losses and implemented with any existing deep network architecture for N-dimensional segmentation (N-D). Experiments were carried on three benchmark datasets corresponding to increasingly unbalanced segmentation problems: Multi modal brain tumor segmentation (BRATS17), the ischemic stroke lesion (ISLES) and white matter hyperintensities (WMH). Used in conjunction with the region-based generalized Dice loss (GDL), our boundary loss improves performance significantly compared to GDL alone, reaching up to 8% improvement in Dice score and 10% improvement in Hausdorff score. It also yielded a more stable learning process

    Otimização multi-objetivo em aprendizado de máquina

    Get PDF
    Orientador: Fernando José Von ZubenTese (doutorado) - Universidade Estadual de Campinas, Faculdade de Engenharia Elétrica e de ComputaçãoResumo: Regressão logística multinomial regularizada, classificação multi-rótulo e aprendizado multi-tarefa são exemplos de problemas de aprendizado de máquina em que objetivos conflitantes, como funções de perda e penalidades que promovem regularização, devem ser simultaneamente minimizadas. Portanto, a perspectiva simplista de procurar o modelo de aprendizado com o melhor desempenho deve ser substituída pela proposição e subsequente exploração de múltiplos modelos de aprendizado eficientes, cada um caracterizado por um compromisso (trade-off) distinto entre os objetivos conflitantes. Comitês de máquinas e preferências a posteriori do tomador de decisão podem ser implementadas visando explorar adequadamente este conjunto diverso de modelos de aprendizado eficientes, em busca de melhoria de desempenho. A estrutura conceitual multi-objetivo para aprendizado de máquina é suportada por três etapas: (1) Modelagem multi-objetivo de cada problema de aprendizado, destacando explicitamente os objetivos conflitantes envolvidos; (2) Dada a formulação multi-objetivo do problema de aprendizado, por exemplo, considerando funções de perda e termos de penalização como objetivos conflitantes, soluções eficientes e bem distribuídas ao longo da fronteira de Pareto são obtidas por um solver determinístico e exato denominado NISE (do inglês Non-Inferior Set Estimation); (3) Esses modelos de aprendizado eficientes são então submetidos a um processo de seleção de modelos que opera com preferências a posteriori, ou a filtragem e agregação para a síntese de ensembles. Como o NISE é restrito a problemas de dois objetivos, uma extensão do NISE capaz de lidar com mais de dois objetivos, denominada MONISE (do inglês Many-Objective NISE), também é proposta aqui, sendo uma contribuição adicional que expande a aplicabilidade da estrutura conceitual proposta. Para atestar adequadamente o mérito da nossa abordagem multi-objetivo, foram realizadas investigações mais específicas, restritas à aprendizagem de modelos lineares regularizados: (1) Qual é o mérito relativo da seleção a posteriori de um único modelo de aprendizado, entre os produzidos pela nossa proposta, quando comparado com outras abordagens de modelo único na literatura? (2) O nível de diversidade dos modelos de aprendizado produzidos pela nossa proposta é superior àquele alcançado por abordagens alternativas dedicadas à geração de múltiplos modelos de aprendizado? (3) E quanto à qualidade de predição da filtragem e agregação dos modelos de aprendizado produzidos pela nossa proposta quando aplicados a: (i) classificação multi-classe, (ii) classificação desbalanceada, (iii) classificação multi-rótulo, (iv) aprendizado multi-tarefa, (v) aprendizado com multiplos conjuntos de atributos? A natureza determinística de NISE e MONISE, sua capacidade de lidar adequadamente com a forma da fronteira de Pareto em cada problema de aprendizado, e a garantia de sempre obter modelos de aprendizado eficientes são aqui pleiteados como responsáveis pelos resultados promissores alcançados em todas essas três frentes de investigação específicasAbstract: Regularized multinomial logistic regression, multi-label classification, and multi-task learning are examples of machine learning problems in which conflicting objectives, such as losses and regularization penalties, should be simultaneously minimized. Therefore, the narrow perspective of looking for the learning model with the best performance should be replaced by the proposition and further exploration of multiple efficient learning models, each one characterized by a distinct trade-off among the conflicting objectives. Committee machines and a posteriori preferences of the decision-maker may be implemented to properly explore this diverse set of efficient learning models toward performance improvement. The whole multi-objective framework for machine learning is supported by three stages: (1) The multi-objective modelling of each learning problem, explicitly highlighting the conflicting objectives involved; (2) Given the multi-objective formulation of the learning problem, for instance, considering loss functions and penalty terms as conflicting objective functions, efficient solutions well-distributed along the Pareto front are obtained by a deterministic and exact solver named NISE (Non-Inferior Set Estimation); (3) Those efficient learning models are then subject to a posteriori model selection, or to ensemble filtering and aggregation. Given that NISE is restricted to two objective functions, an extension for many objectives, named MONISE (Many Objective NISE), is also proposed here, being an additional contribution and expanding the applicability of the proposed framework. To properly access the merit of our multi-objective approach, more specific investigations were conducted, restricted to regularized linear learning models: (1) What is the relative merit of the a posteriori selection of a single learning model, among the ones produced by our proposal, when compared with other single-model approaches in the literature? (2) Is the diversity level of the learning models produced by our proposal higher than the diversity level achieved by alternative approaches devoted to generating multiple learning models? (3) What about the prediction quality of ensemble filtering and aggregation of the learning models produced by our proposal on: (i) multi-class classification, (ii) unbalanced classification, (iii) multi-label classification, (iv) multi-task learning, (v) multi-view learning? The deterministic nature of NISE and MONISE, their ability to properly deal with the shape of the Pareto front in each learning problem, and the guarantee of always obtaining efficient learning models are advocated here as being responsible for the promising results achieved in all those three specific investigationsDoutoradoEngenharia de ComputaçãoDoutor em Engenharia Elétrica2014/13533-0FAPES
    • …
    corecore