8 research outputs found

    Why Does Rebalancing Class-unbalanced Data Improve AUC for Linear Discriminant Analysis?

    Get PDF
    Many established classifiers fail to identify the minority class when it is much smaller than the majority class. To tackle this problem, researchers often first rebalance the class sizes in the training dataset, through oversampling the minority class or undersampling the majority class, and then use the rebalanced data to train the classifiers. This leads to interesting empirical patterns. In particular, using the rebalanced training data can often improve the area under the receiver operating characteristic curve (AUC) for the original, unbalanced test data. The AUC is a widely-used quantitative measure of classification performance, but the property that it increases with rebalancing has, as yet, no theoretical explanation. In this note, using Gaussian-based linear discriminant analysis (LDA) as the classifier, we demonstrate that, at least for LDA, there is an intrinsic, positive relationship between the rebalancing of class sizes and the improvement of AUC. We show that the largest improvement of AUC is achieved, asymptotically, when the two classes are fully rebalanced to be of equal sizes

    NutriFD: Proving the medicinal value of food nutrition based on food-disease association and treatment networks

    Full text link
    There is rising evidence of the health benefit associated with specific dietary interventions. Current food-disease databases focus on associations and treatment relationships but haven't provided a reasonable assessment of the strength of the relationship, and lack of attention on food nutrition. There is an unmet need for a large database that can guide dietary therapy. We fill the gap with NutriFD, a scoring network based on associations and therapeutic relationships between foods and diseases. NutriFD integrates 9 databases including foods, nutrients, diseases, genes, miRNAs, compounds, disease ontology and their relationships. To our best knowledge, this database is the only one that can score the associations and therapeutic relationships of everyday foods and diseases by weighting inference scores of food compounds to diseases. In addition, NutriFD demonstrates the predictive nature of nutrients on the therapeutic relationships between foods and diseases through machine learning models, laying the foundation for a mechanistic understanding of food therapy

    Deep ROC Analysis and AUC as Balanced Average Accuracy to Improve Model Selection, Understanding and Interpretation

    Get PDF
    Optimal performance is critical for decision-making tasks from medicine to autonomous driving, however common performance measures may be too general or too specific. For binary classifiers, diagnostic tests or prognosis at a timepoint, measures such as the area under the receiver operating characteristic curve, or the area under the precision recall curve, are too general because they include unrealistic decision thresholds. On the other hand, measures such as accuracy, sensitivity or the F1 score are measures at a single threshold that reflect an individual single probability or predicted risk, rather than a range of individuals or risk. We propose a method in between, deep ROC analysis, that examines groups of probabilities or predicted risks for more insightful analysis. We translate esoteric measures into familiar terms: AUC and the normalized concordant partial AUC are balanced average accuracy (a new finding); the normalized partial AUC is average sensitivity; and the normalized horizontal partial AUC is average specificity. Along with post-test measures, we provide a method that can improve model selection in some cases and provide interpretation and assurance for patients in each risk group. We demonstrate deep ROC analysis in two case studies and provide a toolkit in Python.Comment: 14 pages, 6 Figures, submitted to IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), currently under revie

    Aplicação de machine learning para previsão de inadimplência

    Get PDF
    O presente trabalho aplica algoritmos de machine learning para prever a inadimplência de clientes de uma empresa brasileira do setor de varejo e identificar quais são as principais variáveis relacionadas à inadimplência. Foi comparado o desempenho dos algoritmos K-Nearest Neighbors, Random Forest, Symbolic Regression e Support Vector Machine, além das técnicas de balanceamento de classes SMOTE e IHT. Além disso, foram utilizadas técnicas de seleção de variáveis e validação cruzada. Todo o trabalho foi desenvolvido utilizando a linguagem de programação Python. A partir da medição e análise de diversas métricas de desempenho, a combinação que gerou as melhores previsões foi o algoritmo Random Forest com a técnica de balanceamento de classes SMOTE

    Why Does Rebalancing Class-Unbalanced Data Improve AUC for Linear Discriminant Analysis?

    No full text
    corecore