8 research outputs found
Why Does Rebalancing Class-unbalanced Data Improve AUC for Linear Discriminant Analysis?
Many established classifiers fail to identify the minority class when it is much smaller than the majority class. To tackle this problem, researchers often first rebalance the class sizes in the training dataset, through oversampling the minority class or undersampling the majority class, and then use the rebalanced data to train the classifiers. This leads to interesting empirical patterns. In particular, using the rebalanced training data can often improve the area under the receiver operating characteristic curve (AUC) for the original, unbalanced test data. The AUC is a widely-used quantitative measure of classification performance, but the property that it increases with rebalancing has, as yet, no theoretical explanation. In this note, using Gaussian-based linear discriminant analysis (LDA) as the classifier, we demonstrate that, at least for LDA, there is an intrinsic, positive relationship between the rebalancing of class sizes and the improvement of AUC. We show that the largest improvement of AUC is achieved, asymptotically, when the two classes are fully rebalanced to be of equal sizes
NutriFD: Proving the medicinal value of food nutrition based on food-disease association and treatment networks
There is rising evidence of the health benefit associated with specific
dietary interventions. Current food-disease databases focus on associations and
treatment relationships but haven't provided a reasonable assessment of the
strength of the relationship, and lack of attention on food nutrition. There is
an unmet need for a large database that can guide dietary therapy. We fill the
gap with NutriFD, a scoring network based on associations and therapeutic
relationships between foods and diseases. NutriFD integrates 9 databases
including foods, nutrients, diseases, genes, miRNAs, compounds, disease
ontology and their relationships. To our best knowledge, this database is the
only one that can score the associations and therapeutic relationships of
everyday foods and diseases by weighting inference scores of food compounds to
diseases. In addition, NutriFD demonstrates the predictive nature of nutrients
on the therapeutic relationships between foods and diseases through machine
learning models, laying the foundation for a mechanistic understanding of food
therapy
Deep ROC Analysis and AUC as Balanced Average Accuracy to Improve Model Selection, Understanding and Interpretation
Optimal performance is critical for decision-making tasks from medicine to
autonomous driving, however common performance measures may be too general or
too specific. For binary classifiers, diagnostic tests or prognosis at a
timepoint, measures such as the area under the receiver operating
characteristic curve, or the area under the precision recall curve, are too
general because they include unrealistic decision thresholds. On the other
hand, measures such as accuracy, sensitivity or the F1 score are measures at a
single threshold that reflect an individual single probability or predicted
risk, rather than a range of individuals or risk. We propose a method in
between, deep ROC analysis, that examines groups of probabilities or predicted
risks for more insightful analysis. We translate esoteric measures into
familiar terms: AUC and the normalized concordant partial AUC are balanced
average accuracy (a new finding); the normalized partial AUC is average
sensitivity; and the normalized horizontal partial AUC is average specificity.
Along with post-test measures, we provide a method that can improve model
selection in some cases and provide interpretation and assurance for patients
in each risk group. We demonstrate deep ROC analysis in two case studies and
provide a toolkit in Python.Comment: 14 pages, 6 Figures, submitted to IEEE Transactions on Pattern
Analysis and Machine Intelligence (TPAMI), currently under revie
Aplicação de machine learning para previsão de inadimplência
O presente trabalho aplica algoritmos de machine learning para prever a inadimplência de clientes de uma empresa brasileira do setor de varejo e identificar quais são as principais variáveis relacionadas à inadimplência. Foi comparado o desempenho dos algoritmos K-Nearest Neighbors, Random Forest, Symbolic Regression e Support Vector Machine, além das técnicas de balanceamento de classes SMOTE e IHT. Além disso, foram utilizadas técnicas de seleção de variáveis e validação cruzada. Todo o trabalho foi desenvolvido utilizando a linguagem de programação Python. A partir da medição e análise de diversas métricas de desempenho, a combinação que gerou as melhores previsões foi o algoritmo Random Forest com a técnica de balanceamento de classes SMOTE