66 research outputs found

    A simple plug-in bagging ensemble based on threshold-moving for classifying binary and multiclass imbalanced data

    Get PDF
    Class imbalance presents a major hurdle in the application of classification methods. A commonly taken approach is to learn ensembles of classifiers using rebalanced data. Examples include bootstrap averaging (bagging) combined with either undersampling or oversampling of the minority class examples. However, rebalancing methods entail asymmetric changes to the examples of different classes, which in turn can introduce their own biases. Furthermore, these methods often require specifying the performance measure of interest a priori, i.e., before learning. An alternative is to employ the threshold moving technique, which applies a threshold to the continuous output of a model, offering the possibility to adapt to a performance measure a posteriori, i.e., a plug-in method. Surprisingly, little attention has been paid to this combination of a bagging ensemble and threshold-moving. In this paper, we study this combination and demonstrate its competitiveness. Contrary to the other resampling methods, we preserve the natural class distribution of the data resulting in well-calibrated posterior probabilities. Additionally, we extend the proposed method to handle multiclass data. We validated our method on binary and multiclass benchmark data sets by using both, decision trees and neural networks as base classifiers. We perform analyses that provide insights into the proposed method. Keywords: Imbalanced data; Binary classification; Multiclass classification; Bagging ensembles; Resampling; Posterior calibrationBurroughs Wellcome Fund (Grant 103811AI

    Verifiability as a Complement to AI Explainability: A Conceptual Proposal

    Get PDF
    Recent advances in the field of artificial intelligence (AI) are providing automated and in many cases improved decision-making. However, even very reliable AI systems can go terribly wrong without human users understanding the reason for it. Against this background, there are now widespread calls for models of “explainable AI”. In this paper we point out some inherent problems of this concept and argue that explainability alone is probably not the solution. We therefore propose another approach as a complement, which we call “verifiability”. In essence, it is about designing AI so that it makes available multiple verifiable predictions (given a ground truth) in addition to the one desired prediction that cannot be verified because the ground truth is missing. Such verifiable AI could help to further minimize serious mistakes despite a lack of explainability, help increase their trustworthiness and in turn improve societal acceptance of AI

    Julearn: an easy-to-use library for leakage-free evaluation and inspection of ML models

    Full text link
    The fast-paced development of machine learning (ML) methods coupled with its increasing adoption in research poses challenges for researchers without extensive training in ML. In neuroscience, for example, ML can help understand brain-behavior relationships, diagnose diseases, and develop biomarkers using various data sources like magnetic resonance imaging and electroencephalography. The primary objective of ML is to build models that can make accurate predictions on unseen data. Researchers aim to prove the existence of such generalizable models by evaluating performance using techniques such as cross-validation (CV), which uses systematic subsampling to estimate the generalization performance. Choosing a CV scheme and evaluating an ML pipeline can be challenging and, if used improperly, can lead to overestimated results and incorrect interpretations. We created julearn, an open-source Python library, that allow researchers to design and evaluate complex ML pipelines without encountering in common pitfalls. In this manuscript, we present the rationale behind julearn's design, its core features, and showcase three examples of previously-published research projects that can be easily implemented using this novel library. Julearn aims to simplify the entry into the ML world by providing an easy-to-use environment with built in guards against some of the most common ML pitfalls. With its design, unique features and simple interface, it poses as a useful Python-based library for research projects.Comment: 13 pages, 5 figure

    A Too-Good-to-be-True Prior to Reduce Shortcut Reliance

    Get PDF
    Despite their impressive performance in object recognition and other tasks under standard testing conditions, deep networks often fail to generalize to out-of-distribution (o.o.d.) samples. One cause for this shortcoming is that modern architectures tend to rely on ǣshortcutsǥ superficial features that correlate with categories without capturing deeper invariants that hold across contexts. Real-world concepts often possess a complex structure that can vary superficially across contexts, which can make the most intuitive and promising solutions in one context not generalize to others. One potential way to improve o.o.d. generalization is to assume simple solutions are unlikely to be valid across contexts and avoid them, which we refer to as the too-good-to-be-true prior. A low-capacity network (LCN) with a shallow architecture should only be able to learn surface relationships, including shortcuts. We find that LCNs can serve as shortcut detectors. Furthermore, an LCN’s predictions can be used in a two-stage approach to encourage a high-capacity network (HCN) to rely on deeper invariant features that should generalize broadly. In particular, items that the LCN can master are downweighted when training the HCN. Using a modified version of the CIFAR-10 dataset in which we introduced shortcuts, we found that the two-stage LCN-HCN approach reduced reliance on shortcuts and facilitated o.o.d. generalization

    A Connectivity-Based Psychometric Prediction Framework for Brain-Behavior Relationship Studies.

    Full text link
    peer reviewedThe recent availability of population-based studies with neuroimaging and behavioral measurements opens promising perspectives to investigate the relationships between interindividual variability in brain regions' connectivity and behavioral phenotypes. However, the multivariate nature of connectivity-based prediction model severely limits the insight into brain-behavior patterns for neuroscience. To address this issue, we propose a connectivity-based psychometric prediction framework based on individual regions' connectivity profiles. We first illustrate two main applications: 1) single brain region's predictive power for a range of psychometric variables and 2) single psychometric variable's predictive power variation across brain region. We compare the patterns of brain-behavior provided by these approaches to the brain-behavior relationships from activation approaches. Then, capitalizing on the increased transparency of our approach, we demonstrate how the influence of various data processing and analyses can directly influence the patterns of brain-behavior relationships, as well as the unique insight into brain-behavior relationships offered by this approach

    The PhyloPythiaS Web Server for Taxonomic Assignment of Metagenome Sequences

    Get PDF
    Metagenome sequencing is becoming common and there is an increasing need for easily accessible tools for data analysis. An essential step is the taxonomic classification of sequence fragments. We describe a web server for the taxonomic assignment of metagenome sequences with PhyloPythiaS. PhyloPythiaS is a fast and accurate sequence composition-based classifier that utilizes the hierarchical relationships between clades. Taxonomic assignments with the web server can be made with a generic model, or with sample-specific models that users can specify and create. Several interactive visualization modes and multiple download formats allow quick and convenient analysis and downstream processing of taxonomic assignments. Here, we demonstrate usage of our web server by taxonomic assignment of metagenome samples from an acidophilic biofilm community of an acid mine and of a microbial community from cow rumen

    Neurobiological Divergence of the Positive and Negative Schizophrenia Subtypes Identified on a New Factor Structure of Psychopathology Using Non-negative Factorization:An International Machine Learning Study

    Get PDF
    ObjectiveDisentangling psychopathological heterogeneity in schizophrenia is challenging and previous results remain inconclusive. We employed advanced machine-learning to identify a stable and generalizable factorization of the “Positive and Negative Syndrome Scale (PANSS)”, and used it to identify psychopathological subtypes as well as their neurobiological differentiations.MethodsPANSS data from the Pharmacotherapy Monitoring and Outcome Survey cohort (1545 patients, 586 followed up after 1.35±0.70 years) were used for learning the factor-structure by an orthonormal projective non-negative factorization. An international sample, pooled from nine medical centers across Europe, USA, and Asia (490 patients), was used for validation. Patients were clustered into psychopathological subtypes based on the identified factor-structure, and the neurobiological divergence between the subtypes was assessed by classification analysis on functional MRI connectivity patterns.ResultsA four-factor structure representing negative, positive, affective, and cognitive symptoms was identified as the most stable and generalizable representation of psychopathology. It showed higher internal consistency than the original PANSS subscales and previously proposed factor-models. Based on this representation, the positive-negative dichotomy was confirmed as the (only) robust psychopathological subtypes, and these subtypes were longitudinally stable in about 80% of the repeatedly assessed patients. Finally, the individual subtype could be predicted with good accuracy from functional connectivity profiles of the ventro-medial frontal cortex, temporoparietal junction, and precuneus.ConclusionsMachine-learning applied to multi-site data with cross-validation yielded a factorization generalizable across populations and medical systems. Together with subtyping and the demonstrated ability to predict subtype membership from neuroimaging data, this work further disentangles the heterogeneity in schizophrenia

    Enhancing Cognitive Performance Prediction through White Matter Hyperintensity Connectivity Assessment: A Multicenter Lesion Network Mapping Analysis of 3,485 Memory Clinic Patients

    Get PDF
    INTRODUCTION: White matter hyperintensities of presumed vascular origin (WMH) are associated with cognitive impairment and are a key imaging marker in evaluating cognitive health. However, WMH volume alone does not fully account for the extent of cognitive deficits and the mechanisms linking WMH to these deficits remain unclear. We propose that lesion network mapping (LNM), enables to infer if brain networks are connected to lesions, and could be a promising technique for enhancing our understanding of the role of WMH in cognitive disorders. Our study employed this approach to test the following hypotheses: (1) LNM-informed markers surpass WMH volumes in predicting cognitive performance, and (2) WMH contributing to cognitive impairment map to specific brain networks. METHODS & RESULTS: We analyzed cross-sectional data of 3,485 patients from 10 memory clinic cohorts within the Meta VCI Map Consortium, using harmonized test results in 4 cognitive domains and WMH segmentations. WMH segmentations were registered to a standard space and mapped onto existing normative structural and functional brain connectome data. We employed LNM to quantify WMH connectivity across 480 atlas-based gray and white matter regions of interest (ROI), resulting in ROI-level structural and functional LNM scores. The capacity of total and regional WMH volumes and LNM scores in predicting cognitive function was compared using ridge regression models in a nested cross-validation. LNM scores predicted performance in three cognitive domains (attention and executive function, information processing speed, and verbal memory) significantly better than WMH volumes. LNM scores did not improve prediction for language functions. ROI-level analysis revealed that higher LNM scores, representing greater disruptive effects of WMH on regional connectivity, in gray and white matter regions of the dorsal and ventral attention networks were associated with lower cognitive performance. CONCLUSION: Measures of WMH-related brain network connectivity significantly improve the prediction of current cognitive performance in memory clinic patients compared to WMH volume as a traditional imaging marker of cerebrovascular disease. This highlights the crucial role of network effects, particularly in attentionrelated brain regions, improving our understanding of vascular contributions to cognitive impairment. Moving forward, refining WMH information with connectivity data could contribute to patient-tailored therapeutic interventions and facilitate the identification of subgroups at risk of cognitive disorders

    Global age-sex-specific mortality, life expectancy, and population estimates in 204 countries and territories and 811 subnational locations, 1950–2021, and the impact of the COVID-19 pandemic: a comprehensive demographic analysis for the Global Burden of Disease Study 2021

    Get PDF
    Background: Estimates of demographic metrics are crucial to assess levels and trends of population health outcomes. The profound impact of the COVID-19 pandemic on populations worldwide has underscored the need for timely estimates to understand this unprecedented event within the context of long-term population health trends. The Global Burden of Diseases, Injuries, and Risk Factors Study (GBD) 2021 provides new demographic estimates for 204 countries and territories and 811 additional subnational locations from 1950 to 2021, with a particular emphasis on changes in mortality and life expectancy that occurred during the 2020–21 COVID-19 pandemic period. Methods: 22 223 data sources from vital registration, sample registration, surveys, censuses, and other sources were used to estimate mortality, with a subset of these sources used exclusively to estimate excess mortality due to the COVID-19 pandemic. 2026 data sources were used for population estimation. Additional sources were used to estimate migration; the effects of the HIV epidemic; and demographic discontinuities due to conflicts, famines, natural disasters, and pandemics, which are used as inputs for estimating mortality and population. Spatiotemporal Gaussian process regression (ST-GPR) was used to generate under-5 mortality rates, which synthesised 30 763 location-years of vital registration and sample registration data, 1365 surveys and censuses, and 80 other sources. ST-GPR was also used to estimate adult mortality (between ages 15 and 59 years) based on information from 31 642 location-years of vital registration and sample registration data, 355 surveys and censuses, and 24 other sources. Estimates of child and adult mortality rates were then used to generate life tables with a relational model life table system. For countries with large HIV epidemics, life tables were adjusted using independent estimates of HIV-specific mortality generated via an epidemiological analysis of HIV prevalence surveys, antenatal clinic serosurveillance, and other data sources. Excess mortality due to the COVID-19 pandemic in 2020 and 2021 was determined by subtracting observed all-cause mortality (adjusted for late registration and mortality anomalies) from the mortality expected in the absence of the pandemic. Expected mortality was calculated based on historical trends using an ensemble of models. In location-years where all-cause mortality data were unavailable, we estimated excess mortality rates using a regression model with covariates pertaining to the pandemic. Population size was computed using a Bayesian hierarchical cohort component model. Life expectancy was calculated using age-specific mortality rates and standard demographic methods. Uncertainty intervals (UIs) were calculated for every metric using the 25th and 975th ordered values from a 1000-draw posterior distribution. Findings: Global all-cause mortality followed two distinct patterns over the study period: age-standardised mortality rates declined between 1950 and 2019 (a 62·8% [95% UI 60·5–65·1] decline), and increased during the COVID-19 pandemic period (2020–21; 5·1% [0·9–9·6] increase). In contrast with the overall reverse in mortality trends during the pandemic period, child mortality continued to decline, with 4·66 million (3·98–5·50) global deaths in children younger than 5 years in 2021 compared with 5·21 million (4·50–6·01) in 2019. An estimated 131 million (126–137) people died globally from all causes in 2020 and 2021 combined, of which 15·9 million (14·7–17·2) were due to the COVID-19 pandemic (measured by excess mortality, which includes deaths directly due to SARS-CoV-2 infection and those indirectly due to other social, economic, or behavioural changes associated with the pandemic). Excess mortality rates exceeded 150 deaths per 100 000 population during at least one year of the pandemic in 80 countries and territories, whereas 20 nations had a negative excess mortality rate in 2020 or 2021, indicating that all-cause mortality in these countries was lower during the pandemic than expected based on historical trends. Between 1950 and 2021, global life expectancy at birth increased by 22·7 years (20·8–24·8), from 49·0 years (46·7–51·3) to 71·7 years (70·9–72·5). Global life expectancy at birth declined by 1·6 years (1·0–2·2) between 2019 and 2021, reversing historical trends. An increase in life expectancy was only observed in 32 (15·7%) of 204 countries and territories between 2019 and 2021. The global population reached 7·89 billion (7·67–8·13) people in 2021, by which time 56 of 204 countries and territories had peaked and subsequently populations have declined. The largest proportion of population growth between 2020 and 2021 was in sub-Saharan Africa (39·5% [28·4–52·7]) and south Asia (26·3% [9·0–44·7]). From 2000 to 2021, the ratio of the population aged 65 years and older to the population aged younger than 15 years increased in 188 (92·2%) of 204 nations. Interpretation: Global adult mortality rates markedly increased during the COVID-19 pandemic in 2020 and 2021, reversing past decreasing trends, while child mortality rates continued to decline, albeit more slowly than in earlier years. Although COVID-19 had a substantial impact on many demographic indicators during the first 2 years of the pandemic, overall global health progress over the 72 years evaluated has been profound, with considerable improvements in mortality and life expectancy. Additionally, we observed a deceleration of global population growth since 2017, despite steady or increasing growth in lower-income countries, combined with a continued global shift of population age structures towards older ages. These demographic changes will likely present future challenges to health systems, economies, and societies. The comprehensive demographic estimates reported here will enable researchers, policy makers, health practitioners, and other key stakeholders to better understand and address the profound changes that have occurred in the global health landscape following the first 2 years of the COVID-19 pandemic, and longer-term trends beyond the pandemic
    corecore