680,344 research outputs found

    Multiple-source cross-validation

    Get PDF
    Cross-validation is an essential tool in machine learning and statistics. The typical procedure, in which data points are randomly assigned to one of the test sets, makes an implicit assumption that the data are exchangeable. A common case in which this does not hold is when the data come from multiple sources, in the sense used in transfer learning. In this case it is common to arrange the cross-validation procedure in a way that takes the source structure into account. Although common in practice, this procedure does not appear to have been theoretically analysed. We present new estimators of the variance of the cross-validation, both in the multiple-source setting and in the standard iid setting. These new estimators allow for much more accurate confidence intervals and hypothesis tests to compare algorithms. 1

    Penalized Clustering of Large Scale Functional Data with Multiple Covariates

    Full text link
    In this article, we propose a penalized clustering method for large scale data with multiple covariates through a functional data approach. In the proposed method, responses and covariates are linked together through nonparametric multivariate functions (fixed effects), which have great flexibility in modeling a variety of function features, such as jump points, branching, and periodicity. Functional ANOVA is employed to further decompose multivariate functions in a reproducing kernel Hilbert space and provide associated notions of main effect and interaction. Parsimonious random effects are used to capture various correlation structures. The mixed-effect models are nested under a general mixture model, in which the heterogeneity of functional data is characterized. We propose a penalized Henderson's likelihood approach for model-fitting and design a rejection-controlled EM algorithm for the estimation. Our method selects smoothing parameters through generalized cross-validation. Furthermore, the Bayesian confidence intervals are used to measure the clustering uncertainty. Simulation studies and real-data examples are presented to investigate the empirical performance of the proposed method. Open-source code is available in the R package MFDA

    MergeMaid: R Tools for Merging and Cross-Study Validation of Gene Expression Data

    Get PDF
    Cross-study validation of gene expression investigations is critical in genomic analysis. We developed an R package and associated object definitions to merge and visualize multiple gene expression datasets. Our merging functions use arbitrary character IDs and generate objects that can efficiently support a variety of joint analyses. Visualization tools support exploration and cross-study validation of the data, without requiring normalization across platforms. Tools include “integrative correlation” plots that is, scatterplots of all pairwise correlations in one study against the corresponding pairwise correlations of another, both for individual genes and all genes combined. Gene-specific plots can be used to identify genes whose changes are reliably measured across studies. Visualizations also include scatterplots of gene-specific statistics quantifying relationships between expression and phenotypes of interest, using linear, logistic and Cox regression. Availability: Free open source from url http://www.bioconductor.org. Contact: Xiaogang Zhong [email protected] Supplementary information: Documentation available with the package

    A simple tool to predict admission at the time of triage

    Get PDF
    Aim To create and validate a simple clinical score to estimate the probability of admission at the time of triage. Methods This was a multicentre, retrospective, cross-sectional study of triage records for all unscheduled adult attendances in North Glasgow over 2 years. Clinical variables that had significant associations with admission on logistic regression were entered into a mixed-effects multiple logistic model. This provided weightings for the score, which was then simplified and tested on a separate validation group by receiving operator characteristic (ROC) analysis and goodness-of-fit tests. Results 215 231 presentations were used for model derivation and 107 615 for validation. Variables in the final model showing clinically and statistically significant associations with admission were: triage category, age, National Early Warning Score (NEWS), arrival by ambulance, referral source and admission within the last year. The resulting 6-variable score showed excellent admission/discharge discrimination (area under ROC curve 0.8774, 95% CI 0.8752 to 0.8796). Higher scores also predicted early returns for those who were discharged: the odds of subsequent admission within 28 days doubled for every 7-point increase (log odds=+0.0933 per point, p&#60;0.0001). Conclusions This simple, 6-variable score accurately estimates the probability of admission purely from triage information. Most patients could accurately be assigned to ‘admission likely’, ‘admission unlikely’, ‘admission very unlikely’ etc., by setting appropriate cut-offs. This could have uses in patient streaming, bed management and decision support. It also has the potential to control for demographics when comparing performance over time or between departments.</p

    Is "Better Data" Better than "Better Data Miners"? (On the Benefits of Tuning SMOTE for Defect Prediction)

    Full text link
    We report and fix an important systematic error in prior studies that ranked classifiers for software analytics. Those studies did not (a) assess classifiers on multiple criteria and they did not (b) study how variations in the data affect the results. Hence, this paper applies (a) multi-criteria tests while (b) fixing the weaker regions of the training data (using SMOTUNED, which is a self-tuning version of SMOTE). This approach leads to dramatically large increases in software defect predictions. When applied in a 5*5 cross-validation study for 3,681 JAVA classes (containing over a million lines of code) from open source systems, SMOTUNED increased AUC and recall by 60% and 20% respectively. These improvements are independent of the classifier used to predict for quality. Same kind of pattern (improvement) was observed when a comparative analysis of SMOTE and SMOTUNED was done against the most recent class imbalance technique. In conclusion, for software analytic tasks like defect prediction, (1) data pre-processing can be more important than classifier choice, (2) ranking studies are incomplete without such pre-processing, and (3) SMOTUNED is a promising candidate for pre-processing.Comment: 10 pages + 2 references. Accepted to International Conference of Software Engineering (ICSE), 201

    Is "Better Data" Better than "Better Data Miners"? (On the Benefits of Tuning SMOTE for Defect Prediction)

    Full text link
    We report and fix an important systematic error in prior studies that ranked classifiers for software analytics. Those studies did not (a) assess classifiers on multiple criteria and they did not (b) study how variations in the data affect the results. Hence, this paper applies (a) multi-criteria tests while (b) fixing the weaker regions of the training data (using SMOTUNED, which is a self-tuning version of SMOTE). This approach leads to dramatically large increases in software defect predictions. When applied in a 5*5 cross-validation study for 3,681 JAVA classes (containing over a million lines of code) from open source systems, SMOTUNED increased AUC and recall by 60% and 20% respectively. These improvements are independent of the classifier used to predict for quality. Same kind of pattern (improvement) was observed when a comparative analysis of SMOTE and SMOTUNED was done against the most recent class imbalance technique. In conclusion, for software analytic tasks like defect prediction, (1) data pre-processing can be more important than classifier choice, (2) ranking studies are incomplete without such pre-processing, and (3) SMOTUNED is a promising candidate for pre-processing.Comment: 10 pages + 2 references. Accepted to International Conference of Software Engineering (ICSE), 201

    Learning Site-specific Styles for Multi-institutional Unsupervised Cross-modality Domain Adaptation

    Full text link
    Unsupervised cross-modality domain adaptation is a challenging task in medical image analysis, and it becomes more challenging when source and target domain data are collected from multiple institutions. In this paper, we present our solution to tackle the multi-institutional unsupervised domain adaptation for the crossMoDA 2023 challenge. First, we perform unpaired image translation to translate the source domain images to the target domain, where we design a dynamic network to generate synthetic target domain images with controllable, site-specific styles. Afterwards, we train a segmentation model using the synthetic images and further reduce the domain gap by self-training. Our solution achieved the 1st place during both the validation and testing phases of the challenge. The code repository is publicly available at https://github.com/MedICL-VU/crossmoda2023.Comment: crossMoDA 2023 challenge 1st place solutio
    • 

    corecore