36 research outputs found

    Detection of Malicious Websites Using Machine Learning Techniques

    Full text link
    In detecting malicious websites, a common approach is the use of blacklists which are not exhaustive in themselves and are unable to generalize to new malicious sites. Detecting newly encountered malicious websites automatically will help reduce the vulnerability to this form of attack. In this study, we explored the use of ten machine learning models to classify malicious websites based on lexical features and understand how they generalize across datasets. Specifically, we trained, validated, and tested these models on different sets of datasets and then carried out a cross-datasets analysis. From our analysis, we found that K-Nearest Neighbor is the only model that performs consistently high across datasets. Other models such as Random Forest, Decision Trees, Logistic Regression, and Support Vector Machines also consistently outperform a baseline model of predicting every link as malicious across all metrics and datasets. Also, we found no evidence that any subset of lexical features generalizes across models or datasets. This research should be relevant to cybersecurity professionals and academic researchers as it could form the basis for real-life detection systems or further research work

    Estimating Example Difficulty using Variance of Gradients

    Full text link
    In machine learning, a question of great interest is understanding what examples are challenging for a model to classify. Identifying atypical examples helps inform safe deployment of models, isolates examples that require further human inspection, and provides interpretability into model behavior. In this work, we propose Variance of Gradients (VOG) as a proxy metric for detecting outliers in the data distribution. We provide quantitative and qualitative support that VOG is a meaningful way to rank data by difficulty and to surface a tractable subset of the most challenging examples for human-in-the-loop auditing. Data points with high VOG scores are more difficult for the model to classify and over-index on examples that require memorization.Comment: Accepted to Workshop on Human Interpretability in Machine Learning (WHI), ICML, 202

    Tailored for Real-World: A Whole Slide Image Classification System Validated on Uncurated Multi-Site Data Emulating the Prospective Pathology Workload.

    Get PDF
    Standard of care diagnostic procedure for suspected skin cancer is microscopic examination of hematoxylin & eosin stained tissue by a pathologist. Areas of high inter-pathologist discordance and rising biopsy rates necessitate higher efficiency and diagnostic reproducibility. We present and validate a deep learning system which classifies digitized dermatopathology slides into 4 categories. The system is developed using 5,070 images from a single lab, and tested on an uncurated set of 13,537 images from 3 test labs, using whole slide scanners manufactured by 3 different vendors. The system\u27s use of deep-learning-based confidence scoring as a criterion to consider the result as accurate yields an accuracy of up to 98%, and makes it adoptable in a real-world setting. Without confidence scoring, the system achieved an accuracy of 78%. We anticipate that our deep learning system will serve as a foundation enabling faster diagnosis of skin cancer, identification of cases for specialist review, and targeted diagnostic classifications

    Clinician-Driven AI: Code-Free Self-Training on Public Data for Diabetic Retinopathy Referral

    Get PDF
    Importance: Democratizing artificial intelligence (AI) enables model development by clinicians with a lack of coding expertise, powerful computing resources, and large, well-labeled data sets. // Objective: To determine whether resource-constrained clinicians can use self-training via automated machine learning (ML) and public data sets to design high-performing diabetic retinopathy classification models. // Design, Setting, and Participants: This diagnostic quality improvement study was conducted from January 1, 2021, to December 31, 2021. A self-training method without coding was used on 2 public data sets with retinal images from patients in France (Messidor-2 [n = 1748]) and the UK and US (EyePACS [n = 58 689]) and externally validated on 1 data set with retinal images from patients of a private Egyptian medical retina clinic (Egypt [n = 210]). An AI model was trained to classify referable diabetic retinopathy as an exemplar use case. Messidor-2 images were assigned adjudicated labels available on Kaggle; 4 images were deemed ungradable and excluded, leaving 1744 images. A total of 300 images randomly selected from the EyePACS data set were independently relabeled by 3 blinded retina specialists using the International Classification of Diabetic Retinopathy protocol for diabetic retinopathy grade and diabetic macular edema presence; 19 images were deemed ungradable, leaving 281 images. Data analysis was performed from February 1 to February 28, 2021. // Exposures: Using public data sets, a teacher model was trained with labeled images using supervised learning. Next, the resulting predictions, termed pseudolabels, were used on an unlabeled public data set. Finally, a student model was trained with the existing labeled images and the additional pseudolabeled images. Main Outcomes and Measures: The analyzed metrics for the models included the area under the receiver operating characteristic curve (AUROC), accuracy, sensitivity, specificity, and F1 score. The Fisher exact test was performed, and 2-tailed P values were calculated for failure case analysis. // Results: For the internal validation data sets, AUROC values for performance ranged from 0.886 to 0.939 for the teacher model and from 0.916 to 0.951 for the student model. For external validation of automated ML model performance, AUROC values and accuracy were 0.964 and 93.3% for the teacher model, 0.950 and 96.7% for the student model, and 0.890 and 94.3% for the manually coded bespoke model, respectively. // Conclusions and Relevance: These findings suggest that self-training using automated ML is an effective method to increase both model performance and generalizability while decreasing the need for costly expert labeling. This approach advances the democratization of AI by enabling clinicians without coding expertise or access to large, well-labeled private data sets to develop their own AI models

    Do comprehensive deep learning algorithms suffer from hidden stratification? A retrospective study on pneumothorax detection in chest radiography

    Full text link
    ObjectivesTo evaluate the ability of a commercially available comprehensive chest radiography deep convolutional neural network (DCNN) to detect simple and tension pneumothorax, as stratified by the following subgroups: the presence of an intercostal drain; rib, clavicular, scapular or humeral fractures or rib resections; subcutaneous emphysema and erect versus non-erect positioning. The hypothesis was that performance would not differ significantly in each of these subgroups when compared with the overall test dataset.DesignA retrospective case–control study was undertaken.SettingCommunity radiology clinics and hospitals in Australia and the USA.ParticipantsA test dataset of 2557 chest radiography studies was ground-truthed by three subspecialty thoracic radiologists for the presence of simple or tension pneumothorax as well as each subgroup other than positioning. Radiograph positioning was derived from radiographer annotations on the images.Outcome measuresDCNN performance for detecting simple and tension pneumothorax was evaluated over the entire test set, as well as within each subgroup, using the area under the receiver operating characteristic curve (AUC). A difference in AUC of more than 0.05 was considered clinically significant.ResultsWhen compared with the overall test set, performance of the DCNN for detecting simple and tension pneumothorax was statistically non-inferior in all subgroups. The DCNN had an AUC of 0.981 (0.976–0.986) for detecting simple pneumothorax and 0.997 (0.995–0.999) for detecting tension pneumothorax.ConclusionsHidden stratification has significant implications for potential failures of deep learning when applied in clinical practice. This study demonstrated that a comprehensively trained DCNN can be resilient to hidden stratification in several clinically meaningful subgroups in detecting pneumothorax.</jats:sec

    Demographic Bias of Expert-Level Vision-Language Foundation Models in Medical Imaging

    Full text link
    Advances in artificial intelligence (AI) have achieved expert-level performance in medical imaging applications. Notably, self-supervised vision-language foundation models can detect a broad spectrum of pathologies without relying on explicit training annotations. However, it is crucial to ensure that these AI models do not mirror or amplify human biases, thereby disadvantaging historically marginalized groups such as females or Black patients. The manifestation of such biases could systematically delay essential medical care for certain patient subgroups. In this study, we investigate the algorithmic fairness of state-of-the-art vision-language foundation models in chest X-ray diagnosis across five globally-sourced datasets. Our findings reveal that compared to board-certified radiologists, these foundation models consistently underdiagnose marginalized groups, with even higher rates seen in intersectional subgroups, such as Black female patients. Such demographic biases present over a wide range of pathologies and demographic attributes. Further analysis of the model embedding uncovers its significant encoding of demographic information. Deploying AI systems with these biases in medical imaging can intensify pre-existing care disparities, posing potential challenges to equitable healthcare access and raising ethical questions about their clinical application.Comment: Code and data are available at https://github.com/YyzHarry/vlm-fairnes
    corecore