36 research outputs found
Detection of Malicious Websites Using Machine Learning Techniques
In detecting malicious websites, a common approach is the use of blacklists
which are not exhaustive in themselves and are unable to generalize to new
malicious sites. Detecting newly encountered malicious websites automatically
will help reduce the vulnerability to this form of attack. In this study, we
explored the use of ten machine learning models to classify malicious websites
based on lexical features and understand how they generalize across datasets.
Specifically, we trained, validated, and tested these models on different sets
of datasets and then carried out a cross-datasets analysis. From our analysis,
we found that K-Nearest Neighbor is the only model that performs consistently
high across datasets. Other models such as Random Forest, Decision Trees,
Logistic Regression, and Support Vector Machines also consistently outperform a
baseline model of predicting every link as malicious across all metrics and
datasets. Also, we found no evidence that any subset of lexical features
generalizes across models or datasets. This research should be relevant to
cybersecurity professionals and academic researchers as it could form the basis
for real-life detection systems or further research work
Estimating Example Difficulty using Variance of Gradients
In machine learning, a question of great interest is understanding what
examples are challenging for a model to classify. Identifying atypical examples
helps inform safe deployment of models, isolates examples that require further
human inspection, and provides interpretability into model behavior. In this
work, we propose Variance of Gradients (VOG) as a proxy metric for detecting
outliers in the data distribution. We provide quantitative and qualitative
support that VOG is a meaningful way to rank data by difficulty and to surface
a tractable subset of the most challenging examples for human-in-the-loop
auditing. Data points with high VOG scores are more difficult for the model to
classify and over-index on examples that require memorization.Comment: Accepted to Workshop on Human Interpretability in Machine Learning
(WHI), ICML, 202
Tailored for Real-World: A Whole Slide Image Classification System Validated on Uncurated Multi-Site Data Emulating the Prospective Pathology Workload.
Standard of care diagnostic procedure for suspected skin cancer is microscopic examination of hematoxylin & eosin stained tissue by a pathologist. Areas of high inter-pathologist discordance and rising biopsy rates necessitate higher efficiency and diagnostic reproducibility. We present and validate a deep learning system which classifies digitized dermatopathology slides into 4 categories. The system is developed using 5,070 images from a single lab, and tested on an uncurated set of 13,537 images from 3 test labs, using whole slide scanners manufactured by 3 different vendors. The system\u27s use of deep-learning-based confidence scoring as a criterion to consider the result as accurate yields an accuracy of up to 98%, and makes it adoptable in a real-world setting. Without confidence scoring, the system achieved an accuracy of 78%. We anticipate that our deep learning system will serve as a foundation enabling faster diagnosis of skin cancer, identification of cases for specialist review, and targeted diagnostic classifications
Clinician-Driven AI: Code-Free Self-Training on Public Data for Diabetic Retinopathy Referral
Importance: Democratizing artificial intelligence (AI) enables model development by clinicians with a lack of coding expertise, powerful computing resources, and large, well-labeled data sets.
//
Objective: To determine whether resource-constrained clinicians can use self-training via automated machine learning (ML) and public data sets to design high-performing diabetic retinopathy classification models.
//
Design, Setting, and Participants: This diagnostic quality improvement study was conducted from January 1, 2021, to December 31, 2021. A self-training method without coding was used on 2 public data sets with retinal images from patients in France (Messidor-2 [n = 1748]) and the UK and US (EyePACS [n = 58 689]) and externally validated on 1 data set with retinal images from patients of a private Egyptian medical retina clinic (Egypt [n = 210]). An AI model was trained to classify referable diabetic retinopathy as an exemplar use case. Messidor-2 images were assigned adjudicated labels available on Kaggle; 4 images were deemed ungradable and excluded, leaving 1744 images. A total of 300 images randomly selected from the EyePACS data set were independently relabeled by 3 blinded retina specialists using the International Classification of Diabetic Retinopathy protocol for diabetic retinopathy grade and diabetic macular edema presence; 19 images were deemed ungradable, leaving 281 images. Data analysis was performed from February 1 to February 28, 2021.
//
Exposures: Using public data sets, a teacher model was trained with labeled images using supervised learning. Next, the resulting predictions, termed pseudolabels, were used on an unlabeled public data set. Finally, a student model was trained with the existing labeled images and the additional pseudolabeled images.
Main Outcomes and Measures: The analyzed metrics for the models included the area under the receiver operating characteristic curve (AUROC), accuracy, sensitivity, specificity, and F1 score. The Fisher exact test was performed, and 2-tailed P values were calculated for failure case analysis.
//
Results: For the internal validation data sets, AUROC values for performance ranged from 0.886 to 0.939 for the teacher model and from 0.916 to 0.951 for the student model. For external validation of automated ML model performance, AUROC values and accuracy were 0.964 and 93.3% for the teacher model, 0.950 and 96.7% for the student model, and 0.890 and 94.3% for the manually coded bespoke model, respectively.
//
Conclusions and Relevance: These findings suggest that self-training using automated ML is an effective method to increase both model performance and generalizability while decreasing the need for costly expert labeling. This approach advances the democratization of AI by enabling clinicians without coding expertise or access to large, well-labeled private data sets to develop their own AI models
Do comprehensive deep learning algorithms suffer from hidden stratification? A retrospective study on pneumothorax detection in chest radiography
ObjectivesTo evaluate the ability of a commercially available comprehensive chest radiography deep convolutional neural network (DCNN) to detect simple and tension pneumothorax, as stratified by the following subgroups: the presence of an intercostal drain; rib, clavicular, scapular or humeral fractures or rib resections; subcutaneous emphysema and erect versus non-erect positioning. The hypothesis was that performance would not differ significantly in each of these subgroups when compared with the overall test dataset.DesignA retrospective case–control study was undertaken.SettingCommunity radiology clinics and hospitals in Australia and the USA.ParticipantsA test dataset of 2557 chest radiography studies was ground-truthed by three subspecialty thoracic radiologists for the presence of simple or tension pneumothorax as well as each subgroup other than positioning. Radiograph positioning was derived from radiographer annotations on the images.Outcome measuresDCNN performance for detecting simple and tension pneumothorax was evaluated over the entire test set, as well as within each subgroup, using the area under the receiver operating characteristic curve (AUC). A difference in AUC of more than 0.05 was considered clinically significant.ResultsWhen compared with the overall test set, performance of the DCNN for detecting simple and tension pneumothorax was statistically non-inferior in all subgroups. The DCNN had an AUC of 0.981 (0.976–0.986) for detecting simple pneumothorax and 0.997 (0.995–0.999) for detecting tension pneumothorax.ConclusionsHidden stratification has significant implications for potential failures of deep learning when applied in clinical practice. This study demonstrated that a comprehensively trained DCNN can be resilient to hidden stratification in several clinically meaningful subgroups in detecting pneumothorax.</jats:sec
Demographic Bias of Expert-Level Vision-Language Foundation Models in Medical Imaging
Advances in artificial intelligence (AI) have achieved expert-level
performance in medical imaging applications. Notably, self-supervised
vision-language foundation models can detect a broad spectrum of pathologies
without relying on explicit training annotations. However, it is crucial to
ensure that these AI models do not mirror or amplify human biases, thereby
disadvantaging historically marginalized groups such as females or Black
patients. The manifestation of such biases could systematically delay essential
medical care for certain patient subgroups. In this study, we investigate the
algorithmic fairness of state-of-the-art vision-language foundation models in
chest X-ray diagnosis across five globally-sourced datasets. Our findings
reveal that compared to board-certified radiologists, these foundation models
consistently underdiagnose marginalized groups, with even higher rates seen in
intersectional subgroups, such as Black female patients. Such demographic
biases present over a wide range of pathologies and demographic attributes.
Further analysis of the model embedding uncovers its significant encoding of
demographic information. Deploying AI systems with these biases in medical
imaging can intensify pre-existing care disparities, posing potential
challenges to equitable healthcare access and raising ethical questions about
their clinical application.Comment: Code and data are available at
https://github.com/YyzHarry/vlm-fairnes