1,841 research outputs found
Using Topological Data Analysis for diagnosis pulmonary embolism
Pulmonary Embolism (PE) is a common and potentially lethal condition. Most
patients die within the first few hours from the event. Despite diagnostic
advances, delays and underdiagnosis in PE are common.To increase the diagnostic
performance in PE, current diagnostic work-up of patients with suspected acute
pulmonary embolism usually starts with the assessment of clinical pretest
probability using plasma d-Dimer measurement and clinical prediction rules. The
most validated and widely used clinical decision rules are the Wells and Geneva
Revised scores. We aimed to develop a new clinical prediction rule (CPR) for PE
based on topological data analysis and artificial neural network. Filter or
wrapper methods for features reduction cannot be applied to our dataset: the
application of these algorithms can only be performed on datasets without
missing data. Instead, we applied Topological data analysis (TDA) to overcome
the hurdle of processing datasets with null values missing data. A topological
network was developed using the Iris software (Ayasdi, Inc., Palo Alto). The PE
patient topology identified two ares in the pathological group and hence two
distinct clusters of PE patient populations. Additionally, the topological
netowrk detected several sub-groups among healthy patients that likely are
affected with non-PE diseases. TDA was further utilized to identify key
features which are best associated as diagnostic factors for PE and used this
information to define the input space for a back-propagation artificial neural
network (BP-ANN). It is shown that the area under curve (AUC) of BP-ANN is
greater than the AUCs of the scores (Wells and revised Geneva) used among
physicians. The results demonstrate topological data analysis and the BP-ANN,
when used in combination, can produce better predictive models than Wells or
revised Geneva scores system for the analyzed cohortComment: 18 pages, 5 figures, 6 tables. arXiv admin note: text overlap with
arXiv:cs/0308031 by other authors without attributio
Learning the optimal scale for GWAS through hierarchical SNP aggregation
Motivation: Genome-Wide Association Studies (GWAS) seek to identify causal
genomic variants associated with rare human diseases. The classical statistical
approach for detecting these variants is based on univariate hypothesis
testing, with healthy individuals being tested against affected individuals at
each locus. Given that an individual's genotype is characterized by up to one
million SNPs, this approach lacks precision, since it may yield a large number
of false positives that can lead to erroneous conclusions about genetic
associations with the disease. One way to improve the detection of true genetic
associations is to reduce the number of hypotheses to be tested by grouping
SNPs. Results: We propose a dimension-reduction approach which can be applied
in the context of GWAS by making use of the haplotype structure of the human
genome. We compare our method with standard univariate and multivariate
approaches on both synthetic and real GWAS data, and we show that reducing the
dimension of the predictor matrix by aggregating SNPs gives a greater precision
in the detection of associations between the phenotype and genomic regions
Malware Analysis with Machine Learning
Tese de mestrado, Segurança Informática, Universidade de Lisboa, Faculdade de Ciências, 2022Malware attacks have been one of the most serious cyber risks in recent years. Almost every week, the
number of vulnerability reports is increasing in the security communities. One of the key causes for the
exponential growth is the fact that malware authors started introducing mutations to avoid detection.
This means that malicious files from the same malware family, with the same malicious behaviour, are
constantly modified or obfuscated using a variety of technics to make them appear to be different.
Characteristics retrieved from raw binary files or disassembled code are used in existing machine
learning-based malware categorization algorithms. The variety of such attributes has made it difficult to
develop generic malware categorization methods that operate well in a variety of operating scenarios.
To be effective in evaluating and categorizing such enormous volumes of data, it is necessary
to divide them into groups and identify their respective families based on their behaviour. Malicious
software is converted to a greyscale image representation, due to the possibility to capture subtle changes
while keeping the global structure helps to detect variations. Motivated by the Machine Learning results
achieved in the ImageNet challenge, this dissertation proposes an agnostic deep learning solution, for
efficiently classifying malware into families based on a collection of discriminant patterns retrieved
from its visualization as images.
In this thesis, we present Malwizard, an adaptable Python solution suited for companies or end users, that allows them to automatically obtain a fast malware analysis. The solution was implemented
as an Outlook add-in and an API service for the SOAR platforms, as emails are the first vector for this
type of attack, with companies being the most attractive targets.
The Microsoft Classification Challenge dataset was used in the evaluation of the noble
approach. Therefore, its image representation was ciphered and generated the correspondent ciphered
image to evaluate if the same patterns could be identified using traditional machine learning techniques.
Thus, allowing the privacy concerns to be addressed, maintaining the data analysed by neural networks
secure to unauthorized parties.
Experimental comparison demonstrates the noble approach performed close to the best analysed
model on a plain text dataset, completing the task in one-third of the time. Regarding the encrypted
dataset, classical techniques need to be adapted in order to be efficient
Evaluating classification accuracy for modern learning approaches
Peer Reviewedhttps://deepblue.lib.umich.edu/bitstream/2027.42/149333/1/sim8103_am.pdfhttps://deepblue.lib.umich.edu/bitstream/2027.42/149333/2/sim8103.pd
Highly accurate model for prediction of lung nodule malignancy with CT scans
Computed tomography (CT) examinations are commonly used to predict lung
nodule malignancy in patients, which are shown to improve noninvasive early
diagnosis of lung cancer. It remains challenging for computational approaches
to achieve performance comparable to experienced radiologists. Here we present
NoduleX, a systematic approach to predict lung nodule malignancy from CT data,
based on deep learning convolutional neural networks (CNN). For training and
validation, we analyze >1000 lung nodules in images from the LIDC/IDRI cohort.
All nodules were identified and classified by four experienced thoracic
radiologists who participated in the LIDC project. NoduleX achieves high
accuracy for nodule malignancy classification, with an AUC of ~0.99. This is
commensurate with the analysis of the dataset by experienced radiologists. Our
approach, NoduleX, provides an effective framework for highly accurate nodule
malignancy prediction with the model trained on a large patient population. Our
results are replicable with software available at
http://bioinformatics.astate.edu/NoduleX
Spam Filter Improvement Through Measurement
This work supports the thesis that sound quantitative evaluation for
spam filters leads to substantial improvement in the classification
of email. To this end, new laboratory testing methods and datasets
are introduced, and evidence is presented that their adoption at Text
REtrieval Conference (TREC)and elsewhere has led to an improvement in state of the art
spam filtering. While many of these improvements have been discovered
by others, the best-performing method known at this time -- spam filter
fusion -- was demonstrated by the author.
This work describes four principal dimensions of spam filter evaluation
methodology and spam filter improvement. An initial study investigates
the application of twelve open-source filter configurations in a laboratory
environment, using a stream of 50,000 messages captured from a single
recipient over eight months. The study measures the impact of user
feedback and on-line learning on filter performance using methodology
and measures which were released to the research community as the
TREC Spam Filter Evaluation Toolkit.
The toolkit was used as the basis of the TREC Spam Track, which the
author co-founded with Cormack. The Spam Track, in addition to evaluating
a new application (email spam), addressed the issue of testing systems
on both private and public data. While streams of private messages
are most realistic, they are not easy to come by and cannot be shared
with the research community as archival benchmarks. Using the toolkit,
participant filters were evaluated on both, and the differences found
not to substantially confound evaluation; as a result, public corpora
were validated as research tools. Over the course of TREC and similar
evaluation efforts, a dozen or more archival benchmarks --
some private and some public -- have become available.
The toolkit and methodology have spawned improvements in the state
of the art every year since its deployment in 2005. In 2005, 2006,
and 2007, the spam track yielded new best-performing systems based
on sequential compression models, orthogonal sparse bigram features,
logistic regression and support vector machines. Using the TREC participant
filters, we develop and demonstrate methods for on-line filter fusion
that outperform all other reported on-line personal spam filters
Recommended from our members
Neyman-Pearson classification algorithms and NP receiver operating characteristics
In many binary classification applications, such as disease diagnosis and spam detection, practitioners commonly face the need to limit type I error (that is, the conditional probability of misclassifying a class 0 observation as class 1) so that it remains below a desired threshold. To address this need, the Neyman-Pearson (NP) classification paradigm is a natural choice; it minimizes type II error (that is, the conditional probability of misclassifying a class 1 observation as class 0) while enforcing an upper bound, α, on the type I error. Despite its century-long history in hypothesis testing, the NP paradigm has not been well recognized and implemented in classification schemes. Common practices that directly limit the empirical type I error to no more than α do not satisfy the type I error control objective because the resulting classifiers are likely to have type I errors much larger than α, and the NP paradigm has not been properly implemented in practice. We develop the first umbrella algorithm that implements the NP paradigm for all scoring-type classification methods, such as logistic regression, support vector machines, and random forests. Powered by this algorithm, we propose a novel graphical tool for NP classification methods: NP receiver operating characteristic (NP-ROC) bands motivated by the popular ROC curves. NP-ROC bands will help choose α in a data-adaptive way and compare different NP classifiers. We demonstrate the use and properties of the NP umbrella algorithm and NP-ROC bands, available in the R package nproc, through simulation and real data studies
- …