Search CORE

1,841 research outputs found

Using Topological Data Analysis for diagnosis pulmonary embolism

Author: Falsetti Lorenzo
Herman Damir
Merelli Emanuela
Nitti Cinzia
Petrossian Tanya
Ramanan Devi
Rucco Matteo
Salvi Aldo
Publication venue
Publication date: 17/09/2014
Field of study

Pulmonary Embolism (PE) is a common and potentially lethal condition. Most patients die within the first few hours from the event. Despite diagnostic advances, delays and underdiagnosis in PE are common.To increase the diagnostic performance in PE, current diagnostic work-up of patients with suspected acute pulmonary embolism usually starts with the assessment of clinical pretest probability using plasma d-Dimer measurement and clinical prediction rules. The most validated and widely used clinical decision rules are the Wells and Geneva Revised scores. We aimed to develop a new clinical prediction rule (CPR) for PE based on topological data analysis and artificial neural network. Filter or wrapper methods for features reduction cannot be applied to our dataset: the application of these algorithms can only be performed on datasets without missing data. Instead, we applied Topological data analysis (TDA) to overcome the hurdle of processing datasets with null values missing data. A topological network was developed using the Iris software (Ayasdi, Inc., Palo Alto). The PE patient topology identified two ares in the pathological group and hence two distinct clusters of PE patient populations. Additionally, the topological netowrk detected several sub-groups among healthy patients that likely are affected with non-PE diseases. TDA was further utilized to identify key features which are best associated as diagnostic factors for PE and used this information to define the input space for a back-propagation artificial neural network (BP-ANN). It is shown that the area under curve (AUC) of BP-ANN is greater than the AUCs of the scores (Wells and revised Geneva) used among physicians. The results demonstrate topological data analysis and the BP-ANN, when used in combination, can produce better predictive models than Wells or revised Geneva scores system for the analyzed cohortComment: 18 pages, 5 figures, 6 tables. arXiv admin note: text overlap with arXiv:cs/0308031 by other authors without attributio

arXiv.org e-Print Archive

Archivio istituzionale della ricerca - Alma Mater Studiorum Università di Bologna

Learning the optimal scale for GWAS through hierarchical SNP aggregation

Author: Ambroise Christophe
Guinot Florent
Samson Franck
Szafranski Marie
Publication venue
Publication date: 01/01/2018
Field of study

Motivation: Genome-Wide Association Studies (GWAS) seek to identify causal genomic variants associated with rare human diseases. The classical statistical approach for detecting these variants is based on univariate hypothesis testing, with healthy individuals being tested against affected individuals at each locus. Given that an individual's genotype is characterized by up to one million SNPs, this approach lacks precision, since it may yield a large number of false positives that can lead to erroneous conclusions about genetic associations with the disease. One way to improve the detection of true genetic associations is to reduce the number of hypotheses to be tested by grouping SNPs. Results: We propose a dimension-reduction approach which can be applied in the context of GWAS by making use of the haplotype structure of the human genome. We compare our method with standard univariate and multivariate approaches on both synthetic and real GWAS data, and we show that reducing the dimension of the predictor matrix by aggregating SNPs gives a greater precision in the detection of associations between the phenotype and genomic regions

arXiv.org e-Print Archive

HAL Evry

Directory of Open Access Journals

HAL Descartes

Hal-Diderot

Malware Analysis with Machine Learning

Author: Cruz João Pedro Matias
Publication venue
Publication date: 01/01/2022
Field of study

Tese de mestrado, Segurança Informática, Universidade de Lisboa, Faculdade de Ciências, 2022Malware attacks have been one of the most serious cyber risks in recent years. Almost every week, the number of vulnerability reports is increasing in the security communities. One of the key causes for the exponential growth is the fact that malware authors started introducing mutations to avoid detection. This means that malicious files from the same malware family, with the same malicious behaviour, are constantly modified or obfuscated using a variety of technics to make them appear to be different. Characteristics retrieved from raw binary files or disassembled code are used in existing machine learning-based malware categorization algorithms. The variety of such attributes has made it difficult to develop generic malware categorization methods that operate well in a variety of operating scenarios. To be effective in evaluating and categorizing such enormous volumes of data, it is necessary to divide them into groups and identify their respective families based on their behaviour. Malicious software is converted to a greyscale image representation, due to the possibility to capture subtle changes while keeping the global structure helps to detect variations. Motivated by the Machine Learning results achieved in the ImageNet challenge, this dissertation proposes an agnostic deep learning solution, for efficiently classifying malware into families based on a collection of discriminant patterns retrieved from its visualization as images. In this thesis, we present Malwizard, an adaptable Python solution suited for companies or end users, that allows them to automatically obtain a fast malware analysis. The solution was implemented as an Outlook add-in and an API service for the SOAR platforms, as emails are the first vector for this type of attack, with companies being the most attractive targets. The Microsoft Classification Challenge dataset was used in the evaluation of the noble approach. Therefore, its image representation was ciphered and generated the correspondent ciphered image to evaluate if the same patterns could be identified using traditional machine learning techniques. Thus, allowing the privacy concerns to be addressed, maintaining the data analysed by neural networks secure to unauthorized parties. Experimental comparison demonstrates the noble approach performed close to the best analysed model on a plain text dataset, completing the task in one-third of the time. Regarding the encrypted dataset, classical techniques need to be adapted in order to be efficient

Universidade de Lisboa: Repositório.UL

Evaluating classification accuracy for modern learning approaches

Author: Cox DR
Li J
Liaw A
Pepe MS
Schalkoff RJ
Vapnik V
Publication venue: 'Wiley'
Publication date: 15/06/2019
Field of study

Peer Reviewedhttps://deepblue.lib.umich.edu/bitstream/2027.42/149333/1/sim8103_am.pdfhttps://deepblue.lib.umich.edu/bitstream/2027.42/149333/2/sim8103.pd

Crossref

Deep Blue Documents at the University of Michigan

Highly accurate model for prediction of lung nodule malignancy with CT scans

Author: Causey Jason L.
Huang Xiuzhen
Jiang Bo
Ma Shiqian
Politte David G.
Prior Fred
Qualls Jake A.
Zhang Junyu
Zhang Shuzhong
Publication venue: Digital Commons@Becker
Publication date: 01/01/2018
Field of study

Computed tomography (CT) examinations are commonly used to predict lung nodule malignancy in patients, which are shown to improve noninvasive early diagnosis of lung cancer. It remains challenging for computational approaches to achieve performance comparable to experienced radiologists. Here we present NoduleX, a systematic approach to predict lung nodule malignancy from CT data, based on deep learning convolutional neural networks (CNN). For training and validation, we analyze >1000 lung nodules in images from the LIDC/IDRI cohort. All nodules were identified and classified by four experienced thoracic radiologists who participated in the LIDC project. NoduleX achieves high accuracy for nodule malignancy classification, with an AUC of ~0.99. This is commensurate with the analysis of the dataset by experienced radiologists. Our approach, NoduleX, provides an effective framework for highly accurate nodule malignancy prediction with the model trained on a large patient population. Our results are replicable with software available at http://bioinformatics.astate.edu/NoduleX

arXiv.org e-Print Archive

Digital Commons@Becker

eScholarship - University of California

Spam Filter Improvement Through Measurement

Author: Lynam Thomas Richard
Publication venue: 'University of Waterloo'
Publication date: 01/01/2009
Field of study

This work supports the thesis that sound quantitative evaluation for spam filters leads to substantial improvement in the classification of email. To this end, new laboratory testing methods and datasets are introduced, and evidence is presented that their adoption at Text REtrieval Conference (TREC)and elsewhere has led to an improvement in state of the art spam filtering. While many of these improvements have been discovered by others, the best-performing method known at this time -- spam filter fusion -- was demonstrated by the author. This work describes four principal dimensions of spam filter evaluation methodology and spam filter improvement. An initial study investigates the application of twelve open-source filter configurations in a laboratory environment, using a stream of 50,000 messages captured from a single recipient over eight months. The study measures the impact of user feedback and on-line learning on filter performance using methodology and measures which were released to the research community as the TREC Spam Filter Evaluation Toolkit. The toolkit was used as the basis of the TREC Spam Track, which the author co-founded with Cormack. The Spam Track, in addition to evaluating a new application (email spam), addressed the issue of testing systems on both private and public data. While streams of private messages are most realistic, they are not easy to come by and cannot be shared with the research community as archival benchmarks. Using the toolkit, participant filters were evaluated on both, and the differences found not to substantially confound evaluation; as a result, public corpora were validated as research tools. Over the course of TREC and similar evaluation efforts, a dozen or more archival benchmarks -- some private and some public -- have become available. The toolkit and methodology have spawned improvements in the state of the art every year since its deployment in 2005. In 2005, 2006, and 2007, the spam track yielded new best-performing systems based on sequential compression models, orthogonal sparse bigram features, logistic regression and support vector machines. Using the TREC participant filters, we develop and demonstrate methods for on-line filter fusion that outperform all other reported on-line personal spam filters

University of Waterloo's Institutional Repository

Recommended from our members

Neyman-Pearson classification algorithms and NP receiver operating characteristics

Author: Feng Yang
Li Jingyi Jessica
Tong Xin
Publication venue: 'Columbia University Libraries/Information Services'
Publication date: 01/01/2018
Field of study

In many binary classification applications, such as disease diagnosis and spam detection, practitioners commonly face the need to limit type I error (that is, the conditional probability of misclassifying a class 0 observation as class 1) so that it remains below a desired threshold. To address this need, the Neyman-Pearson (NP) classification paradigm is a natural choice; it minimizes type II error (that is, the conditional probability of misclassifying a class 1 observation as class 0) while enforcing an upper bound, α, on the type I error. Despite its century-long history in hypothesis testing, the NP paradigm has not been well recognized and implemented in classification schemes. Common practices that directly limit the empirical type I error to no more than α do not satisfy the type I error control objective because the resulting classifiers are likely to have type I errors much larger than α, and the NP paradigm has not been properly implemented in practice. We develop the first umbrella algorithm that implements the NP paradigm for all scoring-type classification methods, such as logistic regression, support vector machines, and random forests. Powered by this algorithm, we propose a novel graphical tool for NP classification methods: NP receiver operating characteristic (NP-ROC) bands motivated by the popular ROC curves. NP-ROC bands will help choose α in a data-adaptive way and compare different NP classifiers. We demonstrate the use and properties of the NP umbrella algorithm and NP-ROC bands, available in the R package nproc, through simulation and real data studies

Columbia University Academic Commons

eScholarship - University of California