341 research outputs found

    Improved Machine Learning-Based Predictive Models for Breast Cancer Diagnosis

    Get PDF
    Breast cancer death rates are higher than any other cancer in American women. Machine learning-based predictive models promise earlier detection techniques for breast cancer diagnosis. However, making an evaluation for models that efficiently diagnose cancer is still challenging. In this work, we proposed data exploratory techniques (DET) and developed four different predictive models to improve breast cancer diagnostic accuracy. Prior to models, four-layered essential DET, e.g., feature distribution, correlation, elimination, and hyperparameter optimization, were deep-dived to identify the robust feature classification into malignant and benign classes. These proposed techniques and classifiers were implemented on the Wisconsin Diagnostic Breast Cancer (WDBC) and Breast Cancer Coimbra Dataset (BCCD) datasets. Standard performance metrics, including confusion matrices and K-fold cross-validation techniques, were applied to assess each classifier’s efficiency and training time. The models’ diagnostic capability improved with our DET, i.e., polynomial SVM gained 99.3%, LR with 98.06%, KNN acquired 97.35%, and EC achieved 97.61% accuracy with the WDBC dataset. We also compared our significant results with previous studies in terms of accuracy. The implementation procedure and findings can guide physicians to adopt an effective model for a practical understanding and prognosis of breast cancer tumors.publishedVersio

    Network Traffic Based Botnet Detection Using Machine Learning

    Get PDF
    The field of information and computer security is rapidly developing in today’s world as the number of security risks is continuously being explored every day. The moment a new software or a product is launched in the market, a new exploit or vulnerability is exposed and exploited by the attackers or malicious users for different motives. Many attacks are distributed in nature and carried out by botnets that cause widespread disruption of network activity by carrying out DDoS (Distributed Denial of Service) attacks, email spamming, click fraud, information and identity theft, virtual deceit and distributed resource usage for cryptocurrency mining. Botnet detection is still an active area of research as no single technique is available that can detect the entire ecosystem of a botnet like Neris, Rbot, and Virut. They tend to have different configurations and heavily armored by malware writers to evade detection systems by employing sophisticated evasion techniques. This report provides a detailed overview of a botnet and its characteristics and the existing work that is done in the domain of botnet detection. The study aims to evaluate the preprocessing techniques like variance thresholding and one-hot encoding to clean the botnet dataset and feature selection technique like filter, wrapper and embedded method to boost the machine learning model performance. This study addresses the dataset imbalance issues through techniques like undersampling, oversampling, ensemble learning and gradient boosting by using random forest, decision tree, AdaBoost and XGBoost. Lastly, the optimal model is then trained and tested on the dataset of different attacks to study its performance

    Performance of Malware Classification on Machine Learning using Feature Selection

    Get PDF
    The exponential growth of malware has created a significant threat in our daily lives, which heavily rely on computers running all kinds of software. Malware writers create malicious software by creating new variants, new innovations, new infections and more obfuscated malware by using techniques such as packing and encrypting techniques. Malicious software classification and detection play an important role and a big challenge for cyber security research. Due to the increasing rate of false alarm, the accurate classification and detection of malware is a big necessity issue to be solved. In this research, eight malware family have been classifying according to their family the research provides four feature selection algorithms to select best feature for multiclass classification problem. Comparing. Then find these algorithms top 100 features are selected to performance evaluations. Five machine learning algorithms is compared to find best models. Then frequency distribution of features are find by feature ranking of best model. At last it is said that frequency distribution of every character of API call sequence can be used to classify malware family

    Identifying Phenotypes Based on TCR Repertoire Using Machine Learning Methods

    Get PDF
    The adaptive immune system can prevent human beings being infected by pathogens. T cells, a kind of lymphocytes in the adaptive immunity, recognise antigens by T cell receptors (TCRs) and then generate cell-mediated immune responses. After primary immune responses, the adaptive immunity can generate corresponding immunological memory. TCRs are generated by a process of somatic gene rearrangement and therefore have high diversity. An individual's TCR repertoire can reveal his pathogen exposure history, which can assist in biological studies such as disease diagnosis. This master thesis targets to make predictions about phenotype statuses based on high-throughput TCR sequencing data using machine learning approaches, to see how accurate the phenotype identification based on TCR repertoire can be. The raw TCR data is preprocessed in three different ways and then proceed the next steps separately. Several feature selection approaches are applied to obtain the most important TCRs. The machine learning algorithms including Beta-binomial model (baseline), Logistic regression, Random forest and a Boosting algorithm LightGBM are trained and evaluated. Two datasets, Cytomegalovirus (CMV) and rheumatoid arthritis (RA), are explored. For the CMV dataset, Random forest performs best, even though only a little bit better than the baseline model. However, the classification results of the RA dataset are not so good whatever models used, and the best classifier is LightGBM. The results imply that the TCR data needs to be large enough to make powerful predictions. Using a sufficiently large dataset, the prediction ability of the baseline model is great, and there may exist certain algorithms such as Random forest outperform it

    Formalization and Detection of Host-Based Code Injection Attacks in the Context of Malware

    Get PDF
    The Host-Based Code Injection Attack (HBCIAs) is a technique that malicious software utilizes in order to avoid detection or steal sensitive information. In a nutshell, this is a local attack where code is injected across process boundaries and executed in the context of a victim process. Malware employs HBCIAs on several operating systems including Windows, Linux, and macOS. This thesis investigates the topic of HBCIAs in the context of malware. First, we conduct basic research on this topic. We formalize HBCIAs in the context of malware and show in several measurements, amongst others, the high prevelance of HBCIA-utilizing malware. Second, we present Bee Master, a platform-independent approach to dynamically detect HBCIAs. This approach applies the honeypot paradigm to operating system processes. Bee Master deploys fake processes as honeypots, which are attacked by malicious software. We show that Bee Master reliably detects HBCIAs on Windows and Linux. Third, we present Quincy, a machine learning-based system to detect HBCIAs in post-mortem memory dumps. It utilizes up to 38 features including memory region sparseness, memory region protection, and the occurence of HBCIA-related strings. We evaluate Quincy with two contemporary detection systems called Malfind and Hollowfind. This evaluation shows that Quincy outperforms them both. It is able to increase the detection performance by more than eight percent

    Predictive analytics framework for electronic health records with machine learning advancements : optimising hospital resources utilisation with predictive and epidemiological models

    Get PDF
    The primary aim of this thesis was to investigate the feasibility and robustness of predictive machine-learning models in the context of improving hospital resources’ utilisation with data- driven approaches and predicting hospitalisation with hospital quality assessment metrics such as length of stay. The length of stay predictions includes the validity of the proposed methodological predictive framework on each hospital’s electronic health records data source. In this thesis, we relied on electronic health records (EHRs) to drive a data-driven predictive inpatient length of stay (LOS) research framework that suits the most demanding hospital facilities for hospital resources’ utilisation context. The thesis focused on the viability of the methodological predictive length of stay approaches on dynamic and demanding healthcare facilities and hospital settings such as the intensive care units and the emergency departments. While the hospital length of stay predictions are (internal) healthcare inpatients outcomes assessment at the time of admission to discharge, the thesis also considered (external) factors outside hospital control, such as forecasting future hospitalisations from the spread of infectious communicable disease during pandemics. The internal and external splits are the thesis’ main contributions. Therefore, the thesis evaluated the public health measures during events of uncertainty (e.g. pandemics) and measured the effect of non-pharmaceutical intervention during outbreaks on future hospitalised cases. This approach is the first contribution in the literature to examine the epidemiological curves’ effect using simulation models to project the future hospitalisations on their strong potential to impact hospital beds’ availability and stress hospital workflow and workers, to the best of our knowledge. The main research commonalities between chapters are the usefulness of ensembles learning models in the context of LOS for hospital resources utilisation. The ensembles learning models anticipate better predictive performance by combining several base models to produce an optimal predictive model. These predictive models explored the internal LOS for various chronic and acute conditions using data-driven approaches to determine the most accurate and powerful predicted outcomes. This eventually helps to achieve desired outcomes for hospital professionals who are working in hospital settings

    Implementing a webserver for managing and detecting viral fusion proteins

    Get PDF
    Dissertação de mestrado em BioinformáticaViral fusion proteins are essential to allow enveloped viruses (such as Influenza, Dengue, HIV and SARS-CoV-2) to enter their hosts’ cells, in a mechanism referred to as membrane fusion. This makes these proteins (with special relevance to their fusion peptides, the com ponent of the protein that can insert into the host’s membrane by itself) interesting potential therapeutic targets for preventing or treating for some well-known diseases. However, there is no centralized data repository containing all the relevant information regarding viral fusion proteins. With that in mind, the main purpose of this work is to develop a CRUD (Create, Read, Update and Delete) web server that will allow researchers to find all the necessary data regarding viral fusion proteins, through an easy-to-use web interface. The web application will also contain other bioinformatics functionalities, such as sequence alignment (through BLAST, Clustal and Weblogo) to allow researchers to retrieve key pieces of information regarding a fusion protein, as well as machine learning models capable of predicting the location of fusion peptides inside the viral fusion protein sequence. The implementation of the server used Django as its back-end, retrieving the data from a MySQL database, and Angular as its front-end. The main result of the work is, therefore, a working webserver, with a web interface available online through the URL: https://viralfp.bio.di.uminho.pt/. The web application allows users to explore the gathered data related to viral fusion proteins in a user-friendly way. This tool contains all the proposed functionalities and machine learning models. As expected in an application’s development, there are several aspects that require future work to improve the usefulness of this tool to the scientific community.Proteínas virais de fusão são essenciais para que vírus encapsulados (tais como Influenza, Dengue, HIV e SARS-CoV-2) sejam capazes de se inserir nos seus hospedeiros, num mecanismo conhecido como fusão membranar. Por este motivo, estas proteínas (com especial relevância para os seus péptidos de fusão, que são a parte da proteína que se insere na membrana do hospedeiro por si mesma) são potenciais alvos terapêuticos interessantes para prevenir ou tratar algumas doenças bem conhecidas. No entanto, não existe nenhuma fonte de dados centralizada disponível que contenha toda a informação relativa a proteínas virais de fusão. Sabendo isto, o propósito primário deste trabalho é desenvolver um web server CRUD (Create, Read, Update and Delete) que permitira investigadores encontrar toda a informação necessária relacionada com proteínas virais de fusão, através de um interface user-friendly. Este web server incluirá outras funcionalidades bioinformáticas, tais como ferramentas de alinhamento de sequências (como BLAST, Clustal e Weblogo), que permitirá investigadores extrair informações importantes acerca de uma proteína de fusão. Por fim, incluir a modelos de machine learning capazes de prever a localização de péptidos de fusão na sequência da proteína de fusão. A implementação do servidor usou Django como seu back-end, que permite extrair a informação da base de dados MySQL, e Angular como front-end. O principal resultado deste trabalho é, portanto, um web server funcional, com a interface web disponível através do URL: https://viralfp.bio.di.uminho.pt/. Esta aplicação web permite que utilizadores possam explorar a informação acumulada acerca de proteínas virais de fusão através de uma interface user-friendly. Esta ferramenta contém todas as funcionalidades e modelos de machine learning propostos. Como seria de esperar no desenvolvimento de uma aplicação, existem vários aspetos que requerem trabalho futuro para melhorar a utilidade desta ferramenta para a comunidade científica.First and foremost, this dissertation is funded by COMPETE 2020, Portugal 2020 and FCT - Fundação para a Ciência e a Tecnologia, under the project ”Using computational and experimental methods to provide a global characterization of viral fusion peptides”, through the funding program ”02/SAICT/2017 - Projetos de Investigação Científica e Desenvolvimento Tecnologico (IC&DT)”, with the reference ”NORTE-01-0145-FEDER-028200”, who I would like to thank for their trust

    Automatic Malware Detection

    Get PDF
    The problem of automatic malware detection presents challenges for antivirus vendors. Since the manual investigation is not possible due to the massive number of samples being submitted every day, automatic malware classication is necessary. Our work is focused on an automatic malware detection framework based on machine learning algorithms. We proposed several static malware detection systems for the Windows operating system to achieve the primary goal of distinguishing between malware and benign software. We also considered the more practical goal of detecting as much malware as possible while maintaining a suciently low false positive rate. We proposed several malware detection systems using various machine learning techniques, such as ensemble classier, recurrent neural network, and distance metric learning. We designed architectures of the proposed detection systems, which are automatic in the sense that extraction of features, preprocessing, training, and evaluating the detection model can be automated. However, antivirus program relies on more complex system that consists of many components where several of them depends on malware analysts and researchers. Malware authors adapt their malicious programs frequently in order to bypass antivirus programs that are regularly updated. Our proposed detection systems are not automatic in the sense that they are not able to automatically adapt to detect the newest malware. However, we can partly solve this problem by running our proposed systems again if the training set contains the newest malware. Our work relied on static analysis only. In this thesis, we discuss advantages and drawbacks in comparison to dynamic analysis. Static analysis still plays an important role, and it is used as one component of a complex detection system.The problem of automatic malware detection presents challenges for antivirus vendors. Since the manual investigation is not possible due to the massive number of samples being submitted every day, automatic malware classication is necessary. Our work is focused on an automatic malware detection framework based on machine learning algorithms. We proposed several static malware detection systems for the Windows operating system to achieve the primary goal of distinguishing between malware and benign software. We also considered the more practical goal of detecting as much malware as possible while maintaining a suciently low false positive rate. We proposed several malware detection systems using various machine learning techniques, such as ensemble classier, recurrent neural network, and distance metric learning. We designed architectures of the proposed detection systems, which are automatic in the sense that extraction of features, preprocessing, training, and evaluating the detection model can be automated. However, antivirus program relies on more complex system that consists of many components where several of them depends on malware analysts and researchers. Malware authors adapt their malicious programs frequently in order to bypass antivirus programs that are regularly updated. Our proposed detection systems are not automatic in the sense that they are not able to automatically adapt to detect the newest malware. However, we can partly solve this problem by running our proposed systems again if the training set contains the newest malware. Our work relied on static analysis only. In this thesis, we discuss advantages and drawbacks in comparison to dynamic analysis. Static analysis still plays an important role, and it is used as one component of a complex detection system
    corecore