25 research outputs found

    Aprendizaje Automático Interpretable en la Detección de Hipoxia Fetal Intraparto

    Full text link
    Master as Research and Innovationon Computational Intelligence and Interactive SystemsNowadays, Machine Learning (ML) has become a widely used tool in different felds due to its greatcapacity to learn to solve problems automatically and to analyze large amounts of data effciently. Infact, in recent years, real-world problems have been solved with very good results using ML methods.However, even for experts in the ML feld, sometimes their results are diffcult to interpret because themodels act as black boxes. This can cause these models to lose much of their power, especially inthe clinical feld, where interpretability is essential to be applied in real-world practice. For this reason,interpretable machine learning is continuously growing.There are many clinical problems where it is possible to make use of ML methods to help healthcarestaff. In particular, this Master Thesis focuses on the detection of intrapartum fetal hypoxia, since it isof great importance to preserve the well-being of fetuses during pregnancy and during delivery to avoidpossible damages.For this purpose, frst of all, we have studied the most commonly used patterns in the clinical feldto detect fetal distress. Then, we have studied and trained both interpretable models by defnition andmore complex models to solve the problem. Specifcally, linear models, tree-based models and kernel-based models. In addition, for the later ones, external interpretability techniques, such as LIME andSHAP, have been used to learn about their performance. In this way, it has been possible to studywhich are the features that the models use to solve the problem and to analyze if they are similar tothose used in the medical feld, that is, if the models act with clinical sense.This document presents the different phases developed throughout this work. By the approachadopted, it has been shown that it is possible to give interpretability to the ML models and to understandhow and why the model makes the predictions. The proposed method provides a frst positive studyand the encouraging results obtained in the classifcation tasks demonstrate the interest and feasibilityof this approach to detect intrapartum fetal hypoxia by this pathway

    A Software Vulnerability Prediction Model Using Traceable Code Patterns And Software Metrics

    Get PDF
    Software security is an important aspect of ensuring software quality. The goal of this study is to help developers evaluate software security at the early stage of development using traceable patterns and software metrics. The concept of traceable patterns is similar to design patterns, but they can be automatically recognized and extracted from source code. If these patterns can better predict vulnerable code compared to the traditional software metrics, they can be used in developing a vulnerability prediction model to classify code as vulnerable or not. By analyzing and comparing the performance of traceable patterns with metrics, we propose a vulnerability prediction model. Objective: This study explores the performance of code patterns in vulnerability prediction and compares them with traditional software metrics. We have used the findings to build an effective vulnerability prediction model. Method: We designed and conducted experiments on the security vulnerabilities reported for Apache Tomcat (Releases 6, 7 and 8), Apache CXF and three stand-alone Java web applications of Stanford Securibench. We used machine learning and statistical techniques for predicting vulnerabilities of the systems using traceable patterns and metrics as features. Result: We found that patterns have a lower false negative rate and higher recall in detecting vulnerable code than the traditional software metrics. We also found a set of patterns and metrics that shows higher recall in vulnerability prediction. Conclusion: Based on the results of the experiments, we proposed a prediction model using patterns and metrics to better predict vulnerable code with higher recall rate. We evaluated the model for the systems under study. We also evaluated their performance in the cross-dataset validation

    CLADAG 2021 BOOK OF ABSTRACTS AND SHORT PAPERS

    Get PDF
    The book collects the short papers presented at the 13th Scientific Meeting of the Classification and Data Analysis Group (CLADAG) of the Italian Statistical Society (SIS). The meeting has been organized by the Department of Statistics, Computer Science and Applications of the University of Florence, under the auspices of the Italian Statistical Society and the International Federation of Classification Societies (IFCS). CLADAG is a member of the IFCS, a federation of national, regional, and linguistically-based classification societies. It is a non-profit, non-political scientific organization, whose aims are to further classification research

    Hyper-heuristic decision tree induction

    Get PDF
    A hyper-heuristic is any algorithm that searches or operates in the space of heuristics as opposed to the space of solutions. Hyper-heuristics are increasingly used in function and combinatorial optimization. Rather than attempt to solve a problem using a fixed heuristic, a hyper-heuristic approach attempts to find a combination of heuristics that solve a problem (and in turn may be directly suitable for a class of problem instances). Hyper-heuristics have been little explored in data mining. This work presents novel hyper-heuristic approaches to data mining, by searching a space of attribute selection criteria for decision tree building algorithm. The search is conducted by a genetic algorithm. The result of the hyper-heuristic search in this case is a strategy for selecting attributes while building decision trees. Most hyper-heuristics work by trying to adapt the heuristic to the state of the problem being solved. Our hyper-heuristic is no different. It employs a strategy for adapting the heuristic used to build decision tree nodes according to some set of features of the training set it is working on. We introduce, explore and evaluate five different ways in which this problem state can be represented for a hyper-heuristic that operates within a decisiontree building algorithm. In each case, the hyper-heuristic is guided by a rule set that tries to map features of the data set to be split by the decision tree building algorithm to a heuristic to be used for splitting the same data set. We also explore and evaluate three different sets of low-level heuristics that could be employed by such a hyper-heuristic. This work also makes a distinction between specialist hyper-heuristics and generalist hyper-heuristics. The main difference between these two hyperheuristcs is the number of training sets used by the hyper-heuristic genetic algorithm. Specialist hyper-heuristics are created using a single data set from a particular domain for evolving the hyper-heurisic rule set. Such algorithms are expected to outperform standard algorithms on the kind of data set used by the hyper-heuristic genetic algorithm. Generalist hyper-heuristics are trained on multiple data sets from different domains and are expected to deliver a robust and competitive performance over these data sets when compared to standard algorithms. We evaluate both approaches for each kind of hyper-heuristic presented in this thesis. We use both real data sets as well as synthetic data sets. Our results suggest that none of the hyper-heuristics presented in this work are suited for specialization – in most cases, the hyper-heuristic’s performance on the data set it was specialized for was not significantly better than that of the best performing standard algorithm. On the other hand, the generalist hyper-heuristics delivered results that were very competitive to the best standard methods. In some cases we even achieved a significantly better overall performance than all of the standard methods

    Preterm labor prediction using uterine electromyography with Machine Learning and Deep Learning Models

    Get PDF
    Trabalho de Projeto de Mestrado, Bioestatística, 2023, Universidade de Lisboa, Faculdade de CiênciasDe acordo com a Organização Mundial da Saúde (OMS) o parto prematuro é definido como o nascimento de bebés antes da finalização das 37 semanas de gestação, sendo considerado um risco de saúde elevado tanto para o bebé como para a mãe. Dois terços destes partos, não tem um diagnóstico específico, enquanto os restantes encontram-se normalmente associados a fatores relacionados com a mãe como várias gravidezes, historial de partos prematuros, uso de drogas, idade inferior a 18 anos, entre outros. A prematuridade é a primeira causa de morte no mundo para crianças com menos de 5 anos, uma vez que quando ocorre o parto, os bebés não se encontram completamente desenvolvidos, podendo vir a sofrer deficiências a nível visual e auditivo e também outras complicações ao nível da saúde como problemas cardiovasculares ou respiratórios. Em Portugal, de acordo com a Sociedade Portuguesa de Pediatria, 8% dos bebés nascem prematuros. Deste modo, a monitorização dos partos de forma a prever partos pré-termo tornou-se fundamental. Os dois métodos mais comumente usados na monitorização da contratilidade uterina são o Cateter de Pressão Intrauterino e o Tocograma Externo, porém ambos apresentam limitações como o facto de ser invasivo ou de não mostrar eficácia para grávidas de elevada massa corporal, respetivamente. O estudo da atividade das contrações no útero através do Electrohisterograma (EHG) como método alternativo tem sido uma forte aposta na previsão do parto prematuro. O EHG é um método não invasivo realizado através de elétrodos colocados no abdómen, que regista a atividade contrátil do útero e resulta num sinal elétrico. Demonstra eficácia em pacientes com índice de massa corporal alta, sendo capaz de indicar quando as grávidas vão entrar em trabalho de parto. Atualmente, o estudo do sinal EHG é uma das práticas mais usadas para estudar e classificar o parto prematuro através de técnicas de Machine Learning (ML) e Deep Learning (DL). Para isso, utilizam-se características frequenciais, temporais, entre outras provenientes do sinal, chamadas de features, que vão representar o sinal. Estas são depois inseridas em algoritmos de ML e DL capazes de fazer previsões com base nas características do sinal. Em literatura as features mais utilizadas para representar os sinais EHG consistem na frequência, amplitude, entropia e outras, demonstrando resultados positivos com elevado valor preditivo, tanto em algoritmos de Machine Learning como de Deep Learning. Desta forma, através do sinal EHG obtido na monitorização do útero será possível prever se a grávida irá ter um parto prematuro ou termo. No entanto, esta classificação ainda se encontra numa fase experimental, existindo uma lacuna no contexto clínico, para uma previsão automática do tipo de parto. Todos estes trabalhos enfrentam um problema associado à falta de observações de partos prematuros nas bases de dados utilizadas. As soluções propostas para combater o desequilíbrio nos dados envolve a utilização de técnicas de sobreamostragrem, como SMOTE, que consistem na produção de observações sintéticos para a classe da minoria (partos prematuros). O número ideal de amostras a serem produzidas é ainda algo a ser estudado, sendo que a maior parte dos estudos fazem uma compensação dos dados com uma proporção final de observações de 1:1, porém este método pode levar a um decréscimo na habilidade do classificador identificar a classe maioritária e uma previsão irrealista e demasiado otimista. De acordo com os autores, o SMOTE atinge os melhores resultados através da combinação de uma subamostragem da classe maioritária com a sobreamostragem da classe minoritária, através do SMOTE. Num sinal EHG processado é possível distinguir a existência de contrações como Braxton-Hicks, ondas Alvarez e ondas LDBF (Longue Durée Basse Fréquence). De momento, na literatura as features são extraídas do sinal completo e não das contrações, nomeadamente das Alvarez e Braxton-Hicks, que contêm informação relevante para a prematuridade do parto. Contudo, as contrações são séries temporais com um número diferente de observações. Deste modo, a solução apresentada para este problema é a análise espectral de cada contração, através do espetro de cada contração, obtido através de uma transformação de tempo para frequência, como a Transformada de Fourier, que é capaz de representar um sinal na base de dados. Esta técnica é usada para extração de features e classificação no campo de diagnóstico médico. Dentro da estimação espetral existem dois métodos: paramétricos e não paramétricos, sendo que o método Welch é uma abordagem não paramétrica, capaz de calcular o espetro de cada contração detetada no sinal EHG, que demonstrou bons resultados na classificação das contrações noutros trabalhos, representando bem o singal EHG, e apresentando sempre a mesma dimensão, independente da duração da contração. Neste estudo, foi utilizada a base de dados pública TPEHG (Term Preterm EHG) com um total de 300 registos, 262 pré-termo e 38 termo. A base de dados apresenta 4 elétrodos, com 3 canais bipolares, sendo que apenas um canal foi escolhido, de acordo com a literatura, visto que o sinal vertical tem uma maior variação do potencial de sinal. Este sinal foi depois filtrado para eliminar o ruído materno do ECG, ou outros ruídos relacionados, e processado para uma frequência amostral final de 4 Hz. As features foram extraídas através da estimação espetral pelo método Welch, finalizando com um total de 200 features. No final, o base de dados utilizado consistia em 4622 observações/contrações, 407 correspondentes a parto prematuro e 2829 parto termo, com 200 features cada. Esta base de dados foi depois fornecida a três algoritmos diferentes de ML, incluindo o Random Forest, RUSBoosted Trees, Support Vector Machine, e uma Shallow Neural Network, e o algoritmo Long-Short Term Memory de DL, com o objetivo de classificar os parto prematuros. Até agora, nenhum estudo se focou na utilização de um algoritmo de LSTM, e na utilização do espetro das contrações como features. Neste estudo, as técnicas mencionadas anteriormente foram aplicadas em 5 cenários diferentes nos algoritmos de ML, de modo a obter o modelo mais robusto para evitar situações de overfitting, e obter os resultados mais realistas possíveis, (1) treinar os dados, sem qualquer opção adicional de outros métodos; (2) treinar os dados com os mesmos algoritmos, adicionando uma técnica de sobreamostragem sintética, SMOTE; (3) treinar os dados com técnica de SMOTE mais uma técnica de redução de dimensionalidade, PCA; (4) treinar os dados com a utilização de um método de seleção de features, MRMR; (5) tuning dos parâmetros do modelo, através do método Bayesian Optimization. Desta forma, os dados foram treinados, validados, e os modelos com melhores resultados preditivos foram depois testados. Os algoritmos de DL foram apenas testados usando o dataset original e o dataset com SMOTE aplicado. Para todos os algoritmos, a accuracy, precision, recall, F1-Score, false negative rate, false positive rate e AUC (exceto para os de DL) foram calculados. Os resultados indicam que usar os primeiros 200 pontos da estimação espetral pelo método Welch, como features frequenciais, não proporciona melhores resultados quando comparando a features mais tradicionais, de tempo-frequência, usadas em toda a literatura. Além disso, utilizar a técnica de SMOTE conciliada com uma subamostragem da classe maioritária produz piores resultados quando comparando com a aplicação de só SMOTE, como usado pela maioria dos autores. Os algoritmos de ML têm um melhor comportamento que os de DL, uma vez que são modelos mais simples não dependentes de uma elevada quantidade de dados. Apesar dos resultados promissores no grupo de treino, com uma elevada Accuracy, F1-score e AUC, o momento de teste teve uma performance abaixo dos valores esperados e em literatura. Com base nestes resultados, concluímos que apesar da abordagem da aplicação de SMOTE após a separação em grupo de treino de teste ser a mais correta, não permite resultados semelhantes à literatura (em que esta ordem de passos usada é a inversa), uma vez que o algoritmo é processado usando um grupo de teste com uma estrutura muito diferente à de treino, o que pode levar a menor precision e recall. Em suma, conclui-se que a utilização do espetro das contrações como features frequenciais num dataset sobreamostrado com a técnica de SMOTE, utilizando as diferentes técnicas de ML e DL referidas, não é uma melhor alternativa em relação à utilização de features de tempo-frequência presentes em literatura. Contudo, é possível concluir a importância de registar mais dados de partos prematuros de EHG, com vista a melhorar as experiências futuras, e evitar a utilização de técnicas como a de SMOTE. Para além disso, abriu-se também a possibilidade da aplicação de uma rede neuronal complexa como o LSTM, com resultados promissores para o futuro, que podem ser eficazes quando aplicados na classificação de parto prematuro.The World Health Organization defines premature birth as the birth of a baby before the completion of 37 weeks of gestation which is considered a high health risk for both the baby and the mother. Prematurity is the leading cause of death in the world for children under 5 years old, therefore monitoring the uterus to predict preterm labor has become essential. Currently, the Intrauterine Pressure Catheter and the External Tocography are the most used monitoring devices, however, they are invasive and don’t perform well with high body mass index (BMI) patients, respectively. The Electrohysterogram (EHG) has emerged as a noninvasive method for predicting premature birth with high performance for mothers with high BMI. This method uses electrodes placed on the abdomen to record uterine contractions by producing an electrical signal, that contains important information regarding the electrical activity of the uterus. The study of the EHG signal is one of the most used practices for studying and classifying premature birth using Machine Learning (ML) and Deep Learning (DL) techniques. In this technique, features are extracted from the signal such as frequency, amplitude, and others to represent the signal and inserted into algorithms capable of making predictions based on the signal characteristics. However, this classification method is still in the experimental phase, and there is a gap in the clinical context for automatic birth type prediction. One of the challenges faced by this method is the lack of observations of premature births in the databases used. Oversampling techniques, such as SMOTE, address the lack of observations of premature births in the databases by producing synthetic observations for the minority class. In this thesis, the Welch estimation of the power spectra of the signal of each contraction from the TPEHG Ljubljana public database is used as features, comprising 200 features. The Minimum Redundancy Maximum Relevance (MRMR) Algorithm was used to search for the most relevant features from this dataset with only 180 showing any relevance, and SMOTE was applied to solve the skewed dataset problem. Four different machine learning algorithms were used, including the Support Vector Machine, the RUSBoosted trees, a Shallow Neural Network, and a Random Forest classifier, moreover, a deep learning network was also tested. These were also optimized with the Bayesian hyperparameter optimization. All algorithms performed with high accuracy, although showing a low predictive power for the test group, probably due to a highly imbalanced test set. We concluded that the use of spectral features of the contractions as an alternative to the timefrequency features shows promising results with the training dataset, but cannot accurately predict preterm labor in the test set, due to the imbalanced dataset problem. More samples should be collected in the future so more meaningful conclusions can be taken

    Development of a Web-enabled Spatial Decision Support System (SDSS) for Prevention of Tick Borne Disease in Kuantan, Malaysia

    Get PDF
    Ticks are the second most common vectors of human disease after mosquitoes. They are found on many small mammal hosts and also blood-feed on humans with the risk of transmitting diseases. Considering the diseases’ risks, this study has investigated the potential for a web-enabled spatial decision support system (SDSS) to assist government decision-makers in the control, management of resources and prevention of tick borne diseases specifically in the study area of Kuantan, Malaysia

    Minimal Infrastructure Radio Frequency Home Localisation Systems

    Get PDF
    The ability to track the location of a subject in their home allows the provision of a number of location based services, such as remote activity monitoring, context sensitive prompts and detection of safety critical situations such as falls. Such pervasive monitoring functionality offers the potential for elders to live at home for longer periods of their lives with minimal human supervision. The focus of this thesis is on the investigation and development of a home roomlevel localisation technique which can be readily deployed in a realistic home environment with minimal hardware requirements. A conveniently deployed Bluetooth ® localisation platform is designed and experimentally validated throughout the thesis. The platform adopts the convenience of a mobile phone and the processing power of a remote location calculation computer. The use of Bluetooth ® also ensures the extensibility of the platform to other home health supervision scenarios such as wireless body sensor monitoring. Central contributions of this work include the comparison of probabilistic and nonprobabilistic classifiers for location prediction accuracy and the extension of probabilistic classifiers to a Hidden Markov Model Bayesian filtering framework. New location prediction performance metrics are developed and signicant performance improvements are demonstrated with the novel extension of Hidden Markov Models to higher-order Markov movement models. With the simple probabilistic classifiers, location is correctly predicted 80% of the time. This increases to 86% with the application of the Hidden Markov Models and 88% when high-order Hidden Markov Models are employed. Further novelty is exhibited in the derivation of a real-time Hidden Markov Model Viterbi decoding algorithm which presents all the advantages of the original algorithm, while producing location estimates in real-time. Significant contributions are also made to the field of human gait-recognition by applying Bayesian filtering to the task of motion detection from accelerometers which are already present in many mobile phones. Bayesian filtering is demonstrated to enable a 35% improvement in motion recognition rate and even enables a floor recognition rate of 68% using only accelerometers. The unique application of time-varying Hidden Markov Models demonstrates the effect of integrating these freely available motion predictions on long-term location predictions

    The Bootstrap in Supervised Learning and its Applications in Genomics/Proteomics

    Get PDF
    The small-sample size issue is a prevalent problem in Genomics and Proteomics today. Bootstrap, a resampling method which aims at increasing the efficiency of data usage, is considered to be an effort to overcome the problem of limited sample size. This dissertation studies the application of bootstrap to two problems of supervised learning with small sample data: estimation of the misclassification error of Gaussian discriminant analysis, and the bagging ensemble classification method. Estimating the misclassification error of discriminant analysis is a classical problem in pattern recognition and has many important applications in biomedical research. Bootstrap error estimation has been shown empirically to be one of the best estimation methods in terms of root mean squared error. In the first part of this work, we conduct a detailed analytical study of bootstrap error estimation for the Linear Discriminant Analysis (LDA) classification rule under Gaussian populations. We derive the exact formulas of the first and the second moment of the zero bootstrap and the convex bootstrap estimators, as well as their cross moments with the resubstitution estimator and the true error. Based on these results, we obtain the exact formulas of the bias, the variance, and the root mean squared error of the deviation from the true error of these bootstrap estimators. This includes the moments of the popular .632 bootstrap estimator. Moreover, we obtain the optimal weight for unbiased and minimum-RMS convex bootstrap estimators. In the univariate case, all the expressions involve Gaussian distributions, whereas in the multivariate case, the results are written in terms of bivariate doubly non-central F distributions. In the second part of this work, we conduct an extensive empirical investigation of bagging, which is an application of bootstrap to ensemble classification. We investigate the performance of bagging in the classification of small-sample gene-expression data and protein-abundance mass spectrometry data, as well as the accuracy of small-sample error estimation with this ensemble classification rule. We observed that, under t-test and RELIEF filter-based feature selection, bagging generally does a good job of improving the performance of unstable, overtting classifiers, such as CART decision trees and neural networks, but that improvement was not sufficient to beat the performance of single stable, non-overtting classifiers, such as diagonal and plain linear discriminant analysis, or 3-nearest neighbors. Furthermore, the ensemble method did not improve the performance of these stable classifiers significantly. We give an explicit definition of the out-of-bag estimator that is intended to remove estimator bias, by formulating carefully how the error count is normalized, and investigate the performance of error estimation for bagging of common classification rules, including LDA, 3NN, and CART, applied on both synthetic and real patient data, corresponding to the use of common error estimators such as resubstitution, leave-one-out, cross-validation, basic bootstrap, bootstrap 632, bootstrap 632 plus, bolstering, semi-bolstering, in addition to the out-of-bag estimator. The results from the numerical experiments indicated that the performance of the out-of-bag estimator is very similar to that of leave-one-out; in particular, the out-of-bag estimator is slightly pessimistically biased. The performance of the other estimators is consistent with their performance with the corresponding single classifiers, as reported in other studies. The results of this work are expected to provide helpful guidance to practitioners who are interested in applying the bootstrap in supervised learning applications

    Statistical Issues in Machine Learning

    Get PDF
    Recursive partitioning methods from machine learning are being widely applied in many scientific fields such as, e.g., genetics and bioinformatics. The present work is concerned with the two main problems that arise in recursive partitioning, instability and biased variable selection, from a statistical point of view. With respect to the first issue, instability, the entire scope of methods from standard classification trees over robustified classification trees and ensemble methods such as TWIX, bagging and random forests is covered in this work. While ensemble methods prove to be much more stable than single trees, they also loose most of their interpretability. Therefore an adaptive cutpoint selection scheme is suggested with which a TWIX ensemble reduces to a single tree if the partition is sufficiently stable. With respect to the second issue, variable selection bias, the statistical sources of this artifact in single trees and a new form of bias inherent in ensemble methods based on bootstrap samples are investigated. For single trees, one unbiased split selection criterion is evaluated and another one newly introduced here. Based on the results for single trees and further findings on the effects of bootstrap sampling on association measures, it is shown that, in addition to using an unbiased split selection criterion, subsampling instead of bootstrap sampling should be employed in ensemble methods to be able to reliably compare the variable importance scores of predictor variables of different types. The statistical properties and the null hypothesis of a test for the random forest variable importance are critically investigated. Finally, a new, conditional importance measure is suggested that allows for a fair comparison in the case of correlated predictor variables and better reflects the null hypothesis of interest
    corecore