1,634 research outputs found
FRI -- Feature Relevance Intervals for Interpretable and Interactive Data Exploration
Most existing feature selection methods are insufficient for analytic
purposes as soon as high dimensional data or redundant sensor signals are dealt
with since features can be selected due to spurious effects or correlations
rather than causal effects. To support the finding of causal features in
biomedical experiments, we hereby present FRI, an open source Python library
that can be used to identify all-relevant variables in linear classification
and (ordinal) regression problems. Using the recently proposed feature
relevance method, FRI is able to provide the base for further general
experimentation or in specific can facilitate the search for alternative
biomarkers. It can be used in an interactive context, by providing model
manipulation and visualization methods, or in a batch process as a filter
method.Comment: Addition of IEEE copyright notice. Accepted for CIBCB 2019
(https://cibcb2019.icas.xyz/
ΠΡΠ±ΠΎΡ ΠΈΠ½ΡΠΎΡΠΌΠ°ΡΠΈΠ²Π½ΡΡ Π³Π΅ΠΎΠΌΠ΅ΡΡΠΈΡΠ΅ΡΠΊΠΈΡ ΠΏΡΠΈΠ·Π½Π°ΠΊΠΎΠ² ΡΠ΄Π΅Ρ ΠΊΠ»Π΅ΡΠΎΠΊ Π½Π° Π»ΡΠΌΠΈΠ½Π΅ΡΡΠ΅Π½ΡΠ½ΡΡ ΠΈΠ·ΠΎΠ±ΡΠ°ΠΆΠ΅Π½ΠΈΡΡ ΡΠ°ΠΊΠΎΠ²ΡΡ ΠΊΠ»Π΅ΡΠΎΠΊ
The methods of geometric informative features selection of nuclei on fluorescent images of cancer cells are considered. During the survey, a review of existing geometric features was carried out, including both the signs of rotation resisted shape and displacement of the image, as well as signs of location in space. For the selection of characteristics, the methods were used: median, correlation with calculation of the Pearson correlation coefficient, correlation with calculation of the Spearman correlation coefficient, logistic regression model, random forest with CART trees and Gini criterion, random forest with CART trees and error minimization criterion. As a result of the investigation 11 characteristics were selected from 59 features, the quality of classification and time costs were calculated depending on the number of features for describing the objects. The use of 11 features is sufficient for the accuracy of classification as it allows to reduce time costs in 2,3 times.Π Π°ΡΡΠΌΠΎΡΡΠ΅Π½Ρ ΠΌΠ΅ΡΠΎΠ΄Ρ ΠΎΡΠ±ΠΎΡΠ° ΠΈΠ½ΡΠΎΡΠΌΠ°ΡΠΈΠ²Π½ΡΡ
ΠΏΡΠΈΠ·Π½Π°ΠΊΠΎΠ² Π΄Π»Ρ Π²ΡΠ΄Π΅Π»Π΅Π½ΠΈΡ Π³Π΅ΠΎΠΌΠ΅ΡΡΠΈΡΠ΅ΡΠΊΠΈΡ
ΠΏΡΠΈΠ·Π½Π°ΠΊΠΎΠ² ΠΏΡΠΈ ΠΎΠΏΠΈΡΠ°Π½ΠΈΠΈ ΡΠ΄Π΅Ρ Π½Π° Π»ΡΠΌΠΈΠ½Π΅ΡΡΠ΅Π½ΡΠ½ΡΡ
ΠΈΠ·ΠΎΠ±ΡΠ°ΠΆΠ΅Π½ΠΈΡΡ
ΡΠ°ΠΊΠΎΠ²ΡΡ
ΠΊΠ»Π΅ΡΠΎΠΊ. ΠΡΠΏΠΎΠ»Π½Π΅Π½ ΠΎΠ±Π·ΠΎΡ ΡΡΡΠ΅ΡΡΠ²ΡΡΡΠΈΡ
Π³Π΅ΠΎΠΌΠ΅ΡΡΠΈΡΠ΅ΡΠΊΠΈΡ
ΠΏΡΠΈΠ·Π½Π°ΠΊΠΎΠ², ΠΊΠΎΡΠΎΡΡΠΉ Π²ΠΊΠ»ΡΡΠ°Π΅Ρ Π² ΡΠ΅Π±Ρ ΠΊΠ°ΠΊ ΠΏΡΠΈΠ·Π½Π°ΠΊΠΈ ΡΠΎΡΠΌΡ, ΡΡΡΠΎΠΉΡΠΈΠ²ΡΠ΅ ΠΊ ΠΏΠΎΠ²ΠΎΡΠΎΡΡ ΠΈ ΠΏΠ΅ΡΠ΅ΠΌΠ΅ΡΠ΅Π½ΠΈΡ ΠΈΠ·ΠΎΠ±ΡΠ°ΠΆΠ΅Π½ΠΈΡ, ΡΠ°ΠΊ ΠΈ ΠΏΡΠΈΠ·Π½Π°ΠΊΠΈ ΡΠ°ΡΠΏΠΎΠ»ΠΎΠΆΠ΅Π½ΠΈΡ Π² ΠΏΡΠΎΡΡΡΠ°Π½ΡΡΠ²Π΅. ΠΠ»Ρ ΠΎΡΠ±ΠΎΡΠ° Π½Π°ΠΈΠ±ΠΎΠ»Π΅Π΅ ΠΈΠ½ΡΠΎΡΠΌΠ°ΡΠΈΠ²Π½ΡΡ
ΠΏΡΠΈΠ·Π½Π°ΠΊΠΎΠ² ΠΈΡΠΏΠΎΠ»ΡΠ·ΠΎΠ²Π°Π½Ρ ΡΠ΅ΡΡΡ ΠΌΠ΅ΡΠΎΠ΄ΠΎΠ²: ΠΌΠ΅Π΄ΠΈΠ°Π½Π½ΡΠΉ, ΠΊΠΎΡΡΠ΅Π»ΡΡΠΈΠΎΠ½Π½ΡΠΉ Ρ ΡΠ°ΡΡΠ΅ΡΠΎΠΌ ΠΊΠΎΡΡΡΠΈΡΠΈΠ΅Π½ΡΠ° ΠΊΠΎΡΡΠ΅Π»ΡΡΠΈΠΈ ΠΏΠΎ ΠΠΈΡΡΠΎΠ½Ρ, ΠΊΠΎΡΡΠ΅Π»ΡΡΠΈΠΎΠ½Π½ΡΠΉ Ρ ΡΠ°ΡΡΠ΅ΡΠΎΠΌ ΠΊΠΎΡΡΡΠΈΡΠΈΠ΅Π½ΡΠ° ΠΊΠΎΡΡΠ΅Π»ΡΡΠΈΠΈ ΠΏΠΎ Π‘ΠΏΠΈΡΠΌΠ΅Π½Ρ, ΠΌΠ΅ΡΠΎΠ΄ Π»ΠΎΠ³ΠΈΡΡΠΈΡΠ΅ΡΠΊΠΎΠΉ ΡΠ΅Π³ΡΠ΅ΡΡΠΈΠΈ, ΡΠ»ΡΡΠ°ΠΉΠ½ΠΎΠ³ΠΎ Π»Π΅ΡΠ° Ρ CART-Π΄Π΅ΡΠ΅Π²ΡΡΠΌΠΈ ΠΈ ΠΊΡΠΈΡΠ΅ΡΠΈΠ΅ΠΌ Gini, ΡΠ»ΡΡΠ°ΠΉΠ½ΠΎΠ³ΠΎ Π»Π΅ΡΠ° Ρ CART-Π΄Π΅ΡΠ΅Π²ΡΡΠΌΠΈ ΠΈ ΠΊΡΠΈΡΠ΅ΡΠΈΠ΅ΠΌ ΠΌΠΈΠ½ΠΈΠΌΠΈΠ·Π°ΡΠΈΠΈ ΠΎΡΠΈΠ±ΠΊΠΈ. Π ΡΠ΅Π·ΡΠ»ΡΡΠ°ΡΠ΅ ΠΈΡΡΠ»Π΅Π΄ΠΎΠ²Π°Π½ΠΈΡ ΠΈΠ· 59 ΠΏΡΠΈΠ·Π½Π°ΠΊΠΎΠ² ΠΎΡΠΎΠ±ΡΠ°Π½Ρ 11 Π½Π°ΠΈΠ±ΠΎΠ»Π΅Π΅ ΠΈΠ½ΡΠΎΡΠΌΠ°ΡΠΈΠ²Π½ΡΡ
, Π²ΡΠΏΠΎΠ»Π½Π΅Π½ Π°Π½Π°Π»ΠΈΠ· ΠΊΠ°ΡΠ΅ΡΡΠ²Π° ΠΊΠ»Π°ΡΡΠΈΡΠΈΠΊΠ°ΡΠΈΠΈ Ρ ΠΏΠΎΠΌΠΎΡΡΡ ΠΌΠ΅ΡΠΎΠ΄Π° ΡΠ»ΡΡΠ°ΠΉΠ½ΠΎΠ³ΠΎ Π»Π΅ΡΠ° ΠΈ ΡΠ°ΡΡΡΠΈΡΠ°Π½Ρ Π²ΡΠ΅ΠΌΠ΅Π½Π½ΡΠ΅ Π·Π°ΡΡΠ°ΡΡ Π² Π·Π°Π²ΠΈΡΠΈΠΌΠΎΡΡΠΈ ΠΎΡ ΠΊΠΎΠ»ΠΈΡΠ΅ΡΡΠ²Π° ΠΏΡΠΈΠ·Π½Π°ΠΊΠΎΠ² Π΄Π»Ρ ΠΎΠΏΠΈΡΠ°Π½ΠΈΡ ΠΎΠ±ΡΠ΅ΠΊΡΠΎΠ². ΠΠ»Ρ ΠΌΠ΅ΡΠΎΠ΄Π° ΡΠ»ΡΡΠ°ΠΉΠ½ΠΎΠ³ΠΎ Π»Π΅ΡΠ° ΠΈΡΠΏΠΎΠ»ΡΠ·ΠΎΠ²Π°Π½ΠΈΠ΅ 11 ΠΏΡΠΈΠ·Π½Π°ΠΊΠΎΠ² ΡΠ²Π»ΡΠ΅ΡΡΡ Π΄ΠΎΡΡΠ°ΡΠΎΡΠ½ΡΠΌ ΠΏΠΎ ΡΠΎΡΠ½ΠΎΡΡΠΈ ΠΊΠ»Π°ΡΡΠΈΡΠΈΠΊΠ°ΡΠΈΠΈ ΠΈ ΠΏΠΎΠ·Π²ΠΎΠ»ΡΠ΅Ρ ΡΠ½ΠΈΠ·ΠΈΡΡ Π²ΡΠ΅ΠΌΠ΅Π½Π½ΡΠ΅ Π·Π°ΡΡΠ°ΡΡ Π² 2,3 ΡΠ°Π·Π°
Methods to Improve the Prediction Accuracy and Performance of Ensemble Models
The application of ensemble predictive models has been an important research area in predicting medical diagnostics, engineering diagnostics, and other related smart devices and
related technologies. Most of the current predictive models are complex and not reliable despite numerous efforts in the past by the research community. The performance accuracy of the predictive models have not always been realised due to many factors such as complexity and class imbalance. Therefore there is a need to improve the predictive accuracy of current ensemble models and to enhance their applications and reliability and non-visual predictive tools.
The research work presented in this thesis has adopted a pragmatic phased approach to propose and develop new ensemble models using multiple methods and validated the methods through rigorous testing and implementation in different phases. The first phase comprises of empirical investigations on standalone and ensemble algorithms that were carried out to ascertain their performance effects on complexity and simplicity of the classifiers. The second phase comprises of an improved ensemble model based on the integration of Extended Kalman Filter (EKF), Radial Basis Function Network (RBFN) and AdaBoost algorithms. The third phase comprises of an extended model based on early stop concepts, AdaBoost algorithm, and statistical performance of the training samples to minimize overfitting performance of the proposed model. The fourth phase comprises of an enhanced analytical multivariate logistic regression predictive model developed to minimize the complexity and improve prediction accuracy of logistic regression model.
To facilitate the practical application of the proposed models; an ensemble non-invasive analytical tool is proposed and developed. The tool links the gap between theoretical concepts and practical application of theories to predict breast cancer survivability. The empirical findings suggested that: (1) increasing the complexity and topology of algorithms does not necessarily lead to a better algorithmic performance, (2) boosting by resampling performs slightly better than boosting by reweighting, (3) the prediction accuracy of the proposed ensemble EKF-RBFN-AdaBoost model performed better than several established ensemble models, (4) the proposed early stopped model converges faster and minimizes overfitting better compare with other models, (5) the proposed multivariate logistic regression concept minimizes the complexity models (6) the performance of the proposed analytical non-invasive tool performed comparatively better than many of the benchmark analytical tools used in predicting breast cancers and diabetics ailments.
The research contributions to ensemble practice are: (1) the integration and development of EKF, RBFN and AdaBoost algorithms as an ensemble model, (2) the development and validation of ensemble model based on early stop concepts, AdaBoost, and statistical concepts of the training samples, (3) the development and validation of predictive logistic regression model based on breast cancer, and (4) the development and validation of a non-invasive breast cancer analytic tools based on the proposed and developed predictive models in this thesis.
To validate prediction accuracy of ensemble models, in this thesis the proposed models were applied in modelling breast cancer survivability and diabeticsβ diagnostic tasks. In comparison with other established models the simulation results of the models showed improved predictive accuracy.
The research outlines the benefits of the proposed models, whilst proposes new directions for future work that could further extend and improve the proposed models discussed in this
thesis
FRI - Feature Relevance Intervals for Interpretable and Interactive Data Exploration
Pfannschmidt L, GΓΆpfert C, Neumann U, Heider D, Hammer B. FRI - Feature Relevance Intervals for Interpretable and Interactive Data Exploration. Presented at the 16th IEEE International Conference on Computational Intelligence in Bioinformatics and Computational Biology, Certosa di Pontignano, Siena - Tuscany, Italy.Most existing feature selection methods are insufficient for analytic
purposes as soon as high dimensional data or redundant sensor signals are dealt
with since features can be selected due to spurious effects or correlations
rather than causal effects. To support the finding of causal features in
biomedical experiments, we hereby present FRI, an open source Python library
that can be used to identify all-relevant variables in linear classification
and (ordinal) regression problems. Using the recently proposed feature
relevance method, FRI is able to provide the base for further general
experimentation or in specific can facilitate the search for alternative
biomarkers. It can be used in an interactive context, by providing model
manipulation and visualization methods, or in a batch process as a filter
method
Predictive Modelling Approach to Data-Driven Computational Preventive Medicine
This thesis contributes novel predictive modelling approaches to data-driven computational preventive medicine and offers an alternative framework to statistical analysis in preventive medicine research. In the early parts of this research, this thesis presents research by proposing a synergy of machine learning methods for detecting patterns and developing inexpensive predictive models from healthcare data to classify the potential occurrence of adverse health events. In particular, the data-driven methodology is founded upon a heuristic-systematic assessment of several machine-learning methods, data preprocessing techniques, modelsβ training estimation and optimisation, and performance evaluation, yielding a novel computational data-driven framework, Octopus.
Midway through this research, this thesis advances research in preventive medicine and data mining by proposing several new extensions in data preparation and preprocessing. It offers new recommendations for data quality assessment checks, a novel multimethod imputation (MMI) process for missing data mitigation, a novel imbalanced resampling approach, and minority pattern reconstruction (MPR) led by information theory. This thesis also extends the area of model performance evaluation with a novel classification performance ranking metric called XDistance.
In particular, the experimental results show that building predictive models with the methods guided by our new framework (Octopus) yields domain experts' approval of the new reliable modelsβ performance. Also, performing the data quality checks and applying the MMI process led healthcare practitioners to outweigh predictive reliability over interpretability. The application of MPR and its hybrid resampling strategies led to better performances in line with experts' success criteria than the traditional imbalanced data resampling techniques. Finally, the use of the XDistance performance ranking metric was found to be more effective in ranking several classifiers' performances while offering an indication of class bias, unlike existing performance metrics
The overall contributions of this thesis can be summarised as follow. First, several data mining techniques were thoroughly assessed to formulate the new Octopus framework to produce new reliable classifiers. In addition, we offer a further understanding of the impact of newly engineered features, the physical activity index (PAI) and biological effective dose (BED). Second, the newly developed methods within the new framework. Finally, the newly accepted developed predictive models help detect adverse health events, namely, visceral fat-associated diseases and advanced breast cancer radiotherapy toxicity side effects. These contributions could be used to guide future theories, experiments and healthcare interventions in preventive medicine and data mining
Recommended from our members
When the machine does not know measuring uncertainty in deep learning models of medical images
This thesis was submitted for the award of Doctor of Philosophy and was awarded by Brunel University LondonRecently, Deep learning (DL), which involves powerful black box predictors, has outperformed
human experts in several medical diagnostic problems. However, these methods focus
exclusively on improving the accuracy of point predictions without assessing their outputsβ
quality and ignore the asymmetric cost involved in different types of misclassification errors.
Neural networks also do not deliver confidence in predictions and suffer from over and
under confidence, i.e. are not well calibrated. Knowing how much confidence there is in a
prediction is essential for gaining cliniciansβ trust in the technology.
Calibrated uncertainty quantification is a challenging problem as no ground truth is
available. To address this, we make two observations: (i) cost-sensitive deep neural networks
with Dropweights models better quantify calibrated predictive uncertainty, and (ii) estimated
uncertainty with point predictions in Deep Ensembles Bayesian Neural Networks with
DropWeights can lead to a more informed decision and improve prediction quality.
This dissertation focuses on quantifying uncertainty using concepts from cost-sensitive
neural networks, calibration of confidence, and Dropweights ensemble method. First, we
show how to improve predictive uncertainty by deep ensembles of neural networks with Dropweights
learning an approximate distribution over its weights in medical image segmentation
and its application in active learning. Second, we use the Jackknife resampling technique
to correct bias in quantified uncertainty in image classification and propose metrics to measure
uncertainty performance. The third part of the thesis is motivated by the discrepancy
between the model predictive error and the objective in quantified uncertainty when costs for
misclassification errors or unbalanced datasets are asymmetric. We develop cost-sensitive
modifications of the neural networks in disease detection and propose metrics to measure the
quality of quantified uncertainty. Finally, we leverage an adaptive binning strategy to measure
uncertainty calibration error that directly corresponds to estimated uncertainty performance
and address problematic evaluation methods.
We evaluate the effectiveness of the tools on nuclei images segmentation, multi-class
Brain MRI image classification, multi-level cell type-specific protein expression prediction in
ImmunoHistoChemistry (IHC) images and cost-sensitive classification for Covid-19 detection
from X-Rays and CT image dataset. Our approach is thoroughly validated by measuring the
quality of uncertainty. It produces an equally good or better result and paves the way for the
future that addresses the practical problems at the intersection of deep learning and Bayesian
decision theory.
In conclusion, our study highlights the opportunities and challenges of the application of
estimated uncertainty in deep learning models of medical images, representing the confidence of the modelβs prediction, and the uncertainty quality metrics show a significant improvement
when using Deep Ensembles Bayesian Neural Networks with DropWeights
Sensing Depression
The hallmark indicator of depressive disorders is a presence of sad, empty, or irritable mood, accompanied by somatic and cognitive changes that significantly affect the individuals capacity to function. The overall goal of our project is to provide a tool for doctors to effortlessly detect depression, and in effect achieve greater coverage in detecting depression over the general population. We use machine learning techniques to create a mobile application that infers a smartphone users severity of depression from data scraped off their phone and social media websites. Through our study, we have demonstrated the feasibility of this approach to diagnosing depression, achieving an average testset RMSE of 5.67 across all modalities in the task of PHQ-9 score predictions
Microarray Data Mining and Gene Regulatory Network Analysis
The novel molecular biological technology, microarray, makes it feasible to obtain quantitative measurements of expression of thousands of genes present in a biological sample simultaneously. Genome-wide expression data generated from this technology are promising to uncover the implicit, previously unknown biological knowledge. In this study, several problems about microarray data mining techniques were investigated, including feature(gene) selection, classifier genes identification, generation of reference genetic interaction network for non-model organisms and gene regulatory network reconstruction using time-series gene expression data. The limitations of most of the existing computational models employed to infer gene regulatory network lie in that they either suffer from low accuracy or computational complexity. To overcome such limitations, the following strategies were proposed to integrate bioinformatics data mining techniques with existing GRN inference algorithms, which enables the discovery of novel biological knowledge. An integrated statistical and machine learning (ISML) pipeline was developed for feature selection and classifier genes identification to solve the challenges of the curse of dimensionality problem as well as the huge search space. Using the selected classifier genes as seeds, a scale-up technique is applied to search through major databases of genetic interaction networks, metabolic pathways, etc.
By curating relevant genes and blasting genomic sequences of non-model organisms against well-studied genetic model organisms, a reference gene regulatory network for less-studied organisms was built and used both as prior knowledge and model validation for GRN reconstructions. Networks of gene interactions were inferred using a Dynamic Bayesian Network (DBN) approach and were analyzed for elucidating the dynamics caused by perturbations. Our proposed pipelines were applied to investigate molecular mechanisms for chemical-induced reversible neurotoxicity
Combining techniques for screening and evaluating interaction terms on high-dimensional time-to-event data
BACKGROUND: Molecular data, e.g. arising from microarray technology, is often used for predicting survival probabilities of patients. For multivariate risk prediction models on such high-dimensional data, there are established techniques that combine parameter estimation and variable selection. One big challenge is to incorporate interactions into such prediction models. In this feasibility study, we present building blocks for evaluating and incorporating interactions terms in high-dimensional time-to-event settings, especially for settings in which it is computationally too expensive to check all possible interactions. RESULTS: We use a boosting technique for estimation of effects and the following building blocks for pre-selecting interactions: (1) resampling, (2) random forests and (3) orthogonalization as a data pre-processing step. In a simulation study, the strategy that uses all building blocks is able to detect true main effects and interactions with high sensitivity in different kinds of scenarios. The main challenge are interactions composed of variables that do not represent main effects, but our findings are also promising in this regard. Results on real world data illustrate that effect sizes of interactions frequently may not be large enough to improve prediction performance, even though the interactions are potentially of biological relevance. CONCLUSION: Screening interactions through random forests is feasible and useful, when one is interested in finding relevant two-way interactions. The other building blocks also contribute considerably to an enhanced pre-selection of interactions. We determined the limits of interaction detection in terms of necessary effect sizes. Our study emphasizes the importance of making full use of existing methods in addition to establishing new ones
- β¦