Search CORE

1,634 research outputs found

FRI -- Feature Relevance Intervals for Interpretable and Interactive Data Exploration

Author: Göpfert Christina
Hammer Barbara
Heider Dominik
Neumann Ursula
Pfannschmidt Lukas
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 21/06/2019
Field of study

Most existing feature selection methods are insufficient for analytic purposes as soon as high dimensional data or redundant sensor signals are dealt with since features can be selected due to spurious effects or correlations rather than causal effects. To support the finding of causal features in biomedical experiments, we hereby present FRI, an open source Python library that can be used to identify all-relevant variables in linear classification and (ordinal) regression problems. Using the recently proposed feature relevance method, FRI is able to provide the base for further general experimentation or in specific can facilitate the search for alternative biomarkers. It can be used in an interactive context, by providing model manipulation and visualization methods, or in a batch process as a filter method.Comment: Addition of IEEE copyright notice. Accepted for CIBCB 2019 (https://cibcb2019.icas.xyz/

arXiv.org e-Print Archive

Crossref

Отбор информативных геометрических признаков ядер клеток на люминесцентных изображениях раковых клеток

Author: M. M. Yatskou
P. D. Pavel D. Kryvasheyeu
V. V. Apanasovich
V. V. Skakun
Ya. U. Lisitsa
В. В. Апанасович
В. В. Скакун
Е. В. Лисица
Н. Н. Яцков
П. Д. Кривошеев
Publication venue: UIIP NASB
Publication date: 06/12/2018
Field of study

The methods of geometric informative features selection of nuclei on fluorescent images of cancer cells are considered. During the survey, a review of existing geometric features was carried out, including both the signs of rotation resisted shape and displacement of the image, as well as signs of location in space. For the selection of characteristics, the methods were used: median, correlation with calculation of the Pearson correlation coefficient, correlation with calculation of the Spearman correlation coefficient, logistic regression model, random forest with CART trees and Gini criterion, random forest with CART trees and error minimization criterion. As a result of the investigation 11 characteristics were selected from 59 features, the quality of classification and time costs were calculated depending on the number of features for describing the objects. The use of 11 features is sufficient for the accuracy of classification as it allows to reduce time costs in 2,3 times.Рассмотрены методы отбора информативных признаков для выделения геометрических признаков при описании ядер на люминесцентных изображениях раковых клеток. Выполнен обзор существующих геометрических признаков, который включает в себя как признаки формы, устойчивые к повороту и перемещению изображения, так и признаки расположения в пространстве. Для отбора наиболее информативных признаков использованы шесть методов: медианный, корреляционный с расчетом коэффициента корреляции по Пирсону, корреляционный с расчетом коэффициента корреляции по Спирмену, метод логистической регрессии, случайного леса с CART-деревьями и критерием Gini, случайного леса с CART-деревьями и критерием минимизации ошибки. В результате исследования из 59 признаков отобраны 11 наиболее информативных, выполнен анализ качества классификации с помощью метода случайного леса и рассчитаны временные затраты в зависимости от количества признаков для описания объектов. Для метода случайного леса использование 11 признаков является достаточным по точности классификации и позволяет снизить временные затраты в 2,3 раза

Informatics (E-Journal) / Информатика

Methods to Improve the Prediction Accuracy and Performance of Ensemble Models

Author: Adegoke V.
Adegoke V.
Publication venue: London South Bank University
Publication date: 01/01/2022
Field of study

The application of ensemble predictive models has been an important research area in predicting medical diagnostics, engineering diagnostics, and other related smart devices and related technologies. Most of the current predictive models are complex and not reliable despite numerous efforts in the past by the research community. The performance accuracy of the predictive models have not always been realised due to many factors such as complexity and class imbalance. Therefore there is a need to improve the predictive accuracy of current ensemble models and to enhance their applications and reliability and non-visual predictive tools. The research work presented in this thesis has adopted a pragmatic phased approach to propose and develop new ensemble models using multiple methods and validated the methods through rigorous testing and implementation in different phases. The first phase comprises of empirical investigations on standalone and ensemble algorithms that were carried out to ascertain their performance effects on complexity and simplicity of the classifiers. The second phase comprises of an improved ensemble model based on the integration of Extended Kalman Filter (EKF), Radial Basis Function Network (RBFN) and AdaBoost algorithms. The third phase comprises of an extended model based on early stop concepts, AdaBoost algorithm, and statistical performance of the training samples to minimize overfitting performance of the proposed model. The fourth phase comprises of an enhanced analytical multivariate logistic regression predictive model developed to minimize the complexity and improve prediction accuracy of logistic regression model. To facilitate the practical application of the proposed models; an ensemble non-invasive analytical tool is proposed and developed. The tool links the gap between theoretical concepts and practical application of theories to predict breast cancer survivability. The empirical findings suggested that: (1) increasing the complexity and topology of algorithms does not necessarily lead to a better algorithmic performance, (2) boosting by resampling performs slightly better than boosting by reweighting, (3) the prediction accuracy of the proposed ensemble EKF-RBFN-AdaBoost model performed better than several established ensemble models, (4) the proposed early stopped model converges faster and minimizes overfitting better compare with other models, (5) the proposed multivariate logistic regression concept minimizes the complexity models (6) the performance of the proposed analytical non-invasive tool performed comparatively better than many of the benchmark analytical tools used in predicting breast cancers and diabetics ailments. The research contributions to ensemble practice are: (1) the integration and development of EKF, RBFN and AdaBoost algorithms as an ensemble model, (2) the development and validation of ensemble model based on early stop concepts, AdaBoost, and statistical concepts of the training samples, (3) the development and validation of predictive logistic regression model based on breast cancer, and (4) the development and validation of a non-invasive breast cancer analytic tools based on the proposed and developed predictive models in this thesis. To validate prediction accuracy of ensemble models, in this thesis the proposed models were applied in modelling breast cancer survivability and diabetics’ diagnostic tasks. In comparison with other established models the simulation results of the models showed improved predictive accuracy. The research outlines the benefits of the proposed models, whilst proposes new directions for future work that could further extend and improve the proposed models discussed in this thesis

LSBU Research Open

FRI - Feature Relevance Intervals for Interpretable and Interactive Data Exploration

Author: Göpfert Christina
Hammer Barbara
Heider Dominik
Neumann Ursula
Pfannschmidt Lukas
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2019
Field of study

Pfannschmidt L, Göpfert C, Neumann U, Heider D, Hammer B. FRI - Feature Relevance Intervals for Interpretable and Interactive Data Exploration. Presented at the 16th IEEE International Conference on Computational Intelligence in Bioinformatics and Computational Biology, Certosa di Pontignano, Siena - Tuscany, Italy.Most existing feature selection methods are insufficient for analytic purposes as soon as high dimensional data or redundant sensor signals are dealt with since features can be selected due to spurious effects or correlations rather than causal effects. To support the finding of causal features in biomedical experiments, we hereby present FRI, an open source Python library that can be used to identify all-relevant variables in linear classification and (ordinal) regression problems. Using the recently proposed feature relevance method, FRI is able to provide the base for further general experimentation or in specific can facilitate the search for alternative biomarkers. It can be used in an interactive context, by providing model manipulation and visualization methods, or in a batch process as a filter method

Publications at Bielefeld University

Predictive Modelling Approach to Data-Driven Computational Preventive Medicine

Author: Aldraimli M.
Aldraimli M.
Publication venue: University of Westminster
Publication date: 01/01/2023
Field of study

This thesis contributes novel predictive modelling approaches to data-driven computational preventive medicine and offers an alternative framework to statistical analysis in preventive medicine research. In the early parts of this research, this thesis presents research by proposing a synergy of machine learning methods for detecting patterns and developing inexpensive predictive models from healthcare data to classify the potential occurrence of adverse health events. In particular, the data-driven methodology is founded upon a heuristic-systematic assessment of several machine-learning methods, data preprocessing techniques, models’ training estimation and optimisation, and performance evaluation, yielding a novel computational data-driven framework, Octopus. Midway through this research, this thesis advances research in preventive medicine and data mining by proposing several new extensions in data preparation and preprocessing. It offers new recommendations for data quality assessment checks, a novel multimethod imputation (MMI) process for missing data mitigation, a novel imbalanced resampling approach, and minority pattern reconstruction (MPR) led by information theory. This thesis also extends the area of model performance evaluation with a novel classification performance ranking metric called XDistance. In particular, the experimental results show that building predictive models with the methods guided by our new framework (Octopus) yields domain experts' approval of the new reliable models’ performance. Also, performing the data quality checks and applying the MMI process led healthcare practitioners to outweigh predictive reliability over interpretability. The application of MPR and its hybrid resampling strategies led to better performances in line with experts' success criteria than the traditional imbalanced data resampling techniques. Finally, the use of the XDistance performance ranking metric was found to be more effective in ranking several classifiers' performances while offering an indication of class bias, unlike existing performance metrics The overall contributions of this thesis can be summarised as follow. First, several data mining techniques were thoroughly assessed to formulate the new Octopus framework to produce new reliable classifiers. In addition, we offer a further understanding of the impact of newly engineered features, the physical activity index (PAI) and biological effective dose (BED). Second, the newly developed methods within the new framework. Finally, the newly accepted developed predictive models help detect adverse health events, namely, visceral fat-associated diseases and advanced breast cancer radiotherapy toxicity side effects. These contributions could be used to guide future theories, experiments and healthcare interventions in preventive medicine and data mining

WestminsterResearch

Recommended from our members

When the machine does not know measuring uncertainty in deep learning models of medical images

Author: Ghoshal Biraja Prasad
Publication venue: Brunel University, London
Publication date: 01/01/2022
Field of study

This thesis was submitted for the award of Doctor of Philosophy and was awarded by Brunel University LondonRecently, Deep learning (DL), which involves powerful black box predictors, has outperformed human experts in several medical diagnostic problems. However, these methods focus exclusively on improving the accuracy of point predictions without assessing their outputs’ quality and ignore the asymmetric cost involved in different types of misclassification errors. Neural networks also do not deliver confidence in predictions and suffer from over and under confidence, i.e. are not well calibrated. Knowing how much confidence there is in a prediction is essential for gaining clinicians’ trust in the technology. Calibrated uncertainty quantification is a challenging problem as no ground truth is available. To address this, we make two observations: (i) cost-sensitive deep neural networks with Dropweights models better quantify calibrated predictive uncertainty, and (ii) estimated uncertainty with point predictions in Deep Ensembles Bayesian Neural Networks with DropWeights can lead to a more informed decision and improve prediction quality. This dissertation focuses on quantifying uncertainty using concepts from cost-sensitive neural networks, calibration of confidence, and Dropweights ensemble method. First, we show how to improve predictive uncertainty by deep ensembles of neural networks with Dropweights learning an approximate distribution over its weights in medical image segmentation and its application in active learning. Second, we use the Jackknife resampling technique to correct bias in quantified uncertainty in image classification and propose metrics to measure uncertainty performance. The third part of the thesis is motivated by the discrepancy between the model predictive error and the objective in quantified uncertainty when costs for misclassification errors or unbalanced datasets are asymmetric. We develop cost-sensitive modifications of the neural networks in disease detection and propose metrics to measure the quality of quantified uncertainty. Finally, we leverage an adaptive binning strategy to measure uncertainty calibration error that directly corresponds to estimated uncertainty performance and address problematic evaluation methods. We evaluate the effectiveness of the tools on nuclei images segmentation, multi-class Brain MRI image classification, multi-level cell type-specific protein expression prediction in ImmunoHistoChemistry (IHC) images and cost-sensitive classification for Covid-19 detection from X-Rays and CT image dataset. Our approach is thoroughly validated by measuring the quality of uncertainty. It produces an equally good or better result and paves the way for the future that addresses the practical problems at the intersection of deep learning and Bayesian decision theory. In conclusion, our study highlights the opportunities and challenges of the application of estimated uncertainty in deep learning models of medical images, representing the confidence of the model’s prediction, and the uncertainty quality metrics show a significant improvement when using Deep Ensembles Bayesian Neural Networks with DropWeights

Brunel University Research Archive

Sensing Depression

Author: Ball Damon Clark
Dogrucu Ada
Isaro Anabella
Perucic Aleksa
Publication venue: Digital WPI
Publication date: 22/03/2018
Field of study

The hallmark indicator of depressive disorders is a presence of sad, empty, or irritable mood, accompanied by somatic and cognitive changes that significantly affect the individuals capacity to function. The overall goal of our project is to provide a tool for doctors to effortlessly detect depression, and in effect achieve greater coverage in detecting depression over the general population. We use machine learning techniques to create a mobile application that infers a smartphone users severity of depression from data scraped off their phone and social media websites. Through our study, we have demonstrated the feasibility of this approach to diagnosing depression, achieving an average testset RMSE of 5.67 across all modalities in the task of PHQ-9 score predictions

DigitalCommons@WPI

Using computational intelligence for knowledge discovery from the human microbiome

Author: Wingfield Benjamin
Publication venue
Publication date: 01/03/2019
Field of study

Ulster University's Research Portal

Microarray Data Mining and Gene Regulatory Network Analysis

Author: Li Ying
Publication venue: The Aquila Digital Community
Publication date: 01/05/2011
Field of study

The novel molecular biological technology, microarray, makes it feasible to obtain quantitative measurements of expression of thousands of genes present in a biological sample simultaneously. Genome-wide expression data generated from this technology are promising to uncover the implicit, previously unknown biological knowledge. In this study, several problems about microarray data mining techniques were investigated, including feature(gene) selection, classifier genes identification, generation of reference genetic interaction network for non-model organisms and gene regulatory network reconstruction using time-series gene expression data. The limitations of most of the existing computational models employed to infer gene regulatory network lie in that they either suffer from low accuracy or computational complexity. To overcome such limitations, the following strategies were proposed to integrate bioinformatics data mining techniques with existing GRN inference algorithms, which enables the discovery of novel biological knowledge. An integrated statistical and machine learning (ISML) pipeline was developed for feature selection and classifier genes identification to solve the challenges of the curse of dimensionality problem as well as the huge search space. Using the selected classifier genes as seeds, a scale-up technique is applied to search through major databases of genetic interaction networks, metabolic pathways, etc. By curating relevant genes and blasting genomic sequences of non-model organisms against well-studied genetic model organisms, a reference gene regulatory network for less-studied organisms was built and used both as prior knowledge and model validation for GRN reconstructions. Networks of gene interactions were inferred using a Dynamic Bayesian Network (DBN) approach and were analyzed for elucidating the dynamics caused by perturbations. Our proposed pipelines were applied to investigate molecular mechanisms for chemical-induced reversible neurotoxicity

Aquila Digital Community

Combining techniques for screening and evaluating interaction terms on high-dimensional time-to-event data

Author: Harald Binder
Isabell Hoffmann
Murat Sariyar
Publication venue: Springer Nature
Publication date: 01/01/2014
Field of study

BACKGROUND: Molecular data, e.g. arising from microarray technology, is often used for predicting survival probabilities of patients. For multivariate risk prediction models on such high-dimensional data, there are established techniques that combine parameter estimation and variable selection. One big challenge is to incorporate interactions into such prediction models. In this feasibility study, we present building blocks for evaluating and incorporating interactions terms in high-dimensional time-to-event settings, especially for settings in which it is computationally too expensive to check all possible interactions. RESULTS: We use a boosting technique for estimation of effects and the following building blocks for pre-selecting interactions: (1) resampling, (2) random forests and (3) orthogonalization as a data pre-processing step. In a simulation study, the strategy that uses all building blocks is able to detect true main effects and interactions with high sensitivity in different kinds of scenarios. The main challenge are interactions composed of variables that do not represent main effects, but our findings are also promising in this regard. Results on real world data illustrate that effect sizes of interactions frequently may not be large enough to improve prediction performance, even though the interactions are potentially of biological relevance. CONCLUSION: Screening interactions through random forests is feasible and useful, when one is interested in finding relevant two-way interactions. The other building blocks also contribute considerably to an enhanced pre-selection of interactions. We determined the limits of interaction detection in terms of necessary effect sizes. Our study emphasizes the importance of making full use of existing methods in addition to establishing new ones

Springer - Publisher Connector

PubMed Central