1,634 research outputs found

    FRI -- Feature Relevance Intervals for Interpretable and Interactive Data Exploration

    Full text link
    Most existing feature selection methods are insufficient for analytic purposes as soon as high dimensional data or redundant sensor signals are dealt with since features can be selected due to spurious effects or correlations rather than causal effects. To support the finding of causal features in biomedical experiments, we hereby present FRI, an open source Python library that can be used to identify all-relevant variables in linear classification and (ordinal) regression problems. Using the recently proposed feature relevance method, FRI is able to provide the base for further general experimentation or in specific can facilitate the search for alternative biomarkers. It can be used in an interactive context, by providing model manipulation and visualization methods, or in a batch process as a filter method.Comment: Addition of IEEE copyright notice. Accepted for CIBCB 2019 (https://cibcb2019.icas.xyz/

    ΠžΡ‚Π±ΠΎΡ€ ΠΈΠ½Ρ„ΠΎΡ€ΠΌΠ°Ρ‚ΠΈΠ²Π½Ρ‹Ρ… гСомСтричСских ΠΏΡ€ΠΈΠ·Π½Π°ΠΊΠΎΠ² ядСр ΠΊΠ»Π΅Ρ‚ΠΎΠΊ Π½Π° Π»ΡŽΠΌΠΈΠ½Π΅ΡΡ†Π΅Π½Ρ‚Π½Ρ‹Ρ… изобраТСниях Ρ€Π°ΠΊΠΎΠ²Ρ‹Ρ… ΠΊΠ»Π΅Ρ‚ΠΎΠΊ

    Get PDF
    The methods of geometric informative features selection of nuclei on fluorescent images of cancer cells are considered. During the survey, a review of existing geometric features was carried out, including both the signs of rotation resisted shape and displacement of the image, as well as signs of location in space. For the selection of characteristics, the methods were used: median, correlation with calculation of the Pearson correlation coefficient, correlation with calculation of the Spearman correlation coefficient, logistic regression model, random forest with CART trees and Gini criterion, random forest with CART trees and error minimization criterion. As a result of the investigation 11 characteristics were selected from 59 features, the quality of classification and time costs were calculated depending on the number of features for describing the objects. The use of 11 features is sufficient for the accuracy of classification as it allows to reduce time costs in 2,3 times.РассмотрСны ΠΌΠ΅Ρ‚ΠΎΠ΄Ρ‹ ΠΎΡ‚Π±ΠΎΡ€Π° ΠΈΠ½Ρ„ΠΎΡ€ΠΌΠ°Ρ‚ΠΈΠ²Π½Ρ‹Ρ… ΠΏΡ€ΠΈΠ·Π½Π°ΠΊΠΎΠ² для выдСлСния гСомСтричСских ΠΏΡ€ΠΈΠ·Π½Π°ΠΊΠΎΠ² ΠΏΡ€ΠΈ описании ядСр Π½Π° Π»ΡŽΠΌΠΈΠ½Π΅ΡΡ†Π΅Π½Ρ‚Π½Ρ‹Ρ… изобраТСниях Ρ€Π°ΠΊΠΎΠ²Ρ‹Ρ… ΠΊΠ»Π΅Ρ‚ΠΎΠΊ. Π’Ρ‹ΠΏΠΎΠ»Π½Π΅Π½ ΠΎΠ±Π·ΠΎΡ€ ΡΡƒΡ‰Π΅ΡΡ‚Π²ΡƒΡŽΡ‰ΠΈΡ… гСомСтричСских ΠΏΡ€ΠΈΠ·Π½Π°ΠΊΠΎΠ², ΠΊΠΎΡ‚ΠΎΡ€Ρ‹ΠΉ Π²ΠΊΠ»ΡŽΡ‡Π°Π΅Ρ‚ Π² сСбя ΠΊΠ°ΠΊ ΠΏΡ€ΠΈΠ·Π½Π°ΠΊΠΈ Ρ„ΠΎΡ€ΠΌΡ‹, устойчивыС ΠΊ ΠΏΠΎΠ²ΠΎΡ€ΠΎΡ‚Ρƒ ΠΈ ΠΏΠ΅Ρ€Π΅ΠΌΠ΅Ρ‰Π΅Π½ΠΈΡŽ изобраТСния, Ρ‚Π°ΠΊ ΠΈ ΠΏΡ€ΠΈΠ·Π½Π°ΠΊΠΈ располоТСния Π² пространствС. Для ΠΎΡ‚Π±ΠΎΡ€Π° Π½Π°ΠΈΠ±ΠΎΠ»Π΅Π΅ ΠΈΠ½Ρ„ΠΎΡ€ΠΌΠ°Ρ‚ΠΈΠ²Π½Ρ‹Ρ… ΠΏΡ€ΠΈΠ·Π½Π°ΠΊΠΎΠ² ΠΈΡΠΏΠΎΠ»ΡŒΠ·ΠΎΠ²Π°Π½Ρ‹ ΡˆΠ΅ΡΡ‚ΡŒ ΠΌΠ΅Ρ‚ΠΎΠ΄ΠΎΠ²: ΠΌΠ΅Π΄ΠΈΠ°Π½Π½Ρ‹ΠΉ, коррСляционный с расчСтом коэффициСнта коррСляции ΠΏΠΎ ΠŸΠΈΡ€ΡΠΎΠ½Ρƒ, коррСляционный с расчСтом коэффициСнта коррСляции ΠΏΠΎ Π‘ΠΏΠΈΡ€ΠΌΠ΅Π½Ρƒ, ΠΌΠ΅Ρ‚ΠΎΠ΄ логистичСской рСгрСссии, случайного лСса с CART-Π΄Π΅Ρ€Π΅Π²ΡŒΡΠΌΠΈ ΠΈ ΠΊΡ€ΠΈΡ‚Π΅Ρ€ΠΈΠ΅ΠΌ Gini, случайного лСса с CART-Π΄Π΅Ρ€Π΅Π²ΡŒΡΠΌΠΈ ΠΈ ΠΊΡ€ΠΈΡ‚Π΅Ρ€ΠΈΠ΅ΠΌ ΠΌΠΈΠ½ΠΈΠΌΠΈΠ·Π°Ρ†ΠΈΠΈ ошибки. Π’ Ρ€Π΅Π·ΡƒΠ»ΡŒΡ‚Π°Ρ‚Π΅ исслСдования ΠΈΠ· 59 ΠΏΡ€ΠΈΠ·Π½Π°ΠΊΠΎΠ² ΠΎΡ‚ΠΎΠ±Ρ€Π°Π½Ρ‹ 11 Π½Π°ΠΈΠ±ΠΎΠ»Π΅Π΅ ΠΈΠ½Ρ„ΠΎΡ€ΠΌΠ°Ρ‚ΠΈΠ²Π½Ρ‹Ρ…, Π²Ρ‹ΠΏΠΎΠ»Π½Π΅Π½ Π°Π½Π°Π»ΠΈΠ· качСства классификации с ΠΏΠΎΠΌΠΎΡ‰ΡŒΡŽ ΠΌΠ΅Ρ‚ΠΎΠ΄Π° случайного лСса ΠΈ рассчитаны Π²Ρ€Π΅ΠΌΠ΅Π½Π½Ρ‹Π΅ Π·Π°Ρ‚Ρ€Π°Ρ‚Ρ‹ Π² зависимости ΠΎΡ‚ количСства ΠΏΡ€ΠΈΠ·Π½Π°ΠΊΠΎΠ² для описания ΠΎΠ±ΡŠΠ΅ΠΊΡ‚ΠΎΠ². Для ΠΌΠ΅Ρ‚ΠΎΠ΄Π° случайного лСса использованиС 11 ΠΏΡ€ΠΈΠ·Π½Π°ΠΊΠΎΠ² являСтся достаточным ΠΏΠΎ точности классификации ΠΈ позволяСт ΡΠ½ΠΈΠ·ΠΈΡ‚ΡŒ Π²Ρ€Π΅ΠΌΠ΅Π½Π½Ρ‹Π΅ Π·Π°Ρ‚Ρ€Π°Ρ‚Ρ‹ Π² 2,3 Ρ€Π°Π·Π°

    Methods to Improve the Prediction Accuracy and Performance of Ensemble Models

    Get PDF
    The application of ensemble predictive models has been an important research area in predicting medical diagnostics, engineering diagnostics, and other related smart devices and related technologies. Most of the current predictive models are complex and not reliable despite numerous efforts in the past by the research community. The performance accuracy of the predictive models have not always been realised due to many factors such as complexity and class imbalance. Therefore there is a need to improve the predictive accuracy of current ensemble models and to enhance their applications and reliability and non-visual predictive tools. The research work presented in this thesis has adopted a pragmatic phased approach to propose and develop new ensemble models using multiple methods and validated the methods through rigorous testing and implementation in different phases. The first phase comprises of empirical investigations on standalone and ensemble algorithms that were carried out to ascertain their performance effects on complexity and simplicity of the classifiers. The second phase comprises of an improved ensemble model based on the integration of Extended Kalman Filter (EKF), Radial Basis Function Network (RBFN) and AdaBoost algorithms. The third phase comprises of an extended model based on early stop concepts, AdaBoost algorithm, and statistical performance of the training samples to minimize overfitting performance of the proposed model. The fourth phase comprises of an enhanced analytical multivariate logistic regression predictive model developed to minimize the complexity and improve prediction accuracy of logistic regression model. To facilitate the practical application of the proposed models; an ensemble non-invasive analytical tool is proposed and developed. The tool links the gap between theoretical concepts and practical application of theories to predict breast cancer survivability. The empirical findings suggested that: (1) increasing the complexity and topology of algorithms does not necessarily lead to a better algorithmic performance, (2) boosting by resampling performs slightly better than boosting by reweighting, (3) the prediction accuracy of the proposed ensemble EKF-RBFN-AdaBoost model performed better than several established ensemble models, (4) the proposed early stopped model converges faster and minimizes overfitting better compare with other models, (5) the proposed multivariate logistic regression concept minimizes the complexity models (6) the performance of the proposed analytical non-invasive tool performed comparatively better than many of the benchmark analytical tools used in predicting breast cancers and diabetics ailments. The research contributions to ensemble practice are: (1) the integration and development of EKF, RBFN and AdaBoost algorithms as an ensemble model, (2) the development and validation of ensemble model based on early stop concepts, AdaBoost, and statistical concepts of the training samples, (3) the development and validation of predictive logistic regression model based on breast cancer, and (4) the development and validation of a non-invasive breast cancer analytic tools based on the proposed and developed predictive models in this thesis. To validate prediction accuracy of ensemble models, in this thesis the proposed models were applied in modelling breast cancer survivability and diabetics’ diagnostic tasks. In comparison with other established models the simulation results of the models showed improved predictive accuracy. The research outlines the benefits of the proposed models, whilst proposes new directions for future work that could further extend and improve the proposed models discussed in this thesis

    FRI - Feature Relevance Intervals for Interpretable and Interactive Data Exploration

    Get PDF
    Pfannschmidt L, GΓΆpfert C, Neumann U, Heider D, Hammer B. FRI - Feature Relevance Intervals for Interpretable and Interactive Data Exploration. Presented at the 16th IEEE International Conference on Computational Intelligence in Bioinformatics and Computational Biology, Certosa di Pontignano, Siena - Tuscany, Italy.Most existing feature selection methods are insufficient for analytic purposes as soon as high dimensional data or redundant sensor signals are dealt with since features can be selected due to spurious effects or correlations rather than causal effects. To support the finding of causal features in biomedical experiments, we hereby present FRI, an open source Python library that can be used to identify all-relevant variables in linear classification and (ordinal) regression problems. Using the recently proposed feature relevance method, FRI is able to provide the base for further general experimentation or in specific can facilitate the search for alternative biomarkers. It can be used in an interactive context, by providing model manipulation and visualization methods, or in a batch process as a filter method

    Predictive Modelling Approach to Data-Driven Computational Preventive Medicine

    Get PDF
    This thesis contributes novel predictive modelling approaches to data-driven computational preventive medicine and offers an alternative framework to statistical analysis in preventive medicine research. In the early parts of this research, this thesis presents research by proposing a synergy of machine learning methods for detecting patterns and developing inexpensive predictive models from healthcare data to classify the potential occurrence of adverse health events. In particular, the data-driven methodology is founded upon a heuristic-systematic assessment of several machine-learning methods, data preprocessing techniques, models’ training estimation and optimisation, and performance evaluation, yielding a novel computational data-driven framework, Octopus. Midway through this research, this thesis advances research in preventive medicine and data mining by proposing several new extensions in data preparation and preprocessing. It offers new recommendations for data quality assessment checks, a novel multimethod imputation (MMI) process for missing data mitigation, a novel imbalanced resampling approach, and minority pattern reconstruction (MPR) led by information theory. This thesis also extends the area of model performance evaluation with a novel classification performance ranking metric called XDistance. In particular, the experimental results show that building predictive models with the methods guided by our new framework (Octopus) yields domain experts' approval of the new reliable models’ performance. Also, performing the data quality checks and applying the MMI process led healthcare practitioners to outweigh predictive reliability over interpretability. The application of MPR and its hybrid resampling strategies led to better performances in line with experts' success criteria than the traditional imbalanced data resampling techniques. Finally, the use of the XDistance performance ranking metric was found to be more effective in ranking several classifiers' performances while offering an indication of class bias, unlike existing performance metrics The overall contributions of this thesis can be summarised as follow. First, several data mining techniques were thoroughly assessed to formulate the new Octopus framework to produce new reliable classifiers. In addition, we offer a further understanding of the impact of newly engineered features, the physical activity index (PAI) and biological effective dose (BED). Second, the newly developed methods within the new framework. Finally, the newly accepted developed predictive models help detect adverse health events, namely, visceral fat-associated diseases and advanced breast cancer radiotherapy toxicity side effects. These contributions could be used to guide future theories, experiments and healthcare interventions in preventive medicine and data mining

    Sensing Depression

    Get PDF
    The hallmark indicator of depressive disorders is a presence of sad, empty, or irritable mood, accompanied by somatic and cognitive changes that significantly affect the individuals capacity to function. The overall goal of our project is to provide a tool for doctors to effortlessly detect depression, and in effect achieve greater coverage in detecting depression over the general population. We use machine learning techniques to create a mobile application that infers a smartphone users severity of depression from data scraped off their phone and social media websites. Through our study, we have demonstrated the feasibility of this approach to diagnosing depression, achieving an average testset RMSE of 5.67 across all modalities in the task of PHQ-9 score predictions

    Microarray Data Mining and Gene Regulatory Network Analysis

    Get PDF
    The novel molecular biological technology, microarray, makes it feasible to obtain quantitative measurements of expression of thousands of genes present in a biological sample simultaneously. Genome-wide expression data generated from this technology are promising to uncover the implicit, previously unknown biological knowledge. In this study, several problems about microarray data mining techniques were investigated, including feature(gene) selection, classifier genes identification, generation of reference genetic interaction network for non-model organisms and gene regulatory network reconstruction using time-series gene expression data. The limitations of most of the existing computational models employed to infer gene regulatory network lie in that they either suffer from low accuracy or computational complexity. To overcome such limitations, the following strategies were proposed to integrate bioinformatics data mining techniques with existing GRN inference algorithms, which enables the discovery of novel biological knowledge. An integrated statistical and machine learning (ISML) pipeline was developed for feature selection and classifier genes identification to solve the challenges of the curse of dimensionality problem as well as the huge search space. Using the selected classifier genes as seeds, a scale-up technique is applied to search through major databases of genetic interaction networks, metabolic pathways, etc. By curating relevant genes and blasting genomic sequences of non-model organisms against well-studied genetic model organisms, a reference gene regulatory network for less-studied organisms was built and used both as prior knowledge and model validation for GRN reconstructions. Networks of gene interactions were inferred using a Dynamic Bayesian Network (DBN) approach and were analyzed for elucidating the dynamics caused by perturbations. Our proposed pipelines were applied to investigate molecular mechanisms for chemical-induced reversible neurotoxicity

    Combining techniques for screening and evaluating interaction terms on high-dimensional time-to-event data

    Get PDF
    BACKGROUND: Molecular data, e.g. arising from microarray technology, is often used for predicting survival probabilities of patients. For multivariate risk prediction models on such high-dimensional data, there are established techniques that combine parameter estimation and variable selection. One big challenge is to incorporate interactions into such prediction models. In this feasibility study, we present building blocks for evaluating and incorporating interactions terms in high-dimensional time-to-event settings, especially for settings in which it is computationally too expensive to check all possible interactions. RESULTS: We use a boosting technique for estimation of effects and the following building blocks for pre-selecting interactions: (1) resampling, (2) random forests and (3) orthogonalization as a data pre-processing step. In a simulation study, the strategy that uses all building blocks is able to detect true main effects and interactions with high sensitivity in different kinds of scenarios. The main challenge are interactions composed of variables that do not represent main effects, but our findings are also promising in this regard. Results on real world data illustrate that effect sizes of interactions frequently may not be large enough to improve prediction performance, even though the interactions are potentially of biological relevance. CONCLUSION: Screening interactions through random forests is feasible and useful, when one is interested in finding relevant two-way interactions. The other building blocks also contribute considerably to an enhanced pre-selection of interactions. We determined the limits of interaction detection in terms of necessary effect sizes. Our study emphasizes the importance of making full use of existing methods in addition to establishing new ones
    • …
    corecore