213 research outputs found

    Data- og ekspertdreven variabelseleksjon for prediktive modeller i helsevesenet : mot økt tolkbarhet i underbestemte maskinlæringsproblemer

    Get PDF
    Modern data acquisition techniques in healthcare generate large collections of data from multiple sources, such as novel diagnosis and treatment methodologies. Some concrete examples are electronic healthcare record systems, genomics, and medical images. This leads to situations with often unstructured, high-dimensional heterogeneous patient cohort data where classical statistical methods may not be sufficient for optimal utilization of the data and informed decision-making. Instead, investigating such data structures with modern machine learning techniques promises to improve the understanding of patient health issues and may provide a better platform for informed decision-making by clinicians. Key requirements for this purpose include (a) sufficiently accurate predictions and (b) model interpretability. Achieving both aspects in parallel is difficult, particularly for datasets with few patients, which are common in the healthcare domain. In such cases, machine learning models encounter mathematically underdetermined systems and may overfit easily on the training data. An important approach to overcome this issue is feature selection, i.e., determining a subset of informative features from the original set of features with respect to the target variable. While potentially raising the predictive performance, feature selection fosters model interpretability by identifying a low number of relevant model parameters to better understand the underlying biological processes that lead to health issues. Interpretability requires that feature selection is stable, i.e., small changes in the dataset do not lead to changes in the selected feature set. A concept to address instability is ensemble feature selection, i.e. the process of repeating the feature selection multiple times on subsets of samples of the original dataset and aggregating results in a meta-model. This thesis presents two approaches for ensemble feature selection, which are tailored towards high-dimensional data in healthcare: the Repeated Elastic Net Technique for feature selection (RENT) and the User-Guided Bayesian Framework for feature selection (UBayFS). While RENT is purely data-driven and builds upon elastic net regularized models, UBayFS is a general framework for ensembles with the capabilities to include expert knowledge in the feature selection process via prior weights and side constraints. A case study modeling the overall survival of cancer patients compares these novel feature selectors and demonstrates their potential in clinical practice. Beyond the selection of single features, UBayFS also allows for selecting whole feature groups (feature blocks) that were acquired from multiple data sources, as those mentioned above. Importance quantification of such feature blocks plays a key role in tracing information about the target variable back to the acquisition modalities. Such information on feature block importance may lead to positive effects on the use of human, technical, and financial resources if systematically integrated into the planning of patient treatment by excluding the acquisition of non-informative features. Since a generalization of feature importance measures to block importance is not trivial, this thesis also investigates and compares approaches for feature block importance rankings. This thesis demonstrates that high-dimensional datasets from multiple data sources in the medical domain can be successfully tackled by the presented approaches for feature selection. Experimental evaluations demonstrate favorable properties of both predictive performance, stability, as well as interpretability of results, which carries a high potential for better data-driven decision support in clinical practice.Moderne datainnsamlingsteknikker i helsevesenet genererer store datamengder fra flere kilder, som for eksempel nye diagnose- og behandlingsmetoder. Noen konkrete eksempler er elektroniske helsejournalsystemer, genomikk og medisinske bilder. Slike pasientkohortdata er ofte ustrukturerte, høydimensjonale og heterogene og hvor klassiske statistiske metoder ikke er tilstrekkelige for optimal utnyttelse av dataene og god informasjonsbasert beslutningstaking. Derfor kan det være lovende å analysere slike datastrukturer ved bruk av moderne maskinlæringsteknikker for å øke forståelsen av pasientenes helseproblemer og for å gi klinikerne en bedre plattform for informasjonsbasert beslutningstaking. Sentrale krav til dette formålet inkluderer (a) tilstrekkelig nøyaktige prediksjoner og (b) modelltolkbarhet. Å oppnå begge aspektene samtidig er vanskelig, spesielt for datasett med få pasienter, noe som er vanlig for data i helsevesenet. I slike tilfeller må maskinlæringsmodeller håndtere matematisk underbestemte systemer og dette kan lett føre til at modellene overtilpasses treningsdataene. Variabelseleksjon er en viktig tilnærming for å håndtere dette ved å identifisere en undergruppe av informative variabler med hensyn til responsvariablen. Samtidig som variabelseleksjonsmetoder kan lede til økt prediktiv ytelse, fremmes modelltolkbarhet ved å identifisere et lavt antall relevante modellparametere. Dette kan gi bedre forståelse av de underliggende biologiske prosessene som fører til helseproblemer. Tolkbarhet krever at variabelseleksjonen er stabil, dvs. at små endringer i datasettet ikke fører til endringer i hvilke variabler som velges. Et konsept for å adressere ustabilitet er ensemblevariableseleksjon, dvs. prosessen med å gjenta variabelseleksjon flere ganger på en delmengde av prøvene i det originale datasett og aggregere resultater i en metamodell. Denne avhandlingen presenterer to tilnærminger for ensemblevariabelseleksjon, som er skreddersydd for høydimensjonale data i helsevesenet: "Repeated Elastic Net Technique for feature selection" (RENT) og "User-Guided Bayesian Framework for feature selection" (UBayFS). Mens RENT er datadrevet og bygger på elastic net-regulariserte modeller, er UBayFS et generelt rammeverk for ensembler som muliggjør inkludering av ekspertkunnskap i variabelseleksjonsprosessen gjennom forhåndsbestemte vekter og sidebegrensninger. En case-studie som modellerer overlevelsen av kreftpasienter sammenligner disse nye variabelseleksjonsmetodene og demonstrerer deres potensiale i klinisk praksis. Utover valg av enkelte variabler gjør UBayFS det også mulig å velge blokker eller grupper av variabler som representerer de ulike datakildene som ble nevnt over. Kvantifisering av viktigheten av variabelgrupper spiller en nøkkelrolle for forståelsen av hvorvidt datakildene er viktige for responsvariablen. Tilgang til slik informasjon kan føre til at bruken av menneskelige, tekniske og økonomiske ressurser kan forbedres dersom informasjonen integreres systematisk i planleggingen av pasientbehandlingen. Slik kan man redusere innsamling av ikke-informative variabler. Siden generaliseringen av viktighet av variabelgrupper ikke er triviell, undersøkes og sammenlignes også tilnærminger for rangering av viktigheten til disse variabelgruppene. Denne avhandlingen viser at høydimensjonale datasett fra flere datakilder fra det medisinske domenet effektivt kan håndteres ved bruk av variabelseleksjonmetodene som er presentert i avhandlingen. Eksperimentene viser at disse kan ha positiv en effekt på både prediktiv ytelse, stabilitet og tolkbarhet av resultatene. Bruken av disse variabelseleksjonsmetodene bærer et stort potensiale for bedre datadrevet beslutningsstøtte i klinisk praksis

    Uncovering the potential role of oxidative stress in the development of periodontitis and establishing a stable diagnostic model via combining single-cell and machine learning analysis

    Get PDF
    BackgroundThe primary pathogenic cause of tooth loss in adults is periodontitis, although few reliable diagnostic methods are available in the early stages. One pathological factor that defines periodontitis pathology has previously been believed to be the equilibrium between inflammatory defense mechanisms and oxidative stress. Therefore, it is necessary to construct a model of oxidative stress-related periodontitis diagnostic markers through machine learning and bioinformatic analysis.MethodsWe used LASSO, SVM-RFE, and Random Forest techniques to screen for periodontitis-related oxidative stress variables and construct a diagnostic model by logistic regression, followed by a biological approach to build a Protein-Protein interaction network (PPI) based on modelled genes while using modelled genes. Unsupervised clustering analysis was performed to screen for oxidative stress subtypes of periodontitis. we used WGCNA to explore the pathways correlated with oxidative stress in periodontitis patients. Networks. Finally, we used single-cell data to screen the cellular subpopulations with the highest correlation by scoring oxidative stress genes and performed a proposed temporal analysis of the subpopulations.ResultsWe discovered 3 periodontitis-associated genes (CASP3, IL-1β, and TXN). A characteristic line graph based on these genes can be helpful for patients. The primary hub gene screened by the PPI was constructed, where immune-related and cellular metabolism-related pathways were significantly enriched. Consistent clustering analysis found two oxidative stress categories, with the C2 subtype showing higher immune cell infiltration and immune function ratings. Therefore, we hypothesized that the high expression of oxidative stress genes was correlated with the formation of the immune environment in patients with periodontitis. Using the WGCNA approach, we examined the co-expressed gene modules related to the various subtypes of oxidative stress. Finally, we selected monocytes for mimetic time series analysis and analyzed the expression changes of oxidative stress genes with the mimetic time series axis, in which the expression of JUN, TXN, and IL-1β differed with the change of cell status.ConclusionThis study identifies a diagnostic model of 3-OSRGs from which patients can benefit and explores the importance of oxidative stress genes in building an immune environment in patients with periodontitis

    Blood Pressure Estimation from Speech Recordings: Exploring the Role of Voice-over Artists

    Get PDF
    Hypertension, a prevalent global health concern, is associated with cardiovascular diseases and significant morbidity and mortality. Accurate and prompt Blood Pressure monitoring is crucial for early detection and successful management. Traditional cuff-based methods can be inconvenient, leading to the exploration of non-invasive and continuous estimation methods. This research aims to bridge the gap between speech processing and health monitoring by investigating the relationship between speech recordings and Blood Pressure estimation. Speech recordings offer promise for non-invasive Blood Pressure estimation due to the potential link between vocal characteristics and physiological responses. In this study, we focus on the role of Voice-over Artists, known for their ability to convey emotions through voice. By exploring the expertise of Voice-over Artists in controlling speech and expressing emotions, we seek valuable insights into the potential correlation between speech characteristics and Blood Pressure. This research sheds light on presenting an innovative and convenient approach to health assessment. By unraveling the specific role of Voice-over Artists in this process, the study lays the foundation for future advancements in healthcare and human-robot interactions. Through the exploration of speech characteristics and emotional expression, this investigation offers valuable insights into the correlation between vocal features and Blood Pressure levels. By leveraging the expertise of Voice-over Artists in conveying emotions through voice, this study enriches our understanding of the intricate relationship between speech recordings and physiological responses, opening new avenues for the integration of voice-related factors in healthcare technologies

    Applicability domains of neural networks for toxicity prediction

    Get PDF
    In this paper, the term "applicability domain" refers to the range of chemical compounds for which the statistical quantitative structure-activity relationship (QSAR) model can accurately predict their toxicity. This is a crucial concept in the development and practical use of these models. First, a multidisciplinary review is provided regarding the theory and practice of applicability domains in the context of toxicity problems using the classical QSAR model. Then, the advantages and improved performance of neural networks (NNs), which are the most promising machine learning algorithms, are reviewed. Within the domain of medicinal chemistry, nine different methods using NNs for toxicity prediction were compared utilizing 29 alternative artificial intelligence (AI) techniques. Similarly, seven NN-based toxicity prediction methodologies were compared to six other AI techniques within the realm of food safety, 11 NN-based methodologies were compared to 16 different AI approaches in the environmental sciences category and four specific NN-based toxicity prediction methodologies were compared to nine alternative AI techniques in the field of industrial hygiene. Within the reviewed approaches, given known toxic compound descriptors and behaviors, we observed a difficulty in being able to extrapolate and predict the effects with untested chemical compounds. Different methods can be used for unsupervised clustering, such as distance-based approaches and consensus-based decision methods. Additionally, the importance of model validation has been highlighted within a regulatory context according to the Organization for Economic Co-operation and Development (OECD) principles, to predict the toxicity of potential new drugs in medicinal chemistry, to determine the limits of detection for harmful substances in food to predict the toxicity limits of chemicals in the environment, and to predict the exposure limits to harmful substances in the workplace. Despite its importance, a thorough application of toxicity models is still restricted in the field of medicinal chemistry and is virtually overlooked in other scientific domains. Consequently, only a small proportion of the toxicity studies conducted in medicinal chemistry consider the applicability domain in their mathematical models, thereby limiting their predictive power to untested drugs. Conversely, the applicability of these models is crucial; however, this has not been sufficiently assessed in toxicity prediction or in other related areas such as food science, environmental science, and industrial hygiene. Thus, this review sheds light on the prevalent use of Neural Networks in toxicity prediction, thereby serving as a valuable resource for researchers and practitioners across these multifaceted domains that could be extended to other fields in future research

    Experience-dependent plasticity in cortical and cerebellar regions of early- and late-trained musicians

    Get PDF
    A body of current evidence suggests that there is a sensitive period for musical training: people who begin training before the age of seven show better performance on tests of musical skill, and also show differences in brain structure – especially in motor cortical and cerebellar regions – compared with those who start later. In two studies, we investigated distributed patterns of structural differences between early-trained (ET) and late-trained (LT) musicians. First, we examined structural covariation between cerebellar volume and cortical thickness (CT) in sensorimotor regions in ET and LT musicians and non-musicians (NMs). We found that early musical training had a specific effect on structural covariance between the cerebellum and cortex: NMs showed negative correlations between left lobule VI and right pre-supplementary motor area (preSMA) and premotor cortex (PMC), but this relationship was reduced in ET musicians. ETs instead showed a significant negative correlation between vermal IV and right pre-SMA and dPMC. In the second study, we used support vector machine models – a subtype of supervised machine learning – to investigate cortico-cerebellar structural covariation and to better understand the age boundaries of the sensitive period for early musicianship. Our model identified a combination of 17 regions, including 9 cerebellar and 8 sensorimotor regions, that accurately identified ET and LT musicians with high sensitivity and specificity. Critically, this model – which defined ET musicians as those who began their training before the age of 7 – outperformed all other models in which age of start was earlier or later (between ages 5-10). Our model’s ability to accurately classify ET and LT musicians provides additional evidence that musical training before age 7 affects cortico-cerebellar structure in adulthood, and is consistent with the hypothesis that connected brain regions interact during development to reciprocally influence brain and behavioural maturation. Together, these results suggest that early musical training has differential impacts on the maturation of cortico-cerebellar networks important for optimizing sensorimotor performance. This work enriches our understanding of how experience-dependent plasticity is affected by early musical training, providing a more nuanced understanding of the interrelated nature of brain development

    24th Nordic Conference on Computational Linguistics (NoDaLiDa)

    Get PDF

    Predictive Modelling Approach to Data-Driven Computational Preventive Medicine

    Get PDF
    This thesis contributes novel predictive modelling approaches to data-driven computational preventive medicine and offers an alternative framework to statistical analysis in preventive medicine research. In the early parts of this research, this thesis presents research by proposing a synergy of machine learning methods for detecting patterns and developing inexpensive predictive models from healthcare data to classify the potential occurrence of adverse health events. In particular, the data-driven methodology is founded upon a heuristic-systematic assessment of several machine-learning methods, data preprocessing techniques, models’ training estimation and optimisation, and performance evaluation, yielding a novel computational data-driven framework, Octopus. Midway through this research, this thesis advances research in preventive medicine and data mining by proposing several new extensions in data preparation and preprocessing. It offers new recommendations for data quality assessment checks, a novel multimethod imputation (MMI) process for missing data mitigation, a novel imbalanced resampling approach, and minority pattern reconstruction (MPR) led by information theory. This thesis also extends the area of model performance evaluation with a novel classification performance ranking metric called XDistance. In particular, the experimental results show that building predictive models with the methods guided by our new framework (Octopus) yields domain experts' approval of the new reliable models’ performance. Also, performing the data quality checks and applying the MMI process led healthcare practitioners to outweigh predictive reliability over interpretability. The application of MPR and its hybrid resampling strategies led to better performances in line with experts' success criteria than the traditional imbalanced data resampling techniques. Finally, the use of the XDistance performance ranking metric was found to be more effective in ranking several classifiers' performances while offering an indication of class bias, unlike existing performance metrics The overall contributions of this thesis can be summarised as follow. First, several data mining techniques were thoroughly assessed to formulate the new Octopus framework to produce new reliable classifiers. In addition, we offer a further understanding of the impact of newly engineered features, the physical activity index (PAI) and biological effective dose (BED). Second, the newly developed methods within the new framework. Finally, the newly accepted developed predictive models help detect adverse health events, namely, visceral fat-associated diseases and advanced breast cancer radiotherapy toxicity side effects. These contributions could be used to guide future theories, experiments and healthcare interventions in preventive medicine and data mining

    Advancing oral delivery of biologics: machine learning predicts peptide stability in the gastrointestinal tract

    Get PDF
    The oral delivery of peptide therapeutics could facilitate precision treatment of numerous gastrointestinal (GI) and systemic diseases with simple administration for patients. However, the vast majority of licensed peptide drugs are currently administered parenterally due to prohibitive peptide instability in the GI tract. As such, the development of GI-stable peptides is receiving considerable investment. This study provides researchers with the first tool to predict the GI stability of peptide therapeutics based solely on the amino acid sequence. Both unsupervised and supervised machine learning techniques were trained on literature-extracted data describing peptide stability in simulated gastric and small intestinal fluid (SGF and SIF). Based on 109 peptide incubations, classification models for SGF and SIF were developed. The best models utilized k-Nearest Neighbor (for SGF) and XGBoost (for SIF) algorithms, with accuracies of 75.1% (SGF) and 69.3% (SIF), and f1 scores of 84.5% (SGF) and 73.4% (SIF) under 5-fold cross-validation. Feature importance analysis demonstrated that peptides’ lipophilicity, rigidity, and size were key determinants of stability. These models are now available to those working on the development of oral peptide therapeutics
    corecore