2,386 research outputs found

    Identifying the most informative features using a structurally interacting elastic net

    Get PDF
    Feature selection can efficiently identify the most informative features with respect to the target feature used in training. However, state-of-the-art vector-based methods are unable to encapsulate the relationships between feature samples into the feature selection process, thus leading to significant information loss. To address this problem, we propose a new graph-based structurally interacting elastic net method for feature selection. Specifically, we commence by constructing feature graphs that can incorporate pairwise relationship between samples. With the feature graphs to hand, we propose a new information theoretic criterion to measure the joint relevance of different pairwise feature combinations with respect to the target feature graph representation. This measure is used to obtain a structural interaction matrix where the elements represent the proposed information theoretic measure between feature pairs. We then formulate a new optimization model through the combination of the structural interaction matrix and an elastic net regression model for the feature subset selection problem. This allows us to (a) preserve the information of the original vectorial space, (b) remedy the information loss of the original feature space caused by using graph representation, and (c) promote a sparse solution and also encourage correlated features to be selected. Because the proposed optimization problem is non-convex, we develop an efficient alternating direction multiplier method (ADMM) to locate the optimal solutions. Extensive experiments on various datasets demonstrate the effectiveness of the proposed method

    Kinematic Flexibility Analysis: Hydrogen Bonding Patterns Impart a Spatial Hierarchy of Protein Motion

    Full text link
    Elastic network models (ENM) and constraint-based, topological rigidity analysis are two distinct, coarse-grained approaches to study conformational flexibility of macromolecules. In the two decades since their introduction, both have contributed significantly to insights into protein molecular mechanisms and function. However, despite a shared purpose of these approaches, the topological nature of rigidity analysis, and thereby the absence of motion modes, has impeded a direct comparison. Here, we present an alternative, kinematic approach to rigidity analysis, which circumvents these drawbacks. We introduce a novel protein hydrogen bond network spectral decomposition, which provides an orthonormal basis for collective motions modulated by non-covalent interactions, analogous to the eigenspectrum of normal modes, and decomposes proteins into rigid clusters identical to those from topological rigidity. Our kinematic flexibility analysis bridges topological rigidity theory and ENM, and enables a detailed analysis of motion modes obtained from both approaches. Our analysis reveals that collectivity of protein motions, reported by the Shannon entropy, is significantly lower for rigidity theory versus normal mode approaches. Strikingly, kinematic flexibility analysis suggests that the hydrogen bonding network encodes a protein-fold specific, spatial hierarchy of motions, which goes nearly undetected in ENM. This hierarchy reveals distinct motion regimes that rationalize protein stiffness changes observed from experiment and molecular dynamics simulations. A formal expression for changes in free energy derived from the spectral decomposition indicates that motions across nearly 40% of modes obey enthalpy-entropy compensation. Taken together, our analysis suggests that hydrogen bond networks have evolved to modulate protein structure and dynamics

    Coarse-grained models for self-assembling systems

    Get PDF
    In the last years, a considerable deal of work has so far been spent to understand and hence harness the physical principles that underpin the general properties of self-assembling systems. In particular, theoretical and computational modelling have been extensively used to obtain a detailed description of the actual process. This thesis reports on computational work, focusing on two different self-assembling systems and from two distinct perspectives. In the first part, a computational study of the self-assembly of string-like rigid templates in solution aims to explore to what extent it is possible to direct the assembly of the templates into knotted or linked structures by suitably tuning geometrical parameters of the system. The second part is devoted to some of the smallest instances of molecular self-assembly in nature, that is viral capsids. We report on the development of a physics-based algorithm to subdivide the structure of a capsid in quasi-rigid units, helping to elucidate the pathway of assembly from the identification of its building blocks with a top-down approach

    Data- og ekspertdreven variabelseleksjon for prediktive modeller i helsevesenet : mot økt tolkbarhet i underbestemte maskinlæringsproblemer

    Get PDF
    Modern data acquisition techniques in healthcare generate large collections of data from multiple sources, such as novel diagnosis and treatment methodologies. Some concrete examples are electronic healthcare record systems, genomics, and medical images. This leads to situations with often unstructured, high-dimensional heterogeneous patient cohort data where classical statistical methods may not be sufficient for optimal utilization of the data and informed decision-making. Instead, investigating such data structures with modern machine learning techniques promises to improve the understanding of patient health issues and may provide a better platform for informed decision-making by clinicians. Key requirements for this purpose include (a) sufficiently accurate predictions and (b) model interpretability. Achieving both aspects in parallel is difficult, particularly for datasets with few patients, which are common in the healthcare domain. In such cases, machine learning models encounter mathematically underdetermined systems and may overfit easily on the training data. An important approach to overcome this issue is feature selection, i.e., determining a subset of informative features from the original set of features with respect to the target variable. While potentially raising the predictive performance, feature selection fosters model interpretability by identifying a low number of relevant model parameters to better understand the underlying biological processes that lead to health issues. Interpretability requires that feature selection is stable, i.e., small changes in the dataset do not lead to changes in the selected feature set. A concept to address instability is ensemble feature selection, i.e. the process of repeating the feature selection multiple times on subsets of samples of the original dataset and aggregating results in a meta-model. This thesis presents two approaches for ensemble feature selection, which are tailored towards high-dimensional data in healthcare: the Repeated Elastic Net Technique for feature selection (RENT) and the User-Guided Bayesian Framework for feature selection (UBayFS). While RENT is purely data-driven and builds upon elastic net regularized models, UBayFS is a general framework for ensembles with the capabilities to include expert knowledge in the feature selection process via prior weights and side constraints. A case study modeling the overall survival of cancer patients compares these novel feature selectors and demonstrates their potential in clinical practice. Beyond the selection of single features, UBayFS also allows for selecting whole feature groups (feature blocks) that were acquired from multiple data sources, as those mentioned above. Importance quantification of such feature blocks plays a key role in tracing information about the target variable back to the acquisition modalities. Such information on feature block importance may lead to positive effects on the use of human, technical, and financial resources if systematically integrated into the planning of patient treatment by excluding the acquisition of non-informative features. Since a generalization of feature importance measures to block importance is not trivial, this thesis also investigates and compares approaches for feature block importance rankings. This thesis demonstrates that high-dimensional datasets from multiple data sources in the medical domain can be successfully tackled by the presented approaches for feature selection. Experimental evaluations demonstrate favorable properties of both predictive performance, stability, as well as interpretability of results, which carries a high potential for better data-driven decision support in clinical practice.Moderne datainnsamlingsteknikker i helsevesenet genererer store datamengder fra flere kilder, som for eksempel nye diagnose- og behandlingsmetoder. Noen konkrete eksempler er elektroniske helsejournalsystemer, genomikk og medisinske bilder. Slike pasientkohortdata er ofte ustrukturerte, høydimensjonale og heterogene og hvor klassiske statistiske metoder ikke er tilstrekkelige for optimal utnyttelse av dataene og god informasjonsbasert beslutningstaking. Derfor kan det være lovende å analysere slike datastrukturer ved bruk av moderne maskinlæringsteknikker for å øke forståelsen av pasientenes helseproblemer og for å gi klinikerne en bedre plattform for informasjonsbasert beslutningstaking. Sentrale krav til dette formålet inkluderer (a) tilstrekkelig nøyaktige prediksjoner og (b) modelltolkbarhet. Å oppnå begge aspektene samtidig er vanskelig, spesielt for datasett med få pasienter, noe som er vanlig for data i helsevesenet. I slike tilfeller må maskinlæringsmodeller håndtere matematisk underbestemte systemer og dette kan lett føre til at modellene overtilpasses treningsdataene. Variabelseleksjon er en viktig tilnærming for å håndtere dette ved å identifisere en undergruppe av informative variabler med hensyn til responsvariablen. Samtidig som variabelseleksjonsmetoder kan lede til økt prediktiv ytelse, fremmes modelltolkbarhet ved å identifisere et lavt antall relevante modellparametere. Dette kan gi bedre forståelse av de underliggende biologiske prosessene som fører til helseproblemer. Tolkbarhet krever at variabelseleksjonen er stabil, dvs. at små endringer i datasettet ikke fører til endringer i hvilke variabler som velges. Et konsept for å adressere ustabilitet er ensemblevariableseleksjon, dvs. prosessen med å gjenta variabelseleksjon flere ganger på en delmengde av prøvene i det originale datasett og aggregere resultater i en metamodell. Denne avhandlingen presenterer to tilnærminger for ensemblevariabelseleksjon, som er skreddersydd for høydimensjonale data i helsevesenet: "Repeated Elastic Net Technique for feature selection" (RENT) og "User-Guided Bayesian Framework for feature selection" (UBayFS). Mens RENT er datadrevet og bygger på elastic net-regulariserte modeller, er UBayFS et generelt rammeverk for ensembler som muliggjør inkludering av ekspertkunnskap i variabelseleksjonsprosessen gjennom forhåndsbestemte vekter og sidebegrensninger. En case-studie som modellerer overlevelsen av kreftpasienter sammenligner disse nye variabelseleksjonsmetodene og demonstrerer deres potensiale i klinisk praksis. Utover valg av enkelte variabler gjør UBayFS det også mulig å velge blokker eller grupper av variabler som representerer de ulike datakildene som ble nevnt over. Kvantifisering av viktigheten av variabelgrupper spiller en nøkkelrolle for forståelsen av hvorvidt datakildene er viktige for responsvariablen. Tilgang til slik informasjon kan føre til at bruken av menneskelige, tekniske og økonomiske ressurser kan forbedres dersom informasjonen integreres systematisk i planleggingen av pasientbehandlingen. Slik kan man redusere innsamling av ikke-informative variabler. Siden generaliseringen av viktighet av variabelgrupper ikke er triviell, undersøkes og sammenlignes også tilnærminger for rangering av viktigheten til disse variabelgruppene. Denne avhandlingen viser at høydimensjonale datasett fra flere datakilder fra det medisinske domenet effektivt kan håndteres ved bruk av variabelseleksjonmetodene som er presentert i avhandlingen. Eksperimentene viser at disse kan ha positiv en effekt på både prediktiv ytelse, stabilitet og tolkbarhet av resultatene. Bruken av disse variabelseleksjonsmetodene bærer et stort potensiale for bedre datadrevet beslutningsstøtte i klinisk praksis

    Computational Approaches in Molecular and Systems Pharmacology: Application to Neurosignaling Membrane Proteins

    Get PDF
    Computer-aided drug discovery methods have played a major role in the development of therapeutically important molecules for decades, and some more advanced and effective methods have been introduced in recent years. Those methods are generally classified as either molecular pharmacology methods or quantitative systems pharmacology methods. In this thesis, with regard to molecular pharmacology computations, we assess the druggability of ionotropic glutamate receptors (iGluRs) N-terminal domains (NTDs) using molecular dynamics (MD) simulations. The simulations are performed in the presence of probe molecules that contain fragments shared by drug-like molecules. iGluRs are ligand-gated ion channels that mediate excitatory neurotransmission events in the central nervous system. Alterations in those receptors, especially in AMPA receptors (AMPARs) and NMDA receptors (NMDARs), are responsible for many neuron diseases like Huntington’s diseases and Parkinson’s diseases. Our study provides insights into the ligand-binding landscape of iGluR NTD dimers and monomers. Moreover, we build PMs for AMPARs and NMDARs, which are then used in a virtual screening scheme to identify lead compounds. Our quantitative systems pharmacology studies focus on drug repurposing upon computational analysis of known drug-target interactions. We use the probabilistic matrix factorization (PMF) method for this purpose, which is particularly useful for analyzing large interaction networks. Our method is shown to outperform those recently introduced for identifying new drug-target associations. Finally, we integrate the results from our druggability simulations and PMF calculations by comparing the drug candidates predicted to bind AMPARs or NMDARs by either of those methods. In addition, we analyzed the structure and dynamics of sodium-coupled neurotransmitter transporters that share the leucine transporter (LeuT) fold. We explore how the collective motions predicted for LeuT using the elastic network models agree with the structural changes experimentally observed upon ligand binding

    Network modeling of the transcriptional effects of copy number aberrations in glioblastoma

    Get PDF
    DNA copy number aberrations (CNAs) are a characteristic feature of cancer genomes. In this work, Rebecka Jörnsten, Sven Nelander and colleagues combine network modeling and experimental methods to analyze the systems-level effects of CNAs in glioblastoma
    corecore