10 research outputs found

    L2-norm multiple kernel learning and its application to biomedical data fusion

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>This paper introduces the notion of optimizing different norms in the dual problem of support vector machines with multiple kernels. The selection of norms yields different extensions of multiple kernel learning (MKL) such as <it>L</it><sub>∞</sub>, <it>L</it><sub>1</sub>, and <it>L</it><sub>2 </sub>MKL. In particular, <it>L</it><sub>2 </sub>MKL is a novel method that leads to non-sparse optimal kernel coefficients, which is different from the sparse kernel coefficients optimized by the existing <it>L</it><sub>∞ </sub>MKL method. In real biomedical applications, <it>L</it><sub>2 </sub>MKL may have more advantages over sparse integration method for thoroughly combining complementary information in heterogeneous data sources.</p> <p>Results</p> <p>We provide a theoretical analysis of the relationship between the <it>L</it><sub>2 </sub>optimization of kernels in the dual problem with the <it>L</it><sub>2 </sub>coefficient regularization in the primal problem. Understanding the dual <it>L</it><sub>2 </sub>problem grants a unified view on MKL and enables us to extend the <it>L</it><sub>2 </sub>method to a wide range of machine learning problems. We implement <it>L</it><sub>2 </sub>MKL for ranking and classification problems and compare its performance with the sparse <it>L</it><sub>∞ </sub>and the averaging <it>L</it><sub>1 </sub>MKL methods. The experiments are carried out on six real biomedical data sets and two large scale UCI data sets. <it>L</it><sub>2 </sub>MKL yields better performance on most of the benchmark data sets. In particular, we propose a novel <it>L</it><sub>2 </sub>MKL least squares support vector machine (LSSVM) algorithm, which is shown to be an efficient and promising classifier for large scale data sets processing.</p> <p>Conclusions</p> <p>This paper extends the statistical framework of genomic data fusion based on MKL. Allowing non-sparse weights on the data sources is an attractive option in settings where we believe most data sources to be relevant to the problem at hand and want to avoid a "winner-takes-all" effect seen in <it>L</it><sub>∞ </sub>MKL, which can be detrimental to the performance in prospective studies. The notion of optimizing <it>L</it><sub>2 </sub>kernels can be straightforwardly extended to ranking, classification, regression, and clustering algorithms. To tackle the computational burden of MKL, this paper proposes several novel LSSVM based MKL algorithms. Systematic comparison on real data sets shows that LSSVM MKL has comparable performance as the conventional SVM MKL algorithms. Moreover, large scale numerical experiments indicate that when cast as semi-infinite programming, LSSVM MKL can be solved more efficiently than SVM MKL.</p> <p>Availability</p> <p>The MATLAB code of algorithms implemented in this paper is downloadable from <url>http://homes.esat.kuleuven.be/~sistawww/bioi/syu/l2lssvm.html</url>.</p

    Dynamic ensemble selection methods for heterogeneous data mining

    Get PDF
    Big data is often collected from multiple sources with possibly different features, representations and granularity and hence is defined as heterogeneous data. Such multiple datasets need to be fused together in some ways for further analysis. Data fusion at feature level requires domain knowledge and can be time-consuming and ineffective, but it could be avoided if decision-level fusion is applied properly. Ensemble methods appear to be an appropriate paradigm to do just that as each subset of heterogeneous data sources can be separately used to induce models independently and their decisions are then aggregated by a decision fusion function in an ensemble. This study investigates how heterogeneous data can be used to generate more diverse classifiers to build more accurate ensembles. A Dynamic Ensemble Selection Optimisation (DESO) framework is proposed, using the local feature space of heterogeneous data to increase diversity among classifiers and Simulated Annealing for optimisation. An implementation example of DESO — BaggingDES is provided with Bagging as a base platform of DESO, to test its performance and also explore the relationship between diversity and accuracy. Experiments are carried out with some heterogeneous datasets derived from real-world benchmark datasets. The statistical analyses of the results show that BaggingDES performed significantly better than the baseline method — decision tree, and reasonably better than the classic Bagging.and accuracy. Experiments were carried out with some heterogeneous datasets derived from real-world benchmark datasets. The statistical analyses of the results show that BaggingDES performed significantly better than the baseline method - decision tree, and reasonably better than the classic Bagging

    A Clustering Algorithm Based on an Ensemble of Dissimilarities: An Application in the Bioinformatics Domain

    Get PDF
    Clustering algorithms such as k-means depend heavily on choosing an appropriate distance metric that reflect accurately the object proximities. A wide range of dissimilarities may be defined that often lead to different clustering results. Choosing the best dissimilarity is an ill-posed problem and learning a general distance from the data is a complex task, particularly for high dimensional problems. Therefore, an appealing approach is to learn an ensemble of dissimilarities. In this paper, we have developed a semi-supervised clustering algorithm that learns a linear combination of dissimilarities considering incomplete knowledge in the form of pairwise constraints. The minimization of the loss function is based on a robust and efficient quadratic optimization algorithm. Besides, a regularization term is considered that controls the complexity of the distance metric learned avoiding overfitting. The algorithm has been applied to the identification of tumor samples using the gene expression profiles, where domain experts provide often incomplete knowledge in the form of pairwise constraints. We report that the algorithm proposed outperforms a standard semi-supervised clustering technique available in the literature and clustering results based on a single dissimilarity. The improvement is particularly relevant for applications with high level of noise

    ProDiGe: Prioritization Of Disease Genes with multitask machine learning from positive and unlabeled examples

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Elucidating the genetic basis of human diseases is a central goal of genetics and molecular biology. While traditional linkage analysis and modern high-throughput techniques often provide long lists of tens or hundreds of disease gene candidates, the identification of disease genes among the candidates remains time-consuming and expensive. Efficient computational methods are therefore needed to prioritize genes within the list of candidates, by exploiting the wealth of information available about the genes in various databases.</p> <p>Results</p> <p>We propose ProDiGe, a novel algorithm for Prioritization of Disease Genes. ProDiGe implements a novel machine learning strategy based on learning from positive and unlabeled examples, which allows to integrate various sources of information about the genes, to share information about known disease genes across diseases, and to perform genome-wide searches for new disease genes. Experiments on real data show that ProDiGe outperforms state-of-the-art methods for the prioritization of genes in human diseases.</p> <p>Conclusions</p> <p>ProDiGe implements a new machine learning paradigm for gene prioritization, which could help the identification of new disease genes. It is freely available at <url>http://cbio.ensmp.fr/prodige</url>.</p

    Multiple kernel learning with random effects for predicting longitudinal outcomes and data integration

    Get PDF
    Predicting disease risk and progression is one of the main goals in many clinical research studies. Cohort studies on the natural history and etiology of chronic diseases span years and data are collected at multiple visits. Although kernel-based statistical learning methods are proven to be powerful for a wide range of disease prediction problems, these methods are only well studied for independent data but not for longitudinal data. It is thus important to develop time-sensitive prediction rules that make use of the longitudinal nature of the data. In this paper, we develop a novel statistical learning method for longitudinal data by introducing subject-specific short-term and long-term latent effects through a designed kernel to account for within-subject correlation of longitudinal measurements. Since the presence of multiple sources of data is increasingly common, we embed our method in a multiple kernel learning framework and propose a regularized multiple kernel statistical learning with random effects to construct effective nonparametric prediction rules. Our method allows easy integration of various heterogeneous data sources and takes advantage of correlation among longitudinal measures to increase prediction power. We use different kernels for each data source taking advantage of the distinctive feature of each data modality, and then optimally combine data across modalities. We apply the developed methods to two large epidemiological studies, one on Huntington's disease and the other on Alzheimer's Disease (Alzheimer's Disease Neuroimaging Initiative, ADNI) where we explore a unique opportunity to combine imaging and genetic data to study prediction of mild cognitive impairment, and show a substantial gain in performance while accounting for the longitudinal aspect of the data

    Interpretable Machine Learning Methods for Prediction and Analysis of Genome Regulation in 3D

    Get PDF
    With the development of chromosome conformation capture-based techniques, we now know that chromatin is packed in three-dimensional (3D) space inside the cell nucleus. Changes in the 3D chromatin architecture have already been implicated in diseases such as cancer. Thus, a better understanding of this 3D conformation is of interest to help enhance our comprehension of the complex, multipronged regulatory mechanisms of the genome. The work described in this dissertation largely focuses on development and application of interpretable machine learning methods for prediction and analysis of long-range genomic interactions output from chromatin interaction experiments. In the first part, we demonstrate that the genetic sequence information at the ge- nomic loci is predictive of the long-range interactions of a particular locus of interest (LoI). For example, the genetic sequence information at and around enhancers can help predict whether it interacts with a promoter region of interest. This is achieved by building string kernel-based support vector classifiers together with two novel, in- tuitive visualization methods. These models suggest a potential general role of short tandem repeat motifs in the 3D genome organization. But, the insights gained out of these models are still coarse-grained. To this end, we devised a machine learning method, called CoMIK for Conformal Multi-Instance Kernels, capable of providing more fine-grained insights. When comparing sequences of variable length in the su- pervised learning setting, CoMIK can not only identify the features important for classification but also locate them within the sequence. Such precise identification of important segments of the whole sequence can help in gaining de novo insights into any role played by the intervening chromatin towards long-range interactions. Although CoMIK primarily uses only genetic sequence information, it can also si- multaneously utilize other information modalities such as the numerous functional genomics data if available. The second part describes our pipeline, pHDee, for easy manipulation of large amounts of 3D genomics data. We used the pipeline for analyzing HiChIP experimen- tal data for studying the 3D architectural changes in Ewing sarcoma (EWS) which is a rare cancer affecting adolescents. In particular, HiChIP data for two experimen- tal conditions, doxycycline-treated and untreated, and for primary tumor samples is analyzed. We demonstrate that pHDee facilitates processing and easy integration of large amounts of 3D genomics data analysis together with other data-intensive bioinformatics analyses.Mit der Entwicklung von Techniken zur Bestimmung der Chromosomen-Konforma- tion wissen wir jetzt, dass Chromatin in einer dreidimensionalen (3D) Struktur in- nerhalb des Zellkerns gepackt ist. Änderungen in der 3D-Chromatin-Architektur sind bereits mit Krankheiten wie Krebs in Verbindung gebracht worden. Daher ist ein besseres Verständnis dieser 3D-Konformation von Interesse, um einen tieferen Einblick in die komplexen, vielschichtigen Regulationsmechanismen des Genoms zu ermöglichen. Die in dieser Dissertation beschriebene Arbeit konzentriert sich im Wesentlichen auf die Entwicklung und Anwendung interpretierbarer maschineller Lernmethoden zur Vorhersage und Analyse von weitreichenden genomischen Inter- aktionen aus Chromatin-Interaktionsexperimenten. Im ersten Teil zeigen wir, dass die genetische Sequenzinformation an den genomis- chen Loci prädiktiv für die weitreichenden Interaktionen eines bestimmten Locus von Interesse (LoI) ist. Zum Beispiel kann die genetische Sequenzinformation an und um Enhancer-Elemente helfen, vorherzusagen, ob diese mit einer Promotorregion von Interesse interagieren. Dies wird durch die Erstellung von String-Kernel-basierten Support Vector Klassifikationsmodellen zusammen mit zwei neuen, intuitiven Visual- isierungsmethoden erreicht. Diese Modelle deuten auf eine mögliche allgemeine Rolle von kurzen, repetitiven Sequenzmotiven (”tandem repeats”) in der dreidimensionalen Genomorganisation hin. Die Erkenntnisse aus diesen Modellen sind jedoch immer noch grobkörnig. Zu diesem Zweck haben wir die maschinelle Lernmethode CoMIK (für Conformal Multi-Instance-Kernel) entwickelt, welche feiner aufgelöste Erkennt- nisse liefern kann. Beim Vergleich von Sequenzen mit variabler Länge in überwachten Lernszenarien kann CoMIK nicht nur die für die Klassifizierung wichtigen Merkmale identifizieren, sondern sie auch innerhalb der Sequenz lokalisieren. Diese genaue Identifizierung wichtiger Abschnitte der gesamten Sequenz kann dazu beitragen, de novo Einblick in jede Rolle zu gewinnen, die das dazwischen liegende Chromatin für weitreichende Interaktionen spielt. Obwohl CoMIK hauptsächlich nur genetische Se- quenzinformationen verwendet, kann es gleichzeitig auch andere Informationsquellen nutzen, beispielsweise zahlreiche funktionellen Genomdaten sofern verfügbar. Der zweite Teil beschreibt unsere Pipeline pHDee für die einfache Bearbeitung großer Mengen von 3D-Genomdaten. Wir haben die Pipeline zur Analyse von HiChIP- Experimenten zur Untersuchung von dreidimensionalen Architekturänderungen bei der seltenen Krebsart Ewing-Sarkom (EWS) verwendet, welche Jugendliche betrifft. Insbesondere werden HiChIP-Daten für zwei experimentelle Bedingungen, Doxycyclin- behandelt und unbehandelt, und für primäre Tumorproben analysiert. Wir zeigen, dass pHDee die Verarbeitung und einfache Integration großer Mengen der 3D-Genomik- Datenanalyse zusammen mit anderen datenintensiven Bioinformatik-Analysen erle- ichtert

    Técnicas basadas en kernel para el análisis de texturas en imagen biomédica

    Get PDF
    [Resumen] En problemas del mundo real es relevante el estudio de la importancia de todas las variables obtenidas de manera que sea posible la eliminación de ruido, es en este punto donde surgen las técnicas de selección de variables. El objetivo de estas técnicas es pues encontrar el subconjunto de variables que describan de la mejor manera posible la información útil contenida en los datos permitiendo mejorar el rendimiento. En espacios de alta dimensionalidad son especialmente interesantes las técnicas basadas en kernel, donde han demostrado una alta eficiencia debido a su capacidad para generalizar en dichos espacios. En este trabajo se realiza una nueva propuesta para el análisis de texturas en imagen biomédica mediante la integración, utilizando técnicas basadas en kernel, de diferentes tipos de datos de textura para la selección de las variables más representativas con el objetivo de mejorar los resultados obtenidos en clasificación y en interpretabilidad de las variables obtenidas. Para validar esta propuesta se ha formalizado un diseño experimental con cuatro fases diferenciadas: extracción y preprocesado de los datos, aprendizaje y selección del mejor modelo asegurando la reproducibilidad de los resultados a la vez que una comparación en condiciones de igualdad.[Resumo] En problemas do mundo real é relevante o estudo da importancia de todas as variables obtidas de maneira que sexa posible a eliminación de ruído, é neste punto onde xorden as técnicas de selección de variables. O obxectivo destas técnicas é pois encontrar o subconxunto de variables que describan do mellor xeito posible a información útil contida nos datos permitindo mellorar o rendemento. En espazos de alta dimensionalidade son especialmente interesantes as técnicas baseadas en kernel, onde demostraron unha alta eficiencia debido á súa capacidade para xeneralizar nos devanditos espazos. Neste traballo realízase unha nova proposta para a análise de texturas en imaxe biomédica mediante a integración, utilizando técnicas baseadas en kernel, de diferentes tipos de datos de textura para a selección das variables máis representativas co obxectivo de mellorar os resultados obtidos en clasificación e en interpretabilidade das variables obtidas. Para validar esta proposta formalizouse un deseño experimental con catro fases diferenciadas: extracción e preprocesar dos datos, aprendizaxe e selección do mellor modelo asegurando a reproducibilidade dos resultados á vez que unha comparación en condicións de igualdade. Utilizáronse imaxes de xeles de electroforese bidimensional.[Abstract] In real-world problems it is of relevance to study the importance of all the variables obtained, so that denoising could be possible, because it is at this point when the variable selection techniques arise. Therefore, these techniques are aimed at finding the subset of variables that describe' in the best possible way the useful information contained in the data, allowing improved performance. In high-dimensional spaces, the kernel-based techniques are of special relevance, as they have demonstrated a high efficiency due to their ability to generalize in these spaces. In this work, a new approach for texture analysis in biomedical imaging is performed by means of integration. For this procedure, kernel-based techniques were used with different types of texture data for the selection of the most representative variables in order to improve the results obtained in classification and interpretability of the obtained variables. To validate this proposal, an experimental design has been concluded, consisting of four different phases: 1) Data extraction; 2) Data pre-processing; 3) Learning and 4) Selection of the best model to ensure the reproducibility of results while making a comparison under conditions of equality. In this regard, two-dimensional electrophoresis gel images have been used
    corecore