187 research outputs found

    Advanced Probabilistic Models for Clustering and Projection

    Get PDF
    Probabilistic modeling for data mining and machine learning problems is a fundamental research area. The general approach is to assume a generative model underlying the observed data, and estimate model parameters via likelihood maximization. It has the deep probability theory as the mathematical background, and enjoys a large amount of methods from statistical learning, sampling theory and Bayesian statistics. In this thesis we study several advanced probabilistic models for data clustering and feature projection, which are the two important unsupervised learning problems. The goal of clustering is to group similar data points together to uncover the data clusters. While numerous methods exist for various clustering tasks, one important question still remains, i.e., how to automatically determine the number of clusters. The first part of the thesis answers this question from a mixture modeling perspective. A finite mixture model is first introduced for clustering, in which each mixture component is assumed to be an exponential family distribution for generality. The model is then extended to an infinite mixture model, and its strong connection to Dirichlet process (DP) is uncovered which is a non-parametric Bayesian framework. A variational Bayesian algorithm called VBDMA is derived from this new insight to learn the number of clusters automatically, and empirical studies on some 2D data sets and an image data set verify the effectiveness of this algorithm. In feature projection, we are interested in dimensionality reduction and aim to find a low-dimensional feature representation for the data. We first review the well-known principal component analysis (PCA) and its probabilistic interpretation (PPCA), and then generalize PPCA to a novel probabilistic model which is able to handle non-linear projection known as kernel PCA. An expectation-maximization (EM) algorithm is derived for kernel PCA such that it is fast and applicable to large data sets. Then we propose a novel supervised projection method called MORP, which can take the output information into account in a supervised learning context. Empirical studies on various data sets show much better results compared to unsupervised projection and other supervised projection methods. At the end we generalize MORP probabilistically to propose SPPCA for supervised projection, and we can also naturally extend the model to S2PPCA which is a semi-supervised projection method. This allows us to incorporate both the label information and the unlabeled data into the projection process. In the third part of the thesis, we introduce a unified probabilistic model which can handle data clustering and feature projection jointly. The model can be viewed as a clustering model with projected features, and a projection model with structured documents. A variational Bayesian learning algorithm can be derived, and it turns out to iterate the clustering operations and projection operations until convergence. Superior performance can be obtained for both clustering and projection

    Advanced Probabilistic Models for Clustering and Projection

    Get PDF
    Probabilistic modeling for data mining and machine learning problems is a fundamental research area. The general approach is to assume a generative model underlying the observed data, and estimate model parameters via likelihood maximization. It has the deep probability theory as the mathematical background, and enjoys a large amount of methods from statistical learning, sampling theory and Bayesian statistics. In this thesis we study several advanced probabilistic models for data clustering and feature projection, which are the two important unsupervised learning problems. The goal of clustering is to group similar data points together to uncover the data clusters. While numerous methods exist for various clustering tasks, one important question still remains, i.e., how to automatically determine the number of clusters. The first part of the thesis answers this question from a mixture modeling perspective. A finite mixture model is first introduced for clustering, in which each mixture component is assumed to be an exponential family distribution for generality. The model is then extended to an infinite mixture model, and its strong connection to Dirichlet process (DP) is uncovered which is a non-parametric Bayesian framework. A variational Bayesian algorithm called VBDMA is derived from this new insight to learn the number of clusters automatically, and empirical studies on some 2D data sets and an image data set verify the effectiveness of this algorithm. In feature projection, we are interested in dimensionality reduction and aim to find a low-dimensional feature representation for the data. We first review the well-known principal component analysis (PCA) and its probabilistic interpretation (PPCA), and then generalize PPCA to a novel probabilistic model which is able to handle non-linear projection known as kernel PCA. An expectation-maximization (EM) algorithm is derived for kernel PCA such that it is fast and applicable to large data sets. Then we propose a novel supervised projection method called MORP, which can take the output information into account in a supervised learning context. Empirical studies on various data sets show much better results compared to unsupervised projection and other supervised projection methods. At the end we generalize MORP probabilistically to propose SPPCA for supervised projection, and we can also naturally extend the model to S2PPCA which is a semi-supervised projection method. This allows us to incorporate both the label information and the unlabeled data into the projection process. In the third part of the thesis, we introduce a unified probabilistic model which can handle data clustering and feature projection jointly. The model can be viewed as a clustering model with projected features, and a projection model with structured documents. A variational Bayesian learning algorithm can be derived, and it turns out to iterate the clustering operations and projection operations until convergence. Superior performance can be obtained for both clustering and projection

    Prevalence of human herpesvirus 8 infection in systemic lupus erythematosus

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>For decades, scientists have tried to understand the environmental factors involved in the development of systemic lupus erythematosus (SLE), in which viral infections was included. Previous studies have identified Epstein-Barr virus (EBV) to incite SLE. Human herpesvirus 8 (HHV-8), another member of the gammaherpesvirus family, shares a lot in common with EBV. The characteristics of HHV-8 make it a well-suited candidate to trigger SLE.</p> <p>Results</p> <p>In the present study, serum samples from patients (n = 108) with diagnosed SLE and matched controls (n = 122) were collected, and the prevalence of HHV-8 was compared by a virus-specific nested PCR and a whole virus enzyme-linked immunoassay (EIA). There was significant difference in the prevalence of HHV-8 DNA between SLE patients and healthy controls (11 of 107 vs 1 of 122, <it>p </it>= 0.001); significant difference was also found in the detection of HHV-8 antibodies (19 of 107 vs 2 of 122, <it>p </it>< 0.001).</p> <p>We also detected the antibodies to Epstein-Barr virus viral capsid antigen (EBV-VCA) and Epstein-Barr nuclear antigen-1 (EBNA-1). Both patients and controls showed high seroprevalence with no significant difference (106 of 107 vs 119 of 122, <it>p </it>= 0.625).</p> <p>Conclusion</p> <p>Our finding indicated that there might be an association between HHV-8 and the development of SLE.</p

    Privacy-Preserving Predictive Models for Lung Cancer Survival Analysis

    Get PDF
    MAASTRO clinic, the Netherlands. Privacy-preserving data mining (PPDM) is a recent emergent research area that deals with the incorporation of privacy preserving concerns to data mining techniques. We consider a real clinical setting where the data is horizontally distributed among different institutions. Each one of the medical institutions involved in this work provides a database containing a subset of patients. There is recent work that shows the potential of the PPDM approach in medical applications. However, there is few work in developing/implementing PPDM for predictive personalized medicine. In this paper we use real data from several institutions across Europe to build models for survival prediction for non-small-cell lung cancer patients while addressing the potential privacy preserving issues that may arise when sharing data across institutions located in different countries. Our experiments in a real clinical setting show that the privacy preserving approach may result in improved models while avoiding the burdens of traditional data sharing (legal and/or anonymization expenses).

    Autoantibodies against the Catalytic Domain of BRAF Are Not Specific Serum Markers for Rheumatoid Arthritis

    Get PDF
    BACKGROUND: Autoantibodies to the catalytic domain of v-raf murine sarcoma viral oncogene homologue B1 (BRAF) have been recently identified as a new family of autoantibodies involved in rheumatoid arthritis (RA). The objective of this study was to determine antibody responses to the catalytic domain of BRAF in RA and other autoimmune diseases. The association between RA-related clinical indices and these antibodies was also assessed. METHODOLOGY/PRINCIPAL FINDINGS: The presence of autoantibodies to the catalytic domain of BRAF (anti-BRAF) or to peptide P25 (amino acids 656-675 of the catalytic domain of BRAF; anti-P25) was determined in serum samples from patients with RA, primary Sjögren's syndrome (pSS), systemic lupus erythematosus (SLE), and healthy controls by using indirect enzyme-linked immunosorbent assays (ELISAs) based on the recombinant catalytic domain of BRAF or a synthesized peptide, respectively. Associations of anti-BRAF or anti-P25 with disease variables of RA patients were also evaluated. Our results show that the BRAF-specific antibodies anti-BRAF and anti-P25 are equally present in RA, pSS, and SLE patients. However, the erythrocyte sedimentation rate (ESR) used to detect inflammation was significantly different between patients with and without BRAF-specific antibodies. The anti-BRAF-positive patients were found to have prolonged disease, and active disease occurred more frequently in anti-P25-positive patients than in anti-P25-negative patients. A weak but significant correlation between anti-P25 levels and ESRs was observed (r = 0.319, p = 0.004). CONCLUSIONS/SIGNIFICANCE: The antibody response against the catalytic domain of BRAF is not specific for RA, but the higher titers of BRAF-specific antibodies may be associated with increased inflammation in RA

    Prolonged dual antiplatelet therapy in patients with non-ST-segment elevation myocardial infarction: 2-year findings from EPICOR Asia.

    Get PDF
    BACKGROUND: Patients with non-ST-segment elevation myocardial infarction (NSTEMI) have a generally poor prognosis and antithrombotic management patterns (AMPs) used post-acute coronary syndrome (ACS) remain unclear. Duration of dual antiplatelet therapy (DAPT) and patient characteristics was evaluated in NSTEMI patients enrolled in EPICOR Asia. HYPOTHESIS: Patients stopping DAPT early may benefit from more intensive monitoring. METHODS: EPICOR Asia was a prospective, real-world, primary data collection, cohort study in adults with an ACS, conducted in eight countries/regions in Asia, with 2 year follow-up. Eligible patients were hospitalized within 48 hours of symptom onset and survived to discharge. We describe AMPs and baseline characteristics in NSTEMI patients surviving ≥12 months with DAPT duration ≤12 and > 12 months post-discharge. Clinical outcomes (composite of death, myocardial infarction, and stroke; and bleeding) were also explored. RESULTS: At discharge, 90.8% of patients were on DAPT (including clopidogrel, 99%). At 1- and 2-year follow-up, this was 79.2% and 60.0%. Patients who stopped DAPT ≤12 months post-discharge tended to be older, female, less obese, have prior cardiovascular disease, and have renal dysfunction. While causality cannot be inferred, the incidence of the composite endpoint over the subsequent 12 months was 10.6% and 3.1% with shorter vs longer use of DAPT, and mortality risk over the same period was 8.4% and 1.6%. CONCLUSIONS: Over 90% of NSTEMI patients were discharged on DAPT, with 60% on DAPT at 2 years. Patients stopping DAPT early were more likely to have higher baseline risk and may therefore benefit from more intensive monitoring during long-term follow-up
    corecore