188 research outputs found

    Advanced Probabilistic Models for Clustering and Projection

    Get PDF
    Probabilistic modeling for data mining and machine learning problems is a fundamental research area. The general approach is to assume a generative model underlying the observed data, and estimate model parameters via likelihood maximization. It has the deep probability theory as the mathematical background, and enjoys a large amount of methods from statistical learning, sampling theory and Bayesian statistics. In this thesis we study several advanced probabilistic models for data clustering and feature projection, which are the two important unsupervised learning problems. The goal of clustering is to group similar data points together to uncover the data clusters. While numerous methods exist for various clustering tasks, one important question still remains, i.e., how to automatically determine the number of clusters. The first part of the thesis answers this question from a mixture modeling perspective. A finite mixture model is first introduced for clustering, in which each mixture component is assumed to be an exponential family distribution for generality. The model is then extended to an infinite mixture model, and its strong connection to Dirichlet process (DP) is uncovered which is a non-parametric Bayesian framework. A variational Bayesian algorithm called VBDMA is derived from this new insight to learn the number of clusters automatically, and empirical studies on some 2D data sets and an image data set verify the effectiveness of this algorithm. In feature projection, we are interested in dimensionality reduction and aim to find a low-dimensional feature representation for the data. We first review the well-known principal component analysis (PCA) and its probabilistic interpretation (PPCA), and then generalize PPCA to a novel probabilistic model which is able to handle non-linear projection known as kernel PCA. An expectation-maximization (EM) algorithm is derived for kernel PCA such that it is fast and applicable to large data sets. Then we propose a novel supervised projection method called MORP, which can take the output information into account in a supervised learning context. Empirical studies on various data sets show much better results compared to unsupervised projection and other supervised projection methods. At the end we generalize MORP probabilistically to propose SPPCA for supervised projection, and we can also naturally extend the model to S2PPCA which is a semi-supervised projection method. This allows us to incorporate both the label information and the unlabeled data into the projection process. In the third part of the thesis, we introduce a unified probabilistic model which can handle data clustering and feature projection jointly. The model can be viewed as a clustering model with projected features, and a projection model with structured documents. A variational Bayesian learning algorithm can be derived, and it turns out to iterate the clustering operations and projection operations until convergence. Superior performance can be obtained for both clustering and projection

    Advanced Probabilistic Models for Clustering and Projection

    Get PDF
    Probabilistic modeling for data mining and machine learning problems is a fundamental research area. The general approach is to assume a generative model underlying the observed data, and estimate model parameters via likelihood maximization. It has the deep probability theory as the mathematical background, and enjoys a large amount of methods from statistical learning, sampling theory and Bayesian statistics. In this thesis we study several advanced probabilistic models for data clustering and feature projection, which are the two important unsupervised learning problems. The goal of clustering is to group similar data points together to uncover the data clusters. While numerous methods exist for various clustering tasks, one important question still remains, i.e., how to automatically determine the number of clusters. The first part of the thesis answers this question from a mixture modeling perspective. A finite mixture model is first introduced for clustering, in which each mixture component is assumed to be an exponential family distribution for generality. The model is then extended to an infinite mixture model, and its strong connection to Dirichlet process (DP) is uncovered which is a non-parametric Bayesian framework. A variational Bayesian algorithm called VBDMA is derived from this new insight to learn the number of clusters automatically, and empirical studies on some 2D data sets and an image data set verify the effectiveness of this algorithm. In feature projection, we are interested in dimensionality reduction and aim to find a low-dimensional feature representation for the data. We first review the well-known principal component analysis (PCA) and its probabilistic interpretation (PPCA), and then generalize PPCA to a novel probabilistic model which is able to handle non-linear projection known as kernel PCA. An expectation-maximization (EM) algorithm is derived for kernel PCA such that it is fast and applicable to large data sets. Then we propose a novel supervised projection method called MORP, which can take the output information into account in a supervised learning context. Empirical studies on various data sets show much better results compared to unsupervised projection and other supervised projection methods. At the end we generalize MORP probabilistically to propose SPPCA for supervised projection, and we can also naturally extend the model to S2PPCA which is a semi-supervised projection method. This allows us to incorporate both the label information and the unlabeled data into the projection process. In the third part of the thesis, we introduce a unified probabilistic model which can handle data clustering and feature projection jointly. The model can be viewed as a clustering model with projected features, and a projection model with structured documents. A variational Bayesian learning algorithm can be derived, and it turns out to iterate the clustering operations and projection operations until convergence. Superior performance can be obtained for both clustering and projection

    Prevalence of human herpesvirus 8 infection in systemic lupus erythematosus

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>For decades, scientists have tried to understand the environmental factors involved in the development of systemic lupus erythematosus (SLE), in which viral infections was included. Previous studies have identified Epstein-Barr virus (EBV) to incite SLE. Human herpesvirus 8 (HHV-8), another member of the gammaherpesvirus family, shares a lot in common with EBV. The characteristics of HHV-8 make it a well-suited candidate to trigger SLE.</p> <p>Results</p> <p>In the present study, serum samples from patients (n = 108) with diagnosed SLE and matched controls (n = 122) were collected, and the prevalence of HHV-8 was compared by a virus-specific nested PCR and a whole virus enzyme-linked immunoassay (EIA). There was significant difference in the prevalence of HHV-8 DNA between SLE patients and healthy controls (11 of 107 vs 1 of 122, <it>p </it>= 0.001); significant difference was also found in the detection of HHV-8 antibodies (19 of 107 vs 2 of 122, <it>p </it>< 0.001).</p> <p>We also detected the antibodies to Epstein-Barr virus viral capsid antigen (EBV-VCA) and Epstein-Barr nuclear antigen-1 (EBNA-1). Both patients and controls showed high seroprevalence with no significant difference (106 of 107 vs 119 of 122, <it>p </it>= 0.625).</p> <p>Conclusion</p> <p>Our finding indicated that there might be an association between HHV-8 and the development of SLE.</p

    Lyapunov exponents and Lagrangian chaos suppression in compressible homogeneous isotropic turbulence

    Full text link
    We study Lyapunov exponents of tracers in compressible homogeneous isotropic turbulence at different turbulent Mach number MtM_t and Taylor-scale Reynolds number ReλRe_\lambda. We demonstrate that statistics of finite-time Lyapunov exponents have the same form as in incompressible flow due to density-velocity coupling. Modulus of the smallest Lyapunov exponent λ3\lambda_3 provides the principal Lyapunov exponent of the time-reversed flow, which usually is wrong in a compressible flow. This exponent, along with the principal Lyapunov exponent λ1\lambda_1, determines all the exponents due to the vanishing of the sum of all Lyapunov exponents. Numerical results by high-order schemes for solving the Navier-Stokes equations and tracking particles verify these theoretical predictions. We found that: 1) The largest normalized Lyapunov exponent λ1τη\lambda_1 \tau_\eta, where τη\tau_\eta is the Kolmogorov time scale, is a decreasing function of MtM_t. Its dependence on ReλRe_\lambda is weak when the driving force is solenoidal, while it is an increasing function of ReλRe_\lambda when the solenoidal and compressible forces are comparable. Similar facts hold for λ3|\lambda_3|, in contrast with well-studied short-correlated model; 2) The ratio of the first two Lyapunov exponents λ1/λ2\lambda_1/\lambda_2 decreases with ReλRe_\lambda, and is virtually independent of MtM_t for Mt1M_t \le 1 in the case of solenoidal force but decreases as MtM_t increases when solenoidal and compressible forces are comparable; 3) For purely solenoidal force, λ1:λ2:λ34:1:5\lambda_1 :\lambda_2 :\lambda_3 \approx 4:1:-5 for Reλ>80Re_\lambda > 80, which is consistent with incompressible turbulence studies; 4) The ratio of dilation-to-vorticity is a more suitable parameter to characterize LEs than MtM_t.Comment: 25 pages, 18 figure

    Privacy-Preserving Predictive Models for Lung Cancer Survival Analysis

    Get PDF
    MAASTRO clinic, the Netherlands. Privacy-preserving data mining (PPDM) is a recent emergent research area that deals with the incorporation of privacy preserving concerns to data mining techniques. We consider a real clinical setting where the data is horizontally distributed among different institutions. Each one of the medical institutions involved in this work provides a database containing a subset of patients. There is recent work that shows the potential of the PPDM approach in medical applications. However, there is few work in developing/implementing PPDM for predictive personalized medicine. In this paper we use real data from several institutions across Europe to build models for survival prediction for non-small-cell lung cancer patients while addressing the potential privacy preserving issues that may arise when sharing data across institutions located in different countries. Our experiments in a real clinical setting show that the privacy preserving approach may result in improved models while avoiding the burdens of traditional data sharing (legal and/or anonymization expenses).

    Autoantibodies against the Catalytic Domain of BRAF Are Not Specific Serum Markers for Rheumatoid Arthritis

    Get PDF
    BACKGROUND: Autoantibodies to the catalytic domain of v-raf murine sarcoma viral oncogene homologue B1 (BRAF) have been recently identified as a new family of autoantibodies involved in rheumatoid arthritis (RA). The objective of this study was to determine antibody responses to the catalytic domain of BRAF in RA and other autoimmune diseases. The association between RA-related clinical indices and these antibodies was also assessed. METHODOLOGY/PRINCIPAL FINDINGS: The presence of autoantibodies to the catalytic domain of BRAF (anti-BRAF) or to peptide P25 (amino acids 656-675 of the catalytic domain of BRAF; anti-P25) was determined in serum samples from patients with RA, primary Sjögren's syndrome (pSS), systemic lupus erythematosus (SLE), and healthy controls by using indirect enzyme-linked immunosorbent assays (ELISAs) based on the recombinant catalytic domain of BRAF or a synthesized peptide, respectively. Associations of anti-BRAF or anti-P25 with disease variables of RA patients were also evaluated. Our results show that the BRAF-specific antibodies anti-BRAF and anti-P25 are equally present in RA, pSS, and SLE patients. However, the erythrocyte sedimentation rate (ESR) used to detect inflammation was significantly different between patients with and without BRAF-specific antibodies. The anti-BRAF-positive patients were found to have prolonged disease, and active disease occurred more frequently in anti-P25-positive patients than in anti-P25-negative patients. A weak but significant correlation between anti-P25 levels and ESRs was observed (r = 0.319, p = 0.004). CONCLUSIONS/SIGNIFICANCE: The antibody response against the catalytic domain of BRAF is not specific for RA, but the higher titers of BRAF-specific antibodies may be associated with increased inflammation in RA
    corecore