191 research outputs found
Advanced Probabilistic Models for Clustering and Projection
Probabilistic modeling for data mining and machine learning problems is a fundamental research area. The general approach is to assume a generative model underlying the observed data, and estimate model parameters via likelihood maximization. It has the deep probability theory as the mathematical background, and enjoys a large amount of methods from statistical learning, sampling theory and Bayesian statistics. In this thesis we study several advanced probabilistic models for data clustering and feature projection, which are the two important unsupervised learning problems.
The goal of clustering is to group similar data points together to uncover the data clusters. While numerous methods exist for various clustering tasks, one important question still remains, i.e., how to automatically determine the number of clusters. The first part of the thesis answers this question from a mixture modeling perspective. A finite mixture model is first introduced for clustering, in which each mixture component is assumed to be an exponential family distribution for generality. The model is then extended to an infinite mixture model, and its strong connection to Dirichlet process (DP) is uncovered which is a non-parametric Bayesian framework. A variational Bayesian algorithm called VBDMA is derived from this new insight to learn the number of clusters automatically, and empirical studies on some 2D data sets and an image data set verify the effectiveness of this algorithm.
In feature projection, we are interested in dimensionality reduction and aim to find a low-dimensional feature representation for the data. We first review the well-known principal component analysis (PCA) and its probabilistic interpretation (PPCA), and then generalize PPCA to a novel probabilistic model which is able to handle non-linear projection known as kernel PCA. An expectation-maximization (EM) algorithm is derived for kernel PCA such that it is fast and applicable to large data sets. Then we propose a novel supervised projection method called MORP, which can take the output information into account in a supervised learning context. Empirical studies on various data sets show much better results compared to unsupervised projection and other supervised projection methods. At the end we generalize MORP probabilistically to propose SPPCA for supervised projection, and we can also naturally extend the model to S2PPCA which is a semi-supervised projection method. This allows us to incorporate both the label information and the unlabeled data into the projection process.
In the third part of the thesis, we introduce a unified probabilistic model which can handle data clustering and feature projection jointly. The model can be viewed as a clustering model with projected features, and a projection model with structured documents. A variational Bayesian learning algorithm can be derived, and it turns out to iterate the clustering operations and projection operations until convergence. Superior performance can be obtained for both clustering and projection
Advanced Probabilistic Models for Clustering and Projection
Probabilistic modeling for data mining and machine learning problems is a fundamental research area. The general approach is to assume a generative model underlying the observed data, and estimate model parameters via likelihood maximization. It has the deep probability theory as the mathematical background, and enjoys a large amount of methods from statistical learning, sampling theory and Bayesian statistics. In this thesis we study several advanced probabilistic models for data clustering and feature projection, which are the two important unsupervised learning problems.
The goal of clustering is to group similar data points together to uncover the data clusters. While numerous methods exist for various clustering tasks, one important question still remains, i.e., how to automatically determine the number of clusters. The first part of the thesis answers this question from a mixture modeling perspective. A finite mixture model is first introduced for clustering, in which each mixture component is assumed to be an exponential family distribution for generality. The model is then extended to an infinite mixture model, and its strong connection to Dirichlet process (DP) is uncovered which is a non-parametric Bayesian framework. A variational Bayesian algorithm called VBDMA is derived from this new insight to learn the number of clusters automatically, and empirical studies on some 2D data sets and an image data set verify the effectiveness of this algorithm.
In feature projection, we are interested in dimensionality reduction and aim to find a low-dimensional feature representation for the data. We first review the well-known principal component analysis (PCA) and its probabilistic interpretation (PPCA), and then generalize PPCA to a novel probabilistic model which is able to handle non-linear projection known as kernel PCA. An expectation-maximization (EM) algorithm is derived for kernel PCA such that it is fast and applicable to large data sets. Then we propose a novel supervised projection method called MORP, which can take the output information into account in a supervised learning context. Empirical studies on various data sets show much better results compared to unsupervised projection and other supervised projection methods. At the end we generalize MORP probabilistically to propose SPPCA for supervised projection, and we can also naturally extend the model to S2PPCA which is a semi-supervised projection method. This allows us to incorporate both the label information and the unlabeled data into the projection process.
In the third part of the thesis, we introduce a unified probabilistic model which can handle data clustering and feature projection jointly. The model can be viewed as a clustering model with projected features, and a projection model with structured documents. A variational Bayesian learning algorithm can be derived, and it turns out to iterate the clustering operations and projection operations until convergence. Superior performance can be obtained for both clustering and projection
Prevalence of human herpesvirus 8 infection in systemic lupus erythematosus
<p>Abstract</p> <p>Background</p> <p>For decades, scientists have tried to understand the environmental factors involved in the development of systemic lupus erythematosus (SLE), in which viral infections was included. Previous studies have identified Epstein-Barr virus (EBV) to incite SLE. Human herpesvirus 8 (HHV-8), another member of the gammaherpesvirus family, shares a lot in common with EBV. The characteristics of HHV-8 make it a well-suited candidate to trigger SLE.</p> <p>Results</p> <p>In the present study, serum samples from patients (n = 108) with diagnosed SLE and matched controls (n = 122) were collected, and the prevalence of HHV-8 was compared by a virus-specific nested PCR and a whole virus enzyme-linked immunoassay (EIA). There was significant difference in the prevalence of HHV-8 DNA between SLE patients and healthy controls (11 of 107 vs 1 of 122, <it>p </it>= 0.001); significant difference was also found in the detection of HHV-8 antibodies (19 of 107 vs 2 of 122, <it>p </it>< 0.001).</p> <p>We also detected the antibodies to Epstein-Barr virus viral capsid antigen (EBV-VCA) and Epstein-Barr nuclear antigen-1 (EBNA-1). Both patients and controls showed high seroprevalence with no significant difference (106 of 107 vs 119 of 122, <it>p </it>= 0.625).</p> <p>Conclusion</p> <p>Our finding indicated that there might be an association between HHV-8 and the development of SLE.</p
Research on Methods for Very Large Scale Integration Track Assignment Routing
Routing is a crucial stage in the physical design of Very Large Scale Integration (VLSI) circuits, comprising three phases: global routing, track assignment routing, and detailed routing. With the development of VLSI circuits, scholars have proposed various track assignment routing algorithms. However, improving the efficiency of track assignment routing and optimizing conflicting design rules have become bottlenecks in track assignment routing problems. This study addresses these bottlenecks by utilizing single-level horizontal and vertical Steiner trees to extract routability information of local wire nets, resolving the adaptation issue between global routing and detailed routing. The proposed algorithm enhances routability information by an average of 16.07% across ten benchmark circuits. Additionally, a Generative Neural Network model based on Conditional Variational Autoencoder (CVAE) is employed to improve the efficiency of track assignment routing, yielding significant simulation results. Furthermore, a negotiation-based tear-and-reassign approach is utilized to address track congestion issues, resulting in an average optimization of 26.03% in overlap cost, with a tradeoff of sacrificing 6.67% of wirelength on average
Lyapunov exponents and Lagrangian chaos suppression in compressible homogeneous isotropic turbulence
We study Lyapunov exponents of tracers in compressible homogeneous isotropic
turbulence at different turbulent Mach number and Taylor-scale Reynolds
number . We demonstrate that statistics of finite-time Lyapunov
exponents have the same form as in incompressible flow due to density-velocity
coupling. Modulus of the smallest Lyapunov exponent provides the
principal Lyapunov exponent of the time-reversed flow, which usually is wrong
in a compressible flow. This exponent, along with the principal Lyapunov
exponent , determines all the exponents due to the vanishing of the
sum of all Lyapunov exponents. Numerical results by high-order schemes for
solving the Navier-Stokes equations and tracking particles verify these
theoretical predictions. We found that: 1) The largest normalized Lyapunov
exponent , where is the Kolmogorov time scale,
is a decreasing function of . Its dependence on is weak when
the driving force is solenoidal, while it is an increasing function of
when the solenoidal and compressible forces are comparable.
Similar facts hold for , in contrast with well-studied
short-correlated model; 2) The ratio of the first two Lyapunov exponents
decreases with , and is virtually independent
of for in the case of solenoidal force but decreases as
increases when solenoidal and compressible forces are comparable; 3) For purely
solenoidal force, for
, which is consistent with incompressible turbulence studies;
4) The ratio of dilation-to-vorticity is a more suitable parameter to
characterize LEs than .Comment: 25 pages, 18 figure
ADCNet: a unified framework for predicting the activity of antibody-drug conjugates
Antibody-drug conjugate (ADC) has revolutionized the field of cancer
treatment in the era of precision medicine due to their ability to precisely
target cancer cells and release highly effective drug. Nevertheless, the
realization of rational design of ADC is very difficult because the
relationship between their structures and activities is difficult to
understand. In the present study, we introduce a unified deep learning
framework called ADCNet to help design potential ADCs. The ADCNet highly
integrates the protein representation learning language model ESM-2 and
small-molecule representation learning language model FG-BERT models to achieve
activity prediction through learning meaningful features from antigen and
antibody protein sequences of ADC, SMILES strings of linker and payload, and
drug-antibody ratio (DAR) value. Based on a carefully designed and manually
tailored ADC data set, extensive evaluation results reveal that ADCNet performs
best on the test set compared to baseline machine learning models across all
evaluation metrics. For example, it achieves an average prediction accuracy of
87.12%, a balanced accuracy of 0.8689, and an area under receiver operating
characteristic curve of 0.9293 on the test set. In addition, cross-validation,
ablation experiments, and external independent testing results further prove
the stability, advancement, and robustness of the ADCNet architecture. For the
convenience of the community, we develop the first online platform
(https://ADCNet.idruglab.cn) for the prediction of ADCs activity based on the
optimal ADCNet model, and the source code is publicly available at
https://github.com/idrugLab/ADCNet
- …