64 research outputs found
Sparse supervised dimension reduction in high dimensional classification
Supervised dimension reduction has proven effective in analyzing data with complex structure. The primary goal is to seek the reduced subspace of minimal dimension which is sufficient for summarizing the data structure of interest. This paper investigates the supervised dimension reduction in high dimensional classification context, and proposes a novel method for estimating the dimension reduction subspace while retaining the ideal classification boundary based on the original dataset. The proposed method combines the techniques of margin based classification and shrinkage estimation, and can estimate the dimension and the directions of the reduced subspace simultaneously. Both theoretical and numerical results indicate that the proposed method is highly competitive against its competitors, especially when the dimension of the covariates exceeds the sample size
Penalized Cluster Analysis With Applications to Family Data
Cluster analysis is the assignment of observations into clusters so that observations in the same cluster are similar in some sense, and many clustering methods have been developed. However, these methods cannot be applied to family data, which possess intrinsic familial structure. To take the familial structure into account, we propose a form of penalized cluster analysis with a tuning parameter controlling its influence. The tuning parameter can be selected based on the concept of clustering stability. The method can also be applied to other cluster data such as panel data. The method is illustrated via simulations and an application to a family study of asthma
Analysis of presence-only data via semi-supervised learning approaches
Presence-only data occur in classification, which consist of a sample of observations
from presence class and a large number of background observations with unknown
presence/absence. Since absence data are generally unavailable, conventional semisupervised
learning approaches are no longer appropriate as they tend to degenerate
and assign all observations to presence class. In this article, we propose a generalized
class balance constraint, which can be equipped with semi-supervised learning approaches
to prevent them from degeneration. Furthermore, to circumvent the difficulty
of model tuning with presence-only data, a selection criterion based on classification
stability is developed, which measures the robustness of any given classification algorithm
against the sampling randomness. The effectiveness of the proposed approach
is demonstrated through a variety of simulated examples, along with an application to
gene function prediction
Selection of the number of clusters via the bootstrap method
Here the problem of selecting the number of clusters in cluster analysis is considered.
Recently, the concept of clustering stability, which measures the robustness
of any given clustering algorithm, has been utilized in Wang (2010) for selecting the number of clusters through cross validation. In this manuscript, an estimation scheme for clustering instability is developed based on the bootstrap, and then the number of clusters is selected so that the corresponding estimated clustering instability is minimized. The proposed selection criterion’s effectiveness is demonstrated on simulations and real examples
Consistent community detection in inter-layer dependent multi-layer networks
Community detection in multi-layer networks, which aims at finding groups of nodes with similar connective patterns among all layers, has attracted tremendous interests in multi-layer network analysis. Most existing methods are extended from those for single-layer networks, which assume that different layers are independent. In this paper, we propose a novel community detection method in multi-layer networks with inter-layer dependence, which integrates the stochastic block model (SBM) and the Ising model. The community structure is modeled by the SBM model and the inter-layer dependence is incorporated via the Ising model. An efficient alternative updating algorithm is developed to tackle the resultant optimization task. Moreover, the asymptotic consistencies of the proposed method in terms of both parameter estimation and community detection are established, which are supported by extensive simulated examples and a real example on a multi-layer malaria parasite gene network.</p
Regularized k-means clustering of high-dimensional data and its asymptotic consistency
K-means clustering is a widely used tool for cluster analysis
due to its conceptual simplicity and computational efficiency. However, its
performance can be distorted when clustering high-dimensional data where
the number of variables becomes relatively large and many of them may
contain no information about the clustering structure. This article proposes
a high-dimensional cluster analysis method via regularized k-means clus-
tering, which can simultaneously cluster similar observations and eliminate
redundant variables. The key idea is to formulate the k-means clustering in a
form of regularization, with an adaptive group lasso penalty term on cluster
centers. In order to optimally balance the trade-off between the clustering
model fitting and sparsity, a selection criterion based on clustering stabil-
ity is developed. The asymptotic estimation and selection consistency of
the regularized k-means clustering with diverging dimension is established.
The effectiveness of the regularized k-means clustering is also demonstrated
through a variety of numerical experiments as well as applications to two
gene microarray examples. The regularized clustering framework can also
be extended to the general model-based clustering
Charge Transfer from <i>n</i>‑Doped Nanocrystals: Mimicking Intermediate Events in Multielectron Photocatalysis
In multielectron
photocatalytic reactions, an absorbed photon triggers
charge transfer from the light-harvester to the attached catalyst,
leaving behind a charge of the opposite sign in the light-harvester.
If this charge is not scavenged before the absorption of the following
photons, photoexcitation generates not neutral but charged excitons
from which the extraction of charges should become more difficult.
This is potentially an efficiency-limiting intermediate event in multielectron
photocatalysis. To study the charge dynamics in this event, we doped
CdS nanocrystal quantum dots (QDs) with an extra electron and measured
hole transfer from <i>n</i>-doped QDs to attached acceptors.
We find that the Auger decay of charged excitons lowers the charge
separation yield to 68.6% from 98.4% for neutral excitons. In addition,
the hole transfer rate in the presence of two electrons (1290 ps)
is slower than that in the presence one electron (776 ps), and the
recombination rate of charge separated states is about 2 times faster
in the former case. This model study provides important insights into
possible efficiency-limiting intermediate events involved in photocatalysis
Electron Transfer into Electron-Accumulated Nanocrystals: Mimicking Intermediate Events in Multielectron Photocatalysis II
The overall efficiency
of multielectron photocatalytic reactions
is often much lower than the charge-separation yield reported for
the first charge-transfer (CT) event. Our recent study has partially
linked this gap to CT from charge-accumulated light harvesters. Another
possible intermediate event lowering the efficiency is CT into charge-accumulated
nanocatalysts. To study this event, we built a “toy”
system using nanocrystal quantum dots (QDs) doped with extra electrons
to mimick charge-accumulated nanocatalysts. We measured electron transfer
(ET) from photoexcited molecular light harvesters into doped QDs using
transient absorption spectroscopy. The measurements reveal that the
pre-existing electron slows down ET from 37.8 ± 2.2 ps in the
neutral sample to 93.4 ± 8.6 ps in the singly doped sample, accelerates
charge recombination (CR) from 7.02 ± 0.84 to 3.69 ± 0.25
ns, and lowers the electron-injection yield by ∼14%. This study
uncovers yet another possible intermediate event lowering the efficiency
of multielectron photocatalysis
Classification With Unstructured Predictors and an Application to Sentiment Analysis
<p>Unstructured data refer to information that lacks certain structures and cannot be organized in a predefined fashion. Unstructured data often involve words, texts, graphs, objects, or multimedia types of files that are difficult to process and analyze with traditional computational tools and statistical methods. This work explores ordinal classification for unstructured predictors with ordered class categories, where imprecise information concerning strengths of association between predictors is available for predicting class labels. However, imprecise information here is expressed in terms of a directed graph, with each node representing a predictor and a directed edge containing pairwise strengths of association between two nodes. One of the targeted applications for unstructured data arises from sentiment analysis, which identifies and extracts the relevant content or opinion of a document concerning a specific event of interest. We integrate the imprecise predictor relations into linear relational constraints over classification function coefficients, where large margin ordinal classifiers are introduced, subject to many quadratically linear constraints. The proposed classifiers are then applied in sentiment analysis using binary word predictors. Computationally, we implement ordinal support vector machines and ψ-learning through a scalable quadratic programming package based on sparse word representations. Theoretically, we show that using relationships among unstructured predictors improves prediction accuracy of classification significantly. We illustrate an application for sentiment analysis using consumer text reviews and movie review data. Supplementary materials for this article are available online.</p
Additional file 1 of Flexible combination of multiple diagnostic biomarkers to improve diagnostic accuracy
The Appendix includes the proof of Proposition 1. (PDF 28.6 kb
- …