64 research outputs found

    Sparse supervised dimension reduction in high dimensional classification

    No full text
    Supervised dimension reduction has proven effective in analyzing data with complex structure. The primary goal is to seek the reduced subspace of minimal dimension which is sufficient for summarizing the data structure of interest. This paper investigates the supervised dimension reduction in high dimensional classification context, and proposes a novel method for estimating the dimension reduction subspace while retaining the ideal classification boundary based on the original dataset. The proposed method combines the techniques of margin based classification and shrinkage estimation, and can estimate the dimension and the directions of the reduced subspace simultaneously. Both theoretical and numerical results indicate that the proposed method is highly competitive against its competitors, especially when the dimension of the covariates exceeds the sample size

    Penalized Cluster Analysis With Applications to Family Data

    No full text
    Cluster analysis is the assignment of observations into clusters so that observations in the same cluster are similar in some sense, and many clustering methods have been developed. However, these methods cannot be applied to family data, which possess intrinsic familial structure. To take the familial structure into account, we propose a form of penalized cluster analysis with a tuning parameter controlling its influence. The tuning parameter can be selected based on the concept of clustering stability. The method can also be applied to other cluster data such as panel data. The method is illustrated via simulations and an application to a family study of asthma

    Analysis of presence-only data via semi-supervised learning approaches

    No full text
    Presence-only data occur in classification, which consist of a sample of observations from presence class and a large number of background observations with unknown presence/absence. Since absence data are generally unavailable, conventional semisupervised learning approaches are no longer appropriate as they tend to degenerate and assign all observations to presence class. In this article, we propose a generalized class balance constraint, which can be equipped with semi-supervised learning approaches to prevent them from degeneration. Furthermore, to circumvent the difficulty of model tuning with presence-only data, a selection criterion based on classification stability is developed, which measures the robustness of any given classification algorithm against the sampling randomness. The effectiveness of the proposed approach is demonstrated through a variety of simulated examples, along with an application to gene function prediction

    Selection of the number of clusters via the bootstrap method

    No full text
    Here the problem of selecting the number of clusters in cluster analysis is considered. Recently, the concept of clustering stability, which measures the robustness of any given clustering algorithm, has been utilized in Wang (2010) for selecting the number of clusters through cross validation. In this manuscript, an estimation scheme for clustering instability is developed based on the bootstrap, and then the number of clusters is selected so that the corresponding estimated clustering instability is minimized. The proposed selection criterion’s effectiveness is demonstrated on simulations and real examples

    Consistent community detection in inter-layer dependent multi-layer networks

    No full text
    Community detection in multi-layer networks, which aims at finding groups of nodes with similar connective patterns among all layers, has attracted tremendous interests in multi-layer network analysis. Most existing methods are extended from those for single-layer networks, which assume that different layers are independent. In this paper, we propose a novel community detection method in multi-layer networks with inter-layer dependence, which integrates the stochastic block model (SBM) and the Ising model. The community structure is modeled by the SBM model and the inter-layer dependence is incorporated via the Ising model. An efficient alternative updating algorithm is developed to tackle the resultant optimization task. Moreover, the asymptotic consistencies of the proposed method in terms of both parameter estimation and community detection are established, which are supported by extensive simulated examples and a real example on a multi-layer malaria parasite gene network.</p

    Regularized k-means clustering of high-dimensional data and its asymptotic consistency

    No full text
    K-means clustering is a widely used tool for cluster analysis due to its conceptual simplicity and computational efficiency. However, its performance can be distorted when clustering high-dimensional data where the number of variables becomes relatively large and many of them may contain no information about the clustering structure. This article proposes a high-dimensional cluster analysis method via regularized k-means clus- tering, which can simultaneously cluster similar observations and eliminate redundant variables. The key idea is to formulate the k-means clustering in a form of regularization, with an adaptive group lasso penalty term on cluster centers. In order to optimally balance the trade-off between the clustering model fitting and sparsity, a selection criterion based on clustering stabil- ity is developed. The asymptotic estimation and selection consistency of the regularized k-means clustering with diverging dimension is established. The effectiveness of the regularized k-means clustering is also demonstrated through a variety of numerical experiments as well as applications to two gene microarray examples. The regularized clustering framework can also be extended to the general model-based clustering

    Charge Transfer from <i>n</i>‑Doped Nanocrystals: Mimicking Intermediate Events in Multielectron Photocatalysis

    No full text
    In multielectron photocatalytic reactions, an absorbed photon triggers charge transfer from the light-harvester to the attached catalyst, leaving behind a charge of the opposite sign in the light-harvester. If this charge is not scavenged before the absorption of the following photons, photoexcitation generates not neutral but charged excitons from which the extraction of charges should become more difficult. This is potentially an efficiency-limiting intermediate event in multielectron photocatalysis. To study the charge dynamics in this event, we doped CdS nanocrystal quantum dots (QDs) with an extra electron and measured hole transfer from <i>n</i>-doped QDs to attached acceptors. We find that the Auger decay of charged excitons lowers the charge separation yield to 68.6% from 98.4% for neutral excitons. In addition, the hole transfer rate in the presence of two electrons (1290 ps) is slower than that in the presence one electron (776 ps), and the recombination rate of charge separated states is about 2 times faster in the former case. This model study provides important insights into possible efficiency-limiting intermediate events involved in photocatalysis

    Electron Transfer into Electron-Accumulated Nanocrystals: Mimicking Intermediate Events in Multielectron Photocatalysis II

    No full text
    The overall efficiency of multielectron photocatalytic reactions is often much lower than the charge-separation yield reported for the first charge-transfer (CT) event. Our recent study has partially linked this gap to CT from charge-accumulated light harvesters. Another possible intermediate event lowering the efficiency is CT into charge-accumulated nanocatalysts. To study this event, we built a “toy” system using nanocrystal quantum dots (QDs) doped with extra electrons to mimick charge-accumulated nanocatalysts. We measured electron transfer (ET) from photoexcited molecular light harvesters into doped QDs using transient absorption spectroscopy. The measurements reveal that the pre-existing electron slows down ET from 37.8 ± 2.2 ps in the neutral sample to 93.4 ± 8.6 ps in the singly doped sample, accelerates charge recombination (CR) from 7.02 ± 0.84 to 3.69 ± 0.25 ns, and lowers the electron-injection yield by ∼14%. This study uncovers yet another possible intermediate event lowering the efficiency of multielectron photocatalysis

    Classification With Unstructured Predictors and an Application to Sentiment Analysis

    No full text
    <p>Unstructured data refer to information that lacks certain structures and cannot be organized in a predefined fashion. Unstructured data often involve words, texts, graphs, objects, or multimedia types of files that are difficult to process and analyze with traditional computational tools and statistical methods. This work explores ordinal classification for unstructured predictors with ordered class categories, where imprecise information concerning strengths of association between predictors is available for predicting class labels. However, imprecise information here is expressed in terms of a directed graph, with each node representing a predictor and a directed edge containing pairwise strengths of association between two nodes. One of the targeted applications for unstructured data arises from sentiment analysis, which identifies and extracts the relevant content or opinion of a document concerning a specific event of interest. We integrate the imprecise predictor relations into linear relational constraints over classification function coefficients, where large margin ordinal classifiers are introduced, subject to many quadratically linear constraints. The proposed classifiers are then applied in sentiment analysis using binary word predictors. Computationally, we implement ordinal support vector machines and ψ-learning through a scalable quadratic programming package based on sparse word representations. Theoretically, we show that using relationships among unstructured predictors improves prediction accuracy of classification significantly. We illustrate an application for sentiment analysis using consumer text reviews and movie review data. Supplementary materials for this article are available online.</p
    corecore