35 research outputs found

    Robust Independent Component Analysis via Minimum Divergence Estimation

    Full text link
    Independent component analysis (ICA) has been shown to be useful in many applications. However, most ICA methods are sensitive to data contamination and outliers. In this article we introduce a general minimum U-divergence framework for ICA, which covers some standard ICA methods as special cases. Within the U-family we further focus on the gamma-divergence due to its desirable property of super robustness, which gives the proposed method gamma-ICA. Statistical properties and technical conditions for the consistency of gamma-ICA are rigorously studied. In the limiting case, it leads to a necessary and sufficient condition for the consistency of MLE-ICA. This necessary and sufficient condition is weaker than the condition known in the literature. Since the parameter of interest in ICA is an orthogonal matrix, a geometrical algorithm based on gradient flows on special orthogonal group is introduced to implement gamma-ICA. Furthermore, a data-driven selection for the gamma value, which is critical to the achievement of gamma-ICA, is developed. The performance, especially the robustness, of gamma-ICA in comparison with standard ICA methods is demonstrated through experimental studies using simulated data and image data.Comment: 7 figure

    Robust Independent Component Analysis viaMinimum γ-Divergence Estimation

    Get PDF
    Independent component analysis (ICA) has been shown to be useful in many applications. However, most ICA methods are sensitive to data contamination. In this article we introduce a general minimum U-divergence framework for ICA, which covers some standard ICA methods as special cases. Within the U-family we further focus on the γ-divergence due to its desirable property of super robustness for outliers, which gives the proposed method γ-ICA. Statistical properties and technical conditions for recovery consistency of γ-ICA are studied. In the limiting case, it improves the recovery condition of MLE-ICA known in the literature by giving necessary and sufficient condition. Since the parameter of interest in γ-ICA is an orthogonal matrix, a geometrical algorithm based on gradient flows on special orthogonal group is introduced. Furthermore, a data-driven selection for the γ value, which is critical to the achievement of γ-ICA, is developed. The performance, especially the robustness, of γ-ICA is demonstrated through experimental studies using simulated data and image data

    Plasma pharmacokinetics after combined therapy of gemcitabine and oral S-1 for unresectable pancreatic cancer

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>The combination of gemcitabine (GEM) and S-1, an oral 5-fluorouracil (5-FU) derivative, has been shown to be a promising regimen for patients with unresectable pancreatic cancer.</p> <p>Methods</p> <p>Six patients with advanced pancreatic cancer were enrolled in this pharmacokinetics (PK) study. These patients were treated by oral administration of S-1 30 mg/m<sup>2 </sup>twice daily for 28 consecutive days, followed by a 14-day rest period and intravenous administration of GEM 800 mg/m<sup>2 </sup>on days 1, 15 and 29 of each course. The PK parameters of GEM and/or 5-FU after GEM single-administration, S-1 single-administration, and co-administration of GEM with pre-administration of S-1 at 2-h intervals were analyzed.</p> <p>Results</p> <p>The maximum concentration (Cmax), the area under the curve from the drug administration to the infinite time (AUCinf), and the elimination half-life (T1/2) of GEM were not significantly different between GEM administration with and without S-1. The Cmax, AUCinf, T1/2, and the time required to reach Cmax (Tmax) were not significantly different between S-1 administration with and without GEM.</p> <p>Conclusion</p> <p>There were no interactions between GEM and S-1 regarding plasma PK of GEM and 5-FU.</p

    A boosting method for maximizing the partial area under the ROC curve

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>The receiver operating characteristic (ROC) curve is a fundamental tool to assess the discriminant performance for not only a single marker but also a score function combining multiple markers. The area under the ROC curve (AUC) for a score function measures the intrinsic ability for the score function to discriminate between the controls and cases. Recently, the partial AUC (pAUC) has been paid more attention than the AUC, because a suitable range of the false positive rate can be focused according to various clinical situations. However, existing pAUC-based methods only handle a few markers and do not take nonlinear combination of markers into consideration.</p> <p>Results</p> <p>We have developed a new statistical method that focuses on the pAUC based on a boosting technique. The markers are combined componentially for maximizing the pAUC in the boosting algorithm using natural cubic splines or decision stumps (single-level decision trees), according to the values of markers (continuous or discrete). We show that the resulting score plots are useful for understanding how each marker is associated with the outcome variable. We compare the performance of the proposed boosting method with those of other existing methods, and demonstrate the utility using real data sets. As a result, we have much better discrimination performances in the sense of the pAUC in both simulation studies and real data analysis.</p> <p>Conclusions</p> <p>The proposed method addresses how to combine the markers after a pAUC-based filtering procedure in high dimensional setting. Hence, it provides a consistent way of analyzing data based on the pAUC from maker selection to marker combination for discrimination problems. The method can capture not only linear but also nonlinear association between the outcome variable and the markers, about which the nonlinearity is known to be necessary in general for the maximization of the pAUC. The method also puts importance on the accuracy of classification performance as well as interpretability of the association, by offering simple and smooth resultant score plots for each marker.</p

    Phosphorylated Smad2 in Advanced Stage Gastric Carcinoma

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Transforming growth factor β (TGFβ) receptor signaling is closely associated with the invasion ability of gastric cancer cells. Although Smad signal is a critical integrator of TGFβ receptor signaling transduction systems, not much is known about the role of Smad2 expression in gastric carcinoma. The aim of the current study is to clarify the role of phosphorylated Smad2 (p-Smad2) in gastric adenocarcinomas at advanced stages.</p> <p>Methods</p> <p>Immunohistochemical staining with anti-p-Smad2 was performed on paraffin-embedded specimens from 135 patients with advanced gastric adenocarcinomas. We also evaluated the relationship between the expression levels of p-Smad2 and clinicopathologic characteristics of patients with gastric adenocarcinomas.</p> <p>Results</p> <p>The p-Smad2 expression level was high in 63 (47%) of 135 gastric carcinomas. The p-Smad2 expression level was significantly higher in diffuse type carcinoma (p = 0.007), tumours with peritoneal metastasis (p = 0.017), and tumours with lymph node metastasis (p = 0.047). The prognosis for p-Smad2-high patients was significantly (p = 0.035, log-rank) poorer than that of p-Smad2-low patients, while a multivariate analysis revealed that p-Smad2 expression was not an independence prognostic factor.</p> <p>Conclusion</p> <p>The expression of p-Smad2 is associated with malignant phenotype and poor prognosis in patients with advanced gastric carcinoma.</p

    Statistical methods for imbalanced data in ecological and biological studies

    No full text
    This book presents a fresh, new approach in that it provides a comprehensive recent review of challenging problems caused by imbalanced data in prediction and classification, and also in that it introduces several of the latest statistical methods of dealing with these problems. The book discusses the property of the imbalance of data from two points of view. The first is quantitative imbalance, meaning that the sample size in one population highly outnumbers that in another population. It includes presence-only data as an extreme case, where the presence of a species is confirmed, whereas the information on its absence is uncertain, which is especially common in ecology in predicting habitat distribution. The second is qualitative imbalance, meaning that the data distribution of one population can be well specified whereas that of the other one shows a highly heterogeneous property. A typical case is the existence of outliers commonly observed in gene expression data, and another is heterogeneous characteristics often observed in a case group in case-control studies. The extension of the logistic regression model, maxent, and AdaBoost for imbalanced data is discussed, providing a new framework for improvement of prediction, classification, and performance of variable selection. Weights functions introduced in the methods play an important role in alleviating the imbalance of data. This book also furnishes a new perspective on these problem and shows some applications of the recently developed statistical methods to real data sets

    A Unified Formulation of k-Means, Fuzzy c-Means and Gaussian Mixture Model by the Kolmogorov–Nagumo Average

    No full text
    Clustering is a major unsupervised learning algorithm and is widely applied in data mining and statistical data analyses. Typical examples include k-means, fuzzy c-means, and Gaussian mixture models, which are categorized into hard, soft, and model-based clusterings, respectively. We propose a new clustering, called Pareto clustering, based on the Kolmogorov–Nagumo average, which is defined by a survival function of the Pareto distribution. The proposed algorithm incorporates all the aforementioned clusterings plus maximum-entropy clustering. We introduce a probabilistic framework for the proposed method, in which the underlying distribution to give consistency is discussed. We build the minorize-maximization algorithm to estimate the parameters in Pareto clustering. We compare the performance with existing methods in simulation studies and in benchmark dataset analyses to demonstrate its highly practical utilities

    Binary Classification with a Pseudo Exponential Model and Its Application for Multi-Task Learning

    No full text
    In this paper, we investigate the basic properties of binary classification with a pseudo model based on the Itakura–Saito distance and reveal that the Itakura–Saito distance is a unique appropriate measure for estimation with the pseudo model in the framework of general Bregman divergence. Furthermore, we propose a novelmulti-task learning algorithm based on the pseudo model in the framework of the ensemble learning method. We focus on a specific setting of the multi-task learning for binary classification problems. The set of features is assumed to be common among all tasks, which are our targets of performance improvement. We consider a situation where the shared structures among the dataset are represented by divergence between underlying distributions associated with multiple tasks. We discuss statistical properties of the proposed method and investigate the validity of the proposed method with numerical experiments

    Multiple Suboptimal Solutions for Prediction Rules in Gene Expression Data

    Get PDF
    This paper discusses mathematical and statistical aspects in analysis methods applied to microarray gene expressions. We focus on pattern recognition to extract informative features embedded in the data for prediction of phenotypes. It has been pointed out that there are severely difficult problems due to the unbalance in the number of observed genes compared with the number of observed subjects. We make a reanalysis of microarray gene expression published data to detect many other gene sets with almost the same performance. We conclude in the current stage that it is not possible to extract only informative genes with high performance in the all observed genes. We investigate the reason why this difficulty still exists even though there are actively proposed analysis methods and learning algorithms in statistical machine learning approaches. We focus on the mutual coherence or the absolute value of the Pearson correlations between two genes and describe the distributions of the correlation for the selected set of genes and the total set. We show that the problem of finding informative genes in high dimensional data is ill-posed and that the difficulty is closely related with the mutual coherence
    corecore