Search CORE

3,755 research outputs found

A Stream-Suitable Kolmogorov-Smirnov-Type Test for Big Data Analysis

Author: Nguyen Hien Duy
Publication venue
Publication date: 12/04/2017
Field of study

Big Data has become an ever more commonplace setting that is encountered by data analysts. In the Big Data setting, analysts are faced with very large numbers of observations as well as data that arrive as a stream, both of which are phenomena that many traditional statistical techniques are unable to contend with. Unfortunately, many of these traditional techniques are useful and cannot be discarded. One such technique is the Kolmogorov-Smirnov (KS) test for goodness-of-fit (GoF). A Big Data and stream-appropriate KS-type test is derived via the chunked-and-averaged (CA) estimator paradigm. The new test is termed the CAKS GoF test. The CAKS test statistic is proved to be asymptotically normal, allowing for the large sample testing of GoF. Furthermore, theoretical results demonstrate that the CAKS test is consistent against both fixed alternatives, where the null and the true data generating distribution are a fixed distance apart, and alternatives that approach the null at a slow enough rate. Numerical results demonstrate that the CAKS test is effective in identifying deviation in the distribution with respect to changes in mean, variance, and shape. Furthermore, it is found that the CAKS test is faster than the KS test, for large numbers of observation, and can be applied to sample sizes of 10^{9} and beyond

arXiv.org e-Print Archive

A Novel Algorithm for Clustering of Data on the Unit Sphere via Mixture Models

Author: Nguyen Hien D.
Publication venue
Publication date: 14/09/2017
Field of study

A new maximum approximate likelihood (ML) estimation algorithm for the mixture of Kent distribution is proposed. The new algorithm is constructed via the BSLM (block successive lower-bound maximization) framework and incorporates manifold optimization procedures within it. The BSLM algorithm is iterative and monotonically increases the approximate log-likelihood function in each step. Under mild regularity conditions, the BSLM algorithm is proved to be convergent and the approximate ML estimator is proved to be consistent. A Bayesian information criterion-like (BIC-like) model selection criterion is also derive, for the task of choosing the number of components in the mixture distribution. The approximate ML estimator and the BIC-like criterion are both demonstrated to be successful via simulation studies. A model-based clustering rule is proposed and also assessed favorably via simulations. Example applications of the developed methodology are provided via an image segmentation task and a neural imaging clustering problem

arXiv.org e-Print Archive

Concentration-based confidence intervals for U-statistics

Author: Nguyen Hien D.
Publication venue
Publication date: 05/03/2019
Field of study

Concentration inequalities have become increasingly popular in machine learning, probability, and statistical research. Using concentration inequalities, one can construct confidence intervals (CIs) for many quantities of interest. Unfortunately, many of these CIs require the knowledge of population variances, which are generally unknown, making these CIs impractical for numerical application. However, recent results regarding the simultaneous bounding of the probabilities of quantities of interest and their variances have permitted the construction of empirical CIs, where variances are replaced by their sample estimators. Among these new results are two-sided empirical CIs for U-statistics, which are useful for the construction of CIs for a rich class of parameters. In this article, we derive a number of new one-sided empirical CIs for U-statistics and their variances. We show that our one-sided CIs can be used to construct tighter two-sided CIs for U-statistics, than those currently reported. We also demonstrate how our CIs can be used to construct new empirical CIs for the mean, which provide tighter bounds than currently known CIs for the same number of observations, under various settings

arXiv.org e-Print Archive

A Note on the Convergence of the Gaussian Mean Shift Algorithm

Author: Nguyen Hien D
Publication venue
Publication date: 13/03/2017
Field of study

Mean shift (MS) algorithms are popular methods for mode finding in pattern analysis. Each MS algorithm can be phrased as a fixed-point iteration scheme, which operates on a kernel density estimate (KDE) based on some data. The ability of an MS algorithm to obtain the modes of its KDE depends on whether or not the fixed-point scheme converges. The convergence of MS algorithms have recently been proved under some general conditions via first principle arguments. We complement the recent proofs by demonstrating that the MS algorithm operating on a Gaussian KDE can be viewed as an MM (minorization-maximization) algorithm, and thus permits the application of convergence techniques for such constructions. For the Gaussian case, we extend upon the previously results by showing that the fixed-points of the MS algorithm are all stationary points of the KDE in cases where the stationary points may not necessarily be isolated

arXiv.org e-Print Archive

A Simple Online Parameter Estimation Technique with Asymptotic Guarantees

Author: Nguyen Hien D
Publication venue
Publication date: 20/03/2017
Field of study

In many modern settings, data are acquired iteratively over time, rather than all at once. Such settings are known as online, as opposed to offline or batch. We introduce a simple technique for online parameter estimation, which can operate in low memory settings, settings where data are correlated, and only requires a single inspection of the available data at each time period. We show that the estimators---constructed via the technique---are asymptotically normal under generous assumptions, and present a technique for the online computation of the covariance matrices for such estimators. A set of numerical studies demonstrates that our estimators can be as efficient as their offline counterparts, and that our technique generates estimates and confidence intervals that match their offline counterparts in various parameter estimation settings

arXiv.org e-Print Archive

Construction of Complete Embedded Self-Similar Surfaces under Mean Curvature Flow. Part II

Author: Nguyen Xuan Hien
Publication venue
Publication date: 23/10/2006
Field of study

We study the Dirichlet problem associated to the equation for self-similar surfaces for graphs over the Euclidean plane with a disk removed. We show the existence of a solution provided the boundary conditions on the boundary circle are small enough and satisfy some symmetries. This is the second step towards the construction of new examples of complete embedded self similar surfaces under mean curvature flow.Comment: 30 page

arXiv.org e-Print Archive

The Decision-Theoretic Interactive Video Advisor

Author: Haddawy Peter
Nguyen Hien
Publication venue
Publication date: 23/01/2013
Field of study

The need to help people choose among large numbers of items and to filter through large amounts of information has led to a flood of research in construction of personal recommendation agents. One of the central issues in constructing such agents is the representation and elicitation of user preferences or interests. This topic has long been studied in Decision Theory, but surprisingly little work in the area of recommender systems has made use of formal decision-theoretic techniques. This paper describes DIVA, a decision-theoretic agent for recommending movies that contains a number of novel features. DIVA represents user preferences using pairwise comparisons among items, rather than numeric ratings. It uses a novel similarity measure based on the concept of the probability of conflict between two orderings of items. The system has a rich representation of preference, distinguishing between a user's general taste in movies and his immediate interests. It takes an incremental approach to preference elicitation in which the user can provide feedback if not satisfied with the recommendation list. We empirically evaluate the performance of the system using the EachMovie collaborative filtering database.Comment: Appears in Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence (UAI1999

arXiv.org e-Print Archive

On the necessary condition for entire function with the increasing second quotients of Taylor coefficients to belong to the Laguerre-P\'olya class

Author: Nguyen Thu Hien
Vishnyakova Anna
Publication venue
Publication date: 21/03/2019
Field of study

For an entire function

f(z) = \sum_{k=0}^\infty a_k z^k, a_k>0,

we show that

f

does not belong to the Laguerre-P\'olya class if the quotients

\frac{a_{n-1}^2}{a_{n-2}a_n}

are increasing in

n

, and

c:= \lim\limits_{n\to \infty} \frac{a_{n-1}^2}{a_{n-2}a_n}

is smaller than an absolute constant

q_\infty

$(q_\infty\approx 3{.}2336) .

arXiv.org e-Print Archive

Shrinking doughnuts via variational methods

Author: Drugan Gregory
Nguyen Xuan Hien
Publication venue
Publication date: 29/08/2017
Field of study

We use variational methods and a modified curvature flow to give an alternative proof of the existence of a self-shrinking torus under mean curvature flow. As a consequence of the proof, we establish an upper bound for the weighted energy of our shrinking doughnuts.Comment: 18 pages, 2 figure

arXiv.org e-Print Archive

Mean curvature flow of an entire graph evolving away from the heat flow

Author: Drugan Gregory
Nguyen Xuan Hien
Publication venue
Publication date: 23/08/2016
Field of study

We present two initial graphs over the entire

\mathbb{R}^n

n \geq 2

for which the mean curvature flow behaves differently from the heat flow. In the first example, the two flows stabilize at different heights. With our second example, the mean curvature flow oscillates indefinitely while the heat flow stabilizes. These results highlight the difference between dimensions

n \geq 2

and dimension

n=1

, where Nara-Taniguchi proved that entire graphs in

C^{2,\alpha}(\mathbb{R})

evolving under curve shortening flow converge to solutions to the heat equation with the same initial data.Comment: To appear in Proceedings of the AM

arXiv.org e-Print Archive