330 research outputs found

    Search Result Clustering via Randomized Partitioning of Query-Induced Subgraphs

    Full text link
    In this paper, we present an approach to search result clustering, using partitioning of underlying link graph. We define the notion of "query-induced subgraph" and formulate the problem of search result clustering as a problem of efficient partitioning of given subgraph into topic-related clusters. Also, we propose a novel algorithm for approximative partitioning of such graph, which results in cluster quality comparable to the one obtained by deterministic algorithms, while operating in more efficient computation time, suitable for practical implementations. Finally, we present a practical clustering search engine developed as a part of this research and use it to get results about real-world performance of proposed concepts.Comment: 16th Telecommunications Forum TELFOR 200

    Robustness in sparse linear models: relative efficiency based on robust approximate message passing

    Full text link
    Understanding efficiency in high dimensional linear models is a longstanding problem of interest. Classical work with smaller dimensional problems dating back to Huber and Bickel has illustrated the benefits of efficient loss functions. When the number of parameters pp is of the same order as the sample size nn, pnp \approx n, an efficiency pattern different from the one of Huber was recently established. In this work, we consider the effects of model selection on the estimation efficiency of penalized methods. In particular, we explore whether sparsity, results in new efficiency patterns when p>np > n. In the interest of deriving the asymptotic mean squared error for regularized M-estimators, we use the powerful framework of approximate message passing. We propose a novel, robust and sparse approximate message passing algorithm (RAMP), that is adaptive to the error distribution. Our algorithm includes many non-quadratic and non-differentiable loss functions. We derive its asymptotic mean squared error and show its convergence, while allowing p,n,sp, n, s \to \infty, with n/p(0,1)n/p \in (0,1) and n/s(1,)n/s \in (1,\infty). We identify new patterns of relative efficiency regarding a number of penalized MM estimators, when pp is much larger than nn. We show that the classical information bound is no longer reachable, even for light--tailed error distributions. We show that the penalized least absolute deviation estimator dominates the penalized least square estimator, in cases of heavy--tailed distributions. We observe this pattern for all choices of the number of non-zero parameters ss, both sns \leq n and sns \approx n. In non-penalized problems where s=pns =p \approx n, the opposite regime holds. Therefore, we discover that the presence of model selection significantly changes the efficiency patterns.Comment: 49 pages, 10 figure

    Breaking the curse of dimensionality in regression

    Full text link
    Models with many signals, high-dimensional models, often impose structures on the signal strengths. The common assumption is that only a few signals are strong and most of the signals are zero or close (collectively) to zero. However, such a requirement might not be valid in many real-life applications. In this article, we are interested in conducting large-scale inference in models that might have signals of mixed strengths. The key challenge is that the signals that are not under testing might be collectively non-negligible (although individually small) and cannot be accurately learned. This article develops a new class of tests that arise from a moment matching formulation. A virtue of these moment-matching statistics is their ability to borrow strength across features, adapt to the sparsity size and exert adjustment for testing growing number of hypothesis. GRoup-level Inference of Parameter, GRIP, test harvests effective sparsity structures with hypothesis formulation for an efficient multiple testing procedure. Simulated data showcase that GRIPs error control is far better than the alternative methods. We develop a minimax theory, demonstrating optimality of GRIP for a broad range of models, including those where the model is a mixture of a sparse and high-dimensional dense signals.Comment: 51 page

    Historical development of judo

    Get PDF
    This is an Accepted Manuscript of a book chapter published by Routledge in The Science of Judo on 14 June 2018, available online: https://www.routledge.com/The-Science-of-Judo/Callan/p/book/9780815349136Judo has its roots in the pre-history and mythology of Japan. Legend has it that the origin of the Imperial line is the result of a hand-to-hand wrestling match between two gods, when Takemikazuchi threw Takenimakata (Ashkenazi, 2008), whilst Nomi no Sukune is regarded as the creator of the earliest form of sumo, following a famous match in 23 bc at the request of Emperor Suinin (Guttmann & Thompson, 2001) and Mifune refers to him as the very founder of judo (Mifune, 1956). Early sumo was known as sumai (to struggle). Sumai, applied to combat, became known as kumiuchi (grappling in armour) (Levinson & Christensen, 1996). Kumiuchi is still seen today within Koshiki no Kata, which is required to be demonstrated for a Kōdōkan promotion to 8th Dan. Through the Muromachi and Sengoku periods of Japanese history (1333-1568) combat systems involving archery, swordsmanship and spearmanship were developed as the various clans battled with each other (Nippon-Budōkan, 2009). The feudal warrior class, samurai or bushi, trained in several martial arts, the collective term for these was bugei. The samurai or bushi culture and lifestyle was known as bushido. However the introduction of the musket gun in 1543 changed warfare and led to armour becoming lighter. This meant that there were greater possibilities for movement in combat once the warrior was unarmed (Hoare, 2009).Peer reviewedFinal Accepted Versio

    Synthetic learner: model-free inference on treatments over time

    Full text link
    Understanding of the effect of a particular treatment or a policy pertains to many areas of interest -- ranging from political economics, marketing to health-care and personalized treatment studies. In this paper, we develop a non-parametric, model-free test for detecting the effects of treatment over time that extends widely used Synthetic Control tests. The test is built on counterfactual predictions arising from many learning algorithms. In the Neyman-Rubin potential outcome framework with possible carry-over effects, we show that the proposed test is asymptotically consistent for stationary, beta mixing processes. We do not assume that class of learners captures the correct model necessarily. We also discuss estimates of the average treatment effect, and we provide regret bounds on the predictive performance. To the best of our knowledge, this is the first set of results that allow for example any Random Forest to be useful for provably valid statistical inference in the Synthetic Control setting. In experiments, we show that our Synthetic Learner is substantially more powerful than classical methods based on Synthetic Control or Difference-in-Differences, especially in the presence of non-linear outcome models

    High-dimensional semi-supervised learning: in search for optimal inference of the mean

    Full text link
    We provide a high-dimensional semi-supervised inference framework focused on the mean and variance of the response. Our data are comprised of an extensive set of observations regarding the covariate vectors and a much smaller set of labeled observations where we observe both the response as well as the covariates. We allow the size of the covariates to be much larger than the sample size and impose weak conditions on a statistical form of the data. We provide new estimators of the mean and variance of the response that extend some of the recent results presented in low-dimensional models. In particular, at times we will not necessitate consistent estimation of the functional form of the data. Together with estimation of the population mean and variance, we provide their asymptotic distribution and confidence intervals where we showcase gains in efficiency compared to the sample mean and variance. Our procedure, with minor modifications, is then presented to make important contributions regarding inference about average treatment effects. We also investigate the robustness of estimation and coverage and showcase widespread applicability and generality of the proposed method
    corecore