1,890 research outputs found

    Fast and modular regularized topic modelling

    Get PDF
    Topic modelling is an area of text mining that has been actively developed in the last 15 years. A probabilistic topic model extracts a set of hidden topics from a collection of text documents. It defines each topic by a probability distribution over words and describes each document with a probability distribution over topics. In applications, there are often many requirements, such as, for example, problem-specific knowledge and additional data, to be taken into account. Therefore, it is natural for topic modelling to be considered a multiobjective optimization problem. However, historically, Bayesian learning became the most popular approach for topic modelling. In the Bayesian paradigm, all requirements are formalized in terms of a probabilistic generative process. This approach is not always convenient due to some limitations and technical difficulties. In this work, we develop a non-Bayesian multiobjective approach called the Additive Regularization of Topic Models (ARTM). It is based on regularized Maximum Likelihood Estimation (MLE), and we show that many of the well-known Bayesian topic models can be re-formulated in a much simpler way using the regularization point of view. We review some of the most important types of topic models: multimodal, multilingual, temporal, hierarchical, graph-based, and short-text. The ARTM framework enables easy combination of different types of models to create new models with the desired properties for applications. This modular “lego-style” technology for topic modelling is implemented in the open-source library BigARTM

    Adaptive Regularized Submodular Maximization

    Get PDF
    In this paper, we study the problem of maximizing the difference between an adaptive submodular (revenue) function and a non-negative modular (cost) function. The input of our problem is a set of n items, where each item has a particular state drawn from some known prior distribution The revenue function g is defined over items and states, and the cost function c is defined over items, i.e., each item has a fixed cost. The state of each item is unknown initially and one must select an item in order to observe its realized state. A policy ? specifies which item to pick next based on the observations made so far. Denote by g_{avg}(?) the expected revenue of ? and let c_{avg}(?) denote the expected cost of ?. Our objective is to identify the best policy ?^o ? arg max_? g_{avg}(?)-c_{avg}(?) under a k-cardinality constraint. Since our objective function can take on both negative and positive values, the existing results of submodular maximization may not be applicable. To overcome this challenge, we develop a series of effective solutions with performance guarantees. Let ?^o denote the optimal policy. For the case when g is adaptive monotone and adaptive submodular, we develop an effective policy ?^l such that g_{avg}(?^l) - c_{avg}(?^l) ? (1-1/e-?)g_{avg}(?^o) - c_{avg}(?^o), using only O(n?^{-2}log ?^{-1}) value oracle queries. For the case when g is adaptive submodular, we present a randomized policy ?^r such that g_{avg}(?^r) - c_{avg}(?^r) ? 1/eg_{avg}(?^o) - c_{avg}(?^o)

    An update on statistical boosting in biomedicine

    Get PDF
    Statistical boosting algorithms have triggered a lot of research during the last decade. They combine a powerful machine-learning approach with classical statistical modelling, offering various practical advantages like automated variable selection and implicit regularization of effect estimates. They are extremely flexible, as the underlying base-learners (regression functions defining the type of effect for the explanatory variables) can be combined with any kind of loss function (target function to be optimized, defining the type of regression setting). In this review article, we highlight the most recent methodological developments on statistical boosting regarding variable selection, functional regression and advanced time-to-event modelling. Additionally, we provide a short overview on relevant applications of statistical boosting in biomedicine

    An Algorithmic Theory of Dependent Regularizers, Part 1: Submodular Structure

    Full text link
    We present an exploration of the rich theoretical connections between several classes of regularized models, network flows, and recent results in submodular function theory. This work unifies key aspects of these problems under a common theory, leading to novel methods for working with several important models of interest in statistics, machine learning and computer vision. In Part 1, we review the concepts of network flows and submodular function optimization theory foundational to our results. We then examine the connections between network flows and the minimum-norm algorithm from submodular optimization, extending and improving several current results. This leads to a concise representation of the structure of a large class of pairwise regularized models important in machine learning, statistics and computer vision. In Part 2, we describe the full regularization path of a class of penalized regression problems with dependent variables that includes the graph-guided LASSO and total variation constrained models. This description also motivates a practical algorithm. This allows us to efficiently find the regularization path of the discretized version of TV penalized models. Ultimately, our new algorithms scale up to high-dimensional problems with millions of variables

    Regularization approaches in clinical biostatistics: a review of methods and their applications

    Get PDF
    A range of regularization approaches have been proposed in the data sciences to overcome overfitting, to exploit sparsity or to improve prediction. Using a broad definition of regularization, namely controlling model complexity by adding information in order to solve ill-posed problems or to prevent overfitting, we review a range of approaches within this framework including penalization, early stopping, ensembling and model averaging. Aspects of their practical implementation are discussed including available R-packages and examples are provided. To assess the extent to which these approaches are used in medicine, we conducted a review of three general medical journals. It revealed that regularization approaches are rarely applied in practical clinical applications, with the exception of random effects models. Hence, we suggest a more frequent use of regularization approaches in medical research. In situations where also other approaches work well, the only downside of the regularization approaches is increased complexity in the conduct of the analyses which can pose challenges in terms of computational resources and expertise on the side of the data analyst. In our view, both can and should be overcome by investments in appropriate computing facilities and educational resources

    rags2ridges:A One-Stop-â„“<sub>2</sub>-Shop for Graphical Modeling of High-Dimensional Precision Matrices

    Get PDF
    A graphical model is an undirected network representing the conditional independence properties between random variables. Graphical modeling has become part and parcel of systems or network approaches to multivariate data, in particular when the variable dimension exceeds the observation dimension. rags2ridges is an R package for graphical modeling of high-dimensional precision matrices through ridge (ℓ2) penalties. It provides a modular framework for the extraction, visualization, and analysis of Gaussian graphical models from high-dimensional data. Moreover, it can handle the incorporation of prior information as well as multiple heterogeneous data classes. As such, it provides a one-stop-ℓ2-shop for graphical modeling of high-dimensional precision matrices. The functionality of the package is illustrated with an example dataset pertaining to blood-based metabolite measurements in persons suffering from Alzheimer’s disease.</p

    Modular Regularization Algorithms

    Get PDF
    • …
    corecore