354 research outputs found

    MUSE: Modularizing Unsupervised Sense Embeddings

    Full text link
    This paper proposes to address the word sense ambiguity issue in an unsupervised manner, where word sense representations are learned along a word sense selection mechanism given contexts. Prior work focused on designing a single model to deliver both mechanisms, and thus suffered from either coarse-grained representation learning or inefficient sense selection. The proposed modular approach, MUSE, implements flexible modules to optimize distinct mechanisms, achieving the first purely sense-level representation learning system with linear-time sense selection. We leverage reinforcement learning to enable joint training on the proposed modules, and introduce various exploration techniques on sense selection for better robustness. The experiments on benchmark data show that the proposed approach achieves the state-of-the-art performance on synonym selection as well as on contextual word similarities in terms of MaxSimC

    Topic Similarity Networks: Visual Analytics for Large Document Sets

    Full text link
    We investigate ways in which to improve the interpretability of LDA topic models by better analyzing and visualizing their outputs. We focus on examining what we refer to as topic similarity networks: graphs in which nodes represent latent topics in text collections and links represent similarity among topics. We describe efficient and effective approaches to both building and labeling such networks. Visualizations of topic models based on these networks are shown to be a powerful means of exploring, characterizing, and summarizing large collections of unstructured text documents. They help to "tease out" non-obvious connections among different sets of documents and provide insights into how topics form larger themes. We demonstrate the efficacy and practicality of these approaches through two case studies: 1) NSF grants for basic research spanning a 14 year period and 2) the entire English portion of Wikipedia.Comment: 9 pages; 2014 IEEE International Conference on Big Data (IEEE BigData 2014

    Streaming and Sketch Algorithms for Large Data NLP

    Get PDF
    The availability of large and rich quantities of text data is due to the emergence of the World Wide Web, social media, and mobile devices. Such vast data sets have led to leaps in the performance of many statistically-based problems. Given a large magnitude of text data available, it is computationally prohibitive to train many complex Natural Language Processing (NLP) models on large data. This motivates the hypothesis that simple models trained on big data can outperform more complex models with small data. My dissertation provides a solution to effectively and efficiently exploit large data on many NLP applications. Datasets are growing at an exponential rate, much faster than increase in memory. To provide a memory-efficient solution for handling large datasets, this dissertation show limitations of existing streaming and sketch algorithms when applied to canonical NLP problems and proposes several new variants to overcome those shortcomings. Streaming and sketch algorithms process the large data sets in one pass and represent a large data set with a compact summary, much smaller than the full size of the input. These algorithms can easily be implemented in a distributed setting and provide a solution that is both memory- and time-efficient. However, the memory and time savings come at the expense of approximate solutions. In this dissertation, I demonstrate that approximate solutions achieved on large data are comparable to exact solutions on large data and outperform exact solutions on smaller data. I focus on many NLP problems that boil down to tracking many statistics, like storing approximate counts, computing approximate association scores like pointwise mutual information (PMI), finding frequent items (like n-grams), building streaming language models, and measuring distributional similarity. First, I introduce the concept of approximate streaming large-scale language models in NLP. Second, I present a novel variant of the Count-Min sketch that maintains approximate counts of all items. Third, I conduct a systematic study and compare many sketch algorithms that approximate count of items with focus on large-scale NLP tasks. Last, I develop fast large-scale approximate graph (FLAG), a system that quickly constructs a large-scale approximate nearest-neighbor graph from a large corpus

    Kontextsensitive Modellhierarchien für Quantifizierung der höherdimensionalen Unsicherheit

    Get PDF
    We formulate four novel context-aware algorithms based on model hierarchies aimed to enable an efficient quantification of uncertainty in complex, computationally expensive problems, such as fluid-structure interaction and plasma microinstability simulations. Our results show that our algorithms are more efficient than standard approaches and that they are able to cope with the challenges of quantifying uncertainty in higher-dimensional, complex problems.Wir formulieren vier kontextsensitive Algorithmen auf der Grundlage von Modellhierarchien um eine effiziente Quantifizierung der Unsicherheit bei komplexen, rechenintensiven Problemen zu ermöglichen, wie Fluid-Struktur-Wechselwirkungs- und Plasma-Mikroinstabilitätssimulationen. Unsere Ergebnisse zeigen, dass unsere Algorithmen effizienter als Standardansätze sind und die Herausforderungen der Quantifizierung der Unsicherheit in höherdimensionalen, komplexen Problemen bewältigen können

    A Deep Topical N-gram Model and Topic Discovery on COVID-19 News and Research Manuscripts

    Get PDF
    Topic modeling with the latent semantic analysis (LSA), the latent Dirichlet allocation (LDA) and the biterm topic model (BTM) has been successfully implemented and used in many areas, including movie reviews, recommender systems, and text summarization, etc. However, these models may become computationally intensive if tested on a humongous corpus. Considering the wide acceptance of machine learning based on deep neural networks, this research proposes two deep neural network (NN) variants, 2-layer NN and 3-layer NN of the LDA modeling techniques. The primary goal is to deal with problems with a large corpus using manageable computational resources. This thesis analyze two datasets related to COVID-19 to explore the underlying structures. The first dataset includes over 7,000 CBC COVID-19 related news articles for the period of January 9, 2020 to May 3,2020. The second dataset, called CORD-19, includes over 100,000 research manuscripts related to COVID-19 for the period of January 2, 2020 to August 1, 2020. We discovered that in the first dataset 14 topics were including “traveling”, “lockdown”, “masks”, the focus of social media attention during the period of January to May of 2020. For the second dataset, 17 topics, including vaccine , “treatment” and social distancing , were identified to be the focus of research articles for the period of January to August of 2020. Compared to the traditional LDA, our proposed model requires less computation time and shows better performance

    Efficient Bayesian travel-time tomography with geologically-complex priors using sensitivity-informed polynomial chaos expansion and deep generative networks

    Full text link
    Monte Carlo Markov Chain (MCMC) methods commonly confront two fundamental challenges: the accurate characterization of the prior distribution and the efficient evaluation of the likelihood. In the context of Bayesian studies on tomography, principal component analysis (PCA) can in some cases facilitate the straightforward definition of the prior distribution, while simultaneously enabling the implementation of accurate surrogate models based on polynomial chaos expansion (PCE) to replace computationally intensive full-physics forward solvers. When faced with scenarios where PCA does not offer a direct means of easily defining the prior distribution alternative methods like deep generative models (e.g., variational autoencoders (VAEs)), can be employed as viable options. However, accurately producing a surrogate capable of capturing the intricate non-linear relationship between the latent parameters of a VAE and the outputs of forward modeling presents a notable challenge. Indeed, while PCE models provide high accuracy when the input-output relationship can be effectively approximated by relatively low-degree multivariate polynomials, this condition is typically unmet when utilizing latent variables derived from deep generative models. In this contribution, we present a strategy that combines the excellent reconstruction performances of VAE in terms of prio representation with the accuracy of PCA-PCE surrogate modeling in the context of Bayesian ground penetrating radar (GPR) travel-time tomography. Within the MCMC process, the parametrization of the VAE is leveraged for prior exploration and sample proposal. Concurrently, modeling is conducted using PCE, which operates on either globally or locally defined principal components of the VAE samples under examination.Comment: 25 pages, 15 figure
    • …
    corecore