354 research outputs found
MUSE: Modularizing Unsupervised Sense Embeddings
This paper proposes to address the word sense ambiguity issue in an
unsupervised manner, where word sense representations are learned along a word
sense selection mechanism given contexts. Prior work focused on designing a
single model to deliver both mechanisms, and thus suffered from either
coarse-grained representation learning or inefficient sense selection. The
proposed modular approach, MUSE, implements flexible modules to optimize
distinct mechanisms, achieving the first purely sense-level representation
learning system with linear-time sense selection. We leverage reinforcement
learning to enable joint training on the proposed modules, and introduce
various exploration techniques on sense selection for better robustness. The
experiments on benchmark data show that the proposed approach achieves the
state-of-the-art performance on synonym selection as well as on contextual word
similarities in terms of MaxSimC
Topic Similarity Networks: Visual Analytics for Large Document Sets
We investigate ways in which to improve the interpretability of LDA topic
models by better analyzing and visualizing their outputs. We focus on examining
what we refer to as topic similarity networks: graphs in which nodes represent
latent topics in text collections and links represent similarity among topics.
We describe efficient and effective approaches to both building and labeling
such networks. Visualizations of topic models based on these networks are shown
to be a powerful means of exploring, characterizing, and summarizing large
collections of unstructured text documents. They help to "tease out"
non-obvious connections among different sets of documents and provide insights
into how topics form larger themes. We demonstrate the efficacy and
practicality of these approaches through two case studies: 1) NSF grants for
basic research spanning a 14 year period and 2) the entire English portion of
Wikipedia.Comment: 9 pages; 2014 IEEE International Conference on Big Data (IEEE BigData
2014
Streaming and Sketch Algorithms for Large Data NLP
The availability of large and rich quantities of text data is due to the emergence of the World Wide Web, social media, and mobile devices. Such vast data sets have led to leaps in the performance of many statistically-based problems. Given a large magnitude of text data available, it is computationally prohibitive to train many complex Natural Language Processing (NLP) models on large data. This motivates the hypothesis that simple models trained on big data can outperform more complex models with small data. My dissertation provides a solution to effectively and efficiently exploit large data on many NLP applications.
Datasets are growing at an exponential rate, much faster than increase in memory. To provide a memory-efficient solution for handling large datasets, this dissertation show limitations of existing streaming and sketch algorithms when applied to canonical NLP problems and proposes several new variants to overcome those shortcomings. Streaming and sketch algorithms process the large data sets in one pass and represent a large data set with a compact summary, much smaller than the full size of the input. These algorithms can easily be implemented in a distributed setting and provide a solution that is both memory- and time-efficient. However, the memory and time savings come at the expense of approximate solutions. In this dissertation, I demonstrate that approximate solutions achieved on large data are comparable to exact solutions on large data and outperform exact solutions on smaller data.
I focus on many NLP problems that boil down to tracking many statistics, like storing approximate counts, computing approximate association scores like pointwise mutual information (PMI), finding frequent items (like n-grams), building streaming language models, and measuring distributional similarity. First, I introduce the concept of approximate streaming large-scale language models in NLP. Second, I present a novel variant of the Count-Min sketch that maintains approximate counts of all items. Third, I conduct a systematic study and compare many sketch algorithms that approximate count of items with focus on large-scale NLP tasks. Last, I develop fast large-scale approximate graph (FLAG), a system that quickly constructs a large-scale approximate nearest-neighbor graph from a large corpus
Kontextsensitive Modellhierarchien für Quantifizierung der höherdimensionalen Unsicherheit
We formulate four novel context-aware algorithms based on model hierarchies aimed to enable an efficient quantification of uncertainty in complex, computationally expensive problems, such as fluid-structure interaction and plasma microinstability simulations. Our results show that our algorithms are more efficient than standard approaches and that they are able to cope with the challenges of quantifying uncertainty in higher-dimensional, complex problems.Wir formulieren vier kontextsensitive Algorithmen auf der Grundlage von Modellhierarchien um eine effiziente Quantifizierung der Unsicherheit bei komplexen, rechenintensiven Problemen zu ermöglichen, wie Fluid-Struktur-Wechselwirkungs- und Plasma-Mikroinstabilitätssimulationen. Unsere Ergebnisse zeigen, dass unsere Algorithmen effizienter als Standardansätze sind und die Herausforderungen der Quantifizierung der Unsicherheit in höherdimensionalen, komplexen Problemen bewältigen können
A Deep Topical N-gram Model and Topic Discovery on COVID-19 News and Research Manuscripts
Topic modeling with the latent semantic analysis (LSA), the latent Dirichlet allocation (LDA) and the biterm topic model (BTM) has been successfully implemented and used in many areas, including movie reviews, recommender systems, and text summarization, etc. However, these models may become computationally intensive if tested on a humongous corpus. Considering the wide acceptance of machine learning based on deep neural networks, this research proposes two deep neural network (NN) variants, 2-layer NN and 3-layer NN of the LDA modeling techniques. The primary goal is to deal with problems with a large corpus using manageable computational resources.
This thesis analyze two datasets related to COVID-19 to explore the underlying structures. The first dataset includes over 7,000 CBC COVID-19 related news articles for the period of January 9, 2020 to May 3,2020. The second dataset, called CORD-19, includes over 100,000 research manuscripts related to COVID-19 for the period of January 2, 2020 to August 1, 2020. We discovered that in the first dataset 14 topics were including “traveling”, “lockdown”, “masks”, the focus of social media attention during the period of January to May of 2020. For the second dataset, 17 topics, including vaccine , “treatment” and social distancing , were identified to be the focus of research articles for the period of January to August of 2020. Compared to the traditional LDA, our proposed model requires less computation time and shows better performance
Efficient Bayesian travel-time tomography with geologically-complex priors using sensitivity-informed polynomial chaos expansion and deep generative networks
Monte Carlo Markov Chain (MCMC) methods commonly confront two fundamental
challenges: the accurate characterization of the prior distribution and the
efficient evaluation of the likelihood. In the context of Bayesian studies on
tomography, principal component analysis (PCA) can in some cases facilitate the
straightforward definition of the prior distribution, while simultaneously
enabling the implementation of accurate surrogate models based on polynomial
chaos expansion (PCE) to replace computationally intensive full-physics forward
solvers. When faced with scenarios where PCA does not offer a direct means of
easily defining the prior distribution alternative methods like deep generative
models (e.g., variational autoencoders (VAEs)), can be employed as viable
options. However, accurately producing a surrogate capable of capturing the
intricate non-linear relationship between the latent parameters of a VAE and
the outputs of forward modeling presents a notable challenge. Indeed, while PCE
models provide high accuracy when the input-output relationship can be
effectively approximated by relatively low-degree multivariate polynomials,
this condition is typically unmet when utilizing latent variables derived from
deep generative models. In this contribution, we present a strategy that
combines the excellent reconstruction performances of VAE in terms of prio
representation with the accuracy of PCA-PCE surrogate modeling in the context
of Bayesian ground penetrating radar (GPR) travel-time tomography. Within the
MCMC process, the parametrization of the VAE is leveraged for prior exploration
and sample proposal. Concurrently, modeling is conducted using PCE, which
operates on either globally or locally defined principal components of the VAE
samples under examination.Comment: 25 pages, 15 figure
- …