8,328 research outputs found
Semi-supervised and Unsupervised Methods for Categorizing Posts in Web Discussion Forums
Web discussion forums are used by millions of people worldwide to share
information belonging to a variety of domains such as automotive vehicles,
pets, sports, etc. They typically contain posts that fall into different
categories such as problem, solution, feedback, spam, etc. Automatic
identification of these categories can aid information retrieval that is
tailored for specific user requirements. Previously, a number of supervised
methods have attempted to solve this problem; however, these depend on the
availability of abundant training data. A few existing unsupervised and
semi-supervised approaches are either focused on identifying a single category
or do not report category-specific performance. In contrast, this work proposes
unsupervised and semi-supervised methods that require no or minimal training
data to achieve this objective without compromising on performance. A
fine-grained analysis is also carried out to discuss their limitations. The
proposed methods are based on sequence models (specifically, Hidden Markov
Models) that can model language for each category using word and part-of-speech
probability distributions, and manually specified features. Empirical
evaluations across domains demonstrate that the proposed methods are better
suited for this task than existing ones
Focused Meeting Summarization via Unsupervised Relation Extraction
We present a novel unsupervised framework for focused meeting summarization
that views the problem as an instance of relation extraction. We adapt an
existing in-domain relation learner (Chen et al., 2011) by exploiting a set of
task-specific constraints and features. We evaluate the approach on a decision
summarization task and show that it outperforms unsupervised utterance-level
extractive summarization baselines as well as an existing generic
relation-extraction-based summarization method. Moreover, our approach produces
summaries competitive with those generated by supervised methods in terms of
the standard ROUGE score.Comment: SIGDIAL 201
Information Extraction from Scientific Literature for Method Recommendation
As a research community grows, more and more papers are published each year.
As a result there is increasing demand for improved methods for finding
relevant papers, automatically understanding the key ideas and recommending
potential methods for a target problem. Despite advances in search engines, it
is still hard to identify new technologies according to a researcher's need.
Due to the large variety of domains and extremely limited annotated resources,
there has been relatively little work on leveraging natural language processing
in scientific recommendation. In this proposal, we aim at making scientific
recommendations by extracting scientific terms from a large collection of
scientific papers and organizing the terms into a knowledge graph. In
preliminary work, we trained a scientific term extractor using a small amount
of annotated data and obtained state-of-the-art performance by leveraging large
amount of unannotated papers through applying multiple semi-supervised
approaches. We propose to construct a knowledge graph in a way that can make
minimal use of hand annotated data, using only the extracted terms,
unsupervised relational signals such as co-occurrence, and structural external
resources such as Wikipedia. Latent relations between scientific terms can be
learned from the graph. Recommendations will be made through graph inference
for both observed and unobserved relational pairs.Comment: Thesis Proposal. arXiv admin note: text overlap with arXiv:1708.0607
GLEAKE: Global and Local Embedding Automatic Keyphrase Extraction
Automated methods for granular categorization of large corpora of text
documents have become increasingly more important with the rate scientific,
news, medical, and web documents are growing in the last few years. Automatic
keyphrase extraction (AKE) aims to automatically detect a small set of single
or multi-words from within a single textual document that captures the main
topics of the document. AKE plays an important role in various NLP and
information retrieval tasks such as document summarization and categorization,
full-text indexing, and article recommendation. Due to the lack of sufficient
human-labeled data in different textual contents, supervised learning
approaches are not ideal for automatic detection of keyphrases from the content
of textual bodies. With the state-of-the-art advances in text embedding
techniques, NLP researchers have focused on developing unsupervised methods to
obtain meaningful insights from raw datasets. In this work, we introduce Global
and Local Embedding Automatic Keyphrase Extractor (GLEAKE) for the task of AKE.
GLEAKE utilizes single and multi-word embedding techniques to explore the
syntactic and semantic aspects of the candidate phrases and then combines them
into a series of embedding-based graphs. Moreover, GLEAKE applies network
analysis techniques on each embedding-based graph to refine the most
significant phrases as a final set of keyphrases. We demonstrate the high
performance of GLEAKE by evaluating its results on five standard AKE datasets
from different domains and writing styles and by showing its superiority with
regards to other state-of-the-art methods
Using Syntax-Based Machine Translation to Parse English into Abstract Meaning Representation
We present a parser for Abstract Meaning Representation (AMR). We treat
English-to-AMR conversion within the framework of string-to-tree, syntax-based
machine translation (SBMT). To make this work, we transform the AMR structure
into a form suitable for the mechanics of SBMT and useful for modeling. We
introduce an AMR-specific language model and add data and features drawn from
semantic resources. Our resulting AMR parser improves upon state-of-the-art
results by 7 Smatch points.Comment: 10 pages, 8 figure
Monolingual sentence matching for text simplification
This work improves monolingual sentence alignment for text simplification,
specifically for text in standard and simple Wikipedia. We introduce a
convolutional neural network structure to model similarity between two
sentences. Due to the limitation of available parallel corpora, the model is
trained in a semi-supervised way, by using the output of a knowledge-based high
performance aligning system. We apply the resulting similarity score to rescore
the knowledge-based output, and adapt the model by a small hand-aligned
dataset. Experiments show that both rescoring and adaptation improve the
performance of knowledge-based method
Extract and Edit: An Alternative to Back-Translation for Unsupervised Neural Machine Translation
The overreliance on large parallel corpora significantly limits the
applicability of machine translation systems to the majority of language pairs.
Back-translation has been dominantly used in previous approaches for
unsupervised neural machine translation, where pseudo sentence pairs are
generated to train the models with a reconstruction loss. However, the pseudo
sentences are usually of low quality as translation errors accumulate during
training. To avoid this fundamental issue, we propose an alternative but more
effective approach, extract-edit, to extract and then edit real sentences from
the target monolingual corpora. Furthermore, we introduce a comparative
translation loss to evaluate the translated target sentences and thus train the
unsupervised translation systems. Experiments show that the proposed approach
consistently outperforms the previous state-of-the-art unsupervised machine
translation systems across two benchmarks (English-French and English-German)
and two low-resource language pairs (English-Romanian and English-Russian) by
more than 2 (up to 3.63) BLEU points.Comment: 11 pages, 3 figures. Accepted to NAACL 201
Robust Layout-aware IE for Visually Rich Documents with Pre-trained Language Models
Many business documents processed in modern NLP and IR pipelines are visually
rich: in addition to text, their semantics can also be captured by visual
traits such as layout, format, and fonts. We study the problem of information
extraction from visually rich documents (VRDs) and present a model that
combines the power of large pre-trained language models and graph neural
networks to efficiently encode both textual and visual information in business
documents. We further introduce new fine-tuning objectives to improve in-domain
unsupervised fine-tuning to better utilize large amount of unlabeled in-domain
data. We experiment on real world invoice and resume data sets and show that
the proposed method outperforms strong text-based RoBERTa baselines by 6.3%
absolute F1 on invoices and 4.7% absolute F1 on resumes. When evaluated in a
few-shot setting, our method requires up to 30x less annotation data than the
baseline to achieve the same level of performance at ~90% F1.Comment: 10 pages, to appear in SIGIR 2020 Industry Trac
Universal, Unsupervised (Rule-Based), Uncovered Sentiment Analysis
We present a novel unsupervised approach for multilingual sentiment analysis
driven by compositional syntax-based rules. On the one hand, we exploit some of
the main advantages of unsupervised algorithms: (1) the interpretability of
their output, in contrast with most supervised models, which behave as a black
box and (2) their robustness across different corpora and domains. On the other
hand, by introducing the concept of compositional operations and exploiting
syntactic information in the form of universal dependencies, we tackle one of
their main drawbacks: their rigidity on data that are structured differently
depending on the language concerned. Experiments show an improvement both over
existing unsupervised methods, and over state-of-the-art supervised models when
evaluating outside their corpus of origin. Experiments also show how the same
compositional operations can be shared across languages. The system is
available at http://www.grupolys.org/software/UUUSA/Comment: 19 pages, 5 Tables, 6 Figures. This is the authors version of a work
that was accepted for publication in Knowledge-Based System
GLoMo: Unsupervisedly Learned Relational Graphs as Transferable Representations
Modern deep transfer learning approaches have mainly focused on learning
generic feature vectors from one task that are transferable to other tasks,
such as word embeddings in language and pretrained convolutional features in
vision. However, these approaches usually transfer unary features and largely
ignore more structured graphical representations. This work explores the
possibility of learning generic latent relational graphs that capture
dependencies between pairs of data units (e.g., words or pixels) from
large-scale unlabeled data and transferring the graphs to downstream tasks. Our
proposed transfer learning framework improves performance on various tasks
including question answering, natural language inference, sentiment analysis,
and image classification. We also show that the learned graphs are generic
enough to be transferred to different embeddings on which the graphs have not
been trained (including GloVe embeddings, ELMo embeddings, and task-specific
RNN hidden unit), or embedding-free units such as image pixels
- …