641 research outputs found
TiFi: Taxonomy Induction for Fictional Domains [Extended version]
Taxonomies are important building blocks of structured knowledge bases, and their construction from text sources and Wikipedia has received much attention. In this paper we focus on the construction of taxonomies for fictional domains, using noisy category systems from fan wikis or text extraction as input. Such fictional domains are archetypes of entity universes that are poorly covered by Wikipedia, such as also enterprise-specific knowledge bases or highly specialized verticals. Our fiction-targeted approach, called TiFi, consists of three phases: (i) category cleaning, by identifying candidate categories that truly represent classes in the domain of interest, (ii) edge cleaning, by selecting subcategory relationships that correspond to class subsumption, and (iii) top-level construction, by mapping classes onto a subset of high-level WordNet categories. A comprehensive evaluation shows that TiFi is able to construct taxonomies for a diverse range of fictional domains such as Lord of the Rings, The Simpsons or Greek Mythology with very high precision and that it outperforms state-of-the-art baselines for taxonomy induction by a substantial margin
Automatic Event Salience Identification
Identifying the salience (i.e. importance) of discourse units is an important
task in language understanding. While events play important roles in text
documents, little research exists on analyzing their saliency status. This
paper empirically studies the Event Salience task and proposes two salience
detection models based on content similarities and discourse relations. The
first is a feature based salience model that incorporates similarities among
discourse units. The second is a neural model that captures more complex
relations between discourse units. Tested on our new large-scale event salience
corpus, both methods significantly outperform the strong frequency baseline,
while our neural model further improves the feature based one by a large
margin. Our analyses demonstrate that our neural model captures interesting
connections between salience and discourse unit relations (e.g., scripts and
frame structures).Comment: EMNLP 2018, 11 pages. Datasets, models and codes:
https://github.com/hunterhector/EventSalienc
Improved Neural Relation Detection for Knowledge Base Question Answering
Relation detection is a core component for many NLP applications including
Knowledge Base Question Answering (KBQA). In this paper, we propose a
hierarchical recurrent neural network enhanced by residual learning that
detects KB relations given an input question. Our method uses deep residual
bidirectional LSTMs to compare questions and relation names via different
hierarchies of abstraction. Additionally, we propose a simple KBQA system that
integrates entity linking and our proposed relation detector to enable one
enhance another. Experimental results evidence that our approach achieves not
only outstanding relation detection performance, but more importantly, it helps
our KBQA system to achieve state-of-the-art accuracy for both single-relation
(SimpleQuestions) and multi-relation (WebQSP) QA benchmarks.Comment: Accepted by ACL 2017 (updated for camera-ready
DREAM: A Challenge Dataset and Models for Dialogue-Based Reading Comprehension
We present DREAM, the first dialogue-based multiple-choice reading
comprehension dataset. Collected from English-as-a-foreign-language
examinations designed by human experts to evaluate the comprehension level of
Chinese learners of English, our dataset contains 10,197 multiple-choice
questions for 6,444 dialogues. In contrast to existing reading comprehension
datasets, DREAM is the first to focus on in-depth multi-turn multi-party
dialogue understanding. DREAM is likely to present significant challenges for
existing reading comprehension systems: 84% of answers are non-extractive, 85%
of questions require reasoning beyond a single sentence, and 34% of questions
also involve commonsense knowledge.
We apply several popular neural reading comprehension models that primarily
exploit surface information within the text and find them to, at best, just
barely outperform a rule-based approach. We next investigate the effects of
incorporating dialogue structure and different kinds of general world knowledge
into both rule-based and (neural and non-neural) machine learning-based reading
comprehension models. Experimental results on the DREAM dataset show the
effectiveness of dialogue structure and general world knowledge. DREAM will be
available at https://dataset.org/dream/.Comment: To appear in TAC
Word, graph and manifold embedding from Markov processes
Continuous vector representations of words and objects appear to carry
surprisingly rich semantic content. In this paper, we advance both the
conceptual and theoretical understanding of word embeddings in three ways.
First, we ground embeddings in semantic spaces studied in
cognitive-psychometric literature and introduce new evaluation tasks. Second,
in contrast to prior work, we take metric recovery as the key object of study,
unify existing algorithms as consistent metric recovery methods based on
co-occurrence counts from simple Markov random walks, and propose a new
recovery algorithm. Third, we generalize metric recovery to graphs and
manifolds, relating co-occurence counts on random walks in graphs and random
processes on manifolds to the underlying metric to be recovered, thereby
reconciling manifold estimation and embedding algorithms. We compare embedding
algorithms across a range of tasks, from nonlinear dimensionality reduction to
three semantic language tasks, including analogies, sequence completion, and
classification
A Survey on Explainability in Machine Reading Comprehension
This paper presents a systematic review of benchmarks and approaches for
explainability in Machine Reading Comprehension (MRC). We present how the
representation and inference challenges evolved and the steps which were taken
to tackle these challenges. We also present the evaluation methodologies to
assess the performance of explainable systems. In addition, we identify
persisting open research questions and highlight critical directions for future
work
A Survey on Multi-hop Question Answering and Generation
The problem of Question Answering (QA) has attracted significant research
interest for long. Its relevance to language understanding and knowledge
retrieval tasks, along with the simple setting makes the task of QA crucial for
strong AI systems. Recent success on simple QA tasks has shifted the focus to
more complex settings. Among these, Multi-Hop QA (MHQA) is one of the most
researched tasks over the recent years. The ability to answer multi-hop
questions and perform multi step reasoning can significantly improve the
utility of NLP systems. Consequently, the field has seen a sudden surge with
high quality datasets, models and evaluation strategies. The notion of
`multiple hops' is somewhat abstract which results in a large variety of tasks
that require multi-hop reasoning. This implies that different datasets and
models differ significantly which makes the field challenging to generalize and
survey. This work aims to provide a general and formal definition of MHQA task,
and organize and summarize existing MHQA frameworks. We also outline the best
methods to create MHQA datasets. The paper provides a systematic and thorough
introduction as well as the structuring of the existing attempts to this highly
interesting, yet quite challenging task.Comment: 45 pages, 4 figures, 3 table
Arabic Text Summarization Challenges using Deep Learning Techniques: A Review
Text summarization is a challenging field in Natural Language Processing due to language modelisation and used techniques to give concise summaries. Dealing with Arabic language does increase the challenge while taking into consideration the many features of the Arabic language, the lack of tools and resources for Arabic, and the Algorithms adaptation and modelisation. In this paper, we present several researches dealing with Arabic Text summarization applying different Algorithms on several Datasets. We then compare all these researches and we give a conclusion to guide researchers on their further work
Graph-based Patterns for Local Coherence Modeling
Coherence is an essential property of well-written texts. It distinguishes a multi-sentence text from a sequence of randomly strung sentences. The task of local coherence modeling is about the way that sentences in a text link up one another. Solving this task is beneficial for assessing the quality of texts. Moreover, a coherence model can be integrated into text generation systems such as text summarizers to produce coherent texts.
In this dissertation, we present a graph-based approach to local coherence modeling that accounts for the connectivity structure among sentences in a text. Graphs give our model the capability to take into account relations between non-adjacent sentences as well as those between adjacent sentences. Besides, the connectivity style among nodes in graphs reflects the relationships among sentences in a text.
We first employ the entity graph approach, proposed by Guinaudeau and Strube (2013), to represent a text via a graph. In the entity graph representation of a text, nodes encode sentences and edges depict the existence of a pair of coreferent mentions in sentences. We then devise graph-based features to capture the connectivity structure of nodes in a graph, and accordingly the connectivity structure of sentences in the corresponding text. We extract all subgraphs of entity graphs as features which encode the connectivity structure of graphs. Frequencies of subgraphs correlate with the perceived coherence of their corresponding texts. Therefore, we refer to these subgraphs as coherence patterns.
In order to complete our approach to coherence modeling, we propose a new graph representation of texts, rather than the entity graph. Our approach employs lexico-semantic relations among words in sentences, instead of only entity coreference relations, to model relationships between sentences via a graph. This new lexical graph representation of text plus our method for mining coherence patterns make our coherence model.
We evaluate our approach on the readability assessment task because a primary factor of readability is coherence. Coherent texts are easy to read and consequently demand less effort from their readers. Our extensive experiments on two separate readability assessment datasets show that frequencies of coherence patterns in texts correlate with the readability ratings assigned by human judges. By training a machine learning method on our coherence patterns, our model outperforms its counterparts on ranking texts with respect to their readability. As one of the ultimate goals of coherence models is to be used in text generation systems, we show how our coherence patterns can be integrated into a graph-based text summarizer to produce informative and coherent summaries. Our coherence patterns improve the performance of the summarization system based on both standard summarization metrics and human evaluations. An implementation of the approaches discussed in this dissertation is publicly available
Semi-supervised and Unsupervised Methods for Categorizing Posts in Web Discussion Forums
Web discussion forums are used by millions of people worldwide to share
information belonging to a variety of domains such as automotive vehicles,
pets, sports, etc. They typically contain posts that fall into different
categories such as problem, solution, feedback, spam, etc. Automatic
identification of these categories can aid information retrieval that is
tailored for specific user requirements. Previously, a number of supervised
methods have attempted to solve this problem; however, these depend on the
availability of abundant training data. A few existing unsupervised and
semi-supervised approaches are either focused on identifying a single category
or do not report category-specific performance. In contrast, this work proposes
unsupervised and semi-supervised methods that require no or minimal training
data to achieve this objective without compromising on performance. A
fine-grained analysis is also carried out to discuss their limitations. The
proposed methods are based on sequence models (specifically, Hidden Markov
Models) that can model language for each category using word and part-of-speech
probability distributions, and manually specified features. Empirical
evaluations across domains demonstrate that the proposed methods are better
suited for this task than existing ones
- …