9,345 research outputs found
Time Aware Knowledge Extraction for Microblog Summarization on Twitter
Microblogging services like Twitter and Facebook collect millions of user
generated content every moment about trending news, occurring events, and so
on. Nevertheless, it is really a nightmare to find information of interest
through the huge amount of available posts that are often noise and redundant.
In general, social media analytics services have caught increasing attention
from both side research and industry. Specifically, the dynamic context of
microblogging requires to manage not only meaning of information but also the
evolution of knowledge over the timeline. This work defines Time Aware
Knowledge Extraction (briefly TAKE) methodology that relies on temporal
extension of Fuzzy Formal Concept Analysis. In particular, a microblog
summarization algorithm has been defined filtering the concepts organized by
TAKE in a time-dependent hierarchy. The algorithm addresses topic-based
summarization on Twitter. Besides considering the timing of the concepts,
another distinguish feature of the proposed microblog summarization framework
is the possibility to have more or less detailed summary, according to the
user's needs, with good levels of quality and completeness as highlighted in
the experimental results.Comment: 33 pages, 10 figure
Explicit diversification of event aspects for temporal summarization
During major events, such as emergencies and disasters, a large volume of information is reported on newswire and social media platforms. Temporal summarization (TS) approaches are used to automatically produce concise overviews of such events by extracting text snippets from related articles over time. Current TS approaches rely on a combination of event relevance and textual novelty for snippet selection. However, for events that span multiple days, textual novelty is often a poor criterion for selecting snippets, since many snippets are textually unique but are semantically redundant or non-informative. In this article, we propose a framework for the diversification of snippets using explicit event aspects, building on recent works in search result diversification. In particular, we first propose two techniques to identify explicit aspects that a user might want to see covered in a summary for different types of event. We then extend a state-of-the-art explicit diversification framework to maximize the coverage of these aspects when selecting summary snippets for unseen events. Through experimentation over the TREC TS 2013, 2014, and 2015 datasets, we show that explicit diversification for temporal summarization significantly outperforms classical novelty-based diversification, as the use of explicit event aspects reduces the amount of redundant and off-topic snippets returned, while also increasing summary timeliness
Improving search effectiveness in sentence retrieval and novelty detection
In this thesis we study thoroughly sentence retrieval and novelty detec-
tion. We analyze the strengths and weaknesses of current state of the art
methods and, subsequently, new mechanisms to address sentence retrieval
and novelty detection are proposed.
Retrieval and novelty detection are related tasks: usually, we initially
apply a retrieval model that estimates properly the relevance of passages
(e.g. sentences) and generates a ranking of passages sorted by their relevance.
Next, this ranking is used as the input of a novelty detection module, which
tries to filter out redundant passages in the ranking.
The estimation of relevance at sentence level is di cult. Standard meth-
ods used to estimate relevance are simply based on matching query and
sentence terms. However, queries usually contain two or three terms and
sentences are also short. Therefore, the matching between query and sen-
tences is poor. In order to address this problem, we study how to enrich
this process with additional information: the context. The context refers
to the information provided by the surrounding sentences or the document
where the sentence is located. Such context reduces ambiguity and supplies
additional information not included in the sentence itself. Additionally, it is
important to estimate how important (central) a sentence is within the docu-
ment. These two components are studied following a formal framework based
on Statistical Language Models. In this respect, we demonstrate that these
components yield to improvements in current sentence retrieval methods.
In this thesis we work with collections of sentences that were extracted
from news. News not only explain facts but also express opinions that people
have about a particular event or topic. Therefore, the proper estimation of
which passages are opinionated may help to further improve the estimation
of relevance for sentences. We apply a formal methodology that helps us to
incorporate opinions into standard sentence retrieval methods. Additionally,
we propose simple empirical alternatives to incorporate query-independent
features into sentence retrieval models. We demonstrate that the incorpo-
ration of opinions to estimate relevance is an important factor that makes
sentence retrieval methods more effective. Along this study, we also analyze
query-independent features based on sentence length and named entities.
The combination of the context-based approach with the incorporation
of opinion-based features is straightforward. We study how to combine these
two approaches and its impact. We demonstrate that context-based models
are implicitly promoting sentences with opinions and, therefore, opinion-
based features do not help to further improve context-based methods.
The second part of this thesis is dedicated to novelty detection at sentence level. Because novelty is actually dependent on a retrieval ranking, we con-
sider here two approaches: a) the perfect-relevance approach, which consists
of using a ranking where all sentences are relevant; and b) the non-perfect rel-
evance approach, which consists of applying first a sentence retrieval method.
We rst study which baseline performs the best and, next, we propose a
number of variations. One of the mechanisms proposed is based on vocab-
ulary pruning. We demonstrate that considering terms from the top ranked
sentences in the original ranking helps to guide the estimation of novelty. The
application of Language Models to support novelty detection is another chal-
lenge that we face in this thesis. We apply di erent smoothing methods in the
context of alternative mechanisms to detect novelty. Additionally, we test a
mechanism based on mixture models that uses the Expectation-Maximization
algorithm to obtain automatically the novelty score of a sentence.
In the last part of this work we demonstrate that most novelty methods
lead to a strong re-ordering of the initial ranking. However, we show that the
top ranked sentences in the initial list are usually novel and re-ordering them
is often harmful. Therefore, we propose di erent mechanisms that determine
the position threshold where novelty detection should be initiated. In this
respect, we consider query-independent and query-dependent approaches.
Summing up, we identify important limitations of current sentence re-
trieval and novelty methods, and propose novel and effective methods
- …