7 research outputs found
Summarize Dates First: A Paradigm Shift in Timeline Summarization
Timeline summarization aims at presenting long news stories in a compact manner. State-of-the-art approaches first select the most relevant dates from the original event timeline then produce per-date news summaries. Date selection is driven by either per-date news content or date-level references. When coping with complex event data, characterized by inherent news flow redundancy, this pipeline may encounter relevant issues in both date selection and summarization due to a limited use of news content in date selection and no use of high-level temporal references (e.g., the past month). This paper proposes a paradigm shift in timeline summarization aimed at overcoming the above issues. It presents a new approach, namely Summarize Date First, which focuses on first generating date-level summaries then selecting the most relevant dates on top of summarized knowledge. In the latter stage, it performs date aggregations to consider high-level temporal references as well. The proposed pipeline also supports frequent incremental timeline updates more efficiently than previous approaches. We tested our unsupervised approach both on existing benchmark datasets and on a newly proposed benchmark dataset describing the COVID-19 news timeline. The achieved results were superior to state-of-the-art unsupervised methods and competitive against supervised ones
Event summarization on social media stream: retrospective and prospective tweet summarization
Le contenu généré dans les médias sociaux comme Twitter permet aux utilisateurs d'avoir un aperçu rétrospectif d'évènement
et de suivre les nouveaux développements dès qu'ils se produisent. Cependant, bien que Twitter soit une source d'information
importante, il est caractérisé par le volume et la vélocité des informations publiées qui rendent difficile le suivi de
l'évolution des évènements. Pour permettre de mieux tirer profit de ce nouveau vecteur d'information, deux tâches
complémentaires de recherche d'information dans les médias sociaux ont été introduites : la génération de résumé
rétrospectif qui vise à sélectionner les tweets pertinents et non redondant récapitulant "ce qui s'est passé" et l'envoi des
notifications prospectives dès qu'une nouvelle information pertinente est détectée.
Notre travail s'inscrit dans ce cadre. L'objectif de cette thèse est de faciliter le suivi d'événement, en fournissant des
outils de génération de synthèse adaptés à ce vecteur d'information. Les défis majeurs sous-jacents à notre problématique
découlent d'une part du volume, de la vélocité et de la variété des contenus publiés et, d'autre part, de la qualité des
tweets qui peut varier d'une manière considérable.
La tâche principale dans la notification prospective est l'identification en temps réel des tweets pertinents et non
redondants. Le système peut choisir de retourner les nouveaux tweets dès leurs détections où bien de différer leur envoi
afin de s'assurer de leur qualité. Dans ce contexte, nos contributions se situent à ces différents niveaux : Premièrement,
nous introduisons Word Similarity Extended Boolean Model (WSEBM), un modèle d'estimation de la pertinence qui exploite la
similarité entre les termes basée sur le word embedding et qui n'utilise pas les statistiques de flux. L'intuition sous-
jacente à notre proposition est que la mesure de similarité à base de word embedding est capable de considérer des mots
différents ayant la même sémantique ce qui permet de compenser le non-appariement des termes lors du calcul de la
pertinence. Deuxièmement, l'estimation de nouveauté d'un tweet entrant est basée sur la comparaison de ses termes avec les
termes des tweets dĂ©jĂ envoyĂ©s au lieu d'utiliser la comparaison tweet Ă tweet. Cette mĂ©thode offre un meilleur passage Ă
l'échelle et permet de réduire le temps d'exécution. Troisièmement, pour contourner le problème du seuillage de pertinence,
nous utilisons un classificateur binaire qui prédit la pertinence. L'approche proposée est basée sur l'apprentissage
supervisé adaptatif dans laquelle les signes sociaux sont combinés avec les autres facteurs de pertinence dépendants de la
requête. De plus, le retour des jugements de pertinence est exploité pour re-entrainer le modèle de classification. Enfin,
nous montrons que l'approche proposée, qui envoie les notifications en temps réel, permet d'obtenir des performances
prometteuses en termes de qualité (pertinence et nouveauté) avec une faible latence alors que les approches de l'état de
l'art tendent à favoriser la qualité au détriment de la latence.
Cette thèse explore également une nouvelle approche de génération du résumé rétrospectif qui suit un paradigme différent de
la majorité des méthodes de l'état de l'art. Nous proposons de modéliser le processus de génération de synthèse sous forme
d'un problème d'optimisation linéaire qui prend en compte la diversité temporelle des tweets. Les tweets sont filtrés et
regroupés d'une manière incrémentale en deux partitions basées respectivement sur la similarité du contenu et le temps de
publication. Nous formulons la génération du résumé comme étant un problème linéaire entier dans lequel les variables
inconnues sont binaires, la fonction objective est à maximiser et les contraintes assurent qu'au maximum un tweet par cluster est sélectionné dans la limite de la longueur du résumé fixée préalablement.User-generated content on social media, such as Twitter, provides in many cases, the latest news before traditional media,
which allows having a retrospective summary of events and being updated in a timely fashion whenever a new development
occurs. However, social media, while being a valuable source of information, can be also overwhelming given the volume and
the velocity of published information. To shield users from being overwhelmed by irrelevant and redundant posts,
retrospective summarization and prospective notification (real-time summarization) were introduced as two complementary
tasks of information seeking on document streams. The former aims to select a list of relevant and non-redundant tweets that
capture "what happened". In the latter, systems monitor the live posts stream and push relevant and novel notifications as
soon as possible.
Our work falls within these frameworks and focuses on developing a tweet summarization approaches for the two
aforementioned scenarios. It aims at providing summaries that capture the key aspects of the event of interest to help users
to efficiently acquire information and follow the development of long ongoing events from social media. Nevertheless, tweet
summarization task faces many challenges that stem from, on one hand, the high volume, the velocity and the variety of the
published information and, on the other hand, the quality of tweets, which can vary significantly.
In the prospective notification, the core task is the relevancy and the novelty detection in real-time. For timeliness, a
system may choose to push new updates in real-time or may choose to trade timeliness for higher notification quality. Our
contributions address these levels: First, we introduce Word Similarity Extended Boolean Model (WSEBM), a relevance model
that does not rely on stream statistics and takes advantage of word embedding model. We used word similarity instead of the
traditional weighting techniques. By doing this, we overcome the shortness and word mismatch issues in tweets. The intuition
behind our proposition is that context-aware similarity measure in word2vec is able to consider different words with the
same semantic meaning and hence allows offsetting the word mismatch issue when calculating the similarity between a tweet
and a topic. Second, we propose to compute the novelty score of the incoming tweet regarding all words of tweets already
pushed to the user instead of using the pairwise comparison. The proposed novelty detection method scales better and reduces
the execution time, which fits real-time tweet filtering. Third, we propose an adaptive Learning to Filter approach that
leverages social signals as well as query-dependent features. To overcome the issue of relevance threshold setting, we use a
binary classifier that predicts the relevance of the incoming tweet. In addition, we show the gain that can be achieved by
taking advantage of ongoing relevance feedback. Finally, we adopt a real-time push strategy and we show that the proposed
approach achieves a promising performance in terms of quality (relevance and novelty) with low cost of latency whereas the
state-of-the-art approaches tend to trade latency for higher quality.
This thesis also explores a novel approach to generate a retrospective summary that follows a different paradigm than the
majority of state-of-the-art methods. We consider the summary generation as an optimization problem that takes into account
the topical and the temporal diversity. Tweets are filtered and are incrementally clustered in two cluster types, namely
topical clusters based on content similarity and temporal clusters that depends on publication time. Summary generation is
formulated as integer linear problem in which unknowns variables are binaries, the objective function is to be maximized and
constraints ensure that at most one post per cluster is selected with respect to the defined summary length limit
A framework for technology-assisted sensitivity review: using sensitivity classification to prioritise documents for review
More than a hundred countries implement freedom of information laws. In the UK, the Freedom of Information Act 2000 (FOIA) states that the government's documents must be made freely available, or opened, to the public. Moreover, all central UK government departments' documents that have a historic value, for example the minutes from significant meetings, must be transferred to the The National Archives (TNA) within twenty years of the document's creation. However, government documents can contain sensitive information, such as personal information or information that would likely damage the international relations of the UK if it was opened to the public. Therefore, all government documents that are to be publicly archived must be sensitivity reviewed to identify and redact the sensitive information, or close the document until the information is no longer sensitive. Historically, government documents have been stored in a structured file-plan that can reliably inform a sensitivity reviewer about the subject-matter and the likely sensitivities in the documents. However, the lack of structure in digital document collections and the volume of digital documents that are to be sensitivity reviewed mean that the traditional manual sensitivity review process is not practical for digital sensitivity review.
In this thesis, we argue that the automatic classification of documents that contain sensitive information, sensitivity classification, can be deployed to assist government departments and human reviewers to sensitivity review born-digital government documents. However, classifying sensitive information is a complex task, since sensitivity is context-dependent. For example, identifying if information is sensitive or not can require a human to judge on the likely effect of releasing the information into the public domain. Moreover, sensitivity is not necessarily topic-oriented, i.e., it is usually dependent on a combination of what is being said and about whom. Furthermore, the vocabulary and entities that are associated to particular types of sensitive information, e.g., confidential information, can vary greatly between different collections.
We propose to address sensitivity classification as a text classification task. Moreover, through a thorough empirical evaluation, we show that text classification is effective for sensitivity classification and can be improved by identifying the vocabulary, syntactic and semantic document features that are reliable indicators of sensitive or non-sensitive text. Furthermore, we propose to reduce the number of documents that have to be reviewed to learn an effective sensitivity classifier through an active learning strategy in which a sensitivity reviewer redacts any sensitive text in a document as they review it, to construct a representation of the sensitivities in a collection.
With this in mind, we propose a novel framework for technology-assisted sensitivity review that can prioritise the most appropriate documents to be reviewed at specific stages of the review process. Furthermore, our framework can provide the reviewers with useful information to assist them in making their reviewing decisions. Our framework consists of four components, namely the Document Representation, Document Prioritisation, Feedback Integration and Learned Predictions components, that can be instantiated to learn from the reviewers' feedback about the sensitivities in a collection or provide assistance to reviewers at different stages of the review. In particular, firstly, the Document Representation component encodes the document features that can be reliable indicators of the sensitivities in a collection. Secondly, the Document Prioritisation component identifies the documents that should be prioritised for review at a particular stage of the reviewing process, for example to provide the sensitivity classifier with information about the sensitivities in the collection or to focus the available reviewing resources on the documents that are the most likely to be released to the public. Thirdly, the Feedback Integration component integrates explicit feedback from a reviewer to construct a representation of the sensitivities in a collection and identify the features of a reviewer's interactions with the framework that indicate the amount of time that is required to sensitivity review a specific document. Finally, the Learned Predictions component combines the information that has been generated by the other three components and, as the final step in each iteration of the sensitivity review process, the Learned Predictions component is responsible for making accurate sensitivity classification and expected reviewing time predictions for the documents that have not yet been sensitivity reviewed.
In this thesis, we identify two realistic digital sensitivity review scenarios as user models and conduct two user studies to evaluate the effectiveness of our proposed framework for assisting digital sensitivity review. Firstly, in the limited review user model, which addresses a scenario in which there are insufficient reviewing resources available to sensitivity review all of the documents in a collection, we show that our proposed framework can increase the number of documents that can be reviewed and released to the public with the available reviewing resources. Secondly, in the exhaustive review user model, which addresses a scenario in which all of the documents in a collection will be manually sensitivity reviewed, we show that providing the reviewers with useful information about the documents in the collection that contain sensitive information can increase the reviewers' accuracy, reviewing speed and agreement.
This is the first thesis to investigate automatically classifying FOIA sensitive information to assist digital sensitivity review. The central contributions of this thesis are our proposed framework for technology-assisted sensitivity review and our sensitivity classification approaches. Our contributions are validated using a collection of government documents that are sensitivity reviewed by expert sensitivity reviewers to identify two FOIA sensitivities, namely international relations and personal information. The thesis draws insights from a thorough evaluation and analysis of our proposed framework and sensitivity classifier. Our results demonstrate that our proposed framework is a viable technology for assisting digital sensitivity review