6,193 research outputs found
Explicit diversification of event aspects for temporal summarization
During major events, such as emergencies and disasters, a large volume of information is reported on newswire and social media platforms. Temporal summarization (TS) approaches are used to automatically produce concise overviews of such events by extracting text snippets from related articles over time. Current TS approaches rely on a combination of event relevance and textual novelty for snippet selection. However, for events that span multiple days, textual novelty is often a poor criterion for selecting snippets, since many snippets are textually unique but are semantically redundant or non-informative. In this article, we propose a framework for the diversification of snippets using explicit event aspects, building on recent works in search result diversification. In particular, we first propose two techniques to identify explicit aspects that a user might want to see covered in a summary for different types of event. We then extend a state-of-the-art explicit diversification framework to maximize the coverage of these aspects when selecting summary snippets for unseen events. Through experimentation over the TREC TS 2013, 2014, and 2015 datasets, we show that explicit diversification for temporal summarization significantly outperforms classical novelty-based diversification, as the use of explicit event aspects reduces the amount of redundant and off-topic snippets returned, while also increasing summary timeliness
Alexandria: Extensible Framework for Rapid Exploration of Social Media
The Alexandria system under development at IBM Research provides an
extensible framework and platform for supporting a variety of big-data
analytics and visualizations. The system is currently focused on enabling rapid
exploration of text-based social media data. The system provides tools to help
with constructing "domain models" (i.e., families of keywords and extractors to
enable focus on tweets and other social media documents relevant to a project),
to rapidly extract and segment the relevant social media and its authors, to
apply further analytics (such as finding trends and anomalous terms), and
visualizing the results. The system architecture is centered around a variety
of REST-based service APIs to enable flexible orchestration of the system
capabilities; these are especially useful to support knowledge-worker driven
iterative exploration of social phenomena. The architecture also enables rapid
integration of Alexandria capabilities with other social media analytics
system, as has been demonstrated through an integration with IBM Research's
SystemG. This paper describes a prototypical usage scenario for Alexandria,
along with the architecture and key underlying analytics.Comment: 8 page
An improved system for sentence-level novelty detection in textual streams
Novelty detection in news events has long been a difficult problem. A number of models performed well on specific data streams but certain issues are far from being solved, particularly in large data streams from the WWW where unpredictability of new terms requires adaptation in the vector space model. We present a novel event detection system based on the Incremental Term Frequency-Inverse Document Frequency (TF-IDF) weighting incorporated with Locality Sensitive Hashing (LSH). Our system could efficiently and effectively adapt to the changes within the data streams of any new terms with continual updates to the vector space model. Regarding miss probability, our proposed novelty detection framework outperforms a recognised baseline system by approximately 16% when evaluating a benchmark dataset from Google News
PDF-Malware Detection: A Survey and Taxonomy of Current Techniques
Portable Document Format, more commonly known as PDF, has become, in the last 20 years, a standard for document exchange and dissemination due its portable nature and widespread adoption. The flexibility and power of this format are not only leveraged by benign users, but from hackers as well who have been working to exploit various types of vulnerabilities, overcome security restrictions, and then transform the PDF format in one among the leading malicious code spread vectors. Analyzing the content of malicious PDF files to extract the main features that characterize the malware identity and behavior, is a fundamental task for modern threat intelligence platforms that need to learn how to automatically identify new attacks. This paper surveys existing state of the art about systems for the detection of malicious PDF files and organizes them in a taxonomy that separately considers the used approaches and the data analyzed to detect the presence of malicious code. © Springer International Publishing AG, part of Springer Nature 2018
An Automatic Intelligent System for Document Processing and Fruition
With the increasing amount of documents available on-line, the need for intelligent
digital libraries, that allow to automatize the document processing tasks and to suitably
organize and make available the documents so as to provide personalized and focused access,
becomes more and more pressing. This paper proposes an integrated system that merges
intelligent modules covering all the phases involved in a document lifecycle, from acquisition,
to processing, to information extraction, to personalized fruition for final users. The role and
possible cooperation of Machine Learning and Data Mining techniques in the system is
highlighted and discussed, along with their importance to provide effective support to both the
building and the fruition of the Digital Library and the underlying knowledge base
- …