5,234 research outputs found
Topic Detection and Tracking in Personal Search History
This thesis describes a system for tracking and detecting topics in personal search history. In particular, we developed a time tracking tool that helps users in analyzing their time and discovering their activity patterns. The system allows a user to specify interesting topics to monitor with a keyword description. The system would then keep track of the log and the time spent on each document and produce a time graph to show how much time has been spent on each topic to be monitored. The system can also detect new topics and potentially recommend relevant information about them to the user. This work has been integrated with the UCAIR Toolbar, a client side agent. Considering limited resources on the client side, we designed an e????cient incremental algorithm for topic tracking and detection. Various unsupervised learning approaches have been considered to improve the accuracy in categorizing the user log into appropriate categories. Experiments show that our tool is effective in categorizing the documents into existing categories and detecting the new useful catgeories. Moreover, the quality of categorization improves over time as more and more log is available
Discovering conversational topics and emotions associated with Demonetization tweets in India
Social media platforms contain great wealth of information which provides us
opportunities explore hidden patterns or unknown correlations, and understand
people's satisfaction with what they are discussing. As one showcase, in this
paper, we summarize the data set of Twitter messages related to recent
demonetization of all Rs. 500 and Rs. 1000 notes in India and explore insights
from Twitter's data. Our proposed system automatically extracts the popular
latent topics in conversations regarding demonetization discussed in Twitter
via the Latent Dirichlet Allocation (LDA) based topic model and also identifies
the correlated topics across different categories. Additionally, it also
discovers people's opinions expressed through their tweets related to the event
under consideration via the emotion analyzer. The system also employs an
intuitive and informative visualization to show the uncovered insight.
Furthermore, we use an evaluation measure, Normalized Mutual Information (NMI),
to select the best LDA models. The obtained LDA results show that the tool can
be effectively used to extract discussion topics and summarize them for further
manual analysis.Comment: 6 pages, 11 figures. arXiv admin note: substantial text overlap with
arXiv:1608.02519 by other authors; text overlap with arXiv:1705.08094 by
other author
Automated Big Text Security Classification
In recent years, traditional cybersecurity safeguards have proven ineffective
against insider threats. Famous cases of sensitive information leaks caused by
insiders, including the WikiLeaks release of diplomatic cables and the Edward
Snowden incident, have greatly harmed the U.S. government's relationship with
other governments and with its own citizens. Data Leak Prevention (DLP) is a
solution for detecting and preventing information leaks from within an
organization's network. However, state-of-art DLP detection models are only
able to detect very limited types of sensitive information, and research in the
field has been hindered due to the lack of available sensitive texts. Many
researchers have focused on document-based detection with artificially labeled
"confidential documents" for which security labels are assigned to the entire
document, when in reality only a portion of the document is sensitive. This
type of whole-document based security labeling increases the chances of
preventing authorized users from accessing non-sensitive information within
sensitive documents. In this paper, we introduce Automated Classification
Enabled by Security Similarity (ACESS), a new and innovative detection model
that penetrates the complexity of big text security classification/detection.
To analyze the ACESS system, we constructed a novel dataset, containing
formerly classified paragraphs from diplomatic cables made public by the
WikiLeaks organization. To our knowledge this paper is the first to analyze a
dataset that contains actual formerly sensitive information annotated at
paragraph granularity.Comment: Pre-print of Best Paper Award IEEE Intelligence and Security
Informatics (ISI) 2016 Manuscrip
- …