97,353 research outputs found
SOTXTSTREAM: Density-based self-organizing clustering of text streams
A streaming data clustering algorithm is presented building upon the density-based selforganizing stream clustering algorithm SOSTREAM. Many density-based clustering algorithms are limited by their inability to identify clusters with heterogeneous density. SOSTREAM addresses this limitation through the use of local (nearest neighbor-based) density determinations. Additionally, many stream clustering algorithms use a two-phase clustering approach. In the first phase, a micro-clustering solution is maintained online, while in the second phase, the micro-clustering solution is clustered offline to produce a macro solution. By performing self-organization techniques on micro-clusters in the online phase, SOSTREAM is able to maintain a macro clustering solution in a single phase. Leveraging concepts from SOSTREAM, a new density-based self-organizing text stream clustering algorithm, SOTXTSTREAM, is presented that addresses several shortcomings of SOSTREAM. Gains in clustering performance of this new algorithm are demonstrated on several real-world text stream datasets
A Semantic Graph-Based Approach for Mining Common Topics From Multiple Asynchronous Text Streams
In the age of Web 2.0, a substantial amount of unstructured
content are distributed through multiple text streams in an
asynchronous fashion, which makes it increasingly difficult
to glean and distill useful information. An effective way to
explore the information in text streams is topic modelling,
which can further facilitate other applications such as search,
information browsing, and pattern mining. In this paper, we
propose a semantic graph based topic modelling approach
for structuring asynchronous text streams. Our model in-
tegrates topic mining and time synchronization, two core
modules for addressing the problem, into a unified model.
Specifically, for handling the lexical gap issues, we use global
semantic graphs of each timestamp for capturing the hid-
den interaction among entities from all the text streams.
For dealing with the sources asynchronism problem, local
semantic graphs are employed to discover similar topics of
different entities that can be potentially separated by time
gaps. Our experiment on two real-world datasets shows that
the proposed model significantly outperforms the existing
ones
Time Aware Knowledge Extraction for Microblog Summarization on Twitter
Microblogging services like Twitter and Facebook collect millions of user
generated content every moment about trending news, occurring events, and so
on. Nevertheless, it is really a nightmare to find information of interest
through the huge amount of available posts that are often noise and redundant.
In general, social media analytics services have caught increasing attention
from both side research and industry. Specifically, the dynamic context of
microblogging requires to manage not only meaning of information but also the
evolution of knowledge over the timeline. This work defines Time Aware
Knowledge Extraction (briefly TAKE) methodology that relies on temporal
extension of Fuzzy Formal Concept Analysis. In particular, a microblog
summarization algorithm has been defined filtering the concepts organized by
TAKE in a time-dependent hierarchy. The algorithm addresses topic-based
summarization on Twitter. Besides considering the timing of the concepts,
another distinguish feature of the proposed microblog summarization framework
is the possibility to have more or less detailed summary, according to the
user's needs, with good levels of quality and completeness as highlighted in
the experimental results.Comment: 33 pages, 10 figure
Growing Story Forest Online from Massive Breaking News
We describe our experience of implementing a news content organization system
at Tencent that discovers events from vast streams of breaking news and evolves
news story structures in an online fashion. Our real-world system has distinct
requirements in contrast to previous studies on topic detection and tracking
(TDT) and event timeline or graph generation, in that we 1) need to accurately
and quickly extract distinguishable events from massive streams of long text
documents that cover diverse topics and contain highly redundant information,
and 2) must develop the structures of event stories in an online manner,
without repeatedly restructuring previously formed stories, in order to
guarantee a consistent user viewing experience. In solving these challenges, we
propose Story Forest, a set of online schemes that automatically clusters
streaming documents into events, while connecting related events in growing
trees to tell evolving stories. We conducted extensive evaluation based on 60
GB of real-world Chinese news data, although our ideas are not
language-dependent and can easily be extended to other languages, through
detailed pilot user experience studies. The results demonstrate the superior
capability of Story Forest to accurately identify events and organize news text
into a logical structure that is appealing to human readers, compared to
multiple existing algorithm frameworks.Comment: Accepted by CIKM 2017, 9 page
Automated speech and audio analysis for semantic access to multimedia
The deployment and integration of audio processing tools can enhance the semantic annotation of multimedia content, and as a consequence, improve the effectiveness of conceptual access tools. This paper overviews the various ways in which automatic speech and audio analysis can contribute to increased granularity of automatically extracted metadata. A number of techniques will be presented, including the alignment of speech and text resources, large vocabulary speech recognition, key word spotting and speaker classification. The applicability of techniques will be discussed from a media crossing perspective. The added value of the techniques and their potential contribution to the content value chain will be illustrated by the description of two (complementary) demonstrators for browsing broadcast news archives
- âŚ