113,365 research outputs found
Real-time Event Detection on Social Data Streams
Social networks are quickly becoming the primary medium for discussing what
is happening around real-world events. The information that is generated on
social platforms like Twitter can produce rich data streams for immediate
insights into ongoing matters and the conversations around them. To tackle the
problem of event detection, we model events as a list of clusters of trending
entities over time. We describe a real-time system for discovering events that
is modular in design and novel in scale and speed: it applies clustering on a
large stream with millions of entities per minute and produces a dynamically
updated set of events. In order to assess clustering methodologies, we build an
evaluation dataset derived from a snapshot of the full Twitter Firehose and
propose novel metrics for measuring clustering quality. Through experiments and
system profiling, we highlight key results from the offline and online
pipelines. Finally, we visualize a high profile event on Twitter to show the
importance of modeling the evolution of events, especially those detected from
social data streams.Comment: Accepted as a full paper at KDD 2019 on April 29, 201
Deep Clustering Survival Machines with Interpretable Expert Distributions
Conventional survival analysis methods are typically ineffective to
characterize heterogeneity in the population while such information can be used
to assist predictive modeling. In this study, we propose a hybrid survival
analysis method, referred to as deep clustering survival machines, that
combines the discriminative and generative mechanisms. Similar to the mixture
models, we assume that the timing information of survival data is generatively
described by a mixture of certain numbers of parametric distributions, i.e.,
expert distributions. We learn weights of the expert distributions for
individual instances according to their features discriminatively such that
each instance's survival information can be characterized by a weighted
combination of the learned constant expert distributions. This method also
facilitates interpretable subgrouping/clustering of all instances according to
their associated expert distributions. Extensive experiments on both real and
synthetic datasets have demonstrated that the method is capable of obtaining
promising clustering results and competitive time-to-event predicting
performance
A framework for clustering and adaptive topic tracking on evolving text and social media data streams.
Recent advances and widespread usage of online web services and social media platforms, coupled with ubiquitous low cost devices, mobile technologies, and increasing capacity of lower cost storage, has led to a proliferation of Big data, ranging from, news, e-commerce clickstreams, and online business transactions to continuous event logs and social media expressions. These large amounts of online data, often referred to as data streams, because they get generated at extremely high throughputs or velocity, can make conventional and classical data analytics methodologies obsolete. For these reasons, the issues of management and analysis of data streams have been researched extensively in recent years. The special case of social media Big Data brings additional challenges, particularly because of the unstructured nature of the data, specifically free text. One classical approach to mine text data has been Topic Modeling. Topic Models are statistical models that can be used for discovering the abstract ``topics\u27\u27 that may occur in a corpus of documents. Topic models have emerged as a powerful technique in machine learning and data science, providing a great balance between simplicity and complexity. They also provide sophisticated insight without the need for real natural language understanding. However they have not been designed to cope with the type of text data that is abundant on social media platforms, but rather for traditional medium size corpora consisting of longer documents, adhering to a specific language and typically spanning a stable set of topics. Unlike traditional document corpora, social media messages tend to be very short, sparse, noisy, and do not adhere to a standard vocabulary, linguistic patterns, or stable topic distributions. They are also generated at high velocity that impose high demands on topic modeling; and their evolving or dynamic nature, makes any set of results from topic modeling quickly become stale in the face of changes in the textual content and topics discussed within social media streams. In this dissertation, we propose an integrated topic modeling framework built on top of an existing stream-clustering framework called Stream-Dashboard, which can extract, isolate, and track topics over any given time period. In this new framework, Stream Dashboard first clusters the data stream points into homogeneous groups. Then data from each group is ushered to the topic modeling framework which extracts finer topics from the group. The proposed framework tracks the evolution of the clusters over time to detect milestones corresponding to changes in topic evolution, and to trigger an adaptation of the learned groups and topics at each milestone. The proposed approach to topic modeling is different from a generic Topic Modeling approach because it works in a compartmentalized fashion, where the input document stream is split into distinct compartments, and Topic Modeling is applied on each compartment separately. Furthermore, we propose extensions to existing topic modeling and stream clustering methods, including: an adaptive query reformulation approach to help focus on the topic discovery with time; a topic modeling extension with adaptive hyper-parameter and with infinite vocabulary; an adaptive stream clustering algorithm incorporating the automated estimation of dynamic, cluster-specific temporal scales for adaptive forgetting to help facilitate clustering in a fast evolving data stream. Our experimental results show that the proposed adaptive forgetting clustering algorithm can mine better quality clusters; that our proposed compartmentalized framework is able to mine topics of better quality compared to competitive baselines; and that the proposed framework can automatically adapt to focus on changing topics using the proposed query reformulation strategy
Multivariate sensor data analysis for oil refineries and multi-mode identification of system behavior in real-time
Large-scale oil refineries are equipped with mission-critical heavy machinery (boilers, engines, turbines, and so on) and are continuously monitored by thousands of sensors for process efficiency, environmental safety, and predictive maintenance purposes. However, sensors themselves are also prone to errors and failure. The quality of data received from these sensors should be verified before being used in system modeling. There is a need for reliable methods and systems that can provide data validation and reconciliation in real-time with high accuracy. In this paper, we develop a novel method for real-time data validation, gross error detection and classification over multivariate sensor data streams. The validated and high-quality data obtained from these processes is used for pattern analysis and modeling of industrial plants. We obtain sensor data from the power and petrochemical plants of an oil refinery and analyze them using various time-series modeling and data mining techniques that we integrate into a complex event processing engine. Next, we study the computational performance implications of the proposed methods and uncover regimes where they are sustainable over fast streams of sensor data. Finally, we detect shifts among steady-states of data, which represent systems' multiple operating modes and identify the time when a model reconstruction is required using DBSCAN clustering algorithm.Turkish Petroleum Refineries Inc. (TUPRAS) RD CenterPublisher versio
Capturing Evolution Genes for Time Series Data
The modeling of time series is becoming increasingly critical in a wide
variety of applications. Overall, data evolves by following different patterns,
which are generally caused by different user behaviors. Given a time series, we
define the evolution gene to capture the latent user behaviors and to describe
how the behaviors lead to the generation of time series. In particular, we
propose a uniform framework that recognizes different evolution genes of
segments by learning a classifier, and adopt an adversarial generator to
implement the evolution gene by estimating the segments' distribution.
Experimental results based on a synthetic dataset and five real-world datasets
show that our approach can not only achieve a good prediction results (e.g.,
averagely +10.56% in terms of F1), but is also able to provide explanations of
the results.Comment: a preprint version. arXiv admin note: text overlap with
arXiv:1703.10155 by other author
The Block Point Process Model for Continuous-Time Event-Based Dynamic Networks
We consider the problem of analyzing timestamped relational events between a
set of entities, such as messages between users of an on-line social network.
Such data are often analyzed using static or discrete-time network models,
which discard a significant amount of information by aggregating events over
time to form network snapshots. In this paper, we introduce a block point
process model (BPPM) for continuous-time event-based dynamic networks. The BPPM
is inspired by the well-known stochastic block model (SBM) for static networks.
We show that networks generated by the BPPM follow an SBM in the limit of a
growing number of nodes. We use this property to develop principled and
efficient local search and variational inference procedures initialized by
regularized spectral clustering. We fit BPPMs with exponential Hawkes processes
to analyze several real network data sets, including a Facebook wall post
network with over 3,500 nodes and 130,000 events.Comment: To appear at The Web Conference 201
- …