124 research outputs found
Anomaly Sequences Detection from Logs Based on Compression
Mining information from logs is an old and still active research topic. In
recent years, with the rapid emerging of cloud computing, log mining becomes
increasingly important to industry. This paper focus on one major mission of
log mining: anomaly detection, and proposes a novel method for mining abnormal
sequences from large logs. Different from previous anomaly detection systems
which based on statistics, probabilities and Markov assumption, our approach
measures the strangeness of a sequence using compression. It first trains a
grammar about normal behaviors using grammar-based compression, then measures
the information quantities and densities of questionable sequences according to
incrementation of grammar length. We have applied our approach on mining some
real bugs from fine grained execution logs. We have also tested its ability on
intrusion detection using some publicity available system call traces. The
experiments show that our method successfully selects the strange sequences
which related to bugs or attacking.Comment: 7 pages, 5 figures, 6 table
Imbalanced Sentiment Classification Enhanced with Discourse Marker
Imbalanced data commonly exists in real world, espacially in
sentiment-related corpus, making it difficult to train a classifier to
distinguish latent sentiment in text data. We observe that humans often express
transitional emotion between two adjacent discourses with discourse markers
like "but", "though", "while", etc, and the head discourse and the tail
discourse 3 usually indicate opposite emotional tendencies. Based on this
observation, we propose a novel plug-and-play method, which first samples
discourses according to transitional discourse markers and then validates
sentimental polarities with the help of a pretrained attention-based model. Our
method increases sample diversity in the first place, can serve as a upstream
preprocessing part in data augmentation. We conduct experiments on three public
sentiment datasets, with several frequently used algorithms. Results show that
our method is found to be consistently effective, even in highly imbalanced
scenario, and easily be integrated with oversampling method to boost the
performance on imbalanced sentiment classification.Comment: 12 pages, 1 figure
Conditional BERT Contextual Augmentation
We propose a novel data augmentation method for labeled sentences called
conditional BERT contextual augmentation. Data augmentation methods are often
applied to prevent overfitting and improve generalization of deep neural
network models. Recently proposed contextual augmentation augments labeled
sentences by randomly replacing words with more varied substitutions predicted
by language model. BERT demonstrates that a deep bidirectional language model
is more powerful than either an unidirectional language model or the shallow
concatenation of a forward and backward model. We retrofit BERT to conditional
BERT by introducing a new conditional masked language model\footnote{The term
"conditional masked language model" appeared once in original BERT paper, which
indicates context-conditional, is equivalent to term "masked language model".
In our paper, "conditional masked language model" indicates we apply extra
label-conditional constraint to the "masked language model".} task. The well
trained conditional BERT can be applied to enhance contextual augmentation.
Experiments on six various different text classification tasks show that our
method can be easily applied to both convolutional or recurrent neural networks
classifier to obtain obvious improvement.Comment: 9 pages, 1 figur
ESA: Entity Summarization with Attention
Entity summarization aims at creating brief but informative descriptions of
entities from knowledge graphs. While previous work mostly focused on
traditional techniques such as clustering algorithms and graph models, we ask
how to apply deep learning methods into this task. In this paper we propose
ESA, a neural network with supervised attention mechanisms for entity
summarization. Specifically, we calculate attention weights for facts in each
entity, and rank facts to generate reliable summaries. We explore techniques to
solve difficult learning problems presented by the ESA, and demonstrate the
effectiveness of our model in comparison with the state-of-the-art methods.
Experimental results show that our model improves the quality of the entity
summaries in both F-measure and MAP.Comment: 12pages, accepted in EYRE@CIKM'201
Reliable Diversity-Based Spatial Crowdsourcing by Moving Workers
With the rapid development of mobile devices and the crowdsourcig platforms,
the spatial crowdsourcing has attracted much attention from the database
community, specifically, spatial crowdsourcing refers to sending a
location-based request to workers according to their positions. In this paper,
we consider an important spatial crowdsourcing problem, namely reliable
diversity-based spatial crowdsourcing (RDB-SC), in which spatial tasks (such as
taking videos/photos of a landmark or firework shows, and checking whether or
not parking spaces are available) are time-constrained, and workers are moving
towards some directions. Our RDB-SC problem is to assign workers to spatial
tasks such that the completion reliability and the spatial/temporal diversities
of spatial tasks are maximized. We prove that the RDB-SC problem is NP-hard and
intractable. Thus, we propose three effective approximation approaches,
including greedy, sampling, and divide-and-conquer algorithms. In order to
improve the efficiency, we also design an effective cost-model-based index,
which can dynamically maintain moving workers and spatial tasks with low cost,
and efficiently facilitate the retrieval of RDB-SC answers. Through extensive
experiments, we demonstrate the efficiency and effectiveness of our proposed
approaches over both real and synthetic data sets.Comment: 16 page
Communicating Is Crowdsourcing: Wi-Fi Indoor Localization with CSI-based Speed Estimation
Numerous indoor localization techniques have been proposed recently to meet
the intensive demand for location based service, and Wi-Fi fingerprint-based
approaches are the most popular and inexpensive solutions. Among them, one of
the main trends is to incorporate the built-in sensors of smartphone and to
exploit crowdsourcing potentials. However the noisy built-in sensors and
multi-tasking limitation of underline OS often hinder the effectiveness of
these schemes. In this work, we propose a passive crowdsourcing CSI-based
indoor localization scheme, C2 IL. Our scheme C2 IL only requires the
locating-device (e.g., a phone) to have a 802.11n wireless connection, and it
does not rely on inertial sensors only existing in some smartphones. C2 IL is
built upon our innovative method to accurately estimate the moving distance
purely based on 802.11n Channel State Information (CSI). Our extensive
evaluations show that the moving distance estimation error of our scheme is
within 3% of the actual moving distance regardless of varying speeds and
environment. Relying on the accurate moving distance estimation as constraints,
we are able to construct a more accurate mapping between RSS fingerprints and
location. To address the challenges of collecting fingerprints, a
crowdsourcing- based scheme is designed to gradually establish the mapping and
populate the fingerprints. In C2 IL, we design a trajectory clustering-based
localization algorithm to provide precise real-time indoor localization and
tracking. We developed and deployed a practical working system of C2 IL in a
large office environment. Extensive evaluation results indicate that our scheme
C2 IL provides accurate localization with error 2m at 80% at very complex
indoor environment with minimal overhead
"Mask and Infill" : Applying Masked Language Model to Sentiment Transfer
This paper focuses on the task of sentiment transfer on non-parallel text,
which modifies sentiment attributes (e.g., positive or negative) of sentences
while preserving their attribute-independent content. Due to the limited
capability of RNNbased encoder-decoder structure to capture deep and long-range
dependencies among words, previous works can hardly generate satisfactory
sentences from scratch. When humans convert the sentiment attribute of a
sentence, a simple but effective approach is to only replace the original
sentimental tokens in the sentence with target sentimental expressions, instead
of building a new sentence from scratch. Such a process is very similar to the
task of Text Infilling or Cloze, which could be handled by a deep bidirectional
Masked Language Model (e.g. BERT). So we propose a two step approach "Mask and
Infill". In the mask step, we separate style from content by masking the
positions of sentimental tokens. In the infill step, we retrofit MLM to
Attribute Conditional MLM, to infill the masked positions by predicting words
or phrases conditioned on the context1 and target sentiment. We evaluate our
model on two review datasets with quantitative, qualitative, and human
evaluations. Experimental results demonstrate that our models improve
state-of-the-art performance.Comment: IJCAI 201
Early Detection of Fake News by Utilizing the Credibility of News, Publishers, and Users Based on Weakly Supervised Learning
The dissemination of fake news significantly affects personal reputation and
public trust. Recently, fake news detection has attracted tremendous attention,
and previous studies mainly focused on finding clues from news content or
diffusion path. However, the required features of previous models are often
unavailable or insufficient in early detection scenarios, resulting in poor
performance. Thus, early fake news detection remains a tough challenge.
Intuitively, the news from trusted and authoritative sources or shared by many
users with a good reputation is more reliable than other news. Using the
credibility of publishers and users as prior weakly supervised information, we
can quickly locate fake news in massive news and detect them in the early
stages of dissemination.
In this paper, we propose a novel Structure-aware Multi-head Attention
Network (SMAN), which combines the news content, publishing, and reposting
relations of publishers and users, to jointly optimize the fake news detection
and credibility prediction tasks. In this way, we can explicitly exploit the
credibility of publishers and users for early fake news detection. We conducted
experiments on three real-world datasets, and the results show that SMAN can
detect fake news in 4 hours with an accuracy of over 91%, which is much faster
than the state-of-the-art models.Comment: Accepted as a long paper at COLING 202
Jointly embedding the local and global relations of heterogeneous graph for rumor detection
The development of social media has revolutionized the way people
communicate, share information and make decisions, but it also provides an
ideal platform for publishing and spreading rumors. Existing rumor detection
methods focus on finding clues from text content, user profiles, and
propagation patterns. However, the local semantic relation and global
structural information in the message propagation graph have not been well
utilized by previous works.
In this paper, we present a novel global-local attention network (GLAN) for
rumor detection, which jointly encodes the local semantic and global structural
information. We first generate a better integrated representation for each
source tweet by fusing the semantic information of related retweets with the
attention mechanism. Then, we model the global relationships among all source
tweets, retweets, and users as a heterogeneous graph to capture the rich
structural information for rumor detection. We conduct experiments on three
real-world datasets, and the results demonstrate that GLAN significantly
outperforms the state-of-the-art models in both rumor detection and early
detection scenarios.Comment: 10 pages, Accepted to the IEEE International Conference on Data
Mining 201
Beyond Statistical Relations: Integrating Knowledge Relations into Style Correlations for Multi-Label Music Style Classification
Automatically labeling multiple styles for every song is a comprehensive
application in all kinds of music websites. Recently, some researches explore
review-driven multi-label music style classification and exploit style
correlations for this task. However, their methods focus on mining the
statistical relations between different music styles and only consider shallow
style relations. Moreover, these statistical relations suffer from the
underfitting problem because some music styles have little training data.
To tackle these problems, we propose a novel knowledge relations integrated
framework (KRF) to capture the complete style correlations, which jointly
exploits the inherent relations between music styles according to external
knowledge and their statistical relations. Based on the two types of relations,
we use a graph convolutional network to learn the deep correlations between
styles automatically. Experimental results show that our framework
significantly outperforms state-of-the-art methods. Further studies demonstrate
that our framework can effectively alleviate the underfitting problem and learn
meaningful style correlations. The source code can be available at
https://github.com/Makwen1995/MusicGenre.Comment: Accepted as WSDM 2020 Regular Pape
- …