15,455 research outputs found
POISED: Spotting Twitter Spam Off the Beaten Paths
Cybercriminals have found in online social networks a propitious medium to
spread spam and malicious content. Existing techniques for detecting spam
include predicting the trustworthiness of accounts and analyzing the content of
these messages. However, advanced attackers can still successfully evade these
defenses.
Online social networks bring people who have personal connections or share
common interests to form communities. In this paper, we first show that users
within a networked community share some topics of interest. Moreover, content
shared on these social network tend to propagate according to the interests of
people. Dissemination paths may emerge where some communities post similar
messages, based on the interests of those communities. Spam and other malicious
content, on the other hand, follow different spreading patterns.
In this paper, we follow this insight and present POISED, a system that
leverages the differences in propagation between benign and malicious messages
on social networks to identify spam and other unwanted content. We test our
system on a dataset of 1.3M tweets collected from 64K users, and we show that
our approach is effective in detecting malicious messages, reaching 91%
precision and 93% recall. We also show that POISED's detection is more
comprehensive than previous systems, by comparing it to three state-of-the-art
spam detection systems that have been proposed by the research community in the
past. POISED significantly outperforms each of these systems. Moreover, through
simulations, we show how POISED is effective in the early detection of spam
messages and how it is resilient against two well-known adversarial machine
learning attacks
Why (and How) Networks Should Run Themselves
The proliferation of networked devices, systems, and applications that we
depend on every day makes managing networks more important than ever. The
increasing security, availability, and performance demands of these
applications suggest that these increasingly difficult network management
problems be solved in real time, across a complex web of interacting protocols
and systems. Alas, just as the importance of network management has increased,
the network has grown so complex that it is seemingly unmanageable. In this new
era, network management requires a fundamentally new approach. Instead of
optimizations based on closed-form analysis of individual protocols, network
operators need data-driven, machine-learning-based models of end-to-end and
application performance based on high-level policy goals and a holistic view of
the underlying components. Instead of anomaly detection algorithms that operate
on offline analysis of network traces, operators need classification and
detection algorithms that can make real-time, closed-loop decisions. Networks
should learn to drive themselves. This paper explores this concept, discussing
how we might attain this ambitious goal by more closely coupling measurement
with real-time control and by relying on learning for inference and prediction
about a networked application or system, as opposed to closed-form analysis of
individual protocols
Learning from networked examples
Many machine learning algorithms are based on the assumption that training
examples are drawn independently. However, this assumption does not hold
anymore when learning from a networked sample because two or more training
examples may share some common objects, and hence share the features of these
shared objects. We show that the classic approach of ignoring this problem
potentially can have a harmful effect on the accuracy of statistics, and then
consider alternatives. One of these is to only use independent examples,
discarding other information. However, this is clearly suboptimal. We analyze
sample error bounds in this networked setting, providing significantly improved
results. An important component of our approach is formed by efficient sample
weighting schemes, which leads to novel concentration inequalities
Probing the topological properties of complex networks modeling short written texts
In recent years, graph theory has been widely employed to probe several
language properties. More specifically, the so-called word adjacency model has
been proven useful for tackling several practical problems, especially those
relying on textual stylistic analysis. The most common approach to treat texts
as networks has simply considered either large pieces of texts or entire books.
This approach has certainly worked well -- many informative discoveries have
been made this way -- but it raises an uncomfortable question: could there be
important topological patterns in small pieces of texts? To address this
problem, the topological properties of subtexts sampled from entire books was
probed. Statistical analyzes performed on a dataset comprising 50 novels
revealed that most of the traditional topological measurements are stable for
short subtexts. When the performance of the authorship recognition task was
analyzed, it was found that a proper sampling yields a discriminability similar
to the one found with full texts. Surprisingly, the support vector machine
classification based on the characterization of short texts outperformed the
one performed with entire books. These findings suggest that a local
topological analysis of large documents might improve its global
characterization. Most importantly, it was verified, as a proof of principle,
that short texts can be analyzed with the methods and concepts of complex
networks. As a consequence, the techniques described here can be extended in a
straightforward fashion to analyze texts as time-varying complex networks
Crisis Analytics: Big Data Driven Crisis Response
Disasters have long been a scourge for humanity. With the advances in
technology (in terms of computing, communications, and the ability to process
and analyze big data), our ability to respond to disasters is at an inflection
point. There is great optimism that big data tools can be leveraged to process
the large amounts of crisis-related data (in the form of user generated data in
addition to the traditional humanitarian data) to provide an insight into the
fast-changing situation and help drive an effective disaster response. This
article introduces the history and the future of big crisis data analytics,
along with a discussion on its promise, challenges, and pitfalls
Integration of Legacy Appliances into Home Energy Management Systems
The progressive installation of renewable energy sources requires the
coordination of energy consuming devices. At consumer level, this coordination
can be done by a home energy management system (HEMS). Interoperability issues
need to be solved among smart appliances as well as between smart and
non-smart, i.e., legacy devices. We expect current standardization efforts to
soon provide technologies to design smart appliances in order to cope with the
current interoperability issues. Nevertheless, common electrical devices affect
energy consumption significantly and therefore deserve consideration within
energy management applications. This paper discusses the integration of smart
and legacy devices into a generic system architecture and, subsequently,
elaborates the requirements and components which are necessary to realize such
an architecture including an application of load detection for the
identification of running loads and their integration into existing HEM
systems. We assess the feasibility of such an approach with a case study based
on a measurement campaign on real households. We show how the information of
detected appliances can be extracted in order to create device profiles
allowing for their integration and management within a HEMS
Distributionally Robust Semi-Supervised Learning for People-Centric Sensing
Semi-supervised learning is crucial for alleviating labelling burdens in
people-centric sensing. However, human-generated data inherently suffer from
distribution shift in semi-supervised learning due to the diverse biological
conditions and behavior patterns of humans. To address this problem, we propose
a generic distributionally robust model for semi-supervised learning on
distributionally shifted data. Considering both the discrepancy and the
consistency between the labeled data and the unlabeled data, we learn the
latent features that reduce person-specific discrepancy and preserve
task-specific consistency. We evaluate our model in a variety of people-centric
recognition tasks on real-world datasets, including intention recognition,
activity recognition, muscular movement recognition and gesture recognition.
The experiment results demonstrate that the proposed model outperforms the
state-of-the-art methods.Comment: 8 pages, accepted by AAAI201
- …