90,871 research outputs found
MacroBase: Prioritizing Attention in Fast Data
As data volumes continue to rise, manual inspection is becoming increasingly
untenable. In response, we present MacroBase, a data analytics engine that
prioritizes end-user attention in high-volume fast data streams. MacroBase
enables efficient, accurate, and modular analyses that highlight and aggregate
important and unusual behavior, acting as a search engine for fast data.
MacroBase is able to deliver order-of-magnitude speedups over alternatives by
optimizing the combination of explanation and classification tasks and by
leveraging a new reservoir sampler and heavy-hitters sketch specialized for
fast data streams. As a result, MacroBase delivers accurate results at speeds
of up to 2M events per second per query on a single core. The system has
delivered meaningful results in production, including at a telematics company
monitoring hundreds of thousands of vehicles.Comment: SIGMOD 201
DPASF: A Flink Library for Streaming Data preprocessing
Data preprocessing techniques are devoted to correct or alleviate errors in
data. Discretization and feature selection are two of the most extended data
preprocessing techniques. Although we can find many proposals for static Big
Data preprocessing, there is little research devoted to the continuous Big Data
problem. Apache Flink is a recent and novel Big Data framework, following the
MapReduce paradigm, focused on distributed stream and batch data processing. In
this paper we propose a data stream library for Big Data preprocessing, named
DPASF, under Apache Flink. We have implemented six of the most popular data
preprocessing algorithms, three for discretization and the rest for feature
selection. The algorithms have been tested using two Big Data datasets.
Experimental results show that preprocessing can not only reduce the size of
the data, but to maintain or even improve the original accuracy in a short
time. DPASF contains useful algorithms when dealing with Big Data data streams.
The preprocessing algorithms included in the library are able to tackle Big
Datasets efficiently and to correct imperfections in the data.Comment: 19 page
Towards Robust Human Activity Recognition from RGB Video Stream with Limited Labeled Data
Human activity recognition based on video streams has received numerous
attentions in recent years. Due to lack of depth information, RGB video based
activity recognition performs poorly compared to RGB-D video based solutions.
On the other hand, acquiring depth information, inertia etc. is costly and
requires special equipment, whereas RGB video streams are available in ordinary
cameras. Hence, our goal is to investigate whether similar or even higher
accuracy can be achieved with RGB-only modality. In this regard, we propose a
novel framework that couples skeleton data extracted from RGB video and deep
Bidirectional Long Short Term Memory (BLSTM) model for activity recognition. A
big challenge of training such a deep network is the limited training data, and
exploring RGB-only stream significantly exaggerates the difficulty. We
therefore propose a set of algorithmic techniques to train this model
effectively, e.g., data augmentation, dynamic frame dropout and gradient
injection. The experiments demonstrate that our RGB-only solution surpasses the
state-of-the-art approaches that all exploit RGB-D video streams by a notable
margin. This makes our solution widely deployable with ordinary cameras.Comment: To appear in ICMLA 201
Beyond Sharing Weights for Deep Domain Adaptation
The performance of a classifier trained on data coming from a specific domain
typically degrades when applied to a related but different one. While
annotating many samples from the new domain would address this issue, it is
often too expensive or impractical. Domain Adaptation has therefore emerged as
a solution to this problem; It leverages annotated data from a source domain,
in which it is abundant, to train a classifier to operate in a target domain,
in which it is either sparse or even lacking altogether. In this context, the
recent trend consists of learning deep architectures whose weights are shared
for both domains, which essentially amounts to learning domain invariant
features.
Here, we show that it is more effective to explicitly model the shift from
one domain to the other. To this end, we introduce a two-stream architecture,
where one operates in the source domain and the other in the target domain. In
contrast to other approaches, the weights in corresponding layers are related
but not shared. We demonstrate that this both yields higher accuracy than
state-of-the-art methods on several object recognition and detection tasks and
consistently outperforms networks with shared weights in both supervised and
unsupervised settings
Clustering Time Series Data Stream - A Literature Survey
Mining Time Series data has a tremendous growth of interest in today's world.
To provide an indication various implementations are studied and summarized to
identify the different problems in existing applications. Clustering time
series is a trouble that has applications in an extensive assortment of fields
and has recently attracted a large amount of research. Time series data are
frequently large and may contain outliers. In addition, time series are a
special type of data set where elements have a temporal ordering. Therefore
clustering of such data stream is an important issue in the data mining
process. Numerous techniques and clustering algorithms have been proposed
earlier to assist clustering of time series data streams. The clustering
algorithms and its effectiveness on various applications are compared to
develop a new method to solve the existing problem. This paper presents a
survey on various clustering algorithms available for time series datasets.
Moreover, the distinctiveness and restriction of previous research are
discussed and several achievable topics for future study are recognized.
Furthermore the areas that utilize time series clustering are also summarized.Comment: IEEE Publication format, International Journal of Computer Science
and Information Security, IJCSIS, Vol. 8 No. 1, April 2010, USA. ISSN 1947
5500, http://sites.google.com/site/ijcsis
Learn on Source, Refine on Target:A Model Transfer Learning Framework with Random Forests
We propose novel model transfer-learning methods that refine a decision
forest model M learned within a "source" domain using a training set sampled
from a "target" domain, assumed to be a variation of the source. We present two
random forest transfer algorithms. The first algorithm searches greedily for
locally optimal modifications of each tree structure by trying to locally
expand or reduce the tree around individual nodes. The second algorithm does
not modify structure, but only the parameter (thresholds) associated with
decision nodes. We also propose to combine both methods by considering an
ensemble that contains the union of the two forests. The proposed methods
exhibit impressive experimental results over a range of problems.Comment: 2 columns, 14 pages, TPAMI submitte
Efficient Classification of Multi-Labelled Text Streams by Clashing
We present a method for the classification of multi-labelled text documents
explicitly designed for data stream applications that require to process a
virtually infinite sequence of data using constant memory and constant
processing time. Our method is composed of an online procedure used to
efficiently map text into a low-dimensional feature space and a partition of
this space into a set of regions for which the system extracts and keeps
statistics used to predict multi-label text annotations. Documents are fed into
the system as a sequence of words, mapped to a region of the partition, and
annotated using the statistics computed from the labelled instances colliding
in the same region. This approach is referred to as clashing. We illustrate the
method in real-world text data, comparing the results with those obtained using
other text classifiers. In addition, we provide an analysis about the effect of
the representation space dimensionality on the predictive performance of the
system. Our results show that the online embedding indeed approximates the
geometry of the full corpus-wise TF and TF-IDF space. The model obtains
competitive F measures with respect to the most accurate methods, using
significantly fewer computational resources. In addition, the method achieves a
higher macro-averaged F measure than methods with similar running time.
Furthermore, the system is able to learn faster than the other methods from
partially labelled streams
Disc-aware Ensemble Network for Glaucoma Screening from Fundus Image
Glaucoma is a chronic eye disease that leads to irreversible vision loss.
Most of the existing automatic screening methods firstly segment the main
structure, and subsequently calculate the clinical measurement for detection
and screening of glaucoma. However, these measurement-based methods rely
heavily on the segmentation accuracy, and ignore various visual features. In
this paper, we introduce a deep learning technique to gain additional
image-relevant information, and screen glaucoma from the fundus image directly.
Specifically, a novel Disc-aware Ensemble Network (DENet) for automatic
glaucoma screening is proposed, which integrates the deep hierarchical context
of the global fundus image and the local optic disc region. Four deep streams
on different levels and modules are respectively considered as global image
stream, segmentation-guided network, local disc region stream, and disc polar
transformation stream. Finally, the output probabilities of different streams
are fused as the final screening result. The experiments on two glaucoma
datasets (SCES and new SINDI datasets) show our method outperforms other
state-of-the-art algorithms.Comment: Project homepage: https://hzfu.github.io/proj_glaucoma_fundus.html ,
and Accepted by IEEE Transactions on Medical Imagin
Smartphone Fingerprinting Via Motion Sensors: Analyzing Feasibility at Large-Scale and Studying Real Usage Patterns
Advertisers are increasingly turning to fingerprinting techniques to track
users across the web. As web browsing activity shifts to mobile platforms,
traditional browser fingerprinting techniques become less effective; however,
device fingerprinting using built-in sensors offers a new avenue for attack. We
study the feasibility of using motion sensors to perform device fingerprinting
at scale, and explore countermeasures that can be used to protect privacy.
We perform a large-scale user study to demonstrate that motion sensor
fingerprinting is effective with even 500 users. We also develop a model to
estimate prediction accuracy for larger user populations; our model provides a
conservative estimate of at least 12% classification accuracy with 100000
users. We then investigate the use of motion sensors on the web and find,
distressingly, that many sites send motion sensor data to servers for storage
and analysis, paving the way to potential fingerprinting. Finally, we consider
the problem of developing fingerprinting countermeasures; we evaluate a
previously proposed obfuscation technique and a newly developed quantization
technique via a user study. We find that both techniques are able to
drastically reduce fingerprinting accuracy without significantly impacting the
utility of the sensors in web applications
Parallel Programming Models for Heterogeneous Many-Cores : A Survey
Heterogeneous many-cores are now an integral part of modern computing systems
ranging from embedding systems to supercomputers. While heterogeneous many-core
design offers the potential for energy-efficient high-performance, such
potential can only be unlocked if the application programs are suitably
parallel and can be made to match the underlying heterogeneous platform. In
this article, we provide a comprehensive survey for parallel programming models
for heterogeneous many-core architectures and review the compiling techniques
of improving programmability and portability. We examine various software
optimization techniques for minimizing the communicating overhead between
heterogeneous computing devices. We provide a road map for a wide variety of
different research areas. We conclude with a discussion on open issues in the
area and potential research directions. This article provides both an
accessible introduction to the fast-moving area of heterogeneous programming
and a detailed bibliography of its main achievements.Comment: Accepted to be published at CCF Transactions on High Performance
Computin
- …