12,652 research outputs found
On Modelling and Prediction of Total CPU Usage for Applications in MapReduce Environments
Recently, businesses have started using MapReduce as a popular computation
framework for processing large amount of data, such as spam detection, and
different data mining tasks, in both public and private clouds. Two of the
challenging questions in such environments are (1) choosing suitable values for
MapReduce configuration parameters -e.g., number of mappers, number of
reducers, and DFS block size-, and (2) predicting the amount of resources that
a user should lease from the service provider. Currently, the tasks of both
choosing configuration parameters and estimating required resources are solely
the users' responsibilities. In this paper, we present an approach to provision
the total CPU usage in clock cycles of jobs in MapReduce environment. For a
MapReduce job, a profile of total CPU usage in clock cycles is built from the
job past executions with different values of two configuration parameters e.g.,
number of mappers, and number of reducers. Then, a polynomial regression is
used to model the relation between these configuration parameters and total CPU
usage in clock cycles of the job. We also briefly study the influence of input
data scaling on measured total CPU usage in clock cycles. This derived model
along with the scaling result can then be used to provision the total CPU usage
in clock cycles of the same jobs with different input data size. We validate
the accuracy of our models using three realistic applications (WordCount, Exim
MainLog parsing, and TeraSort). Results show that the predicted total CPU usage
in clock cycles of generated resource provisioning options are less than 8% of
the measured total CPU usage in clock cycles in our 20-node virtual Hadoop
cluster.Comment: This paper has been accepted to 12th International Conference on
Algorithms and Architectures for Parallel Processing (ICA3PP 2012
HPDedup: A Hybrid Prioritized Data Deduplication Mechanism for Primary Storage in the Cloud
Eliminating duplicate data in primary storage of clouds increases the
cost-efficiency of cloud service providers as well as reduces the cost of users
for using cloud services. Existing primary deduplication techniques either use
inline caching to exploit locality in primary workloads or use post-processing
deduplication running in system idle time to avoid the negative impact on I/O
performance. However, neither of them works well in the cloud servers running
multiple services or applications for the following two reasons: Firstly, the
temporal locality of duplicate data writes may not exist in some primary
storage workloads thus inline caching often fails to achieve good deduplication
ratio. Secondly, the post-processing deduplication allows duplicate data to be
written into disks, therefore does not provide the benefit of I/O deduplication
and requires high peak storage capacity. This paper presents HPDedup, a Hybrid
Prioritized data Deduplication mechanism to deal with the storage system shared
by applications running in co-located virtual machines or containers by fusing
an inline and a post-processing process for exact deduplication. In the inline
deduplication phase, HPDedup gives a fingerprint caching mechanism that
estimates the temporal locality of duplicates in data streams from different
VMs or applications and prioritizes the cache allocation for these streams
based on the estimation. HPDedup also allows different deduplication threshold
for streams based on their spatial locality to reduce the disk fragmentation.
The post-processing phase removes duplicates whose fingerprints are not able to
be cached due to the weak temporal locality from disks. Our experimental
results show that HPDedup clearly outperforms the state-of-the-art primary
storage deduplication techniques in terms of inline cache efficiency and
primary deduplication efficiency.Comment: 14 pages, 11 figures, submitted to MSST201
A Supervised Learning Methodology for Real-Time Disguised Face Recognition in the Wild
Facial recognition has always been a challeng- ing task for computer vision
scientists and experts. Despite complexities arising due to variations in
camera parameters, illumination and face orientations, significant progress has
been made in the field with deep learning algorithms now competing with
human-level accuracy. But in contrast to the recent advances in face
recognition techniques, Disguised Facial Identification continues to be a
tougher challenge in the field of computer vision. The modern day scenario,
where security is of prime concern, regular face identification techniques do
not perform as required when the faces are disguised, which calls for a
different approach to handle situations where intruders have their faces
masked. Along the same lines, we propose a deep learning architecture for
disguised facial recognition (DFR). The algorithm put forward in this paper
detects 20 facial key-points in the first stage, using a 14-layered
convolutional neural network (CNN). These facial key-points are later utilized
by a support vector machine (SVM) for classifying the disguised faces based on
the euclidean distance ratios and angles between different facial key-points.
This overall architecture imparts a basic intelligence to our system. Our
key-point feature prediction accuracy is 65% while the classification rate is
72.4%. Moreover, the architecture works at 19 FPS, thereby performing in almost
real-time. The efficiency of our approach is also compared with the
state-of-the-art Disguised Facial Identification methods.Comment: Accepted at 2018 International Conference on Robotics and Computer
Visio
Zero-Shot Learning with Generative Latent Prototype Model
Zero-shot learning, which studies the problem of object classification for
categories for which we have no training examples, is gaining increasing
attention from community. Most existing ZSL methods exploit deterministic
transfer learning via an in-between semantic embedding space. In this paper, we
try to attack this problem from a generative probabilistic modelling
perspective. We assume for any category, the observed representation, e.g.
images or texts, is developed from a unique prototype in a latent space, in
which the semantic relationship among prototypes is encoded via linear
reconstruction. Taking advantage of this assumption, virtual instances of
unseen classes can be generated from the corresponding prototype, giving rise
to a novel ZSL model which can alleviate the domain shift problem existing in
the way of direct transfer learning. Extensive experiments on three benchmark
datasets show our proposed model can achieve state-of-the-art results.Comment: This work was completed in Oct, 201
Physics-based polynomial neural networks for one-shot learning of dynamical systems from one or a few samples
This paper discusses an approach for incorporating prior physical knowledge
into the neural network to improve data efficiency and the generalization of
predictive models. If the dynamics of a system approximately follows a given
differential equation, the Taylor mapping method can be used to initialize the
weights of a polynomial neural network. This allows the fine-tuning of the
model from one training sample of real system dynamics. The paper describes
practical results on real experiments with both a simple pendulum and one of
the largest worldwide X-ray source. It is demonstrated in practice that the
proposed approach allows recovering complex physics from noisy, limited, and
partial observations and provides meaningful predictions for previously unseen
inputs. The approach mainly targets the learning of physical systems when
state-of-the-art models are difficult to apply given the lack of training data
Malware Task Identification: A Data Driven Approach
Identifying the tasks a given piece of malware was designed to perform (e.g.
logging keystrokes, recording video, establishing remote access, etc.) is a
difficult and time-consuming operation that is largely human-driven in
practice. In this paper, we present an automated method to identify malware
tasks. Using two different malware collections, we explore various
circumstances for each - including cases where the training data differs
significantly from test; where the malware being evaluated employs packing to
thwart analytical techniques; and conditions with sparse training data. We find
that this approach consistently out-performs the current state-of-the art
software for malware task identification as well as standard machine learning
approaches - often achieving an unbiased F1 score of over 0.9. In the near
future, we look to deploy our approach for use by analysts in an operational
cyber-security environment.Comment: 8 pages full paper, accepted FOSINT-SI (2015
Semantic Part Detection via Matching: Learning to Generalize to Novel Viewpoints from Limited Training Data
Detecting semantic parts of an object is a challenging task in computer
vision, particularly because it is hard to construct large annotated datasets
due to the difficulty of annotating semantic parts. In this paper we present an
approach which learns from a small training dataset of annotated semantic
parts, where the object is seen from a limited range of viewpoints, but
generalizes to detect semantic parts from a much larger range of viewpoints.
Our approach is based on a matching algorithm for finding accurate spatial
correspondence between two images, which enables semantic parts annotated on
one image to be transplanted to another. In particular, this enables images in
the training dataset to be matched to a virtual 3D model of the object (for
simplicity, we assume that the object viewpoint can be estimated by standard
techniques). Then a clustering algorithm is used to annotate the semantic parts
of the 3D virtual model. This virtual 3D model can be used to synthesize
annotated images from a large range of viewpoint. These can be matched to
images in the test set, using the same matching algorithm, to detect semantic
parts in novel viewpoints of the object. Our algorithm is very simple,
intuitive, and contains very few parameters. We evaluate our approach in the
car subclass of the VehicleSemanticPart dataset. We show it outperforms
standard deep network approaches and, in particular, performs much better on
novel viewpoints. For facilitating the future research, code is available:
https://github.com/ytongbai/SemanticPartDetectio
On the Complexity of One-class SVM for Multiple Instance Learning
In traditional multiple instance learning (MIL), both positive and negative
bags are required to learn a prediction function. However, a high human cost is
needed to know the label of each bag---positive or negative. Only positive bags
contain our focus (positive instances) while negative bags consist of noise or
background (negative instances). So we do not expect to spend too much to label
the negative bags. Contrary to our expectation, nearly all existing MIL methods
require enough negative bags besides positive ones. In this paper we propose an
algorithm called "Positive Multiple Instance" (PMI), which learns a classifier
given only a set of positive bags. So the annotation of negative bags becomes
unnecessary in our method. PMI is constructed based on the assumption that the
unknown positive instances in positive bags be similar each other and
constitute one compact cluster in feature space and the negative instances
locate outside this cluster. The experimental results demonstrate that PMI
achieves the performances close to or a little worse than those of the
traditional MIL algorithms on benchmark and real data sets. However, the number
of training bags in PMI is reduced significantly compared with traditional MIL
algorithms
On Preempting Advanced Persistent Threats Using Probabilistic Graphical Models
This paper presents PULSAR, a framework for pre-empting Advanced Persistent
Threats (APTs). PULSAR employs a probabilistic graphical model (specifically a
Factor Graph) to infer the time evolution of an attack based on observed
security events at runtime. PULSAR (i) learns the statistical significance of
patterns of events from past attacks; (ii) composes these patterns into FGs to
capture the progression of the attack; and (iii) decides on preemptive actions.
PULSAR's accuracy and its performance are evaluated in three experiments at
SystemX: (i) a study with a dataset containing 120 successful APTs over the
past 10 years (PULSAR accurately identifies 91.7%); (ii) replaying of a set of
ten unseen APTs (PULSAR stops 8 out of 10 replayed attacks before system
integrity violation, and all ten before data exfiltration); and (iii) a
production deployment of PULSAR (during a month-long deployment, PULSAR took an
average of one second to make a decision)
Modular Resource Centric Learning for Workflow Performance Prediction
Workflows provide an expressive programming model for fine-grained control of
large-scale applications in distributed computing environments. Accurate
estimates of complex workflow execution metrics on large-scale machines have
several key advantages. The performance of scheduling algorithms that rely on
estimates of execution metrics degrades when the accuracy of predicted
execution metrics decreases. This in-progress paper presents a technique being
developed to improve the accuracy of predicted performance metrics of
large-scale workflows on distributed platforms. The central idea of this work
is to train resource-centric machine learning agents to capture complex
relationships between a set of program instructions and their performance
metrics when executed on a specific resource. This resource-centric view of a
workflow exploits the fact that predicting execution times of sub-modules of a
workflow requires monitoring and modeling of a few dynamic and static features.
We transform the input workflow that is essentially a directed acyclic graph of
actions into a Physical Resource Execution Plan (PREP). This transformation
enables us to model an arbitrarily complex workflow as a set of simpler
programs running on physical nodes. We delegate a machine learning model to
capture performance metrics for each resource type when it executes different
program instructions under varying degrees of resource contention. Our
algorithm takes the prediction metrics from each resource agent and composes
the overall workflow performance metrics by utilizing the structure of the
corresponding Physical Resource Execution Plan.Comment: This paper was presented at: 6th Workshop on Big Data Analytics:
Challenges, and Opportunities (BDAC) at the 27th IEEE/ACM International
Conference for High Performance Computing, Networking, Storage, and Analysis
(SC 2015
- …