247 research outputs found
Optimizing CMS build infrastructure via Apache Mesos
The Offline Software of the CMS Experiment at the Large Hadron Collider (LHC)
at CERN consists of 6M lines of in-house code, developed over a decade by
nearly 1000 physicists, as well as a comparable amount of general use
open-source code. A critical ingredient to the success of the construction and
early operation of the WLCG was the convergence, around the year 2000, on the
use of a homogeneous environment of commodity x86-64 processors and Linux.
Apache Mesos is a cluster manager that provides efficient resource isolation
and sharing across distributed applications, or frameworks. It can run Hadoop,
Jenkins, Spark, Aurora, and other applications on a dynamically shared pool of
nodes. We present how we migrated our continuos integration system to schedule
jobs on a relatively small Apache Mesos enabled cluster and how this resulted
in better resource usage, higher peak performance and lower latency thanks to
the dynamic scheduling capabilities of Mesos.Comment: Submitted to proceedings of the 21st International Conference on
Computing in High Energy and Nuclear Physics (CHEP2015), Okinawa, Japa
Efficient Implementation of the Room Simulator for Training Deep Neural Network Acoustic Models
In this paper, we describe how to efficiently implement an acoustic room
simulator to generate large-scale simulated data for training deep neural
networks. Even though Google Room Simulator in [1] was shown to be quite
effective in reducing the Word Error Rates (WERs) for far-field applications by
generating simulated far-field training sets, it requires a very large number
of Fast Fourier Transforms (FFTs) of large size. Room Simulator in [1] used
approximately 80 percent of Central Processing Unit (CPU) usage in our CPU +
Graphics Processing Unit (GPU) training architecture [2]. In this work, we
implement an efficient OverLap Addition (OLA) based filtering using the
open-source FFTW3 library. Further, we investigate the effects of the Room
Impulse Response (RIR) lengths. Experimentally, we conclude that we can cut the
tail portions of RIRs whose power is less than 20 dB below the maximum power
without sacrificing the speech recognition accuracy. However, we observe that
cutting RIR tail more than this threshold harms the speech recognition accuracy
for rerecorded test sets. Using these approaches, we were able to reduce CPU
usage for the room simulator portion down to 9.69 percent in CPU/GPU training
architecture. Profiling result shows that we obtain 22.4 times speed-up on a
single machine and 37.3 times speed up on Google's distributed training
infrastructure.Comment: Published at INTERSPEECH 2018.
(https://www.isca-speech.org/archive/Interspeech_2018/abstracts/2566.html
DC-Prophet: Predicting Catastrophic Machine Failures in DataCenters
When will a server fail catastrophically in an industrial datacenter? Is it
possible to forecast these failures so preventive actions can be taken to
increase the reliability of a datacenter? To answer these questions, we have
studied what are probably the largest, publicly available datacenter traces,
containing more than 104 million events from 12,500 machines. Among these
samples, we observe and categorize three types of machine failures, all of
which are catastrophic and may lead to information loss, or even worse,
reliability degradation of a datacenter. We further propose a two-stage
framework-DC-Prophet-based on One-Class Support Vector Machine and Random
Forest. DC-Prophet extracts surprising patterns and accurately predicts the
next failure of a machine. Experimental results show that DC-Prophet achieves
an AUC of 0.93 in predicting the next machine failure, and a F3-score of 0.88
(out of 1). On average, DC-Prophet outperforms other classical machine learning
methods by 39.45% in F3-score.Comment: 13 pages, 5 figures, accepted by 2017 ECML PKD
Coded Distributed Tracking
We consider the problem of tracking the state of a process that evolves over
time in a distributed setting, with multiple observers each observing parts of
the state, which is a fundamental information processing problem with a wide
range of applications. We propose a cloud-assisted scheme where the tracking is
performed over the cloud. In particular, to provide timely and accurate
updates, and alleviate the straggler problem of cloud computing, we propose a
coded distributed computing approach where coded observations are distributed
over multiple workers. The proposed scheme is based on a coded version of the
Kalman filter that operates on data encoded with an erasure correcting code,
such that the state can be estimated from partial updates computed by a subset
of the workers. We apply the proposed scheme to the problem of tracking
multiple vehicles. We show that replication achieves significantly higher
accuracy than the corresponding uncoded scheme. The use of maximum distance
separable (MDS) codes further improves accuracy for larger update intervals. In
both cases, the proposed scheme approaches the accuracy of an ideal centralized
scheme when the update interval is large enough. Finally, we observe a
trade-off between age-of-information and estimation accuracy for MDS codes.Comment: Accepted for publication at IEEE GLOBECOM 201
ClouNS - A Cloud-native Application Reference Model for Enterprise Architects
The capability to operate cloud-native applications can generate enormous
business growth and value. But enterprise architects should be aware that
cloud-native applications are vulnerable to vendor lock-in. We investigated
cloud-native application design principles, public cloud service providers, and
industrial cloud standards. All results indicate that most cloud service
categories seem to foster vendor lock-in situations which might be especially
problematic for enterprise architectures. This might sound disillusioning at
first. However, we present a reference model for cloud-native applications that
relies only on a small subset of well standardized IaaS services. The reference
model can be used for codifying cloud technologies. It can guide technology
identification, classification, adoption, research and development processes
for cloud-native application and for vendor lock-in aware enterprise
architecture engineering methodologies
A Big Data Analyzer for Large Trace Logs
Current generation of Internet-based services are typically hosted on large
data centers that take the form of warehouse-size structures housing tens of
thousands of servers. Continued availability of a modern data center is the
result of a complex orchestration among many internal and external actors
including computing hardware, multiple layers of intricate software, networking
and storage devices, electrical power and cooling plants. During the course of
their operation, many of these components produce large amounts of data in the
form of event and error logs that are essential not only for identifying and
resolving problems but also for improving data center efficiency and
management. Most of these activities would benefit significantly from data
analytics techniques to exploit hidden statistical patterns and correlations
that may be present in the data. The sheer volume of data to be analyzed makes
uncovering these correlations and patterns a challenging task. This paper
presents BiDAl, a prototype Java tool for log-data analysis that incorporates
several Big Data technologies in order to simplify the task of extracting
information from data traces produced by large clusters and server farms. BiDAl
provides the user with several analysis languages (SQL, R and Hadoop MapReduce)
and storage backends (HDFS and SQLite) that can be freely mixed and matched so
that a custom tool for a specific task can be easily constructed. BiDAl has a
modular architecture so that it can be extended with other backends and
analysis languages in the future. In this paper we present the design of BiDAl
and describe our experience using it to analyze publicly-available traces from
Google data clusters, with the goal of building a realistic model of a complex
data center.Comment: 26 pages, 10 figure
- …