12,054 research outputs found
Large Scale Clustering with Variational EM for Gaussian Mixture Models
How can we efficiently find large numbers of clusters in large data sets with
high-dimensional data points? Our aim is to explore the current efficiency and
large-scale limits in fitting a parametric model for clustering to data
distributions. To do so, we combine recent lines of research which have
previously focused on separate specific methods for complexity reduction. We
first show theoretically how the clustering objective of variational EM (which
reduces complexity for many clusters) can be combined with coreset objectives
(which reduce complexity for many data points). Secondly, we realize a concrete
highly efficient iterative procedure which combines and translates the
theoretical complexity gains of truncated variational EM and coresets into a
practical algorithm. For very large scales, the high efficiency of parameter
updates then requires (A) highly efficient coreset construction and (B) highly
efficient initialization procedures (seeding) in order to avoid computational
bottlenecks. Fortunately very efficient coreset construction has become
available in the form of light-weight coresets, and very efficient
initialization has become available in the form of AFK-MC seeding. The
resulting algorithm features balanced computational costs across all
constituting components. In applications to standard large-scale benchmarks for
clustering, we investigate the algorithm's efficiency/quality trade-off.
Compared to the best recent approaches, we observe speedups of up to one order
of magnitude, and up to two orders of magnitude compared to the -means++
baseline. To demonstrate that the observed efficiency enables previously
considered unfeasible applications, we cluster the entire and unscaled 80 Mio.
Tiny Images dataset into up to 32,000 clusters. To the knowledge of the
authors, this represents the largest scale fit of a parametric data model for
clustering reported so far
Single-Board-Computer Clusters for Cloudlet Computing in Internet of Things
The number of connected sensors and devices is expected to increase to billions in the near
future. However, centralised cloud-computing data centres present various challenges to meet the
requirements inherent to Internet of Things (IoT) workloads, such as low latency, high throughput
and bandwidth constraints. Edge computing is becoming the standard computing paradigm for
latency-sensitive real-time IoT workloads, since it addresses the aforementioned limitations related
to centralised cloud-computing models. Such a paradigm relies on bringing computation close to
the source of data, which presents serious operational challenges for large-scale cloud-computing
providers. In this work, we present an architecture composed of low-cost Single-Board-Computer
clusters near to data sources, and centralised cloud-computing data centres. The proposed
cost-efficient model may be employed as an alternative to fog computing to meet real-time IoT
workload requirements while keeping scalability. We include an extensive empirical analysis to
assess the suitability of single-board-computer clusters as cost-effective edge-computing micro data
centres. Additionally, we compare the proposed architecture with traditional cloudlet and cloud
architectures, and evaluate them through extensive simulation. We finally show that acquisition costs
can be drastically reduced while keeping performance levels in data-intensive IoT use cases.Ministerio de Economía y Competitividad TIN2017-82113-C2-1-RMinisterio de Economía y Competitividad RTI2018-098062-A-I00European Union’s Horizon 2020 No. 754489Science Foundation Ireland grant 13/RC/209
Pregelix: Big(ger) Graph Analytics on A Dataflow Engine
There is a growing need for distributed graph processing systems that are
capable of gracefully scaling to very large graph datasets. Unfortunately, this
challenge has not been easily met due to the intense memory pressure imposed by
process-centric, message passing designs that many graph processing systems
follow. Pregelix is a new open source distributed graph processing system that
is based on an iterative dataflow design that is better tuned to handle both
in-memory and out-of-core workloads. As such, Pregelix offers improved
performance characteristics and scaling properties over current open source
systems (e.g., we have seen up to 15x speedup compared to Apache Giraph and up
to 35x speedup compared to distributed GraphLab), and makes more effective use
of available machine resources to support Big(ger) Graph Analytics
- …