1,042 research outputs found
Map-Reduce for Processing GPS Data from Public Transport in Montevideo, Uruguay
This article addresses the problem of processing large volumes of historical GPS data from buses to compute quality-of-service metrics for urban transportation systems. We designed and implemented a solution to distribute the data processing on multiple processing units in a distributed computing infrastructure. For the experimental analysis we used historical data from Montevideo, Uruguay. The proposed solution scales properly when processing large volumes of input data, achieving a speedup of up to 22Ă— when using 24 computing resources.
As case studies, we used the historical data to compute the average speed of bus lines in Montevideo and identify troublesome locations, according to the delay and deviation of the times to reach each bus stop. Similar studies can be used by control authorities and policy makers to get an insight of the transportation system and improve the quality of service.Sociedad Argentina de Informática e Investigación Operativa (SADIO
SOLAR: A Highly Optimized Data Loading Framework for Distributed Training of CNN-based Scientific Surrogates
CNN-based surrogates have become prevalent in scientific applications to
replace conventional time-consuming physical approaches. Although these
surrogates can yield satisfactory results with significantly lower computation
costs over small training datasets, our benchmarking results show that
data-loading overhead becomes the major performance bottleneck when training
surrogates with large datasets. In practice, surrogates are usually trained
with high-resolution scientific data, which can easily reach the terabyte
scale. Several state-of-the-art data loaders are proposed to improve the
loading throughput in general CNN training; however, they are sub-optimal when
applied to the surrogate training. In this work, we propose SOLAR, a surrogate
data loader, that can ultimately increase loading throughput during the
training. It leverages our three key observations during the benchmarking and
contains three novel designs. Specifically, SOLAR first generates a
pre-determined shuffled index list and accordingly optimizes the global access
order and the buffer eviction scheme to maximize the data reuse and the buffer
hit rate. It then proposes a tradeoff between lightweight computational
imbalance and heavyweight loading workload imbalance to speed up the overall
training. It finally optimizes its data access pattern with HDF5 to achieve a
better parallel I/O throughput. Our evaluation with three scientific surrogates
and 32 GPUs illustrates that SOLAR can achieve up to 24.4X speedup over PyTorch
Data Loader and 3.52X speedup over state-of-the-art data loaders.Comment: 14 pages, 15 figures, 5 tables, submitted to VLDB '2
Classification algorithms for Big Data with applications in the urban security domain
A classification algorithm is a versatile tool, that can serve as a predictor for the
future or as an analytical tool to understand the past. Several obstacles prevent
classification from scaling to a large Volume, Velocity, Variety or Value. The aim
of this thesis is to scale distributed classification algorithms beyond current limits,
assess the state-of-practice of Big Data machine learning frameworks and validate
the effectiveness of a data science process in improving urban safety.
We found in massive datasets with a number of large-domain categorical features
a difficult challenge for existing classification algorithms. We propose associative
classification as a possible answer, and develop several novel techniques to distribute
the training of an associative classifier among parallel workers and improve the final
quality of the model. The experiments, run on a real large-scale dataset with more
than 4 billion records, confirmed the quality of the approach.
To assess the state-of-practice of Big Data machine learning frameworks and
streamline the process of integration and fine-tuning of the building blocks, we
developed a generic, self-tuning tool to extract knowledge from network traffic
measurements. The result is a system that offers human-readable models of the data
with minimal user intervention, validated by experiments on large collections of
real-world passive network measurements.
A good portion of this dissertation is dedicated to the study of a data science
process to improve urban safety. First, we shed some light on the feasibility of a
system to monitor social messages from a city for emergency relief. We then propose
a methodology to mine temporal patterns in social issues, like crimes. Finally,
we propose a system to integrate the findings of Data Science on the citizenry’s
perception of safety and communicate its results to decision makers in a timely
manner. We applied and tested the system in a real Smart City scenario, set in Turin,
Italy
Optimum Parallel Processing Schemes to Improve the Computation Speed for Renewable Energy Allocation and Sizing Problems
The optimum penetration of distributed generations into the distribution grid provides several technical and economic benefits. However, the computational time required to solve the constrained optimization problems increases with the increasing network scale and may be too long for online implementations. This paper presents a parallel solution of a multi-objective distributed generation (DG) allocation and sizing problem to handle a large number of computations. The aim is to find the optimum number of processors in addition to energy loss and DG cost minimization. The proposed formulation is applied to a 33-bus test system, and the results are compared with themselves and with the base case operating conditions using the optimal values and three popular multi-objective optimization metrics. The results show that comparable solutions with high-efficiency values can be obtained up to a certain number of processors
A GPU-accelerated package for simulation of flow in nanoporous source rocks with many-body dissipative particle dynamics
Mesoscopic simulations of hydrocarbon flow in source shales are challenging,
in part due to the heterogeneous shale pores with sizes ranging from a few
nanometers to a few micrometers. Additionally, the sub-continuum fluid-fluid
and fluid-solid interactions in nano- to micro-scale shale pores, which are
physically and chemically sophisticated, must be captured. To address those
challenges, we present a GPU-accelerated package for simulation of flow in
nano- to micro-pore networks with a many-body dissipative particle dynamics
(mDPD) mesoscale model. Based on a fully distributed parallel paradigm, the
code offloads all intensive workloads on GPUs. Other advancements, such as
smart particle packing and no-slip boundary condition in complex pore
geometries, are also implemented for the construction and the simulation of the
realistic shale pores from 3D nanometer-resolution stack images. Our code is
validated for accuracy and compared against the CPU counterpart for speedup. In
our benchmark tests, the code delivers nearly perfect strong scaling and weak
scaling (with up to 512 million particles) on up to 512 K20X GPUs on Oak Ridge
National Laboratory's (ORNL) Titan supercomputer. Moreover, a single-GPU
benchmark on ORNL's SummitDev and IBM's AC922 suggests that the host-to-device
NVLink can boost performance over PCIe by a remarkable 40\%. Lastly, we
demonstrate, through a flow simulation in realistic shale pores, that the CPU
counterpart requires 840 Power9 cores to rival the performance delivered by our
package with four V100 GPUs on ORNL's Summit architecture. This simulation
package enables quick-turnaround and high-throughput mesoscopic numerical
simulations for investigating complex flow phenomena in nano- to micro-porous
rocks with realistic pore geometries
Quality of Service Aware Data Stream Processing for Highly Dynamic and Scalable Applications
Huge amounts of georeferenced data streams are arriving daily to data stream management systems that are deployed for serving highly scalable and dynamic applications. There are innumerable ways at which those loads can be exploited to gain deep insights in various domains. Decision makers require an interactive visualization of such data in the form of maps and dashboards for decision making and strategic planning. Data streams normally exhibit fluctuation and oscillation in arrival rates and skewness. Those are the two predominant factors that greatly impact the overall quality of service. This requires data stream management systems to be attuned to those factors in addition to the spatial shape of the data that may exaggerate the negative impact of those factors. Current systems do not natively support services with quality guarantees for dynamic scenarios, leaving the handling of those logistics to the user which is challenging and cumbersome. Three workloads are predominant for any data stream, batch processing, scalable storage and stream processing. In this thesis, we have designed a quality of service aware system, SpatialDSMS, that constitutes several subsystems that are covering those loads and any mixed load that results from intermixing them. Most importantly, we natively have incorporated quality of service optimizations for processing avalanches of geo-referenced data streams in highly dynamic application scenarios. This has been achieved transparently on top of the codebases of emerging de facto standard best-in-class representatives, thus relieving the overburdened shoulders of the users in the presentation layer from having to reason about those services. Instead, users express their queries with quality goals and our system optimizers compiles that down into query plans with an embedded quality guarantee and leaves logistic handling to the underlying layers. We have developed standard compliant prototypes for all the subsystems that constitutes SpatialDSMS
Recommended from our members
MobileTrust: Secure Knowledge Integration in VANETs
Vehicular Ad hoc NETworks (VANET) are becoming popular due to the emergence of the Internet of Things and ambient intelligence applications. In such networks, secure resource sharing functionality is accomplished by incorporating trust schemes. Current solutions adopt peer-to-peer technologies that can cover the large operational area. However, these systems fail to capture some inherent properties of VANETs, such as fast and ephemeral interaction, making robust trust evaluation of crowdsourcing challenging. In this article, we propose MobileTrust—a hybrid trust-based system for secure resource sharing in VANETs. The proposal is a breakthrough in centralized trust computing that utilizes cloud and upcoming 5G technologies to provide robust trust establishment with global scalability. The ad hoc communication is energy-efficient and protects the system against threats that are not countered by the current settings. To evaluate its performance and effectiveness, MobileTrust is modelled in the SUMO simulator and tested on the traffic features of the small-size German city of Eichstatt. Similar schemes are implemented in the same platform to provide a fair comparison. Moreover, MobileTrust is deployed on a typical embedded system platform and applied on a real smart car installation for monitoring traffic and road-state parameters of an urban application. The proposed system is developed under the EU-founded THREAT-ARREST project, to provide security, privacy, and trust in an intelligent and energy-aware transportation scenario, bringing closer the vision of sustainable circular economy
- …