12 research outputs found
An Analysis of Distributed Systems Syllabi With a Focus on Performance-Related Topics
We analyze a dataset of 51 current (2019-2020) Distributed Systems syllabi
from top Computer Science programs, focusing on finding the prevalence and
context in which topics related to performance are being taught in these
courses. We also study the scale of the infrastructure mentioned in DS courses,
from small client-server systems to cloud-scale, peer-to-peer, global-scale
systems. We make eight main findings, covering goals such as performance, and
scalability and its variant elasticity; activities such as performance
benchmarking and monitoring; eight selected performance-enhancing techniques
(replication, caching, sharding, load balancing, scheduling, streaming,
migrating, and offloading); and control issues such as trade-offs that include
performance and performance variability.Comment: Accepted for publication at WEPPE 2021, to be held in conjunction
with ACM/SPEC ICPE 2021: https://doi.org/10.1145/3447545.3451197 This article
is a follow-up of our prior ACM SIGCSE publication, arXiv:2012.0055
Efficient clustering techniques for big data
Clustering is an essential data mining technique that divides observations into
groups where each group contains similar observations. K-Means is one of the
most popular and widely used clustering algorithms that has been used for over
fifty years. The majority of the running time in the original K-Means algorithm
(known as Lloyd’s algorithm) is spent on computing distances from each data
point to all cluster centres to find the closest centre to each data point. Due to
the current exponential growth of the data, it became a necessity to improve KMeans
even further to cope with large-scale datasets, known as Big Data. Hence,
the main aim of this thesis is to improve the efficiency and scalability of Lloyd’s
K-Means.
One of the most efficient techniques to accelerate K-Means is to use triangle
inequality. Implementing such efficient techniques on a reliable distributed model
creates a powerful combination. This combination can lead to an efficient and
highly scalable parallel version of K-Means that offers a practical solution to the
problem of clustering Big Data.
MapReduce, and its popular open-source implementation known as Hadoop,
provides a distributed computing framework that efficiently stores, manages, and
processes large-scale datasets over a large cluster of commodity machines. Many
studies introduced a parallel implementation of Lloyd’s K-Means on Hadoop in
order to improve the algorithm’s scalability. This research examines methods
based on triangle inequality to achieve further improvements on the efficiency of
the parallel Lloyd’s K-Means on Hadoop.
Variants of K-Means that use triangle inequality usually require extra information,
such as distance bounds and cluster assignments, from the previous iteration
to work efficiently. This is a challenging task to achieve on Hadoop for two reasons:
1) Hadoop does not directly support iterative algorithms; and 2) Hadoop does not
allow information to be exchanged between two consecutive iterations. Hence, two
techniques are proposed to give Hadoop the ability to pass information from an
iteration to the next. The first technique uses a data structure referred to as an
Extended Vector (EV), that appends the extra information to the original data
vector. The second technique stores the extra information on files where each file
is referred to as a Bounds File (BF).
To evaluate the two proposed techniques, two K-Means variants are implemented
on Hadoop using the two techniques. Each variant is tested against variable
number of clusters, dimensions, data points, and mappers. Furthermore, the
performance of various implementations of K-Means on Hadoop and Spark is investigated.
The results show a significant improvement on the efficiency of the
new implementations compared to the Lloyd’s K-Means on Hadoop with real and
artificial datasets
Graphy: Exploring the potential of the Contacts application
The number of mobile devices is growing very fast. Smart phones and tablets are, step by step, replacing desktops and laptops as the primary method of computing in daily life. Along with the rapid evolution of mobile devices, the applications on them are undergoing fast transformation. We can see many improvements in traditional applications (messaging, calling, etc.) like multimedia text messages, video calls, voice over IP and so forth. However, the Contacts application has not changed much while it has many potentials. In this thesis, we propose a new model which improves the Contacts application by introducing three novel capabilities: searching for contacts by their miscellaneous information, retaining knowledge of contacts via a tags system, and establishing a Personal Social Network which consists of the relationships between the contacts. By introducing these capabilities, the model helps its users to accomplish new tasks which are not currently handled by modern Contacts applications. Furthermore, the model is implemented and become a fully functional prototype on iOS and Android. The prototype is then evaluated in a user study and a system performance test. The studies yield positive results which indicate that the three new capabilities are valuable and should be included in today’s Contacts applications
Move Big Data to the Cloud: an Online Cost-Minimizing Approach
published_or_final_versio
Semantically-aware data discovery and placement in collaborative computing environments
As the size of scientific datasets and the demand for interdisciplinary collaboration grow in modern science, it becomes imperative that better ways of discovering and placing datasets generated across multiple disciplines be developed to facilitate interdisciplinary scientific research. For discovering relevant data out of large-scale interdisciplinary datasets. The development and integration of cross-domain metadata is critical as metadata serves as the key guideline for organizing data. To develop and integrate cross-domain metadata management systems in interdisciplinary collaborative computing environment, three key issues need to be addressed: the development of a cross-domain metadata schema; the implementation of a metadata management system based on this schema; the integration of the metadata system into existing distributed computing infrastructure. Current research in metadata management in distributed computing environment largely focuses on relatively simple schema that lacks the underlying descriptive power to adequately address semantic heterogeneity often found in interdisciplinary science. And current work does not take adequate consideration the issue of scalability in large-scale data management. Another key issue in data management is data placement, due to the increasing size of scientific datasets, the overhead incurred as a result of transferring data among different nodes also grow into a significant inhibiting factor affecting overall performance. Currently, few data placement strategies take into consideration semantic information concerning data content. In this dissertation, we propose a cross-domain metadata system in a collaborative distributed computing environment and identify and evaluate key factors and processes involved in a successful cross-domain metadata system with the goal of facilitating data discovery in collaborative environments. This will allow researchers/users to conduct interdisciplinary science in the context of large-scale datasets that will make it easier to access interdisciplinary datasets, reduce barrier to collaboration, reduce cost of future development of similar systems. We also investigate data placement strategies that involve semantic information about the hardware and network environment as well as domain information in the form of semantic metadata so that semantic locality could be utilized in data placement, that could potentially reduce overhead for accessing large-scale interdisciplinary datasets
Cost-Based Optimization of Integration Flows
Integration flows are increasingly used to specify and execute data-intensive integration tasks between heterogeneous systems and applications. There are many different application areas such as real-time ETL and data synchronization between operational systems. For the reasons of an increasing amount of data, highly distributed IT infrastructures, and high requirements for data consistency and up-to-dateness of query results, many instances of integration flows are executed over time. Due to this high load and blocking synchronous source systems, the performance of the central integration platform is crucial for an IT infrastructure. To tackle these high performance requirements, we introduce the concept of cost-based optimization of imperative integration flows that relies on incremental statistics maintenance and inter-instance plan re-optimization. As a foundation, we introduce the concept of periodical re-optimization including novel cost-based optimization techniques that are tailor-made for integration flows. Furthermore, we refine the periodical re-optimization to on-demand re-optimization in order to overcome the problems of many unnecessary re-optimization steps and adaptation delays, where we miss optimization opportunities. This approach ensures low optimization overhead and fast workload adaptation
Reliable & Efficient Data Centric Storage for Data Management in Wireless Sensor Networks
Wireless Sensor Networks (WSNs) have become a mature technology aimed at performing environmental monitoring and data collection. Nonetheless, harnessing the power of a WSN presents a number of research challenges. WSN application developers have to deal both with the business logic of the application and with WSN's issues, such as those related to networking (routing), storage, and transport. A middleware can cope with this emerging complexity, and can provide the necessary abstractions for the definition, creation and maintenance of applications.
The final goal of most WSN applications is to gather data from the environment, and to transport such data to the user applications, that usually resides outside the WSN.
Techniques for data collection can be based on external storage, local storage and in-network storage.
External storage sends data to the sink (a centralized data collector that provides data to the users through other networks)
as soon as they are collected.
This paradigm implies the continuous presence of a sink in the WSN, and data can hardly be pre-processed before sent to the sink.
Moreover, these transport mechanisms create an hotspot on the sensors around the sink. Local storage stores data on a set of sensors that depends on the identity of the sensor collecting them, and implies that requests for data must be broadcast to all the sensors, since the sink can hardly know in advance the identity of the sensors that collected the data the sink is interested in.
In-network storage and in particular Data Centric Storage (DCS) stores data on a set of sensors that depend on a meta-datum describing the data.
DCS is a paradigm that is promising for Data Management in WSNs, since it addresses the problem of scalability (DCS employs unicast communications to manage WSNs), allows in-network data preprocessing and can mitigate hot-spots insurgence.
This thesis studies the use of DCS for Data Management
in middleware for WSNs.
Since WSNs can feature different paradigms for data routing (geographical routing and more traditional tree routing), this thesis introduces two different DCS protocols for these two different kinds of WNSs.
Q-NiGHT is based on geographical routing and it can manage the quantity of resources that are assigned to the storage of different meta-data, and implements a load balance for the data storage over the sensors in the WSN.
Z-DaSt is built on top of ZigBee networks, and exploits the standard ZigBee mechanisms to harness the power of ZigBee routing protocol and network formation mechanisms.
Dependability is another issue that was subject to research work. Most current approaches employ replication as the mean to ensure data availability.
A possible enhancement is the use of erasure coding to improve the persistence of data while saving on memory usage on the sensors.
Finally, erasure coding was applied also to gossiping algorithms, to realize an efficient data management. The technique is compared to the state-of-the-art to identify the benefits it can provide to data collection algorithms and to data availability techniques
A shared-disk parallel cluster file system
Dissertação apresentada para obtenção do Grau de Doutor em Informática Pela Universidade Nova de Lisboa, Faculdade de Ciências e TecnologiaToday, clusters are the de facto cost effective platform both for high performance
computing (HPC) as well as IT environments. HPC and IT are quite different environments
and differences include, among others, their choices on file systems and storage: HPC favours parallel file systems geared towards maximum I/O bandwidth, but which are not fully POSIX-compliant and were devised to run on top of (fault prone) partitioned storage; conversely, IT data centres favour both external disk arrays (to provide highly available storage) and POSIX compliant file systems, (either general purpose or shared-disk cluster file systems, CFSs).
These specialised file systems do perform very well in their target environments provided that applications do not require some lateral features, e.g., no file locking on parallel file systems, and no high performance writes over cluster-wide shared files on CFSs. In brief, we can say
that none of the above approaches solves the problem of providing high levels of reliability and performance to both worlds.
Our pCFS proposal makes a contribution to change this situation: the rationale is to take advantage on the best of both – the reliability of cluster file systems and the high performance of parallel file systems. We don’t claim to provide the absolute best of each, but we aim at full POSIX compliance, a rich feature set, and levels of reliability and performance good enough
for broad usage – e.g., traditional as well as HPC applications, support of clustered DBMS engines that may run over regular files, and video streaming. pCFS’ main ideas include:
· Cooperative caching, a technique that has been used in file systems for distributed disks but, as far as we know, was never used either in SAN based cluster file systems or in parallel file systems. As a result, pCFS may use all infrastructures (LAN and SAN) to move data.
· Fine-grain locking, whereby processes running across distinct nodes may define nonoverlapping byte-range regions in a file (instead of the whole file) and access them in parallel, reading and writing over those regions at the infrastructure’s full speed (provided that no major metadata changes are required).
A prototype was built on top of GFS (a Red Hat shared disk CFS): GFS’ kernel code was
slightly modified, and two kernel modules and a user-level daemon were added. In the
prototype, fine grain locking is fully implemented and a cluster-wide coherent cache is maintained through data (page fragments) movement over the LAN.
Our benchmarks for non-overlapping writers over a single file shared among processes
running on different nodes show that pCFS’ bandwidth is 2 times greater than NFS’ while
being comparable to that of the Parallel Virtual File System (PVFS), both requiring about 10 times more CPU. And pCFS’ bandwidth also surpasses GFS’ (600 times for small record sizes, e.g., 4 KB, decreasing down to 2 times for large record sizes, e.g., 4 MB), at about the same CPU usage.Lusitania, Companhia de Seguros S.A, Programa
IBM Shared University Research (SUR