138,139 research outputs found
Efficient High Performance Protocols For Long Distance Big Data File Transfer
Data sets are collected daily in large amounts (Big Data) and they are increasing rapidly due to various use cases and the number of devices used. Researchers require easy access to Big Data in order to analyze and process it. At some point this data may need to be transferred over the network to various distant locations for further processing and analysis by researchers around the globe. Such data transfers require the use of data transfer protocols that would ensure efficient and fast delivery on high speed networks.
There have been several new data transfer protocols introduced which are either TCP-based or UDP-based, and the literature has some comparative analysis studies on such protocols, but not a side-by-side comparison of the protocols used in this work. I considered several data transfer protocols and congestion control mechanisms GridFTP, FASP, QUIC, BBR, and LEDBAT, which are potential candidates for comparison in various scenarios. These protocols aim to utilize the available bandwidth fairly among competing flows and to provide reduced packet loss, reduced latency, and fast delivery of data.
In this thesis, I have investigated the behaviour and performance of the data transfer protocols in various scenarios. These scenarios included transfers with various file sizes, multiple flows, background and competing traffic. The results show that FASP and GridFTP had the best performance among all the protocols in most of the scenarios, especially for long distance transfers with large bandwidth delay product (BDP). The performance of QUIC was the lowest due to the nature of its current implementation, which limits the size of the transferred data and the bandwidth used. TCP BBR performed well in short distance scenarios, but its performance degraded as the distance increased. The performance of LEDBAT was unpredictable, so a complete evaluation was not possible. Comparing the performance of protocols with background traffic and competing traffic showed that most of the protocols were fair except for FASP, which was aggressive. Also, the resource utilization for each protocol on the sender and receiver side was measured with QUIC and FASP having the highest CPU utilization
Design and Implementation of Network Transfer Protocol for Big Genomic Data
Genomic data is growing exponentially due to next generation sequencing technologies (NGS) and their ability to produce massive amounts of data in a short time. NGS technologies generate big genomic data that needs to be exchanged between different locations efficiently and reliably. The current network transfer protocols rely on Transmission Control Protocol (TCP) or User Datagram Protocol (UDP) protocols, ignoring data size and type. Universal application layer protocols such as HTTP are designed for wide variety of data types and are not particularly efficient for genomic data. Therefore, we present a new data-aware transfer protocol for genomic-data that increases network throughput and reduces latency, called Genomic Text Transfer Protocol (GTTP). In this paper, we design and implement a new network transfer protocol for big genomic DNA dataset that relies on the Hypertext Transfer Protocol (HTTP). Modification to content-encoding of HTTP has been done that would transfer big genomic DNA datasets using machine-to-machine (M2M) and client(s)-server topologies. Our results show that our modification to HTTP reduces the transmitted data by 75% of original data and still be able to regenerate the data at the client side for bioinformatics analysis. Consequently, the transfer of data using GTTP is shown to be much faster (about 8 times faster than HTTP) when compared with regular HTTP
PIRE ExoGENI – ENVRI: Preparation for Big Data Science
Big Data is a new field in both scientific research and IT industry focusing on collections of data sets which are so huge and complex that create numerous difficulties not only in processing them but also in transferring and storing them. The Big Data science tries to overcome problems or optimize performancebased on the “5V” concept: Volume, Variety, Velocity, Variability and Value. A Big Data infrastructure integrates advanced IT technologies such as Cloud computing, databases, network and HPC, providing scientists with all the required functionality for performing high level research activities. The EU project of ENVRI is an example of developing Big Data infrastructure for environmental scientists with a special focus on issues like architecture, metadata frameworks, data discovery etc.
In Big Data infrastructures like ENVRI, aggregating huge amount of data from different sources, and transferring them between distribution locations are important processes in the many experiments [5]. Efficient data transfer is thus a key service required in the big data infrastructure.
At the same time, Software Defined Networking (SDN) is a new promising approach of networking. SDN decouples the control interface from network devices and allows high level applications to manipulate network behavior. However, most of the existing high level data transfer protocols treat network as a black box, and do not include the control for network level functionality.
There is a scientific gap between Big Data science and Software Defined Networking and -until now- there is no work done combining these two technologies. This gap leads our research on this project
Benchmarking Apache Arrow Flight -- A wire-speed protocol for data transfer, querying and microservices
Moving structured data between different big data frameworks and/or data
warehouses/storage systems often cause significant overhead. Most of the time
more than 80\% of the total time spent in accessing data is elapsed in
serialization/de-serialization step. Columnar data formats are gaining
popularity in both analytics and transactional databases. Apache Arrow, a
unified columnar in-memory data format promises to provide efficient data
storage, access, manipulation and transport. In addition, with the introduction
of the Arrow Flight communication capabilities, which is built on top of gRPC,
Arrow enables high performance data transfer over TCP networks. Arrow Flight
allows parallel Arrow RecordBatch transfer over networks in a platform and
language-independent way, and offers high performance, parallelism and security
based on open-source standards.
In this paper, we bring together some recently implemented use cases of Arrow
Flight with their benchmarking results. These use cases include bulk Arrow data
transfer, querying subsystems and Flight as a microservice integration into
different frameworks to show the throughput and scalability results of this
protocol. We show that Flight is able to achieve up to 6000 MB/s and 4800 MB/s
throughput for DoGet() and DoPut() operations respectively. On Mellanox
ConnectX-3 or Connect-IB interconnect nodes Flight can utilize upto 95\% of the
total available bandwidth. Flight is scalable and can use upto half of the
available system cores efficiently for a bidirectional communication. For query
systems like Dremio, Flight is order of magnitude faster than ODBC and turbodbc
protocols. Arrow Flight based implementation on Dremio performs 20x and 30x
better as compared to turbodbc and ODBC connections respectively
Efficient HTTP based I/O on very large datasets for high performance computing with the libdavix library
Remote data access for data analysis in high performance computing is
commonly done with specialized data access protocols and storage systems. These
protocols are highly optimized for high throughput on very large datasets,
multi-streams, high availability, low latency and efficient parallel I/O. The
purpose of this paper is to describe how we have adapted a generic protocol,
the Hyper Text Transport Protocol (HTTP) to make it a competitive alternative
for high performance I/O and data analysis applications in a global computing
grid: the Worldwide LHC Computing Grid. In this work, we first analyze the
design differences between the HTTP protocol and the most common high
performance I/O protocols, pointing out the main performance weaknesses of
HTTP. Then, we describe in detail how we solved these issues. Our solutions
have been implemented in a toolkit called davix, available through several
recent Linux distributions. Finally, we describe the results of our benchmarks
where we compare the performance of davix against a HPC specific protocol for a
data analysis use case.Comment: Presented at: Very large Data Bases (VLDB) 2014, Hangzho
When private set intersection meets big data : an efficient and scalable protocol
Large scale data processing brings new challenges to the design of privacy-preserving protocols: how to meet the increasing requirements of speed and throughput of modern applications, and how to scale up smoothly when data being protected is big. Efficiency and scalability become critical criteria for privacy preserving protocols in the age of Big Data. In this paper, we present a new Private Set Intersection (PSI) protocol that is extremely efficient and highly scalable compared with existing protocols. The protocol is based on a novel approach that we call oblivious Bloom intersection. It has linear complexity and relies mostly on efficient symmetric key operations. It has high scalability due to the fact that most operations can be parallelized easily. The protocol has two versions: a basic protocol and an enhanced protocol, the security of the two variants is analyzed and proved in the semi-honest model and the malicious model respectively. A prototype of the basic protocol has been built. We report the result of performance evaluation and compare it against the two previously fastest PSI protocols. Our protocol is orders of magnitude faster than these two protocols. To compute the intersection of two million-element sets, our protocol needs only 41 seconds (80-bit security) and 339 seconds (256-bit security) on moderate hardware in parallel mode
The Crypto-democracy and the Trustworthy
In the current architecture of the Internet, there is a strong asymmetry in
terms of power between the entities that gather and process personal data
(e.g., major Internet companies, telecom operators, cloud providers, ...) and
the individuals from which this personal data is issued. In particular,
individuals have no choice but to blindly trust that these entities will
respect their privacy and protect their personal data. In this position paper,
we address this issue by proposing an utopian crypto-democracy model based on
existing scientific achievements from the field of cryptography. More
precisely, our main objective is to show that cryptographic primitives,
including in particular secure multiparty computation, offer a practical
solution to protect privacy while minimizing the trust assumptions. In the
crypto-democracy envisioned, individuals do not have to trust a single physical
entity with their personal data but rather their data is distributed among
several institutions. Together these institutions form a virtual entity called
the Trustworthy that is responsible for the storage of this data but which can
also compute on it (provided first that all the institutions agree on this).
Finally, we also propose a realistic proof-of-concept of the Trustworthy, in
which the roles of institutions are played by universities. This
proof-of-concept would have an important impact in demonstrating the
possibilities offered by the crypto-democracy paradigm.Comment: DPM 201
- …