138,139 research outputs found

    Efficient High Performance Protocols For Long Distance Big Data File Transfer

    Get PDF
    Data sets are collected daily in large amounts (Big Data) and they are increasing rapidly due to various use cases and the number of devices used. Researchers require easy access to Big Data in order to analyze and process it. At some point this data may need to be transferred over the network to various distant locations for further processing and analysis by researchers around the globe. Such data transfers require the use of data transfer protocols that would ensure efficient and fast delivery on high speed networks. There have been several new data transfer protocols introduced which are either TCP-based or UDP-based, and the literature has some comparative analysis studies on such protocols, but not a side-by-side comparison of the protocols used in this work. I considered several data transfer protocols and congestion control mechanisms GridFTP, FASP, QUIC, BBR, and LEDBAT, which are potential candidates for comparison in various scenarios. These protocols aim to utilize the available bandwidth fairly among competing flows and to provide reduced packet loss, reduced latency, and fast delivery of data. In this thesis, I have investigated the behaviour and performance of the data transfer protocols in various scenarios. These scenarios included transfers with various file sizes, multiple flows, background and competing traffic. The results show that FASP and GridFTP had the best performance among all the protocols in most of the scenarios, especially for long distance transfers with large bandwidth delay product (BDP). The performance of QUIC was the lowest due to the nature of its current implementation, which limits the size of the transferred data and the bandwidth used. TCP BBR performed well in short distance scenarios, but its performance degraded as the distance increased. The performance of LEDBAT was unpredictable, so a complete evaluation was not possible. Comparing the performance of protocols with background traffic and competing traffic showed that most of the protocols were fair except for FASP, which was aggressive. Also, the resource utilization for each protocol on the sender and receiver side was measured with QUIC and FASP having the highest CPU utilization

    Design and Implementation of Network Transfer Protocol for Big Genomic Data

    Get PDF
    Genomic data is growing exponentially due to next generation sequencing technologies (NGS) and their ability to produce massive amounts of data in a short time. NGS technologies generate big genomic data that needs to be exchanged between different locations efficiently and reliably. The current network transfer protocols rely on Transmission Control Protocol (TCP) or User Datagram Protocol (UDP) protocols, ignoring data size and type. Universal application layer protocols such as HTTP are designed for wide variety of data types and are not particularly efficient for genomic data. Therefore, we present a new data-aware transfer protocol for genomic-data that increases network throughput and reduces latency, called Genomic Text Transfer Protocol (GTTP). In this paper, we design and implement a new network transfer protocol for big genomic DNA dataset that relies on the Hypertext Transfer Protocol (HTTP). Modification to content-encoding of HTTP has been done that would transfer big genomic DNA datasets using machine-to-machine (M2M) and client(s)-server topologies. Our results show that our modification to HTTP reduces the transmitted data by 75% of original data and still be able to regenerate the data at the client side for bioinformatics analysis. Consequently, the transfer of data using GTTP is shown to be much faster (about 8 times faster than HTTP) when compared with regular HTTP

    PIRE ExoGENI – ENVRI: Preparation for Big Data Science

    Get PDF
    Big Data is a new field in both scientific research and IT industry focusing on collections of data sets which are so huge and complex that create numerous difficulties not only in processing them but also in transferring and storing them. The Big Data science tries to overcome problems or optimize performancebased on the “5V” concept: Volume, Variety, Velocity, Variability and Value. A Big Data infrastructure integrates advanced IT technologies such as Cloud computing, databases, network and HPC, providing scientists with all the required functionality for performing high level research activities. The EU project of ENVRI is an example of developing Big Data infrastructure for environmental scientists with a special focus on issues like architecture, metadata frameworks, data discovery etc. In Big Data infrastructures like ENVRI, aggregating huge amount of data from different sources, and transferring them between distribution locations are important processes in the many experiments [5]. Efficient data transfer is thus a key service required in the big data infrastructure. At the same time, Software Defined Networking (SDN) is a new promising approach of networking. SDN decouples the control interface from network devices and allows high level applications to manipulate network behavior. However, most of the existing high level data transfer protocols treat network as a black box, and do not include the control for network level functionality. There is a scientific gap between Big Data science and Software Defined Networking and -until now- there is no work done combining these two technologies. This gap leads our research on this project

    Benchmarking Apache Arrow Flight -- A wire-speed protocol for data transfer, querying and microservices

    Full text link
    Moving structured data between different big data frameworks and/or data warehouses/storage systems often cause significant overhead. Most of the time more than 80\% of the total time spent in accessing data is elapsed in serialization/de-serialization step. Columnar data formats are gaining popularity in both analytics and transactional databases. Apache Arrow, a unified columnar in-memory data format promises to provide efficient data storage, access, manipulation and transport. In addition, with the introduction of the Arrow Flight communication capabilities, which is built on top of gRPC, Arrow enables high performance data transfer over TCP networks. Arrow Flight allows parallel Arrow RecordBatch transfer over networks in a platform and language-independent way, and offers high performance, parallelism and security based on open-source standards. In this paper, we bring together some recently implemented use cases of Arrow Flight with their benchmarking results. These use cases include bulk Arrow data transfer, querying subsystems and Flight as a microservice integration into different frameworks to show the throughput and scalability results of this protocol. We show that Flight is able to achieve up to 6000 MB/s and 4800 MB/s throughput for DoGet() and DoPut() operations respectively. On Mellanox ConnectX-3 or Connect-IB interconnect nodes Flight can utilize upto 95\% of the total available bandwidth. Flight is scalable and can use upto half of the available system cores efficiently for a bidirectional communication. For query systems like Dremio, Flight is order of magnitude faster than ODBC and turbodbc protocols. Arrow Flight based implementation on Dremio performs 20x and 30x better as compared to turbodbc and ODBC connections respectively

    Efficient HTTP based I/O on very large datasets for high performance computing with the libdavix library

    Full text link
    Remote data access for data analysis in high performance computing is commonly done with specialized data access protocols and storage systems. These protocols are highly optimized for high throughput on very large datasets, multi-streams, high availability, low latency and efficient parallel I/O. The purpose of this paper is to describe how we have adapted a generic protocol, the Hyper Text Transport Protocol (HTTP) to make it a competitive alternative for high performance I/O and data analysis applications in a global computing grid: the Worldwide LHC Computing Grid. In this work, we first analyze the design differences between the HTTP protocol and the most common high performance I/O protocols, pointing out the main performance weaknesses of HTTP. Then, we describe in detail how we solved these issues. Our solutions have been implemented in a toolkit called davix, available through several recent Linux distributions. Finally, we describe the results of our benchmarks where we compare the performance of davix against a HPC specific protocol for a data analysis use case.Comment: Presented at: Very large Data Bases (VLDB) 2014, Hangzho

    When private set intersection meets big data : an efficient and scalable protocol

    Get PDF
    Large scale data processing brings new challenges to the design of privacy-preserving protocols: how to meet the increasing requirements of speed and throughput of modern applications, and how to scale up smoothly when data being protected is big. Efficiency and scalability become critical criteria for privacy preserving protocols in the age of Big Data. In this paper, we present a new Private Set Intersection (PSI) protocol that is extremely efficient and highly scalable compared with existing protocols. The protocol is based on a novel approach that we call oblivious Bloom intersection. It has linear complexity and relies mostly on efficient symmetric key operations. It has high scalability due to the fact that most operations can be parallelized easily. The protocol has two versions: a basic protocol and an enhanced protocol, the security of the two variants is analyzed and proved in the semi-honest model and the malicious model respectively. A prototype of the basic protocol has been built. We report the result of performance evaluation and compare it against the two previously fastest PSI protocols. Our protocol is orders of magnitude faster than these two protocols. To compute the intersection of two million-element sets, our protocol needs only 41 seconds (80-bit security) and 339 seconds (256-bit security) on moderate hardware in parallel mode

    The Crypto-democracy and the Trustworthy

    Full text link
    In the current architecture of the Internet, there is a strong asymmetry in terms of power between the entities that gather and process personal data (e.g., major Internet companies, telecom operators, cloud providers, ...) and the individuals from which this personal data is issued. In particular, individuals have no choice but to blindly trust that these entities will respect their privacy and protect their personal data. In this position paper, we address this issue by proposing an utopian crypto-democracy model based on existing scientific achievements from the field of cryptography. More precisely, our main objective is to show that cryptographic primitives, including in particular secure multiparty computation, offer a practical solution to protect privacy while minimizing the trust assumptions. In the crypto-democracy envisioned, individuals do not have to trust a single physical entity with their personal data but rather their data is distributed among several institutions. Together these institutions form a virtual entity called the Trustworthy that is responsible for the storage of this data but which can also compute on it (provided first that all the institutions agree on this). Finally, we also propose a realistic proof-of-concept of the Trustworthy, in which the roles of institutions are played by universities. This proof-of-concept would have an important impact in demonstrating the possibilities offered by the crypto-democracy paradigm.Comment: DPM 201
    corecore