83 research outputs found

    Network Traffic Measurements, Applications to Internet Services and Security

    Get PDF
    The Internet has become along the years a pervasive network interconnecting billions of users and is now playing the role of collector for a multitude of tasks, ranging from professional activities to personal interactions. From a technical standpoint, novel architectures, e.g., cloud-based services and content delivery networks, innovative devices, e.g., smartphones and connected wearables, and security threats, e.g., DDoS attacks, are posing new challenges in understanding network dynamics. In such complex scenario, network measurements play a central role to guide traffic management, improve network design, and evaluate application requirements. In addition, increasing importance is devoted to the quality of experience provided to final users, which requires thorough investigations on both the transport network and the design of Internet services. In this thesis, we stress the importance of users’ centrality by focusing on the traffic they exchange with the network. To do so, we design methodologies complementing passive and active measurements, as well as post-processing techniques belonging to the machine learning and statistics domains. Traffic exchanged by Internet users can be classified in three macro-groups: (i) Outbound, produced by users’ devices and pushed to the network; (ii) unsolicited, part of malicious attacks threatening users’ security; and (iii) inbound, directed to users’ devices and retrieved from remote servers. For each of the above categories, we address specific research topics consisting in the benchmarking of personal cloud storage services, the automatic identification of Internet threats, and the assessment of quality of experience in the Web domain, respectively. Results comprise several contributions in the scope of each research topic. In short, they shed light on (i) the interplay among design choices of cloud storage services, which severely impact the performance provided to end users; (ii) the feasibility of designing a general purpose classifier to detect malicious attacks, without chasing threat specificities; and (iii) the relevance of appropriate means to evaluate the perceived quality of Web pages delivery, strengthening the need of users’ feedbacks for a factual assessment

    Doctor of Philosophy

    Get PDF
    dissertationIn the past few years, we have seen a tremendous increase in digital data being generated. By 2011, storage vendors had shipped 905 PB of purpose-built backup appliances. By 2013, the number of objects stored in Amazon S3 had reached 2 trillion. Facebook had stored 20 PB of photos by 2010. All of these require an efficient storage solution. To improve space efficiency, compression and deduplication are being widely used. Compression works by identifying repeated strings and replacing them with more compact encodings while deduplication partitions data into fixed-size or variable-size chunks and removes duplicate blocks. While we have seen great improvements in space efficiency from these two approaches, there are still some limitations. First, traditional compressors are limited in their ability to detect redundancy across a large range since they search for redundant data in a fine-grain level (string level). For deduplication, metadata embedded in an input file changes more frequently, and this introduces more unnecessary unique chunks, leading to poor deduplication. Cloud storage systems suffer from unpredictable and inefficient performance because of interference among different types of workloads. This dissertation proposes techniques to improve the effectiveness of traditional compressors and deduplication in improving space efficiency, and a new IO scheduling algorithm to improve performance predictability and efficiency for cloud storage systems. The common idea is to utilize similarity. To improve the effectiveness of compression and deduplication, similarity in content is used to transform an input file into a compression- or deduplication-friendly format. We propose Migratory Compression, a generic data transformation that identifies similar data in a coarse-grain level (block level) and then groups similar blocks together. It can be used as a preprocessing stage for any traditional compressor. We find metadata have a huge impact in reducing the benefit of deduplication. To isolate the impact from metadata, we propose to separate metadata from data. Three approaches are presented for use cases with different constrains. For the commonly used tar format, we propose Migratory Tar: a data transformation and also a new tar format that deduplicates better. We also present a case study where we use deduplication to reduce storage consumption for storing disk images, while at the same time achieving high performance in image deployment. Finally, we apply the same principle of utilizing similarity in IO scheduling to prevent interference between random and sequential workloads, leading to efficient, consistent, and predictable performance for sequential workloads and a high disk utilization

    The Family of MapReduce and Large Scale Data Processing Systems

    Full text link
    In the last two decades, the continuous increase of computational power has produced an overwhelming flow of data which has called for a paradigm shift in the computing architecture and large scale data processing mechanisms. MapReduce is a simple and powerful programming model that enables easy development of scalable parallel applications to process vast amounts of data on large clusters of commodity machines. It isolates the application from the details of running a distributed program such as issues on data distribution, scheduling and fault tolerance. However, the original implementation of the MapReduce framework had some limitations that have been tackled by many research efforts in several followup works after its introduction. This article provides a comprehensive survey for a family of approaches and mechanisms of large scale data processing mechanisms that have been implemented based on the original idea of the MapReduce framework and are currently gaining a lot of momentum in both research and industrial communities. We also cover a set of introduced systems that have been implemented to provide declarative programming interfaces on top of the MapReduce framework. In addition, we review several large scale data processing systems that resemble some of the ideas of the MapReduce framework for different purposes and application scenarios. Finally, we discuss some of the future research directions for implementing the next generation of MapReduce-like solutions.Comment: arXiv admin note: text overlap with arXiv:1105.4252 by other author

    BIG DATA AND ANALYTICS AS A NEW FRONTIER OF ENTERPRISE DATA MANAGEMENT

    Get PDF
    Big Data and Analytics (BDA) promises significant value generation opportunities across industries. Even though companies increase their investments, their BDA initiatives fall short of expectations and they struggle to guarantee a return on investments. In order to create business value from BDA, companies must build and extend their data-related capabilities. While BDA literature has emphasized the capabilities needed to analyze the increasing volumes of data from heterogeneous sources, EDM researchers have suggested organizational capabilities to improve data quality. However, to date, little is known how companies actually orchestrate the allocated resources, especially regarding the quality and use of data to create value from BDA. Considering these gaps, this thesis – through five interrelated essays – investigates how companies adapt their EDM capabilities to create additional business value from BDA. The first essay lays the foundation of the thesis by investigating how companies extend their Business Intelligence and Analytics (BI&A) capabilities to build more comprehensive enterprise analytics platforms. The second and third essays contribute to fundamental reflections on how organizations are changing and designing data governance in the context of BDA. The fourth and fifth essays look at how companies provide high quality data to an increasing number of users with innovative EDM tools, that are, machine learning (ML) and enterprise data catalogs (EDC). The thesis outcomes show that BDA has profound implications on EDM practices. In the past, operational data processing and analytical data processing were two “worlds” that were managed separately from each other. With BDA, these "worlds" are becoming increasingly interdependent and organizations must manage the lifecycles of data and analytics products in close coordination. Also, with BDA, data have become the long-expected, strategically relevant resource. As such data must now be viewed as a distinct value driver separate from IT as it requires specific mechanisms to foster value creation from BDA. BDA thus extends data governance goals: in addition to data quality and regulatory compliance, governance should facilitate data use by broadening data availability and enabling data monetization. Accordingly, companies establish comprehensive data governance designs including structural, procedural, and relational mechanisms to enable a broad network of employees to work with data. Existing EDM practices therefore need to be rethought to meet the emerging BDA requirements. While ML is a promising solution to improve data quality in a scalable and adaptable way, EDCs help companies democratize data to a broader range of employees

    Just-in-time Analytics Over Heterogeneous Data and Hardware

    Get PDF
    Industry and academia are continuously becoming more data-driven and data-intensive, relying on the analysis of a wide variety of datasets to gain insights. At the same time, data variety increases continuously across multiple axes. First, data comes in multiple formats, such as the binary tabular data of a DBMS, raw textual files, and domain-specific formats. Second, different datasets follow different data models, such as the relational and the hierarchical one. Data location also varies: Some datasets reside in a central "data lake", whereas others lie in remote data sources. In addition, users execute widely different analysis tasks over all these data types. Finally, the process of gathering and integrating diverse datasets introduces several inconsistencies and redundancies in the data, such as duplicate entries for the same real-world concept. In summary, heterogeneity significantly affects the way data analysis is performed. In this thesis, we aim for data virtualization: Abstracting data out of its original form and manipulating it regardless of the way it is stored or structured, without a performance penalty. To achieve data virtualization, we design and implement systems that i) mask heterogeneity through the use of heterogeneity-aware, high-level building blocks and ii) offer fast responses through on-demand adaptation techniques. Regarding the high-level building blocks, we use a query language and algebra to handle multiple collection types, such as relations and hierarchies, express transformations between these collection types, as well as express complex data cleaning tasks over them. In addition, we design a location-aware compiler and optimizer that masks away the complexity of accessing multiple remote data sources. Regarding on-demand adaptation, we present a design to produce a new system per query. The design uses customization mechanisms that trigger runtime code generation to mimic the system most appropriate to answer a query fast: Query operators are thus created based on the query workload and the underlying data models; the data access layer is created based on the underlying data formats. In addition, we exploit emerging hardware by customizing the system implementation based on the available heterogeneous processors Ăą CPUs and GPGPUs. We thus pair each workload with its ideal processor type. The end result is a just-in-time database system that is specific to the query, data, workload, and hardware instance. This thesis redesigns the data management stack to natively cater for data heterogeneity and exploit hardware heterogeneity. Instead of centralizing all relevant datasets, converting them to a single representation, and loading them in a monolithic, static, suboptimal system, our design embraces heterogeneity. Overall, our design decouples the type of performed analysis from the original data layout; users can perform their analysis across data stores, data models, and data formats, but at the same time experience the performance offered by a custom system that has been built on demand to serve their specific use case

    Proceedings of the 12th International Conference on Digital Preservation

    Get PDF
    The 12th International Conference on Digital Preservation (iPRES) was held on November 2-6, 2015 in Chapel Hill, North Carolina, USA. There were 327 delegates from 22 countries. The program included 12 long papers, 15 short papers, 33 posters, 3 demos, 6 workshops, 3 tutorials and 5 panels, as well as several interactive sessions and a Digital Preservation Showcase

    Proceedings of the 12th International Conference on Digital Preservation

    Get PDF
    The 12th International Conference on Digital Preservation (iPRES) was held on November 2-6, 2015 in Chapel Hill, North Carolina, USA. There were 327 delegates from 22 countries. The program included 12 long papers, 15 short papers, 33 posters, 3 demos, 6 workshops, 3 tutorials and 5 panels, as well as several interactive sessions and a Digital Preservation Showcase

    Context Aware Computing for The Internet of Things: A Survey

    Get PDF
    As we are moving towards the Internet of Things (IoT), the number of sensors deployed around the world is growing at a rapid pace. Market research has shown a significant growth of sensor deployments over the past decade and has predicted a significant increment of the growth rate in the future. These sensors continuously generate enormous amounts of data. However, in order to add value to raw sensor data we need to understand it. Collection, modelling, reasoning, and distribution of context in relation to sensor data plays critical role in this challenge. Context-aware computing has proven to be successful in understanding sensor data. In this paper, we survey context awareness from an IoT perspective. We present the necessary background by introducing the IoT paradigm and context-aware fundamentals at the beginning. Then we provide an in-depth analysis of context life cycle. We evaluate a subset of projects (50) which represent the majority of research and commercial solutions proposed in the field of context-aware computing conducted over the last decade (2001-2011) based on our own taxonomy. Finally, based on our evaluation, we highlight the lessons to be learnt from the past and some possible directions for future research. The survey addresses a broad range of techniques, methods, models, functionalities, systems, applications, and middleware solutions related to context awareness and IoT. Our goal is not only to analyse, compare and consolidate past research work but also to appreciate their findings and discuss their applicability towards the IoT.Comment: IEEE Communications Surveys & Tutorials Journal, 201
    • 

    corecore