195 research outputs found

    Improving capacity-performance tradeoffs in the storage tier

    Get PDF
    Data-set sizes are growing. New techniques are emerging to organize and analyze these data-sets. There is a key access pattern emerging with these new techniques, large sequential file accesses. The trend toward bigger files exists to help amortize the cost of data accesses from the storage layer, as many workloads are recognized to be I/O bound. The storage layer is widely recognized as the slowest layer in the system. This work focuses on the tradeoff one can make with that storage capacity to improve system performance. ^ Capacity can be leveraged for improved availability or improved performance. This tradeoff is key in the storage layer, as this allows for data loss prevention and bandwidth aggregation. Typically these tradeoffs do not allow much choice with regard to capacity use. This work will leverage replication as the enabling mechanism to improve the capacity-performance tradeoff in the storage tier, while still providing for availability. ^ This capacity-performance tradeoff can be made at both the local and distributed file system level. I propose two techniques that allow for an improved tradeoff of capacity. The local file system can be employed on scale-out or scale-up infrastructures to improve performance. The distributed file system is targeted at distributed frameworks, such as MapReduce, to improve the cluster performance. The local file system design is MorphStore, and the distributed file system is BoostDFS. ^ MorphStore is a file system that significantly improves performance when accessing large files by using two innovations. MorphStore combines (a) load-adaptive I/O access scheduling to dynamically optimize throughput (aggregation), and (b) utility-xiii driven replication to best use capacity for performance. Additionally, adaptive-access scheduling can be utilized to optimize scheduling of requests (for throughput) on systems with a large number of storage devices. Replication is utilized to make available high utility files and then optimize throughput of these high utility files based on system load. ^ BoostDFS is a distributed file system that allows a better capacity-performance tradeoff via inter-node file replication. BoostDFS is built on the observation that distributed file systems currently inter-node replication for availability, but provide no mechanism to further improve performance. Replication for availability provides diminishing returns on performance, this is due to saturation of locality. BoostDFS exploits the common by improving I/O performance of these local tasks. This is done via intra-node replication by leveraging MorphStore as the local file system. This technique allows for capacity to be traded for availability as well as performance, with a small capacity overhead under constant availability. ^ Both MorphStore and BoostDFS utilize replication. Replication allows for both bandwidth aggregation and availability, This work primarily focuses on the performance utility of replication, but does not sacrifice availability in the process. These techniques provide an improved capacity-performance tradeoff while allowing the desired level of availability

    RAID Organizations for Improved Reliability and Performance: A Not Entirely Unbiased Tutorial (1st revision)

    Full text link
    RAID proposal advocated replacing large disks with arrays of PC disks, but as the capacity of small disks increased 100-fold in 1990s the production of large disks was discontinued. Storage dependability is increased via replication or erasure coding. Cloud storage providers store multiple copies of data obviating for need for further redundancy. Varitaions of RAID based on local recovery codes, partial MDS reduce recovery cost. NAND flash Solid State Disks - SSDs have low latency and high bandwidth, are more reliable, consume less power and have a lower TCO than Hard Disk Drives, which are more viable for hyperscalers.Comment: Submitted to ACM Computing Surveys. arXiv admin note: substantial text overlap with arXiv:2306.0876

    Monitoring and analysis system for performance troubleshooting in data centers

    Get PDF
    It was not long ago. On Christmas Eve 2012, a war of troubleshooting began in Amazon data centers. It started at 12:24 PM, with an mistaken deletion of the state data of Amazon Elastic Load Balancing Service (ELB for short), which was not realized at that time. The mistake first led to a local issue that a small number of ELB service APIs were affected. In about six minutes, it evolved into a critical one that EC2 customers were significantly affected. One example was that Netflix, which was using hundreds of Amazon ELB services, was experiencing an extensive streaming service outage when many customers could not watch TV shows or movies on Christmas Eve. It took Amazon engineers 5 hours 42 minutes to find the root cause, the mistaken deletion, and another 15 hours and 32 minutes to fully recover the ELB service. The war ended at 8:15 AM the next day and brought the performance troubleshooting in data centers to world’s attention. As shown in this Amazon ELB case.Troubleshooting runtime performance issues is crucial in time-sensitive multi-tier cloud services because of their stringent end-to-end timing requirements, but it is also notoriously difficult and time consuming. To address the troubleshooting challenge, this dissertation proposes VScope, a flexible monitoring and analysis system for online troubleshooting in data centers. VScope provides primitive operations which data center operators can use to troubleshoot various performance issues. Each operation is essentially a series of monitoring and analysis functions executed on an overlay network. We design a novel software architecture for VScope so that the overlay networks can be generated, executed and terminated automatically, on-demand. From the troubleshooting side, we design novel anomaly detection algorithms and implement them in VScope. By running anomaly detection algorithms in VScope, data center operators are notified when performance anomalies happen. We also design a graph-based guidance approach, called VFocus, which tracks the interactions among hardware and software components in data centers. VFocus provides primitive operations by which operators can analyze the interactions to find out which components are relevant to the performance issue. VScope’s capabilities and performance are evaluated on a testbed with over 1000 virtual machines (VMs). Experimental results show that the VScope runtime negligibly perturbs system and application performance, and requires mere seconds to deploy monitoring and analytics functions on over 1000 nodes. This demonstrates VScope’s ability to support fast operation and online queries against a comprehensive set of application to system/platform level metrics, and a variety of representative analytics functions. When supporting algorithms with high computation complexity, VScope serves as a ‘thin layer’ that occupies no more than 5% of their total latency. Further, by using VFocus, VScope can locate problematic VMs that cannot be found via solely application-level monitoring, and in one of the use cases explored in the dissertation, it operates with levels of perturbation of over 400% less than what is seen for brute-force and most sampling-based approaches. We also validate VFocus with real-world data center traces. The experimental results show that VFocus has troubleshooting accuracy of 83% on average.Ph.D

    A Study of Client-based Caching for Parallel I/O

    Get PDF
    The trend in parallel computing toward large-scale cluster computers running thousands of cooperating processes per application has led to an I/O bottleneck that has only gotten more severe as the the number of processing cores per CPU has increased. Current parallel file systems are able to provide high bandwidth file access for large contiguous file region accesses; however, applications repeatedly accessing small file regions on unaligned file region boundaries continue to experience poor I/O throughput due to the high overhead associated with accessing parallel file system data. In this dissertation we demonstrate how client-side file data caching can improve parallel file system throughput for applications performing frequent small and unaligned file I/O. We explore the impacts of cache page size and cache capacity using the popular FLASH I/O benchmark and explore a novel cache sharing approach that leverages the trend toward multi-core processors. We also explore a technique we call progressive page caching that represents cache data using dynamic data structures rather than fixed-size pages of file data. Finally, we explore a cache aggregation scheme that leverages the high-level file I/O interfaces provided by the PVFS file system to provide further performance enhancements. In summary, our results indicate that a correctly configured middleware-based file data cache can dramatically improve the performance of I/O workloads dominated by small unaligned file accesses. Further, we demonstrate that a well designed cache can offer stable performance even when the selected cache page granularity is not well matched to the provided workload. Finally, we have shown that high-level file system interfaces can significantly accelerate application performance, and interfaces beyond those currently envisioned by the MPI-IO standard could provide further performance benefits

    Detection of outliers and outliers clustering on large datasets with distributed computing

    Get PDF
    Tese de mestrado em Informática, apresentada à Universidade de Lisboa, através da Faculdade de Ciências, 2012Outlier detection is a data analysis related problem, of great importance in diverse science fields and with many applications. Without a definitive formal definition and holding several other designations – deviations, anomalies, exceptions, noise, atypical data, – outliers are, succinctly, the samples in a dataset that, for some reason, are different from the rest of the set. It can be of interest to either remove them, as a filtering process to smoothing data, or collect them as new dataset holding additional information potentially relevant. Its importance can be seen from the broad range of applications, like fraud or intrusion detection, specialized pattern recognition, data filtering, scientific data mining, medical diagnosis, etc. Although an old problem, with roots in Statistics, the outlier detection problem has become more pertinent then ever and yet further difficult to deal with. Better and more ubiquitous ways of data acquisition and storage capacities increasing constantly, made the size of datasets grow considerably in recent years, along with its number and its availability. Larger volumes of data becomes harder to explore and filter, while simultaneously data treatment and analysis emerges as more demanded and fundamental in today’s life. Distributed computing is a computer science paradigm to distribute hard, complex problems across several independent machines, connected on a network. A problem is break down in more simple sub-problems, that are solved simultaneous by the autonomous machines, and all resultant sub-solutions collected and put together into a final solution. Distributed computing provides a solution for the limitations in the hardware scaling, both economical and physical, by building up computational capacity, as needed, with the addition of new machines, not necessarily new or advanced models, but any commodity hardware. This work presents several distributed computing algorithms to outlier detection, starting from a distributed version of an existent algorithm, CURIO[9], and introducing a series of optimizations and variants that leads to a new method, Curio3XD, that allows to resolve both the common issues typical of this problem, the constraints imposed by the size and the dimensionality of the datasets. The final version, and its variant, is applicable for any volume of data, by scaling the hardware in the distributed computing, and to high dimensionality datasets, by moving the original exponential dependency on the dimension to a dependency, quadratic, on the local density of data, easily tunable with an algorithm parameter, the precision. Intermediate versions are presented for the sake of clarification of the process that took to the final method, and as an alternative approach, possibly useful with very sparse datasets. For a distributed computing environment with full support for the distributed system and the underlying hardware infrastructure, it was chosen Apache Hadoop[23] as a platform for developing, implementation and testing, due to its power and flexibility, and yet relatively easy usability. This constitutes an open-source solution, well studied and documented, employed by several major companies, with an excellent applicability to both clouds and local clusters. The different algorithms and variants were developed within the MapReduce programing model, and implemented in the Hadoop framework, which supports that model. MapReduce was conceived to permit the deployment of distributed computing applications in a simple, developer-oriented way, with main focus on the programmatic solutions of the problems, and leaving the underneath distributed network control and maintenance absolutely transparent. The developed implementations are included in appendix. Results of tests, with an adapted real world dataset, showed very good performances of the referred algorithms’ final versions, with excellent scalability on both size and dimensionality of data, as previewed theoretically. Performance tests with the precision parameter and comparative tests between all variants developed are also presented and discussed.Detecção de outliers é um problema relativo à análise de dados, de grande importância em diversos campos científicos. Sem um definição formal definitiva e possuindo diversas outras designações – desvios, anomalias, exceções, ruído, dados atípicos, – outliers são, sucintamente, as amostras num conjunto de dados que, por alguma razão, são diferentes do resto do dados. Pode ser de interesse quer a sua remoção, como um processo de filtragem para uma suavização dos dados, quer para a recolecção num novo conjunto de dados constituindo informação adicional potencialmente relevante. A sua importância pode ser notada no diversificado espectro de aplicações, como sejam a detecção de fraudes ou intrusos, reconhecimento especializado de padrões, filtragem de dados, prospecção de dados científicos, diagnósticos médicos, etc. Apesar de se tratar de um problema antigo, com origem na Estatística, a detecção de outliers tem-se tornado mais pertinente que nunca e contudo mais difícil de lidar. Melhor e mais ubíquas formas de aquisição de dados e capacidades de armazenamento em constante crescimento, fizeram as bases de dados crescer consideravelmente nos últimos anos, em conjunto com o aumento do seu número e disponibilidade. Um maior volume de dados torna-se mais difícil de explorar e filtrar, e simultaneamente o tratamento e análise de dados emerge como um processo mais necessário e fundamental nos dias de hoje. A computação distribuída é um paradigma das ciências da computação para distribuir problemas complexos e difíceis por diferentes máquinas independentes, ligadas em rede. Os problemas são divididos em problemas menores, mais simples, que são resolvidos simultaneamente pelas várias máquinas autónomas, e todas as sub-soluções resultantes coligidas e combinadas para obter uma solução final. A computação distribuída fornece uma solução para as limitações, físicas e económicas, no escalamento de equipamento, pela incremento de capacidade computacional, conforme a necessidade, com a adição de novas máquinas , não necessariamente modelos novos ou avançados, mas quaisquer equipamento à disposição. Este trabalho apresenta diversos algoritmos em computação distribuída para detecção de outliers, tendo como ponto de partida uma versão distribuída de um algoritmo existente, CURIO[9], e introduzindo uma série de optimizações e variantes que levam a um novo método, Curio3XD, que permite resolver ambos os problemas típicos comuns a este tipo de problemas, relacionados com o tamanho e com a dimensionalidade dos conjuntos de dados. Essa versão final, ou a sua variante, é aplicável a qualquer volume de dados, por escalamento de equipamento na computação distribuída, e a conjuntos de qualquer dimensão, pela remoção da dependência exponencial original na dimensão, substituindo-a por uma dependência, quadrática, na densidade local dos dados, facilmente controlável por um parâmetro do algoritmo, a precisão. As versões intermédias são apresentadas pela clarificação do processo que levou ao método final, e como uma abordagem alternativa, potencialmente útil com conjuntos de dados muito esparsos. Para um ambiente de computação distribuída com suporte completo a um sistema distribuído e uma infraestrutura de hardware adjacente, foi escolhido o Apache Hadoop[23] como plataforma para desenvolvimento, implementação e teste, devido às suas potencialidades e flexibilidade, e sendo contudo de relativo uso fácil. Este constitui um solução open-source, bem estudada e documentada, empregue por diversas grandes empresas, com uma excelente aplicabilidade quer em cloud como em clusters locais. Os diferentes algoritmos e variantes foram desenvolvidos no modelo programático MapReduce, e implementados no quadro do Apache Hadoop, que suporta esse modelo e oferece a capacidade de um fácil desenvolvimento em cloud e grandes clusters. Resultados dos testes, com um conjunto de dados real adaptado, mostrou um muito bom desempenho das versões finais dos referidos algoritmos, com uma excelente escalabilidade em ambas as variáveis tamanho e dimensionalidade dos dados, conforme previsto teoricamente. Testes de desempenho com a precisão e testes comparativos entre todas as variantes desenvolvidas são também apresentados e discutidos

    Network flow optimization for distributed clouds

    Get PDF
    Internet applications, which rely on large-scale networked environments such as data centers for their back-end support, are often geo-distributed and typically have stringent performance constraints. The interconnecting networks, within and across data centers, are critical in determining these applications' performance. Data centers can be viewed as composed of three layers: physical infrastructure consisting of servers, switches, and links, control platforms that manage the underlying resources, and applications that run on the infrastructure. This dissertation shows that network flow optimization can improve performance of distributed applications in the cloud by designing high-throughput schemes spanning all three layers. At the physical infrastructure layer, we devise a framework for measuring and understanding throughput of network topologies. We develop a heuristic for estimating the worst-case performance of any topology and propose a systematic methodology for comparing performance of networks built with different equipment. At the control layer, we put forward a source-routed data center fabric which can achieve near-optimal throughput performance by leveraging a large number of available paths while using limited memory in switches. At the application layer, we show that current Application Network Interfaces (ANIs), abstractions that translate an application's performance goals to actionable network objectives, fail to capture the requirements of many emerging applications. We put forward a novel ANI that can capture application intent more effectively and quantify performance gains achievable with it. We also tackle resource optimization in the inter-data center context of cellular providers. In this emerging environment, a large amount of resources are geographically fragmented across thousands of micro data centers, each with a limited share of resources, necessitating cross-application optimization to satisfy diverse performance requirements and improve network and server utilization. Our solution, Patronus, employs hierarchical optimization for handling multiple performance requirements and temporally partitioned scheduling for scalability

    Data Auditing and Security in Cloud Computing: Issues, Challenges and Future Directions

    Get PDF
    Cloud computing is one of the significant development that utilizes progressive computational power and upgrades data distribution and data storing facilities. With cloud information services, it is essential for information to be saved in the cloud and also distributed across numerous customers. Cloud information repository is involved with issues of information integrity, data security and information access by unapproved users. Hence, an autonomous reviewing and auditing facility is necessary to guarantee that the information is effectively accommodated and used in the cloud. In this paper, a comprehensive survey on the state-of-art techniques in data auditing and security are discussed. Challenging problems in information repository auditing and security are presented. Finally, directions for future research in data auditing and security have been discussed
    corecore