317 research outputs found

    Technical alignment

    Get PDF
    This essay discusses the importance of the areas of infrastructure and testing to help digital preservation services demonstrate reliability, transparency, and accountability. It encourages practitioners to build a strong culture in which transparency and collaborations between technical frameworks are valued highly. It also argues for devising and applying agreed-upon metrics that will enable the systematic analysis of preservation infrastructure. The essay begins by defining technical infrastructure and testing in the digital preservation context, provides case studies that exemplify both progress and challenges for technical alignment in both areas, and concludes with suggestions for achieving greater degrees of technical alignment going forward

    Distributed top-k aggregation queries at large

    Get PDF
    Top-k query processing is a fundamental building block for efficient ranking in a large number of applications. Efficiency is a central issue, especially for distributed settings, when the data is spread across different nodes in a network. This paper introduces novel optimization methods for top-k aggregation queries in such distributed environments. The optimizations can be applied to all algorithms that fall into the frameworks of the prior TPUT and KLEE methods. The optimizations address three degrees of freedom: 1) hierarchically grouping input lists into top-k operator trees and optimizing the tree structure, 2) computing data-adaptive scan depths for different input sources, and 3) data-adaptive sampling of a small subset of input sources in scenarios with hundreds or thousands of query-relevant network nodes. All optimizations are based on a statistical cost model that utilizes local synopses, e.g., in the form of histograms, efficiently computed convolutions, and estimators based on order statistics. The paper presents comprehensive experiments, with three different real-life datasets and using the ns-2 network simulator for a packet-level simulation of a large Internet-style network

    Experimenting with Gnutella Communities

    Get PDF
    Computer networks and distributed systems in general may be regarded as communities where the individual components, be they entire systems, application software or users, interact in a shared environment. Such communities dynamically evolve with components or nodes joning and leaving the system. Their own individual activities affect the community's behaviour and vice-versa. This paper discusses various experiments undertaken to investigate the behaviour of a real system, the Gnutella network, which represents such a community. Gnutella is a distributed Peer-to-Peer data-sharing system without any central control. It turns out that most interactions between nodes do not last long and much of their activity is devoted to finding appropriate partners in the network. Good connections lasting longer appear only as rare events. For example, out of 42,000 connections only 57 hosts were found to available on a regular basis. This means that, in contrast to the common belief that this kind of peer-to-peer networks or sub-communities are always large, they are actually quite small. However, those sub-communities examplify very dynamic behaviour because their actual composition can change very quickly. The experimental results presented have been obtained from a Java implementation of Gnutella running in the open Internet environment, and thus in unknown and quickly changing network structures heavily dependent on chance. Les rĂ©seaux informatique ainsi que les systĂšmes distribuĂ©s peuvent ĂȘtre considĂ©rĂ©s comme des communautĂ©s oĂč les composantes - que ce soit des systĂšmes complets, des programmes ou des usagers - interagissent dans un environnement partagĂ©. Ces communautĂ©s sont dynamiques car des Ă©lĂ©ments peuvent s'y joindre ou quitter en tout temps. L'article prĂ©sente les rĂ©sultats d'une suite d'expĂ©riences et de mesures faites sur Gnutella, un systĂšme peer-to-peer Ă  grande Ă©chelle qui opĂšre sans aucun contrĂŽle centralisĂ©. Nous avons remarquĂ© qu'une grande partie des messages Ă©changĂ©s sont erronĂ©s ou redondants et que les interactions entre n?uds ne durent pas trĂšs longtemps. En particulier, des connexions durant plus d'une minute sont des phĂ©nomĂšnes rares. Les n?uds passent donc la majoritĂ© de leur temps Ă  remplacer les partenaires perdus et, contrairement Ă  l'idĂ©e rĂ©pandue que les rĂ©seaux peer-to-peer sont immenses, nous avons notĂ© que les communautĂ©s effectives Ă©taient assez limitĂ©es. Gnutella est un environnement trĂšs dynamique avec peu de stabilitĂ©. Par exemple, de 42,000 sites avec lesquels nous avons Ă©tabli une connexion, il a seulement Ă©tĂ© possible de re-communiquer de façon rĂ©guliĂšre avec 57. Dans un tel environnement, la chance joue un rĂŽle important dans la performance observĂ©e; mais nous avons Ă©laborĂ© un protocole expĂ©rimental permettant de comparer diverses options.Gnutella, peer-to-peer networks, Internet communities, distributed systems, protocols, Gnutella, rĂ©seaux peer-to-peer, communautĂ©s virtuelles, internet, systĂšmes distribuĂ©s, protocoles de tĂ©lĂ©communication

    Top-k aggregation queries in large-scale distributed systems

    Get PDF
    Distributed top-k query processing has recently become an essential functionality in a large number of emerging application classes like Internet traffic monitoring and Peer-to-Peer Web search. This work addresses efficient algorithms for distributed top-k queries in wide-area networks where the index lists for the attribute values (or text terms) of a query are distributed across a number of data peers. More precisely, in this thesis, we make the following distributions: We present the family of KLEE algorithms that are a fundamental building-block towards efficient top-k query processing in distributed systems. We present means to model score distributions and show how these score models can be used to reason about parameter values that play an important role in the overall performance of KLEE. We present GRASS, a family of novel algorithms based on three optimization techniques significantly increased overall performance of KLEE and related algorithms. We present probabilistic guarantees for the result quality. Moreover, we present Minerva1, a distributed search engine. Minerva offers a highly distributed (in both the data dimension and the computational dimension), scalable, and efficient solution toward the development of internet-scale search engines.Top-k Anfragen spielen eine große Rolle in einer Vielzahl von Anwendungen, insbesondere im Bereich von Informationssystemen, bei denen eine kleine, sorgfĂ€ltig ausgewĂ€hlte Teilmenge der Ergebnisse den Benutzern prĂ€sentiert werden soll. Beispiele hierfĂŒr sind Suchmaschinen wie Google, Yahoo oder MSN. Obwohl die Forschung in diesem Bereich in den letzten Jahren große Fortschritte gemacht hat, haben Top-k-Anfragen in verteilten Systemen, bei denen die Daten auf verschiedenen Rechnern verteilt sind, vergleichsweise wenig Aufmerksamkeit erlangt. In dieser Arbeit beschĂ€ftigen wir uns mit der effizienten Verarbeitung eben dieser Anfragen. Die HauptbeitrĂ€ge gliedern sich wie folgt. Wir prĂ€sentieren KLEE, eine Familie neuartiger Top-k-Algorithmen. Wir entwickeln Modelle mit denen Datenverteilungen beschrieben werden können. Diese Modelle sind die Grundlage fĂŒr eine SchĂ€tzung diverser Parameter, die einen großen Einfluss auf die Performanz von KLEE und anderen Ă€hnlichen Algorithmen haben. Wir prĂ€sentieren GRASS, eine Familie von Algorithmen, basierend auf drei neuartigen Optimierungstechniken, mit denen die Performanz von KLEE und Ă€hnlichen Algorithmen verbessert wird. Wir prĂ€sentieren probabilistische Garantien fĂŒr die ErgebnisgĂŒte. Wir prĂ€sentieren Minerva, eine neuartige verteilte Peer-to-Peer-Suchmaschine

    Neighborhood-based Tag Prediction

    Get PDF
    We consider the problem of tag prediction in collaborative tagging systems where users share and annotate resources on the Web. We put forward HAMLET, a novel approach to automatically propagate tags along the edges of a graph which relates similar documents. We identify the core principles underlying tag propagation for which we derive suitable scoring models combined in one overall ranking formula. Leveraging these scores, we present an effcient top-k tag selection algorithm that infers additional tags by carefully inspecting neighbors in the document graph. Experiments using real-world data demonstrate the viability of our approach in large-scale environments where tags are scarce

    BIGhybrid: A Simulator for MapReduce Applications in Hybrid Distributed Infrastructures Validated with the Grid5000 Experimental Platform

    Get PDF
    International audienceSUMMARY Cloud computing has increasingly been used as a platform for running large business and data processing applications. Conversely, Desktop Grids have been successfully employed in a wide range of projects, because they are able to take advantage of a large number of resources provided free of charge by volunteers. A hybrid infrastructure created from the combination of Cloud and Desktop Grids infrastructures can provide a low-cost and scalable solution for Big Data analysis. Although frameworks like MapReduce have been designed to exploit commodity hardware, their ability to take advantage of a hybrid infrastructure poses significant challenges due to their large resource heterogeneity and high churn rate. In this paper is proposed BIGhybrid, a simulator for two existing classes of MapReduce runtime environments: BitDew-MapReduce designed for Desktop Grids and BlobSeer-Hadoop designed for Cloud computing, where the goal is to carry out accurate simulations of MapReduce executions in a hybrid infrastructure composed of Cloud computing and Desktop Grid resources. This work describes the principles of the simulator and describes the validation of BigHybrid with the Grid5000 experimental platform. Owing to BigHybrid, developers can investigate and evaluate new algorithms to enable MapReduce to be executed in hybrid infrastructures. This includes topics such as resource allocation and data splitting. Concurrency and Computation: Practice and Experienc

    Towards Measuring and Understanding Performance in Infrastructure- and Function-as-a-Service Clouds

    Get PDF
    Context. Cloud computing has become the de facto standard for deploying modern software systems, which makes its performance crucial to the efficient functioning of many applications. However, the unabated growth of established cloud services, such as Infrastructure-as-a-Service (IaaS), and the emergence of new services, such as Function-as-a-Service (FaaS), has led to an unprecedented diversity of cloud services with different performance characteristics.Objective. The goal of this licentiate thesis is to measure and understand performance in IaaS and FaaS clouds. My PhD thesis will extend and leverage this understanding to propose solutions for building performance-optimized FaaS cloud applications.Method.\ua0To achieve this goal, quantitative and qualitative research methods are used, including experimental research, artifact analysis, and literature review.Findings.\ua0The thesis proposes a cloud benchmarking methodology to estimate application performance in IaaS clouds, characterizes typical FaaS applications, identifies gaps in literature on FaaS performance evaluations, and examines the reproducibility of reported FaaS performance experiments. The evaluation of the benchmarking methodology yielded promising results for benchmark-based application performance estimation under selected conditions. Characterizing 89 FaaS applications revealed that they are most commonly used for short-running tasks with low data volume and bursty workloads. The review of 112 FaaS performance studies from academic and industrial sources found a strong focus on a single cloud platform using artificial micro-benchmarks and discovered that the majority of studies do not follow reproducibility principles on cloud experimentation.Future Work. Future work will propose a suite of application performance benchmarks for FaaS, which is instrumental for evaluating candidate solutions towards building performance-optimized FaaS applications
    • 

    corecore