1,041 research outputs found

    Bulk Scheduling with the DIANA Scheduler

    Full text link
    Results from the research and development of a Data Intensive and Network Aware (DIANA) scheduling engine, to be used primarily for data intensive sciences such as physics analysis, are described. In Grid analyses, tasks can involve thousands of computing, data handling, and network resources. The central problem in the scheduling of these resources is the coordinated management of computation and data at multiple locations and not just data replication or movement. However, this can prove to be a rather costly operation and efficient sing can be a challenge if compute and data resources are mapped without considering network costs. We have implemented an adaptive algorithm within the so-called DIANA Scheduler which takes into account data location and size, network performance and computation capability in order to enable efficient global scheduling. DIANA is a performance-aware and economy-guided Meta Scheduler. It iteratively allocates each job to the site that is most likely to produce the best performance as well as optimizing the global queue for any remaining jobs. Therefore it is equally suitable whether a single job is being submitted or bulk scheduling is being performed. Results indicate that considerable performance improvements can be gained by adopting the DIANA scheduling approach.Comment: 12 pages, 11 figures. To be published in the IEEE Transactions in Nuclear Science, IEEE Press. 200

    Compact and efficient representations of graphs

    Get PDF
    [Resumen] En esta tesis estudiamos el problema de la creación de representaciones compactas y eficientes de grafos. Proponemos nuevas estructuras para persistir y consultar grafos de diferentes dominios, prestando especial atención al diseño de soluciones eficientes para grafos generales y grafos RDF. Hemos diseñado una nueva herramienta para generar grafos a partir de fuentes de datos heterogéneas mediante un sistema de definición de reglas. Es una herramienta de propósito general y, hasta nuestro conocimiento, no existe otra herramienta de estas características en el Estado del Arte. Otra contribución de este trabajo es una representación compacta de grafos generales, que soporta el acceso eficiente a los atributos y aristas del grafo. Así mismo, hemos estudiado el problema de la distribución de grafos en un entorno paralelo, almacenados sobre estructuras compactas, y hemos propuesto nueve alternativas diferentes que han sido evaluadas experimentalmente. También hemos propuesto un nuevo índice para RDF que soporta la resolución básica de SPARQL de forma comprimida. Por último, presentamos una nueva estructura compacta para almacenar relaciones ternarias cuyo diseño se enfoca a la representación eficiente de datos RDF. Todas estas propuestas han sido experimentalmente validadas con conjuntos de datos ampliamente aceptados, obteniéndose resultados competitivos comparadas con otras alternativas del Estado del Arte.[Resumo] Na presente tese estudiamos o problema da creación de representacións compactas e eficientes de grafos. Para isto propoñemos novas estruturas para persistir e consultar grafos de diferentes dominios, facendo especial fincapé no deseño de solucións eficientes nos casos de grafos xerais e grafos RDF. Deseñamos unha nova ferramenta para a xeración de grafos a partires de fontes de datos heteroxéneas mediante un sistema de definición de regras. Trátase dunha ferramenta de propósito xeral e, até onde chega o noso coñecemento, non existe outra ferramenta semellante no Estado do Arte. Outra das contribucións do traballo é unha representación compacta de grafos xerais, con soporte para o acceso eficiente aos atributos e aristas do grafo. Así mesmo, estudiamos o problema da distribución de grafos nun contorno paralelo, almacenados sobre estruturas compactas, e propoñemos nove alternativas diferentes que foron avaliadas de xeito experimental. Propoñemos tamén un novo índice para RDF que soporta a resolución básica de SPARQL de xeito comprimido. Para rematar, presentamos unha nova estrutura compacta para almacenar relacións ternarias, cun diseño especialmente enfocado á representación eficiente de datos RDF. Todas estas propostas foron validadas experimentalmente con conxuntos de datos amplamente aceptados, obténdose resultados competitivos comparadas con outras alternativas do Estado do Arte.[Abstract] In this thesis we study the problem of creating compact and efficient representations of graphs. We propose new data structures to store and query graph data from diverse domains, paying special attention to the design of efficient solutions for attributed and RDF graphs. We have designed a new tool to generate graphs from arbitrary data through a rule definition system. It is a general-purpose solution that, to the best of our knowledge, is the first with these characteristics. Another contribution of this work is a very compact representation for attributed graphs, providing efficient access to the properties and links of the graph. We also study the problem of graph distribution on a parallel environment using compact structures, proposing nine different alternatives that are experimentally compared. We also propose a novel RDF indexing technique that supports efficient SPARQL solution in compressed space. Finally, we present a new compact structure to store ternary relationships whose design is focused on the efficient representation of RDF data. All of these proposals were experimentally evaluated with widely accepted datasets, obtaining competitive results when they are compared against other alternatives of the State of the Art

    DIANA Scheduling Hierarchies for Optimizing Bulk Job Scheduling

    Get PDF
    The use of meta-schedulers for resource management in large-scale distributed systems often leads to a hierarchy of schedulers. In this paper, we discuss why existing meta-scheduling hierarchies are sometimes not sufficient for Grid systems due to their inability to re-organise jobs already scheduled locally. Such a job re-organisation is required to adapt to evolving loads which are common in heavily used Grid infrastructures. We propose a peer-to-peer scheduling model and evaluate it using case studies and mathematical modelling. We detail the DIANA (Data Intensive and Network Aware) scheduling algorithm and its queue management system for coping with the load distribution and for supporting bulk job scheduling. We demonstrate that such a system is beneficial for dynamic, distributed and self-organizing resource management and can assist in optimizing load or job distribution in complex Grid infrastructures.Comment: 8 pages, 9 figures. Presented at the 2nd IEEE Int Conference on eScience & Grid Computing. Amsterdam Netherlands, December 200

    Distributed Iterative Graph Processing Using NoSQL with Data Locality

    Get PDF
    A tremendous amount of data is generated every day from a wide range of sources such as social networks, sensors, and application logs. Among them, graph data is one type that represents valuable relationships between various entities. Analytics of large graphs has become an essential part of business processes and scientific studies because it leads to deep and meaningful insights into the related domain based on the connections between various entities. However, the optimal processing of large-scale iterative graph computations is very challenging due to the issues like fault tolerance, high memory requirement, parallelization, and scalability. Most of the contemporary systems focus either on keeping the entire graph data in memory and minimizing the disk access or on processing the graph data completely on a single node with a centralized disk system. GraphMap is one of the state-of-the-art scalable and efficient out-of-core disk-based iterative graph processing systems that focus on using the secondary storage and optimizing the I/O access. In this thesis, we investigate two new extensions to the existing out-of-core NoSQL-based distributed iterative graph processing system: 1) Intra-worker data locality and 2) Mincut-based partitioning. We design an additional suite of data locality that moves the computation towards the data rather than the other way around. A significant improvement in performance, up to 39\%, is demonstrated by this locality implementation. Similarly, we use the mincut-based graph partitioning technique to distribute the graph data uniformly across the workers for parallelization so that the inter-worker communication volume is minimized. By extensive experiments, we also show that the mincut-based graph partitioning technique can lead to improper parallelization due to sub-optimal load-balancing
    corecore