1,041 research outputs found
Bulk Scheduling with the DIANA Scheduler
Results from the research and development of a Data Intensive and Network
Aware (DIANA) scheduling engine, to be used primarily for data intensive
sciences such as physics analysis, are described. In Grid analyses, tasks can
involve thousands of computing, data handling, and network resources. The
central problem in the scheduling of these resources is the coordinated
management of computation and data at multiple locations and not just data
replication or movement. However, this can prove to be a rather costly
operation and efficient sing can be a challenge if compute and data resources
are mapped without considering network costs. We have implemented an adaptive
algorithm within the so-called DIANA Scheduler which takes into account data
location and size, network performance and computation capability in order to
enable efficient global scheduling. DIANA is a performance-aware and
economy-guided Meta Scheduler. It iteratively allocates each job to the site
that is most likely to produce the best performance as well as optimizing the
global queue for any remaining jobs. Therefore it is equally suitable whether a
single job is being submitted or bulk scheduling is being performed. Results
indicate that considerable performance improvements can be gained by adopting
the DIANA scheduling approach.Comment: 12 pages, 11 figures. To be published in the IEEE Transactions in
Nuclear Science, IEEE Press. 200
Compact and efficient representations of graphs
[Resumen] En esta tesis estudiamos el problema de la creación de representaciones compactas y
eficientes de grafos. Proponemos nuevas estructuras para persistir y consultar grafos
de diferentes dominios, prestando especial atención al diseño de soluciones eficientes
para grafos generales y grafos RDF.
Hemos diseñado una nueva herramienta para generar grafos a partir de fuentes de
datos heterogéneas mediante un sistema de definición de reglas. Es una herramienta
de propósito general y, hasta nuestro conocimiento, no existe otra herramienta de
estas características en el Estado del Arte. Otra contribución de este trabajo es
una representación compacta de grafos generales, que soporta el acceso eficiente
a los atributos y aristas del grafo. Así mismo, hemos estudiado el problema de
la distribución de grafos en un entorno paralelo, almacenados sobre estructuras
compactas, y hemos propuesto nueve alternativas diferentes que han sido evaluadas
experimentalmente. También hemos propuesto un nuevo índice para RDF que
soporta la resolución básica de SPARQL de forma comprimida. Por último,
presentamos una nueva estructura compacta para almacenar relaciones ternarias
cuyo diseño se enfoca a la representación eficiente de datos RDF.
Todas estas propuestas han sido experimentalmente validadas con conjuntos de
datos ampliamente aceptados, obteniéndose resultados competitivos comparadas con
otras alternativas del Estado del Arte.[Resumo] Na presente tese estudiamos o problema da creación de representacións compactas e
eficientes de grafos. Para isto propoñemos novas estruturas para persistir e consultar
grafos de diferentes dominios, facendo especial fincapé no deseño de solucións
eficientes nos casos de grafos xerais e grafos RDF.
Deseñamos unha nova ferramenta para a xeración de grafos a partires de fontes
de datos heteroxéneas mediante un sistema de definición de regras. Trátase dunha
ferramenta de propósito xeral e, até onde chega o noso coñecemento, non existe outra
ferramenta semellante no Estado do Arte. Outra das contribucións do traballo é unha
representación compacta de grafos xerais, con soporte para o acceso eficiente aos
atributos e aristas do grafo. Así mesmo, estudiamos o problema da distribución de
grafos nun contorno paralelo, almacenados sobre estruturas compactas, e propoñemos
nove alternativas diferentes que foron avaliadas de xeito experimental. Propoñemos
tamén un novo índice para RDF que soporta a resolución básica de SPARQL de
xeito comprimido. Para rematar, presentamos unha nova estrutura compacta para
almacenar relacións ternarias, cun diseño especialmente enfocado á representación
eficiente de datos RDF.
Todas estas propostas foron validadas experimentalmente con conxuntos de datos
amplamente aceptados, obténdose resultados competitivos comparadas con outras
alternativas do Estado do Arte.[Abstract] In this thesis we study the problem of creating compact and efficient representations
of graphs. We propose new data structures to store and query graph data from
diverse domains, paying special attention to the design of efficient solutions for
attributed and RDF graphs.
We have designed a new tool to generate graphs from arbitrary data through
a rule definition system. It is a general-purpose solution that, to the best of our
knowledge, is the first with these characteristics. Another contribution of this work
is a very compact representation for attributed graphs, providing efficient access
to the properties and links of the graph. We also study the problem of graph
distribution on a parallel environment using compact structures, proposing nine
different alternatives that are experimentally compared. We also propose a novel
RDF indexing technique that supports efficient SPARQL solution in compressed
space. Finally, we present a new compact structure to store ternary relationships
whose design is focused on the efficient representation of RDF data.
All of these proposals were experimentally evaluated with widely accepted
datasets, obtaining competitive results when they are compared against other
alternatives of the State of the Art
DIANA Scheduling Hierarchies for Optimizing Bulk Job Scheduling
The use of meta-schedulers for resource management in large-scale distributed
systems often leads to a hierarchy of schedulers. In this paper, we discuss why
existing meta-scheduling hierarchies are sometimes not sufficient for Grid
systems due to their inability to re-organise jobs already scheduled locally.
Such a job re-organisation is required to adapt to evolving loads which are
common in heavily used Grid infrastructures. We propose a peer-to-peer
scheduling model and evaluate it using case studies and mathematical modelling.
We detail the DIANA (Data Intensive and Network Aware) scheduling algorithm and
its queue management system for coping with the load distribution and for
supporting bulk job scheduling. We demonstrate that such a system is beneficial
for dynamic, distributed and self-organizing resource management and can assist
in optimizing load or job distribution in complex Grid infrastructures.Comment: 8 pages, 9 figures. Presented at the 2nd IEEE Int Conference on
eScience & Grid Computing. Amsterdam Netherlands, December 200
Distributed Iterative Graph Processing Using NoSQL with Data Locality
A tremendous amount of data is generated every day from a wide range of sources such as social networks, sensors, and application logs. Among them, graph data is one type that represents valuable relationships between various entities. Analytics of large graphs has become an essential part of business processes and scientific studies because it leads to deep and meaningful insights into the related domain based on the connections between various entities. However, the optimal processing of large-scale iterative graph computations is very challenging due to the issues like fault tolerance, high memory requirement, parallelization, and scalability. Most of the contemporary systems focus either on keeping the entire graph data in memory and minimizing the disk access or on processing the graph data completely on a single node with a centralized disk system. GraphMap is one of the state-of-the-art scalable and efficient out-of-core disk-based iterative graph processing systems that focus on using the secondary storage and optimizing the I/O access. In this thesis, we investigate two new extensions to the existing out-of-core NoSQL-based distributed iterative graph processing system: 1) Intra-worker data locality and 2) Mincut-based partitioning. We design an additional suite of data locality that moves the computation towards the data rather than the other way around. A significant improvement in performance, up to 39\%, is demonstrated by this locality implementation. Similarly, we use the mincut-based graph partitioning technique to distribute the graph data uniformly across the workers for parallelization so that the inter-worker communication volume is minimized. By extensive experiments, we also show that the mincut-based graph partitioning technique can lead to improper parallelization due to sub-optimal load-balancing
- …