84 research outputs found
Data Tiling for Sparse Computation
Many real-world data contain internal relationships. Efficient analysis of these relationship data is crucial for important problems including genome alignment, network vulnerability analysis, ranking web pages, among others. Such relationship data is frequently sparse and analysis on it is called sparse computation. We demonstrate that the important technique of data tiling is more powerful than previously known by broadening its application space. We focus on three important sparse computation areas: graph analysis, linear algebra, and bioinformatics. We demonstrate data tiling's power by addressing key issues and providing significant improvements---to both runtime and solution quality---in each area. For graph analysis, we focus on fast data tiling techniques that can produce well-structured tiles and demonstrate theoretical hardness results. These tiles are suitable for graph problems as they reduce data movement and ultimately improve end-to-end runtime performance. For linear algebra, we introduce a new cache-aware tiling technique and apply it to the key kernel of sparse matrix by sparse matrix multiplication. This technique tiles the second input matrix and then uses a small, summary matrix to guide access to the tiles during computation. Our approach results in the fastest known implementation across three distinct CPU architectures. In bioinformatics, we develop a tiling based de novo genome assembly pipeline. We start with reads and develop either a graph or hypergraph that captures internal relationships between reads. This is then tiled to minimize connections while maintaining balance. We then treat each resulting tile independently as the input to an existing, shared-memory assembler. Our pipeline improves existing state-of-the-art de novo genome assemblers and brings both runtime and quality improvements to them on both real-world and simulated datasets.Ph.D
Execution Trace Graph Based Multi-criteria Partitioning of Stream Programs
AbstractOne of the problems proven to be NP-hard in the field of many-core architectures is the Partitioning of stream programs. In order to maximize the execution parallelism and obtain the maximal data throughput for a streaming application it is essential to find an appropriate actors assignment. The paper proposes a novel approach for finding a close-to-optimal partitioning configuration which is based on the execution trace graph of a dataflow network and its anal- ysis. We present some aspects of dataflow programming that make the partitioning problem different in this paradigm and build the heuristic methodology on them. Our optimization cri- teria include: balancing the total processing workload with regards to data dependencies, actors idle time minimization and reduction of data exchanges between processing units. Finally, we validate our approach with experimental results for a video decoder design case and compare them with some state-of-the-art solutions
Towards Performance Portable Graph Algorithms
In today's data-driven world, our computational resources have become heterogeneous, making the processing of large-scale graphs in an architecture agnostic manner crucial. Traditionally, hand-optimized high-performance computing (HPC) solutions have been studied and used to implement highly efficient and scalable graph algorithms. In recent years, several graph processing and management systems have also been proposed. Hand optimized HPC approaches require high levels of expertise and graph processing frameworks suffer from expressibility and performance. Portability is a major concern for both approaches. The main thesis of this work is that block-based graph algorithms offer a compromise between efficient parallelism and architecture agnostic algorithm design for a wide class of graph problems. This dissertation seeks to prove this thesis by focusing the work on the three pillars; data/computation partitioning, block-based algorithm design, and performance portability.
In this dissertation, we first show how we can partition the computation and the data to design efficient block-based algorithms for solving graph merging and triangle counting problems. Then, generalizing from our experiences, we propose an algorithmic framework, for shared-memory, heterogeneous machines for implementing block-based graph algorithms; PGAbB. PGAbB aims to maximally leverage different architectures by implementing a task-based execution on top of a block-based programming model. In this talk we will discuss PGAbB's programming model, algorithmic optimizations for scheduling, and load-balancing strategies for graph problems on real-world and synthetic inputs.Ph.D
A Methodology for Profiling and Partitioning Stream Programs on Many-core Architectures
Maximizing the data throughput is a very common implementation objective for several streaming applications. Such task is particularly challenging for implementations based on many-core and multi-core target platforms because, in general, it implies tackling several NP- complete combinatorial problems. Moreover, an efficient design space exploration requires an accurate evaluation on the basis of dataflow program execution profiling. The focus of the paper is on the methodology challenges for obtaining accurate profiling measures. Experimental results validate a many-core platform built by an array of Transport Triggered Architecture processors for exploring the partitioning search space based on the execution trace analysis
Independent task assignment for heterogeneous systems
Ankara : The Department of Computer Engineering and the Graduate School of Engineering and Science of Bilkent Univ., 2013.Thesis (Ph. D.) -- Bilkent University, 2013.Includes bibliographical references leaves 136-150.We study the problem of assigning nonuniform tasks onto heterogeneous systems.
We investigate two distinct problems in this context. The first problem is the
one-dimensional partitioning of nonuniform workload arrays with optimal load
balancing. The second problem is the assignment of nonuniform independent
tasks onto heterogeneous systems.
For one-dimensional partitioning of nonuniform workload arrays, we investigate
two cases: chain-on-chain partitioning (CCP), where the order of the processors
is specified, and chain partitioning (CP), where processor permutation
is allowed. We present polynomial time algorithms to solve the CCP problem
optimally, while we prove that the CP problem is NP complete. Our empirical
studies show that our proposed exact algorithms for the CCP problem produce
substantially better results than the state-of-the-art heuristics while the solution
times remain comparable.
For the independent task assignment problem, we investigate improving the
performance of the well-known and widely used constructive heuristics MinMin,
MaxMin and Sufferage. All three heuristics are known to run in O(KN2
) time in
assigning N tasks to K processors. In this thesis, we present our work on an algorithmic
improvement that asymptotically decreases the running time complexity
of MinMin to O(KN log N) without affecting its solution quality. Furthermore,
we combine the newly proposed MinMin algorithm with MaxMin as well as Sufferage,
obtaining two hybrid algorithms. The motivation behind the former hybrid
algorithm is to address the drawback of MaxMin in solving problem instances
with highly skewed cost distributions while also improving the running time performance
of MaxMin. The latter hybrid algorithm improves the running time
performance of Sufferage without degrading its solution quality. The proposed
algorithms are easy to implement and we illustrate them through detailed pseudocodes.
The experimental results over a large number of real-life datasets show
that the proposed fast MinMin algorithm and the proposed hybrid algorithms
perform significantly better than their traditional counterparts as well as more
recent state-of-the-art assignment heuristics. For the large datasets used in the
experiments, MinMin, MaxMin, and Sufferage, as well as recent state-of-the-art
heuristics, require days, weeks, or even months to produce a solution, whereas all
of the proposed algorithms produce solutions within only two or three minutes.
For the independent task assignment problem, we also investigate adopting
the multi-level framework which was successfully utilized in several applications
including graph and hypergraph partitioning. For the coarsening phase of the
multi-level framework, we present an efficient matching algorithm which runs in
O(KN) time in most cases. For the uncoarsening phase, we present two refinement
algorithms: an efficient O(KN)-time move-based refinement and an efficient
O(K2N log N)-time swap-based refinement. Our results indicate that multi-level
approach improves the quality of task assignments, while also improving the running
time performance, especially for large datasets.
As a realistic distributed application of the independent task assignment problem,
we introduce the site-to-crawler assignment problem, where a large number
of geographically distributed web servers are crawled by a multi-site distributed
crawling system and the objective is to minimize the duration of the crawl. We
show that this problem can be modeled as an independent task assignment problem.
As a solution to the problem, we evaluate a large number of state-of-the-art
task assignment heuristics selected from the literature as well as the improved
versions and the newly developed multi-level task assignment algorithm. We
compare the performance of different approaches through simulations on very
large, real-life web datasets. Our results indicate that multi-site web crawling
efficiency can be considerably improved using the independent task assignment
approach, when compared to relatively easy-to-implement, yet naive baselines.Tabak, E KartalPh.D
A Polyhedral Study of Mixed 0-1 Set
We consider a variant of the well-known single node fixed charge network flow set with constant capacities. This set arises from the relaxation of more general mixed integer sets such as lot-sizing problems with multiple suppliers. We provide a complete polyhedral characterization of the convex hull of the given set
Efficient bulk-loading methods for temporal and multidimensional index structures
Nahezu alle naturwissenschaftlichen Bereiche profitieren von neuesten Analyse- und Verarbeitungsmethoden fĂŒr groĂe Datenmengen. Diese Verfahren setzten eine effiziente Verarbeitung von geo- und zeitbezogenen Daten voraus, da die Zeit und die Position wichtige Attribute vieler Daten
sind. Die effiziente Anfrageverarbeitung wird insbesondere durch den Einsatz von Indexstrukturen
ermöglicht. Im Fokus dieser Arbeit liegen zwei Indexstrukturen: Multiversion B-Baum
(MVBT) und R-Baum. Die erste Struktur wird fĂŒr die Verwaltung von zeitbehafteten Daten,
die zweite fĂŒr die Indexierung von mehrdimensionalen Rechteckdaten eingesetzt.
StĂ€ndig- und schnellwachsendes Datenvolumen stellt eine groĂe Herausforderung an die Informatik
dar. Der Aufbau und das Aktualisieren von Indexen mit herkömmlichen Methoden (Datensatz
fĂŒr Datensatz) ist nicht mehr effizient. Um zeitnahe und kosteneffiziente Datenverarbeitung
zu ermöglichen, werden Verfahren zum schnellen Laden von Indexstrukturen dringend benötigt.
Im ersten Teil der Arbeit widmen wir uns der Frage, ob es ein Verfahren fĂŒr das Laden von MVBT
existiert, das die gleiche I/O-KomplexitÀt wie das externe Sortieren besitz. Bis jetzt blieb diese
Frage unbeantwortet. In dieser Arbeit haben wir eine neue Kostruktionsmethode entwickelt und
haben gezeigt, dass diese gleiche ZeitkomplexitÀt wie das externe Sortieren besitzt. Dabei haben
wir zwei algorithmische Techniken eingesetzt: Gewichts-Balancierung und Puffer-BĂ€ume. Unsere
Experimenten zeigen, dass das Resultat nicht nur theoretischer Bedeutung ist.
Im zweiten Teil der Arbeit beschÀftigen wir uns mit der Frage, ob und wie statistische Informationen
ĂŒber Geo-Anfragen ausgenutzt werden können, um die Anfrageperformanz von R-BĂ€umen zu
verbessern. Unsere neue Methode verwendet Informationen wie SeitenverhÀltnis und SeitenlÀngen
eines reprĂ€sentativen Anfragerechtecks, um einen guten R-Baum bezĂŒglich eines hĂ€ufig eingesetzten
Kostenmodells aufzubauen. Falls diese Informationen nicht verfĂŒgbar sind, optimieren
wir R-BĂ€ume bezĂŒglich der Summe der Volumina von minimal umgebenden Rechtecken der Blattknoten.
Da das Problem des Aufbaus von optimalen R-BĂ€umen bezĂŒglich dieses KostenmaĂes
NP-hart ist, fĂŒhren wir zunĂ€chst das Problem auf ein eindimensionales Partitionierungsproblem
zurĂŒck, indem wir die Daten bezĂŒglich optimierte raumfĂŒllende Kurven sortieren. Dann lösen
wir dieses Problem durch Einsatz vom dynamischen Programmieren. Die I/O-KomplexitÀt des
Verfahrens ist gleich der von externem Sortieren, da die I/O-Laufzeit der Methode durch die
Laufzeit des Sortierens dominiert wird.
Im letzten Teil der Arbeit haben wir die entwickelten Partitionierungsvefahren fĂŒr den Aufbau
von Geo-Histogrammen eingesetzt, da diese Àhnlich zu R-BÀumen eine disjunkte Partitionierung
des Raums erzeugen. Ergebnisse von intensiven Experimenten zeigen, dass sich unter Verwendung
von neuen Partitionierungstechniken sowohl R-BĂ€ume mit besserer Anfrageperformanz als
auch Geo-Histogrammen mit besserer SchÀtzqualitÀt im Vergleich zu Konkurrenzverfahren generieren
lassen
- âŠ