3,308 research outputs found

    Framework for Supporting Genomic Operations

    Get PDF
    Next Generation Sequencing (NGS) is a family of technologies for reading the DNA or RNA, capable of producing whole genome sequences at an impressive speed, and causing a revolution of both biological research and medical practice. In this exciting scenario, while a huge number of specialized bio-informatics programs extract information from sequences, there is an increasing need for a new generation of systems and frameworks capable of integrating such information, providing holistic answers to the needs of biologists and clinicians. To respond to this need, we developed GMQL, a new query language for genomic data management that operates on heterogeneous genomic datasets. In this paper, we focus on three domain-specific operations of GMQL used for the efficient processing of operations on genomic regions, and we describe their efficient implementation; the paper develops a theory of binning strategies as a generic approach to parallel execution of genomic operations, and then describes how binning is embedded into two efficient implementations of the operations using Flink and Spark, two emerging frameworks for data management on the cloud

    Efficient processing of large-scale spatio-temporal data

    Get PDF
    Millionen Geräte, wie z.B. Mobiltelefone, Autos und Umweltsensoren senden ihre Positionen zusammen mit einem Zeitstempel und weiteren Nutzdaten an einen Server zu verschiedenen Analysezwecken. Die Positionsinformationen und übertragenen Ereignisinformationen werden als Punkte oder Polygone dargestellt. Eine weitere Art räumlicher Daten sind Rasterdaten, die zum Beispiel von Kameras und Sensoren produziert werden. Diese großen räumlich-zeitlichen Datenmengen können nur auf skalierbaren Plattformen wie Hadoop und Apache Spark verarbeitet werden, die jedoch z.B. die Nachbarschaftsinformation nicht ausnutzen können - was die Ausführung bestimmter Anfragen praktisch unmöglich macht. Die wiederholten Ausführungen der Analyseprogramme während ihrer Entwicklung und durch verschiedene Nutzer resultieren in langen Ausführungszeiten und hohen Kosten für gemietete Ressourcen, die durch die Wiederverwendung von Zwischenergebnissen reduziert werden können. Diese Arbeit beschäftigt sich mit den beiden oben beschriebenen Herausforderungen. Wir präsentieren zunächst das STARK Framework für die Verarbeitung räumlich-zeitlicher Vektor- und Rasterdaten in Apache Spark. Wir identifizieren verschiedene Algorithmen für Operatoren und analysieren, wie diese von den Eigenschaften der zugrundeliegenden Plattform profitieren können. Weiterhin wird untersucht, wie Indexe in der verteilten und parallelen Umgebung realisiert werden können. Außerdem vergleichen wir Partitionierungsmethoden, die unterschiedlich gut mit ungleichmäßiger Datenverteilung und der Größe der Datenmenge umgehen können und präsentieren einen Ansatz um die auf Operatorebene zu verarbeitende Datenmenge frühzeitig zu reduzieren. Um die Ausführungszeit von Programmen zu verkürzen, stellen wir einen Ansatz zur transparenten Materialisierung von Zwischenergebnissen vor. Dieser Ansatz benutzt ein Entscheidungsmodell, welches auf den tatsächlichen Operatorkosten basiert. In der Evaluierung vergleichen wir die verschiedenen Implementierungs- sowie Konfigurationsmöglichkeiten in STARK und identifizieren Szenarien wann Partitionierung und Indexierung eingesetzt werden sollten. Außerdem vergleichen wir STARK mit verwandten Systemen. Im zweiten Teil der Evaluierung zeigen wir, dass die transparente Wiederverwendung der materialisierten Zwischenergebnisse die Ausführungszeit der Programme signifikant verringern kann.Millions of location-aware devices, such as mobile phones, cars, and environmental sensors constantly report their positions often in combination with a timestamp to a server for different kinds of analyses. While the location information of the devices and reported events is represented as points and polygons, raster data is another type of spatial data, which is for example produced by cameras and sensors. This Big spatio-temporal Data needs to be processed on scalable platforms, such as Hadoop and Apache Spark, which, however, are unaware of, e.g., spatial neighborhood, what makes them practically impossible to use for this kind of data. The repeated executions of the programs during development and by different users result in long execution times and potentially high costs in rented clusters, which can be reduced by reusing commonly computed intermediate results. Within this thesis, we tackle the two challenges described above. First, we present the STARK framework for processing spatio-temporal vector and raster data on the Apache Spark stack. For operators, we identify several possible algorithms and study how they can benefit from the underlying platform's properties. We further investigate how indexes can be realized in the distributed and parallel architecture of Big Data processing engines and compare methods for data partitioning, which perform differently well with respect to data skew and data set size. Furthermore, an approach to reduce the amount of data to process at operator level is presented. In order to reduce the execution times, we introduce an approach to transparently recycle intermediate results of dataflow programs, based on operator costs. To compute the costs, we instrument the programs with profiling code to gather the execution time and result size of the operators. In the evaluation, we first compare the various implementation and configuration possibilities in STARK and identify scenarios when and how partitioning and indexing should be applied. We further compare STARK to related systems and show that we can achieve significantly better execution times, not only when exploiting existing partitioning information. In the second part of the evaluation, we show that with the transparent cost-based materialization and recycling of intermediate results, the execution times of programs can be reduced significantly

    Cost estimation of spatial join in spatialhadoop

    Get PDF
    Spatial join is an important operation in geo-spatial applications, since it is frequently used for performing data analysis involving geographical information. Many efforts have been done in the past decades in order to provide efficient algorithms for spatial join and this becomes particularly important as the amount of spatial data to be processed increases. In recent years, the MapReduce approach has become a de-facto standard for processing large amount of data (big-data) and some attempts have been made for extending existing frameworks for the processing of spatial data. In this context, several different MapReduce implementations of spatial join have been defined which mainly differ in the use of a spatial index and in the way this index is built and used. In general, none of these algorithms can be considered better than the others, but the choice might depend on the characteristics of the involved datasets. The aim of this work is to deeply analyse them and define a cost model for ranking them based on the characteristics of the dataset at hand (i.e., selectivity or spatial properties). This cost model has been extensively tested w.r.t. a set of synthetic datasets in order to prove its effectiveness

    MapReduce Particle Filtering with Exact Resampling and Deterministic Runtime

    Get PDF
    Particle filtering is a numerical Bayesian technique that has great potential for solving sequential estimation problems involving non-linear and non-Gaussian models. Since the estimation accuracy achieved by particle filters improves as the number of particles increases, it is natural to consider as many particles as possible. MapReduce is a generic programming model that makes it possible to scale a wide variety of algorithms to Big data. However, despite the application of particle filters across many domains, little attention has been devoted to implementing particle filters using MapReduce. In this paper, we describe an implementation of a particle filter using MapReduce. We focus on a component that what would otherwise be a bottleneck to parallel execution, the resampling component. We devise a new implementation of this component, which requires no approximations, has O(N)O\left(N\right) spatial complexity and deterministic O((logN)2)O\left(\left(\log N\right)^2\right) time complexity. Results demonstrate the utility of this new component and culminate in consideration of a particle filter with 2242^{24} particles being distributed across 512512 processor cores

    MapReduce particle filtering with exact resampling and deterministic runtime

    Get PDF
    Particle filtering is a numerical Bayesian technique that has great potential for solving sequential estimation problems involving non-linear and non-Gaussian models. Since the estimation accuracy achieved by particle filters improves as the number of particles increases, it is natural to consider as many particles as possible. MapReduce is a generic programming model that makes it possible to scale a wide variety of algorithms to Big data. However, despite the application of particle filters across many domains, little attention has been devoted to implementing particle filters using MapReduce. In this paper, we describe an implementation of a particle filter using MapReduce. We focus on a component that what would otherwise be a bottleneck to parallel execution, the resampling component. We devise a new implementation of this component, which requires no approximations, has O(N) spatial complexity and deterministic O((logN)2) time complexity. Results demonstrate the utility of this new component and culminate in consideration of a particle filter with 224 particles being distributed across 512 processor cores

    What makes spatial data big? A discussion on how to partition spatial data

    Get PDF
    The amount of available spatial data has significantly increased in the last years so that traditional analysis tools have become inappropriate to effectively manage them. Therefore, many attempts have been made in order to define extensions of existing MapReduce tools, such as Hadoop or Spark, with spatial capabilities in terms of data types and algorithms. Such extensions are mainly based on the partitioning techniques implemented for textual data where the dimension is given in terms of the number of occupied bytes. However, spatial data are characterized by other features which describe their dimension, such as the number of vertices or the MBR size of geometries, which greatly affect the performance of operations, like the spatial join, during data analysis. The result is that the use of traditional partitioning techniques prevents to completely exploit the benefit of the parallel execution provided by a MapReduce environment. This paper extensively analyses the problem considering the spatial join operation as use case, performing both a theoretical and an experimental analysis for it. Moreover, it provides a solution based on a different partitioning technique, which splits complex or extensive geometries. Finally, we validate the proposed solution by means of some experiments on synthetic and real datasets

    JedAI-spatial: a system for 3-dimensional Geospatial Interlinking

    Get PDF
    Τα γεωχωρικά δεδομένα αποτελούν ένα σημαντικό κομμάτι των δεδομένων του Σημασιολογικού Ιστού (Semantic Web), αλλά μέχρι στιγμής οι πηγές του δεν περιέχουν αρκετούς συνδέσμους στο Linked Open Data cloud. Η Διασύνδεση Γεωχωρικών Δεδομένων (Geospatial Interlinking) έχει ως στόχο να καλύψει αυτό το κενό συνδέοντας τις γεωμετρίες με καθιερωμένες τοπολογικές σχέσεις, όπως αυτές του Dimensionally Extended 9-Intersection Model. Έχουν προταθεί διάφοροι αλγόριθμοι στη βιβλιογραφία για την επίλυση αυτού του προβληματος. Στο πλαίσιο αυτής της διπλωματικής εργασίας, αναπτύσσουμε το JedAI-spatial, ένα καινοτόμο σύστημα ανοιχτού κώδικα, το οποίο οργανώνει τους κύριους υπάρχοντες αλγορίθμους σύμφωνα με τρεις διαστάσεις: i. Το Space Tiling διαφοροποιεί τους αλγόριθμους διασύνδεσης σε αυτούς που βασίζονται σε πλέγμα (grid-based), δέντρα (tree-based) ή κατατμήσεις (partition-based), σύμφωνα με την μέθοδο τους για τη μείωση του χώρου αναζήτησης και συνεπώς της τετραγωνικής πολυπλοκότητας αυτού του προβλήματος. Η πρώτη κατηγορία περιέχει τεχνικές Σημασιολογικού Ιστού, η δεύτερη καθιερωμένες τεχνικές για χωρική διασύνδεση (spatial join) στην κύρια μνήμη από την κοινότητα των βάσεων δεδομένων , ενώ η τρίτη περιλαμβάνει παραλλαγές του βασικού αλγορίθμου plane-sweep της υπολογιστικής γεωμετρίας. ii. Το Budget awareness διαχωρίζει τους αλγόριθμους διασύνδεσης σε budget-agnostic και budget-aware. Οι μέν απαρτίζονται από batch τεχνικές, που παράγουν αποτελέσματα μόνο μετά την επεξεργασία όλων των δεδομένων, ενώ οι δε λειτουργούν με έναν προοδευτικό τρόπο που παράγει αποτελέσματα σταδιακά - ο στόχος τους είναι να επικυρώσουν τις τοπολογικά συσχετιζόμενες γεωμετρίες πριν από τις μη-συσχετιζόμενες. iii. Η Μέθοδος Εκτέλεσης διαφοροποιεί τους αλγορίθμους σε σειριακούς, οι οποίοι εκτελούνται χρησιμοποιώντας ένα πυρήνα (CPU core), και παράλληλους (parallel), οι οποίοι αξιοποιούν την κατανεμημένη εκτέλεση πάνω στο Apache Spark. Στα πλαίσια της διπλωματικής πραγματοποιήθηκαν εκτενή πειράματα με τις μεθόδους και των 3 διαστάσεων, με τα πειραματικά αποτελέσματα να παρέχουν μία ενδιαφέρουσα εικόνα όσον αφορά τη σχετική απόδοση των αλγορίθμων.Geospatial data constitutes a considerable part of Semantic Web data, but so far, its sources lack enough links in the Linked Open Data cloud. Geospatial Interlinking aims to cover this gap by associating geometries with established topological relations, such as those of the Dimensionally Extended 9-Intersection Model. Various algorithms have already been proposed in the literature for this task. In the context of this master thesis, we develop JedAI-spatial, a novel, open-source system that organizes the main existing algorithms according to three dimensions: i. Space Tiling distinguishes interlinking algorithms into grid-, tree- and partition-based, according to their approach for reducing the search space and, thus, the computational cost of this inherently quadratic task. The former category includes Semantic Web techniques that define a static or dynamic EquiGrid and verify pairs of geometries whose minimum bounding rectangles intersect at least one common cell. Tree-based algorithms encompass established main-memory spatial join techniques from the database community, while the partition-based category includes variations of the cornerstone of computational geometry, i.e., the plane sweep algorithm. ii. Budget-awareness distinguishes interlinking algorithms into budget-agnostic and budget-aware ones. The former constitute batch techniques that produce results only after completing their processing over the entire input data, while the latter operate in a pay-as-you-go manner that produces results progressively - their goal is to verify related geometries before the non-related ones. iii. Execution mode distinguishes interlinking algorithms into serial ones, which are carried out using a single CPU-core, and parallel ones, which leverage massive parallelization on top of Apache Spark. Extensive experimental evaluations were performed along these 3 dimensions, with the experimental outcomes providing interesting insights about the relative performance of the considered algorithms

    R*-Grove: Balanced Spatial Partitioning for Large-scale Datasets

    Full text link
    The rapid growth of big spatial data urged the research community to develop several big spatial data systems. Regardless of their architecture, one of the fundamental requirements of all these systems is to spatially partition the data efficiently across machines. The core challenges of big spatial partitioning are building high spatial quality partitions while simultaneously taking advantages of distributed processing models by providing load balanced partitions. Previous works on big spatial partitioning are to reuse existing index search trees as-is, e.g., the R-tree family, STR, Kd-tree, and Quad-tree, by building a temporary tree for a sample of the input and use its leaf nodes as partition boundaries. However, we show in this paper that none of those techniques has addressed the mentioned challenges completely. This paper proposes a novel partitioning method, termed R*-Grove, which can partition very large spatial datasets into high quality partitions with excellent load balance and block utilization. This appealing property allows R*-Grove to outperform existing techniques in spatial query processing. R*-Grove can be easily integrated into any big data platforms such as Apache Spark or Apache Hadoop. Our experiments show that R*-Grove outperforms the existing partitioning techniques for big spatial data systems. With all the proposed work publicly available as open source, we envision that R*-Grove will be adopted by the community to better serve big spatial data research.Comment: 29 pages, to be published in Frontiers in Big Dat

    Efficient processing of similarity queries with applications

    Get PDF
    Today, a myriad of data sources, from the Internet to business operations to scientific instruments, produce large and different types of data. Many application scenarios, e.g., marketing analysis, sensor networks, and medical and biological applications, call for identifying and processing similarities in big data. As a result, it is imperative to develop new similarity query processing approaches and systems that scale from low dimensional data to high dimensional data, from single machine to clusters of hundreds of machines, and from disk-based to memory-based processing. This dissertation introduces and studies several similarity-aware query operators, analyzes and optimizes their performance. The first contribution of this dissertation is an SQL-based Similarity Group-by operator (SGB, for short) that extends the semantics of the standard SQL Group-by operator to group data with similar but not necessarily equal values. We realize these SGB operators by extending the Standard SQL Group-by and introduce two new SGB operators for multi-dimensional data. We implement and test the new SGB operators and their algorithms inside an open-source centralized database server (PostgreSQL). In the second contribution of this dissertation, we study how to efficiently process Hamming-distance-based similarity queries (Hamming-distance select and Hamming-distance join) that are crucial to many applications. We introduce a new index, termed the HA-Index, that speeds up distance comparisons and eliminates redundancies when performing the two flavors of Hamming distance range queries (namely, the selects and joins). In the third and last contribution of this dissertation, we develop a system for similarity query processing and optimization in an in-memory and distributed setup for big spatial data. We propose a query scheduler and a distributed query optimizer that use a new cost model to optimize the cost of similarity query processing in this in-memory distributed setup. The scheduler and query optimizer generates query execution plans that minimize the effect of query skew. The query scheduler employs new spatial indexing techniques based on bloom filters to forward queries to the appropriate local sites. The proposed query processing and optimization techniques are prototyped inside Spark, a distributed main-memory computation system
    corecore