Search CORE

608 research outputs found

Partial Replica Location And Selection For Spatial Datasets

Author: Tian Yun
Publication venue: eGrove
Publication date: 01/01/2013
Field of study

As the size of scientific datasets continues to grow, we will not be able to store enormous datasets on a single grid node, but must distribute them across many grid nodes. The implementation of partial or incomplete replicas, which represent only a subset of a larger dataset, has been an active topic of research. Partial Spatial Replicas extend this functionality to spatial data, allowing us to distribute a spatial dataset in pieces over several locations. We investigate solutions to the partial spatial replica selection problems. First, we describe and develop two designs for an Spatial Replica Location Service (SRLS), which must return the set of replicas that intersect with a query region. Integrating a relational database, a spatial data structure and grid computing software, we build a scalable solution that works well even for several million replicas. In our SRLS, we have improved performance by designing a R-tree structure in the backend database, and by aggregating several queries into one larger query, which reduces overhead. We also use the Morton Space-filling Curve during R-tree construction, which improves spatial locality. In addition, we describe R-tree Prefetching(RTP), which effectively utilizes the modern multi-processor architecture. Second, we present and implement a fast replica selection algorithm in which a set of partial replicas is chosen from a set of candidates so that retrieval performance is maximized. Using an R-tree based heuristic algorithm, we achieve O(n log n) complexity for this NP-complete problem. We describe a model for disk access performance that takes filesystem prefetching into account and is sufficiently accurate for spatial replica selection. Making a few simplifying assumptions, we present a fast replica selection algorithm for partial spatial replicas. The algorithm uses a greedy approach that attempts to maximize performance by choosing a collection of replica subsets that allow fast data retrieval by a client machine. Experiments show that the performance of the solution found by our algorithm is on average always at least 91% and 93.4% of the performance of the optimal solution in 4-node and 8-node tests respectively

eGrove (Univ. of Mississippi)

Recommended from our members

Optimizing Spatiotemporal Analysis Using Multidimensional Indexing with GeoWave

Author: Fecher Richard
Whitby Michael A.
Publication venue: ScholarWorks@UMass Amherst
Publication date: 22/09/2017
Field of study

The open source software GeoWave bridges the gap between geographic information systems and distributed computing. This is done by preserving locality of multidimensional data when indexing it into a single-dimensional key-value store, using space filling curves. This means that like values in each dimension are stored physically close together in the datastore. We demonstrate the efficiencies and benefits of the GeoWave indexing algorithm to store and query billions of spatiotemporal data points. We show how this indexing strategy can be used to reduce query and processing times by multiple orders of magnitude using publicly available taxi trip data published by the New York City Taxi & Limousine Commission. Furthermore, we demonstrate how this efficiency lends itself to analysis that would otherwise be unfeasible

ScholarWorks@UMass Amherst

Voronoi classfied and clustered constellation data structure for three-dimensional urban buildings

Author: Azri Nor Suhaibah
Publication venue
Publication date: 01/01/2017
Field of study

In the past few years, the growth of urban area has been increasing and has resulted immense number of urban datasets. This situation contributes to the difficulties in handling and managing issues related to urban area. Huge and massive datasets can degrade the performance of data retrieval and information analysis. In addition, urban environments are very difficult to manage because they involved with various types of data, such as multiple types of zoning themes in urban mixeduse development. Thus, a special technique for efficient data handling and management is necessary. In this study, a new three-dimensional (3D) spatial access method, the Voronoi Classified and Clustered Data Constellation (VOR-CCDC) is introduced. The VOR-CCDC data structure operates on the basis of two filters, classification and clustering. To boost up the performance of data retrieval, VORCCDC offers a minimal percentage of overlap among nodes and a minimal coverage area in order to avoid repetitive data entry and multi-path queries. Besides that, VOR-CCDC data structure is supplemented with an extra element of nearest neighbour information. Encoded neighbouring information in the Voronoi diagram allows VOR-CCDC to optimally explore the data. There are three types of nearest neighbour queries that are presented in this study to verify the VOR-CCDC’s ability in finding the nearest neighbour information. The queries are Single Search Nearest Neighbour query, k Nearest Neighbour (kNN) query and Reverse k Nearest Neighbour (RkNN) query. Each query is tested with two types of 3D datasets; single layer and multi-layer. The test demonstrated that VOR-CCDC performs the least amount of input/output than their best competitor, the 3D R-Tree. Besides that, VOR-CCDC is also tested for performance evaluation. The results indicate that VOR-CCDC outperforms its competitor by responding 60 to 80 percent faster to the query operation. In the future, VOR-CCDC structure is expected to be expanded for temporal and dynamic objects. Besides that, VOR-CCDC structure can also be used in other applications such as brain cell database for analysing the spatial arrangement of neurons or analysing the protein chain reaction in bioinformatics applications

Universiti Teknologi Malaysia Institutional Repository

R*-Grove: Balanced Spatial Partitioning for Large-scale Datasets

Author: Eldawy Ahmed
Vu Tin
Publication venue
Publication date: 01/01/2020
Field of study

The rapid growth of big spatial data urged the research community to develop several big spatial data systems. Regardless of their architecture, one of the fundamental requirements of all these systems is to spatially partition the data efficiently across machines. The core challenges of big spatial partitioning are building high spatial quality partitions while simultaneously taking advantages of distributed processing models by providing load balanced partitions. Previous works on big spatial partitioning are to reuse existing index search trees as-is, e.g., the R-tree family, STR, Kd-tree, and Quad-tree, by building a temporary tree for a sample of the input and use its leaf nodes as partition boundaries. However, we show in this paper that none of those techniques has addressed the mentioned challenges completely. This paper proposes a novel partitioning method, termed R*-Grove, which can partition very large spatial datasets into high quality partitions with excellent load balance and block utilization. This appealing property allows R*-Grove to outperform existing techniques in spatial query processing. R*-Grove can be easily integrated into any big data platforms such as Apache Spark or Apache Hadoop. Our experiments show that R*-Grove outperforms the existing partitioning techniques for big spatial data systems. With all the proposed work publicly available as open source, we envision that R*-Grove will be adopted by the community to better serve big spatial data research.Comment: 29 pages, to be published in Frontiers in Big Dat

arXiv.org e-Print Archive

eScholarship - University of California

Visualization and analysis of cancer genome sequencing studies

Author: Park Richard Won
Publication venue
Publication date: 22/01/2016
Field of study

Large-scale genomics projects such as the Cancer Genome Atlas (TCGA), and the Encyclopedia of DNA Elements (ENCODE) involve generation of data at an unprecedented scale, requiring new computational techniques for analysis and interpretation. In the three studies I present in this thesis, I utilize these data sources to derive biological insights or created visualization tools that enable others to obtain insights more easily. First, I examine the distribution of the lengths for copy number variations (CNVs) in the cancer genome. This analysis shows that a small number of genes are altered at a greater frequency than expected from a power law distribution, suggesting that a large number of genomes must be sequenced for a given tumor type to a comprehensive discovery of somatic mutations. Second, I investigate germline CNVs in thousands of TCGA samples using single nucleotide polymorphism (SNP) array data to find variants that may confer increased susceptibility to cancer. This CNV-based genome-wide association study resulted in many germline CNVs that potentially increase risk in brain, breast, colorectal, renal, or ovarian cancers. Finally, I apply several visualization techniques to create tools for the TCGA and ENCODE projects in order to help investigators better process and synthesize meaning from large volume of data. Seqeyes combines linear and circular genomic views to explore predicted structural variations to help guide experimental validation. The modEncode browser visualizes chromatin organization by integrating data from a multitude of histone marks and chromosomal proteins. These results present visualization as a useful strategy for rapid identification of salient genomic features from large, heterogeneous genomic datasets

Boston University Institutional Repository (OpenBU)

Efficient bulk-loading methods for temporal and multidimensional index structures

Author: Achakeev Daniar
Publication venue: Philipps-Universität Marburg
Publication date: 01/01/2013
Field of study

Nahezu alle naturwissenschaftlichen Bereiche profitieren von neuesten Analyse- und Verarbeitungsmethoden für große Datenmengen. Diese Verfahren setzten eine effiziente Verarbeitung von geo- und zeitbezogenen Daten voraus, da die Zeit und die Position wichtige Attribute vieler Daten sind. Die effiziente Anfrageverarbeitung wird insbesondere durch den Einsatz von Indexstrukturen ermöglicht. Im Fokus dieser Arbeit liegen zwei Indexstrukturen: Multiversion B-Baum (MVBT) und R-Baum. Die erste Struktur wird für die Verwaltung von zeitbehafteten Daten, die zweite für die Indexierung von mehrdimensionalen Rechteckdaten eingesetzt. Ständig- und schnellwachsendes Datenvolumen stellt eine große Herausforderung an die Informatik dar. Der Aufbau und das Aktualisieren von Indexen mit herkömmlichen Methoden (Datensatz für Datensatz) ist nicht mehr effizient. Um zeitnahe und kosteneffiziente Datenverarbeitung zu ermöglichen, werden Verfahren zum schnellen Laden von Indexstrukturen dringend benötigt. Im ersten Teil der Arbeit widmen wir uns der Frage, ob es ein Verfahren für das Laden von MVBT existiert, das die gleiche I/O-Komplexität wie das externe Sortieren besitz. Bis jetzt blieb diese Frage unbeantwortet. In dieser Arbeit haben wir eine neue Kostruktionsmethode entwickelt und haben gezeigt, dass diese gleiche Zeitkomplexität wie das externe Sortieren besitzt. Dabei haben wir zwei algorithmische Techniken eingesetzt: Gewichts-Balancierung und Puffer-Bäume. Unsere Experimenten zeigen, dass das Resultat nicht nur theoretischer Bedeutung ist. Im zweiten Teil der Arbeit beschäftigen wir uns mit der Frage, ob und wie statistische Informationen über Geo-Anfragen ausgenutzt werden können, um die Anfrageperformanz von R-Bäumen zu verbessern. Unsere neue Methode verwendet Informationen wie Seitenverhältnis und Seitenlängen eines repräsentativen Anfragerechtecks, um einen guten R-Baum bezüglich eines häufig eingesetzten Kostenmodells aufzubauen. Falls diese Informationen nicht verfügbar sind, optimieren wir R-Bäume bezüglich der Summe der Volumina von minimal umgebenden Rechtecken der Blattknoten. Da das Problem des Aufbaus von optimalen R-Bäumen bezüglich dieses Kostenmaßes NP-hart ist, führen wir zunächst das Problem auf ein eindimensionales Partitionierungsproblem zurück, indem wir die Daten bezüglich optimierte raumfüllende Kurven sortieren. Dann lösen wir dieses Problem durch Einsatz vom dynamischen Programmieren. Die I/O-Komplexität des Verfahrens ist gleich der von externem Sortieren, da die I/O-Laufzeit der Methode durch die Laufzeit des Sortierens dominiert wird. Im letzten Teil der Arbeit haben wir die entwickelten Partitionierungsvefahren für den Aufbau von Geo-Histogrammen eingesetzt, da diese ähnlich zu R-Bäumen eine disjunkte Partitionierung des Raums erzeugen. Ergebnisse von intensiven Experimenten zeigen, dass sich unter Verwendung von neuen Partitionierungstechniken sowohl R-Bäume mit besserer Anfrageperformanz als auch Geo-Histogrammen mit besserer Schätzqualität im Vergleich zu Konkurrenzverfahren generieren lassen

Publikations- und Dokumentenserver der Universitätsbibliothek Marburg