4 research outputs found

    GPU-aided edge computing for processing the k nearest-neighbor query on SSD-resident data

    Get PDF
    Edge computing aims at improving performance by storing and processing data closer to their source. The Nearest-Neighbor (-NN) query is a common spatial query in several applications. For example, this query can be used for distance classification of a group of points against a big reference dataset to derive the dominating feature class. Typically, GPU devices have much larger numbers of processing cores than CPUs and faster device memory than main memory accessed by CPUs, thus, providing higher computing power. However, since device and/or main memory may not be able to host an entire reference dataset, the use of secondary storage is inevitable. Solid State Disks (SSDs) could be used for storing such a dataset. In this paper, we propose an architecture of a distributed edge-computing environment where large-scale processing of the -NN query can be accomplished by executing an efficient algorithm for processing the -NN query on its (GPU and SSD enabled) edge nodes. We also propose a new algorithm for this purpose, a GPU-based partitioning algorithm for processing the -NN query on big reference data stored on SSDs. We implement this algorithm in a GPU-enabled edge-computing device, hosting reference data on an SSD. Using synthetic datasets, we present an extensive experimental performance comparison of the new algorithm against two existing ones (working on memory-resident data) proposed by other researchers and two existing ones (working on SSD-resident data) recently proposed by us. The new algorithm excels in all the conducted experiments and outperforms its competitors

    Efficient Distance Join Query Processing in Distributed Spatial Data Management Systems

    Get PDF
    Due to the ubiquitous use of spatial data applications and the large amounts of such data these applications use, the processing of large-scale distance joins in distributed systems is becoming increasingly popular. Distance Join Queries (DJQs) are important and frequently used operations in numerous applications, including data mining, multimedia and spatial databases. DJQs (e.g., k Nearest Neighbor Join Query, k Closest Pair Query, ε Distance Join Query, etc.) are costly operations, since they involve both the join and distance-based search, and performing DJQs efficiently is a challenging task. Recent Big Data developments have motivated the emergence of novel technologies for distributed processing of large-scale spatial data in clusters of computers, leading to Distributed Spatial Data Management Systems (DSDMSs). Distributed cluster-based computing systems can be classified as Hadoop-based or Spark-based systems. Based on this classification, in this paper, we compare two of the most recent and leading DSDMSs, SpatialHadoop and LocationSpark, by evaluating the performance of several existing and newly proposed parallel and distributed DJQ algorithms under various settings with large spatial real-world datasets. A general conclusion arising from the execution of the distributed DJQ algorithms studied is that, while SpatialHadoop is a robust and efficient system when large spatial datasets are joined (since it is built on top of the mature Hadoop platform), LocationSpark is the clear winner in total execution time efficiency when medium spatial datasets are combined (due to in-memory processing provided by Spark). However, LocationSpark requires higher memory allocation when large spatial datasets are involved in DJQs (even more so when k and ε are large). Finally, this detailed performance study has demonstrated that the new distributed DJQ algorithms we have proposed are efficient, robust and scalable with respect to different parameters, such as dataset sizes, k, ε and number of computing nodes

    Streaming Data Algorithm Design for Big Trajectory Data Analysis

    Get PDF
    Trajectory streams consist of large volumes of time-stamped spatial data that are constantly generated from diverse and geographically distributed sources. Discovery of traveling patterns on trajectorystreamssuchasgatheringandcompaniesneedstoprocesseachrecordwhenitarrivesand correlatesacrossmultiplerecordsnearreal-time. Thustechniquesforhandlinghigh-speedtrajectorystreamsshouldscaleondistributedclustercomputing. Themainissuesencapsulatethreeaspects, namely a data model to represent the continuous trajectory data, the parallelism of a discovery algorithm, and end-to-end performance improvement. In this thesis, I propose two parallel discovery methods,namelysnapshotmodelandslotmodelthateachconsistsof1)amodelofpartitioningtrajectoriessampledondifferenttimeintervals;2)definitionondistancemeasurementsoftrajectories; and 3) a parallel discovery algorithm. I develop these methods in a stream processing workflow. I evaluate our solution with a public dataset on Amazon Web Services (AWS) cloud cluster. From parallelization point of view, I investigate system performance, scalability, stability and pinpoint principle operations that contribute most to the run-time cost of computation and data shuffling. I improve data locality with fine-tuned data partition and data aggregation techniques. I observe that both models can scale on a cluster of nodes as the intensity of trajectory data streams grows. Generally, snapshot model has higher throughput thus lower latency, while slot model produce more accurate trajectory discovery
    corecore