372,844 research outputs found

    Parallel R-trees

    Get PDF
    We consider the problem of exploiting parallelism to accelerate the performance of spatial access methods and specifically, R-trees [14]. Our goal is to design a server for spatial data, so that to maximize the throughput of range queries. This can be achieved by (a) maximizing parallelism for large range queries, and (b) by engaging as few disks as possible on point queries [26]. We propose a simple hardware architecture consisting of one processor with several disks attached to it. On this architecture, we propose to distribute the nodes of a traditional R-tree, with crossdisk pointers ('Multiplexed' R-tree). The R-tree code is identical to the one for a single-disk R-tree, with the only addition that we have to decide which disk a newly created R-tree node should be stored in. We propose and examine several criteria to choose a disk for a new node. The most successful one, termed 'proximity index' or PI, estimates the similarity of the new node with the other Rtree nodes already on a disk, and chooses the disk with the lowest similarity. Experimental results show that our scheme consistently outperforms all the other heuristics for node-to-disk assignments, achieving up to 55% gains over the Round Robin one. Experiments also indicate that the multiplexed Rtree with PI heuristic gives better response time than the disk-stripping (="Super-node") approach, and imposes lighter load on the I/O sub-system. The speed up of our method is close to linear speed up, increasing with the size of the queries. (Also cross-referenced as UMIACS-TR-92-1

    Multidimensional Range Queries on Modern Hardware

    Full text link
    Range queries over multidimensional data are an important part of database workloads in many applications. Their execution may be accelerated by using multidimensional index structures (MDIS), such as kd-trees or R-trees. As for most index structures, the usefulness of this approach depends on the selectivity of the queries, and common wisdom told that a simple scan beats MDIS for queries accessing more than 15%-20% of a dataset. However, this wisdom is largely based on evaluations that are almost two decades old, performed on data being held on disks, applying IO-optimized data structures, and using single-core systems. The question is whether this rule of thumb still holds when multidimensional range queries (MDRQ) are performed on modern architectures with large main memories holding all data, multi-core CPUs and data-parallel instruction sets. In this paper, we study the question whether and how much modern hardware influences the performance ratio between index structures and scans for MDRQ. To this end, we conservatively adapted three popular MDIS, namely the R*-tree, the kd-tree, and the VA-file, to exploit features of modern servers and compared their performance to different flavors of parallel scans using multiple (synthetic and real-world) analytical workloads over multiple (synthetic and real-world) datasets of varying size, dimensionality, and skew. We find that all approaches benefit considerably from using main memory and parallelization, yet to varying degrees. Our evaluation indicates that, on current machines, scanning should be favored over parallel versions of classical MDIS even for very selective queries

    Comparing four methods for decision-tree induction: a case study on the invasive Iberian gudgeon (Gobio lozanoi; Doadrio & Madeira, 2004)

    Get PDF
    The invasion of freshwater ecosystems is a particularly alarming phenomenon in the Iberian Peninsula. Habitat suitability modelling is a proficient approach to extract knowledge about species ecology and to guide adequate management actions. Decision-trees are an interpretable modelling technique widely used in ecology, able to handle strongly nonlinear relationships with high order interactions and diverse variable types. Decision-trees recursively split the input space into two parts maximising child node homogeneity. This recursive partitioning is typically performed with axis-parallel splits in a top-down fashion. However, recent developments of the R packages oblique.tree, which allows the development of oblique split-based decision-trees, and evtree, which performs globally optimal searches with evolutionary algorithms to do so, seem to outperform the standard axis-parallel top-down algorithms; CART and C5.0. To evaluate their possible use in ecology, the two new partitioning algorithms were compared with the two well-known, standard axis-parallel algorithms. The entire process was performed in R by simultaneously tuning the decision-tree parameters and the variables subset with a genetic algorithm and modelling the presence-absence of the Iberian gudgeon (Gobio lozanoi; Doadrio & Madeira, 2004), an invasive fish species that has spread across the Iberian Peninsula. The accuracy and complexity of the trees, the modelled patterns of mesohabitat selection and the variables importance were compared. None of the new R packages, namely oblique.tree and evtree, outperformed the C5.0 algorithm. They rendered almost the same decision-trees as the CART algorithm, although they were completely interpretable they performed from four to eight partitions in comparison with C5.0, which resulted in a more complex structure with 17 partitions. Oblique.tree proved to be affected by prevalence and it does not include the possibility of weighting the observations, which potentially discourage its actual use. Although the use of evtree did not suggest a major improvement compared with the remaining packages, it allowed the development of regression trees which may be informative for additional modelling tasks such as abundance estimation. Looking at the resulting decision-trees, the optimal habitats for the Iberian gudgeon were large pools in lowland river segments with depositional areas and aquatic vegetation present, which typically appeared in the form of scattered macrophytes clumps. Furthermore, Iberian gudgeon seem to avoid habitats characterised by scouring phenomena and limited vegetated cover availability. Accordingly, we can assume that river regulation and artificial impoundment would have favoured the spread of the Iberian gudgeon across the entire peninsula.The study has been partially funded by the national Research project IMPADAPT (CGL2013-48424-C2-1-R) with MINECO (Spanish Ministry of Economy) and Feder funds and by the Confederacion Hidrografica del Jucar (Spanish Ministry of Agriculture, Food and Environment). This study was also supported in part by the University Research Administration Center of the Tokyo University of Agriculture and Technology. Finally, we are grateful to the colleagues who worked in the field data collection, especially Juan Diego Alcaraz-Henandez, Rui M. S. Costa and Aina Hernandez.Muñoz Mas, R.; Fukuda, S.; Vezza, P.; Martinez-Capel, F. (2016). Comparing four methods for decision-tree induction: a case study on the invasive Iberian gudgeon (Gobio lozanoi; Doadrio & Madeira, 2004). Ecological Informatics. 34:22-34. https://doi.org/10.1016/j.ecoinf.2016.04.011S22343

    IMPLEMENTASI DAN ANALISA WAKTU KOMPUTASI PADA ALGORITMA RANDOM FOREST DENGAN PARALLEL COMPUTING DI R

    Get PDF
    Random forest merupakan metode untuk membangun model dengan menggabungkan decision trees atau pohon keputusan yang dihasilkan dari sampel bootstrap dan fitur acak. Permasalahan umum yang sering terjadi pada saat mengimplementasikan random forest adalah waktu pemrosesan yang lama karena menggunakan data yang banyak dan membangun model tree yang banyak pula untuk membentuk random trees karena menggunakan single processor. Penelitian ini mengusulkan metode random forest dengan parallel computing dan diimplementasikan dalam bahasa pemrograman R. Beberapa kasus yang digunakan dalam penelitian ini yaitu dataset bunga Iris, kualitas wine dan data diagnosa diabetes wanita Pima Indian. Hasil yang diperoleh dari penelitian secara kesuluruhan menunjukkan waktu komputasi yang digunakan saat menjalankan random forest dengan parallel computing lebih singkat dibandingkan dengan saat menjalankan random forest biasa yang hanya menggunakan single processor. Kata kunci : Decision trees, random forest, parallel computing, bahasa pemrograman R  Random forests are a set of methods constructing a model by assembling DTs that are generated from bootstrap samples and a randomized features. A common problem that often occurs when implementing random forest is long processing time because it uses a lot of data and build many tree models to form random trees because it uses single processor. This research proposes random forest method with parallel computing and implemented in R programming language. Some of the cases used in this research are Iris flower dataset, wine quality and diabetes diagnosis data of Pima Indian woman. The results obtained from the entire study show that the computational time used when running random forest with parallel computing is shorter than when running a regular random forest using only a single processor. Keywords : Decision trees, random forest, parallel computing, bahasa pemrograman

    Capacitated Vehicle Routing with Non-Uniform Speeds

    Get PDF
    The capacitated vehicle routing problem (CVRP) involves distributing (identical) items from a depot to a set of demand locations, using a single capacitated vehicle. We study a generalization of this problem to the setting of multiple vehicles having non-uniform speeds (that we call Heterogenous CVRP), and present a constant-factor approximation algorithm. The technical heart of our result lies in achieving a constant approximation to the following TSP variant (called Heterogenous TSP). Given a metric denoting distances between vertices, a depot r containing k vehicles with possibly different speeds, the goal is to find a tour for each vehicle (starting and ending at r), so that every vertex is covered in some tour and the maximum completion time is minimized. This problem is precisely Heterogenous CVRP when vehicles are uncapacitated. The presence of non-uniform speeds introduces difficulties for employing standard tour-splitting techniques. In order to get a better understanding of this technique in our context, we appeal to ideas from the 2-approximation for scheduling in parallel machine of Lenstra et al.. This motivates the introduction of a new approximate MST construction called Level-Prim, which is related to Light Approximate Shortest-path Trees. The last component of our algorithm involves partitioning the Level-Prim tree and matching the resulting parts to vehicles. This decomposition is more subtle than usual since now we need to enforce correlation between the size of the parts and their distances to the depot
    • …
    corecore