372,844 research outputs found
Parallel R-trees
We consider the problem of exploiting parallelism to accelerate the
performance of spatial access methods and specifically, R-trees [14].
Our goal is to design a server for spatial data, so that to maximize
the throughput of range queries. This can be achieved by (a) maximizing
parallelism for large range queries, and (b) by engaging as few disks
as possible on point queries [26].
We propose a simple hardware architecture consisting of one processor
with several disks attached to it. On this architecture, we propose to
distribute the nodes of a traditional R-tree, with crossdisk pointers
('Multiplexed' R-tree). The R-tree code is identical to the one for a
single-disk R-tree, with the only addition that we have to decide
which disk a newly created R-tree node should be stored in. We propose
and examine several criteria to choose a disk for a new node. The most
successful one, termed 'proximity index' or PI, estimates the similarity
of the new node with the other Rtree nodes already on a disk, and chooses
the disk with the lowest similarity. Experimental results show that our
scheme consistently outperforms all the other heuristics for node-to-disk
assignments, achieving up to 55% gains over the Round Robin one.
Experiments also indicate that the multiplexed Rtree with PI heuristic
gives better response time than the disk-stripping (="Super-node") approach,
and imposes lighter load on the I/O sub-system.
The speed up of our method is close to linear speed up, increasing with
the size of the queries.
(Also cross-referenced as UMIACS-TR-92-1
Multidimensional Range Queries on Modern Hardware
Range queries over multidimensional data are an important part of database
workloads in many applications. Their execution may be accelerated by using
multidimensional index structures (MDIS), such as kd-trees or R-trees. As for
most index structures, the usefulness of this approach depends on the
selectivity of the queries, and common wisdom told that a simple scan beats
MDIS for queries accessing more than 15%-20% of a dataset. However, this wisdom
is largely based on evaluations that are almost two decades old, performed on
data being held on disks, applying IO-optimized data structures, and using
single-core systems. The question is whether this rule of thumb still holds
when multidimensional range queries (MDRQ) are performed on modern
architectures with large main memories holding all data, multi-core CPUs and
data-parallel instruction sets. In this paper, we study the question whether
and how much modern hardware influences the performance ratio between index
structures and scans for MDRQ. To this end, we conservatively adapted three
popular MDIS, namely the R*-tree, the kd-tree, and the VA-file, to exploit
features of modern servers and compared their performance to different flavors
of parallel scans using multiple (synthetic and real-world) analytical
workloads over multiple (synthetic and real-world) datasets of varying size,
dimensionality, and skew. We find that all approaches benefit considerably from
using main memory and parallelization, yet to varying degrees. Our evaluation
indicates that, on current machines, scanning should be favored over parallel
versions of classical MDIS even for very selective queries
Comparing four methods for decision-tree induction: a case study on the invasive Iberian gudgeon (Gobio lozanoi; Doadrio & Madeira, 2004)
The invasion of freshwater ecosystems is a particularly alarming phenomenon in the Iberian Peninsula. Habitat suitability modelling is a proficient approach to extract knowledge about species ecology and to guide adequate management actions. Decision-trees are an interpretable modelling technique widely used in ecology, able to handle strongly nonlinear relationships with high order interactions and diverse variable types. Decision-trees recursively split the input space into two parts maximising child node homogeneity. This recursive partitioning is typically performed with axis-parallel splits in a top-down fashion. However, recent developments of the R packages oblique.tree, which allows the development of oblique split-based decision-trees, and evtree, which performs globally optimal searches with evolutionary algorithms to do so, seem to outperform the standard axis-parallel top-down algorithms; CART and C5.0. To evaluate their possible use in ecology, the two new partitioning algorithms were compared with the two well-known, standard axis-parallel algorithms. The entire process was performed in R by simultaneously tuning the decision-tree parameters and the variables subset with a genetic algorithm and modelling the presence-absence of the Iberian gudgeon (Gobio lozanoi; Doadrio & Madeira, 2004), an invasive fish species that has spread across the Iberian Peninsula. The accuracy and complexity of the trees, the modelled patterns of mesohabitat selection and the variables importance were compared. None of the new R packages, namely oblique.tree and evtree, outperformed the C5.0 algorithm. They rendered almost the same decision-trees as the CART algorithm, although they were completely interpretable they performed from four to eight partitions in comparison with C5.0, which resulted in a more complex structure with 17 partitions. Oblique.tree proved to be affected by prevalence and it does not include the possibility of weighting the observations, which potentially discourage its actual use. Although the use of evtree did not suggest a major improvement compared with the remaining packages, it allowed the development of regression trees which may be informative for
additional modelling tasks such as abundance estimation. Looking at the resulting decision-trees, the optimal habitats for the Iberian gudgeon were large pools in lowland river segments with depositional areas and aquatic vegetation present, which typically appeared in the form of scattered macrophytes clumps. Furthermore, Iberian gudgeon seem to avoid habitats characterised by scouring phenomena and limited vegetated cover availability. Accordingly, we can assume that river regulation and artificial impoundment would have favoured the spread of the Iberian gudgeon across the entire peninsula.The study has been partially funded by the national Research project IMPADAPT (CGL2013-48424-C2-1-R) with MINECO (Spanish Ministry of Economy) and Feder funds and by the Confederacion Hidrografica del Jucar (Spanish Ministry of Agriculture, Food and Environment). This study was also supported in part by the University Research Administration Center of the Tokyo University of Agriculture and Technology. Finally, we are grateful to the colleagues who worked in the field data collection, especially Juan Diego Alcaraz-Henandez, Rui M. S. Costa and Aina Hernandez.Muñoz Mas, R.; Fukuda, S.; Vezza, P.; Martinez-Capel, F. (2016). Comparing four methods for decision-tree induction: a case study on the invasive Iberian gudgeon (Gobio lozanoi; Doadrio & Madeira, 2004). Ecological Informatics. 34:22-34. https://doi.org/10.1016/j.ecoinf.2016.04.011S22343
IMPLEMENTASI DAN ANALISA WAKTU KOMPUTASI PADA ALGORITMA RANDOM FOREST DENGAN PARALLEL COMPUTING DI R
Random forest merupakan metode untuk membangun model dengan menggabungkan decision trees atau pohon keputusan yang dihasilkan dari sampel bootstrap dan fitur acak. Permasalahan umum yang sering terjadi pada saat mengimplementasikan random forest adalah waktu pemrosesan yang lama karena menggunakan data yang banyak dan membangun model tree yang banyak pula untuk membentuk random trees karena menggunakan single processor. Penelitian ini mengusulkan metode random forest dengan parallel computing dan diimplementasikan dalam bahasa pemrograman R. Beberapa kasus yang digunakan dalam penelitian ini yaitu dataset bunga Iris, kualitas wine dan data diagnosa diabetes wanita Pima Indian. Hasil yang diperoleh dari penelitian secara kesuluruhan menunjukkan waktu komputasi yang digunakan saat menjalankan random forest dengan parallel computing lebih singkat dibandingkan dengan saat menjalankan random forest biasa yang hanya menggunakan single processor.
Kata kunci : Decision trees, random forest, parallel computing, bahasa pemrograman R 
Random forests are a set of methods constructing a model by assembling
DTs that are generated from bootstrap samples and a randomized features. A common problem that often occurs when implementing random forest is long processing time because it uses a lot of data and build many tree models to form random trees because it uses single processor. This research proposes random forest method with parallel computing and implemented in R programming language. Some of the cases used in this research are Iris flower dataset, wine quality and diabetes diagnosis data of Pima Indian woman. The results obtained from the entire study show that the computational time used when running random forest with parallel computing is shorter than when running a regular random forest using only a single processor.
Keywords : Decision trees, random forest, parallel computing, bahasa pemrograman
Capacitated Vehicle Routing with Non-Uniform Speeds
The capacitated vehicle routing problem (CVRP) involves distributing
(identical) items from a depot to a set of demand locations, using a single
capacitated vehicle. We study a generalization of this problem to the setting
of multiple vehicles having non-uniform speeds (that we call Heterogenous
CVRP), and present a constant-factor approximation algorithm.
The technical heart of our result lies in achieving a constant approximation
to the following TSP variant (called Heterogenous TSP). Given a metric denoting
distances between vertices, a depot r containing k vehicles with possibly
different speeds, the goal is to find a tour for each vehicle (starting and
ending at r), so that every vertex is covered in some tour and the maximum
completion time is minimized. This problem is precisely Heterogenous CVRP when
vehicles are uncapacitated.
The presence of non-uniform speeds introduces difficulties for employing
standard tour-splitting techniques. In order to get a better understanding of
this technique in our context, we appeal to ideas from the 2-approximation for
scheduling in parallel machine of Lenstra et al.. This motivates the
introduction of a new approximate MST construction called Level-Prim, which is
related to Light Approximate Shortest-path Trees. The last component of our
algorithm involves partitioning the Level-Prim tree and matching the resulting
parts to vehicles. This decomposition is more subtle than usual since now we
need to enforce correlation between the size of the parts and their distances
to the depot
- …