Individuals from Pittsburgh rule-based classifiers represent a complete solution to the classification problem and each individual is a variable-length set of rules. Therefore, these systems usually demand a high level of computational resources and run-time, which increases as the complexity and the size of the data sets. It is known that this computational cost is mainly due to the recurring evaluation process of the rules and the individuals as rule sets. In this paper we propose a parallel evaluation model of rules and rule sets on GPUs based on the NVIDIA CUDA programming model which significantly allows reducing the run-time and speeding up the algorithm. The results obtained from the experimental study support the great efficiency and high performance of the GPU model, which is scalable to multiple GPU devices. The GPU model achieves a rule interpreter performance of up to 64 billion operations per second and the evaluation of the individuals is speeded up of up to 3.461× when compared to the CPU model. This provides a significant advantage of the GPU model, especially addressing large and complex problems within reasonable time, where the CPU run-time is not acceptable.
Introduction
Evolutionary computation and its application to machine learning and data mining, and specifically, to classification problems, has attracted the attention of researchers over the last decade [1, 2, 3, 4, 5] . Classification is a supervised machine learning task which consists in predicting class membership of uncategorised examples using the properties of a set of train examples from which a classification model has been inducted [6] .
Rule-based classification systems are especially useful in applications and domains which require comprehensibility and clarity in the knowledge discovery process, expressing information in the form of IF-THEN classification rules. Evolutionary rule-based algorithms take advantage of fitness-biased generational inheritance evolution to obtain rule sets, classifiers, which cover the train examples and produce class prediction over new examples.
Rules are encoded into the individuals within the population of the algorithm in two different ways: individual = rule, or individual = set of rules. Most evolutionary rule-based algorithms follow the first approach due to its simplicity and efficiency, whereas the latter, also known as Pittsburgh style algorithms, are not so usually employed because they are considered to perform slowly [7] . However, Pittsburgh approaches comprise other advantages such as providing individuals as complete solutions to the problem and allow of considering relations between the rules within the evolutionary process.
The efficiency, computational cost, and run-time of Pittsburgh rule-based systems is a primary concern and a challenge for researchers [8, 9] , especially when seeking their scalability to large scale databases [10, 11] , processing vast amounts of data within a reasonable amount of time. Therefore, it becomes crucial to design efficient parallel algorithms capable of handling these large amounts of data [12, 13, 14, 15] .
Parallel implementations have been employed to speed up evolutionary algorithms, including multi-core and distributed computing [16, 17] , masterslave models [18] , and grid computing environments [19, 20] . Over the last few years, increasing attention has focused on graphic processing units (GPUs). GPUs are devices with multi-core architectures and massive parallel processor units, which provide fast parallel hardware for a fraction of the cost of a traditional parallel system. Actually, since the introduction of the computer unified device architecture (CUDA) in 2007, researchers all over the world have harnessed the power of the GPU for general purpose GPU computing (GPGPU) [21, 22, 23, 24] .
The use of GPGPU has already been studied for speeding up algorithms within the framework of evolutionary computation and data mining [25, 26, 27, 28] , achieving high performance and promising results. Specifically, there are GPU-accelerated genetic rule-based systems for individual = rule approaches, which have been shown to achieve high performance [29, 30, 31] . Franco et al. [29] reported a speedup of up to 58× using the BioHEL system. Cano et al. [30] reported a speedup of up to 820×, considering a scalable model using multiple GPU devices. Augusto [31] reported a speedup of up to 100× compared to a single-threaded model and delivering almost 10× the throughput of a twelve-core CPU. These proposals are all focused on speeding up individual = rule approaches. However, as far as we know, there are no GPU-based approaches to date using an individual = set of rules representation.
In this paper we present an efficient Pittsburgh individuals evaluation model on GPUs which parallelises the fitness computation for both rules and rules sets, applicable to any individual = set of rules evolutionary algorithm. The GPU model is scalable to multiple GPU devices, which allows of addressing larger data sets and population sizes. The rules interpreter, which checks the coverage of the rules over the instances, is carefully designed to maximize its efficiency compared to traditional rules stack-based interpreters. Experimental results demonstrate the great performance and high efficiency of the proposed model, achieving a rules interpreter performance of up to 64 billion operations per second. On the other hand, the individual evaluation performance achieves a speedup of up to 3.461× when compared to the single-threaded CPU implementation, and a speedup of 1.311× versus the parallel CPU version using 12 threads. This paper is organized as follows. In the next section, genetic rule-based systems and their encodings are introduced, together with the definition of the CUDA programming model on the GPU. Section 3 presents the GPU evaluation model and its implementation in CUDA kernels. Section 4 introduces the experimental study setup, whose results are given in Section 5. Finally, Section 6 collects some concluding remarks.
Background
This section introduces the genetic rule-based systems and the encoding of the individuals. Finally, the CUDA programming model on the GPU is presented.
Genetic rule-based systems
Genetic algorithms (GAs) evolve a population of individuals which correspond to candidate solutions to a problem. GAs have been used for learning rules (Genetic rule-based systems), including crisp and fuzzy rules, and they follow two approaches for encoding rules within a population.
The first one represents an individual as a single rule (individual = rule). The rule base is formed by combining several individuals from the population (rule cooperation) or via different evolutionary runs (rule competition). This representation results in three approaches:
• Michigan: they employ reinforcement learning and the GA is used to learn new rules that replace the older ones via competition through the evolutionary process. These systems are usually called learning classifier systems [32] , such as XCS [33] , UCS [34] , Fuzzy-XCS [35] , and Fuzzy-UCS [36] .
• Iterative Rule Learning (IRL): individuals compete to be chosen in every GA run. The rule base is formed by the best rules obtained when the algorithm is run multiple times. SLAVE [37] , SIA [38] and HIDER [39] are examples which follow this model.
• Genetic Cooperative-Competitive Learning (GCCL): the whole population or a subset of individuals encodes the rule base. In this model, the individuals compete and cooperate simultaneously. This approach makes it necessary to introduce a mechanism to maintain the diversity of the population in order to avoid a convergence of all the individuals in the population. GP-COACH [40] or COGIN [41] follow this approach.
The second one represents an individual as a complete set of rules (individual = set of rules), which is also known as the Pittsburgh approach. The main advantage of this approach compared to the first one is that it allows of addressing the cooperation-competition problem, involving the interaction between rules in the evolutionary process [42, 43] . Pittsburgh systems (especially naive implementations) are slower, since they evolve more complex structures and they assign credit at a less specific (and hence less informative) level [44] . Moreover, one of their main problems is controlling the number of rules, which increases the complexity of the individuals, adding computational cost to their evaluation and becoming an unmanageable problem.
This problem is known as the bloat effect [45] , i.e., a growth without control of the size of the individuals.
One method based on this approach is the Memetic Pittsburgh Learning Classifier System (MPLCS) [8] . In order to avoid the bloat effect, they employ a rule deletion operator and a fitness function based on the minimum description length [46] , which balances the complexity and accuracy of the rule set. Moreover, this system uses a windowing scheme [47] that reduces the run-time of the system by dividing the training set into many non-overlapping subsets over which the fitness is computed at each GA iteration.
CUDA programming model
Computer unified device architecture (CUDA) [48] is a parallel computing architecture developed by NVIDIA that allows programmers to take advantage of the parallel computing capacity of NVIDIA GPUs in a general purpose manner. The CUDA programming model executes kernels as batches of parallel threads. These kernels comprise thousands to millions of lightweight GPU threads per each kernel invocation.
CUDA's threads are organised into thread blocks in the form of a grid. Thread blocks are executed in streaming multiprocessors. A stream multiprocessor can perform zero-overhead scheduling to interleave warps (a warp is a group of threads that execute together) and hide the overhead of long-latency arithmetic and memory operations. GPU's architecture was rearranged from SIMD (Single Instruction, Multiple Data) to MIMD (Multiple Instruction, Multiple Data), which runs independent separate program codes. Thus, up to 16 kernels can be executed concurrently as long as there are multiprocessors available. Moreover, asynchronous data transfers can be performed concurrently with the kernel executions. These two features allow of speeding up the execution compared to a sequential kernel pipeline and synchronous data transfers, as in previous GPU architectures.
There are four different main memory spaces: global, constant, shared, and local. These GPU memories are specialised and have different access times, lifetimes, and output limitations.
• Global memory: a large long-latency memory that exists physically as an off-chip dynamic device memory. Threads can read and write global memory to share data and must write the kernel's output to be readable after the kernel terminates. However, a better way to share data and improve performance is to take advantage of shared memory.
• Shared memory: a small low-latency memory that exists physically as on-chip registers and its contents are only maintained during thread block execution and are discarded when the thread block completes. Kernels that read or write a known range of global memory with spatial or temporal locality can employ shared memory as a software-managed cache. Such caching potentially reduces global memory bandwidth demands and improves overall performance.
• Local memory: each thread also has its own local memory space as registers, so the number of registers a thread uses determines the number of concurrent threads executed in the multiprocessor, which is called multiprocessor occupancy. To avoid wasting hundreds of cycles while a thread waits for a long-latency global-memory load or store to complete, a common technique is to execute batches of global accesses, one per thread, exploiting the hardware's warp scheduling to overlap the threads' access latencies.
• Constant memory: this memory is specialised for situations in which many threads will read the same data simultaneously. This type of memory stores data written by the host thread, is accessed constantly, and does not change during the execution of the kernel. A value read from the constant cache is broadcast to all threads in a warp, effectively serving all loads from memory with a single-cache access. This enables a fast, single-ported cache to feed multiple simultaneous memory accesses.
There are some recommendations for improving the performance on a GPU [49] . Memory accesses must be coalesced as with accesses to global memory. Global memory resides in device memory and is accessed via 32, 64, or 128-byte segment memory transactions. It is recommended to perform fewer but larger memory transactions. When a warp executes an instruction which accesses global memory, it coalesces the memory accesses of the threads within the warp into one or more of these memory transactions depending on the size of the word accessed by each thread and the distribution of the memory addresses across the threads. In general, the more transactions are necessary, the more unused words are transferred in addition to the words accessed by the threads, reducing the instruction throughput accordingly.
To maximise global memory throughput, it is therefore important to maximise the coalescing, by following optimal access patterns, using data types that meet the size and alignment requirements, or padding data. For these accesses to be fully coalesced, both the width of the thread block and the width of the array must be a multiple of the warp size.
Parallel Pittsburgh evaluation on GPU
This section first introduces the encoding of the Pittsburgh individuals on the GPU. Then, it will present the evaluation procedure of an individual's rules. Finally, it will describe the evaluation process of an individual's fitness.
Pittsburgh individual encoding
Pittsburgh individuals are variable-length sets of rules which may include a default rule class prediction, interesting when using decision lists [50] as individual representation. Rules are one of the formalisms most often used to represent classifiers (decision trees can be easily converted into a rule set [51] ). The IF part of the rule is called the antecedent and contains a combination of attribute-value conditions on the predicting attributes. The THEN part is called the consequent and contains the predicted value for the class. This way, a rule assigns a data instance to the class pointed out by the consequent if the values of the predicting attributes satisfy the conditions expressed in the antecedent. Rule specification can be formally defined by means of a context-free grammar [52] as the shown in Figure 1 . [53, 54] . Traditional stack-based interpreters perform push and pop operations on a stack, involving the operator and operands found in the rule. The rule encoding we employ allows the interpreter to achieve maximal efficiency by minimizing the number of push and pop operations on the stack, reading the rules from the left to the right. Attribute-value comparisons are expressed in prefix notation, which places operators to the left of their operands, whereas logical operators are expressed in postfix notation, in which the operator is placed after the operands. This way, the efficiency of the interpreter is increased by minimizing the number of operations on the stack. The interpreter avoids pushing or popping unnecessary operands and behaves as a finite-state machine. For example, the first rule represented in the individual from Figure 2 reads the first element and finds the > operator. The interpreter knows the cardinality of the > operator, which has two operands. Thus, it directly computes > At 1 V 1 and pushes the result into the stack. Then, the next element is < , it computes < At 2 V 2 and pushes the result. Finally, the AND operator is found, the interpreter pops the two operands from the stack and returns the AND Boolean computation.
This interpreter model provides a natural representation which allows dealing with all types of logical operators with different cardinalities and operand types while keeping an efficient performance.
Evaluation of particular rules
Rules within individuals must be evaluated over the instances of the data set in order to assign a fitness to the rules. The evaluation of the rules is divided into two steps, which are implemented in two GPU kernels. The first one, the coverage kernel, checks the coverage of the rules over the instances of the data set. The second one, the reduction kernel, performs a reduction count of the predictions of the rules, to compute the confusion matrix from which the fitness metrics for a classification rule can be obtained.
Rule coverage kernel
The coverage kernel executes the rule interpreter and checks whether the instances of the data set satisfy the conditions comprised in the rules within the individuals. The interpreter takes advantage of the efficient representation of the individuals described in Section 3.1 to implement an efficient stack-based procedure in which the partial results coming from the child nodes are pushed into a stack and pulled back when necessary.
The interpreter behaves as a single task being executed on the Single Instruction Multiple Data (SIMD) processor, while the rules and instances are treated as data. Therefore, the interpreter parallelises the fitness computation cases for individuals, rules, and instances. Each thread is responsible for the coverage of a single rule over a single instance, storing the result of the matching of the coverage and the actual class of the instance to an array. Threads are grouped into a 3D grid of thread blocks, whose size depends on the number of individuals (width), instances (height), and rules (depth), as represented in Figure 3 . Thus, a thread block represents a collection of threads which interpret a common rule over a subset of different instances, avoiding a divergence of the kernel, which is known to be one of the major efficiency problems of NVIDIA CUDA programming. The number of threads per block is recommended to be a multiple of the warp size (a warp is a group of threads that execute together in a streaming multiprocessor), usually being 128, 192, 256, ..., up to 1024 threads per block. This number is important as it concerns the scalability of the model in future GPU devices with a larger number of processors. NVIDIA recommends running at least twice as many thread blocks as the number of multiprocessors in the GPU, and provides an occupancy calculator which reports the GPU occupancy regarding the register and shared memory pressure, and the number of threads per block. Table 1 shows the GPU occupancy to be maximized for different block sizes. 192 threads per block is the best choice since it achieves 100% occupancy and provides more active thread blocks per multiprocessor to hide latency arising from register dependencies, and therefore, a wider range of possibilities given to the scheduler to issue concurrent block to the multiprocessors. Moreover, while the occupancy is maximal, the smaller number of threads per block there is, the higher the number of blocks, which provides better scalability to future GPU devices capable of handling more active blocks concurrently. Scalability to multiple GPU devices is achieved by splitting the population into as many GPUs as available, and each GPU is reponsible of evaluating a subset of the population. Thread accesses to global memory must be coalesced to achieve maximum performance and memory throughput, using data types that meet the size and alignment requirements, or padding data arrays. For these accesses to be fully coalesced, both the width of the thread block and the width of the array must be a multiple of the warp size. Therefore, the results array employs intra-array padding to align the memory addresses to the memory transfer segment sizes [30, 55] . Since the number of threads per block is said to be 192, the results array intra-array padding forces the memory alignment to 192 float values, i.e, 768 bytes. Thus, memory accesses are fully coalesced and best throughput is achieved. Memory alignment and padding details can be found in Section 5.3.2 from the NVIDIA CUDA programming guide. Threads within a warp shall request consecutive memory addresses that can be serviced in fewer memory transactions. All the threads in a warp evaluate the same rule but over different instances. Thus, the data set must be stored transpose in memory to provide fully coalescing memory requests to the threads from the warp.
The codes for the coverage kernel and the rule interpreter are shown in Listing 1. The coverage kernel receives as input four arrays: an array of attributes values, an array of class values of the instances of the dataset, an array containing the rules to evaluate, and an array containing the consequents of the rules. It computes the matching of the results and return them in an array of matching results. The result of the matching of the rule prediction and the actual class of an instance can take four possible values: true positive (T P ), true negative (T N ), false positive (F P ), or false negative (F N ). Threads and blocks within the kernel are identified by the built-in CUDA variables threadIdx, blockIdx and blockDim, which specify the grid and block dimensions and the block and thread indexes, following the 3D representation shown in Figure 3 . Further information about CUDA threads indices can be seen in Section B.4 from CUDA programming guide.
Rule fitness kernel
The rule fitness kernel calculates the fitness of the rules by means of the performance metrics obtained from the confusion matrix. The confusion matrix is a two dimensional table which counts the number of true positives, false positives, true negatives, and false negatives resulting from the matching of a rule over the instances of the data set. There are many wellknown performance metrics for classification, such as sensitivity, specificity, precision, recall, F-Measure, etc. The algorithm assigns the fitness values corresponding to the objective or objectives to optimize, e.g, to maximize both sensitivity and specificity at the same time.
The rule fitness kernel is implemented using a 2D grid of thread blocks, whose size depends on the number of individuals (width) and the number of rules (height). The kernel perform a parallel reduction operation over the matching results of the coverage kernel. The naive reduction operation sums in parallel the values of an array reducing iteratively the information.
Our approach does not need to sum the values, but counting the number of T P , T N , F P and F N . O(log 2 N) parallel reduction is known to perform most efficiently in multi-core CPU processors with large arrays. However, our best results on GPUs were achieved using a 2-level parallel reduction with se- quential addressing using 128 threads per block, which is shown in Figure 4 . Accessing sequential memory address in parallel is more efficient than accessing non-contiguous addresses since contiguous data are transferred in a single memory transaction and provides coalesced accesses to threads. Finally, the code for the rule fitness kernel is shown in Listing 2. The input of the kernel is the array of matching results, and returns an array of fitness values. The 2-level parallel reduction takes advantage of GPU shared memory, in which threads within a block collaborate to compute partial counts of the confusion matrix values. Each thread is responsible to count the results from the base index to the top index. Therefore, contiguous threads address contiguous memory indexes, achieving maximum throughput. 
i o n M a t r i x [0] += c o n f u s i o n M a t r i x [ i ]; // # true p o s i t i v e s c o n f u s i o n M a t r i x [1] += c o n f u s i o n M a t r i x [ i +1]; // # true n e g a t i v e s c o n f u s i o n M a t r i x [2] += c o n f u s i o n M a t r i x [ i +2]; // # false p o s i t i v e s c o n f u s i o n M a t r i x [3] += c o n f u s i o n M a t r i x [ i +3]; // # false n e g a t i v e s

Evaluation of rule sets
Pittsburgh individuals encode sets of rules as complete solutions to the classification problem (classifiers). Many performance measures of a classifier can be evaluated using the confusion matrix. The standard performance measure for classification is the accuracy rate, which is the number of successful predictions relative to the total number of classifications.
The evaluation of the classifiers is divided into two steps, which are implemented in two GPU kernels. The first one, the classification kernel, performs the class prediction for the instances of the data set. The second one, the rule set fitness kernel, performs a reduction count of the classifier predictions to compute the confusion matrix, from which the fitness metrics for a classifier can be obtained.
Rule set classification kernel
The rule set classification kernel performs the class prediction for the instances of the data set using the classification rules, which are linked as a decision list. An instance is predicted to the class pointed out by the consequent of the first rule which satisfy the conditions of the antecedent. If no rule covers the instance, it is classified using the default class.
In order to save time, the classification kernel reuses the matching results from the rule coverage kernel, and therefore, the rules do not need to be interpreted again. The classifier follow the decision list inference procedure to perform the class prediction. Notice that the class prediction is only triggered when the rule is known to cover the instance (true positive or false positive).
__global_ _ void c l a s s i f i c a t i o n K e r n e l ( unsigned char * result , int * Class , int d e f a u l t C l a s s ) { int instance = blockDim . y * blockIdx . y + threadIdx . y ; int index = blockIdx . x * maxRules * n u m b e r I n s t a n c e s + instance ; The classification kernel is implemented using a 2D grid of thread blocks, whose size depends on the number of individuals (width) and instances (height). The kernel setup is similar to the rule coverage kernel. The number of threads per block is also 192, to maximize the occupancy of the streaming multiprocessors. Listing 3 shows the code for the classification kernel. The input of the kernel is the array of matching results, an array with information about the instance class and the default class, which applies when none of the rules covers the instance (default hypothesis).
Rule set fitness kernel
__global_ _ void r u l e S e t F i t n e s s K e r n e l ( float * fitness , unsigned char * result ) { __shared__ int shmCount [128];
shmCount [ threadIdx . y ] = 0; int base = blockIdx . x * n u m b e r I n s t a n c e s * maxRules + threadIdx . y ; int top = blockIdx . x * n u m b e r I n s t a n c e s * maxRules + n u m b e r I n s t a n c e s -base ; The rule set fitness kernel performs a reduction operation over the classifier predictions to count the number of successful predictions. The reduction operation is similar to the one from the rule fitness kernel from Section 3.2.2 and counts the number of correctly classified instances to compute the accuracy of the classifier. The settings for the kernel and the reduction operation are the same. The kernel is implemented using a 1D grid of thread blocks whose length depends only on the number of individuals. The code for the rule set fitness kernel is shown in Listing 4. The kernel receives as input the array of prediction results from the rule set classification kernel, and returns an array of fitness values which defines the accuracy of the classifiers. Similarly than the rules fitness kernel, shared memory is employed to count partial results and guarantee contiguous and coalesced memory accesses.
Experimental setup
This section describes the experimental study setup, the hardware configuration, and the experiments designed to evaluate the efficiency of the GPU model.
Hardware configuration
The experiments were run on a cluster of machines equipped with dual Intel Xeon E5645 processors running at 2.4 GHz and 24 GB of DDR3 host memory. The GPUs employed were two NVIDIA GTX 480 video cards equipped with 1.5 GB of GDDR5 video RAM. The GTX 480 GPU comprised 15 multiprocessors and 480 CUDA cores. The host operating system was a GNU/Linux Rocks cluster 5.4.3 64 bit together with CUDA 4.1 runtime.
Problem domains
The performance of the GPU model was evaluated on a series of data sets collected from the UCI machine learning repository [56] and the KEEL data sets repository [57] . These data sets are very varied, with different degrees of complexity. Thus, the number of instances ranges from the simplest, containing 150 instances, to the most complex, containing one million instances. The number of attributes and classes also differ significantly to represent a wide variety of real word data problems. This information is summarized in Table 2 . The wide variety of data sets allowed us to evaluate the model performance on problems of both low and high complexity.
Experiments
The experimental study comprises three experiments designed to evaluate the performance and efficiency of the model. Firstly, the performance of the rules interpreter was evaluated. Then, the times required for evaluating individuals by CPU and GPU were compared. Finally, the efficiency of the model was analysed regarding performance and power consumption. 
Rule interpreter performance
The efficiency of rule interpreters is often reported by means of the number of primitives interpreted by the system per second, similarly to Genetic Programming interpreters, which determine the number of Genetic Programming operations per second (GPops/s) [31, 53, 54] .
In this experiment, the performance of the rules interpreter was evaluated by running the interpreter with a different number of rules over data sets with varied number of instances and attributes. Thus, the efficiency of the interpreter was analysed regarding its scalability to larger numbers of rules and instances.
Individual evaluation performance
The second experiment evaluated the performance of the evaluation of the individuals and their rules in order to compute their fitness values. This experiment compared the execution times (these times consider in the case of CPU cluster, data transfers between compute nodes and the GPU times, the data transfer between host and GPU memory) dedicated to evaluate different population sizes over the data sets. The range of population sizes varies from 10 to 100 individuals. This range of population sizes is commonly used in most of the classification problems and algorithms, and represents a realistic scenario for real world data. The number of rules of each individual is equal to the number of classes of the data set, and the length of the rules varies stochastically regarding to the number of attributes of the data set, i.e., rules are created adapted to the problem complexity. Thus, the experiments are not biased for unrealistic more complex rules and individuals which would obtain better speedups. The purpose of this experiment was to obtain the speedups of the GPU model and check its scalability to large data sets and multiple GPU devices. Extension to multiple GPUs is simple, the population is divided into as many GPUs as available, and each GPU is responsible of evaluating a subset of the population. Therefore, the scalability is guaranteed to larger population sizes and further number of GPU devices.
Performance per Watt
Power consumption has increasingly become a major concern for highperformance computing, due not only to the associated electricity costs, but also to environmental factors [58] . The power efficiency is analysed based on the throughput results on the evaluated cases. To simplify the estimates, it is assumed that the devices work at their full occupancy, that is, at maximum power consumption [31] . One NVIDIA GTX 480 GPU consumes up to 250 W, whereas one Intel Xeon E5645 consumes up to 80 W. The efficiency of the model is evaluated regarding the performance per Watt (GPops/s/W). The power consumption is reported to the CPU or GPU itself and it does not take into account the base system power consumption. We followed this approach because it is the commonly accepted way both in academia and industry [31] to report the performance per watt efficiency. Table 3 shows the rule interpreter execution times and performance in terms of the number of primitives interpreted per second (GPops/s). Each row represents the case of a stack-based interpretation of the rules from the population over the instances of the data sets. The number of rules of each individual is equal to the number of classes of the data set. The number of primitives, Genetic Programming operations (GPops), reflects the total number of primitives to be interpreted for that case, which depends on the variable number of rules, their length, and the number of instances, representing the natural variable length of Pittsburgh problems.
Results
The single-threaded CPU interpreter achieves a performance of up to 9.63 million GPops/s, whereas multi-threading with 4 CPU threads brings the performance up to 34.70 million GPops/s. The dual socket cluster platform allows two 6-core CPUs and a total of 12 CPU threads, which are capable of running up to 92.06 million GPops/s in parallel. On the other hand, the GPU implementation obtains great performance in all cases, especially over large scale data sets with a higher number of instances. One GPU obtains up to 31 billion GPops/s, whereas scaling to two GPU devices enhances the interpreter performance up to 64 billion GPops/s. The best scaling is achieved when a higher number of instances and individuals are considered, i.e., the GPU achieves its maximum performance and occupancy when there are enough threads to fill the GPU multiprocessors. Figure 5 shows the GPops/s scaling achieved by the GPU model regarding to the number of nodes to interpret. The higher number of nodes to interpret, the higher the occupancy of the GPU and thus, the higher efficiency. Table 4 shows the evaluation times and the speedups of the GPUs versus the single-threaded and 12-threaded CPU implementations. The GPU model has high performance and efficiency, which increase as the number of individuals and instances increase. The highest speedup over the singlethreaded CPU version is achieved for the Connect-4 data set using 100 individuals (1.880× using one GPU and 3.461× using two GPU devices). On the other hand, compared to the parallel 12-threaded CPU version, the highest speedup is 933× using one GPU and 1.311× using two GPUs. The evaluation times for the Poker data set using 100 individuals are reduced from 818 seconds (13 minutes and 38 seconds) to 0.2390 seconds using two NVIDIA GTX 480 GPUs. Since evolutionary algorithms perform the evaluation of the population each generation, the total amount of time dedicated to evaluate individuals along generations becomes a major concern. GPU devices allow greatly speeding up the evaluation process and save much time. Figure 6 shows the speedup obtained by comparing the evaluation time when using two NVIDIA GTX 480 GPUs and the single-threaded CPU evaluator. The figure represents the speedup over the four largest data sets with the higher number of instances. The higher number of instances, the more number of parallel and concurrent threads to evaluate and thus, the higer the occupancy of the GPU. 1.000,00
1.500,00
2.000,00
2.500,00
3.000,00
3.500,00
Speedup Shuttle Connect-4 Kddcup Poker Figure 6 : Model speedup using two GPUs
Finally, Table 5 shows the efficiency of the model regarding the computing devices, their power consumption, and their performance in terms of GPops/s. Parallel threaded CPU solutions increase their performance as more threads are employed. However, their efficiency per Watt is decreased as more CPU cores are used. On the other hand, GPUs require many Watts but their performance is justified by a higher efficiency per Watt. Specifically, the single-threaded CPU performs around 0.7 million GPops/s/W whereas using two GPUs increases its efficiency up to 129.96 million GPops/s, which is higher than the efficiency reported in related works [31] , which achieve a performance up to 52.7 million GPops/s per Watt. 
Conclusions
In this paper we have presented a high-performance and efficient evaluation model for individual = rule set (Pittsburgh) genetic rule-based algorithms. The rule interpreter and the GPU kernels have been designed to maximize the GPU occupancy and throughput, reducing the evaluation time of the rules and rule sets. The experimental study has analysed the performance and scalability of the model over a series of varied data sets with different numbers of instances. It is concluded that the GPU implementation is highly efficient, scalable to multiple GPU devices. The best performance was achieved when the number of instances or the population size was large enough to fill the GPU multiprocessors. The speedup of the model was up to 3.461× when addressing large scale classification problems with two GPUs, significantly higher than the speedup achieved by the CPU parallel 12-threaded solution. The rule interpreter obtained a performance above 64 billion GPops/s and even the efficiency per Watt is up to 129 million GPops/s/W.
