Abstract. Solver competitions have been used in many areas of AI to assess the current state of the art and guide future research and development. AI planning is no exception, and the International Planning Competition (IPC) has been frequently run for nearly two decades. Due to the organisational and computational burden involved in running these competitions, solvers are generally compared using a single homogeneous hardware and software environment for all competitors. To what extent does the specific choice of hardware and software environment have an effect on solver performance, and is that effect distributed equally across the competing solvers?
Introduction
Competitions in AI are a useful focal point for researchers, help to drive forward research and development of solver algorithms, and provide incentives for widely sharing tools and benchmarks. Competitions also play a prominent role in evaluating and improving the state of the art of their particular research areas. Examples include AI planning, SAT, ASP, CSP, and machine learning [1] [2] [3] [4] [5] . Among those, the International Planning Competition (IPC) is one of the bestknown, longest-running and most thoroughly designed competition series [6, 7] . Organised periodically since 1998, the IPC provides a good example of the impact of competition results on AI planning research, and on planning applications. While all the planning engines tested in the IPC are available to be used after the competition, top-ranked planners receive much of the attention and drive the research direction of the field in the years thereafter.
The great impact and success of top performing solvers implicitly rests on the assumption that, at least from a qualitative point of view, conclusions derived from competition results generalise well to othereven significantly different -hardware and software environments than those used for running the competition. It is well-known that competition results are already strongly affected by the set of benchmark instances, the evaluation function used to assess solver performance, the way problem instances or planning domains are modelled, and the set of competitors [8] [9] [10] [11] [12] [13] [14] . Moreover, an analysis performed on the SAT competition showed that ranks of solvers are also affected by pseudo-random number seeds used in randomised solvers [15] .
Interestingly, an investigation performed by Howe and Dahlman in 2002 showed that the relative (qualitative) performance of planners can vary when run using different hardware configurations [8] . However, as their work focused on identifying potential sources of performance variation, their analysis on the impact of hardware and software configuration was limited to assessing differences between two different machines having the same software configuration. In this work, we present the first thorough study of the impact of hardware and software environment choices, as well as resource limits, individually and jointly, on competition outcome. Specifically, our aim is to attempt to identify and isolate some aspects that have unequal impact on planner performance, with the hope that these will help with performing (and interpreting) future comparisons between planning engines.
We focus our analysis on two deterministic tracks of the 2014 International Planning Competition, the Optimal and Agile tracks. These two tracks provide a very interesting test-bed, as they rank competitors using nearly opposite metrics. In the Optimal track planner running time is of limited importance: planners are assessed according to their ability to generate optimal solution plans within the (large) given cutoff time. On the contrary, in the Agile track the quality of solutions is irrelevant, as planners are ranked according to their ability to quickly find a solution. The selected tracks also differ in terms of the benchmark instance sets used.
Our experimental analysis involves two hardware configurations and eight different software configurations. The software configurations include the choice of C++ compiler version, Python interpreter version and Java version. When running experiments on all possible combinations of hardware and software configurations, we also evaluate the impact of solver stochasticity and different running time and memory limits. Our results show that, in addition to verifying the well-known impact of memory and running time limits [8, 12] , competition rankings can be affected by both hardware and software configurations. The source code for all planners, problem instances and domains, and all experimental results have been made publicly available. 1 The remainder of this paper is organised as follows. First, we give some background on the 2014 International Planning Competition. Then, in Section 3, we describe potential sources of planner performance variation. Section 4 describes the experiment design used in our work. We present and discuss our experimental results in Section 5 and then conclude with a brief discussion of the effect of our results on the IPC.
The International Planning Competition
Automated planning studies the problem of finding a totally or partially ordered sequence of actions that transform a given problem environment from an initial state to a goal state (of which there may be several) [16] . Actions are usually expressed in terms of preconditions and effects. Preconditions indicate the requirements that must hold to apply an action, while effects are the consequence (including the cost) of applying the action to modify the state of the world.
The International Planning Competition has been organised since 1998, with the aims of fostering the development and comparison of planning approaches, assessing the current state of the art in planning, and identifying new and challenging benchmarks. In this paper we focus on the eighth edition of the IPC, held in 2014. For a summary of the history of previous IPCs, the interested reader is referred to López et al. [17] .
IPC 2014 was held in three distinct parts: the deterministic part focused on fully observable environments where actions are atomic with deterministic effects and planning is episodic, with the presence of action costs, negative preconditions and conditional effects; the learning part, which relaxes the episodic assumption to allow planners to learn from prior experience; and the probabilistic part, with stochastic transitions and partial observability. The deterministic part is the longest running part of IPC, and is the part that traditionally has the highest number of participants (67 in IPC 2014). Hereinafter we will focus on this part. Among the five tracks of the deterministic part of IPC 2014, here we consider the Agile (15 participants) and Optimal (17) tracks. The Agile track was introduced in 2014, while the Optimal track is among the longest-standing in the IPC.
The set of benchmarks used in the Agile track includes the following 14 domains: Barman, Cave Diving, Child-Snack, CityCar, Floortile, GED, Hiking, Maintenance, Openstacks, Parking, Tetris, Thoughtful, Transport, and Visitall. In the Optimal track, the Tidybot domain has been used in place of Thoughtful. For each track, 20 instances per domain were selected following a specifically-designed protocol [18] .
In the Optimal track, SymBA-2, which is based on a symbolic bidirectional blind search with perimeter abstraction heuristics, was declared the winner and cGamer-bd, a bidirectional symbolic search approach that extends the Gamer planner (winner of the corresponding track in IPC 2008), was declared as the runner-up. Finally, Yahsp3, which performs a search embedding delete-relaxed heuristics, was declared as the winner of the Agile track of IPC 2014 and Madagascar-pC, which exploits a SAT-based approach to planning, was declared the runner-up.
For more information about the competition, including complete results, source code of planning systems, and domain models, the interested reader is referred to the analysis of the IPC 2014 results [18] , and to the official competition website. 2 Detailed descriptions of the planning systems can be found in the IPC 2014 booklet [19] .
Sources of performance variation
When performing empirical analyses or comparisons, there are many potential influences on software performance variation. We attempt below to introduce as many such sources as possible, although we acknowledge that a full cataloguing is impossible. We investigate a subset of these sources in this work, but include all of them as a contribution and reference for future work.
Solver randomisation and other stochastic effects
Many solvers take advantage of randomisation to improve average-case performance and to avoid manual deterministic development choices. This randomisation can result in very different solver trajectories in repeated runs with different random seeds, with a correspondingly wide variation in the resulting performance. In satisfiability and other domains, empirical results demonstrate that the running time to find a valid solution is often exponentially distributed for randomised solvers [8, [20] [21] [22] [23] . Even in the impossible case of holding everything constant in the execution environment other than random seeds, solver stochasticity would cause repeated runs to differ significantly in performance. This has also been shown empirically by Hurley and O'Sullivan [15] .
Other stochastic effects on solver performance can come from the use of shared machines for experiment execution, such as large compute clusters or virtualised commodity environments such as Amazon EC2. CPU core allocation also has effects, as cache connections can vary depending on the core assignment in modern CPUs. Finally, even with no other jobs running on the same machine the operating system can (and will) context switch an experiment process in the middle of execution, causing variance in measurements of running time.
2 https://helios.hud.ac.uk/scommv/IPC-14/
Running time and memory limits
Generally, allocating more running time or memory to solver executions will result in more problem instances solved. However, this improved performance with increased limits tends to not be distributed evenly across all solvers. For example, solvers with extensive caching or precomputation (including use of pattern databases) may benefit from increased memory limits more than other solvers. Most solver competitions, including those in automated planning, evaluate competitors with fixed limits for running time and memory. There has historically been little investigation into competitor performance outside of these limits, and the 4 GB limit used in the IPC is now less than that available in many commodity laptop computers. 3 We perform an investigation in this work studying how the benefits of higher or lower limits are distributed among competing solvers.
Hardware architecture
It is clear that hardware choices such as the CPU can affect solver performance. However, there are many choices differentiating the hardware environment of different machines, and CPU clock speed is no longer the primary source of increased performance. Performance differences can also come from CPU cache levels, processor architecture, memory bandwidth, local storage medium, network interconnection where applicable, and more. We are not aware of any solver competitions measuring competitor performance across several different machine configurations, and in this work we make a small step toward this goal by evaluating solvers on two distinct compute clusters.
Software architecture
In addition to hardware configuration, there are many aspects of the software environment configuration that can affect solver performance. These choices include the operating system used, system library options and versions (e.g. LIBC), and the compilation toolchain used for building and linking a given solver and all of its dependencies. Furthermore, there can be performance differences based on the interpreter version and configuration settings for interpreted or JITcompiled languages like Python or Java. We investigate the effects from 8 different software configurations in this work.
Choice of benchmark distribution
Evaluating solver performance requires runs on one or more problem instances forming a benchmark set. (In planning, these instances are additionally from distinct planning domains.) Benchmarks should be challenging for the participating solvers, and need to allow for performance differences between solvers to be identified. In a nutshell, benchmarks must be neither too challenging, where no solver is able to provide any solution within the given resource limits, nor trivially solvable: in both cases, no differences between solvers can be identified. In some areas of AI the complexity of problems can be evaluated statistically, without running any solver, by considering the phase transition [24, 25] . This is usually not the case in planning. A phase transition has been demonstrated for randomlygenerated graphs [26, 27] , but is typically unknown on instances from newly-designed domain models. However, recent work by Cohen and Beck [21] provide an empirical investigation of the phase transition phenomena for heuristic search, focusing on the exploitation of greedy best first search. In fact, the difficulty of planning problems has been mainly assessed experimentally, i.e., by running solvers. For instance, the recently introduced Torchlight tool [28] allows planner developers and users to analyse the search space topology of planning problems under the delete-relaxation heuristic.
Beside the issue of assessing problem instance complexity, planning instances are often created using randomised generators, where a few parameters define the size and the complexity of the resulting instances. The choice of problem instance domains, randomised generator settings, and instance set size and distribution will all have an (uneven) effect on competing solver performance. We consider this source of variation out of scope for this work, and focus solely on the official benchmark instances used in the 2014 IPC. A discussion on protocols for benchmark selection in planning competitions is provided by Vallati and Vaquero [14] , and recommendations on benchmark selections were also provided by Howe and Dahlman [8] .
Choice of performance aggregation and ranking mechanism
Given a set of solvers and benchmark problem instances, competition organisers and others interested in empirical performance evaluation must make further evaluation decisions. These decisions include how solver performance is aggregated across the set of benchmark problems, and the metric used (running time, instance set coverage, solution quality). Some competitions use an absolute scoring mechanism (such as mean running time), while others like the IPC use scoring mechanisms where each competitor can have an effect on the score of other competitors. In fact, each track of the IPC typically uses its own scoring mechanism. Tiebreaking mechanisms can also affect the final solver rankings. All of the above choices have been held constant in the IPC for some time, and while the question of whether there are qualitatively better choices is an interesting one, we consider a full treatment of this topic to be out of scope for this work. However, performing an initial analysis of these effects is useful and straightforward with existing competition data, and can help contextualise and characterise the relative impact of different scoring mechanisms on solver rankings. In section 5.4 we provide the results of such an empirical analysis, looking at the effect of alternative scoring mechanisms on the two IPC 2014 tracks considered in this paper.
Experiment design
For our experimental analysis, we chose two sequential, deterministic tracks of the 2014 International Planning Competition (IPC): the Agile and the Optimal tracks. These two tracks provide a very interesting test-bed, as they rank competitors using nearly opposite metrics and also differ in terms of the benchmark sets. The Optimal track is among the longest-standing tracks in the IPC series, with many participating planners and substantial impact on the field of AI planning. While the Agile track was new for IPC 2014, its emphasis on planner running time and low resource requirements made it ideally suited for our analysis.
In the Agile track, competing planners are evaluated based on the running time required to find any satisficing plan, with no regard to the quality of that plan. There were 15 competing planners in the IPC 2014 Agile track, evaluated on 20 benchmark instances from each of 14 planning domains (280 instances in total). These planners were given a running time limit of 300 CPU seconds, on a single CPU core, and performance was evaluated using the IPC running time score. For each problem instance i, let t * i be the minimum running time required for any competing planner to produce a satisficing plan. Any planner that successfully produces a satisficing plan in time t will receive a score Table 1 Hardware specification of the Orcinus and Galileo computer clusters used in our experiments, and of the cluster used to run the official IPC-2014 competition. Unfortunately, the IPC-2014 cluster has been upgraded since the competition and the CPU model used at the time of the competition is unknown.
Cluster
Nodes of 1/[1 + log 10 (t/t * i )] for i. Failure to produce a satisficing plan within the CPU time limit results in a score of 0 for i. If t is less than 1 CPU second, the score is set to 1, to prevent large score differences on trivial instances. The final score for each competing planner is the sum of the scores for that planner over all instances i of the benchmark set.
In the Optimal track, competing planners are evaluated based on the ability to find an optimal-cost plan. There were 17 planners competing in the Optimal track of IPC 2014, again evaluated on 280 benchmark instances (20 instances from each of 14 domains). These planners were given 30 CPU minutes of running time, on a single CPU core. The running time required to produce this plan plays no role in scoring, and planners are simply assigned a score of 1 if an optimal plan was produced for instance i, and 0 otherwise. As for the Agile track, the final score for each competing planner is defined as the sum of the individual instance scores.
Many of the planners from the selected tracks required some modification in order to run successfully on our hardware and software configurations, for example to avoid writing temporary files into their source directories and polluting results when executing runs concurrently. The planners requiring the most modifications were cGamer-bd, DPMPlan, MIPlan and NuCeLaR; these planners all use the same parser for grounded PDDL, which writes files into the directory containing the planner's source code. In order to adapt the parser to our environment and support runs performed in parallel, we modified the source code to instead write these files into the planner's working directory. An analogous solution was applied to RIDA, where the planner also dynamically wrote files into its own source directory [19] .
We consider these modifications minor and do not believe that they had any effect on planner execution or running times. There were two planners that could not be made to work in our environments, namely the Freelunch planner from the Agile track and the AllPaca planner from the Optimal track. In the case of Freelunch, we could not successfully run the planner on either of the computer clusters or with any version of Java at our disposal. As far as we can determine, this was caused by the high-memory shared environment on each cluster node, as Freelunch would crash immediately on launch with a Java JVM memory allocation exception. In the case of AllPaca, the planner relied on the presence of a specific commercial Lisp variant, and we were unable to modify it to work with any of the Lisp distributions available on our systems. These two planners have therefore been removed from our results. Both planners were ranked outside of the top 5 planners in their respective tracks, placing approximately in the middle of the competition rankings. We fully expect that if these issues were to be fixed, they would not significantly change our data or conclusions.
In order to investigate the performance effects of hardware architecture, we utilized two large compute clusters: the Compute-Calcul Canada WestGrid Orcinus cluster 4 and the Italian CINECA Galileo cluster.
5
More details on the hardware and software configuration of these clusters are given in Table 1 . Hyperthreading was disabled on both clusters. We note that both
Orcinus and Galileo have more (and more powerful) CPU cores than those of the cluster used for running the IPC 2014. It is also noticeable that they share the same OS, though in different versions. Both the hardware architecture and the OS are elements that are often beyond our control, but can still have an impact on the performance of solvers, as we demonstrate in this work. Due to the significant resource requirements of reproducing the IPC competition results, we were limited to these two clusters, as we had existing large resource allocation grants for both. We expect that the variance results in this paper will be similar or more significant on other hardware architectures, especially those based on CPUs other than the newer Intel Xeon chips utilised in Orcinus and Galileo.
For the analysis of performance variation over different software configurations, we chose to investigate three major software components: GCC compiler version, Python interpreter version, and Java version. Nearly every planner was entirely or partially reliant on components compiled with GCC, and different compiler versions are very likely to produce different executables even when identical command-line options are used. We selected GCC versions 4.7.2 and 4.8.2 as the two configuration options, since 4.7.2 was that used in the competition and several of the planners do not successfully compile with versions of GCC more recent than 4.8.2.
Python and Java were by far the next most common software dependencies for the planners we considered. We selected Python 2.7.3 and Oracle Java 1.7.0_45, the versions used in IPC-2014, as well as Python 2.7.10 and Oracle Java 1.8.0_65, the most recent versions on which all relevant planners would execute successfully. The combination of these options resulted in 8 potential software configurations, all of which were used in this work. We will frequently refer in our results to the configuration provided as default in IPC 2014 (GCC 4.7.2, Python 2.7.3, Java 1.7.0) as the base configuration, and the configuration with the most recent of each option (GCC 4.8.2, Python 2.7.10, Java 1.8.0) as the newest configuration. However, it should be noted that IPC 2014 participants were allowed to require a specific version of software dependencies to be used for running their planner.
We then evaluated all considered planners from each track on the entire competition benchmark sets, for each of the 16 (hardware, GCC version, Python version, Java version) configuration options. In order to account for and measure solver stochasticity, we performed 5 independent runs of each configuration. This resulted in 80 complete reproductions of the IPC 2014 Agile and Optimal tracks. All planner runs were performed independently in parallel, with each run assigned 1 CPU core, 8 GB of RAM, and a running time limit of 1800 (Optimal track) or 300 (Agile track) CPU seconds. These running time limits are the same as those used in the IPC tracks, but the competition memory limit was only 4 GB.
We have used a memory limit of 8 GB in this paper, primarily to offset the increased memory usage when forcing compilation for 64-bit execution. As our focus in this work is the impact of hardware and software configuration on planner performance, we also wanted to avoid memory limits being exceeded as much as possible. We use our results with an 8 GB memory limit to produce hallucinated results using a 4 GB limit, as follows: for each considered planner run, if the 8 GB data shows a peak memory usage for that planner higher than 4 GB, that problem instance is counted as unsolved for that planner in the hallucinated data. A set of experiments performed using hard 4 GB limits (described in Section 5.2) indicates that these hallucinated results are consistent with those obtained by setting hard RAM limits.
All planners were explicitly compiled for 64-bit execution, as neither of our clusters have support for 32-bit execution. The cluster on which IPC 2014 was run had a 64-bit architecture, however competitors were allowed to require their planners to be compiled and executed as 32-bit. Running time and memory limits in our experiments were monitored and enforced using tools from AClib. [29] . 6 These tools are built on top of standard system tools such as ulimit, and the limit enforcement is consistent with that used in the IPC.
Results
This section is devoted to the empirical evaluation of the influences of sources of performance variation on the results of the agile and optimal tracks of IPC 2014. In many of our results, we make use of so-called bump charts to graphically represent the performance variation of our considered planners across several hardware and software configuration options; an example is shown in Figure 1 . In these charts, each vertical "column" represents the performance of a different set of runs of our considered planners. The points for each planner are connected and coloured to better illustrate the performance differences between configurations. 
The effects of solver randomisation
It is common practice to include randomised components in AI planning systems. Randomisation is useful, e.g., for breaking ties during the heuristic search process, introducing some noise in the heuristic search state evaluation, and for diversification when performing search restarts. Evidently, solver performance can be affected by this source of stochasticity.
In order to quantify this variability, we examine five independent runs of all the participants of the competition tracks considered in our study, on the same platform, in this case the "base" configuration of Orcinus. For planners that allowed it, we fixed the seed parameter used for the randomised component. The underlying assumption is that this performance variability is orthogonal to hardware and software configuration choices, and the results for the other 15 configurations are indeed very similar. Figure 1 shows the variation in competition rank of planners that took part in the optimal track, for each of the 5 independent runs. Table 2 presents the corresponding numerical values in terms of instance coverage, along with the performance variance for each planner. 7 Very few planners of the Optimal track show significant variability in terms of coverage. SPMaS and Dynamic-Gamer are the planners that show the largest difference in terms of instances solved within the allotted time; most of the planners show a discrepancy of only 1 or 2 instances.
A similar picture emerges when the performances of the Agile track planners are analysed. In fact, these planners show less variation in terms of instance coverage. However, in the IPC Agile track, planners have been evaluated according to their IPC runtime score. From this perspective, ArvandHerd and Jasper show the largest score fluctuations: the score of the former ranges between 94.0 and 84.8, while the IPC score of Jasper stays in the 90.0-81.6 range. The impact of this IPC score variation on the competition ranks is limited: in each run, at most two pairs of planners swapped their ranks. However, these results do show that the ranks can change in repeated runs.
For both the considered clusters, we were allowed to reserve cores -with the corresponding amount of dedicated RAM-for our experiments, hence minimising the variability due to having different processes sharing such resources. We are confident that to the extent possible without deep planner source code modification, we isolated the impact of planner randomisation and reduced the impact of variance due to hardware and software factors not considered in this work.
These experimental results confirm that, while the performance variation of these planners due to stochasticity tends to be limited, it cannot be ignored. For this reason, in order to try to correctly account for stochasticity when assessing the impact of the different sources of variation in the following subsections, the presented results will be derived by considering average performance over the five independent runs per instance. However, it is important to mention that there is still a possibility that a portion of the planner performance variation observed in our experiments is due to stochastic noise that has not been removed by considering the average of multiple runs.
The effects of memory limits
It is well known that the amount of RAM available for a planner has a strong impact on its performance [8, 12] . In addition to providing further confirmation of this previous work, in this section we are interested in investigating if this source of variation similarly affects all of the planners considered in our study. To investigate this, we considered the five independent runs of our "base" configuration on Orcinus. The memory limit used for these runs was 8 GB, as with the other experiments in this work. We recorded the peak memory usage for each run, which we used to hallucinate the result of running each planner with a memory limit lower than 8 GB. This approach does not work for planners that pre-allocate resources to fill their memory allocation, but in practice, we did not see this behaviour in the planners we studied. We performed an additional set of runs with an explicit 4 GB RAM limit to test the effect of hallucinating lower memory limits, and planner performance was very similar to the hallucinated predictions (with the exception of some planners from the Gamer family, discussed further below). Figure 2 shows the hallucinated cumulative coverage of the Optimal track planners with respect to the available amount of RAM. Interestingly, most of the planners show a significant performance improvement when the amount of available RAM ranges between 4 and 5 GB. Moreover, planners based on Java -from the Gamer family-show a very peculiar behaviour: almost no solutions are found when less than 3 GB of RAM are available. The effect of the Java Virtual Machine on memory consumption can be clearly observed by looking at the performance of NuCeLaR, a portfolio planner that exploits Gamer as a basic solver. In fact, we observed that, when manually configured by using the ms and mx JVM parameters, the RAM requirements of the planners that use Java can be significantly reduced.
Hallucinated instance set coverage of planners that took part in the Agile track of IPC 2014 were similar to those discussed above: Java-based planners required at least 3 GB of RAM in order to solve any instance. However, one difference between the Agile and Optimal planners was that the coverage of the Agile planners was not typically improved when more than 4 GB of RAM were available. Exceptions were only Madagascar-p and Yahsp3-mt, which were able to exploit the higher memory limit to solve more instances. Finally, by examining the performance of Javabased planners across our Java 1.7 and Java 1.8 configurations, we conclude that the latter typically forces planners to use a larger amount of RAM, on average around 1 GB more.
The effects of running time limits
It comes as no surprise that increasing or decreasing the available running time has an impact on the performance of many planners. However, it is also a common belief in the literature that most classical planners either solve a problem quickly or not at all within reasonable running time [8, 30] . In this section, we aim to investigate this hypothesis, as well as to examine whether the performance differences resulting from different running time limits are evenly distributed across planners. Figure 3 shows the cumulative number of solved instances for planners that took part in the IPC 2014 Optimal track. Experiments were run on our Orcinus hardware configuration, using our "base" software configuration. These results indicate that many of the ranks do not change when the cutoff time is higher than 10 seconds. Moreover, after 10 seconds many of the planners continue to solve additional instances as the available running time increases. However, there are a number of exceptions. DPMPlan solves a significant number of instances using approximately 70-80 seconds: analyses indicate that this is due to the fact that this planner uses a sequential portfolio, and around that running time a new planner is typically started. RIDA solves the vast majority of its instances using more than 120 seconds while, on the contrary, the SymBA-1 and SymBA-2 planners do not solve any additional instances using running time cutoffs of more than 100 seconds. The Agile track planners show a similar overall behaviour in terms of coverage. Our analyses indicate that increasing the available running time leads to an expected improvement in instance set coverage for most of the planners: only the Yahsp3 planners show a flat cumulative coverage function before the 5 minute limit. Additionally, it seems that the competition rankings are not stable and are significantly affected by the chosen cutoff time. Figure 4 shows how the IPC runtime score of planners that took part in the IPC 2014 Agile track is affected by the cutoff time. As it is apparent, many ranks change also when the cutoff time is higher than 100 seconds, indicating that planners are still solving instances and improving their IPC runtime score. Cedalion and ArvandHerd provide a good example of the described behavior; their IPC runtime scores keep growing for cutoff times higher than 1 CPU-time second. On the contrary, planners like YAHSP3 and Madagascar are able to solve a significant number of instances in a very short time -less than 60 seconds-, but after that their IPC runtime scores do not improve significantly, as the additional CPU-time does not allow them to solve as many additional instances.
According to the results shown in Figures 3 and 4 , the past observation that planners either solve a problem quickly or fail to solve it within reasonable time does not appear to hold any longer. Most of the considered planners can utilise all of the running time provided in the competition. Remarkably, this observation not only applies to portfolio-based planners, but also to those based on a single planning approach. In terms of the relative impact on planner performance, the impact of running time differs substantially between planners. In the Optimal track, performance ranks do not vary significantly when more than 10 seconds are available. Agile track rankings, on the other hand, are strongly affected by the cutoff time.
The effects of scoring mechanisms
When comparing the performance of two or more planners, there are several potential alternatives to produce a ranked ordering. These options include the number of problem instances solved by each planner (instance set coverage), IPC runtime/quality scores, and penalised average runtime (PAR) scores. PAR-10 (PAR-1) is a metric often used in automated algorithm design and empirical analysis experiments, where average runtime is modified by counting runs that did not find a plan as ten (one) times the running time cutoff. The choice of a specific scoring mechanism incentivises competitors to optimise their submissions with respect to that mechanism, and therefore this choice can affect the resulting rankings. In order to investigate the effect of scoring mechanisms on planner rankings, we computed instance set coverage, IPC runtime score, PAR-1 and PAR-10 scores using the independent Agile and Optimal track runs gathered on the Orcinus base environment configuration. Table 3 summarises the various scoring mechanisms for the Agile track. The scoring mechanism used in the competition was the IPC runtime score. This metric penalises planners for failing to solve problem instances solved by other planners, and ignores timing differences between planners successfully solving a problem instance in less than 1 CPU second. Planners such as IBaCoP2 appear to have solved almost the same number of problem instances as the winning planners, but with running times that reduced their IPC runtime score substantially. The PAR-1 and PAR-10 scores appear to be a compromise between instance set coverage and IPC runtime score. Table 4 summarises the scoring results for the Optimal track. In this case, the competition scoring metric was strictly instance set coverage over a 1800 CPU second running time cutoff. When the running time to produce an optimal solution is taken into account cGamerbd is still the top-performing planner, but there are changes in many of the other ranks. For example, the performance of Metis and the SymBA planners improves significantly, and the RIDA, MIPlan and NuCeLaR planners see a decrease.
The effects of hardware architecture configuration
Previous work in this area (such as that of Howe and Dahlman [8] ) has presented evidence that the hardware platform used can influence planner performance. Different CPU-clock speeds can behave like a different running time cutoff, different amounts of RAM can change the problems that can be solved by one algorithm significantly, and the architecture design of the CPU and other factors can influence performance in a manner that is harder to predict. What we want to analyse here is the relative performance changes between planners from these factors, since this can lead to a different competition ranking based on the specific hardware configuration chosen. Here we use the Orcinus and Galileo clusters as our two considered hardware configurations, and we also investigate the differences between the performance on our clusters and the official IPC 2014 results. While this analysis does make every effort to isolate the effects of the hardware configuration alone, some software influences are still present since the two clusters run different operating system versions and several system libraries out of our control are not identical. 
Differences between hardware configurations
Looking at the second and third column of Figures 5  and 6 , which respectively represent the results for Orcinus and Galileo on each of the considered IPC 2014 tracks, we can see a mixed set of trends. The instance coverage and IPC runtime score for each track, respectively, are given in Tables 5 and 6 . In the Optimal track, the trend is largely neutral to positive, with most planners obtaining equal or slightly-higher instance set coverage on Galileo than Orcinus. However, there are three exceptions: MIPlan, NuCeLaR, and to a lesser extent SPMaS. These trends are more significant for the Agile track due to the use of the IPC runtime score, with rank changes occurring from both performance improvement (Jasper, Mercury, BFS-f, Madagascarpc, SIW) and performance degradation (ArvandHerd, Probe, Madagascar, Yahsp3-mt) between Orcinus and Galileo.
We believe that the cause of the mostly-positive trends can be partially explained by the better singlecore hardware performance of the Galileo cluster. Even though Orcinus has a "faster" CPU in terms of clock speed, Galileo has a newer hardware architecture, more available cache, and better memory bandwidth. What we consider most interesting in these ex- Table 5 Comparison of planner performance on our two hardware configurations ("base" software configuration), for the IPC 2014 Optimal track. We also show the official IPC 2014 results. perimental results are those planners that significantly deviate from the neutral-to-positive trend, and that we see both significant performance improvements for some planners and significant performance degradation for others. A full explanation for these deviations is difficult, but our observations suggest that the hardware platform does not affect all planners in the same way.
Comparison with IPC 2014 results
We now turn our attention to the differences between the official results of IPC 2014 and the performance on our two hardware configurations, referring again to Figures 5, 6 , along with Tables 5, and 6 .
It should be noted that during the 2014 IPC competition, participating teams were allowed to require (or select) the most appropriate version of every software component needed by the submitted planner, whether to use 64-or 32-bit execution, and RAM was limited to 4 GB (while we consider an 8 GB limit). Unfortunately, the cluster used for running the competition is no longer available, as it was replaced shortly after the competition. For these reasons, a direct comparison between the results obtained on our machines and the official IPC 2014 results is not possible. However, we try to analyse here some general trends that can be observed even in a very rough comparison.
In the Optimal track, the performance differences causing rank changes between planners are mainly caused by a few planners with significantly different performance between IPC 2014 and our hardware configurations. The SymBA-1, SymBA-2 and SPMaS planners show a significant performance degradation, We also show the official IPC 2014 results. Table 6 Comparison of planner performance on our two hardware configurations ("base" configuration), for the IPC 2014 Agile track. We also show the official IPC 2014 results. whereas MIPlan, DPMPlan, Metis and NuCeLaR show significant performance improvements. The specific causes of these changes in each planner are unclear, but we did identify in the IPC 2014 published planner logs that MIPlan had several crashed runs in the competition due to a library dependency error, and that DPMPlan was disproportionally affected by the competition's 4 GB RAM limit.
In the Agile track we observe a general trend of improved IPC runtime score performance between the IPC 2014 results and our hardware configurations. As in the Optimal track, several planners are affected differently and exhibit significant performance degradation, in this case with the two Madagascar and two Yahsp3 planners. Table 6 demonstrates that the positive performance trend is also not distributed evenly among the competing planners, which is the main cause of the ranking changes other than the planners with performance degradation. The Cedalion planner is the system that gains the most when run on both of our hardware configurations.
Hardware and software synergies
In order to better understand the role of the hardware configuration in performance variation, we repeated the previous analysis for a different software configuration. In this case, we used the "newest" configuration rather than our "base" configuration. We present side-by-side bump charts for the two analyses in Figures 7 and 8 , respectively, for the Optimal and Agile tracks. Many of the planners show very similar performance changes between our hardware configurations in both scenarios, but several planners change their behaviour dramatically. In the Optimal track, the significant performance degradation seen for the MIPlan and NuCeLaR planners in the "base" configuration disappear completely for the "newest" configuration. In the Agile track, there are performance differences due to the running time changes between the two software configurations, but the trends are largely similar other than for the two IBaCoP planners which show opposite trends.
The effects of software architecture configuration
In order to evaluate the impact of the software environment configuration on planner performance, we examine the results of our eight software configurations using the consistent hardware configuration provided by the Orcinus cluster. These eight software configurations reflect the choice of GCC compiler version (4.7.2 or 4.8.2), Python interpreter version (2.7.3 or 2.7.10) and the version of the Java JDK and virtual machine (1.7 or 1.8). Figure 9 shows how the instance set coverage (and therefore the competition ranking) of the Optimal track planners are affected by our software configurations. The corresponding data can be found in Table 7 . It appears that a number of planners have a tangible performance drop when Java 1.8 is used instead of ver- sion 1.7. The planners most affected by this are MIPlan and NuCeLaR; the planners based on Gamer, i.e., Gamer, cGamer-bd and Dynamic-Gamer also show a performance drop, although not as significant as for the previously-mentioned solvers. The remaining planners of the Optimal track show limited, but in the case of RIDA and SPMaS still noticeable, performance fluctuation. Figure 10 shows how the software configuration affects the instance coverage of planners from the Agile track. The corresponding coverage data is presented in Table 8 , along with the IPC runtime scores for the sake of completeness. We focused on coverage because this allows for an objective assessment of the performance of each planner. The IPC score of each planner depends on the performance of all other planners, and may obscure performance differences in individual planners. Instead, the instance set coverage of a planner is an absolute measure that does not depend on any other planner. However, IPC score and coverage are also closely related: if a planner is unable to solve a given benchmark instance, the corresponding IPC runtime score for the instance will be 0.0.
As shown in Figure 10 , some planners from the Agile track demonstrated a sensitivity to the GCC compiler version. For example, extreme variation can be observed in the performance of IBaCoP and IBaCoP2. The performance of the other Agile track planners, in particular Use, ArvandHerd and Jasper, are affected by a combination of GCC compiler and Python versions. However, we note that ArvandHerd, Jasper and Yahsp3-mt, which according to the results in Figure 10 show remarkable performance variation, are also among the planners with the highest variance on multiple runs (see Section 5.1). Therefore, there is a possibility that a portion of the observed software configuration variation is due to stochastic noise that was not removed by considering the average of five runs. To summarise our software configuration analyses, the specific choice of software configuration has an impact on the performance of planners, in many cases a significant impact. The impact may be the result of a number of factors, such as:
• Planner dependence on a variety of software technologies. Intuitively, this is due to the fact that a planner can then be affected by multiple changes simultaneously. This is especially true for portfolio planners, such as NuCeLar.
• Planners that have been highly optimised for specific versions of a particular software package are sensitive to subsequent version changes. SymBA, for instance, explicitly required a specific version of the GCC compiler to be installed on the IPC 2014 benchmarking environment.
• Major changes in the way in which software components work, such as Java JVM memory allocation and garbage collection from version 1.7 to 1.8, strongly affect planners relying on those components.
As impacts are not distributed evenly among all planners, the choice of specific software configuration can dramatically affect competition results.
Conclusions and future work
In this work we presented an empirical investigation of solver performance variation across several options for hardware and software architecture configuration. For each of our 8 software and 2 hardware configurations, we ran the planners used in the deterministic optimal and agile tracks of the 2014 International Planning Competition (IPC 2014), effectively repeating the 2014 competition multiple times, independently for each considered configuration.
Our analysis shows that the hardware and software environment has a significant effect on solver performance, and that this effect can also vary significantly for different solvers. As a result, rankings in competitions such as the IPC cannot be expected to generalise completely to hardware or software environments different from those used in the competition. This may partially be due to the fact that many planners show similar (comparable) performance, but the impact of hardware and software environment can also dramatically change performance. In fact, for both of our considered IPC tracks we empirically observed a different top planner than in the official competition results. These hardware and software environment changes can be as minor as the version of the compiler used to create each solver executable: in our case GCC 4.7.2 vs. 4.8.2.
Furthermore, we also provide empirical evidence for the common belief that the choice of competition running time cutoffs and memory limits affect solver performance differently for different solvers, and thus can affect the resulting rankings.
While our experimental observations suggest that competition performance results should be carefully interpreted, we caution that these observations should not be taken as making past competition results somehow invalid, or diminishing the utility of solver competitions in general. Given our experimental results, we do recommend that users evaluate as many of the topranked solvers as possible in their own hardware and software environments when making decisions about the "best" solver for a specific problem.
Attempting to compensate for many of the sources of performance variation discussed in this paper would place a heavy burden on competition organisers, both in terms of time and additional computational resources. Specifically for the IPC, increasing the memory limits used to 8 GB does appear to result in a general improvement of planners' performance on existing benchmark instances, thus possibly providing a better snapshot of the actual performance of considered planners, and performing multiple planner runs on each benchmark instance would also help limit variance with only minimal additional human effort. Allowing competitors the ability to customise their own software configuration for the competition would potentially reduce this source of variation, but would also have a side effect of newly biasing the competition results toward competitors with the sophisticated knowledge, computational resources and time to do the performance tuning required. We see several possible avenues for future work: first, a deeper investigation into specific planners such as SymBA, SPMaS, MIPlan and DPMPlan, which exhibited extremely large performance swings between our two hardware configurations; second, using the knowledge gained in this work, the study and development of a competition measuring solver performance across several distinct hardware and software environments; finally, a thorough analysis of additional sources of performance variation not covered in this paper, including benchmark instance set selection and solver stochasticity. 
