Abstract. Answer Set Programming (ASP) has become, the paradigm of choice in the field of logic programming and non-monotonic reasoning. Thanks to the availability of efficient solvers, ASP has been successfully employed in a large number of application domains. The term GPU-computing indicates a recent programming paradigm aimed at enabling the use of modern parallel Graphical Processing Units (GPUs) for general purpose computing. In this paper we describe an approach to ASP-solving that exploits GPU parallelism. The design of a GPUbased solver poses various challenges due to the peculiarities of GPUs' software and hardware architectures and to the intrinsic nature of the satisfiability problem.
Introduction
Answer Set Programming (ASP) is an expressive and purely declarative framework developed in the last decades in the Logic Programming and Knowledge Representation communities. Thanks to its extensively studied mathematical foundations and the continuous improvement of efficient and competitive solvers, ASP has become the paradigm of choice in many fields of AI. It has been fruitfully employed in many areas, such as knowledge representation and reasoning, planning, bioinformatics, multi-agent systems, data integration, language processing, declarative problem solving, semantic web, robotics, to mention a few among many [7, 8] .
The clear and highly declarative nature of ASP enables excellent opportunities for the introduction of parallelism and concurrency in implementations of ASP-solvers.
Steps have been made in the last decade toward the parallelization of the basic components of Logic Programming systems. Implementations of solvers exploiting multicore architectures, distributed systems, or portfolio approaches, have been proposed [4] . In this direction, a recent new stream of research concerns the design and development of parallel ASP systems that can take advantage of the massive degree of parallelism offered by modern Graphical Processing Units (GPUs).
GPUs are multicore devices designed to operate with very large number of lightweight threads, executing in a rigid synchronous manner. They present a significantly complex memory organization. To take full advantage of GPUs' computational power, one
Preliminaries
We briefly recall the basic notions on ASP needed in the rest of the paper (for a detailed treatment see [12, 11] and the references therein). Similarly, we also recall few needed notions on CUDA parallelism [17, 18] .
Answer Set Programming. An ASP program Π is a set of ASP rules of the form: p 0 ← p 1 , . . . , p m , not p m+1 , . . . , not p n where n ≥ 0 and each p i is an atom. If n = 0, the rule is a fact. If p 0 is missing, the rule is a constraint. Notice that such a constraint can be rewritten as a headed rule of the form q ← p 1 , . . . , p m , not p m+1 , . . . , not p n , not q, where q is a fresh atom. Hence, constraints do not increase the expressive power of ASP.
A rule including variables is simply seen as a shorthand for the set of its ground instances. Without loss of generality, in what follows we consider the case of ground programs only. (Hence, each p i is a propositional atom.)
Given a rule r, p 0 is referred to as the head of the rule (head(r)), while the set {p 1 , . . . , p m , not p m+1 , . . . , not p n } is referred to as the body of r (body(r)). Moreover, we put body + (r) = {p 1 , . . . , p m }, body − (r) = {p m+1 , . . . , p n }, ϕ + (r) = p 1 ∧ · · · ∧ p m and ϕ − (r) = ¬p m+1 ∧ · · · ∧ ¬p n . We will denote the set of all atoms in Π by atoms(Π) and the set of all rules defining the atom p by rules(p) = {r | head(r) = p}. The completion Π cc of a program Π is defined as the formula:
Semantics of ASP programs is expressed in terms of answer sets. An interpretation is a set M of atoms; p ∈ M (resp. p ∈ M ) denotes that p is true (resp. false). An interpretation is a model of a rule r if head(r) ∈ M , or body
M is a model of a program Π if it is a model of each rule in Π. M is an answer set of Π if it is the subset-minimal model of the reduct program Π
M . An important connection exists between the answer sets of Π and the minimal models of Π cc . In fact, any answer set of Π is a minimal model of Π cc . The converse is not true, but it can be shown [14] that the answer sets of Π are the minimal models of Π cc satisfying the loop formulas of Π. The number of loop formulas can be, in general, exponential in the size of Π. Hence, modern ASP solvers adopt some form of lazy approach to generate loop formulas only "when needed". We refer the reader to [14, 11] for the details; in what follows we will describe an alternative approach to answer set computation that avoids the generation of loop formulas. The new approach exploits ASP computations to avoid the introduction of loop formulas and the need of performing unfounded set checks [11] during the search of answer sets.
The notion of ASP computations originates from a computation-based characterization of answer sets [15] based on an incremental construction process, where at each step choices determine which rules are actually applied to extend the partial answer set. More specifically, for a program Π let T Π be the immediate consequence operator of Π. Namely, if I is an interpretation, then
. . (where I 0 can be any set of atoms that are logical consequences of Π) satisfying these conditions:
is a model of body(r) for each j ≥ i. Following [15] , an interpretation I is an answer set of Π if and only if there exists an ASP computation such that I = ∞ i=0 I i .
GPU-computing and the CUDA framework. Graphical Processing Units (GPUs) are massively parallel devices, originally developed to support efficient computer graphics and fast image processing. The use of such multicore systems has become pervasive in general-purpose applications that are not directly related to computer graphics, but demand massive computational power. The term GPU-computing indicates the use of the modern GPUs for such general-purpose computing. NVIDIA is one of the pioneering manufacturers in promoting GPU-computing, especially through the support to its Computing Unified Device Architecture (CUDA) [18] . A GPU contains hundreds or thousands of identical computing units (cores) and provides access to both on-chip memory (used for registers and shared memory) and off-chip memory (used for cache and global memory). Cores are grouped in a collection of Streaming MultiProcessors (SMs); in turn, each SM contains a fixed number of computing cores. The underlying conceptual model for parallelism is Single-Instruction MultipleThread (SIMT), where the same instruction is executed by different threads that run on cores, while data and operands may differ from thread to thread. A logical view of computations is introduced by CUDA, in order to define abstract parallel work and to schedule it among different hardware configurations. A typical CUDA program is a C/C++ program that includes parts meant for execution on the CPU (referred to as the host) and parts meant for parallel execution on the GPU (referred to as the device).
The CUDA API supports interaction, synchronization, and communication between host and device. Each device computation is described as a collection of concurrent threads, each executing the same device function (called a kernel, in CUDA terminology). These threads are hierarchically organized in blocks of threads and grids of blocks. The host program contains all the instructions needed to initialize the data in the GPU, specify the number of grids/blocks/threads, and manage the kernels. Each thread in a block executes an instance of the kernel, and has a thread ID within its block. A grid is a 3D array of blocks that execute the same kernel, read data input from the global memory, and write results to the global memory. When a CUDA program on the host launches a kernel, the blocks of the grid are scheduled to the SMs with available execution capacity. The threads in the same block can share data, using high-throughput on-chip shared memory, while threads belonging to different blocks can only share data through the global memory. Thus, the block size allows the programmer to define the granularity of threads cooperation.
It should be noticed that the most efficient access pattern to be adopted by threads in reading/storing data depends on the kind of memory. We briefly mention here two possibilities (see [17] for a comprehensive description). Shared memory is organized in banks. In case threads of the same block accesses locations in the same bank, a bank conflict occurs and the accesses are serialized. To avoid bank conflicts, strided access pattern has to be adopted. On the contrary, concerning global memory, to reach the highest throughput, coalesced accesses have to be executed. Intuitively, this can be achieved if consecutive threads access contiguous global memory locations.
Threads of each block are grouped in warps of 32 threads each. The threads of the same warp share the fetch of the instruction code to be executed. Hence, the maximum efficiency is achieved when all 32 threads execute the same instruction (possibly, on different data). Whenever two (or more) groups of threads belonging to the same warp fetch/execute different instructions, thread divergence occurs. In this case the execution of the different groups is serialized and the overall performance decreases.
A simple CUDA application presents the following basic components: 4 MEMORY ALLOCATION AND DATA TRANSFER. Before being processed by kernels, data must be copied to the global memory of the device. The CUDA API supports memory allocation and data transfer to/from the host. KERNELS DEFINITION. Kernels are defined as standard C functions; the annotation used to communicate to the CUDA compiler that a function should be treated as kernel has the form: global void kernelName (Formal Arguments). KERNELS EXECUTION. A kernel can be launched from the host program using:
where GridDim describes the number of blocks of the grid and TPB specifies the number of threads in each block. DATA RETRIEVAL. After the execution of the kernel, the host retrieves the results with a transfer operation from global memory to host memory.
Conflict-driven ASP-Solving
Conflict-driven nogood learning (CDNL) is one of the techniques successfully used by ASP-solvers, such as the CLINGO system [11] . The first attempt in exploiting GPU parallelism for conflict-driven ASP solving has been made in [5, 6] . The approach adopts a conventional architecture of an ASP solver which starts by translating the completion Π cc of a given ground program Π into a collection of nogoods (see below). Then, the search for the answer sets of Π is performed by exploring a search space composed of all interpretations for the atoms in Π, organized as a binary tree. Branches of the tree correspond to (partial) assignments of truth values to program atoms (i.e., partial interpretations). The computation of an answer set proceeds by alternating decision steps and propagation phases. Intuitively: (1) A decision consists in selecting an atom and assigning it a truth value. (This step is usually guided by powerful heuristics analogous to those developed for SAT [2] .) (2) Propagation extends the current partial assignment by adding all consequences of the decision. The process repeats until a model is found (if any). It may be the case that inconsistent truth values are propagated for the same atom after i decisions (i.e., while visiting a node at depth i in the tree-shaped search space).
In such cases a conflict arises at decision level i testifying that the current partial assignment cannot be extended to a model of the program. Then, a conflict analysis procedure is run to detect the reasons of the failure. The analysis identifies which decisions should be undone in order to restore consistency of the assignment. It also produces a new learned nogood to be added to the program at hand, so as to exclude repeating the same failing sequence of decisions, in the subsequent part of the computation. Consequently, the program is extended with the learned nogood and the search backjumps to a previous (consistent) point in the search space, at a decision level j < i. Whenever a conflict occurs at the top decision level (i = 1), the computation ends because no (more) solutions exist. Following [5, 6] , let us outline how CDNL can be combined with ASP computation in order to obtain a solver that does not need to use loop formulas. We describe both assignments A and nogoods δ as sets of signed atoms-i.e., entities of the form T p or F p, denoting that p ∈ atoms(Π) has been assigned true or false, respectively. Plainly, assignment contains at most one element between T p and F p for each atom p. Given an assignment A, let A T = {p | T p ∈ A}. Note that A T is an interpretation for Π. A total assignment A is such that, for every atom p, {T p, F p} ∩ A = ∅. Given a (possibly partial) assignment A and a nogood δ, we say that δ is violated if δ ⊆ A. In turn, A is a solution for a set of nogoods ∆ if no δ ∈ ∆ is violated by A. Nogoods can be used to perform deterministic propagation (unit propagation) and extend an assignment. Given a nogood δ and a partial assignment A such that δ \ A = {F p} (resp., δ \ A = {T p}), then we can infer the need to add T p (resp., F p) to A in order to avoid violation of δ.
Given a program Π, a set of completion nogoods ∆ Πcc is derived from Π cc as follows. For each rule r ∈ Π and each atom p ∈ atoms(Π), we introduce the formulas:
b r where b r , t r , n r are new atoms (if rules(p) = ∅, then the last formula reduces to ¬p). The completion nogoods reflect the structure of the implications in these formulas:
• from the first formula we have the nogoods: {F b r , T t r , T n r }, {T b r , F t r }, and {T b r , F n r }.
• From the second and third formulas we have the nogoods: {T t r , F p} for each p ∈ body + (r); {T n r , T q} for each q ∈ body − (r); {F t r } ∪ {T p | p ∈ body + (r)}; and {F n r } ∪ {F q | q ∈ body − (r)}.
• From the last formula we have the nogoods: {F p, T b r } for each r ∈ rules(p) and {T p} ∪ {F b r | r ∈ rules(p)}.
Moreover, for each constraint ← p 1 , . . . , p m , not p m+1 , . . . , not p n in Π we introduce a constraint nogood of the form {T p 1 , . . . , T p m , F p m+1 , . . . , F p n }. The set ∆ Πcc is the set of all the nogoods so defined. The basic CDNL procedure described earlier can be easily combined with the notion of ASP computation. Indeed, it suffices to apply a specific heuristic during the selection steps to satisfy the four properties defined in Sect. 1. This can be achieved by assigning true value to a selected atom only if this atom is supported by a rule with true body. More specifically, let A be the current partial assignment, the selection step acts as follows. For each unassigned atom p occurring as head of a rule in the original program, all nogoods reflecting the rule b r ← t r , n r , such that r ∈ rules(p) are analyzed to check whether T t r ∈ A and F n r / ∈ A (i.e., the rule is applicable [15] ). One of the rules r that pass this test is selected. Then, T b r is added to A. In the subsequent propagation phase T p and F n r are also added to A and F n r imposes that all the atoms of body − (r) are set to false. This, in particular, ensures the persistence of beliefs of the ASP computation. (In the real implementation (see Sect. 4) all applicable rules r, and their heads, are evaluated according to a heuristic weight and the rule r with highest ranking is selected.) It might be the case that no selection is possible because no unassigned atom p exists such that there is an applicable r ∈ rules(p). In this situation the computation ends by assigning false value to all unassigned heads in Π. This completes the assignment, which is validated by a final propagation step in order to check that no constraint nogoods are violated. In the positive case the assignment so obtained is an answer set of Π.
ASP as an irregular application
The design of GPU-based ASP-solvers poses various challenges due to the structure and intrinsic nature of the satisfiability problem. The same holds for GPU-based approaches to SAT [3] . As a matter of fact, the parallelization of SAT/ASP-solving shares many aspect with other applications of GPU-computing where problems/instances are characterized by the presence of large, sparse, and unstructured data. Parallel graph algorithms constitute significant examples, that, like SAT/ASP solving, exhibit irregular and low-arithmetic intensity combined with data-dependent control flow and memory access patterns. Typically, in these contexts, large instances/graphs have to be modeled and represented using sparse data structures (e.g., matrices in Compressed Sparse Row/Column formats). The parallelization of such algorithms struggle to achieve scalability due to lack of data locality, irregular access patterns, and unpredictable computation [16] . Although, in the case of some graph algorithms, several techniques have been established in order to improve performance on parallel architectures [13] and accelerators [1] , the different character of the algorithms used in SAT/ASP might prevent from obtaining comparable impact on performance by directly applying the same techniques. This is because, first, the time-to-solution of a SAT/ASP problem is dominated by heuristic selection and learning procedures able to cut the exponential search space. In several cases, smart heuristics might be most effective than advanced parallel solutions. Second, because of intrinsic data-dependencies, procedures like propagation or learning often require to access large parts of the data/graph, sequentially. Similarly to what experienced in other complex graph-based problems [9] , the kind of computation involved differs from that of traversal-like algorithms (such as, Breadth-First Search) which process a subset of the graph in iterative/incremental manners and for which advanced GPU-solutions exist. Furthermore, aspect specific to the underlying architecture enters into play, such as coalesced memory access and CUDA-thread balancing, which are major objectives in parallel algorithm design. In this scenario, our GPU-based proposal to ASP solving also implements:
• efficient parallel propagation able to maximize memory throughput and minimize thread divergence.
• Fast parallel learning algorithm which avoids the bottleneck represented by the intrinsically sequential resolution-like learning procedures commonly used in CDNL solvers.
• Specific thread-data mapping solutions able to regularize the access to data stored in global, local, and shared memories. In what follows we will describe how to achieve these requirements in the GPU-based solver for ASP. 4 The CUDA-based ASP-solver YASMIN In this section, we present a solver that exploits ASP computation, nogoods handling, and GPU parallelism. The ground program Π, as produced by the grounder GRINGO [11] , is read by the CPU. The CPU also computes the completion nogoods ∆ Πcc and transfers them to the device. The rest of the computation is performed completely on the GPU. During this process, there only memory transfers between the host and device involve control-flow flags (e.g., an "exit" flag, used to communicate whether the computation is terminated) and the computed answer set (from the GPU to the CPU).
As concerns representation and storing of data on the device, nogoods are represented using Compressed Sparse Row (CSR) format. The atoms of each nogood are stored contiguously and all nogoods are stored in consecutive locations of an array allocated in global memory. An indexing array contains the offset of each nogood, to enable direct accesses to them. (Such indexes are then used as identifiers for the corresponding nogoods.) Nogoods are sorted in increasing order, depending on their length. Each atom in Π is uniquely identified by an index, say p. A array A of integers is used to store in global memory the set of assigned atoms (with their truth values) in this manner:
• A[p] = 0 if and only if the atom p is unassigned;
means that atom p has been assigned true (resp., false) at the decision level i.
The basic structure of the YASMIN solver is shown in Alg. 1. We adopt the following notation: for each signed atom p, let p represent the same atom with opposite sign. Moreover, let us refer to the stored set of nogoods simply by the variable ∆. The variable cdl (initialized in line 1) represents the current decision level. As mentioned, cdl acts as a counter that keeps track of the current number of decisions that have been made.
Since the set of input nogoods may include some unitary nogoods, a preliminary parallel computation partially initializes A accordingly (line 2). It may be the case that inconsistent assignments occur in this phase. In such case a flag V iol is set, the given program Π in declared unsatisfiable (line 3) and the computation ends. Notice that the algorithm can be restarted several times-typically, this happens when more than one solution is requested or if restart strategy is activated by command-line options. (For simplicity, we did not include the code for restarting the solver in Alg. 1.) In such cases, InitialPropagation() also handles unit nogoods that have been learned in the previous execution. The kernel invocation in line 2 specifies a grid of b blocks each composed of t threads. The mapping is one-to-one between threads and unitary nogood. In particular, if k is the number of unitary nogoods, b= k/T P B and t=T P B, where T P B is the number of threads-per-block specified via command-line option. The loop in lines 4-14 computes the answer set, if any. Propagation is performed by the procedure PropagateAndCheck() in line 5, which also checks whether nogood violations occur. To better exploit the SIMT parallelism and maximize the number of concurrently active threads, in each device computation the workload has to be divided among the threads of the grid as uniformly as possible. To this aim, PropagateAndCheck() launches multiple kernels: one kernel deals with all nogoods with exactly two literals; a second one processes the nogoods composed of three literals, and a further kernel processes all remaining nogoods. In this manner, threads of the same grid process a uniform number of atoms, reducing the divergence between them and minimizing the number of inactive threads. Moreover, because, as mentioned, nogoods of the same length are stored contiguously, threads of the same grid are expected to realize coalesced accesses to global memory. A more detailed description of the third of such device functions is given in Sect. 4.1. A similar technique is used in PropagateAndCheck() to process those nogoods that are learned at run-time through the conflict analysis step (cf. Sect. 4.2). These nogoods are partitioned depending on their cardinality and processed by different kernels, accordingly. In general, if n is the number of nogoods of one partition, the corresponding kernel has b= n/T P B blocks of t=T P B threads each. Each thread processes one learned nogood.
Propagation stops because either a fixpoint is reached (no more propagations are possible) or one or more conflicts occur. In the latter case, if the current decision level is the top one the solver ends: no solution exists (line 6). Otherwise, (lines 7-9) conflict analysis (Learning()) is performed and then the solver backjumps to a previous decision point (line 9). The learning procedure is describes in Sect. 4.2. A specific kernel Backjump() takes care of updating the value of cdl and the array that stores the assignment. A mapping one-to-one between threads and atoms in A is used.
On the other hand, if no conflict occurs and A is not complete, a new Decision() is made (line 11). As mentioned, the purpose of this kernel is to determine an unassigned atom p which is head of an applicable rule r. All candidates p and applicable r are evaluated in parallel according to a typical heuristics to rank the atoms. Possible criteria, selectable by command-line options, use the number of positive/negative occurrences of atoms in the program (by either simply counting the occurrences or by applying the Jeroslow-Wang heuristics) or the "activity" of atoms [2] . The first access to global memory to retrieve needed data is done in coalesced manner (a mapping one-to-one between threads and rules is used). Then, a logarithmic parallel reduction scheme, implemented using thread-shuffling to avoid further accesses to global memory, yields the rule r with highest ranking. Its head is selected and set true in the assignment. Decision() also communicates to the solver whether no applicable rule exists (line 12). In this case all unassigned heads in Π are assigned false (by the kernel CompleteAssignment() in line 13). A successive invocation of PropagateAndCheck() validates the answer set and the solver ends in line 14.
The propagate-and-check procedure
After each assignment of an atom of the current partial assignment A, each nogood δ needs to be analyzed to detect whether: (1) it is violated, or (2) there is exactly one literal p in it that is unassigned in A, in which case an inference step adds p to A (cf., Sect. 2). The procedure is repeated until a fixpoint is reached. As seen earlier, this task is performed by the kernels launched by the procedure PropagateAndCheck().
Alg. 2 shows the device code of the generic kernel dealing with nogoods of length greater than three (the others are simpler). The execution of each iteration is driven by the atoms that have been assigned a truth value in the previous iteration (array Last in Alg. 2). Thus, each kernels involves a number of blocks that is equal to the number of such assigned atoms. The threads in each block process the nogoods that share the same assigned atom. The number of threads of each block is established by considering the number of occurrences of each assigned atom in the input nogoods. Observe that the dimension of the grid may change between two consecutive invocations of the same kernel, and, as such, it is evaluated each time. Specific data structures (initialized once during a pre-processing phase and stored in the sparse matrix Map[][] in Alg. 2) are used in order to determine, after each iteration and for each assigned atom, which are the input nogoods to be considered. A further technique is adopted to improve performance. Namely, the processing of nogoods is realized by implementing a standard technique based on watched literals [2] . In this case, each thread accesses the watched literals of a nogood and acts accordingly. The combination of nogood sorting and the use of watched literals, improves the workload balancing among threads and mitigates thread divergence. (Watched literals are exploited also for long learned nogoods.)
Concerning Alg. 2, each thread of the grid first retrieves one of the atoms propagated during the previous step (line 1). Threads of the same block obtain the same atom L. In line 2, threads accesses the data structure M aps, mentioned earlier, to retrieve the number ngInBlock of nogoods to be processed by the block. In line 5 each thread of the block determines which nogood has to be processed and retrieves its watched literals (lines 6-7). In case one or both literals belongs to the current assignment A, suitable substitutes are sought for (lines 10 and 14). Violation might be detected (lines 12 and 19, resp.) or propagation might occur (lines [16] [17] [18] . Notice that, concurrent threads might try to propagate the same atom (possibly with different sign), originating race conditions. The use of atomic functions (line 16) allows one nondeterministically chosen thread t to perform the propagation. Other threads may discover agreement or detect inconsistency w.r.t. the value set by t (line 19). In line 17 the thread t updates the set N ext of propagated atoms (to be used in the subsequent iteration) and stores (line 18) information needed in future conflict analysis steps (by means of mk dl bitmap(), to be described in Sect. 4.2) and concerning the causes of the propagation.
The learning procedure
As mentioned, the Learning() procedure is used to resolve a conflict detected by PropagateAndCheck() and to identify a decision level the computation should backjump to, in order to remove the violation. The analysis usually performed in ASP solvers such as CLINGO [11] demonstrated rather unsuitable to SIMT parallelism. This is due to the fact that a sequential sequence of resolution-like steps must be encoded.
In the case of the parallel solver YASMIN, more than one conflict might be detected by PropagateAndCheck(). The solver selects one or more of them (heuristics can be applied to perform such a selection, for instance, priority can be assigned to shorter nogoods.) For each selected conflict, a grid of a single block, to facilitate synchronization, is run to perform a sequence of resolution steps, starting from the conflicting nogood (say, δ), and proceeding backward, by resolving upon the last but one assigned atom σ ∈ δ. The step involves δ and a nogood ε including σ. Resolution steps end as soon as the last two assigned atoms in δ correspond to different decision levels. This approach identifies the first UIP [2] . Alg. 3 shows the pseudo-code of such procedure ID Instance Table 1 . Some instances used in experiments. The table shows: shorthand IDs, instance names (taken from [6] ), the numbers of nogoods/atoms given as input to the solving phase of YASMIN.
Experimental Results
In this section we briefly report on some experiments we run to compare the two learning techniques described in the previous section. Table 1 shows a selection of the instances (taken from [6] ) we used. For each instance the table indicates, together with an ID, the number of nogoods and the number of atoms.
Experiments were run on a Linux PC (running Ubuntu Linux v.19.04), used as host machine, and using as device a Tesla K40c Nvidia GPU with these characteristics: 2880 CUDA cores at 0.75 GHz, 12GB of global device memory. We used on such GPU the CUDA runtime version 10.1. The compute capability was 3.5. 
Conclusions
This paper we described the main traits of a CUDA-based solver for Answer Set Programming. The fact that the algorithms involved in ASP-solving present an irregular and low-arithmetic intensity, usually combined with data-dependent control flows, makes it difficult to achieve high performance without adopting proper sophisticated solutions and fulfilling suitable programming directives. In this paper we dealt with the basic software architecture of a parallel prototypical solver with the main aim of demonstrating that GPU-computing can be exploited in ASP solving. Much is left to do in order to obtain a full-blown parallel solver able to compete with the state-of-the-art existing solvers. First, effort have to be made in enhancing the parallel solver with the collection of heuristics proficiently used to guide the search in sequential solvers. Indeed, experimental comparison [6] show that good heuristics might be the most effective component of a solver. Second, the applicability of further techniques and refinements have to be investigated. For instance, techniques such as parallel lookahead [4] , multiple learning [10] , should be considered. Also the possibility of developing a distributed parallel solver that operates on multiple GPUs represents a challenging theme of research.
