Computing the partition function of a spin lattice is known to be a NP-hard problem that becomes intractable as the size of the lattice increases. More in detail, the time needed to compute the transfer matrix for a given strip lattice grows exponentially as a function T (|V |, |E|) of vertices and edges. There is a high interest in knowing the exact partition functions of many different strip lattices since they provide a first approach on knowing the eventual phase transitions at the thermodynamical equilibrium in the full plane. With the fast evolution of CPUs and GPUs towards a higher number of cores, parallel computing becomes fundamental for reducing the computational time of T (|V |, |E|). In this work, we implement a parallel method for computing the transfer matrix of any strip lattice. The solution utilizes a forest of subtrees of heigh h = 1 that work as a cache for the computation of the rows. Also, the implementation is NUMA-aware and achieves scalable performance in multi-socket machines. Our experiments report speedups of up to 3x using a quad core machine and there is also an extra speedup of 2x when the problem is symmetric in topology.
Introduction
The Potts model [16] has been widely used to study physical phenomena of spin lattices such as phase transitions in the thermodynamical equilibrium. Topologies such as triangular, honeycomb, square, kagome among others are of high interest and are being studied frequently (see [6, 18, 5, 4] ). When the number of possible spin states is set to q = 2, the Potts model becomes the classic Ising model [13] which has been solved analytically for the whole plane by Onsager in 1944 [15] . Unfortunately, for higher values of q no solution has been found yet. Therefore, studying strip lattices becomes a natural approach for achieving an exact but finite representation of the bidimensional plane. The wider the strip, the better the representation. Hopefully, by increasing the width enough, some properties of the full plane would emerge. One known technique for obtaining the partition function of a strip lattice is to compute a transfer matrix based on the periodic information of the system. Building the transfer matrix is not free of expensive combinatorial computations and exponential cost algorithms. In fact, the problem requires the computation of partition functions which are NP-hard problems [23] . With the fast evolution of multi-core CPUs and GPUs, parallel computing is not anymore limited to clusters or super-computing, it also includes the shared memory model for workstation parallel computing. Latest work in the field of computational physics for the Potts model has been focused on parallel probabilistic simulations [9, 20] and new sequential methods for computing the exact partition function of a lattice [12, 2, 17] . Even when the exact methods have much higher cost than probabilistic ones, they are still important because by knowing the exact invariants one can obtain many thermodynamical properties of the system. In the case of strip lattices, this is traduced to being able to compute the transfer matrix efficiently and fast in order to keep stuying widther and more sophisticated topologies. In this work we propose two parallelization schemes for computing the transfer matrix of a strip lattice; (1) a parallel and partially reductable deletion-contraction method and (2) a parallel building strategy for the transfer matrix based on a combinatorial hash table. As an additional improvement, symmetric lattices take approximately half the time to compute. The paper is organized as follows: section (2) defines the fundamental physical concepts used along this manuscript as well as related work, sections (3) and (4) explain the details of the algorithm and the additional optimizations available appart from parallelism itself. In section (5) we present experimental measures for both parallelization schemes implemented for a multi-core CPU. Section (6) concludes our work and identifies future work.
Preliminaries and related work
Let G = (V, E) be a lattice with |V | vertices, |E| edges and σ i be the state of a spin of G with i ∈ [1..q]. The Potts partition function Z(G, q, β) is defined as
where β = 1 KB T , K B is Boltzmann constant and h(G r ) is the energy of the lattice at a given state G r
1 . The Potts model defines the energy of a state G r with the following Hamiltonian:
Where i, j corresponds to the edge from vertex v i to v j , r ∈ [1..q |V | ], J is the interaction energy (J < 0 for antiferromagnetic and J > 0 for ferromagnetic) and δ si,sj corresponds to the Kronecker delta evaluated at the pair of spins i, j with states σ i , σ j and expressed as
The free energy of the system is computed as:
As the lattice becomes bigger in the number of vertices and edges, the computation of equation (1) becomes rapidly intractable with an exponential cost of Θ(q |V | ). In practice, an equivalent recursive method is more convenient than the original definition. The deletion-contraction method, or DC method, was initially used to compute the Tutte polynomial [21] and was then extended to the Potts model after the relation found between the two (see [24, 19] ). DC re-defines Z(..) as the following recursive equation
G − e is the deletion operation, G/e is the contraction operation and the auxiliary variabale v = e −βJ − 1 makes Z(..) a polynomial. There are three special cases when DC can perform a recursive step with linear cost:
if {e} is a spike.
The computational complexity of DC has a direct upper bound of O(2 |E| ). When |E| >> |V | then a tighter bound is known based on the Fibonacci sequence complexity [22] ; O((
2 ) |V |+|E| ). In general, the time complexity of DC can be written as
Haggard's et al. [11] work is considered the best implementation of DC for computing the Tutte polynomial for any given graph. Their algorithm, even when it's exponential in time, reduces the computation tree in the presence of loops, multi-edges, cycles and biconnected graphs (as one-step reductions). An important contribution by the authors is that by using a cache, some computations can be reused (i.e sub-graphs that are isomorphic to the ones stored in the cache dont need to be computed again). An alternative algorithm was proposed by Björklund et al. [3] which accomplishes exponential time only on the number of vertices; O(2 n n O(1) ) with n = |V |. Asymtotically their method is better than DC considering that many interesting lattices have more edges than vertices. However, Haggard et al [11] has stated that their memory usage is too high and limits the practical usage of the algorithm. For the case of strip lattices, a full application of DC is not practicable since the exponential cost would grow as a function of the strip length, making any computation rapidly intractable. The transfer matrix technique, mixed with DC is a better choice since the exponential cost will not depend on the length of the strip, but instead just on the size of the period. As soon as the matrix is built, computing for the strips length has a polynomial cost. A transfer matrix algorithm that works for any strip lattice is not trivial; an special DC algorithm is needed, called partial DC that stops a recursion branch when edges of the next period are met, resulting in partial partition functions asociated to combinatorial labels that act as incognits. Also, hash tables or binary search tables are needed in order to search and asociate the the partial partition functions with the combinatorial labels consistently and fast enough. A bidimensional strip lattice is as a graph G = (V, E) that is periodic in one dimension. It can be perceived as a concatenation of n subgraphs K sharing their boundary vertices and edges. Let P be the set of all possible strip lattices, then we formally define G:
is the special operator for concatenating the periods defined as
Each period connects to the other periods K i−1 and K i+1 , except for K 1 and K n which are the external ones and connect to K 2 and K n−1 respectively (see Figure 1 ). The main computational challenge when using a matrix transfer based algorithm is the cost of building it because its size increases exponentially as a function of the width of the strip. The problem of matrix size has been improved by analytic techniques [14] . However, the authors specify that these techniques are only applicable to square and triangular lattices using values of q = 2 and q = 3 (Ising and three-state Potts respectively). For this reason, we preffer to use a more traditional transfer matrix technique based on a combinatorial amount of DC repetitions with the advantage of being able to q ∈ R and apply parallelism into the process.
3 Algorithm overview.
Data structure
Our definition of K from section (2) (Figure 1) will be used to model the data structures. Given any strip lattice G, only the last part of the strip lattice is needed, that is, K n . We will refer to K n as K σ1 to denote the basic case of K n without modifications. For consistency, some vertices of K σ1 are labeled as internal or external. Internal vertices correspond to the left-most ones, connect with K n−1 (see Figure 2 ) and are indexed top-down from 0 to m − 1 with m being the width of the strip. External vertices are the ones of the periphery of K σ1 and are indexed top-down from m to 2m − 1. 
Compute transfer matrix M and initial conditions Z 1 in parallel.
Computing the transfer matrix M is a process that involves combinatorial operations over the basic period K σ1 . For a better explanation of the algorithm, we introduce two terminologies; initial configurations and terminal configurations. These configurations define a combinatorial sequence of identifications for vertices and belong to the set of all non-crossing partitions. The number of initial and terminal configurations is the same, we denote it C(m) as a function of the width of the strip and it can be computed using the expression for the Catalan numbers:
, define a combinatorial sequence of identifications just on the external vertices. The terminal configurations, denoted ϕ j with j ∈ [0..C(m) − 1], define a combinatorial sequence of identifications just on the internal vertices. Initial configurations generate terminal ones (using the DC method) but not vice versa. As stated before, K σ1 is the basic case and matches K n . That is, σ 1 is the initial configuration where no identifications are applied to the external vertices (vertices remain intact). It is also equivalent to say σ 1 is the null configuration. Similarly, ϕ 1 corresponds to the base case and matches the internal vertices of K n (then ϕ 1 is the null configuration for the internal vertices). Additionally, terminal configurations imply that the period has a maximum of m vertices, all internal (i.e remaining vertices of K σi after applying DC to all except the internal vertices and the edges connecting them). The transfer matrix M is computed in rows by repeatedly applying DC, each time to a different K σi , a total of C(m) times. Each repetition contributes to a row of M . By default, the algorithm cannot know the C(m) different sequences of terminal and initial configurations except for K σ1 which is given as part of the strip lattice and is the one that triggers the computation. This is indeed a problem for parallelization. H, S ) ) r e t u r n f o r ( i n t k=0; k<A. s i z e ( ) ; k++) f o r ( i n t j =1; j <A[ k ] . s i z e ( ) ; j ++) f o r ( i n t i =0; i <j ; i ++) i f ( c a n i d e n t i f y (A[ k ] , i , j ) ) cA := copy (A) c s := copy ( s ) i d e n t i f y ( cA , i , j , k , c s )
cs , H, S ) }
Once g(..) is computed, H and S serve for building the matrix in parallel. Let p be the number of processors available and with the i-th processor computing C(m)/p rows consecutievly starting from row i ′ = iC(m)/p of M . Thanks to S, the initial configuration sequence can be obtained in parallel under the CREW variation of the PRAM model (processors read and apply S[i ′ ] to the external vertices of their own local copy of the base case K σ1 ). After each processor builds their corresponding K σ ′ i graph, they perform the DC procedure in parallel. This DC procedure is only partial because edges that connect two internal vertices must never be deleted neither contracted. This menas that the DC procedure stops with a remanent of the graph (internal edges connected to internal vertices) and up to C(m) different terminal configurations can exist in the leaves of the recursion tree, asociated with an expression in q and v (sometimes not all C(m) terminal configuration sequences are generated but only some of them. That is why the generator function is so much needed in order to work in parallel). The job is to group and add together all the expressions from the leaves of the DC recursion tree that share the same terminal configuration sequence. For each terminal configuration ϕ j , its sequence is known because it was built dynamically along the branches of the DC recursion tree. Dynamic creation of the terminal configuration sequences is achieved by using a small algebra to combine the identifications found on the internal vertices. An identification of two internal vertices [v i , v j ] will be denoted as δ(i, j). Each additional identification adds up to the previous ones to finally form a sequence of a terminal configuration δ(x 1 , y 2 )+δ(x 2 , y 2 )+...+δ(x n , y n ). Additionally, the following properties are defined to reduce the sequence to a minimal expression:
A lexicographical order is applied to each reduced sequence for faster comparison and avoid checking properties (10) and (11) . The terminal configuration sequence is necessary but not sufficient for knowing its index j in M . This is where H becomes useful for knowing with O (1) cost what is the actual index j for a given terminal configuration sequence to build the coordinate i, j and form the partial partition function z i,j (q, v) in M i,j (the row index is already known by the processor index).
The main idea of the parallelization scheme can be illustrated by using Foster's [10] four-step strategy for building parallel algorithms; partitioning, communication, agglomeration, mapping. Figure 3 shows an example using a dual core CPU. Basically, the idea is to give a small amount of k consecutive rows to each processor (for example k = 2, 3, 4, 5) with an offset of b = pk rows per processor (p is the number of processors) assuming that work is balanced and processing is synchronous (i.e blocks of computation). If the work per row is unbalanced, then processing is better to be asynchronous. This means that the offset must be dynamic depending on which processor is the first one to finnish their k consecutive rows. The asymptotic complexity for computing M under the PRAM model using the CREW variation is upper-bounded by:
The complexity equals the cost of applying DC times C(m) in parallel with p processors. When all processors end, the final transfer matrix M is size C(m) x C(m) as shown ahead:
The partition function for a strip lattice of length n and width m is computed as a vector Z:
By solving the recurrence, Z becomes:
Z 1 is the initial conditions vector and is computed by applying DC to each one of the C(m) terminal configurations:
The computation of Z 1 has very little impact on the overall cost of the algorithm. In fact the cost is practically O(m) because a terminal configuration contains mostly spikes and/or loops. Moreover, terminal states can be computed even faster by using the serial and parallel optimizations. Finally, the first element of Z is the partition function of studied the strip lattice. After this point, Z is used to study a wide range of physical phenomena.Álvarez and Canfora et al. [1] have reported new exact results for strip lattices such as the kagome of width m = 5.
4 Algorithm improvements
Serial and Parallel paths
The DC contraction procedure can be further optimized for graphs that present serial or parallel paths along their computation (see Figure 4 ). Let v a and v b be the first and last vertexes of a path, respectively. A serial path s is a set of edges e 1 , e 2 , ..., e n that connect sequentially n − 1 internal vertexes between v a and v b . It is possible to process a serial path of n edges in one recursion step by using the following expression;
On the other hand, a parallel path p is a set of edges e 1 , e 2 , ..., e n where each one connects redundantly v a and v b , forming n possible paths between v a and v b . It is also possible to process a parallel path of n edges in one recursion step by using the following expression;
Lattice Symmetry
A very important optimization is to detect the lattice's symmetry when building the matrix. By detecting symmetry, the matrix size is significantly lower because all symmetric pair of terminal states are grouped as one terminal state. As the width of the strip lattice increases, the number of symmetric states increases too, leading to matrices almost a quarter of the size of the original M . We stablish that two terminal states ϕ a and ϕ b with keys δ a (a 1 , ..., a n ) and δ b (b 1 , ..., b n ) respectively are symmetric if:
The resulting matrix M s has a different numerical sequence of sizes than the original M which obeyed the Calatan numbers. In this case, we denote the size of the matrix as D(m) and it's equal to:
As m grows, the 
Experimental results
We performed experimental tests for the parallel transfer matrix method implemented with OpenMP [7] . The computer used for all tests is the following. We measure the execution time for building the transfer matrix and from this metric we compute other performance measures such as speedup, efficiency and knee [8] . On Figure 5 we show the four performance measures for the parallel computation of the transfer matrix with two different strip lattices; kagome width m = 6 and square with m = 10. As expected, execution time decreases as more processors are used, achieving up to 3x speedup when using 4 processors. For the case of the kagome lattice, speedup obtained was approximately 2.25x. The problem is memory intensive so speedup will vary depending on the amount of reads and writes from main memory (i.e depends on the cost of DC). Efficiency decreases linearly and is lower on intensive DC problems and higher in big matrix problems. The knee tells how many processors give the best efficiency and computing time relation (i.e the point in the lower-right most corner). For our experiments, our knee point occurs when using 3 processors, both for the kagome and square cases.
Conclusions
Parallelizing the computation of a partition function for strip lattices is not a trivial task from the point of view of combinatorics and memory usage. In this work we presented a method for computing transfer matrices that benefits from parallelism and improves its performance by using the lattice's symmetry and serial/parallel paths. The small algebra rules used for the terminal configurations permited the dynamic creation of sequences under the DC process, in a lexicographic order. For strip lattices of width m ≥ 7, we reccomend treating the matrix numerically with q and v already set in order to fit the problem in memory. For smaller problems the matrix can be handled symbolically with the help of a library. Based on our experiments, we can confirm that the solution achieves a speedup of 3x using four cores. Special care must be put for NUMA architectures in order to achieve scalable performance. To use the NUMA potential, each processor should have its own copy of the H hash-table and S array, as well as have its corresponding rows allocated in its local memory. Even when this work was aimed at multicore architectures, we are aware that a distributed enviroment can become very useful to achieve even higher parallelism. Since rows are fully independant, sets of rows can be computed on different nodes. In terms of communication, just two data transfers are needed per slave node; master sends the hash tables and the slave node sends back the computed row as a numerical array of rows. Considering how computer architectures are evolving nowadays, we think that the solution proposed here is very likely to gain importance in the future with the upcoming family of multicore CPUs.
