The computational cost of transfer matrix methods for the Potts model is related to the question into how many ways can two layers of a lattice be connected?. Answering the question leads to the generation of a combinatorial set of lattice configurations. This set defines the configuration space of the problem, and the smaller it is, the faster the transfer matrix can be computed. The configuration space of generic (q, v) transfer matrix methods for strips is in the order of the Catalan numbers, which grows asymptotically as O(4 m ) where m is the width of the strip. Other transfer matrix methods with a smaller configuration space indeed exist but they make assumptions on the temperature, number of spin states, or restrict the structure of the lattice. In this paper we propose a parallel algorithm that uses a sub-Catalan configuration space of O(3 m ) to build the generic (q, v) transfer matrix in a compressed form. The improvement is achieved by grouping the original set of Catalan configurations into a forest of family trees, in such a way that the solution to the problem is now computed by solving the root node of each family. As a result, the algorithm becomes exponentially faster than the Catalan approach while still highly parallel. The resulting matrix is stored in a compressed form using O(3 m × 4 m ) of space, making numerical evaluation and decompression to be faster than evaluating the matrix in its O(4 m × 4 m ) uncompressed form. Experimental results for different sizes of strip lattices show that the parallel family trees (PFT) strategy indeed runs exponentially faster than the Catalan Parallel Method (CPM), especially when dealing with dense transfer matrices. In terms of parallel performance, we report strongscaling speedups of up to 5.7X when running on an 8-core shared memory machine and 28X for a 32-core cluster. The best balance of speedup and efficiency for the multi-core machine was achieved when using p = 4 processors, while for the cluster scenario it was in the range p ∈ [8, 10]. Because of the parallel capabilities of the algorithm, a large-scale execution of the parallel family trees strategy in a supercomputer could contribute to the study of wider strip lattices.
Introduction
The Potts model [1] has been widely used to study physical phenomena of spin lattices such as phase transitions [2] in the thermodynamical equilibrium. Lattices such as square, triangular, honeycomb and kagome are of high interest and are being studied frequently [3, 4, 5, 6] . When the number of possible spin states is set to q = 2, the Potts model becomes the classic Ising model [7] , which was solved by Onsager [8] for the infinite-volume limit on a torus. For higher values of q the problem becomes much harder and no solution has been found yet. Nevertheless, it is of interest to study the problem in the form of a strip lattice. Hopefully, the study of sufficiently wide strips could contribute at understanding the physical properties of such complex systems under different boundary conditions.
An effective technique for obtaining the partition function of strip lattices is to compute its transfer matrix, denoted M . The transfer matrix technique allows the study of strips that repeat their lattice structure along one of its dimensions. M can be computed symbolically or numerically (fully or partial) evaluated on (q, v). When there is enough disk space, we find that it is more convenient to compute M using polynomials on (q, v). Indeed, computing M with general (q, v) has an impact on performance and memory, but it gives the advantage that M will not have to be re-computed many times when doing numerical sweeps for q and v. Another advantage is that from the general (q, v) transfer matrix one can generate many partially evaluated instances of the transfer matrix that can be used later for numerical sweeps on the remaining parameter. For limited computational resources, generating M partially or fully evaluated is a practical choice.
If the strip lattice represents an infinite band, then analysis can be performed by computing the eigenvalues of M . If the strip lattice is finite, then a initial condition vector Z 1 is needed. In that case, boundary conditions have to be specified. Typical boundary conditions are free, periodic, cylindrical and cyclic. M and Z 1 together form a partition function vector Z based on the following recursion:
Computing the powers of M n−1 is done in a numerical context, otherwise memory usage would become intractable. When M n−1 is computed, the first element of Z becomes the partition function of the strip lattice. This work focuses on the process of building M , which is an NP-hard problem [9] where exponential cost algorithms are involved in the process, with the width m as the exponent. There are different approaches for building M : (1) In the spin representation approach, an integer value is chosen for q and the transfer matrix T is obtained by combining the different spin configurations in the graph layer. Under this approach, the size of M becomes q |V | × q |V | , where |V | is the number of spins in the layer of the strip. A more detailed explanation on the spin representation approach is available in the first of the six works by Salas, Sokal and Jacobsen series of papers [10] . (2) One can also obtain M as a product of sparse matrices of asymptotic size O(4 m ) [11] , one per edge and practically linear in the number of edges, where M is not constructed explicitly but only its action on a given vector of states. (3) Alternatively one can compute M with a generic (q, v) method where the configuration space grows proportional to the Catalan numbers [12] or asymptotically as O(4 m ), leading to a matrix of size O(4 m × 4 m ). Indeed there are other strategies that can achieve smaller transfer matrices [13, 14, 15] , but they assume special properties for the lattice, work only for finite graphs or need to fix the values of v and/or q in order to take any advantage. We believe it is worth studying what are the possibilities for algorithmic improvements in the generic (q, v) Catalan based approach since it is a general method applicable to any planar strip.
In the light of these aspects just mentioned, we ask question 1: Is there a generic (q, v) method that can compute the transfer matrix for any planar strip lattice, using a sub-Catalan configuration space?. From our research we have found that: a hierarchical symmetry exists among elements of the configuration space that define the transfer matrix. This symmetry is revealed when first applying deletion-contraction to certain edges of the strip layer. If this symmetry is used so that the configuration space is re-organized as a forest of hierarchical families, then a parallel computation only on the root nodes is sufficient for generating a compressed transfer matrix. When exploiting this symmetry, the configuration space is reduced from O(4 m ) to O(3 m ), which is an improvement to the actual bound on general transfer matrix methods for strips. This result allows us to answer positively to question 1.
With the evolution of computer architectures towards a higher amount of cores [16, 17] , parallel computing is not anymore limited to clusters or super-computing; workstations can also provide high performance for solving physical problems [18] . It is in this last category where most of the scientific community lies, therefore parallel implementations for multicore machines are the ones to have the largest impact on the community. Considering how technology is changing, we ask question 2: Can transfer matrix methods work in parallel for modern multi-core architectures and scale their performance efficiently as more processors are used?. Given the amount of data-parallelism on the number of root nodes, the performance of the algorithm scales efficiently as more processors are used. Results on a multi-core 8-core machine show a speedup of 5.7X is achieved when using p = 8 processors, and an efficiency of 95% is achieved when using p = 4. Results on a 32-core cluster confirm that the implementation can scale in a distributed scenario, achieving a speedup of 28X when using p = 32 processors and an efficiency of over 90% for the full range p ∈ [1, 32] when dealing with large square strips. We can also confirm that a compressed transfer matrix not only saves data space in comparison to the original one, but it is also faster to load considering that it must be first evaluated for any practical usage. In the case of cluster performance, a dynamic scheduler is mandatory in order to bypass potential performance valleys that are caused by the combination of unbalanced work and a static scheduler. Again, this result allows a positive answer for question 2.
The paper is organized as follows: Section 2 covers preliminary concepts of the Potts model, Section 3 describes related work. Sections 4 and 5 explain the algorithm and the additional optimizations. Section 6 provides details about the implementation while in section 7 we present detailed results for running time, speedup, efficiency and knee, using different amount of processors. We also compare performance against the Catalan Parallel Method (CPM) [19] . Section 8 is devoted to the validation of the algorithm by computing some physical results; from limiting curves to energy and specific heat, and comparing them to the results obtained by other authors. Section 9 discusses our main results and concludes the impact of our work.
Preliminaries
Let G = (V, E) be a lattice with |V | vertices, |E| edges and s i be the state of a spin of G with s i ∈ [1..q] and i ∈ [1, |V |]. The partition function Z(G, q, β) is defined as Z(G, q, β) = r e −βh(Gr) (2) where β = 
Where i, j corresponds to the nearest neighbor edge from vertex v i to v j , r ∈ [1..q |V | ], J is the interaction energy (J < 0 for anti-ferromagnetic and J > 0 for ferromagnetic) and δ s i ,s j corresponds to the Kronecker delta evaluated at the pair of spins i, j with states s i , s j and expressed as
As the lattice becomes larger in the number of vertices and edges, the computation of equation (2) becomes rapidly intractable with an exponential cost of Θ(q |V | ). In practice, one can use equivalent methods that, while still exponential, in practice run faster than the original definition.
The deletion-contraction method [20] , or DC method, was initially used to compute the Tutte polynomial [21] and was then extended to the Potts model after a relation of duality was found between the two (see [22, 23] ). DC re-defines Z(..) as the following recursive equation:
Where G − e is the deletion operation, G/e is the contraction operation and the auxiliary variable v = e −βJ − 1 makes Z(..) a polynomial. There are three special cases where DC can perform a recursive step with linear cost:
The computational complexity of DC has a direct upper bound of O(2 |E| ). When |E| >> |V | a tighter bound is known based on the Fibonacci sequence complexity [20] ; O((
2 ) |V |+|E| ). In general, the time complexity of DC can be written as
A strip lattice is a bidimensional graph G = (V, E) that repeats its pattern at least along one dimension. It can be built as the concatenation of layers K 1 , K 2 , ..., K n sharing their boundary vertices and edges. Figure 1 illustrates how the notion of strip lattice applies to the case of the square and kagome lattices. The transfer matrix, denoted M , takes advantage of the repeating nature of the lattice, allowing the study of very long graphs. In the limit of infinite length the free energy per site becomes: where n K is the number of non-shared vertices per layer and λ + is the dominant eigenvalue of M with nontrivial coefficient associated. The dimension of M grows proportional to a combinatorial function Γ(m), which depends on the size of the base (i.e., the width of G(V, E)) and it represents the different ways in which two layers can connect by combining spin states and identifications. The set of configurations generated by the base corresponds to the configuration space of the problem. The computational cost of a transfer matrix method comes from two sources; (1) the size of the configuration space and (2) the cost of the local algorithm. The sequence generated by Γ(m) corresponds to the size of the configuration space of the problem and, as mentioned earlier, it defines the size of M . The local algorithm is in charge of computing the partition functions for each element of the configuration space.
Related Works
The transfer matrix methods were introduced by Derrida et. al. in 1980 [24] as an approach to study percolation and phenomenological re-normalization. In 1982, Baxter used transfer matrix techniques in his seminal works as a tool for solving statistical mechanics problems [25] . Salas, Sokal and Jacobsen have greatly contributed with a series of results, plus an additional unnumbered one that follows the same line, in which they study the physics of square and triangular strip lattices through the transfer matrix technique [10, 26, 27, 28, 29, 13, 30] . In those works, the authors use different types of algorithmic optimizations for the construction of M based on the symmetries available. Different scenarios are considered along the works, such as the zero temperature (chromatic polynomial) case, ferromagnetic and antiferromagnetic cases, and different boundary conditions such as free, periodic, cylindrical and a special boundary condition that consists of adding two extra vertices on the sides of the strip. Some of the contributions made in these works include the use of non-nearest neighbors partitions for v = −1, sparse matrix factorization, algebraic input from the representation of the Temperley-Lieb algebra, symmetries for different boundary conditions and the computation of the limiting curves or partition function zeroes for the different boundary conditions up to m ≤ 13. State of the art works on the square lattice normally study strips in the range 3 ≤ m ≤ 13. For the case of the square lattice with free boundary conditions, Salas et. al. achieved m = 12 using v = −1 [29] . It should be noted that if v = −1 and free boundary conditions are used, then the configuration space is the one proportional to the Catalan numbers and the problem becomes computationally harder to handle. The problem of the matrix size has also been improved by algebraic techniques [14] in the spin representation, reducing the matrix size when working with q = 2 and q = 3.
The authors studied the square and triangular strips with layers of up to r = 11 spins, which is equivalent to a square strip of width m ≈ 5. Jacobsen et. al. have studied the q-state Potts model for q = 4cos 2 (π/p) being a Beraha number with p > 2 and integer [28] . In the work, the authors study strips of widths in the range m ∈ [2, 6] . The relevance of their work is that they manage to compute the partition function using the RSOS representation. Alvarez et. al. [31] have reported exact results for the kagome strip of width m = 5 using the generic (q, v) Catalan based transfer matrix technique. In contrast to these related works, we are interested in exploring a general (q, v) method that can allow the study of strips in the state of the art range for free boundary conditions using generic (q, v). For simplicity, we will restrict our physical results just to the computation and validation of the limiting curves using free boundary conditions in order to stay within the scope of our work, but not restrict the proposed strategy to these conditions.
More general methods for computing the exact partition function of a lattice have also been proposed [32, 15, 33] . Bedini et. al. [15] proposed a transfer matrix method for computing the partition function of arbitrary graphs using a tree-decomposed transfer matrix technique. For arbitrary graphs, they mean any type of finite graph; i.e., random or regular planar/non-planar graphs. In their work, the authors obtain a sub-exponential complexity when processing random planar graphs. Their algorithm is considered the best so far for arbitrary graphs and the authors manage to achieve results for regular lattices of up to 18×18 sites. If the tree-decomposed transfer matrix method is applied to a strip, the configuration space to explore becomes the same as the traditional transfer matrix methods for strips, i.e., the tree-width becomes the width of the strip and the cost is proportional to the Catalan number of the tree-width. The work is closely related to another result by Jacobsen in which large regular lattices of up to 20 × 21 sites were studied [11] by using a sparse transfer matrix method based on the product of sparse matrices, of dimension 3 m for v = −1 and 4 m for v = −1. The work of Haggard et. al. [34] is considered to have the best implementation of a deletion-contraction technique for the computation of the Tutte polynomial for any arbitrary graph (the Tutte polynomial is the dual of the partition function [22] ). Their algorithm reduces the computation tree in the presence of loops, multi-edges, cycles and biconnected graphs (as one-step reductions). By using a cache, some computations can be reused (i.e., sub-graphs that are isomorphic to the ones stored in the cache do not need to be computed again). An alternative algorithm to Haggard et. al. was proposed by Björklund et. al. [35] which achieves exponential time only in the number of vertices; O(2 n n O(1) ) with n = |V |. Asymptotically their method is better than deletion-contraction considering that many interesting lattices have more edges than vertices. However, Haggard et. al. [34] have stated that the memory usage of Björklund's method is too high for practical use. These techniques, which are more general than the ones from the beginning of this section, cannot be directly compared against the classic transfer matrix approach, nevertheless they still needed to be mentioned as part of the related work background. General techniques compute the transfer matrix efficiently for arbitrary graphs, but do not take advantage of the regular graph structure when it is available. On the other hand, classic transfer matrix methods for strips indeed take advantage of the regular graph structure but for arbitrary graphs are not so efficient because for each layer there is a new non-sparse transfer matrix to be computed. Both strategies play an important role in the study of spin lattices. In our case, we focus on strips with regular graph structure, therefore our approach should be considered as a classic transfer matrix method.
Research on transfer matrices for strip lattices in the Potts model have not reported experimental results on the parallel performance, except for a prior work of the authors [19] that consists of a parallel method for computing general (q, v) transfer matrices using the Catalan approach, which will be named the Catalan Parallel Method (CPM) for the ease of referencing it later on. The CPM method was successfully used to study new widths of the kagome strip [31] with generic (q, v). The present work is a substantial improvement from CPM.
Algorithm overview

Data structure
The definition of G from Section 2 (see Figure 1 ) will be used in this section to explain the input data structure needed by the algorithm. Since the graph is a strip lattice, only layer K n of the graph G is explicitly needed. The following naming scheme is now introduced for distinguishing two types of boundary vertices in the layer: shared vertices and external vertices. For convention, shared vertices are indexed top-down from 0 to m−1 and correspond to the left-most ones of K n , which are being shared with layer K n−1 . External vertices are the right-most ones of K n and are indexed bottom-up from |V | − m to |V | − 1. Figure 2 illustrates the data structure for an square strip of m = 3. 
DC-based transfer matrix computation
When using (q, v) polynomials, the configuration space of generic q transfer matrix methods turns out to be the set of all non-crossing partitions on a sequence of m serially connected vertices. The size of this configuration space is defined by the Catalan numbers:
We will first explain how the transfer matrix can be built from partial DC repetitions and then proceed to the parallel family trees strategy. At this point we introduce two terminologies that are important for the rest of the section; initial configurations and terminal configurations. These configurations define a combinatorial sequence of identifications 2 on the external and shared vertices of layer K n . Initial configurations, denoted σ i with i ∈ [0..C m − 1], define a combinatorial sequence of identifications just on the external vertices of K n . The terminal configurations, denoted ϕ j with j ∈ [0..C m − 1], define a combinatorial sequence of identifications just on the shared vertices of K n . Initial configurations generate terminal ones, through the DC method.
The case of σ 1 is the basic case and matches K n . That is, σ 1 is the initial configuration where no identifications are applied to the external vertices of K n . It is equivalent as saying that σ 1 is the empty partition of the Catalan set. Similarly, ϕ 1 corresponds to the base case where no shared vertices are identified. In other words, ϕ 1 is the empty configuration for the Catalan set on the shared vertices of K n . For illustration, Figure 3 shows the configuration space for the square lattice of width m = 3: In order to compute the transfer matrix M (row by row), one must apply C m partial DCs, each time to a different initial configuration σ i . Each one of the C m partial DC applications generates a row of M in the form of partial partition functions on (q, v), distributed into a maximum of C m terminal configurations. By partial DC we mean to perform DC on the layer, with the corresponding initial configuration σ i applied, but stopping the recursion branches whenever they meet and edge that connects two shared vertices. The stop condition on the recursion branches is needed otherwise one would be processing vertices and edges of the next layer of the strip, breaking the idea of a transfer matrix. For the example of Figure 2 with m = 3, the partial DC is applied to σ 1 , σ 2 , σ 3 , σ 4 and σ 5 from Figure 3 .
An example of a partial DC for the example of m = 3 is illustrated in Figure 4 for the case when computing the first row. The process is analogous for the other four rows of M (i.e., σ 2 , σ 3 , σ 4 and σ 5 ). Once a recursion branch has been stopped, partial partition functions z i,j (q, v) appear associated to remanents of the graph layer. Remanents are parts of the graph layer that cannot be computed (i.e, edges connecting shared vertices) and they match one of the C m possible terminal configurations that can exist. For some initial configurations, not all terminal configurations may be generated from a single DC, but only a subset of them.
A terminal configuration ϕ j contains a unique sequence of planar identifications on the shared vertices that is useful to differentiate one from another. We use the term key to denote such sequences since they allow fast search and modification in a hash table. Proper construction of keys are achieved by using a simple algebra that defines how multiple identifications on shared vertices are combined. A key of n identifications is denoted as Π = π x 1 ,y 1 + π x 2 ,y 2 + ... + π xn,yn . The following properties hold true for keys:
Properties (10) and (11) allow the application of a lexicographical order on the keys, while property (12) allows to combine them using transitivity. There are important differences when comparing this algebra to the partition algebras studied by Halverson and Ram [36] , specially because the former is much simpler and defines operations on a single layer of points, while the latter defines a different set of operations for a partition monoid that is represented as a graph of two layers of points. Nevertheless, we can still find a relation with the number of partitions in the case of the planar sub-monoid P k , which is C 2k for two layers of length k, and the number of keys for a single layer of length m, which is C m . Using Stirling's approximation, we have that
, which is consistent with the upper bound:
Dutton and Brigham proved in 1986 that the Stirling approximation of the Catalan numbers is in fact already a valid upper bound [37] . In addition, they obtain tighter lower and upper bounds for the Catalan numbers. The cost of the DC-based transfer matrix method is the product of the cost of the partial DC and the size of the configuration space C m . So far, the worst case running time of the algorithm for computing M is:
In the following sub-section, we show how a finer analysis can lead to a smaller configuration space of Γ(m) = O(3 m ) for computing a compressed transfer matrix M .
Family trees strategy
It is possible to reduce the Catalan configuration space by exploiting a symmetry present in the deletion-contraction (DC) method, resulting in an exponentially faster algorithm. Basically, the idea is the following: if the DC procedure is forced to act first on certain external edges of the layer, and act later on the rest of the graph, then symmetries appear between nodes of the recursion tree and other initial configurations. Exploiting such symmetry allows one to group many Catalan configurations into families of configurations, where a single DC procedure applied to the root node of a family contributes to the solution of the whole family.
Forcing DC to start on the external edges results in a recursion tree composed of two phases; (1) a perfect binary tree (PBT) of height h = m − 1 − b and (2) several sub-trees t j with j ∈ [1..2 h ] (see Figure 5 ).
Figure 5: When DC is forced to start on the external edges, the recursion is divided into two phases.
Variable b is the number of external edges that sit in between an identification π ij where at least one of its vertices is i or j. These b edges are left for phase (2) because they do not produce the symmetries needed for the family trees strategy. Each node of the PBT of phase (1) that comes from a contraction produces a unique algebraic symmetry to one of the configurations found in the original Catalan set. The configuration of a contracted node from the recursion tree is denoted χ i and the symmetric correspondence is χ i ←→ σ i . All χ i configurations that share the same PBT, together form a family tree. Following the example of the square strip with m = 3, its configuration space would be grouped into two family trees (see Figure 6 ); {σ 1 , σ 2 , σ 3 , σ 4 } and {σ 5 }, being σ 1 and σ 5 their root configurations, respectively. The solution of a configuration, namely σ i , is defined in terms of its symmetric χ i found in the PBT:
Variable d denotes the number of deletions (i.e., holes in the external layer) and variable c the contractions accumulated along its path, both starting from the root. The (1 + v) c coefficient corresponds to the expression for the c loops that are present in the external layer of σ i , but are missing in χ i . For the example of the square strip of width m = 3, c = 0, 1, 1, 2, 0 for χ 1 , χ 2 , χ 3 , χ 4 , χ 5 , respectively. Function b(k) counts the number of non-zero bits of k and the expression χ k i is the application of the binary mask k just on the holes of χ i . The mask works as follows: if bit k j = 1, with j ∈ [0..d − 1], then the j-th hole is filled with an edge, otherwise it is left as a hole.
When d = 0, χ i represents exactly the starting point of an eventual solution σ i , algebraically symmetric in (1 + v) c . When d > 0, χ i is no longer the starting point of σ i , but instead it is the left-most node in an eventual recursion tree of the solution σ i , at level d. In order to compute σ i , 2 d − 1 variations of χ i are needed to build the missing steps and eventually reach σ i in a bottom-up way. An important property of the variations of χ i is that they actually correspond to other family members within the PBT that will be eventually solved too. This means that there is no need to compute these variations, instead one has to make the correct relations between the different family members. We propose a hash map of the type (χ i , r[]) so that for each χ i , represented by its unique key, there is an array of related configurations r[] that need χ i . Each time a contracted configuration is reached in the PBT, equation (15) is applied and 2 d − 1 relations are inserted in the hash map. Figure 7 illustrates the example of the strip of width m = 3 when processing χ 3 ; it needs χ 4 in order to build the solution σ 3 . The solution for each family member χ i can be written in terms of the solutions of the 2 h subtrees. A convenient way for storing the solution for a whole family is to write a system of equations, using a linear combination of the 2 h sub-trees. A v c coefficient is included, where c is the amount of contractions found in the path from the familiar to the sub-tree. For the example of the strip of m = 3, the solution for the family of σ 1 is:
Note how σ 3 includes χ 4 , as shown in Figure 7 . The solution for the family of σ 5 is:
These equations, plus the solutions of the sub-trees, conform the compressed transfer matrix for the example strip of width m = 3. It is important to mention that the sub-trees are stored only once and the system of equations use indices to the sub-trees. Given how DC works, identification can only occur on pairs of vertices that are neighbors. This aspect of DC allows us to establish a formal definition for a family. Definition 1. A family is a set of configurations in which for any chosen pair σ i and σ j of the set, the difference of their corresponding keys Π i and Π j is
In other words, the difference between σ i and σ j must only consist of identifications of length l = 1. Configurations that differ at least by one identification of length l > 1 belong to a different family. Each family is identified by its root configuration, therefore it is important to know which configurations are root and which are not.
Definition 2.
A root configuration is an instance of K n where its key Π = π
That is, a root configuration is one that does not have identifications of length l = 1. The number of root configurations will be denoted ∆ m as a function of the width m. We formulate the following expression for ∆ m , based on Definition 2 and using the inclusionexclusion principle:
Proof. Using (13) into (21) leads to the following bound:
Step 23 is obtained by using the Binomial formula with x = 4 and y = −1.
The number of root configurations ∆ m corresponds to the number of non-crossing nonnearest-neighbor partitions (nc-nnn). The number of nc-nnn can also be counted with the Motzkin number evaluated at m − 1; ∆ m = M m−1 , where M m is:
The asymptotic number of nc-nnn partitions has been previously studied by Chang et. al. in [38] by using the asymptotic behavior of M m :
Although the asymptotic bound was already obtained in two earlier works [38, 13] in the context of nc-nnn partitions, the proof of Theorem 1 still remains interesting as a short and alternative way to establish the O(3 m ) upper bound coming from an inclusion-exclusion formulation that has not considered the Motzkin numbers.
Upper bound for relating k-hole familiars
Counting the amount of family relations within a DC procedure allows one to precise an upper bound on the number of accesses made to the hash map. For each DC application, the cost of relating family members is defined as:
Where r(k) = 2 k −1 is the cost of performing the relations for a k-hole configuration. Function c(k, h) counts the number of k-hole configurations, which is a subset of the total number of familiars. Since familiars can only be contracted nodes within the PBT, the size of a family is 2 h−1 . A direct upper bound can be computed assuming the worst case for r(k):
A tighter upper bound is possible when c(k, h) is analyzed more carefully. The following pattern can be found when counting the number of k-hole configurations.
The recursion for c(k, h) is:
Function c(k, h) is equivalent to counting the number of k-faces in a regular (h − 1)-simplex [39] . A regular (h − 1)-simplex is a (h − 1)-dimensional polytope that is the convex hull of h vertices in a regular spatial distribution. A regular simplex can also be seen as the generalization of the notion of a triangle or a tetrahedron, for an arbitrary dimension. A regular (h − 1)-simplex can be drawn in the plane by placing h vertices inscribed in a circle, with all pairs connected (see Figure 8 ). The number of k-faces in a (h − 1)-simplex [40] is defined as:
Using (35) in (27), we have that
Theorem 2. The cost of relating all configurations within a PBT is upper bounded as
Proof. For simplicity, we will assume that every DC application processes the default initial configuration. This configuration is the one that spans the largest family, hence the worst case where b = 0, that is h = m − 1.
Both summations obey the following form:
Using the Binomial theorem for the summation, we get
Using a = 2 and a = 1 leads to the first and second terms of Eq. (37)
Running time of the family trees strategy
The asymptotic sequential running time of the family trees algorithm applied to a layer K(V ′ , E ′ ) of a strip lattice is:
The extra cost provided by g(m − 1) does not incur in too much extra computation compared to the cost of DC itself, where the amount of edges of K(V ′
Parallel family trees
By default, the algorithm does not know the ∆ m different root configurations except for σ 1 which is given as part of the input of the strip lattice and is the one that triggers the computation. Under this scheme, the configuration space would have to be explored incrementally, each time adding a sub-set of configurations from the terminal configurations found from a DC application. This is indeed a problem for parallelization because the data-parallel elements are being discovered sequentially, limiting the efficiency and scalability of a parallel computation. In order to solve this problem, we use a recursive generator g( Basically, g(..) performs a recursive partition of the domain A. If |j −i| ≤ 3 then no further identifications can be carried on, otherwise the identification would be of length l = 1 and the generated configuration would not be a root configuration. Similarly, for the top and bottom parts if |j −i| ≤ 2 then no more identifications are possible. Each time a new identification i, j is added, the resulting configuration is checked in the hash table. If it is a new configuration, then it is added, else it is discarded as well as all further recursion computations continuing from that point. By using this approach we ensure that redundant recursion branches are never computed. Once g(..) has finished, S becomes the array of all possible configurations and H the hash that maps configurations to indices.
Parallel family trees are achieved by first generating all root configurations with g(.. Figure 9 . The work for each processor p i is divided in the following steps: (1) pick one root configuration key from S[], (2) apply it to its local copy of the K σ 1 layer, (3) perform the DC procedure, (4) write the results into non-volatile memory, i.e., sub-tree results as well as the linear equations into disk, and (5) go to step (1) if there are still root configurations remaining. For step (3), familiars of a root configuration are detected at runtime within the PBT by computing its key, each time the recursion comes from a contraction. When the beginning of a sub-tree is reached, no more familiars are guaranteed to be found on what is left of the recursion, therefore the algorithm can proceed to compute the whole sub-tree without needing to check for the existence of familiars. The solution of a sub-tree t i is a vector of expressions z i,j (q, v) that associates a j index to a terminal configuration ϕ j within the sub-tree t i . The hash-map H from the generator becomes useful for searching with average cost O(1) the index j of a terminal configuration ϕ j . Also, H ensures that all vectors are consistent with the order established in the generator and in the transfer matrix. The 2 m−1 sub-tree vectors and the coefficients for the set of equations provide the solution for a whole family. Both of these results are saved locally for each processor. This output format based on sub-trees and coefficients makes the matrix compressed in the same proportion of the improvement in the running time.
The asymptotic running time for the parallel family trees algorithm using p processors is:
Further computations for achieving physical results require decompression of the matrix, leading to a matrix of Catalan dimensions again. In practice, large symbolic matrices need first to be evaluated before doing any analysis. If the numerical evaluation is performed before decompressing the matrix, then the process is much faster than first decompressing and then evaluating, even faster than evaluating an uncompressed transfer matrix on (q, v). Numerical evaluation has the potential to be exponentially faster as a consequence of the parallel family trees compression, which is in the same order of the running time improvement.
The analysis of the algorithm has been made for the case of free boundary conditions but it is not restricted to it. For different boundary conditions such as cylindrical, full periodic or cyclic, the parallel family trees can be still applied following the same principle, while taking advantage of additional symmetries like the dihedral group in the cylindrical case. The rest of the paper assumes free boundary conditions unless we explicitly mention the contrary.
For the case of a finite strip, the initial conditions vector Z 1 is computed by applying DC to each one of the C m terminal configurations:
The computation of Z 1 has very little impact on the overall cost of the algorithm and practically costs O(mC m ) in time because a terminal configuration contains mostly spikes and/or loops, which are linear in cost for DC.
Algorithm improvements
Serial and Parallel paths
The DC contraction procedure can be improved for graphs that present serial or parallel paths between two endpoints v a and v b , as shown in Figure 10 . A serial path , denoted s, is a set of edges e 1 , e 2 , ..., e n that connect sequentially n−1 vertices between v a and v b . It is possible to process a serial path of n edges in one recursion step by using the following expression;
A parallel path p is a set of edges e 1 , e 2 , ..., e n that reduntandly connect v a and v b . It is possible to process a parallel path of n edges in one recursion step by using the following expression;
Axial Symmetry
One practical optimization is to detect the lattice's reflection symmetry when computing the root configurations as well as the Catalan configurations. When detecting reflection symmetry, the size of the configuration space is decreased for all symmetric pairs of configurations, no matter if it is initial, terminal or root. As the width of the strip lattice increases, the number of symmetric states increases too, leading to configuration spaces almost half the size of the original. We establish reflection symmetry between two configurations ϕ a and ϕ b with keys π a 1 ,...,an and π b 1 ,...,bn respectively in the following way:
Exploiting this symmetry results in a matrix size C s m :
For large values of m, C s m ≈ Cm 2 . For the case of root configurations, Chang et. al. [38] proved that the number of noncrossing non nearest-neighbor partitions under reflection symmetry, which we denote ∆ s m , is:
where
. The expression was also obtained by Salas and Sokal [13] for studying the square lattice symmetries when v = −1. When m → ∞ we have:
Table (1) shows how the amount of Catalan and root configurations increase for nonsymmetric and symmetric lattices up to m = 14. If cylindrical boundary conditions are used, then the reflection symmetry can be replaced by the symmetry of the dihedral group which further reduces the size of the matrix. For this manuscript we limit our work to the case of free boundary conditions.
Implementation
We tried two implementations for the parallel family trees parallel algorithm; one using OpenMP [42] and the other one using MPI [43] . We observed that the MPI implementation achieved better performance in the multi-core scenario and allows parallel computation in a distributed scenario. For this, we decided to continue the research with the MPI implementation for both multi-core and distributed scenarios. Basic mathematical operations on symbolic expressions are handled through the GiNaC C++ library [44] . Parallel execution of the algorithm receives two parameters; the number of processors p and the block size B, which is the amount of consecutive jobs per process. When the parallelization is unbalanced, the value of B plays an important role for efficiently distributing work to all processors. In our implementation we make each process to generate its own H lookup table and S array. This small sacrifice in memory leads to better performance than if H and S were shared among all processes. There are mainly three reasons why the replication approach is better than the sharing approach: (1) caches will not have to deal with consistency of shared data, (2) there is no sending/receiving of data structures and (3) the allocation of the replicated data is correctly placed on memory modules when working under a NUMA architecture. The last claim is true because on NUMA systems memory allocations on a given process are automatically placed in its fastest location according to the NUMA topology between memory and CPU cores. It is responsibility of the OS (or make manual mapping) to stick the process to the same processor throughout the entire computation.
The implementation writes each row to a persistent secondary memory (i.e., HDD or SSD) as soon as it is computed. Each processor does this with its own file, therefore the matrix is fragmented into p files. In practice, a fragmented matrix is not a problem at all, because numerical evaluation is needed before using the matrix in its full form. Furthermore, a fragmented matrix allows parallel numerical evaluation.
Performance results
We have realized performance tests for the parallel transfer matrix method implemented with MPI for both shared and distributed memory scenarios. The experimental design consists of measuring the main performance metrics (i.e., running time, speedup, efficiency, knee) of the implementation by computing the compressed transfer matrix several times, each time varying the number of processors p. We also compute the improvement factor with respect to previous work [19] . The experiments are divided into two categories; (1) multi-core and (2) cluster. For each case, we measure performance with two strip lattices; (1) square and (2) kagome, respectively (see Figure 11 ). Explicit algebraic expressions for the sparse-matrix factorization of M for all the Archimedean lattices (which include the square and kagome lattices) have been computed by Jacobsen [45] , on finite lattice regions of up to |E| = 882 edges. The approach taken by the sparse-matrix differs from the standard transfer matrix technique, since the former processes a whole finite lattice region, using one sparse matrix computation per edge, while the latter computes a dense T M for each different graph layer of width m.
Note: P F T refers to the actual Parallel Family Trees strategy and P CM to the Parallel Catalan Method from [19] .
Multi-core results
The machine used for the multi-core performance tests has an 8-core CPU AMD FX-8350 at 4.0 GHz, 8GB of RAM and uses the openMPI implementation of the MPI standard [43] .
Square strip lattice test
For the square lattice, we measure performance for 9 different strip widths in the range m ∈ [2, 10] . For each width, we measure 8 average execution times, one for each value of p ∈ [1, 8] . As a whole, we perform a total of 72 average measurements for the square test. The standard error for each average execution time is below 5%. Different block sizes where tested, giving no significant difference on performance. For this reason, we kept a block size of B = 1. The other performance measures include speedup, efficiency and the knee 3 [46] . In this case we took advantage of the reflection symmetry for all sizes of m. Figure 12 shows all four performance measures for the square lattice. From the results, we observe that the running time grows at an exponential rate which is compatible with the upper bound in (44) , assuming that the cost of DC had a little impact on the algorithm. Indeed it is possible for DC to have a little impact, considering that algorithmic improvements are linear and they occur with more or less frequency depending on the edge selection order [34] and the lattice structure. For the speedup, there is improved performance for every value of p as long as m > 4. For m ≤ 4, the problem is not large enough to justify parallel computation, hence the overhead from MPI makes the implementation perform poorly and sometimes even worse than the sequential version. The plot of the execution times confirms this behavior since the curves cross each other for in the transition from m = 3 to m = 4. The maximum speedup obtained was 5.7 when using p = 8 processors. From the lower left plot we can see that efficiency decreases as p increases, which is expected in every parallel implementation. What is important is that for large enough problems (i.e., m > 6), efficiency is over 62% for all p. For the case of p = 4, we report at least 95% of efficiency, which is close to perfect linear speedup. For m ≤ 6, the implementation is not so efficient because the amount of computation involved is not enough to keep all cores working at full capacity. The knee is useful for finding the optimal value of p for a balance between efficiency and computing time. It is called knee because the hint for the optimal value of p is located in the knee of the curve (thought as a leg), that is, its lower right part. In order to know the value of p suggested by the knee, one has to count the position of the closest point to the knee region, in reverse order. Our results of the knee for m > 6 show that the best balance of performance and efficiency is achieved with p = 4 (for m ≤ 6, the knee is not effective since there was no speedup in the first place). In other words, while p = 8 is faster, it is not as efficient as with p = 4.
Kagome strip lattice test
For the test of the kagome lattice, we used 6 different strip widths in the range m ∈ [2, 7] . For each width, we measured 8 average execution times, one for each value of p ∈ [1, 8] . As a whole, we performed a total of 48 measurements for the kagome test. The standard error for each average execution time is below 5%. Additional performance measures such as speedup, efficiency and knee have also been computed. Different values of block size were tested, achieving noticeable differences on performance as B changed. We found by experimentation that B = 1 makes the work assignment slightly more balanced. In this test we can only use lattice axial symmetry for m = 2, 4, 6, 8, . .. . For this reason we decided to run the whole kagome benchmark without axial symmetry in order to maintain a coherence between odd and even values of m. we have that the parallel performance is still scalable even for dense layers; the maximum speedup is over 4.7 for p = 8 on the largest problems. When m > 5, the efficiency of the parallel implementation is approximately over 60% for all values of p. In this test the knee is harder to identify, however for the largest problems one can see a small curve that suggests p = 4 which is in fact 90% efficient when solving large problems.
Cluster results
The cluster used for the tests has a total four nodes; each one with 32GB RAM and two quad-core processors Xeon 5500 2.26 GHz. The full systems offers a total of 32 processing cores and 128GB RAM. The network is Ethernet gigabit centralized and the implementation of MPI is openMPI.
Square results
For the test of the square strip lattice in the cluster environment, we tested 9 different strip widths in the range m ∈ [2, 10] . For each width, we measure 32 average execution times, one for each value of p ∈ [1, 32] . This process is repeated for both static and dynamic scheduling. The standard error for each average execution time is below 5%. For the dynamic scheduler we have chosen a block size value of B = 1. This value of B produces the highest amount of communication between the worker processes and the scheduler, hence the most dynamic scenario. Advantage of axial symmetry has also been taken. Figure 14 shows the performance measures of the running time, speedup, efficiency and the knee [46] for the cluster environment. Note that for each color (size), the solid and dashed lines represent static and dynamic scheduling, respectively. From the results we observe that the reduction of the running time becomes effective starting from problems of size m ≥ 6. Speedup has an overall linear behavior for the full range p ∈ [1, 32] which tells good scalability. Interestingly, near p = 4 there is a region of super-linear speedup [47] that occurs only for sizes m = 6, 8. For p > 10, super-linear speedup vanishes for all problem sizes. In the cluster environment, the behavior between static (solid lines) and dynamic scheduling (dashed lines) is notorious; the former behaves irregularly producing several performance valleys, while the latter behaves regularly, gives higher performance and produces close to zero performance valleys. The maximum speedup achieved is approximately 28X for p = 32, being superior in the dynamic case by a small margin. The efficiency of the parallel algorithm stays above 90% for the largest case of m = 10. Again, dynamic scheduler proves to be much more efficient than the static one when m > 6, and overall the algorithm is over 70% efficient for large enough problems, that is m ≥ 8. The knee suggests that p ∈ [8, 10] gives the best balance of running time and efficiency whenever m ≥ 8.
Kagome results
For the test of the kagome strip lattice in the cluster environment, we tested 5 different strip widths in the range m ∈ [3, 7] . For each width, we measure 32 average execution times, one for each value of p ∈ [1, 32] . This process is repeated for both static and dynamic scheduling. The standard error for each average execution time is below 5%. For the dynamic scheduler we have chosen a block size value of B = 1, same as in the square cluster test. Figure 14 shows the performance measures of running time, speedup, efficiency and the knee [46] for the cluster environment. Note that for each color (size), the solid and dashed lines represent static and dynamic scheduling, respectively. The results show that the reduction of the running time becomes effective in a cluster as long as m ≥ 6. In this case, speedup is closer to a logarithmic curve rather than a linear one. It is interesting to note that speedup gets stuck at specific values for sizes m = 4, 5, 6. The reason why is because the size of the configuration space is not large enough for cluster execution; ∆ m ≤ 32 for m = 4, 5, 6. In fact, the values of p where speedup starts to get stuck actually match the values found for ∆ 4 , ∆ 5 , ∆ 6 in Table 1 . This phenomenon is totally normal in cluster or supercomputer environments, where the amount of work needed to reach full system occupancy is not always provided by the problem input. In order for speedup to take off, the configuration space must be equal or greater than the amount of processors available in the system.
There is a notorious difference in performance between static and dynamic scheduling. With dynamic scheduling, the performance valleys are practically non-existent, giving a much more stable parallel performance for the full range of p. Efficiency is not as good as in the square test; the largest problem is solved with an efficiency over 55%, while the others reach below 50% at some point of p. Dynamic scheduling proves to be in average more efficient than static scheduling, by-passing the performance valleys. The Knee curve suggests a value p ≈ 8 for a good balance between running time and efficiency.
Impact of DC on algorithm performance
We observed from the results that the running time of PFT applied to the kagome strip is slower than in the square strip. DC may cost too much in layers with a dense number of edges if optimizations do not occur too frequently. For the square lattice layer, we can write the DC worst case cost as O (2 2m For dense layers the performance depends on how good the optimizations are and how frequently one can make them appear for a specific strip type. In our case the optimizations for kagome did not occur as frequent as in the square case because we programmed the heuristics in a very general way, nevertheless the method still managed to perform at least two times faster than the Catalan approach. It should be possible to make DC become more aware of the kagome structure and make it to generate the maximum number of optimization opportunities, as mentioned in the work of Haggard et. al. [34] .
Performance on wider strips
We ran the PFT method to compute general (q, v) transfer matrices on square strips at m = {11, 12, 13} and kagome strips at m = {8, 9}, using free boundary conditions and all the 32 processors we had available. For the square strip, the computation of the TM took ∼ 5.5 minutes for width m = 11, ∼ 46 minutes for width m = 12 and ∼ 6.7 hours for width m = 13. For the kagome strip, the computation of the TM took between 11 ∼ 12 hours at width m = 8 and ∼ 3 months at width m = 9. These results were not included in the performance plots because it would have required excessive amount of time to benchmark for all values of p, specially for p = 1 where the computation is sequential. For the kagome strip we consider that we have reached the limit of tractability and wider kagome strips would become intractable 4 with our hardware resources. For the square strip, we believe it is still possible to go further with our hardware resources, possibly up to m = 14 or in the best scenario m = 15 before reaching intractability. Moreover, if cylindrical boundary conditions are used, then it should be possible to go further beyond by using the symmetry of the dihedral group.
An important aspect of having a parallel solution is that if enough processors are used, that is p = ∆ m , then the time for computing the transfer matrix becomes proportional to the depth of the largest directed-acyclic graph (DAG) of computation, which would correspond to the time required to solve the deepest family. The DAG concept allows to know what to expect when having more processors (i.e., a supercomputer) and gives insights on the limits of computation regarding parallelism. If we apply the DAG concept to our results, we have that the time needed to compute the TM for the square strip would have been less than 5 seconds for m = 11 using p = 1142 processors, less than 10 seconds for m = 12 using p = 2947 processors and less than 5 minutes for m = 13 using p = 7889 processors. Analogous for kagome; the time needed to compute the TM would have been between 2 ∼ 3 hours for m = 8 using p = 70 processors and ∼ 1 week for m = 9 using p = 323 processors. As we mentioned earlier, DC heuristics that are aware of the kagome structure should improve the performance further.
Comparison with related work
In this subsection we compare the Parallel Family Trees (PFT) strategy against the Catalan Parallel Method (CPM) [19] by using the following metrics: (1) running time (2) matrix evaluation time and (3) matrix space. Figure 16 shows the results. The first aspect to note from the running time results is that there is an non-linear improvement with respect to CPM that is independent of the amount of processors used. This improvement corresponds to the asymptotic reduction from O(4 m ) to O(3 m ) in configuration space. The improvement is less clear in the kagome strip test, but we expect that it should manifest when exploring larger sizes of m or when using better heuristics for the DC optimizations. For the space metric, we observe that the size of the compressed matrices is indeed smaller than in the CPM case. Moreover, for the square strip the amount of compression increases non-linearly as we expected from the theoretical bound. For the kagome test, the compression factor stabilizes at approximately 1.5. We believe that the reason why kagome compression stays fixed is because the kagome matrix is more sparse than in the square case, making the method to group zero-elements instead of large polynomials, reducing the compression factor from the maximum possible if the matrix was dense. For the results of Matrix evaluation, we observe that evaluation and decompression on a PFT-matrix is faster than just evaluation on a CPM-matrix. The improvement seems to be a consequence of the compression factor achieved previously, since the behavior is similar.
Dynamic scheduler and block size
The role of the block size under dynamic scheduling can be viewed as the amount of staticness induced to the program. A value of B = 1 means a fully dynamic scheduler, while a value of B = ⌈n/p⌉ means a fully static scheduler. Given that the dynamic scheduler of our implementation communicates via 1-byte messages, it is safe to use B as long as the network is sufficiently fast and dedicated to the cluster, like in our case. In a limited and shared network environment, one could consider exploring the range 1 < B < ⌈n/p⌉ until a good local minimum is found.
Axial Symmetry
When using axial symmetry, we observed an extra improvement in performance of up to 2X for the largest values of m. This improvement applies to both sequential and parallel execution. The size of the transfer matrix is improved under axial symmetry, in the best cases we achieved almost half the dimension of the original matrix, which in practice translates into up to 1/4 of the space of the original non-symmetric matrix. Lattices as the kagome will only have certain values of m where it is axial symmetric. In the other cases, one must perform a non-symmetric computation.
Validation
In this section we present some physical results we have computed for different widths of the square strip using free boundary conditions, as a way to validate the correctness of the parallel family trees method by comparing the curves with the ones from related works.
The first set of results are shown in Figure 17 . In the graphics we present the limiting curves on the complex q-plane for different values of the temperature-like parameter; v = {−1.0, −0.5, −0.1}, at different strip widths in the range m ∈ [2, 8] . The curves were obtained by using the direct-search approach method which consists of scanning the complex domain in small discrete steps, and checking on each discrete location the condition |λ 1 | = |λ 2 | where λ 1 and λ 2 are the first and second dominant eigenvalues, respectively. If the condition is true, then the pair (x, y) is a point of the curve, where x and y are the real and imaginary parts of q, respectively. Due to numerical precision limits, we allowed %1 of numerical error for accepting the condition |λ 1 | = |λ 2 |. [38] . Limiting curves for 6 ≤ m ≤ 8 did not appear in the cited work.
For the next set of physical results we are interested in fixing the q parameter at values q = {2, 3, 4} and compute the dimensionless reduced internal energy E r as well as the reduced function C H of the specific heat C, for different strip widths in the range m ∈ [2, 8] . The dimensionless reduced internal energy is defined as
where f is the free energy density as defined in equation (8) , J the coupling constant which is J > 0 for the ferromagnetic case (0 < v < ∞) and J < 0 for the antiferromagnetic case (−1 < v < 0). The specific heat is defined as
and C H uses the reduced form
The results are presented in Figure 18 , where each row presents the results for a given q value. The curves for 2 ≤ m ≤ 5 agree with the ones presented by Chang et. al. [38] . Results for 6 ≤ m ≤ 8 did not appear in the cited work. Although the computation of new physical curves for wider strips is indeed possible, it would require more time with our resources, or a much larger cluster than ours for faster results. Nevertheless, our present results already show that with the PFT strategy known results are obtained faster than with CPM. We would like to remind the reader that the focus of this work is on the algorithmic improvements and the possibilities to compute the general (q, v) transfer matrix for strips, using a configuration space that is asymptotically O(3 m ) .
Discussion
We have presented a parallel strategy for computing the general (q, v) transfer matrix of strip lattices in the Potts model. Our main result is the asymptotic reduction of the configuration space, from O(4 m ) to O(3 m ), by re-organizing the problem domain as parallel family trees (PFT). Using this strategy, the transfer matrix can now be computed by just processing the root configurations, which are O(3 m ) in number. Computation of the family trees can be performed completely in parallel because family trees are independent from each other, and the configuration space is generated a priori, removing any potential time-dependence. We have compared the experimental results of PFT and indeed it runs exponentially faster than the Catalan Parallel Method (CPM) [19] , both in sequential and parallel execution. The resulting matrix of PFT is a compressed structure based on systems of linear equations. Numerical evaluation on the matrix, including decompression time, is actually faster than numerical evaluation using the CPM method, by a factor that is proportional to the improvement we measured for running time. Therefore, it is not only faster to generate the matrix using PFT, but it is also faster to use it later for extracting the physical information.
Multi-core results have shown that PFT benefits from shared-memory parallelism, achieving a maximum of 5.7X of speedup for the square strip test when using p = 8 processors. At p = 4, the efficiency of the implementation is still over 95%, which is worth mentioning. By plotting the knee curve, we have managed to confirm that choosing p = 4 is in fact a wise decision for a balance of speed and efficiency. In the Multi-core scenario, a dynamic scheduler did not produce a beneficial change in performance, therefore static scheduling still remains convenient.
For the cluster results, we achieved up to 28X of speedup using p = 32 for the square strip tests, with an efficiency above 90% for a strip of width m = 10 (largest one). For the kagome strip test, efficiency stayed above 55% for a strip of m ≥ 7 and the maximum value of speedup reached was close to 20X when using p = 31. A small super-linear speedup region emerged near p = 4 when solving square strips of sizes m = 6, 8, giving an efficiency of up to 120%. We believe that this is just a particular fortunate event, possibly produced by the reduction of cache misses, which is caused when partitioned data fits entirely in cache. In general, we do not expect super-linear behavior since we are measuring fixed-size speedup which is upper bounded as S p ≤ p [48] . The knee curve suggests that p ∈ [8, 10] produces a good balance between speed and efficiency. An important result in cluster execution is that dynamic scheduling is mandatory in order to achieve a performance curve that will not fall into performance valleys, as static scheduling did. On average, dynamic scheduling achieves considerable higher performance than static scheduling.
One of the goals of this work was to present an algorithmic improvement that is implicitly parallel and scalable. For this, we introduced a preprocessing step that generates all possible root configurations and Catalan configurations, which are critical for processing the family trees in parallel. This step takes a small amount of time compared to the whole problem. Other technical improvements had been introduced, some of them being already known in the literature [34] ; (1) fast computation of serial and parallel paths of the graph, (2) exploiting axial symmetry, (3) a set of algebra rules for making consistent keys in all leaf nodes and (4) a hash table for accessing column values of the transfer matrix. In particular, when taking advantage of axial symmetry, the implementation achieved extra improvement of up to 2X in performance, using almost a quarter of the matrix space used in a non-symmetric computation.
In order to achieve a scalable parallel implementation, some small data structures were replicated among processors while some other data structures per processor were created within the corresponding worker process context, not in any master process. This allocation strategy results in faster cache performance and brings up the possibility to scale better under NUMA architectures. It is not a problem to store the matrix fragmented into many files as long as the matrix is in its symbolic form. In practice, it is first necessary to evaluate the matrix on q and v before doing any further numerical analysis. Therefore, the fragmented parts can be evaluated at runtime as they become read. This evaluation can also be done in parallel.
The only technical restriction of the parallel family trees strategy in order to work is that vertices of the left and right boundaries of the layer need to be connected sequentially. This restriction is not a problem, because any planar strip lattice can be rotated so that the restriction is satisfied. Additionally, PFT allows any graph structure along the vertical direction, that is, one can study strips where its K i layer is composed by a sequence of different tiles.
In the kagome tests, the performance results were not as good as we expected, because the number of edges in the layer is much higher than in the square case, making DC to take a considerable amount of time for each configuration. We believe that the dependence of DC on the number of edges in the layer is a sensible aspect for the PFT algorithm, and an extrapolation of this situation would suggest that the largest Archimedean lattices could be much harder to the point of being intractable. However, it is important to consider that DC can significantly improve its performance if the heuristics are improved so that they choose the best sequence of edges based on the connectivity of the graph layer [34] . These heuristics, combined with the linear-cost optimizations, can make the PFT method more resistant to the number of edges in the layer. Furthermore, if more processors are used to the point that p = ∆ m , then the time for computing the TM will be much lower than in our case with p = 32, and will correspond to the time taken to solve the deepest DAG of computation. For this reason, we expect that an execution on a large cluster or supercomputer could allow the computation of transfer matrices of strips wider than what has been reached before.
