In this paper, we provide a fine-grained parallel scheme for anisotropic mesh adaptation on NUMA 1 architectures. Data dependencies are expressed by a graph for each kernel, and concurrency is extracted through fine-grained graph coloring. Tasks are structured into bulk-synchronous steps to avoid data races and to aggregate shared-data accesses. To ensure performance prediction, time cost and load imbalance are theoretically characterized. The devised scheme was evaluated on a 4 NUMA node (2-socket) machine, and a mean efficiency of 70% was reached on 32 cores for 3 kernels out of 4. The impact of irregular degree distribution and data layout on scalability is highlighted. 
Introduction
Numerical simulations of complex physical phenomena (such as turbulence, noise propagation) may require billions of mesh elements to achieve a high level of accuracy. Coupled with a parallel numerical scheme, mesh adaptation is a relevant alternative to reduce the computational effort while maintaining the required accuracy. By equidistributing the error of the solution field while modifying the domain discretization, it decreases the required number of degrees of freedom. That said, the whole computational chain must be efficiently parallelized to be applicable, according to Amdahl's law [1] . But mesh adaptation is known as being irregular [2] and difficult to parallelize at a fine granularity level. Indeed, mesh topology is explicitly stored but evolves in an unpredictable way. Data dependencies cannot be resolved statically, involving a poor cache locality due to irregular memory access patterns. On the other hand, manycore architectures have been recently emerging in high performance computing platforms, with an increasing number of cores per node but a decreasing memory/clock rate per core, and a deep cache hierarchy. Furthermore, inter-socket memory accesses have higher latency than intra-socket ones in a NUMA multicore machine.
Therefore, coarse-grained schemes based on serial kernels are no longer suitable, as pointed out in [3] . The challenge is to devise an efficient fine-grained scheme which exhibit a high degree of concurrency and a high locality to ease memory accesses penalties. The idea is no longer to process larger mesh in a given time frame T , but to shorten T .
Related works. Parallel meshing is an active field of research [4] . Most of up-to-date schemes are coarse-grained ones, and rely on serial kernels, domain decomposition and dynamic cell migration for load balancing. In this case, the parallelization effort mainly focuses on domain interfaces management in order to reduce the synchronization between submeshes, and efficient heuristics for load balancing [5] . Fine-grained schemes have also been developed, but many of them concerns lock-based Delaunay remeshing [6] .
In [7] , a lock-free scheme for the isotropic case was proposed. Mesh operations were expressed by a graph, and an independent set I is extracted to ensure topological conformity and data updates consistency. The critical point is the extraction of I in the swapping stage because it involves a distance-2 vicinity for each vertex. Such a choice greatly reduces the degree of concurrency since a 2-coloring requires more colors.
In [8] , an extension to the anisotropic case was given with an additional contraction kernel. It uses a node-to-node and node-to-element data structure and a unique graph was considered for all kernels. To ensure data consistency, adjacency lists updates are deferred to the end of each remeshing iteration in a way that a unique thread k commits the list of a given point p, even if N p is partially updated by other threads. It was designed in a practical way and no theoretical guarantees were provided.
Contributions. We provide a fine-grained scheme for anisotropic remeshing on NUMA architectures based on [7, 8] . Data dependencies are expressed by a graph G = (V, E) and concurrency is extracted through fine-grained graph coloring. Locality is the key-point of our scheme and its originality relies on:
• kernel-specific graphs to increase |V| and the degree of concurrency |I|, and to reduce |E|, unlike [8] . (sec. 3.1)
That said, no graph is required for refinement and only distance-1 edges are considered, unlike [7] .
• a locality-aware data layout management in order to ease cache and NUMA effects (sec. 3.2)
• a bulk-synchronous task structuring in order to avoid data races and to aggregate data accesses (sec. 4).
• the use of a bridging model to ensure performance portability, unlike [8] (sec. 5).
Serial adaptive scheme
The purpose is to control the discretization of the domain Ω to accurately capture characteristic features of a given solution field (u p ) p∈Ω (such as shocks, flows, interfaces motion), while reducing the number of degrees of freedom. It aims at building a discretization M of Ω where the interpolation error ε = ||u − Π u || is bounded and equally distributed, with Π u the interpolate of u on M. It can be expressed as the following optimization problem:
This non-linear problem is resolved by applying an iterative procedure involving both a finite element/volume solver and a adaptive remesher. The convergence of the couple mesh-solution is achieved when the error is smaller than a given threshold. The sequence of operations we use is given in Algorithm 1.
Metric construction
The error of the solution field is closely related to mesh quality: a control of elements size is required to reduce it. For this purpose, one may use a local a posteriori error estimate from which a nodewise metric tensor field (M p ) p∈Ω is derived, and will guide the remeshing procedure. In addition, phenomena involved in computational fluid dynamics often admit anisotropic features: the solution field evolves differently depending on the considered direction. To properly capture them, a direction prescription is thus necessary. For each tensor, its eigenvalues (
, with d is the dimension of Ω. Then a gradation procedure may be applied in order to smooth out sudden changes in size requirements [9] .
Many error models can be found in the literature, each one being suitable to a specific class of numerical schemes. In our case, we use a generic multi-scale model based on the continuous mesh formalism [10] , which gives us a control in L k norm of the error, for all k ∈ N. The main benefit of this approach is to give an automatic normalization of the tensor field (M p ) p∈Ω . Indeed, it finely captures characteristic features with different amplitudes (shock within a flow for instance), without a minimal size requirement. For a target number of points N, the metric tensor of a point p is:
The local normalization factor aims to control the point density in large gradient variation regions of Ω, whereas the global normalization factor is required to reach N according to the chosen L k norm. The metric tensor definition involves the Hessian matrix H u of u, which is only known at mesh nodes. The hessian field is recovered by means of a double L2-projection, which is accurate and does not tend to smooth the calculated field [11] .
Remeshing
It consists in applying topological and geometric modifications to the mesh with respect to metric field requirements. Formally, the purpose is to build an uniform unit mesh in the Riemannian space induced by (M p ) p∈Ω , where the length of a segment [ab] and the area of an element K are given by:
It is an iterative procedure consisting of 4 stages:
• refinement: it aims at splitting edges whose lengths are larger than √ 2. We use a recursive dissection kernel, since data updates remain local to the element in this case. This aspect is important for its parallelization. Finally, the Steiner point related each marked edge is its midpoint in the Riemannian space.
• contraction: it aims at removing edges whose lengths are smaller than 1/ √ 2. Here we use a vertex collapse kernel, in which the smallest edge (p 1 , p 2 ) incident to a point p 1 is selected at each time. Task ordering is actually important since processing nodes in a breadth-first way rather than a depth-first way avoids the creation of long edges induced by a subsequent collapse of a node p.
• swapping: it aims at improving the quality of any element pair (K 1 , K 2 ) by flipping their shared edge if quality is improved i.e
• smoothing: it aims at relocating each internal point p so that the quality of its stencil N p is strictly improved.
Several kernel choices are possible, we opted for a smart laplacian where the location of p is:
: relaxation weights such that p remains in the convex hull of N p
Regarding quality measure, we opted for the one in [12] which takes both element size and shape into account:
• |K| is the Riemannian area of K and ∂|K| is its perimeter according to the tensor interpolated at its vertices.
• ϕ is defined by
It reaches its maximum value 1 for x = 1 and decreases smoothly for x > 1. Thus Q(K) = 1 when K is equilateral with unit sides with respect to M.
An example of anisotropic mesh adapted to an analytical solution field is given in Fig. 7 . No gradation were applied but all scales of the solution field were correctly captured. ⇒ completely local (e) data updates 
Issues. Apart from being irregular, the remeshing part is one of those memory-intensive algorithms in which computation is cheap but limited by memory transfers. This aspect is accentuated by irregular memory accesses, induced by mutable data dependencies. It reduces the efficiency of prefetching techniques for memory latency hiding [13] . On the other hand, this level of granularity involves frequent thread synchronization and idle times. To reach scalability, the challenge is to increase task locality, especially in a NUMA context, and reduce synchronization count while keeping the execution safe.
Philosophy of the method. Locality is the key-point of our devised scheme. Data dependencies are expressed by an undirected graph G = (V, E), from which a task partition P is extracted by a fine-grained parallel graph coloring. Task granularity is refined to increase the degree of concurrency |I|, whereas data dependencies is reduced to increase locality. Thus, graphs are chosen such that |V| is increased and |E| is reduced. Each stage is structured into bulksynchronous steps in order to avoid conflictual data load/store, and to aggregate shared-data accesses for memory latency hiding. By the way, reordering instructions allows us to exhibit theoretical prediction on execution time and load imbalance.
Task extraction
To ensure correctness, two issues must be addressed: topological conformity and data consistency. Overlapping operations may invalid the mesh (edge cross, holes), whereas cell data may be corrupted by data races. In our context, a remeshing task is defined by a kernel and a related dataset. Task graphs are built from data dependencies of each kernel, and are chosen such that |V| is increased and |E| is reduced. Indeed, we aim at reducing the degree ∆ of each graph, as well as the deviation σ of their degree distribution to reduce load imbalance. On the other hand, the number of colors n c required to build P (and I ∈ P) is bounded by ∆ + 1, and must be reduced since |I| decrease linearly to n c . Data involved by each task are described in Fig. 2 . Related graphs are given in Fig. 3 and recapitulated in Table 1 . • dissection: it involves the vicinity of each element, however this dependency may be avoided by carefully choosing the mesh data structure. No graph is required here, unlike [7] .
• collapse: it involves the vicinity (stencil) of each node. G = (V, E) is built from the mesh primal graph, where V is a set of marked nodes, and E is a set of pairs of V corresponding to a mesh edge. Here G is planar and we have a bound on the chromatic number χ ≤ 6, as stated in [14] . Hence a small number of colors is expected for the extraction of a partition P, despite irregular degree distribution induced by anisotropy.
• flip: it involves the shell of the edge (i.e the two elements sharing it). G is built from the mesh dual edge graph. As each shell contains exactly 4 edges in 2D, we have ∆ = 4 (whereas ∆ = 12 in [7] and ∆ ≈ 6 in [8] ).
• laplacian: it involves the stencil of each internal node. G is built from the primal graph deprived of its boundary nodes. As the topology remains unchanged, the same partition P is kept throughout the stage, and task propagation is ensured by processing V i ∈ P at iteration i.
The devised scheme relies on an efficient independent task extraction which is NP-hard. Related heuristics are not trivially parallelizable due to intrinsic dependencies between iterations [15] . Since the extraction of I is only a preliminary step of a given remeshing stage, its cost must remain negligible. However, the quality of the solution must remain acceptable in order to increase the degree of concurrency |I|. Given these constraints, we opted for a speculative fine-grained graph coloring [16] . It gives a good tradeoff between accuracy and performance. The idea is to perform a pseudo-coloring of V using first-first heuristic [18] without worrying about data races, and to handle defective vertices in a separate stage. In the first stage, each thread maintain a local mapping of forbidden colors for each v ∈ V, because they are already assigned to a v i ∈ N v . In the second stage, threads recheck colors of a subset of marked vertices R, which will be colored again in the next iteration. If conflict happens, only one of the 2 defective vertices is recolored, which are discriminated by their ID. The algorithm includes 2 synchronization points.
The heuristic was evaluated on 3 instances of the recursive matrix graph model (RMAT) [17] , to assess its applicability in our context. Indeed, the primal graph (involved in contraction and smoothing stages) has an irregular degree distribution because of anisotropy. The graph model allows to generate instances with degree distribution parameters (see [17] for details). Results are shown in table 2. Less than 4 seconds is required to process 16 · 10 6 vertices and the number of defective vertices remains negligible compared to |V|. The ratio of independent vertices is not altered by parallelism, but depends on the degree distribution.
Locality enhancement
Issues. In our context, data layout must be considered to improve cache locality despite irregular memory accesses. However the initial memory layout is hard to preserve because cell data dependencies evolve in a unpredictable way. On the other hand, index reordering schemes (Hilbert curves, matrix renumbering) are still expensive and hard to parallelize in a fine-grained way. Therefore, the challenge is to reduce data updates on one hand, and keep neighboring cells as close as possible in memory on the other hand, subject to these constraints.
Data layout and synchronization. Data are stored in flat arrays in order to increase spatial locality due to memory addresses contiguity. No reallocation is performed within a remeshing stage but a parallel mesh compression stage is done after a contraction/refinement. To reduce remote accesses in a NUMA context, memory cells are first-touched by each thread in a round-robin fashion for each shared tasklist. Therefore, memory cells are allocated on the closest DRAM of the core where a given thread is statically bound. Data stores are synchronized in a way that newly inserted neighboring mesh cells are kept as close as possible in memory (see Fig. 5 ). Indeed, mesh cell index offsets n α , n β , and n γ are precalculated according to the number of long edges α j of each element K j during a refinement stage:
, and
A lock-free synchronization scheme is used for shared tasklist updates: a reduction is performed on index offsets (k i ) for a shared tasklist L, in order to find the current index range where a given thread t i may store its private data. For that, a atomic capture mechanism inspired by [8] is used: the size n L of L is incremented atomically while its old value is cached in k i . Thus, t i knows its index range [k i , k i + n L ] and can resume its local computation or copy his local data. Notice that offsets could be retrieved by means of prefix sum, but it would require log(p) synchronization barriers. Data structure. A combinatorial map data structure is used such that updates remain local to the considered patch [19] (see Fig. 1 ). Each dart (or halfedge) stores the indices of its vertices, its next, its twin, and the element containing it. In addition, each node stores the index of an outgoing dart, idem for elements. The stencil of each node can be easily found via a circulator. Boundary darts are linked so that the whole boundary may be retrieved via references traversal. Remeshing stages are structured into bulk-synchronous steps for 3 reasons:
Steps structuring
• avoid data races: threads use local copies of shared-data, and updates are made in a synchronized way at the end of each step, using the mechanism described in section 3.2. Thus, conflictual data load/stores are avoided.
• hide memory latency: shared-data accesses are aggregated in a single communication substep, and may be performed in a pipelined way.
• ensure performance prediction: the time cost of each step may be finely estimated through bridging models for a particular machine, given a set of parameters (unit task costs, bandwidth, latency, caches etc) [20] [21] [22] .
(see Fig. 6 )
Refinement. Each iteration consist of 4 steps: (1) element marking, (2) steiner point and offsets calculation, (3) element process, (4) twin darts repairing. In addition to the mesh, threads share 2 arrays: L storing the indices of marked darts, and S for the Steiner point related to each e ∈ L. Related index offsets n L and n S are also maintained.
If an edge e k ∈ K is too long, then the element K is marked and the index k is added to a local list L. A reduction on n L is performed by each thread in order to know the current offset of L where it may copy the content of L. The same applies to n S with S. Here, only 3 reductions per thread is necessary to update them in the mesh data structure. Finally when a dart e : (a, b) ∈ L is split into (e 1 , e 2 ), the stored indices of (a, b) are replaced by those of (e 1 , e 2 ). Thus no additional shared array is required to perform the last step (see Algorithm 2).
Contraction. Each iteration consist of 4 steps: (1) stencil analysis and graph construction, (2) graph coloring P, (3) process of I ∈ P, (4) boundary edges repairing. Threads share a graph G = (V, E) and 3 additional arrays: S indexing the dart to be collapsed for each point, L for active nodes indices, and C storing the colors of each point p. An offset n L is also maintained for the latter. First, the stencil N i of each p i is retrieved. Then edges e ∈ N i are sorted according to their lengths, such that
is a valid index then k is added to a local list L, and a reduction is performed on n L in order to find the current offset of L from where each thread may copy the content of their respective list L. Afterward G is built by storing each p k ∈ L in V and its stencil N k in E. Then the partition P is extracted from G, and I from P by comparing |P c | c∈[1,n c ] . Each point p ∈ I is then processed. Finally, each boundary dart is fixed by updating its next reference (see Algorithm 3).
Swapping. Here we have 3 steps per iteration: (1) edge filtering and graph construction (2) graph coloring P, (3) process of I ∈ P. Threads share a graph G = (V, E) and an array L referencing darts which need to be processed. An index offset n L is also maintained.
If Q M (K) ≤ q min then the index of each e k ∈ K is added to a local list L. A reduction on n L is performed in order to find the current offset of L from which each thread may copy the content of its list L. Afterward G is built by storing each e k ∈ L in V and indices of the shell of e k is retrieved and stored in E. Then the partition P and I ∈ P are extracted. Each e ∈ I is then processed (see Algorithm 5) .
Smoothing. Unlike the 3 others, this stage consist of a preprocessing step and 1 step per iteration. The preprocessing consists of: (a) quality analysis, (b) graph construction, (c) graph coloring. Since mesh topology remains unchanged, many steps are deported in the preprocessing. Afterward, each subset (V i ) i∈[1,n c ] ∈ P is processed (see Algorithm 4).
for element K do in parallel no wait STEP 1 if there is a short edge e i ∈ K then mark K. repair twin index of e, for each dart e ∈ K barrier until L = ∅ Algorithm 2: bulk-synchronous refinement
for point p i ∈ M do in parallel no wait STEP 1 retrieve and cache
extract partition P from G in parallel, and I ∈ P STEP 2 barrier for i from 1 to |I| do in parallel
for dart e ∈ ∂Ω do in parallel STEP 4 repair next reference of e using a circulator. barrier until L = ∅ 
retrieve the shell of V[i] and store to E[i] barrier extract partition P from G in parallel, and I ∈ P STEP 2 barrier for 1 ≤ i ≤ |I| do in parallel 
Performance prediction
The time cost of each remeshing stage is estimated with the queuing shared memory bridging model [22, 23] . It consists of p cores, each with its own private memory, and communicating through load/stores within n memory cells in a shared-memory or distributed shared-memory (DSM). Any cell j ≤ n may be simultaneously accessed, but there is a cost κ j for such contention. The gap between local instruction rate and communication rate is given by a parameter g ≥ 1. In this framework, the time cost of a bulk-synchronous remeshing stage is given by:
Finer prediction can be obtained by providing unit task costs per step (cycles, bytes), memory bandwidth (GB/s). The mean time complexity and load imbalance of each stage are given in table 3, and detailed below. 
Proposition 1 (Refinement). Let n be the number of mesh elements. The time cost of a refinement iteration is in Ω((1 + g) n p ) and no substantial load imbalance is expected with a static task partitioning. Proof. For each core i and at a given iteration, we have:
• step 1: let n 1,i be the number of elements assigned to the core i. Hence r i = n 1,i are read.
For each element, at most 3 lengths computations are performed and 1 local value is added to local list, thus c i = 4n 1,i . A reduction on n L is performed involving 1 read/1 write per core, thus κ = 2. Finally, at most n i ≤ p k=1 n 1,k values are written into L so w i = Ω(n 1,i ).
• step 2: let n 2,i be the number of long darts assigned to the core i. Thus r i = n 2,i . Here, r i steiner point calculations are done, and r i values are stored locally, so c i = 2n 2,i . A reduction on n α is performed, thus κ = 2. Finally, n p = r i values are written to mesh, so w i = r i .
• step 3: let n 3,i be the number of marked elements assigned to the core i. Thus r i = n 3,i . For each element, 2 offset calculations are done, 1 value is stored locally and 1 refinement is done. Hence c i = 4n 3,i . Then 2 reductions on n β and n γ are done involving 1 read/1 write each other, so κ = 4. Finally (n β + n γ ) values are written by all cores, with n β ≤ 3n γ and n γ ≤ 4n, making a total of n(12 + 4). Thus w i = Ω(16n 3,i ).
• step 4: let n i,4 be the number of new elements assigned to the core i. Hence r i = n i,4 .
For each element, 3 dart updates are performed, so c i = 3n i,4 . In addition, 3 values per element are written in mesh, thus w i = 3n i,4 . No reduction is performed so κ = 0.
Therefore we have:
[Ω(n s,i )] + 8 for any scheduling policy. For a static task partitioning, we have:
p ), with n K the nb of elements. Since tasks are equi-distributed and have a uniform unit cost Ω(g), the load imbalance of a step s is ι s = 0 Proposition 2 (Graph coloring). Let n i be the number of vertices assigned to core i,∆ the mean degree of G. For each core i, the local instructions, shared-data accesses and contention count of a graph coloring iteration are: r i = Ω(∆n i ), w i = Ω(2n i ), c i = n i , and κ = 2.
Proof. Let R be the shared array storing the indices of conflictual vertices, and n R its related offset. Then we have:
• pseudo-coloring: for each vertex v, the color of each v k ∈ N v is retrieved locally and the color of v is updated accordingly. Thus r i = Ω(∆n i ) and w i = n i .
• defective vertices detection: the colors of each v k ∈ N v is re-checked for each vertex v. If a same color was assigned to v and v k , then v k is added to a local list R i . Thus r i = Ω(2∆n i ) at this point, and c i = n i . A reduction on n R is performed, thus κ = 2. Then the contents of R i are copied to R, so w i = Ω(2n i ).
At this point, we have r i = Ω(∆n i ), w i = Ω(2n i ), c i = n i and κ = 2. The whole procedure is repeated by processing R instead of V, and terminates when n R = |R| = 0.
Proposition 3 (Contraction).
Let n be the number of nodes,∆ i the mean degree of the graph on each core i, k the number of iterations for graph coloring, and σ max the maximum deviation of (∆ i ) on all cores. The cost of a contraction iter. is in Ω((1 + g)(∆ + k) n p ), and the load imbalance of each step is in Ω(σ max ) with a static task partitioning. Proof. Let (n s,i ) i=1,p be the number of nodes assigned to core i at step s, and (∆ s,i ) i=1,p the mean degree of G on i.
• step 1: the stencil N p of each point p is retrieved in local memory, thus r i = Ω(∆ 1,i · n 1,i ).
Then each p i ∈ N p is evaluated, and at most n 1,i point indices are stored locally, so c i = (∆ 1,i + 1)n 1,i . Then n 1,i cells of S are written, so w i = n 1,i at this point. A reduction on n L is performed, thus κ = 2. Afterward the construction of G involves at most n 1,i writes into V and (∆n) 1,i writes into E, thus w i = (∆n) 1,i .
• step 2: it refers to graph coloring (see Proposition 2). Thus we have r i = Ω(∆kn 2,i ), w i = Ω(2kn 2,i ), c i = kn 2,i and κ = 2k, with k the number of iterations for coloring G.
• step 3: the stencil N p of each point p is retrieved locally, thus r i = (∆n) 3,i . The point merge is performed by replacing each stored reference of p k in N k by S[k] and by deleting 2 elements. Thus w i = c i = (∆n) 3,i and no reduction is performed, so κ = 0.
• step 4. Again, the stencil N k of each point p k is retrieved locally, thus r i = Ω(∆ 4,i n 4,i ). Then each edge e ∈ N k is evaluated so c i = r i . Afterward one dart is updated so w i = 1. No reduction is performed, so κ = 0.
Thus we have:
for any scheduling policy. For a static task partitioning, we have: n s,i = Ω( 
Proposition 4 (Swapping). Let n be the number of darts, and k the number of iterations for graph coloring. The time cost of a swapping iteration is in Ω((1 + g)(2k + 4) n p ), and no substantial load imbalance is expected with a static task partitioning.
Proof. For each core i and at a given iteration, we have:
• step 1: let n 1,i be the number of mesh elements assigned to the core i. Hence r i = n 1,i .
The quality of each element is evaluated, at most 3 values are locally stored in L, so c i = Ω(3n
values are written into L by all cores, so w i = Ω(n 1,i ).
• step 2: it refers to graph coloring (see Proposition 2). Let n 2,i be the number of vertices of V assigned to the core i. Thus we have r i = Ω(∆kn 2,i ), w i = Ω(2kn 2,i ), c i = kn 2,i and κ = 2k, with k the number of iterations.
• step 3: let n 3 = |I|, and n 3,i the number of darts assigned to the core i. The 4 darts (e k ) k=1,4 surrounding any e are retrieved locally , so r i = 4n 3,i . The flip is done by overwriting references stored in each e k , and the quality of the 2 resulting elements are calculated. Thus w i = 4n 3,i and c i = 2n 3,i . No reduction is done, so κ = 0.
for any scheduling policy. For a static task partitioning, we have: n s,i = Ω(
with n h the number of half-edges. Tasks are equi-distributed and unit task cost is in Ω((1 + g)(2k + ∆). As k is fixed and ∆ = 4, thus task cost distribution is uniform. Therefore the load imbalance of a step s is ι s = 0
Numerical results
Accuracy. The devised scheme was implemented in C++ using OpenMP 4. An example of mesh adapted to a multiscale solution field (u p ) p∈Ω is given in Fig. 7 , with u = 0.1 sin(50x)
Input mesh: n = 13 102 nodes and m = 25 760 elements, no metric gradation, target number of nodes N = n. Run parameters: 4 threads, n a = 5 adaptations, n r = 10 stages per adapt and n s ≤ 15 iterations per stage. The error is defined as the gap between mesh and the cartesian surface defined by the metric field [24] . It is given by t i, j,k with n a = 3, n r = 10, n s ≤ 15, and speedup per iteration. Refinement and swapping scale well, even on 4 NUMA nodes. Data involved here are mainly local and no substantial load imbalance is expected. Smoothing scales even better, since its more compute-intensive on one hand, and since primal graph remains unchanged on the other hand. Contraction suffers from load imbalance due to irregular stencil size distribution, and cache misses penalties.
The mean elapsed time per iteration t = (1/(n a n r n s )) n a i=1 n r j=1 n s k=1 t i, j,k of each remeshing stage is given on Fig. 8 . Overall scalability is good with an efficiency of 70% for 3 kernels out of 4 on 32 cores. Refinement and swapping scale well, even on 4 NUMA nodes. Indeed, data involved in these stages are mainly local and no substantial load imbalance is expected, as stated in Proposition 3 (see Table 3 ). Smoothing stages scale even better. As it is more compute-intensive, data accesses penalties are significantly mitigated. Moreover, it benefits from the reuse of cached cell data, since mesh topology -and the underlying data layout -remains unchanged. Despite efforts to improve locality, contraction stages suffer from load imbalance due to irregular stencil size distribution (induced by anisotropy), and high last level cache (LLC) misses penalties (see Fig. 10 ). The time spent on each step for the refinement, contraction and swapping stage on 32 cores is given in Fig. 9 Fig. 9 : Ratio of time spent on each bulk-synchronous step for a refinement, contraction and swapping stage on 32 cores. The overhead spent on task graphs construction and coloring is small compared to other steps, especially for the contraction stage.
In the latter, most of the elapsed time is spent on stencil analysis: the collapse operation is simulated for each target node v, and the validity of the resulting patch is verified at each time (in terms of areas and edge length). Its cost depends on the size and geometrical configuration of the stencil. Fig. 10 : Degree distribution evolution, cache misses with/without index reordering (using Metis [25] and PAPI counters [26] ), and elapsed time profile for one contraction stage on 32 cores. The degree distribution of the primal graph evolves in an irregular way throughout iterations, resulting in load imbalance. Data layout have been significantly evolved throughout iterations, resulting in a high cache misses count.
Further investigation has been done for contraction (see Fig. 10 ). As pointed in Proposition 3, load imbalance is mainly impacted by degree distribution of mesh primal graph. The deviation on degree distribution of the primal graph shows that stencil sizes evolve in an irregular way throughout iterations. As shown in Fig. 9 , most of the elapsed time is spent on stencil analysis step, which is the most irregular part of this stage. Indeed, the collapse operation is simulated for each potential target node v, and the validity of the resulting patch is verified at each time (in terms of element area and edge length). Depending on the size and geometrical configuration of the stencil, this step may involve an important load imbalance. An index reordering using a nested dissection heuristic [25] has been performed in order to show the impact of data layout on cache performance. It confirms that data layout has been significantly changed throughout iterations, resulting in a high cache misses count.
Conclusion and perspectives
A scalable fine-grained lock-free scheme for anisotropic mesh adaptation on NUMA architectures was provided. A mean efficiency of 70% was achieved on 32 cores for 3 kernels out of 4, but further efforts have to be done for the contraction. Two tracks may be considered for this purpose: (1) a locality-aware work-stealing scheme to ease load imbalance and (2) the integration of a fine-grained index reordering stage in the remeshing loop to ease cache penalties (the related overhead is actually the main obstacle). Further measures (unit task costs -not only for remeshing ones -effective memory bandwidth/latency) have to be done to compare the predicted execution time with the real elapsed time, and to assess the accuracy of the cost model. An extension to a multi-grained scheme (shared/distributed memory) is expected, with the constraint that the bulksynchronous structure of the algorithm should be kept. A hierarchical bridging model [21] may be used in this case. It will allow to explicitely characterize the latency at each level of the memory hierarchy (caches, local/remote DRAM) for finer prediction. Finally, an extension to the 3D case may be considered.
