This article focuses on the optimization of PCDM, a parallel, 2D Delaunay mesh generation application and its interaction with parallel architectures based on simultaneous multithreading (SMT) processors. We first present the step-by-step effect of a series of optimizations on performance. These optimizations improve the performance of PCDM by up to a factor of six. They target issues that very often limit the performance of scientific computing codes. We then evaluate the interaction of PCDM with a real SMT-based SMP system, using both high-level metrics such as execution time, and low-level information from hardware performance counters. We identify specific bottlenecks that do not allow SMT processors to efficiently execute fine-grain codes. Following, we evaluate the effect of limited, realistic hardware extensions, using a simulated SMT-based SMP. These hardware extensions allow SMT processors to execute efficiently even the finest granularity of parallelism available in PCDM.
Introduction
Simultaneous multithreading (SMT) and multicore (CMP) processors have lately found their way in the product lines of all major hardware manufacturers [25] [26] [27] . These processors allow more than one threads to simultaneously execute on the same physical CPU. The degree of resource sharing inside the processor may range from sharing one or more levels of the cache (CMP processors), to almost fully sharing all processor resources (SMT processors).
SMT and CMP chips offer a series of competitive advantages over conventional ones. They are, for example, characterized by better price to performance and power to performance ratios. As a consequence, they gain more and more popularity as building blocks of both multi-layer, high performance compute servers and off-the-self desktop systems.
The pervasiveness of SMT and CMP processors changes radically the software development process. Traditionally, evolution across different processor generations alone, would allow single-threaded programs to execute more and more efficiently. This trend, however, tends to diminish. SMT and CMP processors support, instead, thread-level parallelism within a single chip. As a result, parallel software is necessary in order to unleash the computational power of these chips by a single application. Needless to say, rewriting existing sequential software or developing from scratch parallel software comes at an increased cost and complexity. In addition, the development of efficient code for SMT and CMP processors is not an easy task. Resource sharing inside the chip makes performance hard to analyze and optimize, since performance is dependent not only on the interaction between individual threads and the hardware, but also on non-trivial interference between threads on resources such as caches, TLBs, instruction queues, and branch predictors.
The trend of massive code development or rewriting restates traditional software engineering tradeoffs between ease of code development and performance.
For example, programmers may either reuse functionality offered by system libraries (synchronization primitives, STL data structures, memory management etc.), or reimplement it from scratch, targeting high performance. They may or may not opt for complex algorithmic optimizations, balancing code simplicity and maintainability with performance.
In this paper we focus on 2D Parallel Constrained Delaunay Mesh (PCDM) generation. Mesh generation is a central building block of many applications, in the areas of engineering, medicine, weather prediction etc. PCDM is an irregular, adaptive, memory-intensive, multi-level and multi-grain parallel implementation of Delaunay mesh generation. We discuss in detail the exploitation of each of the three parallelism granularities present in PCDM on a real, SMT-based multiprocessor. We present the step-by-step optimization of the code and quantify the effect of each particular optimization on performance.
This gradual optimization process results to code that is up to 6 times faster from the original, unoptimized one. Moreover, the optimized code has sequential performance within 12.3% of Triangle [36] , the best known sequential Delaunay mesh generation code. The exploitation of parallelism in PCDM allows it to outperform Triangle, even on a single physical (SMT) processor. As a next step, we use low-level performance metrics and information attained from hardware performance counters, to accurately characterize the interaction of PCDM with the underlying architecture. We find that current SMT processors do not offer adequate support for the execution of very fine-grained, irregular codes, with high synchronization requirements. Consequently, we dis-cuss limited, realistic hardware extensions for the efficient execution of codes with these characteristics. The evaluation of the hardware extensions on a simulated system proves their effectiveness. Although this study is focused on PCDM, its results are applicable to a whole class of applications with similar characteristics.
The rest of the paper is organized as follows: In Section 2 we discuss related work in the context of performance analysis and optimization for layered parallel architectures. In Section 3 we briefly describe the parallel Delaunay mesh refinement algorithm. Section 4 discusses the implementation and optimization of the multi-grain PCDM on an SMT-based multiprocessor. We study the performance of the application on the target architecture both macroscopically and using low-level metrics. Based on the observations from the performance evaluation and analysis, in Section 5 we discuss and evaluate hardware support for the efficient execution of applications with characteristics similar to those of PCDM on SMT-based multiprocessors. Finally, Section 6 concludes the paper.
Related Work
Although layered multiprocessors have established a strong presence in the server and desktop markets, there is still considerable skepticism for deploying these platforms in supercomputing environments. One reason seems to be that the understanding of the interaction between computationally-intensive scientific applications and these architectures is rather limited. Most existing studies of SMT and CMP processors originate from the computer architecture domain and use conventional uniprocessor benchmarks such as SPEC CPU [23] and shared-memory parallel benchmarks such as SPEC OMP [7] and SPLASH-2 [44] . There is a notable absence of studies that investigate application-specific optimizations for SMT and CMP chips, as well as the architectural implications of SMT and CMP processing cores on real-world applications that demand high FPU performance and high intra-chip and off-chip memory bandwidth. Interestingly, in some real supercomputing installations based on multi-core and SMT processor cores, multi-core execution is often de-activated, primarily due to concerns about the high memory bandwidth demands of multithreaded versions of complex scientific applications [2] . This paper builds upon an earlier study of a realistic application, PCDM, on multi-SMT systems [5] , to investigate the issues pertinent to application optimization and adaptation to layered shared-memory architectures. Similar studies appeared recently in other application domains, such as databases [17, 45] and have yielded results that stir the database community to develop more architecture-aware DataBase Management System (DBMS) infrastructure [22] . Another recent study of several realistic applications, including molecular dynamics and material science codes, on a Power5-based system with dual SMT-core processors [21] , indicated both advantages and disadvantages from activating SMT, however the study was confined to execution times and speedups of out-of-the-box codes without providing further details.
Delaunay Mesh Generation
In this paper we focus on the parallel constrained Delaunay refinement algorithm for 2D geometries. Delaunay mesh generation offers mathematical guarantees on the quality of the resulting mesh [15, 20, 28, 35, 37] . In particular, one can prove that for a user-defined lower bound on the minimal angle (below 20.7
• ) the algorithm will terminate while matching this bound and produce a size-optimal mesh. It has been proven [29] that a lower bound on the minimal angle is equivalent to the upper bound on circumradius-to-shortest edge ratio which we will use in the description of the algorithm. Another commonly used criterion is an upper bound on triangle area which allows to obtain sufficiently small triangles.
The sequential Delaunay refinement algorithm works by inserting additional -so-called Steiner-points into an existing mesh with the goal of removing poor quality triangles, in terms of either shape or size, and replacing them with better quality triangles. Throughout the execution of the algorithm the Delaunay property of the mesh is maintained: the mesh is said to satisfy the Delaunay property if every triangle's circumscribing disk (circumdisk) does not include any of the mesh vertices. Usually Steiner points are chosen in the centers (circumcenters) of circumdisks of bad triangles, although other choices are also possible [12] . For our analysis and implementation we use the Bowyer-Watson (B-W) point insertion procedure [8, 43] which consists of the following steps: (1) the triangles whose circumdisks include the new Steiner point p are identified; they are called the cavity C (p); (2) the triangles in C (p) are deleted from the mesh; as a result, an untriangulated space with closed polygonal boundary ∂C (p) is created; (3) p is connected with each edge of ∂C (p), and the newly created triangles are inserted into the mesh.
We explore three levels of granularity in parallel Delaunay refinement: coarse, medium, and fine. At the coarse level, the triangulation domain Ω is decomposed into subdomains Ω i which are distributed among MPI processes and used as units of refinement. When Steiner points are inserted close to subdo-main boundaries, the corresponding edges are subdivided, and split messages are sent to the MPI processes refining subdomains that share the specific edge,
to ensure boundary conformity [13] . At the medium granularity level, the units of refinement are cavities; in other words, multiple Steiner points are inserted concurrently into a single subdomain. Since the candidate Steiner points can have mutual dependencies, we check for the conflicts and cancel some of the insertions if necessary. The problem of Delaunay-independent point insertion along with parallel algorithms which avoid conflicts is described in [10] [11] [12] 14] .
In this paper, however, we study a different approach which allows to avoid the use of auxiliary lattices and quadtrees, at the cost of rollbacks. Finally, at the fine granularity level, we explore the parallel construction of a single cavity (cavity expansion). This is achieved by having multiple threads check different triangles for inclusion into the cavity.
Implementation, Optimization and Performance Evaluation
In the following paragraphs we discuss the implementation and the optimization process of the three granularities of parallelism in PCDM and their combinations into a new multi-grain implementation we describe in [6] . We also provide insight on the interaction of the application with the hardware on a commercial, low-cost, SMT-based multiprocessor platform. Table 1 Configuration of the Intel HT Xeon-based SMP system used to evaluate the multigrain implementation of PCDM and its interaction with layered parallel systems.
Apart from their popularity, another reason we focus on Intel HT-based SMPs is that Intel HT processors offer ample opportunities for performance analysis through the performance monitoring counters integrated in the processor [24] . The performance counters offer valuable information on the interaction between software and the underlying hardware. They can be used either directly [34] , or through higher level data acquisition and analysis tools [1, 9, 18] . 1 The cost of an Intel HT processor was initially the same as that of a conventional processor of the same family and frequency. Gradually conventional processors of the IA-32 family were withdrawn.
Experimental results from larger scale parallel systems, as well as a detailed, direct comparative evaluation of the performance of different parallelism granularities in PCDM are presented in [6] .
Throughout this section we present experimental results applying PCDM on a rocket engine pipe 2D cross-cut domain. The specific engine pipe has been Table 2 Execution time (in sec) of the original (unoptimized), and the optimized coarsegrain PCDM implementation. Similarly, the right diagram depicts the % performance improvement after the application of each additional optimization over version that incorporates all previous optimizations. Due to space limitations, we report the effect of optimizations on the coarse-grain PCDM configurations using 1 MPI process per physical processor. However, their effect on configurations using 2 MPI processes per physical processor is quantitatively very similar.
Substitution of Generic STL Data-Structures
The original, unoptimized version of coarse-grain PCDM makes extensive use of STL structures. Although using STL constructs has several software engineering advantages in terms of code readability and code reuse, such constructs often introduce unacceptable overhead.
During the cavity expansion phase, PCDM performs a depth-first search of the triangles graph, the graph in which a triangle is connected with the 3 neighbors it shares faces with. The algorithm identifies triangles included in the cavity, and those that belong to the closure of the cavity, i.e. triangles that share an edge with the boundary of the cavity. The population of these two sets for each cavity is a priori unknown, thus the original PCDM uses STL vectors for the implementation of the respective data structures, taking advantage of the fact that STL vectors can be extended dynamically. Similarly newly created triangles, during cavity re-triangulations, are accommodated in an STL vector as well.
We replaced these STL vectors by array-based LIFO queues. We have conservatively set the maximum size of each queue to 20 elements, since our experiments indicate that the typical population of these queues is only 5-6 triangles for 2D geometries. In any case, a dynamic queue growth mechanism is present and is activated in the infrequent case triangles overflow one of the queue arrays.
Replacing the STL vectors with array-based queues improved the execution time of coarse-grain PCDM by an average 36.98%.
Memory Management
Mesh generation is a memory intensive process, which -by its nature-triggers frequent memory management (allocation / deallocation) operations. Even the unoptimized implementation of coarse-grain PCDM includes a custom memory manager. The memory manager focuses on efficiently recycling and managing triangles, since they are by far the most frequently used data structure of PCDM.
After a cavity is expanded, the triangles included in the cavity are deleted and resulting empty space is then re-triangulated. The memory allocated for deleted triangles is never returned to the system. Deleted triangles are, instead, inserted in a recycling list. The next time the program requires memory for a new triangle (during retriangulation), it reuses deleted triangles from the recycling list. Memory is allocated from the system only when the recycling list is empty.
During mesh refinement, the memory footprint of the mesh is monotonically increasing, since during the refinement of a single cavity the number of deleted triangles is always less than or equal to the number of created triangles. As a result, memory is requested from the system during every single cavity expansion. The optimized PCDM implementation pre-allocates pools (batches) of objects instead of allocating individual objects upon request. We experimentally determined that memory pools spanning the size of 1 page (4Kb for our experimental platform) resulted in the best performance. When all the memory from the pool is used, a new pool is allocated from the system. Batch memory allocation significantly reduces the pressure to the system's memory manager and improves the execution time of coarse-grain PCDM approximately by an additional 6.5%.
Algorithmic Optimizations
Balancing algorithmic optimizations that target higher performance or lower resource usage, with code simplicity, readability and maintainability is an interesting exercise during code development for scientific applications. When high performance is the main consideration, the decision is usually in favor of the optimized code.
In the case of PCDM, we performed limited, localized modifications in a single, critical computational kernel of the original version. The modifications targeted the reduction or elimination of costly floating-point operations on the critical path of the algorithm.
The specific kernel evaluates the quality of a triangle, by comparing its minimum angle with a predefined, user-provided threshold. Lets assume that C is the minimum angle of triangle ABC and L is the threshold angle. The original code would calculate C from the coordinates of triangle points, using the inner product formula C = arccos
for the calculation of the angle C between vectors − → a and − → b . The kernel would then compare C with L to decide whether the specific triangle fulfilled the user-defined quality criteria or not.
However, the calculation of C involves costly arccos and sqrt operations (the latter for the calculation of − → a · − → b ).
The algorithmic optimizations are based on the observation that, since C and L represent minimum angles of triangles, they are both less than The specific algorithmic optimizations improved further the execution time of coarse-grain PCDM by an average 8.82%.
Medium-grain PCDM
The medium-grain PCDM implementation spawns threads inside each MPI process. These threads cooperate for the refinement of a single subdomain, by simultaneously expanding different cavities. The threads of each MPI process are bound one-by-one to the execution contexts of a physical processor. Table 3 Execution time (in sec) of the original (unoptimized), and the optimized medium+coarse multi-grain PCDM implementation.
inside each SMT processor (2 execution contexts per processor for our experimental platform, executing one medium-grain thread each). The unoptimized multi-grain implementation performs almost 3 times worse than the unoptimized coarse-grain one. However, our optimizations result to code that is approximately 6 times faster than the original, unoptimized implementation.
The exploitation of the second execution context of each SMT processor allows optimized multi-grain PCDM to outperform the optimized coarse-grain configuration which exploits only one SMT execution context on each physical processor. It is, however, up to 4 processors, slightly less efficient than the coarse-grain configuration that executes 2 MPI processes on each CPU 3 .
Similarly with Figure 1 , the diagrams of Figure 2 itemize the effect of each 3 In [6] we evaluate PCDM on larger-scale systems. We find that the use of additional MPI processes comes at the cost of additional preprocessing overhead and we identify cases in which the combination of coarse-grain and medium-grain (coarse+medium) PCDM proves more efficient than a single-level coarse-grain approach. Furthermore, in [6] , we evaluate the medium-grain implementation of PCDM on IBM Power5 processors, in which the cores have a seemingly more scalable implementation of the SMT architecture, compared to the older Intel HT processors used in this study. Similarly, the right diagram depicts the % performance improvement after the application of each additional optimization over the version that incorporates all previous optimizations.
Synchronization
A major algorithmic concern for medium-grain PCDM is the potential occurrence of conflicts while threads are simultaneously expanding cavities. Multiple threads may work on different cavities at the same time, within the same domain. A conflict occurs if any two cavities -processed simultaneously by different threads-overlap, i.e., have a common triangle or share an edge. In this case, only a single cavity expansion may continue; the rest need to be canceled. This necessitates a conflict detection and recovery mechanism. Triangles that protect edges of the cavity Each triangle is tagged with a flag (taken). Whenever a triangle is touched during a cavity expansion (either because it is actually part of the cavity itself or of its closure), the flag is set. The closure of the cavity, namely this extra layer of triangles that surround the cavity -without being part of it-prevents two cavities from sharing an edge (Figure 3 ) [30, 31] . If, during a cavity expansion, a thread touches a triangle whose flag has already been set, the thread detects a conflict. The cavity expansion must then be canceled.
Updates of the flag variable need to be atomic since two or more threads may access the same triangle simultaneously. Every access to the triangle's flag is performed through atomic fetch and store() operations. These instructions incur -on the vast majority of modern shared-memory architectures-less overhead than conventional locks or semaphores under high contention, while providing additional advantages such as immunity to preemption. The use of atomic instructions resulted in 33% to 39% faster code than an alternative, naive implementation using POSIX lock/unlock operations for the protection of the flag.
Reduction of Conflicts
The cancellations of cavity expansions -as a consequence of conflicts-directly results to the discarding of already performed computation. The canceled cavity expansion will have to be restarted again from the beginning. It is, thus, critical for performance to minimize the occurrence of conflicts.
The optimized multi-grain PCDM implementation isolates each thread to a single area of the sub-domain (Figure 4 ). We apply a straightforward, computationally inexpensive decomposition, using simple, straight segments, by occur only close to the borders between areas. Moreover, the probability of conflicts decreases as the quality of the mesh improves [10] . Table 4 Number of conflicts before and after splitting (in two) the working area inside each sub-domain. The implementation of the conflicts reduction technique is interdependent to the work-queues hierarchy design and implementation, presented later in section 4.2.3.
As a result the effect of each of these two optimizations in execution can not be isolated and evaluated separately.
Work-Queues Hierarchy
PCDM maintains a global queue of "bad" triangles, i.e., triangles that violate quality criteria. Whenever a cavity is re-triangulated, the quality of the new triangles is checked, and any offending triangle is placed into the queue.
Throughout the refinement process threads poll the queue. As long as it is not empty, they retrieve a triangle from the top, and start a new cavity expansion. In medium-grain PCDM, the queue is concurrently accessed by multiple threads and thus needs to be protected.
A straightforward solution for reducing the overhead due to contention is to use local, per thread queues of bad triangles. Bad triangles that belong to a specific working area of the sub-domain are inserted in the local list of the thread working in that area. Since, however, a cavity can cross the working area boundaries, a thread can produce bad triangles situated at areas assigned to other threads. As a result, local queues of bad triangles still need to be protected, although they are significantly less contended than a single global queue.
A hierarchical queue scheme with two local queues of bad triangles per thread is applied to further reduce locking and contention overhead. One queue is strictly private to the owning thread, while the other can be shared with other threads, and therefore needs to be protected. If a thread, during a cavity retriangulation, creates a new bad triangle whose circumcenter is included it its assigned working area, the new triangle is inserted in the private local queue.
If, however, the circumcenter of the triangle is located in the area assigned to another thread, the triangle is inserted in the shared local queue of that thread ( Figure 5 ). Each thread dequeues triangles from its private queue as
Thread1 working area
Thread2 working area long as the private queue is not empty. Only whenever the private queue is found empty shall a thread poll its shared local queue.
As expected, the private local queue of bad triangles is accessed much more frequently than the shared local one. During the creation of the mesh of 10M triangles for the pipe domain, using two threads to exploit medium-grain parallelism, the shared queues of bad triangles are accessed 800,000 times, while the private ones are accessed more than 12,000,000 times. Therefore, the synchronization overhead for the protection of the shared queues is practically negligible.
The average performance improvement after reducing cavity expansion conflicts and using the 2-level queue scheme is 40.52%.
Memory Management
The memory recycling mechanism of PCDM, described in Section 4.1.2, is not efficient-enough in the case of medium-grain PCDM for two reasons:
• The recycling list is shared between threads and thus accesses to it need to be protected.
• Memory allocation/deallocation requests from different threads cause contention inside the system's memory allocator. Such contention may result to severe performance degradation for applications with frequent memory management operations.
In the optimized medium-grain PCDM we associate a local memory recycling list with each thread. Local lists alleviate the problem of contention at the level of the recycling list and eliminate the respective synchronization overhead. A typical concern whenever private, per thread lists are used is the potential imbalance in the population of the lists. This is, however, not an issue in the case of PCDM since, as explained in section 4.1.2, the population of triangles either remains the same or increases during every single cavity refinement.
To reduce pressure on the system's memory allocator, medium-grain PCDM also uses memory pools. The difference with coarse-grain PCDM is that memory pools are thread-local and thus do not need to be protected.
The execution time of coarse+medium grain PCDM, after memory managementrelated optimizations were applied, further improved on average by 13.49%.
Substitution of STL Data-Structures
In section 4.1.1 we described the substitution of STL constructs with generic data structures (arrays) in the code related to cavity expansion. This optimization is applicable to the medium-grain implementation of PCDM code as well.
The average performance improvement by substituting STL constructs with generic data structures is in the order of 44.21%, 7.21% higher than the performance improvement attained by substituting STL data structures in the coarse-grain PCDM implementation. STL data structures introduce additional overhead when used in multi-threaded code, due to the mechanisms used by STL to guarantee thread-safety.
Algorithmic Optimizations
The algorithmic optimizations described in section 4. 
Load Balancing
As explained in Section 4.2.2, each sub-domain is divided in distinct areas, and the refinement of each area is assigned to a single thread. The decomposition is performed by equipartitioning -using straight lines as separators-a rectangular parallelogram enclosing the subdomain. Despite being straightforward and computationally inexpensive, this type of decomposition can introduce load imbalance between threads for irregular subdomains ( Figure 6 ). 
Fine-grain PCDM
Fine-grain PCDM also spawns threads (a master and one or more workers)
inside each MPI process. The difference with the medium-grain PCDM implementation is that in the fine-grain case the threads cooperate for the expansion of a single cavity. Cavity expansions account for 59% of the total PCDM execution time.
The master thread behaves similarly to a coarse-grain MPI process. Worker threads assist the master during cavity expansions and idle otherwise. Tri-
angles that have already been tested for inclusion to the cavity have to be tagged so that they are not checked again during the expansion of the same cavity. Similarly to the medium-grain PCDM implementation, we use atomic test and set() operations to atomically test the value of and set a flag. Each thread queues/dequeues unprocessed triangles to/from a thread-local queue.
As soon as the local queue is empty, threads try to steal work from the local queues of other threads. Since the shape of a cavity is, unlike the shape of a sub-domain, not a priori known, techniques such as the multi-level queue 
Experimental Study
We executed a version of PCDM which exploits both the fine and the coarse granularities of parallelism (Coarse+Fine On the specific system, Triangle is 12.3% faster than the optimized, sequential PCDM. The multilevel PCDM code (Coarse+Fine) does not perform well.
In fact a slowdown of 44.5% occurs as soon as a second thread is used to take advantage of the second execution context of the HT processor. The absolute performance is improved as more physical processors are used (2 and the multi-grain one (by 43.6% on 2 processors and by 45.5% on 4 processors).
The performance difference is even higher compared with the coarse-grain configuration using 2 MPI processes per processor. In any case, single-or multi-level (coarse+fine), 2 processors are sufficient for PCDM to outperform the extensively optimized, sequential Triangle, whereas Coarse (2 MPI/proc) manages to outperform Triangle even on a single SMT processor.
We used the hardware performance counters available on Intel HT proces- The number of stall cycles (Fig. 9a) is a single metric that provides insight into the extent of contention between the two threads running on the execution contexts of the same processor. It indicates the number of cycles each thread spent waiting because an internal processor resource was occupied by either the other thread or by previous instructions of the same thread. The average per stall latency, on the other hand, indicates how much performance penalty each stall introduces. Whenever two threads share the same processor, the stall cycles are from 3.6 to 3.7 times more for Coarse+Fine and 3.9 times more for Coarse (2 MPI/proc). Exploiting the two execution contexts of each HT processor with two MPI processes seems to introduce more stalls. It should, however, be noted that the worker thread in the Coarse+Fine implementation performs useful computation only during cavity expansions, which account for 59% of the execution time of sequential PCDM. On the contrary, Coarse (2 MPI/proc) MPI processes perform useful computation throughout the execution life of the application.
Resource sharing inside the processor has a negative effect on the average latency associated with each stall as well. The average latency is 10 cycles when one thread is executed on each physical processor. When two MPI processes share the same processor it raises to approximately 15 cycles. When two threads that exploit the fine-grain parallelism of PCDM are co-located on the same processor the average latency ranges between 11.3 and 11.9 cycles.
Interesting information is also revealed by the number of retired instructions (Fig. 9b) . Whenever two processors are used, the total number of instructions always increases by a factor of approximately 1.4 -with respect to the corresponding single-processor experiments-for the two coarse configurations and the coarse+fine version. We have traced the source of this problematic behavior to the internal implementation of the MPI library, which attempts to minimize response time by performing active spinning whenever a thread has to wait for the completion of an MPI operation. Active spinning produces very tight loops of "fast" instructions with memory references that hit into the L1
cache. If more than two processors are used, the cycles spent spinning inside the MPI library are reduced, with an imminent effect on the total number of instructions.
Another interesting observation is that the multilevel version of the algorithm instructions typically result to hits in the L1 cache, do not create dependencies and retire quickly, thus even when they occupy processor resources, they do not introduce high latencies.
A significant side-effect of active spinning is the performance penalty suffered by computational threads that share the same processor with spinning threads.
Both threads share a common set of processor resources, such as execution units and instruction queues. The instructions issued by the spinning threads tend to fill the queues, thus delaying potentially useful instructions issued by the other execution context. Our experimental evaluation indicates a slowdown of more than 25% when a PCDM thread is executed together with an active spinning thread on the same physical CPU.
A detailed profiling of the code revealed that up to 24.5% of the cycles is spent on synchronization operations, for both the protection of work-queues and for tagging each triangle upon checking it for inclusion in a cavity. Synchronization is always limited among the two threads co-located on the same physical processor, and memory references due to synchronization operations always hit in the cache. However, the massive number of processed triangles results in a high percentage of cumulative synchronization overhead. Moreover, the software implementation of mutual exclusion algorithms introduces active spinning at the entrance of protected code regions, should access to these regions be contended.
A fundamental reason that prohibits Intel HT processors from efficiently supporting fine-grain parallelism is the lack of hardware support for light-weight threading. Such support includes hardware primitives for efficient thread spawning and joining, queuing of hardware threads and dispatching to the execution contexts of the processor. The lack of such support forces programmers to im-plement similar functionality in software. As a result, the PCDM implementation organizes unprocessed triangles in queues and uses kernel threads as "virtual processors" which dispatch and process triangles. If adequate hardware functionality was available, unprocessed triangles would be naturally translated to hardware threads, rendering all the aforementioned software support unnecessary. Such an implementation would also eliminate problems identified earlier, such as active spinning whenever the worker thread is idling, synchronization for access to the queues etc.
Alternative Methods for the Exploitation of Execution Contexts
As is the case with most pointer-chasing codes, PCDM suffers from poor cache locality. Previous literature has suggested the use of speculative precomputation (SPR) [16] for speeding up such codes on SMTs and CMPs [16, 41] . SPR exploits one of the execution contexts of the processor in order to precompute addresses of memory accesses that lead to cache misses and preexecute these accesses, before the computation thread. In many cases, the precomputation thread manages to execute faster than and ahead of the computation thread.
As a result, data are prefetched timely into the caches.
We have evaluated the use of the second hyperthread for indiscriminate precomputation, by cloning the code executed by the computation thread and stripping it from everything but data accesses and memory address calculations. The precomputation thread successfully prefetched all data touched by the computation thread. However, the execution time was higher than that of the 1 thread per CPU or 2 computation threads per CPU versions. As explained in the previous section, Intel HT processors do not provide mech-anisms for low overhead thread suspension / resumption. As a result, when the precomputation thread prefetches an element, it performs active spinning until the next element to be prefetched is known However, active spinning slows down -as reported earlier-the computation thread by more than 25%.
We tried to suspend/resume the precomputation thread using the finest-grain sleep/wakeup primitives available by the OS. In this case, the computation thread does not suffer a slowdown, however -as explained earlier-the la- The discussion in section 4.3 revealed weaknesses in the design of current, commercially available SMTs, that do not allow the efficient exploitation of fine-grain parallelism. In this section we discuss a set of potential architectural extensions, similar to extensions that have already been proposed for fine-grain and speculative multithreaded processors, and project the impact that these architectural optimizations will have on performance. We focus on hardware support for synchronization and thread management, since emerging processor architectures can easily provide realistic support for almost zero-cost synchronization and thread spawning/joining.
Extensions for Fine-Grain Synchronization
The problem of synchronization latency on multithreaded processors has been addressed in earlier work. Fine-grain synchronization on a word-by-word basis can be enabled by a full/empty bit, an architectural feature used first in the Tera MTA [4, 38] , by special-purpose synchronization registers, such as those found on Cray XMP, and by other mechanisms. In this work we consider the use of a lock-box [39] , as an efficient mechanism for synchronization, which can be implemented with modest hardware cost. A lock-box is a small buffer in the processor with one entry per thread, including the lock address, the address of the locking instruction and a valid bit. On a failed attempt to acquire a lock with a read-modify-write instruction, the acquiring thread blocks and is flushed from the processor. On a release, the address of the lock is compared -with a parallel, associative search-against all the contents of the lock box.
If a match is found, the matching thread is woken up. We estimate the latency of the entire critical path of a critical section using the lock box to 10 cycles, following the design suggestions in [39] .
Extensions for Fine-Grain Thread Spawning
Several multithreaded processor designs, including simultaneous multithreaded processors with embedded support for dynamic precomputation [16] , threaded multipath execution processors [40] and implicitly multithreaded processors [33] , support automatic thread spawning in hardware. Besides multiple hardware contexts and program counters, this hardware support includes a mechanism for communicating the live-in register values to a newly spawned thread, and instructions to spawn and join threads. Some designs allow thread spawning in the context of the same basic block, while others extend this mechanism to support function calls issued on separate hardware contexts [3] . Although many of the related studies assume negligible thread spawning latencies, communicating register values requires extra processor cycles, and in some cases, register spilling to memory. Related studies estimate the latency between 2 and 10 cycles depending on the assumptions [3, 33, 40] . In this work, we simulate a hardware thread spawning mechanism which supports register communication between threads executing across basic blocks and function boundaries. We conservatively assume a latency of 10 cycles for thread spawning. This latency includes register communication and potential queuing of threads.
Experimental Evaluation of Hardware Extensions
We have used a multi-SMT simulator based on SimICS [19] , to evaluate the impact of limited, realistic hardware support for thread execution and synchronization on the performance of the fine-grain implementation of PCDM. Table 5 shows the parameters of our multi-SMT simulator. We simulated the functionality of a lock box and a hardware thread spawning mechanism. Notice that, whenever possible, our simulated system is configured with exactly the same amount of resources offered by our real multi-SMT platform. This allows us to isolate the impact of our hardware extensions to the performance of the fine-grain parallel execution.
We conducted complete system simulations -including system calls and OS Table 5 Simulation parameters for the multi-SMT system used to evaluate the fine-grain implementation of PCDM on emerging microprocessors.
overhead-using different levels of hardware support, with the multilevel im- We expect more aggressive hardware mechanisms for thread management and synchronization to be present in the upcoming generations of multithreaded processors. More aggressive support will be a natural aftereffect of advances in technology and the need to meet the requirements of applications with fine-grain parallelism.
Conclusions
As SMT processors become more widespread, parallel systems are being built using one or more of these processors. The ubiquitousness of SMT processors necessitates a shift towards parallel programming, especially in the context of scientific computing. The development of parallel codes is not an easy undertaking, especially if high performance is the end-goal. Code optimization is a valuable step of the development process, however the programmer has to both identify performance bottlenecks and evaluate complex tradeoffs. At the same time, adaptive and irregular applications are a challenging target for any parallel architecture. Investigating whether emerging parallel architectures are well suited for such applications is, therefore, an important undertaking. Our paper makes contributions towards these directions, focusing on PCDM, an multi-level, multi-grain parallel mesh generation code.
We first presented a step-by-step optimization of the two outer granularities of PCDM. Despite the fact that PCDM is the direct target of these optimizations, most of them are generic enough to be applicable to other applications of the same class. We evaluated and presented the effect of each individual optimization on performance. The resulting optimized code was up to 6 times more efficient than the original one.
Following, we evaluated the interaction of the finest granularity of PCDM with multi-level, SMT-based parallel architectures, using both high level metricssuch as execution time-and low-level ones such as stall cycles, stall latency and number of retired instructions. The evaluation revealed weaknesses in the design of commercially available SMT processors, which do not allow them to efficiently exploit parallelism at very fine granularities.
Based on these findings, we simulated an SMT-based multiprocessor including minimal, realistic hardware support for efficient intra-processor synchronization and efficient hardware threads management. Our experiments indicate that although minimal, the hardware modifications enabled the efficient execution of fine-grain PCDM, allowing a multi-level (Coarse+Fine) implementation to outperform a single-level one by up to 19.8%, even when the latter used both execution contexts of each physical processor.
As modern parallel systems integrate many execution contexts organizeddue to technical limitations-in more and more levels, system architects are faced with a choice between performance and programmability. They can present all the computational resources of the system to the programmer in a uniform way, in order to facilitate programming. Alternatively, they can Next-generation system software has a significant role in this emerging environment; it can bridge these two alternatives. New compilers, operating system kernels and run-time libraries need to be developed specifically for layered parallel architectures, with the goal of hiding complex architectural details from the programmer, but at the same time exploiting in an educated manner the structural organization of the hardware in order to unleash the performance potential of modern parallel architectures.
