Finite state machines (FSMs) are the backbone of many applications, but are difficult to parallelize due to their inherent dependencies. Speculative FSM parallelization has shown promise on multicore machines with up to eight cores. However, as hardware parallelism grows (e.g., Xeon Phi has up to 288 logical cores), a fundamental question raises: How does the speculative FSM parallelization scale as the number of cores increases? Without answering this question, existing methods for speculative FSM parallelization simply choose to use all available cores, which might not only waste computing resources, but also result in suboptimal performance.
INTRODUCTION
Scalability is fundamental to the high-performance applications. An accurate scalability analysis helps realize the optimal performance and avoid unnecessary use of additional computing resources. This work is to provide an accurate scalability analysis for the speculative parallelization of finite state machine (FSM) computations.
As a classic computation model, FSMs have been widely used in many critical applications, such as intrusion detection [16, 36] , data decoding [14, 34] , genome and protein motif searching [4, 33] , and high-performance web analytics [24] . For its fundamental role in many performance-critical applications, it is anticipated that emerging architectures will feature hardware supports for FSM computations, such as automata processor [6] .
However, due to the tight dependencies among state transitions, FSM computations are extremely difficult to parallelize. As shown in the code snippet below, at each transition, the current state state not only depends on the input symbol c but also the prior state prior. Such dependencies essentially form a dependence chain, inherently prevent any FSM computations from running in parallel. State of the Art. To overcome the dependencies, existing methods often rely on speculative parallelization [22, 26, 38] . Basically, they first partition the input sequence evenly into N cor e chunks where N cor e is the number of available cores, then process the chunks in parallel, each with a predicted starting state, except the first chunk.
In the case where a prediction fails, they need to reprocess the wrong part to ensure the correctness (see Section 2) . This strategy has shown promise on small-scale multicore processors (up to eight cores). However, it is still poorly understood how well it can scale to larger parallel platforms with tens of or even hundreds of processing cores 1 Without answering these questions, existing methods may not only suffer from suboptimal performance, but also waste computing resources that would be otherwise used for other computations. As illustrated by Figure 1 , when executed on a Xeon Phi processor with 256 logical cores, existing methods [37, 38] result in suboptimal speedup on two FSM benchmarks, up to nearly 5X performance degradation comparing to the optimal ones. It is also important to note that the optimal number of cores varies across different FSMs.
Unlike prior work that focus on either designing speculation techniques [38] or reducing profiling cost [37] , this work aims to achieve the optimal performance gain for speculative FSM parallelization with an accurate scalability analysis.
However, accurately analyzing the scalability for speculative FSM parallelization is challenging for four-fold reasons. First, by nature, speculative parallelization is non-deterministic. Its overall performance highly depends on the accuracy of the speculation. Second, when a speculation fails (misspeculation), it is required to reprocess the incorrect part. But the processing cost may vary across different chunks, depending on the convergence. Consequently, the total cost of misspeculation correlates with the number of cores nonlinearly, making existing models fail to capture its scalability. Finally, the actual scalabilities of FSM computations are also constrained by the machines where they are executed via resource contention and relative execution speed.
To address the above complexities, this work introduces a series of scalability models for speculative FSM parallelization. The models integrate a probabilistic analysis to capture the non-deterministic behaviors of speculation and an offline sample-based conditional regression (SCR) technique to characterize the cost variation of misspeculation. Unlike existing FSM characterization [37, 38] that requires to profile the convergence property for every pair of states, SCR only profiles state pairs that are more likely to appear in the actual speculative execution. Based on the probabilistic models and SCR, this work designs both architecture-independent scalability analysis and architecture-aware scalability analysis. The former analyzes the scalability solely based on the design of speculative parallelization and the properties of an FSM. It guides the designers to tune speculative parallelization scheme and helps developers compare the scalabilities of various FSMs. In comparison, the latter further characterizes the architecture factors that may affect the actual scalability, making the scalability analysis practical in realworld computing environments.
To effectively leverage the above scalability analyses, this work develops S3 -a scalability-sensitive speculative parallelization framework for FSM computations. At high level, S3 works in three steps: (1) it first characterizes the FSM's properties and measures the architectural factors of the machine; (2) With the measurements, S3 next automatically reasons about the scalability and infers the optimal number of cores n * to use; (3) Finally, it feeds n * into the speculative parallelization to maximize its performance gain.
Experiments on a set of real-world FSM benchmarks demonstrate the accuracies of the proposed models and show that S3 can boost the performance of existing techniques up to 5X, with significant energy savings in most cases (up to 77%).
In sum, this work makes the following four-fold contributions.
• This work, for the first time, points out the suboptimality of existing speculative FSM parallelization methods when moving to larger-scale parallel platforms.
• It provides a series of rigorous scalability models, with a sample-based conditional regression technique. Together, they enable an accurate characterization of complex scaling behaviors in speculative FSM parallelization.
• To facilitate the use of the proposed models, it designs S3, a scalability-sensitive speculative parallelization framework that automatically reason about the scalability and guides the configuration of speculative parallelization at runtime.
• It evaluates S3 on both many-core processor (Xeon Phi processor) and multi-socket multi-core architectures, and demonstrates large ratio of performance improvements and significant energy reduction.
MOTIVATION
In this section, we first illustrate the basic approach of speculative parallelization used by existing work for FSM computations, then point out the suboptimality of performance in existing solutions due to their unawareness of scalability, hence the necessity to enable scalability-sensitive speculative parallelization. Speculative Parallelization. To address the tight dependence among state transitions in FSM computations, existing solutions often rely on speculative parallelization techniques, which is based on a predict-validate-reprocess strategy. Next, we describe the high-level ideas of speculative parallelization of FSM computations, which consists of three major phases.
( According to the first phase, the basic speculative parallelization approach assumes that it can scale up to the total number of CPU cores N cor e . While this might be true for small-scale multicore processors (e.g., quad/oct-core processors), but may not hold for larger-scale platforms with tens of or hundreds of CPU cores. Performance Suboptimality. Figure 2 shows the speedup curve for an FSM benchmark on a Xeon Phi machine with 256 cores. As the blue line shows, the actual speedup increases linearly at the beginning before reaching about 10 cores, which is confirmed by prior work [37, 38] . However, the increase becomes non-linear thereafter and even starts dropping after about 30 cores. Finally, the speedup drops to merely 5.7X when all 256 cores are used. This On the other hand, unlike many parallel applications, the speedup curve of speculative FSM parallelization is difficult to model using the classic Amdahl's law and its simple extensions. As Figure 2 shows, the speedup curve predicted by Amdahl's law, with 3% of serial execution (green line), follows an obviously different trend comparing with actual speedup curve. The principle reason to such a discrepancy is due to the inherent complexities of speculative FSM parallelization. First, during a speculative FSM execution, not only the parallel part (i.e., Phase 2) depends on the number of cores, but also the sequential execution part (i.e., Phase 3). In comparison, the serial part in Amdahl's law, by default, is assumed to be a constant. Furthermore, the relation between the sequential execution performance and the number of cores in parallel part is non-linear, due to the variation of state convergence. The two complexities make existing scalability models fail to faithfully capture the scalability of this advanced parallelization technique.
To address the challenges and seek for the optimal performance, we propose S3, a speculative parallelization framework that can automatically characterize the scalability of a given FSM and calculates the best configuration to maximize the performance. We next give an overview of S3 before presenting its details.
OVERVIEW
At high level, S3 includes three layers. From bottom to top, they are characterization, modeling and guidance, as shown in Figure 3 . We next briefly present each of the three layers in order. Characterization. As FSMs and the underlying architecture affect the scalability of speculative parallelization from totally different ways, it is natural to separate their characterizations. We refer to the characterization results on the application side and architecture side, symbolically, as FSM.properties and Arch.properties, respectively.
1 On one side, FSMs exhibit dramatically different behaviors when executed speculatively. Some FSMs are easier to speculate while others may be much more challenging (such as div in [38] ), depending on their transition structures and input characteristics. Furthermore, when a misspeculation happens, the penalty not only varies across FSMs, but also varies across speculatively processed chunks, depending on how fast the predicted (wrong) state s pr ed converges with the correct starting state s true . The faster they converge, the less penalty the misspeculation incurs.
Prior work [37, 38] introduce metrics like state feasibility and expected convergence length to tune the design of the predictor. They are inadequate to accurately model the details of the nondeterministic behaviors of speculation, such as the distribution of misspeculation penalty across different chunks.
In comparison, S3 maintains a short list of raw convergence length samples (typically < 150) for each state pair, rather than simply averaging them. The list of samples encapsulates not only the average convergence length, but also its distribution, which is the key to accurately model the variation of misspeculation penalty (Section 4.2). To reduce the overhead of characterization, unlike existing methods which require to profile the convergence property for every pair of states, S3 only profiles the state pairs that are more likely to appear in actual speculative executions. We will elaborate FSM characterization and its uses in Section 4. 2 On the other side, the characteristics of the architecture also directly affect the scalability of speculative FSM parallelization in various ways, depending on the specific design of the architecture. In this work, we focus on two main factors that play critical roles in the scalability analysis: resource contention and relative execution speed. Resource contention happens when different threads share the same computing resources, such as last level cache (LLC) and memory bandwidth. Depending on the design of the architecture and the number of concurrent threads, such contention could vary significantly. Note that resource contention only happens in the parallel phase of speculative FSM execution. When moving into the reprocessing phase, only a single thread is left due to dependencies, the contention hence reduces to zero. However, due to the tracking of state convergence, the execution speed in the reprocessing phase might be slightly slower than the parallel phase. This difference directly influences the scalability, but may vary across architectures. Therefore, it is necessary to capture the relative execution speed between the two phases, in order to precisely quantify the scalability. We will present architecture characterizations in Section 5.1. Scalability Modeling. In this work, the scalability is defined as the capability of speculative parallelization to scale up to larger amount of computing units (i.e., CPU cores). In particular, given an FSM with a fixed-size input, the scalability concerns how the execution time varies with the number of CPU cores used 3 .
3 With the characterization results, S3 can automatically reason about the scalability using a series of scalability models that are derived based on the design of speculative FSM parallelization.
In specific, the speedup S is defined as a non-linear function of the number of cores employed n, along with other parameters, such as the properties of FSM computations FSM.properties and the architecture properties Arch.properties 4 
. Depending on if
Arch.properties is considered, the models fall into two types:
architecture-independent scalability model:
architecture-aware scalability model:
where n is the number of cores used in speculative execution, FSM.properties represents convergence properties of the given FSM, and Arch.properties contains architecture characteristics such as resource contention among threads and relative execution speed of different FSM operations. The architecture-independent models can be used to compare the scalabilities of different FSMs and guide the design of speculative parallelization; The architecture-aware models provide more accurate scalability analysis results that are customized for a specific architecture. At high level, the entire speculative FSM execution time T spec is broken down into two parts: the parallel processing time T par a and the sequential reprocessing time T r epr . Let T seq be the sequential execution time, then the speedup of speculative parallelization can be defined as follows:
Unlike the classic Amdahl's law that assumes a constant ratio for the sequential part, in Equation 3, both the parallel part T par a and the sequential part T seq primarily depend on the number of cores n. Moreover, the relation between T r epr and n follows a non-linear pattern, making standard scalability models fail to faithfully capture its scalability. We address the challenges with a novel sample-based conditional regression (SCR) technique. Different from traditional regression models, SCR conditionally accept convergence length samples based on the parameters of speculative parallelization. With such fine-grained customization, SCR can precisely model the above non-linear relation.
We will describe the basic scalability analysis in Section 4.2 and the integration of architecture properties in Section 5.2. Scalability-Sensitive Speculative Parallelization. The goal of S3 is to maximize the efficiency of speculative parallelization by reasoning about its scalability and discovering the optimal number of cores to use (i.e., n * ). 4 With the scalability models, this problem can be formalized as the following discrete optimization problem. (4) where the number of cores used by speculative FSM parallelization is bounded by the total number of cores on the machine. When n = 1, the FSM execution becomes sequential. To solve the optimization problem, depending on the models, S3 either simply enumerates each configuration and chooses the one with the highest speedup, or directly computes the optimal configuration from a closed-form expression. By setting n * in the speculative parallelization, S3 maximizes the speedup. We refer to this scheme as scalability-sensitive speculative parallelization.
In sum, the three layers closely depend on each other from top to bottom. They together enable a new speculative parallelization scheme for FSM computations on larger-scale parallel platforms. In the following, we elaborate architecture-independent scalability analysis and architecture-aware scalability analysis, respectively.
ARCHITECTURE-INDEPENDENT SCALABILITY ANALYSIS
This section presents the scalability analysis that does not assume any particular architecture, but solely based on the properties of the FSM computations and the design of speculative parallelization. 
FSM Characterization
As a basic computation model, FSMs feature many properties. In this work, we focus on a type of characteristics that has a significant influence on the penalty of misspeculation -the convergence length.
As mentioned in Section 2, when a misspeculation happens, the speculative parallelization framework may not have to reprocess the whole chunk, thanks to the fact that the predicted (wrong) state s pr ed may converge with the actual state s true . The shorter it takes for them to converge, the less penalty of the misspeculation incurs.
To effectively model such behaviors, we leverage the concept of state convergence length [38] , defined as follows. Given an FSM, its convergence length properties can be profiled either offline using a set of training inputs [38] , or online using the testing inputs [37] . Note that prior work require to profile the average state convergence length for every pair of states, in order to guide the design of the starting state predictor [37, 38] . In comparison, S3 maintains a pool of raw convergence length samples for state pairs that are more likely to appear in actual speculative executions, in order to facilitate a high-precision scalability analysis, as explained in the next subsection.
Scalability Analysis
The goal of scalability analysis is to examine how speedup S varies as the number of cores used n changes. Based on the definition of S in Equation 3 , this requires to model the ratio between sequential execution time and its corresponding speculative execution time. In architecture-independent scalability analysis, we use the number of state transitions to quantify the relative execution time, instead of the concrete execution time which may vary across different architectures. For example, given an input of length I , when processed sequentially, the execution time is T seq = I .
For speculative execution, the total time T spec mainly consists of the parallel processing time T par a and the sequential reprocessing time T r epr (Phases 2 and 3 in Section 2) 5 .
In parallel processing phase, each thread first predicts the starting state, then processes its corresponding chunk with the predicted starting state. In general, there is a tradeoff between the prediction accuracy and prediction cost. However, after the design of predictor is fixed, the prediction cost becomes a constant (more details in [37, 38] ). We use C pr ed to represent the prediction cost and T pr oc to represent the processing time of chunks. Since the input is evenly partitioned based on the number of cores n, we have T pr oc = I /n. Hence, the parallel phase execution time T par a is T par a = C pr ed + T pr oc = C pr ed + I n
Next, we analyze the execution time of the reprocessing phase, which is more challenging due to two complexities inherited in the design of speculative FSM parallelization.
Complexity I: Undeterministic behaviors of speculation. By its nature, speculation is non-deterministic. If a speculation succeeds, there would be no cost of reprocessing; otherwise, the speculation framework has to initiate reprocessing to correct the mistakenly processed parts. As only the latter case degrades the scalability, an effective scalability model needs to distinguish the two cases. However, since the speculation happens during the actual runs, such a distinction is as hard as the speculation itself.
Complexity II: Variation of reprocessing costs. To reduce the penalty of misspeculation, existing methods [37, 38] leverage the convergence property of FSMs (see Section 4.1) by tracking if the misspeculated state s pr ed converges with the actual state s true . Once they converge, the reprocessing can safely stop. On one hand, this design helps reduce the reprocessing costs of misspeculation. On the other hand, it also complicates the modeling of speculative parallelization, as the reprocessing costs for different chunks may vary significantly, depending on their convergence lengths.
In the following, we present an analytical model that address both complexities together, referred to as sample-based conditional regression model. Before introducing the model, we first formalize the total execution time of reprocessing.
In reprocessing phase, due to dependencies, all chunks, except the first one, have to be validated and reprocessed sequentially. Therefore, the total reprocessing time T r epr is composed of the reprocessing time of each chunk T i r epr , where 
Two points worth to mention here. First, to address the above two complexities together, we unify the representations by referring to a successful speculation as a "misspeculation" with reprocessing length of zero, that is,
Second, the reprocessing of a chunk cannot go beyond the size of the chunk, hence the following constraint holds:
Sample-based Conditional Regression. A key challenge in the scalability analysis of speculative FSM parallelization is precisely estimating Equation 7 in practice. We address this challenge with sample-based conditional regression (SCR). Different from a classic regression analysis, SCR considers samples conditionally -only if they satisfy the given constraint.
In the context of reprocessing time modeling, a sample in SCR is the convergence length for a pair of states L k (s i , s j ) on a piece of training input k. The constraint for the samples is the chunk size I /n. For a given state pair (s i , s j ), SCR maintains a short list of K samples 6 . However, during the regression analysis, SCR only chooses samples with convergence length shorter than the constraint (i.e., L k (s i , s j ) < I /n). Note that the constraint can vary, depending on the input size I and the number of cores n. This flexibility allows the customization of SCR based on the needs of scalability analysis. On the other hand, with a pool of samples, the differences among samples resemble the variation of reprocessing costs among different chunks.
Note that convergence length profiling for different state pairs is already required by existing speculative FSM parallelization [37, 38] , in order to improve the starting state prediction. In these cases, SCR does not require any extra profiling.
However, maintaining a list of samples for every pair of states could be expensive in terms of both space cost and the cost of regression analysis, depending on the number of high-frequency state pairs, their convergence lengths and the size of a sample set. To reduce the total amount of samples, SCR maintains samples only for state pairs that are more likely to appear in actual runs. To find out these state pairs, SCR performs a lightweight state pair frequency profiling offline, by invoking a speculative execution with a large number of parallel threads 7 . Let the set of high-frequency state pairs be S 2 f , then
where H f is a predefined frequency threshold.
With the samples of high-frequency state pairs, SCR computes the total reprocessing cost estimate T r epr for a configuration n with the following equation:
Note that each sample L k (s, s ) is considered only if it satisfies the constraint, that is, L k (s, s ) < I /n. Statistically speaking, if set H f = 0, then we have
Model M1. Putting all together, we have the estimated speculative execution time
As the sequential execution time T seq = I , we have the first scalability model M1:
Based on Equation 14 , for a given FSM and an input size I , Model M1 can compute the speedup of speculative FSM parallelization 6 K is tunable to balance the accuracy and cost. In our evaluation, K is set to 120. 7 In our experiments, this number is set to 1000.
for any configuration n, with the help of SCR (Equation 11), hence, find out the optimal configuration n * , such that,
Depending on the number of state pairs S 2 f and the number of samples for each state pair K, the calculation of Equation 11 may introduce some runtime cost. One way to reduce the cost is by tuning the state pair frequency threshold H f and the number of samples K, which in turn may compromise the accuracy. Next, we will discuss another way to balance the accuracy and modeling cost, by simplifying the SCR model. The simplification will lead to a closed-form representation of the optimal configuration. Model M2. Considering the SCR model in Equation 11 , there are two scenarios in which the model can be simplified by eliminating the constraint. First, when the input size is large enough or the convergence lengths between state pairs are relatively short, such that L k (s, s ) is often smaller than I /n, then we can assume that 
whereL is the average convergence length among all samples and P s is the probability of successful speculation, that is, P (s = s ). Depending on the ratio between input size and number of cores, the new model T r epr (n) switches between two equations. By substituting the corresponding term in Equation 14 with the new model, we get the second scalability model M2:
One advantage of Model M2 is that the optimal number of cores n * can be represented in a closed-form expression, hence calculated directly without going through the pool of samples (required by Model M1). Considering Equation 16 and Equation 17 together, we can solve speedup maximization problem in Equation 3 and get the following optimal configuration:
whereL and P s capture the convergence properties of the FSM and the speculation accuracy, respectively. Equation 18 quantitatively reflects two basic intuitions behind scalability analysis. First, as the convergence lengthL increases, the optimal configuration n * should be reduced. Second, when the speculation accuracy P s increases, the speedup tends to be better when choosing to use more available cores. Discussion. Comparing models M1 and M2, there is a tradeoff between the accuracy and the modeling cost. On one hand, with the SCR, Model M1 captures more details of misspeculation cost variation, hence tends to be more accurate in most cases. Meanwhile, M1 incurs more overhead as it needs to go through the pool of samples to calculate the speedup for each configuration. On the other hand, though Model M2 directly computes the optimal configuration, it may lose some accuracy, especially when the average convergence lengthL is close to the chunk size I /n.
Both models M1 and M2 are solely based on FSMs' properties, and can be used for comparing the scalability of different FSMs when being executed speculatively. In practice, the actual scalability also depends on the characteristics of underlying architecture. Next, we will discuss how to extend the FSM properties-based scalability models to architecture-aware scalability models.
TOWARDS ARCHITECTURE-AWARE SCALABILITY ANALYSIS
On different architectures, the scalability of a type of computations may vary significantly, not only depending on the characteristics of the architecture, but also depending on their interaction with the computations. To enable accurate and practical scalability analysis for a given computing platform, this section presents architecture characterizations and discusses how to integrate them into the scalability models introduced in Section 4.
Architecture Effects
Considering the complexity of modern architectures, the interplay between an architecture and an application can be quite involved.
Here, we focus on the end-to-end architecture effects that are closely relevant to the performance of speculative FSM parallelization. In another word, our architecture characterizations are customized for speculative FSM parallelization.
Since the execution time of speculative FSM execution mainly consists of two phases: parallel processing phase and sequential reprocessing phase (Section 4.2), we separate our discussion on the two phases. In specific, for each phase, we identify the major factor(s) that directly influences the performance. Resource Contention in Parallel Phase. During the parallel phase, a group of n threads are created, each of them occupying a separate (logical) core. Based on their predicted starting states, these threads proceed with their own input chunks individually, and do not need to communicate either other. Thus, they do not suffer from any lock contention that is often caused by concurrent access of the shared data structures 8 . However, different cores physically share hardware resources, such as last level cache (LLC) and memory bandwidth, and even more resources among logical cores in a hyper-threaded core. The sharing of resources leads to contentions that directly influence the performance of this phase.
As the number of cores used increases, the resource contention tends to increase as well. However, the contention may not increase linearly or even monotonically, depending on the design of the architecture as well as the mapping between threads and logical cores. Without loss of generality, this work assumes that S3 uses the default mapping that is chosen by the operations system (defined in /proc/cpuinfo).
To quantitatively measure the resource contention, we introduce the metric contention factor, denoted as α (n).
where T (i) is the execution time of processing i input chunks of the same length with i cores. Contention factor α (n) captures the degree of resource contention when executing with n parallel threads, comparing with a single thread execution. For commonly used architectures, it is expected that α (n) > 1. For a given architecture, the contention factor α (n) can be easily measured by running a micro benchmark N cor e times. Relative Execution Speed of Reprocessing. After entering into the reprocessing phase, only one thread is left, responsible for validating the correctness of each speculation and correcting the mistakenly processed parts caused by misspeculation. This implies there is no resource contention in this phase (i.e., α (1) = 1). However, due to the tracking of state convergence (between s pr ed and s true ), the execution speed (i.e., processing time per symbol) in reprocessing phase might be relatively slower than regular state transitions. The actual difference depends on the architecture, meanwhile, affects the scalability: the slower the reprocessing is, the less scalability the speculative parallelization can achieve.
To capture the relative execution speed, we introduce the relative speed factor, denoted as γ . (20) where T 0 r epr and T 0 seq are the processing time of a single symbol during reprocessing and a sequential execution, respectively. It is also expected that γ > 1. Similar to contention factor, γ can be measured with a micro FSM benchmark, but just running twice.
For a given architecture, α (n) and γ only need to be profiled once. Next, we discuss how to integrate these two architecture factors into the scalability models presented in Section 4. The integration will lead to a pair of architecture-aware scalability models that are more accurate and practical than their counterparts.
Integration of Architecture Factors
We first consider the resource contention factor α (n) in the parallel phase, then the relative speed factor γ in the reprocessing phase, and finally put them together.
The parallel phase execution time model in Equation 6 assumes that the parallel processing time T pr oc equals sequential processing time (modeled as I ) divided by the number of cores n. When considering the resource contention factor (Equation 19 ), that is, α (n) = T pr oc /(T seq /n), we can easily infer
Equation 21 implies that the higher the resource contention is, the longer the parallel processing time would become.
Similarly, we can extend to the reprocessing phase model in Equation 7 by integrating relative speed factor γ
Putting all together, we have two enhanced scalability models M1+ and M2+, corresponding to models M1 and M2, respectively. Model M1+. Based on Model M1 in Equation 14, we have the following extended Model M1+ with architecture factors.
Model M2+. Similarly, based on Equations 16 and 17, we extend Model M2 to Model M2+ as follows.
where T r epr (n) is defined the same as that in Equation 16 .
Augmented with the architecture factors, models M1+ and M2+ are expected to provide more accurate scalability analysis results that are customized to a specific architecture.
IMPLEMENTATION
We implemented S3 based on the OptSpec library [37, 38] , which is implemented in C language and leverages Pthread for multithreading. At high level, there are three major components: (i) An FSM property collector for profiling state convergence properties. The collector can be tuned either online using testing inputs or offline using training inputs. The cost of online profiling has been optimized with techniques from prior work [37] (typically less than 5%); (ii) An offline architecture property collector which runs a small set of micro FSM benchmarks on the target machine to measure the resource contention α (n) and relative speed factor γ ; And (iii) a runtime controller that implements the scalability models. Based on the collected the FSM and architectural properties, the controller calculates the optimal configuration n * , and feeds it into speculative parallelization setting at runtime.
EVALUATION
In this section, we evaluate S3 on large-scale shared memory architectures, including a standalone Xeon Phi processor with 256 logical cores. The evaluation mainly focuses on two aspects: the accuracy of scalability analysis and the performance and energy benefits from using S3. We also discuss of scalabilities of some specific FSM computations based on experimental results.
Methodology
We compare S3 with two methods. One is the default setting of OptSpec [37, 38] which uses all available cores on the machine; The other is the exhausted searching that provides the ground truth of optimal configuration. In specific, given a FSM, the input size and an architecture with N cor e cores, the exhausted searching executes the FSM with its inputs on the architecture using 1 to N cor e cores to find the optimal number of cores. Obviously, it is unreasonable to use exhausted searching in real situations as trying one configuration is already at least as costly as the executing the best configuration, not mentioning enumerating all configurations.
We run our experiments on three different architectures, which are summarized in Table 1 . Due to space limit, we mainly focus on the results on Xeon Phi architecture. Xeon Phi runs Linux 3.10.0 with GCC 4.8.5, while the other two run Linux 3.10.0 with GCC 4.47. All programs are compiled with "-O3" optimization flag. The timing results reported are the average of 10 runs on 10 inputs, with all runtime cost included. We use PAPI [3] for accessing hardware performance counters to measure the energy consumption. The benchmarks are collected from real-world FSM applications, primarily from Snort [32] , one of the most widely used open source Network Intrusion Detection Systems (NIDSs). It has a rich body of signatures/rules, most of which are specified by a Perl-compatible regular expression (PCRE). We converted the PCREs to FSMs using standard regular expression to FSM conversion algorithms [1] . The inputs to the FSMs are network traffic traces collected from a Linux server and a laptop via tcpdump, with a total size of 18GB. Table 2 summarizes the 14 benchmarks used in our evaluation, including the number of states and the average convergence length collected from high-frequency state pairs, each with 120 samples. Table 3 reports the optimal configuration n * found by exhausted searching and the four models of S3, on architectures Xeon Phi and Haswell 9 . "Exs" shows the actual optimal number of cores by enumerating all configurations (i.e., the "ground truth"). Note that, the predicted optimal number of cores is bounded by the total number of cores in the tested platforms. Overall, architectureaware models (M1+ and M2+) are more accurate than architectureindependent models (M1 and M2), especially for benchmarks with better scalabilities, thanks to their consideration of architecture factors α (n) and γ . The differences between M1+ and M2+ are not significant for most benchmarks, similar to M1 and M2. The largest difference happens on benchmark rtf, where M2 turns to be much over optimistic (80 v.s. 42). In comparison, the result of M1 is very close to the actual optimal (43 v.s. 42). Also note that the results of M2 is closer to the ground truth than M1 in general. On one hand, due to the simplification of SCR model, M2 predicts less accurately than M1 in reprocessing length (more pessimistic in most cases). On the other hand, both M1 and M2 miss the architecture factors as mentioned above, and tend to be more optimistic. Because of the "balance" that happens to M2, results from M2 turn to be closer to the real cases than M1. To further examine the overall accuracy in scalability analysis, we collected the speedup curves for each benchmark on each architecture. Due to space limit, we only report some representative 9 The results on Ivy Bridge follow similar patterns. results in Figure 5 . In general, the speedup curves clearly show the effectiveness of the two architectural-aware models (M1+ and M2+), whose speedup curves precisely align with the actual speedup curve for most benchmarks. Between M1+ and M2+, M2+ performs less reliable than M1+, especially on benchmarks openview and buffer, due to its simplification of the SCR model (Section 4). Model M1+ shows some slight discrepancy on benchmarks buffer, openview, and mutiny. This is mainly caused by the characteristic differences between the samples and the testing inputs.
Model Accuracy

Performance Improvement
We next present the performance benefits of S3, comparing with the default setting of speculative FSM parallelization [37, 38] . Figure 6 shows the speedup (baseline is sequential FSM execution) of all five methods on 14 benchmarks and three architectures. "Exhaust" represents the ideal speedup that can be achieved by tuning the number of cores. The most performance gains come from the results on Xeon Phi, for its larger number of available cores. On average, S3 boosts the speedup from 6.1X to 16.7X with Model M1+. For architecture-independent models (M1 and M2), the improvements are slightly less, but still reaching 15X. For benchmark buffer, the speedup is improved by a factor of five (3.2X v.s. 15.6X). Results on Haswell and Ivy Bridge follow similar trends in general, but are less significant due their limited number of available cores. Overall, the results imply the necessity of scalability-aware speculative FSM parallelization, especially considering future parallel platforms with even more number of processing units. 
Energy Saving
Finally, we briefly discuss one side benefit of scalability-sensitive speculative FSM parallelization -energy saving. The energy saving primarily comes from the use of less number of processors. Table 4 reports the energy saving in percentage on Xeon Phi and Haswell architectures. On Xeon Phi, the energy saving is more significant, up to 77%, because it has more room to reduce the amount of core uses. However, on Haswell, we also observed cases with even more energy consumption. This is because when using all available cores (64 cores), though the power consumption is higher, the execution time becomes shorter (due to smaller chunk size I /n), comparing with the optimal core predicted by the models (e.g., 22 cores for M2 and M2+). In another word, for speculative FSM parallelization, it is not necessary that more number of cores always leads to higher energy consumption. We leave further investigation as future work. 
RELATED WORK
Scalability Analysis. Scalability analysis is critical for studying the performance of parallel applications, especially on large scale parallel computers. It has been used to find the best algorithmarchitecture combinations for problems with different constraints and predict the performance of a parallel algorithm for a parallel platform with more number of processors [2, 10, 17] . The most widely used scalability models include Amdahl's law and its variations [2] , which states the existence of a speedup bound caused by the serial part. Gustafson further builds models whose system resources are improved, known as Gustafson's law [8] . Though providing valuable insights, these scalability models cannot be applied to speculative FSM parallelization whose behaviors are probabilistic with non-linear variation of parallelization cost. Speculative FSM Parallelization. Due to the dependencies in state transitions, existing ways to parallelize FSM rely on either enumeration-based parallel prefix-sum and its variations [18, 22] , or speculative parallelization [27, 37, 38] . The former can be treated as a special case of speculative parallelization, where the "speculation" enumerates all the states, hence always covers the correct one. From this perspective, though the models proposed in this work can be reused for the former with simple extensions.
Some other FSM parallelization work focus on a few specific FSM applications, such as browser front-end [11] and JPEG decoder [13] . The basic ideas in these work were later formalized by Zhao and others [38] by introducing a concept called principled speculation. Other examples include hot state prediction for FSMs in intrusion detection [21] and speculative parsing [12] . Some studies also design and implement parallel Non-deterministic Finite Automata (NFA) [39] , which naturally exposes parallelism, hence are relatively easier to parallelize, comparing with their DFA counterparts.
Some prior work have also explored bit-parallel fine-grained parallelism for FSMs by converting FSM computations into a sequence of bit operations [19, 23] , and the combination of both fine-grained and coarse-grained speculative parallelism [27] . Speculative Parallelization of Other Applications. The idea of speculative parallelization has been studied for years. These work include designing new programming language constructs [26] and developing parallelization frameworks [5, 7, 29, 30, 35] . Some of these studies have explored parallelism in irregular programs [9, 15, 20, 25, 31] , which share some similarities with the parallelization of FSM computations, given that FSMs essentially run on an irregular data structure (a graph). Quinones and others [28] use pre-computation for speculative threading, which shares the idea with speculative FSM parallelization in that both exploit some constraints of the computation to facilitate speculative execution.
CONCLUSION
With a systematic scalability study, this work points out a principal fallacy in the existing design of speculative FSM parallelization when being ported to a larger parallel platform. To address the issue, this work introduces a series of scalability analysis models, which are tailored to both the properties of FSM computations and the characteristics of the underlying architecture. To leverage the proposed models, this work develops an automatic speculative FSM parallelization framework S3, which, for the first time, enables a scalability-sensitive speculative parallelization for FSM computations. For a given FSM, its input size and the architecture, S3 can automatically compute the optimal number of cores to use and guide the speculative parallelization towards the best performance. Evaluation on FSM benchmarks with a spectrum of scalabilities demonstrates the effectiveness of the new speculative parallelization scheme, showing up to 5X speedup comparing to previous methods as well as up to 77% energy saving.
