Abstract. Many cryptographic protocols derive their security from the apparent computational intractability of the integer factorization problem. Currently, the best known integer-factoring algorithms run in subexponential time. Efficient parallel implementations of these algorithms constitute an important area of practical research. Most reported implementations use multi-core and/or distributed parallelization. In this paper, we use SIMD-based parallelization to speed up the sieving stage of integer-factoring algorithms. We experiment on the two fastest variants of factoring algorithms: the number-field sieve method and the multiple-polynomial quadratic sieve method. Using Intel's SSE2 and AVX intrinsics, we have been able to speed up index calculations in each core during sieving. This performance enhancement is attributed to a reduction in the packing and unpacking overheads associated with SIMD registers. We handle both line sieving and lattice sieving. We also propose improvements to make our implementations cache-friendly. We obtain speedup figures in the range 5-40%. To the best of our knowledge, no public discussions on SIMD parallelization in the context of integer-factoring algorithms are available in the literature.
Introduction
Let n be a large composite integer having the factorization
The integer factorization problem deals with the determination of all the prime divisors p 1 , p 2 , . . . , p k of n and their corresponding multiplicities v p 1 , v p 2 , . . . , v p k . The earlier algorithms proposed to solve this problem run in exponential time in the input size (the number of bits in n, that is, log 2 n or lg n). Modern factoring algorithms have subexponential running times of the form L(n, ω, c) = exp (c + o(1))(ln n) ω (ln ln n) 1−ω , where c and ω are positive real constants with 0 < ω < 1. The smaller the constants ω and c are, the faster these algorithms are. The fastest known general-purpose factoring algorithm is the (general) number-field sieve method (NFSM) for which ω = 1/3 and c = (64/9) 1/3 ≈ 1.923. The multiple-polynomial quadratic sieve method (MPQSM) is reported as the second fastest factoring algorithm. Its running time corresponds to ω = 1/2. Although the NFSM is theoretically faster than the MPQSM, the asymptotic behavior shows up only when the input integer n is moderately large (having at least a few hundred bits). Better than fully exponential-time algorithms as they are, these subexponential algorithms are still prohibitively slow, and it is infeasible to factor a 1024-bit composite integer in a reasonable amount of time (like a few years). Many cryptographic protocols (including RSA) derive their security from this apparent intractability of the integer-factoring problem.
The most time-consuming part in the NFSM and the MPQSM is the sieving stage in which a large number of candidates (suitably small positive or negative integers) are generated. Only those candidates which factor completely over a predetermined set of small primes yield useful relations. We combine these relations using linear-algebra tools in order to arrive at a congruence of the form x 2 ≡ y 2 (mod n). If x ≡ ±y (mod n), the nontrivial factor gcd(x − y, n) is discovered.
In this paper, we deal with efficient implementations of the sieving stage in the NFSM and the MPQSM. Sieving turns out to be massively parallelizable on multi-core and even distributed platforms. All recent implementations of sieving in the literature deal with issues of such large-scale parallelization. In this paper, we look at the problem from a different angle. We plan to exploit SIMD features commonly available in modern desktop and server machines. We investigate whether SIMD-based parallelization can lead to some speedup in the computation of each core. The sieving procedure involves two data-parallel operations: computation of indices and subtraction of log values. We theoretically and experimentally demonstrate that the index-calculation process can be effectively parallelized by SIMD intrinsics. This effectiveness banks on that we can avoid packing of data in SIMD registers in each iteration of the sieving loop. The subtraction operations, on the other hand, do not benefit from such parallelization efforts. First, we need packing and unpacking of SIMD registers in each iteration of the loop. Second, the individual data items that are packed in SIMD registers do not belong to contiguous locations in the memory.
The main technical contribution of this paper is the successful parallelization of index calculations in the sieving stage using Intel's SSE2 (Streaming SIMD Extensions) and AVX (Advanced Vector Extensions) features on a Sandy Bridge platform. We have been able to speed up the sieving stage of both the NFSM and the MPQSM by 5-40%. Intel's recently released AVX2 instruction set is expected to increase this speedup further (AVX uses 256-bit vectorization only on floating-point values, whereas AVX2 has this feature for integer operands too). We study two variants of sievingthe line sieve and the lattice sieve-in the NFSM implementation. We also propose a cache-friendly implementation strategy. To the best of our knowledge, no SIMD-based parallelization attempts on sieving algorithms for integer factorization are reported in the literature. The implementations described by [1] and [2] use the term SIMD but are akin to multi-core parallelization in a 16K MasPar SIMD machine with a 128 × 128 array of processing elements.
The rest of the paper is organized as follows. Section 2 introduces the background relevant for the remaining sections, namely, the sieving procedures in the MPQSM and in the NFSM, the lattice sieve, and Intel's SSE2 and AVX components. In Section 3, we detail our implementation strategies for the MPQSM and the NFSM sieves. Section 4 presents our experimental results achieved on a Sandy Bridge platform. We analyze the speedup figures, and demonstrate the effectiveness of our cache-friendly implementation. We conclude the paper in Section 5 after highlighting some possible future extensions of our work.
Background

A Summary of Known Integer-Factoring Algorithms
Most modern integer-factoring algorithms construct Fermat congruences of the form x 2 ≡ y 2 (mod n). If x ≡ ±y (mod n), then gcd(x −y, n) is a non-trivial factor of n. If these congruences are available for many random values of x and y, at least some of these pairs is/are expected to reveal non-trivial factors of n with high probability. In order to generate such congruences, one collects many relations, each of which supplies a linear equation modulo 2. Relation collection is efficiently carried out by sieving. After sufficiently many relations are collected, the linear equations are combined to find nonzero vectors in the nullspace of the system. Each such non-zero vector yields a Fermat congruence. The factoring algorithms differ from one another in the way the relations are generated, that is, in their sieving techniques. The linear-algebra phase turns out to be identical in all these algorithms.
J. D. Dixon [3] proposes the simplest variant of such a factoring method. Based on the work of Lehmer and Powers [4] , Morrison and Brillhart introduce another variant known as the CFRAC method [5] , where relations are obtained from the continued fraction expansion of √ n. In Pomerance's quadratic sieve method (QSM) [6] , the polynomial T (c) = J + 2Hc + c 2 (where H = √ n and J = H 2 − n) is evaluated for small values of c (in the range −M c M). If some T (c) value factors completely over the first t primes p 1 , p 2 , . . . , p t , we get a relation. In Dixon's method, the smoothness candidates are O(n), whereas in CFRAC and QSM, these are O( √ n), resulting in a larger proportion of smooth integers (than Dixon's method) in the pool of smoothness candidates. Moreover, QSM replaces trial divisions by sieving (subtractions after some preprocessing). This gives QSM a better running time than Dixon's method and CFRAC.
R. D. Silverman introduces a variant of quadratic sieve method, called the multiple polynomial quadratic sieve method (MPQSM) [7] . Instead of using the fixed polynomial T (c), the MPQSM uses a more general polynomial T (c) = W c 2 + 2V c +U so that the smoothness candidates are somewhat smaller than those in the QSM.
The number field sieve method (NFSM) is originally proposed for integers of a special form [8] , and is later extended to factor arbitrary integers [9] . Pollard introduces the concept of lattice sieving [10] as an efficient implementation of the sieves in the NFSM. The conventional sieving is called line sieving.
Some other methods for factoring large integers include the cubic sieve method (CSM) [11] and the elliptic curve method (ECM) [12] .
The linear-algebra phase in factoring algorithms can be reasonably efficiently solved using sparse system solvers like the block Lanczos method [13] . We do not deal with this phase in this paper. We focus our attention only to the relation-collection stage (more precisely, the sieving part of it).
A Brief Introduction to SIMD Features
All our implementations of sieving use SIMD (Single Instruction Multiple Data) parallelization. We carry out our experiments on a Sandy Bridge architecture from Intel. This architecture provides two types of vectorization primitives SSE2 and AVX. SSE (Streaming SIMD Extensions) [14] is an extension of the previous x86 instruction set, and SSE2 enhances the SSE instruction set further. Intel introduces SSE2 [15] in Pentium 4. SSE2 instructions apply to 128-bit SIMD registers (XMM). In each XMM register, we can accommodate multiple data of some basic types (like four 32-bit integers, four single-precision floating-point numbers, and two double-precision floatingpoint numbers). There are special instructions to perform vector operations (like arithmetic, logical, shift, conversion, and comparison) on these XMM registers. The basic idea to exploit this architecture is to pack these registers with multiple data, perform a single vector instruction, and finally unpack the output XMM register to obtain the desired individual results.
AVX (Advanced Vector Extensions) [16] is a recent extension to the general x86 instruction set. It is introduced in Intel's Sandy Bridge processor. This architecture is designed with sixteen 256-bit SIMD registers (YMM). Now, we can accommodate eight single-precision or four double-precision floating-point numbers in one YMM register. AVX is designed with three-operand non-destructive SIMD instructions, where the destination register is different from the two source registers.
Another important point to note here is that mixing of AVX instructions and legacy (that is, non-VEX-encoded) SSE instructions is not recommended. XMM registers are aliased as the lower 128 bits of YMM registers. Every AVX-SSE or SSE-AVX transition incurs severe clock-cycle penalties. To eliminate AVX-SSE transitions, legacy SSE instructions can be converted to their equivalent VEX-encoded instructions by zeroing the upper 128 bits of the YMM registers, as suggested in [17] .
Programming languages come with intrinsics for high-level access to SIMD instructions. We can use these intrinsics directly in our implementations to exploit data parallelism.
The MPQSM Sieve
The MPQSM is a variant of the quadratic sieve method (QSM). Instead of using a single polynomial (with fixed coefficients), the MPQSM deals with a general polynomial and tunes its coefficients to generate small smoothness candidates. This variant is parallelizable in the sense that different polynomials can be assigned to different cores.
The MPQSM sieve deals with a polynomial
with V 2 − UW = n. 
We include W too in the factor base. Now, we briefly discuss the sieving step in the MPQSM. The smoothness candidates are the values T (c) of Eqn (1) for all c in the range [−M, M]. We need to locate those values of c for which T (c) is smooth over the factor base. We take an array A indexed by c. Initially, we store log |T (c)| in A[c], truncated after three decimal places. Indeed, we can avoid floating-point operations by storing 1000 log |T (c)| .
After this initialization, we try to find solutions of the congruence T (c) ≡ 0 (mod p h ), where p is a small prime in the factor base, and h is a small positive exponent. The solutions of the congruence
For h = 1, we use a root-finding algorithm to compute the square roots of n modulo p (see [18, 19] ). For h > 1, we obtain the solutions modulo p h by lifting the solutions modulo p h−1 (see [20] ). Let s 1 , s 2 be the solutions of T (c) ≡ 0 (mod p h ) chosen to be as small as possible in the range [−M, M]. Then, all the solutions of T (c) ≡ 0 (mod p h ) are s 1 + j p h and s 2 + j p h for j = 0, 1, 2, 3, . . . . We subtract 1000 log p from these array locations. For p = 2 and h = 1, there is only one initial solution of the congruence. For p = 2 and h > 1, there may be more than two initial solutions.
When all appropriate values of p h are considered, the array locations storing A[c] ≈ 0 correspond to the smooth values of T (c). We apply trial division on these smooth values of T (c). If some T (c) value is not smooth, then the corresponding array entry A[c] holds an integer 1000 log p t+1 for the smallest prime p t+1 > B. This justifies the correctness of the calculations using approximate (truncated) log values.
The NFSM Sieve
The NFSM [9] 
We have f (m) = n, implying that f (m) ≡ 0 (mod n), as desired.
If θ ∈ C is a root of a monic irreducible polynomial f (x) of degree d with integral coefficients and m ∈ Z is an integer such that f (m) ≡ 0 (mod n), then there exists a ring homomorphism φ : Z[θ ] → Z n defined by φ (θ ) = m (mod n) (and φ (1) = 1 (mod n)).
1
Now, let E be a set of pairs of integers (a, b) satisfying
for some α ∈ Z[θ ] and y ∈ Z. Let φ (α) = x ∈ Z n . Then, we get the Fermat congruence
The NFSM uses a rational factor base (RFB) and an algebraic factor base (AFB). The RFB consists of the first t 1 primes p 1 , p 2 , . . . , p t 1 , where t 1 is chosen based on a bound B rat . The AFB consists of some primes of small norms in O. Application of the homomorphism φ lets us rewrite the AFB in terms of t 2 rational (integer) primes p 1 , p 2 , . . . , p t 2 , where t 2 is chosen based on a bound B alg . Now, we describe the rational sieve and the algebraic sieve of the NFSM. Here, we deal with incomplete sieving (see [20] ), that is, higher powers of factor-base primes are not considered in sieving.
Let T (a, b) = a + bm. First, we calculate the values of T (a, b) for all b in the range 1 b u and a in the range −u a u. Now, we try to find those (a, b) pairs with gcd(a, b) = 1 and b ≡ 0 (mod p), for which T (a, b) is B rat -smooth, that is, T (a, b) factors completely into the primes p 1 , p 2 , . . . , p t 1 . The determination whether a small prime p i divides some a + bm is equivalent to solving the congruence a + bm ≡ 0 (mod p i ). The sieving bound u is determined based upon certain formulas which probabilistically guarantee that we can obtain the requisite number of relations from the entire sieving process.
We take a two-dimensional array A indexed by a and b. Initially, we store log |T (a, b)| in A[a, b], truncated after three decimal places (or the integers 1000 log |T (a, b)| ). After this initialization, we find solutions of the congruence T (a, b) ≡ 0 (mod p), where p is a small prime in the RFB. For a fixed b, the solutions are a ≡ −bm (mod p). Let s be the least possible solution of
Then, all the solutions of T (a, b) ≡ 0 (mod p) for that b are s + j p, j = 0, 1, 2, 3, . . . . We subtract 1000 log p from all the array locations A[a, b] such that a = s + j p. We repeat this procedure for all small primes p in the RFB and for all allowed values of b. As we deal with only small primes with exponent h = 1, the array locations storing A[a, b] values less than 1000ξ log p t 1 +1 (where p t 1 +1 is the smallest prime greater than B rat , and ξ lies between 1.0 and 2.5) are subjected to trial division to verify the smoothness of T (a, b).
The algebraic sieve uses the norm function N : Q(θ ) → Q. Its restriction to Z[θ ] yields norm values in Z. For an element of the form a + bθ ∈ Z[θ ], we have the explicit formula:
An element α ∈ Z[θ ] is smooth with respect to the small primes of O if and only if N(α) ∈ Z is B alg -smooth. For each small prime p in the AFB, we compute the set of zeros of f modulo p, that is, all r values satisfying the congruence f (r) ≡ 0 (mod p). For a particular b with b ≡ 0 (mod p) and 1 b u, the norm values with N(a + bθ ) ≡ 0 (mod p) correspond to the a values given by a ≡ −br (mod p) for some root r of f modulo p. It follows that the same sieving technique as discussed for the rational sieve can be easily adapted to the case of the algebraic sieve.
An (a, b) pair for which both a + bm and a + bθ are smooth gives us a relation. Sufficiently many relations obtained from the two sieves are combined using linear algebra to get the set E of (a, b) pairs such that ∏ 
The Lattice Sieve Method
Pollard [10] introduces the lattice sieve method based on an idea proposed in [21] . Here, we describe the lattice sieve (its sieving-by-rows variant) for the rational sieve only. The same technique can be used in the algebraic sieve too. First, we take the rational factor base B rat , and partition B rat in the following way. By S, we denote the set consisting of all small primes p of the factor base with p B rat . Moreover, we denote by M the set consisting of all medium primes q of the factor base with B rat < q B rat . Usually, B rat is selected in such a way that the ratio B rat /B rat lies between 0.1 and 0.5.
Let q ∈ M be a medium prime (also called a special q in the notation of [21] ). Unlike the line sieve described in earlier chapters, we now take into account a region R in the (a, b) plane. For a particular q, we sieve only those (a, b) pairs which satisfy the congruence a + bm ≡ 0 (mod q). Furthermore, we consider only the small primes p with p < q while sieving the integers a + bm for a fixed q.
All the pairs (a, b) satisfying a+bm ≡ 0 (mod q) form a lattice L q in the (a, b) plane. The vectors w 1 = (q, 0) and w 2 = (−m, 1) generate the lattice L q . This basis is reduced using Lagrange's lattice-basis reduction algorithm for two-dimensional lattices (see [22, 23] ). The algorithm is similar to Euclid's gcd computation and is given as Algorithm 1. The iterations of the while loop monotonically decrease the Euclidean norms of v 1 and v 2 by subtracting an appropriate multiple of the shorter vector from the other.
From Algorithm 1, we get a reduced basis {v 1 , v 2 } of the lattice L q with v 1 being the shorter one. Let us write v 1 = (a 1 , b 1 ) and v 2 = (a 2 , b 2 ). All the points of L q can Algorithm 1 : Lagrange's rank-2 lattice-basis reduction algorithm
uv 1 is the orthogonal projection of v 2 on v 1 Here, · denotes the inner or dot product of two vectors
To ensure
be uniquely represented as
The lattice sieving takes place in
We take two sieving bounds L 1 and L 2 and a two-dimensional array A indexed by l 1 and l 2 . The indices vary in the ranges −L 1 l 1 L 1 and 1 l 2 L 2 . The array location (l 1 , l 2 ) represents the value a + bm = l 1 u 1 + l 2 u 2 , where u 1 = a 1 + b 1 m and u 2 = a 2 +b 2 m with gcd(u 1 , u 2 ) = q. As v 1 is shorter than v 2 , we choose L 1 > L 2 in order to make sieving efficient. All the array elements are initialized to zero. Subsequently, 1000 log p values are added to A[l 1 , l 2 ] for all primes p < q for which l 1 u 1 + l 2 u 2 ≡ 0 (mod p). After all primes p are considered, the sum stored in A[l 1 , l 2 ] is subtracted from 1000 log(a + bm) for the corresponding (a, b) pair. If the difference is smaller than a threshold, trial division is used to verify whether a + bm is actually smooth. The threshold is usually chosen to be 1000ξ log p t 1 +1 for the smallest prime p t 1 +1 > B rat and with 1.0 ξ 2.5. This liberal selection is necessitated by the fact that the congruence a + bm ≡ 0 (mod p h ) is now solved only for h = 1.
The above steps are repeated for each medium prime q ∈ M.
Implementation Details
Currently, the largest general-purpose integer that has been factored by a subexponential algorithm is a 768-bit (232-digit) RSA modulus (see [24] ). Consequently, we restrict our experiments to integers with no more than 250 decimal digits. For the NFSM, we experiment with two integers n 1 and n 2 having 60 and 120 decimal digits, respectively. Each of these is like an RSA modulus (that is, a product of two primes of the same bit length). In the MPQSM, we have implemented the complete sieve, that is, the congruence T (c) ≡ 0 (mod p h ) is solved for all relevant p, h values. For the NFSM, on the other hand, we have implemented only the incomplete sieve, that is, the value h = 1 is only considered. We have not gone for any sophisticated polynomial-selection procedure in the NFSM. We have instead chosen f (x) to be the polynomial obtained from the m-ary representation of n, where m = n 1/d . As suggested in [2, 25] , we take d = 3 if the number of digits in n is less than 80, and d = 5 for larger integers.
Sequential Implementations
The sequential implementations are straightforward. Let us consider the MPQSM sieve first. For a small prime p in the factor base, let H be the largest exponent h for which Eqn (3) is satisfied by some c ∈ [−M, M]. For each h ∈ {1, 2, . . . , H}, these solutions are precomputed before the sieving loop. Let s be a solution for some p, h values. We take s to be the minimum location in the sieving interval [−M, M]. We sieve the array A by subtracting 1000 log p from all the array locations c = s + j p h with j ∈ N. This procedure is repeated for all small primes in the factor base. Algorithm 2 presents a pseudocode of our sequential implementation of the MPQSM sieve. In the NFSM, the rational and the algebraic sieves are carried out independently in two two-dimensional arrays. If for some pair (a, b) with gcd(a, b) = 1, both a + bm and a + bθ are found to be smooth from the two sieves, we obtain a relation. The number of relations depends on the sieving bound u. Our basic goal is to investigate the benefits of SIMD parallelization instead of generating a complete solvable system, so we have experimented with small sieving bounds only.
In the lattice sieve method, we choose a medium prime q ∈ M. Then, we determine a reduced basis {v 1 , v 2 } for L q using Algorithm 1 (see Section 2.5). Subsequently, the rational sieve is carried out in a two-dimensional array whose rows and columns are indexed by l 2 and l 1 , respectively. We consider sieving by rows only, that is, we fix l 2 , start with the least array location l 1 in the sieving interval satisfying l 1 u 1 + l 2 u 2 ≡ 0 (mod p), and sieve all the other locations for the particular row l 2 . We precompute u 
we sieve all elements of every p-th row or every p-th column, respectively. We consider only the coprime (l 1 , l 2 ) pairs. Finally, the gcd of every pair (a, b) indicating smoothness is calculated. If this gcd is one, the (a, b) pair provides a relation. If gcd(a, b) = q, we take the pair (a/q, b/q) into account. If the value of b turns out to be negative, we change the signs of both a and b.
Parallel Index Calculations
The procedure for data-parallel index calculations is shown in Figure 1 k , respectively. These initial solutions are taken to be as small as possible in the sieving range. Then, k parallel additions take place with the help of a single SIMD addition instruction. The individual indices are obtained by extracting the components of the SIMD register storing the sum. However, a repacking of these indices is not needed in the next iteration. The output SIMD register of one iteration is directly fed as an input in the next iteration. In all these index calculations, the second SIMD register (holding the p h values) remains constant. The parallel index-calculation loop terminates when the value of any of the k components goes beyond the sieving range. The smoothness candidates in the NFSM are polynomial expressions in two variables a, b. If we fix b and vary a, sieving becomes identical to the univariate case of T (c). A couple of general points concerning these data-parallel implementations are in order.
In both the SSE2 and AVX implementations, we use 32-bit chunks (integers or floating-point numbers) to populate SIMD registers. This means that a single vector addition in SSE2 computes four sums in parallel. For AVX, eight sums are carried out by each such vector addition. Both SSE2 and AVX support packing of 64-bit units (double-precision integers or floating-point numbers). But this reduces the extent of parallelism. Moreover, so long as array indices are restricted to single-precision values (as is usually the case), use of 64-bit units not only is wasteful of space but also increases the relative overhead associated with packing and unpacking.
AVX provides 256-bit vector operations on floating-point numbers only. Therefore, we transform 32-bit integers to 32-bit single-precision floating-point numbers and pack these floating-point numbers in an SIMD register. After a vector addition, the floatingpoint units are converted back to integers after unpacking. This, however, comes at a cost. First, this introduces overheads associated with conversion between integer and floating-point formats. Second, this reduces the maximum index value from 2 32 to 2 23 (since the length of the mantissa segment of a 32-bit single-precision floating-point number is 23 bits according to the IEEE Floating-Point Standard). Converting integers to double-precision floating-point numbers eliminates this restriction. But as mentioned above, it is preferable to avoid 64-bit units in SIMD registers.
SSE2 Implementations
SSE2 provides 128-bit SIMD registers, that is, we take k = 4 in Fig 1. We take four integer solutions s i as small as possible in the sieving interval. These four solutions correspond to four p If we take the same p and use different h i values in the components of the SIMD register, the problem just mentioned becomes quite acute. It is instead preferable to use the same or close p i values and take the same h value in all the components of the SIMD register. This means that we first sieve for all the solutions of T (c) ≡ 0 (mod p i ) for i = 1, 2, 3, . . . ,t in groups of four. Next, we take h = 2 and do sieving for the solutions of T (c) ≡ 0 (mod p 2 i ), again in groups of four, and so on. For this choice, the iteration counts (2M +1)/p h i i are nearly the same for all the four components, and the extent of parallelism increases. Moreover, the spatial proximity during the access of A improves for small values of p and h. Notice that our NFSM and lattice sieve implementations use h = 1 only, and therefore we only need to take the p i values in the four components as close to one another as possible (like consecutive primes). end while end for end for end for Fig. 2 . SSE2 intrinsics used for index calculations m128i xmm p = mm load si128 ( m128i *P); m128i xmm l = mm load si128 ( m128i *L); m128i xmm l = mm add epi32 ( m128i xmm l, m128i xmm p); mm store si128 ( m128i *L, m128i xmm l);
We have used some other optimization tricks. If T (c) ≡ 0 (mod p h ) has only a few solutions in the sieving interval, then it is preferable to sieve for these values sequentially, since in this case the benefits of parallelization is shadowed by packing and unpacking overheads. The threshold, up to which parallelizing solutions modulo p h remains beneficial, is determined experimentally.
The MPQSM deals with a quadratic congruence T (c) ≡ 0 (mod p h ) which usually has two solutions (if p is odd). If there is a unique solution of this congruence (this is rather infrequent), we handle the sieving for this solution sequentially in order not to lose the alignment of using a pair of two distinct prime powers in a round of dataparallel sieving. For the NFSM and the lattice sieve, we solve a linear congruence to generate the initial solutions. In this case, there is always a unique initial solution s i for each p i (or for each root of f (x) modulo p i in the algebraic sieve of the NFSM). Here, we carry out sieving for four solutions in parallel.
Intrinsics Used for SSE2 The SSE2 intrinsics used in our implementation are shown in Figure 2 . 2 The header file emmintrin.h defines the 128-bit data type m128i and the prototypes for intrinsics mm load si128, mm add epi32 and mm store si128. The registers xmm p (for p h or p values) and xmm l (for s values) are packed each with four contiguous 32-bit integers starting from the locations P and L, respectively, using mm load si128. Then, they are added with a single vector instruction defined by mm add epi32. Finally, the output SIMD register xmm l is unpacked and its content is stored in the location L. However, unpacking is not destructive, that is, we can reuse this output register later, if required. Now, we subtract the log values from the array locations stored in [2] and L [3] . To use the intrinsics mm load si128 and mm store si128, it is necessary that the addresses P and L are 16-byte aligned. If they are not, we have to use the intrinsics mm loadu si128 and mm storeu si128. However, these intrinsics are more time-consuming compared to the ones for 16-byte aligned memory. Another important point is that the packing overhead is much larger in case we attempt to pack from four non-contiguous locations using mm set epi32 or similar intrinsics. So, we avoid them in our implementations. Fig. 3 . AVX intrinsics used for index calculations m256 ymm p = mm256 load ps (float *P); m256 ymm l = mm256 load ps (float *L); m256 ymm l = mm256 add ps ( m256 ymm l, m256 ymm p); mm256 store ps (float *L, m256 ymm l);
AVX Implementations
Following the same basic idea discussed above for SSE2, we implement data-parallel index calculations using AVX intrinsics. The AVX instruction set of Sandy Bridge does not support 256-bit vector integer operations. In order to exploit the power of 256-bit registers, we carry out floating-point index calculations. But then, we also need conversions between floating-point numbers and integers, since array indices must be integers.
Intrinsics Used for AVX
The intrinsics we employ in our AVX implementation are shown in Figure 3 . The 256-bit data type m256 and the prototypes for the intrinsics mm256 load ps, mm256 add ps and mm256 store ps are defined in the header file immintrin.h. Two 256-bit SIMD registers (ymm p and ymm l) are packed each with eight contiguous 32-bit floating-point numbers starting from the locations P (for p h or p values) and L (for s values), respectively, using mm256 load ps. A single vector instruction defined by mm256 add ps is used to add them. The individual results in the output SIMD register ymm l are then extracted in the location starting from the address L. Now, we need to convert the floating-point values [6] and L [7] to integers to obtain the array locations for sieving. The addresses P and L should be 32-byte aligned, for otherwise we need to use the inefficient intrinsics mm256 loadu ps and mm256 storeu ps. Moreover, we avoid using mm256 set ps or similar inefficient intrinsics which are used to pack eight floating-point numbers from arbitrary non-contiguous locations.
Experimental Results and Analysis
Experimental Setup
Our focus is on implementing the sieving part efficiently using SIMD parallelization. Version 4.6.3 of GCC supports SSE2 and AVX intrinsics. Our implementation platform is a 2.40GHz Intel Xeon machine (Sandy Bridge microarchitecture with CPU Number E5-2609). Version 2.5.0 of the GP/PARI calculator (see [26] ) is used to calculate the log values of large integers and to find the zeros of f modulo p (for only the algebraic sieve in the NFSM). We use the optimization flag -O3 with GCC for all sequential and parallel implementations. To avoid the AVX-SSE and SSE-AVX conversions, we use the flag -mavx in the AVX implementation. To handle large integers and operations on them, we use version 5.0.5 of the library [27] . Table 1 summarizes the timing (in milliseconds) and speedup figures for the implementations of the MPQSM sieve. For each n, M, B values used in our experiments, we take the average of the times taken by fifty executions. In all the tables, the speedup figures are calculated with respect to the sequential implementations. We have incorporated all the improvement possibilities discussed in Section 3.3. The rows in the same cluster have the same values for M and B, but differ in the count of digits in the integer being factored. NFSM Timings for our implementations are reported in milliseconds. For each data set, we record the average of the times taken over fifty executions. We take B rat = B alg = B as the bounds on the small primes in the two sieves. We document the results for 1 b 10. Timing and speedup figures for the implementations (sequential and SIMD-based) are summarized in Table 2 and Table 3 for the 60-and 120-digit integers n 1 and n 2 , respectively (see Section 3). The experimental results for the rational sieve and the algebraic sieve are shown separately. Lattice Sieve Method We have implemented the rational sieve of the NFSM using the lattice sieve method. For each data set, we record the average of the times taken by fifty executions. Table 4 and Table 5 show the experimental results for the 60-and 120-digit integers n 1 and n 2 , respectively (see Section 3), with B rat = 10000 and L 2 = 10. Different values of B rat are considered. Times are measured in seconds.
Speeding up Implementations of the Sieving Using SSE2 and AVX MPQSM
Observations Tables 1-5 suggest the following points.
-For the MPQSM sieve, the speedup varies in the range 20-35% on an average, except in the last two clusters where both M and B values are large. -For the NFSM sieve, we get a speedup in the range 15-40%.
-The speedup obtained in the lattice sieve (in the range 5-25%) is less compared to what is achieved by the line sieve. This is attributed to the fact that the lattice sieve involves some extra work like resetting the sieving array and calculating a reduced basis for each medium prime q ∈ M. -The speedup increases when the sieving limit increases. This is due to fact that larger sieving bounds allow parallel index calculations to proceed for a larger number of iterations. -We get higher speedup for smaller bounds on the primes in the factor base. In this case too, the number of iterations in the sieving loop increases. -Because of frequent conversions between integer and floating-point formats in each iteration of the sieving loop, the speedup with AVX, despite the use of 256-bit registers, is below our expectation, and turns out to be almost the same as that with SSE2. -For the NFSM, the speedup is found to be somewhat higher for the algebraic sieve compared to the rational sieve. The rational sieve packs four (or eight) different primes in a SIMD register. The largest of these determines how many times the loop iterates. In the algebraic sieve, the primes packed in the SIMD register may be repeated, since the polynomial f may have multiple roots modulo a small prime. Repeated primes indicate the possibility of an increased number of iterations in the sieving loop. This is the most likely reason why we obtain better speedup in the algebraic sieve.
Cache-Profile Analysis of Our Implementations of the MPQSM Sieve Using SSE2
The idea of packing close p h i i values in SIMD registers, as discussed in Section 3.3, was motivated by the heuristic assumption that the improvement makes our implementations more cache-friendly than the basic one (which corresponds to the same p i value and different h i values). In order to verify the effectiveness of our cache-friendly implementation, we have done cache-profile analysis our implementations (both the basic and the improved versions) using SSE2.
We have used the Cachegrind tool along with Valgrind (Version 3.7.0) to profile our basic and improved SIMD implementations using SSE2. From those profile information, we get an idea about the L1 and LL (last level, L3 in this case) cache misses. The sum of L1 misses includes L1 instruction-fetch misses, L1 data-read misses, and L1 data-write misses. Similarly, the sum of L3 misses includes L3 instruction-fetch misses, L3 data-read misses, and L3 data-write misses. The sum of L1 cache misses and the sum of L3 cache misses for an integer with 241 digits for different M and B values are reported in Table 6 . The table clearly indicates that our cache-friendly implementation encounters about 10% less cache misses compared to the unoptimized version.
Conclusion
Sieving is used to make the relation-collection phase in modern integer-factoring algorithms faster than trial divisions. In this work, we have exploited SIMD features, commonly available in modern microprocessors, in our implementations of the sieving procedure. We have chosen the two most efficient factoring algorithms (the MPQSM and the NFSM) in our experiments. We have implemented both the line sieve and the lattice sieve. We have used SSE2 and AVX features provided by Intel's Sandy Bridge architecture. Our data-parallel implementations have achieved noticeable speedup over sequential implementations. Line sieving enjoys slightly higher speedup than lattice sieving. We have also employed some optimization techniques in our SIMD-based implementations, and these have been experimentally verified to increase the cache-friendliness of our implementations. We conclude this paper after mentioning some directions in which this research can be extended.
-Although we have been able to speed up index calculations using SIMD-based parallelization, a similar approach was not effective during data-parallel subtractions of log values. Our efforts on parallelizing the subtraction operations, as described in Figure 4 , have not produced any benefit. Here, the bottleneck is that we cannot reuse the content of the output SIMD register (used for storing array elements) in the next iteration. Consequently, packing is required in every iteration of the sieving loop. Moreover, both packing and unpacking access non-contiguous memory locations, leading to additional slowdown in the implementations. Efficient data parallelization of the subtraction operations seems to be quite challenging, since a straightforward use of SIMD intrinsics is not beneficial, as mentioned above. The main bulk in the sieving computations includes index calculations and subtractions of log values. It requires non-trivial experimental investigations to settle whether subtractions can at all be effectively handled by SIMD intrinsics.
-Our implementations using 256-bit SIMD registers can be easily ported to Intel's recently released Haswell architecture, where 256-bit vector integer instructions are available. Using this AVX2 instruction set helps us to avoid frequent conversions between floating-point numbers and integers. This has the potential to increase the performance gains significantly. - Table 1 for sieving in the MPQSM shows that the speedup is somewhat small when both M and B are large. Improving the performance of our SIMD-based implementations for large values of M and B deserves further attention. -For the lattice sieve method, Pollard [10] proposes two different variants: sieving by rows and sieving by vectors. In our work, we have dealt with the first variant only. It is an interesting experimental investigation to determine how SIMD features can benefit sieving by vectors. -Our implementations of the MPQSM and NFSM sieves are not readily portable to polynomial sieves used in the computation of discrete logarithms over finite fields of small characteristics (for example, see [28, 29] ). Fresh experimentation is needed to investigate the effects of SIMD parallelization on polynomial sieves. Data-parallel implementations of the pinpointing algorithm of Joux [30] also merits attention.
