Abstract-Geiselmann and Steinwandt proposed an ASICbased hardware design "YASD" for the sieving step in the number field sieve (NFS) method of integer factorization in 2004. The design is attractive since its regular structure seems suitable for implementation, however, performance evaluation for 1024-bit integers has not been provided. This paper firstly evaluates the performance of YASD for 1024-bit integers by a simple extrapolation under the same assumptions of the original YASD. In our estimation, optimized YASD for 1024-bit integers requires 42200 mm 2 and about 10600 years for the sieving. Since we did not consider the wiring problem and the mini-factoring problem, even if YASD for 1024-bit integers are manufactured, further circuit area and time will be required.
I. INTRODUCTION
It is strongly believed that breaking RSA is as hard as factoring large integers. Thus the number field sieve (NFS) method, the currently best factoring algorithm, is an essential threat for RSA. Recently, NFS succeeded factoring a 768-bit integer by software implementation [5] . Based on this fact, it is considered that factoring 1024-bit integers is infeasible for at most 10 years. However, this expectation does not take into account recent research results on dedicated factoring devices of NFS originated from Bernstein's circuit [1] . NFS consists of 4 steps: the polynomial selection step, the sieving (or the relation finding) step, the linear algebra step, and the square root step. Since the sieving step is the most timeconsuming part, a lot of research has been focused on this step. In 2003, Shamir and Tromer proposed an ASIC-based hardware design "TWIRL" for the sieving step [9] . According to their estimation, a single TWIRL could sieve within 2.3 years with cost 750 dollars (excluding non-recurring engineering, defects, and power supply) for a 768-bit integer. More surprisingly, a single TWIRL could factor even a 1024-bit integer within 194 years with cost 15000 dollars, which implies that TWIRL could factor within 1 year with cost 10M dollars (including defects). Since TWIRL uses various kinds of circuit (irregular structure), it requires an LSI as large as a full wafer with 300 mm diameter. Recently, Franke et al. proposed a sophisticated hardware design "SHARK" [2] . According to their estimation, a single SHARK could factor a 1024-bit integer within 2300 years with cost 40000 dollars under the same assumptions as TWIRL. These results are summarized in Table I (Cost estimation excludes NRE, defects and power supply). On the other hand, Geiselmann and Steinwandt proposed another novel design "YASD" [3] , which is very attractive because of its smallness and regular structure. According to their estimation, a single YASD requires 34.5 years with cost 250 dollars for a 768-bit integer. However, no estimation for 1024-bit integers has been provided up to the moment. This paper evaluates the performance of YASD for 1024-bit integers based on the same assumptions as those of the original YASD. Especially, we assumed that the process rule is 130 nm, the frequency is 500 MHz, the diameter of a full wafer is 300 mm, the manufacturing cost of a wafer is 5000 dollars. We establish general formulas of circuit area and sieving time of YASD for 768-bit integers (YASD768) and YASD for 1024-bit integers (YASD1024). By optimizing parameters with regard to the area-time product (AT product), we conclude that YASD1024 requires at least 42200 mm 2 and 10652 years for the sieving with k = 27 and m = 10 (where YASD consists of 2 m × 2 m nodes and each node has 2 k registers for the sieving). Since we did not consider the wiring problem and the mini-factoring problem in detail in our estimation, YASD1024 requires much more area and time even if it is actually manufactured and operated.
II. PRELIMINARIES
This section reviews the number field sieve (NFS) method and the sieving device [3] YASD for later discussion.
A. Number Field Sieving Method
The number field sieving (NFS) method of integer factorization is the currently best algorithm, which consists of four steps: the polynomial selection, the sieving (or relation finding), the linear algebra, and the square root. Let n be an integer to be factored. Firstly, two univariate polynomials f r (x), f a (x) and an integer m such that f r (m) ≡ f a (m) ≡ 0 (mod n) are selected in the polynomial selection step. These polynomials are converted to bivariate homogeneous polynomials F r (x, y), F a (x, y). The purpose of the sieving step is to find a large number of relations, namely coprime integer pairs smooth and F a (a, b) is B a -smooth. Here an integer x is called B-smooth if x is factored over the primes less than B. Procedures corresponding to F r (F a ) are sometimes called 'rational' ('algebraic'), respectively. Parameters H a , H b determine the sieving region, and depend on the target integer n. In practice, the core sieving step picks up possible relations, and an additional step checks whether it really is a relation via mini-factoring. After finding a set of relations, the Gaussian elimination over a matrix generated from these relations is computed in the linear algebra step. Then, a factor of n is output from the square root step.
Let us look at the sieving step in detail. (a j , b) ), log √ 2 p is added to a register corresponding to the pair (a j , b), respectively. After checking all primes less than B a (B r ), we pick up (a, b) such that corresponding register almost equals log √ 2 F a (a, b) and
B. YASD
YASD (Yet Another Sieving Device) is an ASIC-based dedicated sieving device for NFS proposed by Geiselmann and Steinwandt [3] . YASD consists of a 2 m × 2 m processor mesh, where each processor node is connected to its horizontal and vertical neighbors. Each node conceptually has three parts: Main part, Mesh part, and Memory part. In addition, each node has a buffer between Main and Mesh part, and a buffer between Mesh and Memory part. Primes are distributedly held in Main parts as the factor base. Each Memory part has 2 × 2 k−2m registers to store log values. In YASD, a log value is generated in Main part as a packet. At the same time, a target register to be stored and thus a target node are computed and packed into the packet. These packets are routed via Mesh parts controlled by the clockwise transposition routing algorithm. Main Part A main function of Main part is to generate packets from the factor base and to send packets to Mesh part. More precisely, for a prime p in the factor base, Main part outputs a packet corresponding to an integers r such that
If r + p is in the current subinterval S, a packet corresponding to r + p is also generated and sent to Mesh part. Otherwise, the same procedure is performed for the next prime.
A packet has 8 values: x t , y t , c x , c y , log, a footprint, z t and a flag. (x t , y t ) is a coordinate values of its target node and z t is the address of register to which the log value will be added. c x , c y are cross border flags. The "footprint" is factoring information consists of the coordinates of the target node concatenated with its some lower bits of p (more precisely, the lowest bit 1 is excluded since p is prime). This information will be used in the mini-factoring. The flag indicates whether the packet is algebraic or rational.
The coordinate values (x t , y t ) and the target address z t are computed in Main part. If the target node is itself, the packet is directly sent to the buffer between Mesh and Memory part. Otherwise, packets are sent to the buffer between Main and Mesh part. In Mesh parts, 25% nodes are kept empty in order to avoid packet congestion, The factor base is distributedly stored in 2 m ×2 m nodes. In addition, small primes are repeatedly stored in some nodes for efficiency. Primes smaller than Z = 2 k−2m are stored in all nodes because so many packets corresponding to such primes will be generated. Also, primes p such that Z < p < 2 k−2m+2 are stored and shared in 2 2 nodes, primes p such that 2 k−2m+2 < p < 2 k−2m+4 are stored and shared in 2 4 nodes, and so on (see Table 2 ). All factor bases have almost same number of primes. Memory Part Memory part of each node has two registers for each of 2 k z-values that the node has to take care of. One DRAM is for the algebraic side and the other is for rational. These memories are initialized with 0 and hold sums of log √ 2 p -values. In addition, Memory part stores footprints in the special DRAM. One DRAM is shared by two neighbor-nodes. Thus, when a footprint is stored, in addition to the footprint itself, the position z t , 1-bit algebraic/rational flag and 1-bit flag to identify the node will be needed. As well as Main part, Memory part requires two receiving buffers connecting Memory and Mesh part.
Mesh Part A main function of Mesh part is to route packets to their target nodes. The routing is controlled by the clockwise transposition routing algorithm and a complete logic is located in Mesh part. This part requires two registers to hold travelling packets on the mesh. According to the analysis by Lenstra et al. [6] , a clockwise transposition routing over an L × L-mesh requires at most 2(L − 1) steps.
III. PARAMETERIZATION OF YASD
According to the analysis of YASD for 768-bit integers by Geiselmann and Steinwandt, an LSI requires as large as 24 cm 2 and the running time for the whole sieving is 34.5 years at 500 MHz frequency. In this section, we establish formulas of running time and area for a generalized YASD.
A. Running Time
In YASD, all Main, Mesh, and Memory parts are operated in parallel. Among them, dominant procedures will be packet generation in Main parts, and routing in Mesh parts. In the following, we analyze these procedures in detail. Packet Generation Suppose a mesh consists of 2 m × 2 m nodes and the size of a subinterval is 2 k . We divide primes into 3 groups according to their sizes: p < Z = 2 k−2m , Z < p < 2 k and 2 k < p < B a . When p < Z, each node will generate Z/p packets for each prime. When Z < p < 2 k , each prime is stored and shared by some neighbor nodes. For a prime p such that
/p packets will be generated. Since this prime is shared by 2 2(i+1) nodes, the packet generation requires Z<p<2 k Z p per node. When 2 k < p < B a , at most 1 packet will be generated (since these primes are shared by a mesh). Thus the packet generation requires 
where π(x) denotes the number of primes smaller than x. By substituting YASD's parameters (k = 24, m = 8, Z = 256) into Equation (1), we have Time gen ≈ 3000 clocks (assuming that each packet is generated in 1 clock), which is much smaller than 40000 (required clocks for sieving a subinterval by YASD). Thus, this procedure is not dominant. . Therefore, we assume that a difference of the above size and an evaluated size in [3] is for Input/Output circuit proportional to the number of nodes. Main Part Main functions of Main part are to store prime numbers in the factor base, to generate packets, and to send them to the buffer between Main-Mesh or Mesh-Memory parts. The Main-Mesh buffers are included here.
Let us consider the number of transistors. Main part requires max (L p , k)-bit adder for packet generation, and 2 P AC-bit latches. Thus we have the number of transistors required in Main part as
where H add denotes the number of transistors of a 1-bit adder, L p denotes the bit length of the max prime, P AC denotes the bit length of a packet, H latch denotes the number of transistors of a 1-bit latch, and the constant 1000 is for logic circuits of packet generation (see Table VI and VII). We also have P AC = k + L log + 2m + 13 for the bit length L log of log √ 2 p . DRAM of Main part is used for storing the factor base, whose number N depends on the bit length of the composite integer n. Since YASD distributedly stores the factor base in a mesh, the number of DRAM of Main part equals
for the average bit length of a prime L fb . Memory Part Main functions of Memory part are to extract the log value log √ 2 p and to add it to the specified address in the packet. The Mesh-Memory buffers are included here. Thus Memory part requires L mem -bit adder and 2 (P AC − 2m − 2)-bit buffers (since a target address (x t , y t ) and cross border flags (c x , c y ) are no longer needed after reaching at the target node). Thus we have the number of transistors required in Memory part as
(5) for the bit length of log √ 2 p L mem . Again, the constant 500 is for logic circuits for output.
Memory part has 2 · L mem -bit DRAM entries (for algebraic and for rational) for each of Z = 2 k−2m pairs, and additional DRAM for storing footprints. A footprint consist of a target node address (x t , y t ) and a 10-bit prime information of p (the last 10 significant bits of p). In addition to a footprint itself, a Z-bit sieving position, a 1-bit algebraic/rational flag, and a 1-bit flag to identify the node are required (DRAM for footprints are shared by 2 nodes). Thus we have the number of DRAM of Memory part
for the number of footprints F . Mesh Part Main functions of Mesh part is to compare packets and to exchange packets if needed. Thus we have the number of transistors of Mesh part Tr mesh by
for the number of transistors of 1-bit D-F/F H df f and that of 1-bit comparator H comp . From the above discussions, we have a general formula of the circuit area of YASD (for N -bit integers) as follows:
where we assumed that 1 DRAM bit requires 0.3 μm 2 and 1 transistor requires 2.8 μm 2 with 130 nm technology. In this section, we formulate the processing time and circuit area YASD for 1024-bit integers (YASD1024). Then we optimize the parameters k, m with regard to the AreaTime product. Sieving parameters used in the following are summarized in Table V .
A. Processing Time for YASD1024
Since the total number of packets being sent is given by
from Equation (2), we have the routing time
.27−2 log log 2
In NFS, relations are collected from a region
Since a and b should be coprime, a pair with even a and even b does not need to be sieved. Thus, 25% of the whole sieving region can be excluded. Consequently, the processing time of a single YASD1024 is evaluated by the following with frequency 500 MHz:
B. Circuit Area for YASD1024
Evaluation of Parameters Before evaluating the circuit area, some variables should be evaluated because of the change from 768-bit to 1024-bit.
L p : For 1024-bit integers, the maximal prime is approximately B a = 2.6 × 10 10 . Thus log 2 log √ 2 (2.6 × 10 10 ) = 7 is enough as L p . However, considering a margin, we set L p = 8 in the following.
L mem : Let us discuss the maximal value of a sum log 2 log √ 2 p . Since this value is originally equal to F a (a, b) ≈ 10 105 [7] , log 2 log √ 2 (10 105 ) = 10 is enough as L mem . Since L mem also holds a sum of ceiling values, 11-bit seems adequate and we set L mem = 11.
L fb : The factor base includes a difference of consecutive primes Δp 2 and a root r. The majority of the factor base is so-called Hugish primes, primes larger than B r , which have only 1 algebraic root each. Thus, with regard to Δp 2 , it is sufficient to make 2-bit longer than L log , and same for L p .
In summary, values of parameters are summarized in Table VI . By using these values, bit length of a packet P AC is expressed by P AC = k + L log + 2m + 10 + 3, a constant 10 is derived from footprint information and last term 3 is assigned to flags (see Table II ). Formula By substituing these values into Equation (8), we obtain the following area of YASD1024:
C. Optimizations of Parameters
We optimize parameters k, m with regard to the AreaTime product. Table VIII shows concrete values of circuit area, processing time, and AT-product. From these data, we obtain optimized parameters as k = 27 and m = 10. In this case, a single chip requires as large as about 422 cm 2 and 10652 years for the sieving. When we use 600 wafers (as TWIRL1024 [9] assumes), about 18 years will be required.
Optimized m is around 10 regardless of k, since a dominant term of the AT-product is
which is minimal when m = 10. A reduction of the constant 1.83 × 10 10 may lead to the reduction of the AT-product, however, since this constant came from the size of factor base, a straightforward reduction seems difficult.
V. CONCLUDING REMARKS
This paper evaluated the performance of YASD for 1024-bit integers. With the optimized parameters, a single YASD requires at least 422 cm 2 and 10000 years for the sieving. Only a simple extrapolation was used in this paper: it is implicitly assumed that wiring of YASD1024 has no problem. Moreover, we did not consider the mini-factoring problem: after the sieving, candidates should be checked by the mini-factoring. In YASD1024, the mini-factoring part may be another dominant step and further circuit will be required (as discussed in [2] ). Thus, YASD1024 will require much more time and circuit area even if YASD1024 is actually manufactured and operated. APPENDIX Appendix provides concrete formnulas of processing time and circuit area of YASD for 768-bit integers (YASD768).
A. Processing Time and Circuit Area for YASD768
Similar to formulas in the paper for YASD1024, we obtain the following formulas by substituing values in Table 5 and 6 to Equations (9), (10), and (11):
T ime Table IX shows concrete values of circuit area, processing time, and AT-product. From these data, we obtain optimized parameters as k = 23 and m = 8
B. Optimizations of Parameters for YASD768
1 . Optimized m always seems to be 8 with regardless to k, since a diminant factor of the AT-product of YASD768 is (190.4m + 8977.6) · 2 m + 9.99 · 10 8 , which is minimal when m = 8. Table VIII 
