Extensible Markup Language (XML) is playing an increasing important role in web services and database systems. However, the task of XML parsing is often the bottleneck, and as a result, the target of acceleration using custom hardware or multicore CPUs. In this paper, we detail the design of the first complete field programmable gate array (FPGA) accelerator capable of XML well-formed checking, schema validation, and tree construction at a throughput of 1 cycle per byte (CPB). This is a significant advancement from 40 CPB, the best previous reported commercial result. We demonstrate our design on a Xilinx Virtex-5 board, which successfully saturates a 1 Gbps Ethernet link.
INTRODUCTION
Extensible Markup Language (XML) has become a standard for data representation and exchange. It is prevalent in a wide variety of applications like web services, database systems, content-based routing, and scientific applications, thanks to its platformindependence, interoperability and flexibility. As a result, XML processing has become an important workload for web servers, database servers, etc. However, XML parsing consumes a significant portion of execution time of web servers, and has become a threat to database performance [5] .
XML parsing consists of three major tasks: well-formed checking, which checks the document against syntactic rules, schema validation, which checks the document against semantic rules, and tree construction, which builds the in-memory data structure for further processing. To characterize the performance of XML parsers, the metric of cycle per byte (CPB) is often used. Similar to cycle per instruction (CPI) found in computer architecture, CPB counts the average number of cycles used to process each byte of XML document. Since it is independent of the clock frequency, whose scaling can be arguably enjoyed by all platforms, it is a preferred figure of merit for achieved parallelism of a design.
Current commercial software XML parsers, such as libxml, Xerces and XML4C, can only achieve a best processing rate of 40 CPB on tree construction and 70 CPB on schema validation [4] [5] [11] [23] . A large array of research results have been reported, which often exploit the SIMD instruction set extension of CPUs, or multicore CPUs to speed up XML processing in software [9] [13] [14] [16] . However, their results are often incomplete, e.g. with result only on well-formed checking. While the leading IT companies such as IBM, Intel, HP and Dell offer hardware-accelerated solutions to different XML processing tasks, neither performance metric nor design detail was revealed. The latest commercial result of a full ASIC-based XML accelerator, presumably with highest performance, achieves well-formed checking of 10 CPB, schema validation of 40 CPB, and tree construction of 20 CPB [18] .
In this paper, we present a high performance XML Parsing Accelerator (XPA) capable of performing all thee tasks at 1 CPB. More specifically, we make the following contributions: First, we identify recurring computational idioms in XML processing, and devise corresponding hardware structures to achieve efficiency. Second, we devise a speculative pipeline structure such that tree construction can be initiated before validated. Third, we devise a skewed pipeline structure in which it achieves high throughput under the common case where the XML document being parsed is correct, and stalls the pipeline for long latency operations only under non-common cases. Last but not the least, we detail the design of a complete hardware accelerator, which to the best of our knowledge, has not been found in the literature. Although our design has employed many techniques reported elsewhere in other contexts, we believe a synthesis of these techniques to achieve a record performance milestone is valuable to the community by itself.
We believe our contributions are particularly relevant to FPGAs in addition to the fact that our design is demonstrated on an FPGA platform. First, we took advantage of the availability of on-chip memory resources and bandwidth, as well as the availability of network IOs and intellectual properties. Second, as web services evolve at a fast rate, FPGAs present an inherit advantage over ASICs due to its field programmability. Our results show that by architectural and design innovations, FPGAs implementation can outperform existing ASICs. Combined with the fact that web services belong to the low volume infrastructure market where FPGAs have the economic advantage, we hope our contributions Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. FPGA '10, February 21-23, 2010 , Monterey, California, USA. Copyright 2010 ACM 978-1-60558-911-4/10/02…$10.00. make a case that XML processing is a promising area for FPGAs to win more sockets and expand more market.
The rest of the paper is organized as follows. In Section 2, we review the related work. In Section 3, we describe some background information about XML. In Section 4, we describe our key ideas. In Section 5, we detail our design. In Section 6, we discuss our experimental result.
Related Work
Two styles of XML parsers are involved, depending on if an inmemory data structure is constructed for later "random access". The popular style builds the Document Object Model (DOM) [3] tree, a standard data structure for web processing. The less popular, but faster style, called Simple API for XML (SAX), relies on the fact that the XML can be processed by later stages in the same order they are transmitted. It is therefore less flexible and of limited use.
The software community reported many implementations of XML parsers, with varying styles and compromises. In 2004, Zhang et al. developed the VTD-XML (Virtual Token Descriptor) parser [10] . They employ the concept of binary XML to avoid performance bottleneck of XML parsing, and achieves a performance of 20 to 27 CPB on tree construction and schema validation. However, binary XML is not an industry standard and their parsed data can't be used by other XML applications directly. In 2006, Lu et al. presented a parallel approach to XML parsing [9] . Their technique uses a light weight XML parser to build a skeleton of the XML document in a first pass parsing to guide the partition of the document into chunks that can be processed independently on different threads. Using this technique, the parser achieves tree construction performance of 30 CPB on a 4-core processor. However, the extra skeleton building process, done sequentially, may become a performance bottleneck. In 2006, Kostoulas et al. presented a schema-based XML parsing technique named XML Screamer [14] , which improves the performance by schemadependent compilation and tight integration across layers of software. The parser achieves a performance of 22 to 43 CPB on SAX parsing and schema validation. However, for each different type of XML documents, a new parser needs to be generated. In 2008, Cameron et al. developed an open-source non-validating XML parser Parabix (parallel bit streams for XML) which exploits the SIMD capabilities of modern-day commodity processors to process multiple characters at the same time, achieving performance of 6 to 15 CPB on SAX parsing [13] . However, no inmemory tree data was built and schema validation was not implemented.
In the hardware community, Lunteren et al. proposed in 2004 an approach to build an efficient and scalable general purpose state machine for accelerating XML processing [4] . However, no full system was demonstrated. In 2007, Moscola et al. presented a technique to automatically map regular expressions directly onto FPGA hardware and implemented a simple XML parser for demonstration [7] . Their technique could be useful but not sufficient to solve all problems since XML syntax rule is not a regular language. In addition, hardware recompilation is required each time it is applied to a different type of XML documents. In 2008, Krishnamoorthy presented a hardware XML parser [6] , which constraints on the length and types of tokens. In 2009, Leventhal et al. presented an ASIC-based XML Accelerator, which achieves performance of 20 CPB on tree construction and 40 CPB on schema validation [18] . In addition, there are a number of commercial products provided by the leading IT companies, such as IBM's WebSphere DataPower XML accelerator XA35 [26] , however neither performance metric nor design detail was revealed.
The achieved performance of previous work, along with our proposed design, is summarized in Table 1 . ('?' means the data is not reported, and '-' means not implemented). [10] ? 20-27 20-27 Lu [9] ? 27 33 Kostoulas [14] 22-43 -22-43 Cameron [13] 6-15 --Leventhal [18] 10 20 40 MIT-libxml [24] ? 64 71 XPA 1 1 1
Background
XML parsing consists of three major tasks: well-formed checking, schema validation and in-memory data construction. Other XML applications including XSLT, XPATH, XQuery are based on the results of these 3 basic tasks.
Well-formed Checking
The task of Well-formed Checking is to perform syntax checking on XML documents to ensure that it conforms to XML syntax rules provided in XML specifications [1] . A sample XML document is shown in Figure 1 .
The content of the document is organized in a tree structure with a unique root. Each element is delimited by an opening ('<>') and a closing tag ('</>') and may contain multiple attributes delimited by a space.
<?xml version = "1.0" encoding = "UTF-8" ?> <!--this is an example xml document --> <University> <Department name = "ECE"> <Students> <freshman>310</freshman> <sophomore>298</sophomore> <junior>213</junior> <senior>178</senior> <graduate>86</graduate> … </Students> <Professors> <professor name="Mike" field="network"/> … </Professors> </Department> … </University> A well-formed checker scans characters of an XML document, checks if the characters are valid, extracts tokens from scanned characters and perform syntax checking on the extracted tokens. Syntax rules include a) the opening tag of an element must match its closing tag; b) an attribute name must be unique within its parent element; c) element tags must be properly nested.
Schema Validation
Due to the flexibility of user-defined markups in XML, servers commonly only accept specific type of XML documents that conforms to set of rules described in certain formats: DTD (Data Type Definition) or its successor XSD (XML Schema Definition) [2] . An example of XSD file, which itself is an XML file, is shown in Figure 2 .
A schema validator needs to interpret XSD files and to apply the rules to the tokens extracted by WFC processor. The challenge of schema validation is to select the correct rule to apply to each token out of a set of candidates as well as the token content validation against the selected rules.
<?xml version ="1.0"?> <xs:schema xmlns:xs="http://www.w3.org/XMLSchema"> <xs:element name="University"> <xs:complexType> <xs:element name="Department" minOccurs="2" > <xs:complexType> <xs:sequence> <xs:element name="Students"> <xs:complexType> <xs:all> <xs:element name="freshman" type="xs:string" /> <xs:element name="sophomore" type="xs:string" /> <xs:element name="junior" type="xs:string" /> <xs:element name="senior" type="xs:string" /> <xs:element name="graduate" type="xs:string" /> </xs:all> </xs:complexType> </xs:element> <xs:element name="Professors" type="professorType"/> </xs:sequency> </xs:complexType> </xs:element> </xs:complexType> </xs:element> </xs:schema> Figure 2 . A sample XML Schema Definition (XSD) file. Element "University" is defined as complexType that is only allowed to have "Department" as its child. "Department" requires "Students" and "Professors" in the order. Finally, "Students" may contain "freshman" to "Graduate" in any order.
In-memory Data Construction
Given that the size of an XML file can be very large, the DOM representation, which captures the parental relationship between elements and attributes, or nodes, must be stored in DRAM. Such tree data structure requires extra headers with pointers to connect parent, sibling and child nodes. Not only does this require extra memory footprint, but also non-uniform memory access caused by updating previously written memory locations to connect a new node to rest of the tree. Such accesses might cause DRAM page crossing and degrade performance.
Key Ideas
In this section, we first identified several recurring computational idioms (fondly referred to as dwarfs in recent literature [21] ). Not surprisingly, in the context of XML processing, these idioms are all related to the processing of strings. Isolating these idioms allow us devise or choose efficient hardware structures to implement them. We then describe the key architectural decisions by refining a familiar, baseline architecture, which ultimately leads to the 1 CPB performance target.
Recurring Idioms

One-to-one String Match
This idiom tests if a subject string equals to a reference string. Due to the fact that the reference string is known at time of input, the commencement of matching task need not wait until the subject string is present in its entirety. Instead, the matching can be executed in a streaming fashion. This not only achieves the best latency, but also scales well on strings with large, variable length due to its minimal requirement of storage.
One-to-many String Membership Test
This idiom tests if a subject strings equals to any member of a set of reference strings.
Example 2. There are rules in both well-formed checking and schema validation that require an element/attribute name or its value to be unique within a certain range. This is equivalent to ask if an incoming element/attribute name matches with one of the previously seen names.
In general, performing such tests require string comparison of all reference strings, which can be prohibitively expensive. However, the number of full comparisons can be reduced if one can filter out "obvious" cases, where a simple test can determine that an incoming string does not belong to the set. We employ the concept of Bloom Filter, which defines a set of independent hash values for each reference string. The set of reference strings is then approximated by a bit vector where the corresponding bits of all hash values of the reference strings are set to '1's. If the hash values of the subject string produce a new '1' in the bit vector, then we can conclude that the subject string does not belong to the set.
One-to-many String Search
This idiom finds a subject string among a set of reference strings. Note that while seemingly similar, the previous idiom only needs to return a binary answer, whereas this idiom effectively performs a lookup into an associative array (dictionary) of strings.
Example 3: During schema validation, each element or attribute needs to search for its corresponding schema rule among a set of candidates.
This idiom is commonly implemented as a hash table in software. We employ the BART scheme [8] , originally proposed in the context of network routing table lookup. Unlike software hash table implementation where the lookup time can be undeterministic in the presence of hash value conflict, the BART scheme guarantees that the number of conflicts is bounded to a predefined value. Therefore, a string search amounts only to an onchip memory access and parallel comparisons of bounded size.
Key Architectural Decisions
Before describing our architectural decisions, it is instructive to describe a naïve baseline architecture shown in Figure 3 . The architecture mimics a textbook decomposition of compiler frontend, which suffer from poor performance even when the individual blocks are pipelined. First, the number of pipeline stages is large, leading to long latency in processing. Second, blocks have diverse worst case, leading to poor overall throughput. In the sequel, we describe architectural techniques to improve the baseline architecture. 
Speculative Pipeline
While a compiler usually constructs a syntax tree only after it passes correctness check, we choose to construct DOM tree immediately after lexical analysis, as shown in Figure 4 . This is speculative since we may construct a tree only later to find out invalid. Although in this case the tree has to be discarded, this mechanism allows the DOM tree construction stage to run independently of well-formed checking and schema validation stage, thereby significantly reducing the latency of the accelerator. 
Multi-rate Pipeline
Well-formed checking and schema validation are different in processing rate and granularity. Well-formed checking performs the syntax checks on each single character, while schema validation validates the semantics of extracted token flow. Wellformed rules are simpler compared to schema rules. To achieve a balanced pipeline design, we device a 3-level multi-rate pipeline structure as shown in Figure 5 . In the first level, the well-formed rules are checked against each character. In the second level, a rule math unit inside the schema validation stage search for the corresponding schema rule for each token, which is hashed into a 16 bit integer. In the third level, the rule checking units perform checking on multiple bytes of data simultaneously, such that they have multiple cycles of time budge to achieve the same throughput as the other stages. 
Common Case Optimized Stallable Pipeline
High-bandwidth On-chip Data Structure
To perform schema validation, many rules have to be checked against an XML construct under parsing. Typically, the types of checks need to be encoded in memory. To reduce latency, it is desirable to parallelize the rule checking, which dictates that the encoded rule information needs to be accessible in parallel.
FPGAs offer very large bandwidth on-chip memories. We devised a custom schema rule representation. The schema rules are divided into three portions and distributed into three local memories. Each memory has a wide data bus, allowing a single-cycle access of all schema rules associated with the XML construct under validation.
Final Architecture of the XPA
The final architecture of the XPA as a result of above decisions is shown in Figure 6 . The lexical analysis stage is merged into wellformed checking stage, since some well-formed rules are also checked during lexical analysis. 
Design
This session will present the detailed implementation of each functional unit of the XPA.
Well-formed Checking Stage
Character Scanner Unit
The Character Scanner Unit retrieves data from the Embedded Ethernet MAC (EMAC), and outputs data byte by byte to the next unit in the XPA. The block diagram of Character Scanner Unit is shown in Figure 7 .
A 1Gbps PHY is connected to the Embedded MAC through a SGMII interface. We implemented a simple UDP receiving logic block to deliver the incoming packet payload sent from host PC into the parser. In addition, a 1KB asynchronous FIFO is used to bridge the different clock domains between the Character Scanner Unit and the next cores. 
Token Extractor Unit
The Token Extractor Unit is responsible for recognizing all the tokens from the input stream. It is implemented as a finite state machine that makes state transitions on valid input characters. In contrast to software parser states, the goal of our finite state machine is not to perform the entire well-formed checking but to extract the tokens and output their types as well as the position signals as "begin", "enable" and "end". The finite state machine and sample signal behavior are shown in Figure 8 . The core wellformed checking functions are then executed in the Token Handler Unit.
Token Handler Unit
The Token Handler Unit performs a series of operations on each token extracted by the Token Extractor. Main operations include: A) Checking the correct nesting of each element, and the uniqueness of root element name. B) Checking the uniqueness of each attribute name within every element. C) Generating information of type, length and hash code for each token, passing them down to schema validation stage through FIFO. D) Storing useful characters into XML Cyclic Buffer for schema validation. The first two tasks are described in details.
Element Name Correct Nesting Checking
To check the correct nesting of each element, the closing tag of each element needs to be compared with the last opening tag. As described in section 4.1.1, the comparison is carried out on each input character. This task is done with the help of an Element Name Stack. Whenever an element opens, its name is pushed into the Element Name Stack character by character. When it is being closed, one character is popped from the Element Name Stack per cycle and compared with the incoming character. Because the element tags are required to nest properly, a mismatch in the input character of closing element with the output of Element Name Stack always means a violation. The usage example is shown in Figure 9 . 
Figure 9. Example of Element Name Stack operation. When the 'Students' element is being closed, SP starts at 'S' and moves cycle by cycle to '8'. At the end of the matching. The whole element is popped off the Element Name Stack by updating NSP=NSP -8 -1
and SP=NSP-10 -1.
Attribute Name Uniqueness Checking
The uniqueness checking requires each attribute name to be compared against multiple preprocessed names. This problem is identified as membership test dwarf in section 4.1.2. We employed the concept of Bloom Filter [19] [20] and implemented a 3-stage pipeline for this task as shown in Figure 10 . In the first stage, a HashCode Generator generates k independent hash codes for each attribute name. In the second stage, the k hash codes are used to access k different bits in a bit array. In the third stage, the fetched k bits are examined whether any bit is '0' (initial value), which means the attribute name is guaranteed to be unique. Once uniqueness is confirmed, all corresponding k bit in the bit array are updated to '1' and the attribute name is stored into the Attribute Name Stack. In case, all k locations returned '1', it infers potential violation, hence the whole pipeline will be stalled to compare the attribute name against each strings previously stored inside the Attribute Name Stack, character by character, to remove the falsepositive case ( Figure 11 ). 
Schema Validation Stage
A valid element/attribute token not only needs to be syntax correct, but also contain its conforming definition in its XSD file in the correct context. Due to the relatively small volatility of schema files, we first pre-compile the current schema file into a custom local memory format that is efficient for lookup. We use three tables to store the contents: Rule Header Table ( RHT), Rule Name Table (RNT) and Rule Content Table ( RCT), each maintaining the tree structure of every rule, the name of each rule and the rule contents respectively.
Rule Match Unit
The Rule match Unit is responsible for selecting the corresponding schema rule for each element name and attribute name among a set of candidate rules. Figure 12 . Example of the BART scheme. From Figure 1 BART is based on a novel hash function with the special property that the maximum number of collisions for any hash index can be limited by a configurable bound P. The hash index is extracted from bit positions within the input hash code, which are selected to realize the maximum collision bound P. The value of bound P is based on the memory access granularity to ensure that all collisions for a given hash index can be resolved by a single memory access and by at most P parallel comparisons. A simple illustration is shown in Figure 12 .
The Rule Match Unit consists of a two-stage pipeline where the first stage selects at most P rules (We chose 4 for our design) using XORed value of input hash code and a bit mask as index into the Rule Header 
Rule Check Unit
The Rule Check Unit is responsible for the schema validation on the contents. It is further divided into 2 sub units: Rule Name Check Unit and Rule Content Check Unit. Rule Name Check Unit verifies the selected rule from the Rule Match Unit is hash-codecollision error free. The Rule Content Check Unit checks if the contents of elements and attributes conform to the selected rule.
Rule Name Check Unit
The logic of the Rule Name Check Unit is shown in Figure 14 . When a rule arrives from the Rule Match Unit, it starts reading out characters from two different local memories: the XML Cyclic Buffer (XCB), which contains the actual string of the input token pushed in by the Token Handler Unit, and the Rule Name Table  pointed by the RNTAddr in Figure 13 . Both data are fetched out and compared, 8-byte by 8-byte, to verify the match.
Rule Content Check Unit
The Rule Content Check Unit (Figure 15 ) is responsible for performing schema validation on element and attribute contents as well as checking their arrival sequence. Once the Rule Match Unit finds a rule, this unit fetches the corresponding rule contents (Figure 15 a) The Sequence check block ensures if the sequence of incoming tokens follow the specified order in XSD. (e.g. <Sequence> in Figure 2 ) we use a stack to record the latest sequence number in each level of tree hierarchy and compare against the SeqNO field in Figure 15 . The Type check block checks if the pattern of the content is correct. We currently support string, integer, decimal and date. The Range check block checks if a content value falls within the allowed range, such as "minOccur" in Figure 2 . Lastly but not least, the Key check block checks if a token marked as "key" is unique throughout a document. This problem is categorized as the dwarf in section 4.1.2 because each 'key' type value needs to be compared against a previously parsed "key" to verify the uniqueness. The same Bloom Filter approach as described in section 5.1.3.2 is used except that actual strings of the key contents are stored in DRAM instead of local memory as the set could grow over thousands. (e.g. list of student numbers)
DOM Construction Stage
The DOM Constructor Unit is responsible of building a DOM tree of the input XML document in DRAM, which can then be used to develop a DOM Application Programming Interface. In order to support industry specified efficient tree operations, the base data structure should contain enough pointers in each node such that every part of XML data is tightly connected. In our current design, a simple and straight forward 32-byte aligned data structure is employed to implement the DOM Construction (Figure 16 ). With the data structure, each element name requires a) as its header and c) to contain its name strings. Each attribute name uses b) as header and c) for its name strings. Contents only use a c) with parent link linked back to their parents.
The DOM Constructor exercises three main tasks, new node allocation, update of parent and update of sibling. When a new token other than closing element is parsed, it allocates a new node in DRAM in an appropriate format. If the token has a previous sibling in the same hierarchy, the NextSibling pointer in previous sibling header is updated in the next cycle. When parsing a closing element, the ChildList pointer of the corresponding element header is updated if it appears to be a parent of already parsed nodes. In addition, we employed multiple techniques such as a stack to store DRAM addresses of active parent nodes and register last closed element to locally keep track of DRAM addresses for update of parent and sibling respectively. Because the DOM Constructor requires no DRAM read operation, the data structure is optimized for write only data access by reducing page crossing. 
Evaluation
In this section, we carry out comparative performance study of our design against 4 publicly accessible software XML processors. We use throughput in both CPB and Gbps as performance metrics. In addition, we examine the implementation cost and speed on FPGA, and scalability issues.
Hardware Experimental Setup
Our design is implemented and tested on Xilinx Virtex-5 XC5VSX50T FPGA on the ML506 evaluation board. To perform the test under a practical environment, we connect the input of the XPA to a Tri-mode Ethernet MAC, configured to work with a 1Gbps SGMII PHY device. A simple UDP receiving protocol is used to extract data and commands from incoming UDP packages. The test files are fed from a laptop to the Xilinx board through a 1 Gbps Ethernet link. The output data of the XPA is written to an on-board 256MB SODIMM DDR2-533 memory module through the Memory Controller (MC). A serial port is integrated to display experimental results. The structure of the XPA test bed is shown in Figure 17 . 
Software Experimental Setup
To compare the performance of the XPA against other XML processors, we test 4 software XML parsers with the same set of benchmarks. The parsers are chosen from well-known open source commercial tools that have the best reported performance according to [24] . The tests of software XML parsers are carried out under the configurations listed in Table 2 . We used the XML Benchmark Tool [23] from Intel to gather performance results for the 4 software XML parsers. For each benchmark, the XML Benchmark Tool will perform multiple iterations of warm-up and test to get stable results such that the overhead of memory load and operation system management are minimized. All the benchmarks are read from local hard drive for software tests.
Benchmarks
The benchmarks are chosen from different XML projects. Each benchmark contains multiple test files from the same project. The file size varies from 3 KB to 116MB. The benchmarks are separated into 2 groups: DOM parsing benchmarks and schema validation benchmarks. Schema validation benchmarks contain one XSD file for each benchmark. Table 3 lists the names of these projects, the maximum sizes of test files as well as the source of the projects. 
Measurement
Throughput
The detailed test results on performance of different XML processors are presented in Table 4 and Table 5 ('-' indicates that certain functionality is not implemented and thus result unavailable). Table 4 lists the throughput of different tests in Gbps. Table 5 presents the same results in CPB.
As illustrated by Table 4 , XPA achieves the raw throughput it is designed for, 1Gbps. The throughput is in fact bounded by the Ethernet link speed. Sine the maximum frequency achieved is 130MHz, which we did not make an effort to further improve, the actual raw throughput can be slightly higher: 1.04Gps. Note that although the software parsers run on processors with 2.5GHz of frequency, XPA is still faster. For parsing benchmarks, which involves only well-formed checking and tree construction, XPA outperforms the best performing software parser (JAXP) by 2.8 times.
For the much more difficult validation benchmarks, XPA outperforms the best performing software parser (libxml) by 3.7 times. As illustrated by Table 5 , XPA outperforms other tested software XML processors by more than 66 times in term of CPB, which illustrates the potential of XPA architecture when implemented in ASIC with more aggressive frequency optimizations. 
Stall Rate
To further understand the performance of the XPA, Table 6 lists the statistics of pipeline stalls in each parsing stage and memory controller (only the maximum number of stalls for each benchmark is shown). In addition, the average memory bandwidth requirement for each benchmark is also shown in Table 6 . All observed stalls occur in the DOM construction stage. This is because the DOM constructor often needs to generate multiple write requests at the same clock cycle on cases when multiple pointers in a DOM tree need to be updated. Normally the extra requests are buffered. When these cases happen too close to each other, the buffer might become full. However, these types of stalls do not happen frequently as illustrated by Table 6 : 1263 stalls occur for a 116 MB input file, which contributes to tiny portion of the whole processing time.
No stall is observed on the Memory Controller either, thanks to the large command FIFO deployed in the Memory Controller and the high performance of DDR2 memory. For each benchmark, the memory bandwidth requirement is calculated by counting the number of memory accesses. As shown in Table 6 , the average memory bandwidth requirement of all benchmarks is 908 MB/s. Because DDR2-533 memory has a maximum available bandwidth of 4.2 GB/s [25] , it is sufficient to consume the memory requests generated by DOM constructor. Therefore, Memory Controller is not likely to generate a stall.
Area and Clock Frequency
The device utilization of our design is shown in 
Scalability Study
In this section, we study the sensibility of various design parameters against XML file sizes and characteristics, to ensure the robustness of our design.
Bloom Filter Requirement
The Bloom Filter is one of the key enabling techniques of our design. However, its false positive rate also has great impact on the scalability. Thus it is important to examine the requirements of achieving a low false positive rate.
The false positive rate of the Bloom Filter depends on the size of the tested set n, the size of the bit array m and the number of independent hash functions k. A false positive can be described as the probability of k hashed locations all equal to 1. It can be calculated using following equation as presented in [20] Table 8 .
For every false positive, assume an overhead of 100 clock cycles is needed for doing real string comparison. We hope there are less than 10 false positives, so that the extra cycles can be tolerated by the 1 KB buffer in the Character Scanner Unit. A reasonable test case is when the attribute tokens consist of 25% percent of all the tokens, and each token of any type has an average size of 4 characters. Then a 100 KB file would require a practical false positive rate of:
(number of false positives / number of attribute name tokens) = (10/((100 K/4)*25%))=0.01%
This means that for every 10,000 attribute name tokens there should be less than 1 false positive. Therefore, the test results illustrated by Table 7 show that a configuration of 1-kb bit array with 3 hash functions or 2-kb bit array with 2 hash functions should be practical enough for the Attribute Name Uniqueness test task. 64b_2h  1  66  509  6  129  502  256b_2h  0  5  60  1  8  56  256b_3h  0  0  14  1  3  9  1kb_2h  0  1  6  1  2  2  1kb_3h  0  0  1  0  0  0  2kb_2h  0  0  1  0  0  0  2kb_3h  0  0  0  0  0  0   Table 8 also shows that, increasing the bit array size from 64 bits to 256 bits, the false positive rate of all tests is reduced by 10 times. Besides, by increasing the number of independent hash functions from 2 to 3, the false positive rate is reduced by more than 5 times in most test cases.
On-chip Storage Requirement
In this section, we analyze the scalability of the XPA in term of on-chip memory requirement for processing different sizes of XML files.
The required size of the Element Name Stack is determined by the depth of the XML document tree. And the size of the Attribute Name Stack is determined by the largest number of attributes one element has in a document. Both are not likely to scale with XML file size.
The Schema Rule Memory used in schema validation stage consists of the Rule Header Table, Rule Name Table and Rule  Content Table. To support variable types of XML documents, the schema Rule memory needs to be large enough to store their XSD files. A typical schema file like XHTML schema requires less than 70 KB. Thus, the storage requirement of the Schema Rule Memory is not likely to become a limit.
Conclusion
In the paper, we present an innovative XML processing architecture and design that achieves 1 CPB performance on both tree construction and schema validation with very good scalability. The architecture is implemented on a Virter-5 FPGA board and successfully saturates a 1 Gbps Ethernet Link when running at 125MHz clock frequency. With our demonstration, we believe FPGAs can become a valid contender in winning the enterprise XML processing sockets.
Limitations
We acknowledge the following omissions of our design in the interest of time. First, our token extractor does not handle the full UTF-8 character set, and settles only with the ASCII character set. Second, our schema validation does not yet handle full regular expression check. It can be argued that both features are unlikely to be performance bottlenecks, as efficient implementations have been demonstrated elsewhere [7] [12] [22] .
ACKNOWNLEDGEMENTS
