Giving Text Analytics a Boost by Polig, Raphael et al.
ar
X
iv
:1
80
6.
01
10
3v
1 
 [c
s.D
C]
  2
5 A
pr
 20
18
1
c©2014 IEEE. Personal use of this material is permitted. Permission from IEEE must be
obtained for all other uses, in any current or future media, including reprinting/republishing
this material for advertising or promotional purposes, creating new collective works, for resale
or redistribution to servers or lists, or reuse of any copyrighted component of this work in other
works.
DOI Link: https://doi.org/10.1109/MM.2014.69
IEEE MICRO BIG DATA 2
Giving Text Analytics a Boost
Raphael Polig, Kubilay Atasu, Laura Chiticariu, Christoph Hagleitner, H. Peter
Hofstee, Frederick R. Reiss, Eva Sitaridi, Huaiyu Zhu
✦
Abstract—The amount of textual data has reached a new scale
and continues to grow at an unprecedented rate. IBM’s SystemT
software is a powerful text analytics system, which offers a
query-based interface to reveal the valuable information that lies
within these mounds of data. However, traditional server archi-
tectures are not capable of analyzing the so-called ”Big Data”
in an efficient way, despite the high memory bandwidth that is
available. We show that by using a streaming hardware accel-
erator implemented in reconfigurable logic, the throughput rates
of the SystemT’s information extraction queries can be improved
by an order of magnitude. We present how such a system can
be deployed by extending SystemT’s existing compilation flow
and by using a multi-threaded communication interface that can
efficiently use the bandwidth of the accelerator.
Index Terms—Text analytics, Big Data, field programmable gate
arrays, heterogeneous systems, hardware accelerators
1 INTRODUCTION
W E all compose text messages in our dailylives. We send emails to our colleagues,
share our movie review on social media plat-
forms, some of us write medical reports or
publications like this one. Moreover, machines
often produce logs which can be easily con-
sumed by a human, but the unstructured or
semi-structured nature of these messages pose
a challenge for a compute system. Within all
these messages lie pieces of information that
scientists, doctors or marketeers would like to
extract and work with [1]. The amount of data
has reached an enormous volume, it continues
R. Polig, K. Atasu and C. Hagleitner are with IBM Research - Zurich
L. Chiticariu, F. Reiss and H. Zhu are with IBM Research - Almaden
E. Sitaridi is with Columbia University, New York (Work performed
while at IBM Research - Almaden)
HP. Hofstee is with IBM Research - Austin
to double each year [2], and generating value
from it is a key competitive advantage.
Information extraction is the task of extract-
ing desired information from textual data and
transforming it into a tabular data structure.
A number of frameworks exist to perform this
task, like the open-source applications GATE [3]
and NLTK [4]. IBM’s SystemT software [5] cou-
ples a declarative rule language with a mod-
ular runtime based on relational algebra, aug-
mented with special operators for information
extraction primitives such as regular expres-
sions and gazetteers. This approach improves
the expressive power of the rule language,
while enabling cost-based rule optimization
that significantly improves extraction through-
put. The desired information to be extracted can
be formulated as a query written in an annota-
tion rule language called AQL, which is similar
to SQL but includes text-specific operators also.
The AQL query gets compiled into an operator
graph (AOG) which can be executed by the
SystemT runtime on a given set of documents.
A user typically creates and refines an AQL
query in a development environment running
on a set of sample documents before deploying
the query on a compute cluster.
The SystemT software uses a document-
per-thread execution model, enabling each
software thread to work on an independent
document in parallel. A similar approach is
taken also by the GATE software through the
GATECloud.net service, which enables deploy-
ment of an annotation pipeline on a compute
cloud. Measurements have shown an up to
ten-fold speedup [6] compared with a single-
IEEE MICRO BIG DATA 3
node server system. However, the experiments
nearly doubled the CPU time, which makes the
efficiency of such an approach questionable.
The mismatch between the modern scale-
out workloads and the existing server processor
designs is significant [7]. These workloads often
cannot make use of or do not profit from fea-
tures, such as wide instruction windows, cache
coherence, and out-of-order execution. As a re-
sult, modern server processor architectures use
only a fraction of their available internal and
external memory bandwidth when executing
such tasks [8]. Although text analytics might
not be the classic scale-out workload, it has
similar symptoms when deployed.
To overcome this inefficiency in mod-
ern server processors, two main trends have
emerged in recent years. The first one is the
use of many simple parallel processing cores
either at the chip level [9] or at the node level in
the form of micro-servers [10]. A typical scale-
out workload executes simple instructions prof-
iting from small and efficient cores, while many
cores operate independently as there is little
or no data dependency. Another trend is the
use of specialized and heterogeneous architec-
tures [11], [12], such as system-on-chip proces-
sors in mobile devices or network processors
in the telecommunication industry. These archi-
tectures either have custom instruction sets or
include dedicated accelerators that are tailored
to an application domain.
Dedicated hardware accelerators can yield
high performance and efficiency gains, but of-
ten lack flexibility when different or new tasks
need to be executed. On the one hand, a text
analytics query remains unchanged for a long
period of time, and operates on large volumes
of data. On the other hand, the query is hand-
crafted by a domain expert and can become
very complex. A fixed architecture might not
be flexible enough to execute new and complex
queries. Thus, the text analytics system must
provide the flexibility of processing arbitrary
text analytics queries while identifying and ac-
celerating bottleneck operations to improve the
overall efficiency and the processing rates.
In this work, we propose a reconfigurable
accelerator to accelerate text analytics queries.
The main contributions of this work are:
1) a deployment flow for a hardware-
accelerated text analytics system that
exploits the reconfigurability of field
programmable gate arrays (FPGAs) to
adapt to a wide range of text analytics
queries;
2) a multi-threaded hardware-software in-
terface to support scale-out systems
that operate on streams of text docu-
ments;
3) implementation and evaluation of the
proposed flow on real text analytics
queries estimating an up to 16-fold
speed-up with respect to the multi-
threaded software implementation.
2 RELATED WORK
The use of hardware accelerators for efficient
query processing has been explored by several
research groups. Such approaches have also
been incorporated into commercial appliances.
One of the earliest examples of such an ap-
proach is given in [13], in which Kung and
Lehman described systolic array-based acceler-
ators for relational algebra operations. More re-
cently, Muller et al. proposed a query compiler
that produces FPGA bitstreams for complex
event detection queries that consist mainly of
relational algebra operations [14].
Dennl et al. propose a system that en-
ables on-the-fly composition of FPGA-based
SQL query accelerators by combining a static
stream-based communication interface and par-
tially relocatable module libraries on the
FPGA [15]. Such an approach enables creation
of FPGA bit streams for dynamically changing
relational queries without going through time-
consuming FPGA synthesis tools. Sukhwani
et al. [16] describe an FPGA-based accelera-
tor engine for database processing that offers
a software-programmed interface to eliminate
the need for FPGA reconfiguration. Chung et
al. [17] present a query compiler for a domain-
specific language called LINQ that can be
mapped to accelerator templates. Wu et al. [18]
describe a programmable hardware accelerator
for range partitioning that is directly attached
IEEE MICRO BIG DATA 4
Fig. 1. Example of partitioning an operator graph (a) into a su-
pergraph that is executed by the runtime (b) and an accelerated
subgraph that gets compiled into a hardware netlist (c).
to a CPU core. The accelerator operates in
a streaming fashion, but only accelerates the
range partitioning step of query processing.
Our approach is inspired by IBM’s PureData
System [19], which attaches FPGAs directly
to storage devices to deal with large volumes
of data. Although our accelerator architecture
uses a shared-memory setup, a direct I/O at-
tachment can be beneficial for specific use-
cases, e.g. when documents are read from a
database. To the best of our knowledge, our
work is the first to produce FPGA-based accel-
erators that support a combination of informa-
tion extraction operations (i.e., regular expres-
sions) and relational algebra operations.
3 A RECONFIGURABLE ACCELERATOR
Our system improves information extraction
throughput by executing selected operators on
a reconfigurable device, such as an FPGA. One
advantage of a reconfigurable device is that
once it has been configured, it does not require
any instructions to execute its tasks. The only
data that is required to be transferred between
the memories and the FPGA is the actual data
to be processed, together with some negligible
control information. In the case of text analyt-
ics applications, which are typically applied to
large volumes of data, the same query is run
for several hours or several days. Thus, fast
reconfiguration capability is not needed.
Another advantage of FPGAs is their capa-
bility to compute in space. On the one hand,
a reconfigurable device can implement a deep
custom pipeline working on different data sets
at different stages. On the other hand, mul-
tiple parallel instances can operate simultane-
ously on the same data set executing different
tasks, such as our architectures for the extrac-
tion operators [20], [21]. This high degree of
parallelism makes up for the comparably low
clock frequencies FPGAs provide. By moving
not only single operators to the FPGA but also
larger subgraphs of the operator graph, the par-
allelism can be fully exploited and the amount
of communication between the software-based
operators and the hardware-accelerated opera-
tors can be minimized.
Fig. 1 illustrates how an operator graph (a)
can be partitioned into a supergraph (b) and
a hardware-accelerated subgraph (c). Operators
that are moved to the accelerator are removed
from the original operator graph and replaced
with a new subgraph operator. It is also possi-
ble to extract multiple independent subgraphs
that can be executed in parallel or in sequence
on the FPGA for the same or a different set of
text documents. In this way, most of the unnec-
essary data gets filtered out before reaching the
software modules running on the server pro-
cessors, which greatly improves the processing
rates. In this work, we have used the concept
of maximal convex subgraphs [22] to identify
the subgraphs that are maximal in size and that
can be atomically executed without processor
intervention.
To automate the generation of query-specific
accelerators, we have extended the compilation
flow of SystemT. Fig. 2 shows the acceleration
flow added to the original SystemT text ana-
lytics system. The AQL query gets compiled
into the operator graph, which is further pro-
cessed by the original SystemT optimizer. Be-
fore deploying the operator graph, we perform
a partitioning step that generates the software
supergraph and the subgraphs that are run on
the FPGA. We have also developed a query
compiler [23] that uses a set of configurable
operator modules which can be linked using an
elastic interface to generate a streaming hard-
ware design for a given subgraph.
The document is processed on the FPGA
as a sequence of ASCII characters and is the
only variable-length data structure used. The
IEEE MICRO BIG DATA 5
Fig. 2. Extending SystemT’s compilation flow to support FPGA-
based hardware-accelerators.
main data structure used is a so-called span that
defines a segment within the document text. A
span is composed of a start and an end offset,
both of which are represented as 32-bit integers.
Additional data types are integers, floats, and
boolean. The same type of operator can have
different types of input schemas consisting of
different number and types of data. However,
all of these schemas are known at compile time,
and our compiler generates a custom operator
for each node in the operator graph.
The compiler leverages the possibility to
implement a large set of operators in streaming
fashion when the input data is sorted in a
certain direction. Sorting itself is a blocking op-
eration, but many operators produce sorted or
nearly sorted output data naturally. By adding
simple sorting buffers or configuring preceding
operators properly, the compiler ensures the
streaming operation of the accelerator. After
the compiler generates the hardware descrip-
tion, it is synthesized and the configuration is
loaded onto the FPGA. The supergraph will be
executed by the SystemT runtime on the host
CPU, whereas the subgraphs are run on the
reconfigurable accelerator.
The SystemT runtime uses multiple worker
threads, all of which execute the same su-
pergraph on different documents. When a
worker thread reaches a subgraph operator,
it signals that to a dedicated communication
thread, which coordinates the data transfers
between the runtime and the FPGA. Because
the document-per-thread execution model, we
set the worker thread to sleep while the sub-
graph is being executed on the accelerator. To
avoid the CPU cores from idling, a high number
of worker threads is run in parallel to hide
the execution time of the FPGA. Ideally, the
reconfigurable device would have a very low
latency when accessing the data after receiving
the instruction to execute its configured sub-
graph. However, traditionally, FPGA accelera-
tors are attached via the system bus and access
the data via DMA transfers, which have an at
least three- to four-fold higher memory access
latency than the processor itself [24].
Our accelerators use the load-store units
of an early version of the coherent accelerator
processor interface (CAPI) [25]. A service layer
implemented on the FPGA enables the accel-
erators to access the processor’s main memory
and operate in a common virtual address space
with the applications running on the processor.
The address translation is software-based in the
system that is available to us, and occurs within
our communication thread, resulting in an ad-
ditional communication overhead. To minimize
the impact of this overhead, larger data blocks
(> 1000 bytes) should be transferred at once
to fully use the system bus bandwidth. There-
fore, the communication thread collects the data
submitted by some of the worker threads and
generates a larger combined work package. It
then sends the data to the accelerator’s work
queue and starts again to check for submissions
from the worker threads. When the FPGA fin-
ishes working on a work package, it signals
it via a status register to the communication
thread, which wakes up the software threads
that belong to this work package.
4 EXPERIMENTS
We carried out a number of performance and
profiling experiments on an IBM POWER7
server running at 3.55 GHz, capable of 64 log-
ical threads and 64 GB of DDR3 memory. We
synthesized the accelerator designs for an Al-
tera Stratix IV FPGA running at 250 MHz. The
FPGA was attached to a proprietary bus inter-
face that is capable of 2.5 GB/s DMA transfers.
4.1 Software measurements
We evaluated five customer queries that we
ran over the same set of input documents. The
IEEE MICRO BIG DATA 6
Fig. 3. Communication scheme using multiple SystemT software
threads. The communication thread orchestrates the transfers
between the SystemT runtime and the hardware accelerator.
Fig. 4. Relative time spent on executing different operators for
five real-life text analytics queries.
SystemT profiler captures the time spent at each
operator and accumulates it over the total run-
time. From these numbers we derived a relative
distribution to get comparable profiles of our
testcases, as shown in Fig. 4. Queries T1 to T4
are dominated by the processing time spent
on extraction operators (RegularExpression &
Dictionaries), whereas query T5 spends more
than 80% at relational operators.
Extraction primitives operate across the en-
tire document, whereas relational operators
usually work on the results produced by ex-
tractors. The extraction operators are typically
the slowest operations in software. As a result,
the throughput for testcase T5 is higher than for
T1-T4. Fig. 5 shows the system throughput for
all testcases running with different numbers of
threads. Initially, the throughput scales nearly
linearly with the number of threads before
starting to roll off at eight. Surprisingly, the
throughput increases again strongly between 32
and 40 worker threads. This behavior appears
Fig. 5. Throughput of the original software vs. the number of
threads for 256 byte documents.
to be because of the operating system scheduler,
which uses all logical threads on one processor
before spawning to another one.
4.2 Accelerator measurements
As the query profiles show, a significant
amount of time is spent on extraction operators
that operate on the entire document data. As a
result, we have optimized our HW-SW interface
for this type of input so that the extraction
operators on the FPGA determine the max-
imum achievable throughput rate regardless
of the subgraph configured on the accelerator.
A significant backpressure from the relational
operators was never observed in our test cases,
and could be removed by using shallow buffers
at critical stages. In our experiments, we mea-
sured the throughput rate for an accelerator
with four parallel streams and a maximum
peak bandwidth of 500 MB/s.
Fig. 6 shows the measured throughput rate
for different document sizes, which are submit-
ted by parallel SystemT worker threads to the
interface. We observe that we achieve the peak
bandwidth when using document sizes of 2 kB
or larger. News entries typically have a few kBs
of text, and thus can be processed at the peak
bandwidth of the accelerator. In contrast, when
using 128-byte-sized documents, the through-
put diminishes by a factor of ten, and when
using 256-byte-sized documents, the through-
put diminishes by a factor of five even though
the communication thread combines small doc-
uments into a larger workpackage. Although
these numbers do not represent the size of a
IEEE MICRO BIG DATA 7
Fig. 6. Throughput of the FPGA executing all extraction oper-
ators of query T1 using four parallel text streams for different
document sizes.
typical text document, they are representative
of the typical size of Twitter messages and RSS
feeds.
5 ANALYSIS
Our existing implementation of the SystemT
runtime is not capable of executing the gener-
ated supergraph indicated by the dashed line
in Fig. 2. Therefore we estimate the achievable
overall system bandwidth by analyzing the re-
sults from section 4. We observe that the run-
time of most queries is dominated by extraction
type operators consuming up to 82% of the
overall runtime. As all of these operators op-
erate on the same document data source, they
are an ideal candidate for acceleration on the
reconfigurable device, where they can operate
in parallel on a single document pass. Addi-
tional relational operators that are supported
for hardware processing can add up to 97% of
the total runtime.
The software throughput rate varies with
the profile of the query, whereas the hardware
throughput is determined by the input oper-
ator of the subgraph. We choose to always
offload the extraction operators, which allowed
us to focus on the document data transfers.
The document size has a significant impact
on the throughput of the accelerator as Fig. 6
shows. Although the peak bandwidth can only
be reached by using larger documents, the
throughput rate for smaller documents is still
much higher than that of the pure software.
Fig. 7. Throughput using 64 software threads and estimated
throughput when executing the extraction operators, a single
subgraph or multiple subgraphs on the accelerator for 256 and
2048 byte documents.
We estimate the overall system through-
put using (1), in which we add the remain-
ing time spent on software processing, rtSW ,
to the time spent on the accelerator using the
measured throughput rates tpSW and tpHW . The
interface cost is included in our measurements
for the accelerator throughput and does not
need to be added as an extra penalty. We esti-
mate the throughput achieved 1) by offloading
only the extraction operations to the FPGA, 2)
by offloading a single maximal convex sub-
graph that contains all extraction operations
and as many hardware-supported operators as
possible, and 3) by offloading all hardware-
supported operators to the FPGA using mul-
tiple maximal convex subgraphs. In the first
two cases, the estimations we present are pes-
simistic because we do not take into account po-
tential processing overlaps between the FPGA
and the CPU. In the third case, our estimations
are optimistic because we do not take into
account the communication overhead incurred
by the additional subgraphs. Fig. 7 summa-
rizes our estimations when using 64 software
threads, four hardware streams and average
document sizes of 256 or 2048 bytes.
tpest =
1
1
tpHW
+
rtSW
tpSW
(1)
Although the throughput rates of the query
T1-T4 increase up to 4.8 fold by offloading the
IEEE MICRO BIG DATA 8
extraction operators, query T5 sees a limited
impact. Only by running multiple subgraphs
on the accelerator query T5 does gain an up to
three-fold improvement. Query T1 improves by
a factor of ten by offloading multiple subgraphs
to the accelerator for small documents and by a
factor of 16 for larger documents.
6 CONCLUSION
As we enter the Big Data era, deriving value
from large amounts of data efficiently becomes
a necessity. We believe that text analytics will
be a key application of this new era, but that it
is challenged by the growing complexity of the
queries and ever more data to process. We have
presented a prototype system that includes an
FPGA as a reconfigurable accelerator and a
hardware compiler that enables offloading se-
lected parts of a given text analytics query.
Projections based on profiling results and actual
measurements on the FPGA-attached system
promise an up to 16-fold speed-up over purely
software-based solutions.
The speed-up results reported in this paper
can be further improved by including support
for additional relational operators in our hard-
ware compiler. Further optimizations to the
interface are also being investigated to mini-
mize the latency penalty of small documents.
Our future work will cover hardware/software
partitioning algorithms to maximize the overall
system’s throughput rate under resource con-
straints of the FPGA. We also plan to identify
the most power-efficient design choices for a
given query.
REFERENCES
[1] G. Weikum, “From text to entities and from entities to in-
sight: A perspective on unstructured big data.” Presented
at Microsoft Research Big Data Analytics Workshop 2013,
Cambridge, UK, 2013.
[2] J. Manyika et al., “Big data: The next frontier for inno-
vation, competition, and productivity,” McKinsey Global
Institute, Tech. Rep., May 2011.
[3] H. Cunningham, “GATE, a general architecture for text
engineering,” Computers and the Humanities, vol. 36, no. 2,
pp. 223–254, 2002.
[4] S. Bird, “NLTK: the natural language toolkit,” in In Proc.
COLING/ACL on Interactive Presentation Sessions. Associ-
ation for Computational Linguistics, 2006, pp. 69–72.
[5] R. Krishnamurthy et al., “SystemT: A system for declar-
ative information extraction,” ACM SIGMOD Record,
vol. 37, no. 4, pp. 7–13, 2009.
[6] V. Tablan et al., “Gatecloud.net: A platform for large-scale,
open-source text processing on the cloud,” Philosophical
Transactions of the Royal Society A: Mathematical, Physical
and Engineering Sciences, vol. 371, no. 1983, 2013.
[7] M. Ferdman et al., “Clearing the clouds,” in Proc. ASPLOS.
ACM, 2012, pp. 37–48.
[8] M. Dimitrov et al., “Memory system characterization of
big data workloads,” in Proc. 2013 IEEE International Con-
ference on Big Data. IEEE, 2013, pp. 15–22.
[9] P. Lotfi-Kamran et al., “Scale-out processors,” in Proc.
ISCA. IEEE Press, 2012, pp. 500–511.
[10] Whitepaper, “Flexible, low power microservers for
lightweight scale-out workloads,” Intel Corporation, Tech.
Rep., 2011.
[11] Y. S. Shao and D. Brooks, “ISA-independent workload
characterization and its implications for specialized archi-
tectures,” in Proc. ISPASS, 2013, pp. 245–255.
[12] E. S. Chung et al., “Single-chip heterogeneous computing:
Does the future include custom logic, FPGAs, and GPG-
PUs?” in Proc. MICRO. IEEE Computer Society, 2010, pp.
225–236.
[13] H. Kung and P. L. Lehman, “Systolic (VLSI) arrays for
relational database operations,” in Proc. SIGMOD. ACM,
1980, pp. 105–116.
[14] R. Mueller et al., “Streams on wires: A query compiler for
FPGAs,” Proceedings of the VLDB Endowment, vol. 2, no. 1,
pp. 229–240, 2009.
[15] C. Dennl et al., “On-the-fly composition of FPGA-based
SQL query accelerators using a partially reconfigurable
module library,” in Proc. FCCM. IEEE, 2012, pp. 45–52.
[16] B. Sukhwani et al., “Database analytics acceleration using
FPGAs,” in Proc. PACT. New York, NY, USA: ACM, 2012,
pp. 411–420.
[17] E. S. Chung et al., “Linqits: Big data on little clients,” in
Proc. ISCA. ACM, 2013, pp. 261–272.
[18] L. Wu et al., “Navigating big data with high-throughput,
energy-efficient data partitioning,” in Proc. ISCA. ACM,
2013, pp. 249–260.
[19] Datasheet, “IBM PureData system for
Analytics N2001,” 2013. [Online]. Available:
http://www-01.ibm.com/software/data/puredata/analytics/
[20] K. Atasu et al., “Hardware-accelerated regular expression
matching for high-throughput text analytics,” in Proc. FPL.
IEEE, 2013, pp. 1–7.
[21] R. Polig et al., “Token-based dictionary pattern matching
for text analytics,” in Proc. FPL. IEEE, 2013, pp. 1–6.
[22] J. Reddington and K. Atasu, “Complexity of computing
convex subgraphs in custom instruction synthesis,” IEEE
Trans. VLSI Syst., vol. 20, no. 12, pp. 2337–2341, 2012.
[23] R. Polig et al., “TAPAS: Compiling text
analytics queries to FPGAs,” IBM Research
Report RZ3864, 2014. [Online]. Available:
http://domino.research.ibm.com/library/cyberdig.nsf/papers
[24] B. Holden, “Latency comparison between HyperTrans-
port and PCI-express in communication systems,” Hyper-
Transport Consortium, Tech. Rep., 2006.
[25] J. Stuecheli, “Next generation power micropro-
cessor,” in Proc. Hot Chips: A Symposium on
High Performance Chips, 2013. [Online]. Available:
http://www.hotchips.org/archives/hc25
