Mining Dynamic Document Spaces with Massively Parallel Embedded Processors by Jacobs, Jan W.M. et al.
Mining Dynamic Document Spaces with Massively
Parallel Embedded Processors
Jan W. M. Jacobs1, Rui Dai2, and Gerard J.M. Smit3
1 Oce´ Technologies BV, PO Box 101,
5900MA Venlo, The Netherlands
jan.wm.jacobs@oce.com
2 National University of Singapore, Design Technology Institute Faculty of Engineering,
10 Kent Ridge Crescent, Singapore 119260
3 University of Twente, PO Box 217,
7500AE Enschede, The Netherlands
Abstract. Currently Oce´ investigates future document management services. One
of these services is accessing dynamic document spaces, i.e. improving the ac-
cess to document spaces which are frequently updated (like newsgroups). This
process is rather computational intensive.
This paper describes the research conducted on software development for mas-
sively parallel processors. A prototype has been built which processes streams of
information from specified newsgroups and transforms them into personal infor-
mation maps.
Although this technology does speed up the training part compared to a general
purpose processor implementation its real benefits emerges with larger problem
dimensions because of the scalable approach.
1 Introduction
We are living in a society that is flooded with information. People need tools to struc-
ture this information and/or inform users on new trends or remarkable events. One way
of visualising the unknown structure of the targeted information sources is by using the
Self Organising Map (SOM) neural network [1][2][3]. This network can be visualised
by a rectangular map, see Fig. 1. In the map similarity between newsgroup articles,
indicated by the labelled dots, is expressed as proximity1. The colour of the neurons
indicates whether neighbouring neurons are similar or different. Clusters of similar ar-
ticles are grouped into a “country”, which has been given a name and is bordered by
red lines.
For recurring visualisations the map is only useful if its global structure does not
change that much when new articles are incorporated. Only then will the user be able to
quickly reorientate so he/she can see the new changes (cognitive spatial memory effect).
The generation of these maps, however, is very demanding in compute power. Ear-
lier implementations based on Intel’s Pentium lack the required responsiveness but do
1 The shown partial example map covers the newsgroup BBC News and BBC Sports in June
2005. It is built up by a grid of 16 by 32 squared tiles (neurons) and each tile can accommodate
one or more samples (newsgroup articles).
Fig. 1. Part of map of newsgroup articles
Fig. 2. SOM reduces dimensions with good
preservation of structure. The original space
doc∈RN = ( f0, f1, . . . , fN−1) is mapped on a
better comprihensible space doc ∈N2 = (i, j)
show a straightforward development process, a property of programmable systems. For-
tunately, many datamining tasks show simple massively parallel processing.
This research is inspired by the potential advantages of massively parallel embed-
ded processors, namely flexibility and shorter design cycles compared to an FPGA ap-
proach, while attaining better performance (by scalable design) than a general purpose
processor. An important departure point, and in our eyes a novelty, is the reuse of the
same hardware for other demanding tasks like colour image processing [4] in Oce´’s
problem domain. This research has been conducted in co-operation with Aspex Semi-
conductor, a fabless semiconductor company specialising in high performance, software
programmable, parallel processors based on associative technology [5].
The problem addressed in this paper is the implementation of the performance de-
manding SOM neural network training on a massively parallel processor in both an
effective and efficient way.
In chapter 2 the reader is introduced to some for this paper relevant concepts such
as: SOM, data mining and hardware architectures used for SOM. In chapter 3 we will
elaborate on the particular application and in subsequent chapters on implementation
issues (chapter 4) and results (chapter 5). Finally, in chapter 6 some conclusions will be
drawn.
2 Related Work
Data mining is an application that tries to find hidden patterns and relationships in data
that can be used for various purposes such as data analyses, observing trends, prediction
etc. One nice way to visualise the hidden relationships is by using the SOM neural
network. The network reduces the data volume of the original space while preserving
its original structure as faithfully as possible, see Fig. 2. The SOM network projects
the data in the N-dimensional space to a two-dimensional space. The original space is
encoded by sparse vectors, typically having over 104 dimensions, containing the relative
frequency of occurrence of significant words in the whole collection. After training, the
network exhibits a topological ordering, i.e. data samples (or newsgroup articles) which
are similar to each other are positioned in their proximity. Successful applications of
SOM networks can be found in visualising document spaces such as newsgroup articles
[2] and conference abstracts [3].
The reasons why SOM is taken as a clustering and visualisation tool are: it is better
suitable for human interpretation (2D graphic presentation versus 1D list as in Google),
it maintains the original structure as closely as possible, it allows for associativity (topo-
logical ordering) and it is less computational intensive and more robust than its com-
petitor Multi Dimensional Scaling (MDS) [6].
SOM training is in general a relative computational intensive step in data mining ap-
plications [7][6]. That is the reason why many hardware mappings for SOM have been
described since its conception in 1982. Because of its inherent parallel structure also
parallel implementations have been made. The most advanced ones have been written
for SIMD architectures such as CNAPS, Hypercube, Connection Machine and Mas-
Par, which, however, are expensive, bulky and have extremely high power consumption
[8][7][9]. Also other, more embedded parallel solutions have been devised like Trans-
puter [10] or FPGA [11]. The latter, however, exhibits rather long development cycles.
Fast development is supported by a general purpose processor with special SIMD ex-
tensions [12], but is too costly to be a serious contender for embedded applications.
Traditional computers, rely upon a memory that stores and retrieves data by its ad-
dress rather than its content. In such an organisation (von Neumann architecture), every
accessed data word must travel individually between the processing unit and the mem-
ory. The simplicity of this retrieval-by-address approach has ensured its success, but
has also produced some inherent disadvantages. One is the von Neumann bottleneck,
where the memory-access path becomes the limiting factor for system performance.
A related disadvantage is the inability to proportionally increase the performance of a
unit transfer between the memory and the processor as the size of the memory scales
up. Associative memory, in contrast, provides a naturally parallel and scalable form of
data retrieval for both structured data (e.g. sets, arrays, tables, trees and graphs) and
unstructured data (raw text and digitised signals). An associative memory can be easily
extended to process the retrieved data in place, thus becoming an associative processor.
This extension is merely the capability of writing a value in parallel into selected cells
[5]. Applications range from handheld gaming, multi-media, base transceiver stations
(BTSs), on-line transaction processing to heavy image processing, pattern recognition
and data mining [13][5].
Aspex’s Linedancer is an implementation of a parallel associative processor. The
approach taken by Aspex Semiconductor is to use many simple associative processors
in a SIMD arrangement. Each of the 4096 processing elements on the Linedancer de-
vice has about 200 bits of memory (of which 64 bits are full associative) and a single
bit ALU, which can perform a 1 bit operation in 1 clock cycle. Operations on larger
data types take multiple clock cycles. The aggregate processing power of Linedancer
depends entirely on parallel processing. For example: a 32-bit add will take many times
the number of clock cycles taken by a high-end scalar processor, but due to the paral-
lelism 4096 additions can be performed in parallel. Multiple Linedancer devices can be
easily connected together to create an even wider SIMD array.
The Linedancer device (shown in Fig. 3) includes an intelligent DMA controller,
to ensure that data is moved in and out of the ASProCore concurrently with data pro-
Fig. 3. Aspex Semiconductor’s Linedancer
cessing, and a RISC processor, to issue high level commands to the ASProCore and
to set-up the DMA controller. All parts of the device run at the same clock frequency,
which can be up to 400MHz.
A Linedancer is programmed in an extended version of C, with additional syntax
for controlling the ASProCore.
3 Specification of the Application
The purpose of the system presented in this paper is to transform the personal news-
group feeds into a personal 2D map. In this way the user will have a quicker overview
of the changes in his area of interest. The whole pipeline is described in Fig. 4. In order
Fig. 4. Processing pipeline, the amount of data communicated between the modules is indicated
to cluster newsgroups all articles have to be expressed in a common notation. As in
[2] we use multi-word terms extracted from the corpus (a collection of documents) of
newsgroup articles. Currently a tool named Sigmund is used, a Prolog project developed
at the University of Amsterdam [14].
The number of features in a newsgroup collection can become very large, even
with a modest number of articles. Since these document spaces are very sparse simple
compression methods suffice and good results have been reported [2].
One of the most time consuming tasks in the pipeline is the training of the SOM
neural network. The purpose of a neural network is to generalise from its training input
so that new and not trained samples can be clustered or classified correctly. The process
simply boils down to a controlled annealing of a set of neurons arranged in a rectangular
grid as will be described in section 3.1. The SOM exhibits spatial ordering, that is
neighbouring neurons have similar content.
The spatial order in the SOM is now exploited: similar newsgroup articles are posi-
tioned near each other. This allows for associativity since related articles are positioned
in each other vicinity. The final module prepares a Scalable Vector Graphics [15](SVG)
file for a light weight client. This SVG format allows for operations like zooming, pan-
ning and selection for viewing the article itself.
3.1 SOM training
In this section we will go into some of the details of SOM training. First some defini-
tions are given, then followed by a mathematical framework and finally an example is
included. The most important concepts are:
– a neuron mi j(t) ∈ RN with dimension N on a fixed position in a grid r = (i, j),
– an input sample xs ∈ RN with the same dimension N as the neurons,
– a learning rate α(t) ∈ R to control the amount of learning,
– a neighbourhood matrix Λi j(t) ∈ R2, defined on the same grid r = (i, j) to provide
for spatial ordering and finally
– a scalar σ(t) ∈ R to control the effective size of the neighbourhood matrix.
In the annealing process all samples xs are repeatedly offered (in so called epochs)
to all neurons mi j(t) in the grid. The neurons will be tuned towards a particular sample
xs by a certain fraction, see (1) below. This fraction is determined by the difference
between the sample xs and the neuron mi j(t), the learning rate α(t) and the neighbour-
hood matrix Λi j(t). The learning rate is relatively large in the early epochs to allow
for large changes and is small towards the end. In order to realise spatial ordering the
neighbourhood matrix Λi j(t) is controlled by the neighbourhood parameter σ(t). The
neighbourhood is as large as the network in the beginning and small in the end, see (2).
The function exp refers to the standard exponential function with base Euler’s number
e. The norm or length of a vector x can in general be defined by Lp(x = p
√
∑Ni |xi|p,
where p ∈ R, p ≥ 1. For p = 1,2,∞ the norm represents respectively Manhattan dis-
tance (1-norm), Euclidian (2-norm) and the max-norm, which is equivalent to maxi(xi).
The neighbourhood function, often a Gaussian function, is positioned in the 2D grid at
the location rwin ∈N2 of the best matching neuron, i.e. the neuron which is most similar
to the sample xs, see (3).
mi j(t +1) = mi j(t)+α(t) ·Λi j(t) ·
(
xs−mi j(t)
)
update rule (1)
Λi j(t) = exp
(‖r− rwin(t)‖2
2σ2(t)
)
neighbourhood (2)
rwin(t) =
(
r,(i, j) ∈ N2 | ∀i j min(xs−mi j(t)
)
winning neuron (3)
The following 5 steps will compute the update for a single neuron for a given sample
within an epoch:
Step 1. determine the high dimensional distance: ∀i j(xs−mi j(t))
Step 2. determine winning neuron location: rwin(t) = ∀i j min(xs−mi j(t))
Step 3. 2D distance computation: ∀i j(ri j− rwin(t))
Step 4. neighbourhood computation: see equation (2)
Step 5. compute the update for the neurons: see equation (1)
3.2 Complexity analysis for SOM training
For this research we restricted ourselves to the SOM training. The algorithm of the
training process is given below, see program in Table 1. The training consists of a se-
quence of epochs, training sessions, in which 2 parameters are decreased in a controlled
way: the learning rate (α) and the neighbourhood (σ). Typical values for epochs is 250,
number of samples is 500 and map sizes W = 32, H = 16 and N = 256. The following
for all epochs do
decrease α; decrease σ;
for each sample do
dist N D = compute N D distance(sample, all neurons); (1)
winning neuron = determine winner(dist N D); (2)
dist 2D = compute 2D distance(winning neuron, all neurons); (3)
neighbourhood = compute neighbourhood(dist 2D,σ); (4)
all neurons = all neurons + α·neighbourhood . (sample – all neurons); (5)
end
end
Table 1. SOM training program
table not only summarises the sequential complexity but also includes concrete opera-
tion counts for the herefore mentioned values (in cycles per epoch per sample). In the
Table 2. Base complexity, for comparison purposes and projected gain by parallelisation
training step sequential complexity
order of operations
# sequential
operations
projected order of
parallel operations
1. Distance in highD O(W ·H ·N) 393216 O(H + log2N)
2. Winner selection O(W ·H) 768 constant
3. Distance in 2D O(W ·H) 2560 constant
4. Determine neighbourhood O(W ·H) 512 O(H)
5. Update neurons O(W ·H ·N) 393216 O(H)
second column cycles are expressed in (big O) order notation. Conversion to concrete
numbers of operations is straightforward; the distance computations (steps 1 and 3),
however have to account for the subtraction, taking absolute value (for the 1-norm) and
finally adding all component values together. The 3rd column contains an estimate of
number of operations for a sequential processor. The last column shows the projected
parallel complexity for a particular parallel architecture, which is parallel in W ×N but
sequential in H. The additional O( log2N) accounts for the time to compute a binary
adding tree in parallel.
4 Implementation restrictions and choices
In order to map the SOM algorithm on the Linedancer in a performance optimal way
the following observations are important. It is shown in [11][7] that SOM is flexible
in the sense that it is somewhat robust to 1) lower precision (e.g. to 8 bit), 2) using a
simple distance metric (e.g. 1-norm or Manhattan distance) and 3) approximating the
neighbouring function as a box function.
With a 2 Linedancer system we have an 8K PE budget. Every PE is equipped with
128 bit Extended Memory (EM) and 64 bit Content Addressable Memory (CAM). We
have chosen to store the neurons in the array and the input samples in off-chip DRAM.
For the choice of dimension N, [16] has shown that for newsgroup articles N = 315
is adequate. For our application we used N = 256. Reference [11] reports a precision
of 8 to 16 bit; we used 8 bit. Although not tested extensively we, however, have the
impression that these values are sufficient for our purposes. The same applies to the
box function [7], our choice for the neighbouring function.
This leaves us with the following choice for map dimensions: W ×H×N = 32×
16×256. The EM is used to store the neurons, the CAM is used to host the temporary
work registers. Each set of 256 PEs covers a row of H = 16 neurons with dimension
N = 256 and precision 8 bits, which fits in EM (i.e. 16 neurons × 8 bit=128 bits), see
Fig. 5. The DMA engine will copy the current input sample into I/O memory in a fast
way and in parallel to the computations.
The algorithm above is mapped in the following way, see also Fig. 6:
Step 1. The distance between the current sample and all neurons allows for parallel com-
putation of steps (1) and (5). For the current neuron column these absolute differ-
ences is stored in the middle column of CAM (Fig. 6). The max-norm is used to
compute the length of these 32 differences in parallel instead of a time consuming
parallel adding tree. These results are stored in location 0 at the bottom of each
256-segment. Subsequently the 15 other neuron columns have their distances with
this sample computed and stored at locations 1..15. The end result is a single byte
wide column (left column in CAM), covering all 32×16=512 distances between
the current input sample and the neurons.
Step 2. The winner selection is performed by a global minimum operation on these 512 dis-
tances, which can be done in a relatively fast way by using the associative property
of the array. The winning neuron is indicated by one exclusively tagged PE.
Step 3. From now on steps 3, 4 and 5 are performed per neuron column (so H times in
sequence). The (x,y) location of each neuron is conveniently stored in an adjacent,
Fig. 5. Vertical arrangement of neurons over
PEs in extended memory
Fig. 6. Selecting the winning neuron and
computation of the neighbourhood in CAM
memory
rightmost, 16 bit column in the CAM. Hence the winning location is selected to be
broadcasted to each neuron, after which the 2D distance is determined.
Step 4. The neighbourhood matrix is computed by a parallel comparison of locally com-
puted 2D distance from the previous step with the current global neighbourhood
parameter (2).
Step 5. The final update step is computed in parallel by multiplying the global learning rate
with the recomputed difference between neurons and input sample, only for those
neurons which were selected in the previous step (1).
5 Results
The performance measurements are now compared with two Pentium implementations,
one with SSE instructions and one without SSE instructions. See Table 3 for a detailed
comparison. Since the SSE can operate on 4 single precision floats at a time step 1 and
5 can speed up the sequential computation with this factor at maximum. Both Pentium
versions use the 1-norm for computing the length of a vector. The SSE version is de-
rived by compiling the algorithm Table 1 with the Intel C++ compiler (version 8.0).
The Linedancer results are measured cycles; the Pentium results are estimates derived
from assembly code. The Linedancers do speed up step 5 significantly beyond the clock
frequency ratio (2 GHz / 300 MHz). However, the performance of step 1 is disappoint-
ing for the 1-norm as well as for the max-norm. Especially for the max-norm, which
was expected to take fewer cycles because there is no need to sum up all components
as in 1-norm. In comparison with a Pentium a speed up of a factor 7 is achieved by a
2 Linedancer system, see Fig. 7. Using the 1-norm distance metric a speedup of 3.5 is
Table 3. Comparison
Training step Pentium Aspex Linedancers
sequential version
estimate [cycles]
SSE estimate
[cycles]
1-norm
[cycles]
max-norm
[cycles]
1. Distance in highD 393216 65024 43384 18028
2. Winner selection 768 768 2158 2158
3. Distance in 2D 2560 2560 100 100
4. Determine neighbourhood 512 512 38 38
5. Update neurons 393216 98304 5590 5536
achieved. The expected speed up is somewhat disappointing because the inherent paral-
lel nature of the SOM training process should map efficiently on the massively parallel
Linedancer.
The main reason for that is the relative high communication overhead in the time
spent in the inner loop. We collected for the most dominant part, the high dimensional
distance (step 1), how many cycles were spend in communication and how many in
computation. See Fig. 8. This figure shows that the communication overhead dominates
the computation cost.
If inter PE communication would be improved then for this step the performance
could match O(H+log2N) for 1-norm and O(H) for max-norm. When processing and
communication were perfectly balanced then this would result in a 3× performance
improvement for 1-norm and 5× for max-norm.
Fig. 7. Comparison of implementation alter-
natives
Fig. 8. Distribution of communication and
computation in High Dimensional Distance
computation
6 Conclusions
A single Linedancer is 3.5 times faster than a Pentium implementation in training a
SOM neural network.
Improving on inter PE communication such that computation and communication
are better balanced would not only increase the performance significantly (factor of
3 for 1-norm and 5 for max-norm) but would also improve the scalability to larger
network dimensions using multiple Linedancers.
It is recommended to improve the performance of the inter PE communication. A
solution could be to introduce a chordal ring communication structure [17] or wired-OR
functionality.
References
1. Meij, J., ed.: Introduction to Multidimensional Scaling. In: Dealing with the data flood.
Mining data, text and multimedia. STT/Beweton, The Hague, The Netherlands (2002)
2. Perelomov, I., Azcarraga, A.P., Tan, J., Chua, T.S.: Using structured self-organizing maps in
news integration websites (2002) http://citeseer.ist.psu.edu/perelomov02using.html.
3. Skupin, A.: A cartographic approach to visualizing conference abstracts. In: IEEE Computer
Graphics and Applications. (2002) 50–58
4. Jacobs, J., Bond, W., Pouls, R., Smit, G.: Colour image processing with massively parallel
embedded processors. To appear in Parallel Computing (2005)
5. Aspex Semiconductor Ltd: Linedancer - overview (2005) http://www.aspex-
semi.com/pages/products/products linedancer overview.shtml.
6. Duda, R., Hart, P., Stork, D.: Pattern Classification. Wiley-Interscience (2000)
7. Kohonen, T.: Self-Organizing Maps. Springer (1997)
8. Nordstrom, T.: Designing parallel computers for self-organizing maps (1992)
http://citeseer.ist.psu.edu/nordstrom92designing.html.
9. Schikuta, E., Weidmann, C.: Data parallel simulation of self-organizing maps on hypercube
architectures. In: Proceedings of WSOM’97, Workshop on Self-Organizing Maps, Espoo,
Finland, June 4-6. Helsinki University of Technology, Neural Networks Research Centre,
Espoo, Finland (1997) 142–147 http://citeseer.ist.psu.edu/72587.html.
10. Wu, C.H., Hodges, R.E., Wang, C.J.: Parallelizing the self-organizing feature map on multi-
processor systems. Parallel Computing 17(6-7) (1991) 821–832
11. Pohl, C., Franzmeier, M., Porrmann, M., Ru¨ckert, U.: gnbx reconfigurable hardware accel-
eration of self-organizing maps. In: Proceedings of the IEEE International Conference on
Field Programmable Technology (FPT’04), Brisbane, Australia (2004) 97–104
12. Garcia, C., Prieto, M., Pascual-Montano, A.: A speculative parallel algorithm for self-
organizing maps. To appear in Parallel Computing (2005)
13. Krikelis, A., Weems, C.: Associative Processing and Processors. IEEE Computer Society
(1997)
14. Anjewierden, A., de Hoog, R., Brussee, R., Efimova, L.: Knowledge flows in weblogs. In:
Proceedings of the 13th International Conference on Conceptual Structures (ICCS 2005),
Kassel, Germany (2005)
15. W3Schools: Introduction into svg (2006) http://www.w3schools.com/svg/svg intro.asp [On-
line, accessed 12/04/2006].
16. Azcarraga, A.P., Teddy N. Yap, J.: Extracting meaningful labels for websom text archives. In:
CIKM ’01: Proceedings of the tenth international conference on Information and knowledge
management, New York, NY, USA, ACM Press (2001) 41–48
17. NeoMagic Corporation: The technology of associative processor array (2002)
http://www.neomagic.com/product/apa version3 1.pdf.
