Constructive Induction Machines for Data Mining by Perkowski, Marek et al.
Portland State University
PDXScholar
Electrical and Computer Engineering Faculty
Publications and Presentations Electrical and Computer Engineering
1999
Constructive Induction Machines for Data Mining
Marek Perkowski
Portland State University
Stanislaw Grygiel
Portland State University
Qihong Chen
Portland State University
Dave Mattson
Portland State University
Let us know how access to this document benefits you.
Follow this and additional works at: http://pdxscholar.library.pdx.edu/ece_fac
Part of the Electrical and Computer Engineering Commons
This Conference Proceeding is brought to you for free and open access. It has been accepted for inclusion in Electrical and Computer Engineering
Faculty Publications and Presentations by an authorized administrator of PDXScholar. For more information, please contact pdxscholar@pdx.edu.
Citation Details
Perkowski, Marek, Stanislaw Grygiel, Qihong Chen, and Dave Mattson. "Constructive induction machines for data mining." (1999)
CONSTRUCTIVE INDUCTION MACHINES FOR DATA 

MINING. 

Marek Perkowski, Stanislaw Grygiel, Qihong Chen, and Dave Mattson 
Portland State University, Dept. of Electr. Engn., Portland, Oregon 97207, 

Tel: 503-725-5411, Fax: 503-725-4882, mperkows@ee.pdx.edu 

Abstract 
"Learning Hardware" approach involves creating a computational network based on feed­
back from the environment (for instance, positive and negative examples from the trainer), 
and realizing this network in an array of Field Programmable Gate Arrays (FPGAs). Compu­
tational networks can be built based on incremental supervised learning (Neural Net training) 
or global construction (Decision Tree design). Here we advocate the approach to Learning 
Hardware based on Constructive Induction methods of Machine Learning (ML) using multi­
valued functions. This is contrasted with the Evolvable Hardware (EHW) approach in which 
learning/evolution is based on the genetic algorithm only. 
Various approaches to supervised inductive learning for Data Mining and Machine Learning 
applications require fast operations on complex logic expressions and solving some NP-complete 
problems such as graph-coloring or set covering. They should be realized therefore in hardware 
to obtain the necessary speed-ups. Using a fast prototyping tool; the DEC-PERLE-l board 
based on an array of Xilinx FPGAs, we are developing virtual processors that accelerate the 
design and optimization of decomposed networks of arbitrary logic blocks. 
1. INTRODUCTION. 

EVOLVING IN HARDWARE VERSUS 

LEARNING IN HARDWARE 

In recent years the scientific community is witness­
ing very fast developments in the area of Soft Com­
puting. Thus, Artificial Neural Nets (ANNs), Cel­
lular Neural Nets (CNN), Fuzzy Logic, Rough Sets, 
Genetic Algorithms (GA), Genetic and Evolutionary 
Programming and other approaches have been de­
veloped and they are concerned with the notions of 
learning, adapting, modifying, evolving or emerging. 
Several mixed approaches are being developed that 
in many different ways combine elements of these 
areas with the goal of solving very complex and 
poorly defined problems that could not be tackled 
by previous, analytic models. What is common to all 
these approaches is that they propose a way of au­
tomatic learning by the system; the computer is 
taught on examples rather completely programmed 
(instructed) what to do. This philosophy dominates 
also the areas of Artificial Life, solving problems by 
analogy to Nature, decision making, knowledge ac­
quisition, new approaches to intelligent robotics, and 
many other. Machine Learning becomes then now 
a new and most general system design paradigm uni­
fying many previously disconnected research areas. 
It starts to become a new hardware construction 
paradigm as well. 
Recently, a term Evolvable Hardware (EHW) 
has been coined [17, 16, 18, 20] which is, real­
ization of genetic algorithm (GA) in reconfigurable 
hardware. Our approach of Universal Logic Ma­
chine [31, 33, 38, 21] proposes to build a learning 
machine based on logic principles, especially the Con­
structive Induction [26, 27) 44] and Rough Set The­
ory [30] approaches. While the Genetic Algorithm 
is a very simple and practically blind mechanism of 
Nature, it can be easily realizable in hardware. We 
are afraid, however, that this mechanism alone can­
not produce good results. In contrast, the logic algo­
rithms that use previous human knowledge are opti­
mal and mathematically sophisticated, and their soft­
ware realizations use so complex data structures and 
controls that it is very difficult to realize them in 
hardware, but they lead to high quality learning re­
sults. Since software/hardware realizations may suf­
fer from the consequences of the Amdahl's Law I in­
teresting software-hardware design trade-offs must be 
then resolved to realize optimally the learning algo­
rithms based on logic. 
When we will talk about "Learning Hard­
ware", we will understand the term "learning" 
very broadly, as any mechanism that leads to the 
improvement of operation, evolution-based learning 
is thus included. Although specific learning con­
cepts and their formalisms differ from one learn­
ing approach to other, what is common is that in 
the process of learning some kind of network is 
constructed/evolved/adapted/grown that stores the 
knowledge acquired in the learning phase (the net­
work can become equivalent to a state machine or 
fuzzy automaton by adding some discrete or con­
tinuous memory elements). The learned network is 
next run (executed, evaluated, etc.) for old or new 
data given to it, thus producing its responses - ex­
pected behaviors( decisions, controls) in unfamiliar 
situations (new data sets). The responses may be 
correct or erroneous, the network's behavior is then 
evaluated by some fitness (cost) functions and the 
learning and running phases are interspersed. The 
process of solving problems is thus always reduced to 
two phases: the phase of learning, which is, con­
structing and tuning the network, and the phase 
of using knowledge, that is, running the network 
for data sets. Comparing to the process of devel­
oping and using a computer, the first stage could be 
compared to the entire process of conceptualizing, de­
signing and optimizing a computer on all its system, 
behavioral, architectural, logic design, and physical 
design levels (partitioning, placement ,routing) ; and 
the second stage to using this computer to perform 
calculations. You cannot redesign the standard com­
puter hardware, however, when it cannot solve the 
problem correctly, while the Learning Hardware will 
redesign itself automatically based on new learning 
examples given to it. 
Let us observe, that from the operational point of 
view, from the entire system it is irrelevant what kind 
of network is being taught. It can be a combinational 
network, in which the outputs are some functions of 
the states of input signals, or it can be a network 
with a memory. It can be digital or analog, with dis­
cretization in signal value or in time, synchronous or 
asynchronous (for simplification, in this paper we will 
restrict ourselves to combinational digital circuits). 
It is only important that we have some way of de­
signing this network by positive and negative exam­
ples and next some way of evaluating network's be­
havior on data sets (similar clustering methods have 
been also designed to acquire knowledge in a feedback 
from the environment and without direct intelligent 
supervision). Thus, the structure of the network 
must be created, and also its elements must be de­
signed, adapted, selected from a menu, or tuned in 
the learning process. The network can be realized in 
software, in hardware or as software-hardware co­
design (Amdahl' Law can be used as an argument 
against software/hardware approaches as opposed to 
a purely hardware approach). 
Observe also that once the network has been found, 
it can be transformed to another network, either com­
pletely equivalent to it or being its generalization. 
For instance, an integer-based neural net or a multi­
valued (MV) decision tree can be both compiled to bi­
nary logic gates. The net can be designed using con­
structive methods all at once from the complete set 
of examples (an approach used for diagnostic trees), 
or it can be built incrementally (like done for neural 
nets). 
It should be clear for the reader from the above re­
marks, that there are close links between various 
learning approaches, thus ideas developed in one 
area, say ANNs, can be next mapped to other area, 
say Fuzzy Logic. Therefore, many new approaches 
can be created and investigated based on combining 
basic learning models and methods in various ways. 
For instance, the ANN built in Brain Builder's [18] 
approach can be directly compiled to binary hard­
ware without the intermediate medium of cellular au­
tomata used in [18], or a learning algorithm different 
than the genetic algorithm from [18] can be used to 
construct the ANN. 
It is also irrelevant from the point of view of the en­
tire system when it has been already taught, whether 
its "black box", the learning module realized as, say, 
an array of programmed FPGAs, has been taught in 
an incremental learning process, or constructed as a 
fully specified system, or constructed by a learning 
algorithm. Different construction methods will only 
differ in their convergence speeds, sizes of networks, 
their learning errors, networks' speeds, testabi lities , 
power consumption, etc. It is in the network model 
selection and network construction methods where 
the different philosophies of designing Learning Hard­
ware and Evolvable Hardware essentially disagree. 
The plan of this paper is the following. Section II 
will briefly compare logic versus ANN and GA ap­
proaches to learning. In section III we will intro­
duce the concept of Learning Hardware, and in sec­
tion IV we explain methods of knowledge represen­
tation in the Universal Logic Machine (ULM), our 
realization of the Learning Hard ware concept. Sec­
tion V introduces briefly the DEC-PERLE-l board to 
which the vhtual processors of the ULM are mapped 
[51]. Section VI clarifies programming/designing en­
vironment for DEC-PERLE/XILINX. Next we illus­
trate our approach with two different concepts of de­
signing Learning Hardware using the DEC-PERLE-l 
board. While the first method is to design a general­
purpose computer with instructions special­
ized to operate on logic data, the second method 
is to design a processor for only one application. 
In section VII, we present a virtual general-purpose 
computer that operates on multi-valued cube cal­
culus, an algebra to solve combinatorial problems 
in multiple-valued logic. The data path of this 
computer is entirely specialized for efficient realiza­
tion of cube calculus operations and its control unit 
implements basic algorithms that use these opera­
tions; for instance, two-level AND/OR logic mini­
mization. The virtual processor from section VIII 
realizes one algorithm only: the generalized Ashen­
hurst/Curtis decomposition of functions [2, 7, 52] 
and relations [36]. The unifying concept of both these 
architectures is the use of cellular automata and 
regular logic structures, because they can be eas­
ily specified in VHDL or schematic capture tools, and 
nicely mapped to regular chip-level and board-level 
"FPGA Array)) resources of DEC-PERLE-I. We 
point to some difficulties, evaluate the design and 
discuss future improvements in conclusion section. 
II. LOGIC METHODS FOR LEARNING 
In this paper we will present a new approach to 
design a learning machine, based on FPGA tech­
nology and associated logic development methods 
(called logic synthesis by the design automation 
community and constructive induction by the Ma­
chine Learning community) rather than on neural or 
genetic algorithms. Michie [28] makes distinction 
between black-box and knowledge-oriented concept 
learning systems in terms of weak and strong cri­
teria. The system satisfies a weak criterium when it 
uses sample data to generate an updated basis for 
improved performance on subsequent data. Strong 
criterion is satisfied if the system moreover can com­
municate its learned concepts in symbolic form [27]. 
Let us observe that ANNs, CNNs and similar ap­
proaches satisfy only the weak criterium while our 
approach satisfies the strong criterium. For instance, 
a medical doctor who uses the aid of a knowledge­
based system has to understand the explanation of 
the system to undertake his decision, for which only 
he will be responsible. The doctor cannot than rely 
on a "black box))-type of decision from the ANN. 
We believe that the results of the learning process, 
and even the process itself, should be understood 
by humans. The processes should be then similar 
to those in humans, thus based on symbolic logic 
and not on the methods of Nature. Human think­
ing is perceived by other humans as the abstract 
use of symbols, and not tuning of numeric weights 
of neurons. Our approach to learning still does allow 
for fuzziness, imprecision of formulation, and random 
search or probabilistic solving mechanisms. It oper­
ates, however, on higher and more natural symbolic 
representation levels. Also, the built-in mathematical 
optimization techniques allow to satisfy the Occam's 
Razor Principle, thus finding solutions that are prov­
ably good in the sense of Computational Learning 
Theory (COLT) [1, 44]. This is the first main point 
of our philosophy. 
In our past research we have been using and 
comparing, in software, various network structures 
for learning: two-level AND/OR (Sum-of-Products 
(SOP), or Disjunctive-Normal-Forms (DNF)) [29], 
Exclusive-Or-Sum-of-Products (ESOP) [49, 35], 
Three-Level NAND/AND/OR networks [39], Three­
Level AND/NOT (TANT) networks [34], decision 
trees (C4.5), and multi-level decomposition struc­
tures [52, 36, 13, 14], as well as various logic, non­
logic and mixed optimization methods: search [37], 
rule-based, set-covering, graph-coloring, genetic algo­
rithm [11, 9] (including mixtures of logic and GA ap­
proaches), genetic programming [10], artificial neural 
nets, and simulated annealing. We compared our net­
works' results on their complexity (Occam's Razor), 
as well as on various ways of calculating the error 
of learning [13, 14, 44, 24]. The Decomposed Func­
tion Cardinality (DFC) and its extensions for MV 
logic [1,44, 13, 14] were used as common measures of 
complexity, because of its theoretically proven prop­
erties [1, 44]. Based on these investigations, we can 
definitely state that logic approaches and especially 
the MV decomposition techniques, combined with 
smart heuristic strategies and good data representa­
tions, are usually superior to other approaches with 
respect to smaller net complexity and learning error. 
Based on small complexity and error, especially 
poor results were obtained using the genetic algo­
rithms [11, 9, 10]. May be GA performs well in other 
applications, but both in our experience and based 
on literature we were simply not able to find a single 
problem domain that a GA-based algorithm would 
be superior to a hand-crafted human-designed algo­
rithm to design a binary or multi-valued logic net­
work of any kind. This is perhaps because humans 
have long experience in creating efficient logic mini­
mization algorithms (for instance, more papers have 
been written on SOP minimization than perhaps on 
any other engineering topic). We want to make use of 
this accumulated human experience in our approach, 
rather than to "reinvent" algorithms using GA. On 
the other hand, for large data the logic algorithms are 
relatively slow, hence must be speed-up in hardware. 
III. 	FROM LEARNING HARDARE TO DATA 
MINING MACHINES 
There is one general agreement among various de­
velopers of evolvable and learning systems: that real­
ized with current software or even parallel program­
ming technologies, the learning phase and/or the ex­
ecution phase are too slow for real-life problems, es­
pecially real-time problems, regardless whether the 
exhaustive combinatorial search, simulated anneal­
ing, or evolutionary algorithms that involve millions 
of populations are used. Thus, the researchers pro­
posed to speed-up some phases by migrating them 
from software to hardware. 
In general, five approaches to implementing the 
learning algorithms are possible: 
AI. 	Both learning and execution are done in soft­
ware (this standard approach still dominates 
the Artificial Neural Nets, Constructive Induc­
tion, Data Mining, and the so-called" extrinsic" 
Evolvable Hardware [16]). 
A2. 	The learning phase is performed in software and 
the network is next downloaded to hardware for 
execution (this approach has been used with 
fuzzy logic controllers, FPGAs realizing binary 
and multi-valued networks, etc.). 
A3. 	The learning phase is performed in hardware 
and the execution phase in software. This ap­
proach is thus a hardware-accelerated design 
of a knowledge-based expert system. (This is 
an approach that can be used for Data Mining 
(DM) and Knowledge Discovery in Databases 
(KDD) of extremely large data. So far, we do 
not know researches other than ours based on 
this principle). 
A4. Both the learning and the execution phase are 
performed in hardware. This is the area of 
classical ANNs, CNNs and "intrinsic" evolv­
able hardware [16]. The approach of evolving 
ANNs realized with cellular automata evolved 
using genetic algorithms [18] requires hardware 
(the Intrinsic Evolvable Hardware) because the 
slow evaluation process must be repeated on 
millions population members to give the suffi­
ciently good results. This is an extreme and 
purist, but very innovative and ambitious ap­
proach. Only time will show if it will be suc­
cessful; but this philosophy is supported by 
Amdahl's Law. Satisfying Amdahl's Law would 
require, however, to remove software decision 
making from the process, which in our opinion 
will prove impossible. 
A5. Software-hardware co-design in one or in both 
phases. This approach is the most prospective 
in our view. 
Many ambitious projects based on ANNs, cellu­
lar logic, DNA, simulated evolution and biologically 
motivated hardware have been proposed that will be 
perhaps some day realized on molecular or quantum 
levels. However J many of them are quite impracti­
cal in current technologies. The following general 
observations related to practical hardware real­
ization of the Learning Hardware concept can 
be made: 
1. 	 Most of the approaches to learning and evo­
lutionary hardware use binary Field Pro­
grammable Gate Arrays, because simply 
there are no other mass-scale hardware recon­
figurable (reprogrammable) and relatively in­
expensive technologies widely available now. 
Other potential realization technologies are ei­
ther too primitive and do not allow for large 
networks, or are in their very early develop­
ment stages. For instance, the Electronically 
Programmable Logic Devices are too small, and 
Field Programmable Analog Arrays are in their 
current state not flexible enough but have a 
high potential in a longer run [42]. Chips for 
Cellular Neural Nets [43] and Artificial Neu­
ral Nets [25] have high potentials, but their 
small markets did not make them commer­
cially successful yet. Finally, Multi-Valued FP­
GAs, Fuzzy Logic Programmable Arrays, and 
Mixed FPAAs are still in very early stages. 
On the other hand, binary FPGAs allow now 
to realize in hardware various conceptual net­
works, including neural, fuzzy, and decision 
trees. A practical task should be then to com­
pare FPGA-based realizations of various 
machine learning paradigms. 
2. 	Thus, because in binary FPGAs everything is 
realized on the level of binary logic gates, in 
our opinion, the learning process should be also 
performed on the level of logic gates. This level 
is more natural than the level of arithmetic op­
erations of ANNs or Fuzzy Logic functions, or 
that of the switching transistor sequences re­
sponsible for routing connection paths. This is 
the second main point of our philosophy. 
3. 	 Once we decide to realize the network using 
logic gates in FPGA, we should re-use all 
powerful EDA (Electronic Design Au­
tomation) tools that engineers have al­
ready developed in many years in the area 
of digital design automation, especially for re­
configurable computers: state machines, logic 
synthesis, technology mapping, placement and 
routing, partitioning, timing analysis, etc. 
"Airplanes are not evolved and they do not fly 
like birds, they are constructed based on ac­
cumulated human knowledge and sophisticated 
mathematical algorithms." It seems like a non­
sense to try to teach the evolvable system to do 
everything that was done by hundreds of thou­
sands scientists and engineers in VLSI and mi­
croprocessor industry: Do we believe that the 
Intel's Pentium chip can be evolved by exam­
ples as an ANN? Neural net uses multiplica­
tions; can even the 16-bit Booth multiplier be 
evolved? Cellular automata-based realizations 
need connection-routing algorithms, do we be­
lieve that practically useful )) physical design" 
algorithms should be tried to evolve, while ex­
cellent algorithms for placement, routing and 
partitioning already exist in commercial EDA 
tools and can be used? De Garis' group evolves 
ANN realized as cellular automata for pattern 
recognition - while this approach demonstrates 
the power of evolutionary algorithms, is this 
the best approach to build Pattern Recogni­
tion hardware? Why to realize the genetic 
algorithm in hardware rather than the Tabu 
Search, Simulated Annealing or any other gen­
eral problem-solving mechanism? 
4. 	 Occam Razor principle should be used when­
ever possible because only it can lead to mean­
ingful discoveries. 
Concluding, we do not believe that the "purist 
strategies" to evolutionary hardware, DeGaris and 
Brains Builder Group [17, 3, 16, 18], will be practi­
cally acceptable for most commercial applications of 
Learning Hardware. 
Therefore, we propose here the principles of Learn­
ing Hardware that will use previous human 
problem-solving experience and apply many 
mathematical algorithms and problem-solving strate­
gies rather than rely on only two generic methods of 
Evolvable Hardware: ANNs and GA. We believe also 
that all methods that exist in VLSI design, and es­
pecially, the powerful CAD and EDA tools, should 
be re-used in their entirety, rather than duplicated 
by naive low-level evolutionary algorithms. Learn­
ing/evolution should still remain as the main prin­
ciple of building new generation hardware, but it 
should be restricted to high abstract levels. The vari­
ants evaluation/selection should be also performed 
at abstract levels, before mapping to low-level field­
programmable resources, such as switches, for which 
chromosomes are long and the operation of GA is 
inefficient. 
Our Learning Hardware approach is thus oriented 
towards modern FPGA technologies and similar tech­
nologies that can be predicted in a short time hori­
zon, and is not necessarily best for future realization 
technologies of learning networks. The proposed by 
us Learning Hardware methodology can be sum­
marized as follows: 
1. 	Based on sets of examples classified to several 
(at least two) categories, and various network 
requirements (background knowledge), the 
hardware processors, using logic/mathematical 
algorithms, create the logic network descrip­
tion. This network can have two-, three-, or 
arbitrary number of levels and either binary 
or multi-valued variables (attributes, sig­
nals). It can use simple gates such as ANDs, 
ORs and EXORs, or complex gates such 
as arbitrary 4-input, 2-output Lookup-Tables. 
In ULM, to realize (construct, design, learn, 
evolve) the network we use hardware realization 
of well-known logic synthesis algorithms such 
as: two-level AND/OR minimizers [29), two­
level AND/EXOR minimizers [49], three-level 
OR/AND/OR minimizers [34], and functional 
(Ashenhurst-Curtis) decomposers [36]. 
2. 	The (quasi)optimaUy constructed network is 
mapped to standard FPGAs and realized using 
standard partitioning, placement and routing 
and other EDA tools from Xilinx and commer­
cial EDA software companies. 
3. 	The knowledge of the machine is stored in mem­
ory patterns representing logic nets. Under su­
pervision of the software program in the main 
processor, the hardware multiplexes between 
various learned nets, depending on rules that 
also can be acquired automatically. This phase 
is therefore similar to the CBM approach [18]. 
4. 	 As the network solves new problems, the new 
data sets and training decisions are accumu­
lated and the network is repeatedly automati­
cally redesigned. The old network can serve as 
a redesign plan for the new network, or the net 
is "redesigned from scratch" to avoid any bias. 
Thus, we replace the process of evolving on all de­
sign levels used in EHW, with the ULM model of 
learning at high level and next compiling to low level 
using standard FPGA-based tools. This can be used 
in approaches A3 - A5 above. 
Observe also that the same physical FPGA re­
sources are multiplexed to implement the virtual 
human-designed "learning hardware" and the au­
tomatically learned "data hardware". While the 
"learning hardware" is designed once by humans and 
cannot be changed, the" data hardware can be per­
manently modified. 
We presently model our algorithms in software or 
implement them for a prototype reconfigurable plat­
form from DEC, the DEC-PERLE-l board. We con­
sider the ULM to be an early prototype of Data 
Mining machines, that some day will be able to col­
lect data from on-line data bases, for instance from 
WWW Pages and the Internet. Other variants of 
Brain Builder Universal Logic Machine 
Model of Learning Artificial Multi-Valued Logic Language 
How the net is Genetic Algorithm, Multi-Valued Logic Synthesis, 
constructed ANN Training Constructive Induction, Rough Set Theory 
Virtual intermediate Cellular Automata MV logic networks and state machines 
represen tation with arbitrary structures and arbitrary 
operators realized as look-up tables 
automata tables language expressions 
hardware hardware and software 
array of binary FPGAs array of binary FPGAs 
Xilinx 6000 series 
CBM 
Xilinx 3090 + on-board memory 
DEC-PERLE-1 board + DEC workstation 
learned: 
truction 
to: 
Table 1: Brain Builder versus Universal Logic Machine. 
such machines will acquire data from industrial agri­
cultural, military, or other application areas in real­
time, using pre-processing techniques of Image Pro­
cessing and Digital Signal Processing through sen­
sors, microphones and TV cameras. In contrast to 
similar projects, our goal is not to build the Arti­
ficial Brain [3, 17], a superintelligent robot-pet, or 
a model of instinctual animal behavior, but rather 
to develop a system being able to perform mean­
ingful discoveries in narrowly defined areas, thus 
speeding-up both the learning and execution phases 
of application software programs that are now be­
ing used in Machine Learning, Knowledge Discovery 
from Databases, Data Mining, and robotics. 
With this respect, as an accomplishment of our 
project we would treat the system being able to solve 
in few seconds and with error as small as the learning 
error of our current software, the following problems: 
1. 	Every data set from the U .C. Irvine Repository 
of Data Mining benchmarks [50], as well our 
benchmarks [41]. Including all versions oflarge 
examples such as Breast Cancer I and Michal­
ski's Trains with 30 trains [19]. They take now 
up to 30 minutes in software. 
2. 	 The recognition of cervical mucus fearning mi­
croscope images for ovulation prediction [32]. 
It takes now minutes in software. 
3. 	The recognition of 3-dimensional images of 
rooms and corridors for mobile robot orienta­
tion [47]. This task takes now up to 17 minutes 
in software. 
Table 1 shows the comparison of Brain Builders' 
CAM-Brain Machine (CBM) and Universal Logic 
Machine approaches. 
IV. MULTI-VALUED LOGIC LANGUAGE 
TO REPRESENT THE LEARNING 

DATA IN HARDWARE 

Because we want the system to learn on some 
higher level than that of elementary gates and their 
connections, we need first to develop certain higher­
level language, in which expressions, the virtual nets, 
will be automatically created, evaluated, selected and 
optimized, to be next realized as hardware FPGA 
nets by top-down automatic design methods. Since 
in the learning phase we want to operate on elements 
of this language in hardware, our choices are limited 
because of the necessity of operations that are easily 
realizable in hardware. 
Several such languages have been created in the 
past, mostly for applications in Logic Synthesis, Au­
tomatic Theorem-Proving, Data Base Theory, and 
Information Engineering, and we adopted some of 
them for hardware representation. They include: 
binary and Multi-Valued Cube Calculus [8, 33], 
Decision Tables, Rough Sets [30], Rough Parti­
tions [22, 23], Labeled Rough Partitions [19], Bi­
nary [5] and Multi-Valued Decision Diagrams [46]. 
Observe that in ML, DM and KDD these functions 
are very strongly unspecified (99% of don't cares, 
or more). Although Decision Diagrams seem to be 
nowadays the most successful representation of dis­
crete data and their applications span the whole spec­
trum of modern Computer Science, we (and other re­
searchers) were not able so far to find good hardware 
architectures to process them efficiently. Therefore 
we restricted our attention to the tabular represen­
tation of data [6] that found their applications in 
Logic Synthesis, state machines, automatic theorem 
proving, data base theory, Rough Sets, and pattern 
recognition. An example of suchtwo-dimensional 
tabular representation is shown in Table 2. 
Xl X2 Yl Y2 
a 0,2 1 2 
b 0,1 0 0,2 1 
c 2 0 1,2 0 
d 1 1 1,2 2 
Table 2: MV multi-output relation. 
Rows correspond to objects a, b, c and d and 
columns to input variables (attributes) Xl and X2, 
and output variables Yl and Y2. Symbol YI denotes a 
relation output. Inputs Xl and X2 together with out­
put Yl specify an (oriented) relation. Relation can 
be used to express such facts as: this color is red or 
white but not yellow or black. Symbol Y2 denotes a 
function output, Yl(Xl,X2)' Rows c and d have only 
one value for each attribute, so they are minterms. 
Rows a and b have more than one value for attributes, 
so they are cubes. Eeach row can be thought of as a 
record from a data base, or their set, or a collection 
of image features after image preprocessing. Dash in 
Yl is a standard don't care, it corresponds to any pos­
sible value of this variable (or to all possible values 
of this variable in another interpretation of cubes). 
All other entries in column Yl are called generalized 
don't cares; they corresponds to some subsets of pos­
sible values of this (ternary) variable. Comparing 
rows and columns of such table can be done partially 
in parallel and can serve to find certain patterns in 
data. Such patterns can be used to: generate prime 
implicants, decompose function, find a "bound set" 
or "free set" of variables for decompositin [36], re­
move redundant variables, find essential variables, 
find essential implicants, etc. Finding and analyz­
ing patterns in such tables is a subject of Rough Sets 
Theory [30], Logic Synthesis [8, 46], and Data Base 
Theory [6]. Many similar or competing algorithms 
for the same task have been developed in these areas 
independently for the tabular data model. Basic op­
erations of algorithms remove rows or columns, copy 
and modify them, merge rows or columns, etc. 
The main advantage of two-dimensional model 
is that it can be partitioned regularly to smaller 
parts. For instance, smaller tables can be extracted 
as scannable windows in the big table, similarly as 
it is done in convolution-based algorithms of Im­
age Processing and Digital Signal Processing. Most 
often, however, the two-dimensional representations 
are partitioned to one-dimensional representations. 
This can be done vertically or horizontally. 
The main advantage of one-dimensional repre­
sentations is that they can be efficiently processed 
in one-dimensional cellular automata, systolic, ping­
pong, SIMD and pipelined hardware architectures. 
Regularity is the key to success in "Array of FP-
GAs" environment where routing long connections is 
the main design bottleneck. This environment is sim­
ilar to Cellular Automata (CA), but does not require 
all automata to be the same or to be entirely reg­
ularly connected. More flexibility exists thus in this 
model than in the CAs, which are a very restricted 
design environment. 
In addition, these one-dimensional representations 
resemble chromosomes in Genetic Algorithms, 
which allows to use them in evolutionary computa­
tions. 
The two-dimensional representations composed 
from one-dimensional strings include the following. 
Standard Binary Cube Calculus (CC) of Roth, 
Karp, and Dietmeyer [8, 33]. It represents product 
terms as cubes where the state of each input vari­
able is specified by a symbol: positive (1), negative 
(0), non-existing (a don't care) (X), or contradic­
tory (c:). Each of these symbols is encoded in po­
sitional notation with two bits as follows: 1 =01, 
o = 10, X = 11, c: = 00. For instance, the posi­
tional notation for cube OX1 is 10-11-01. Dashes have 
no hardware meaning, they only help the reader to 
separate visually the variables. Thus, each position 
represents a state of the variable by the presence of 
" one" in it: left bit - value 0, right bit - value 1. 
This encoding presents therefore a simple reduction 
to set-theoretical representations. A cube can repre­
sent a product, a sum, a set of symmetry coefficients 
of a symmetric function, a spectrum of the func­
tion, or another piece of data on which some symbol­
manipulation (usually set-theoretical) [8, 33] oper­
ations are executed. Usually the cube corresponds 
to a product term of literals. For instance, assume 
the following order of binary variables: age, sex and 
color_of_hair. Assume also that the discretization of 
variable age is: age = 0 for person's age < 18 and 
age = 1 otherwise. Men are encoded by value 0 of 
attribute sex and women by value 1. color _of _hair is 
ofor black and 1 for blond. Then a blond woman of 
age 19 is denoted by 110 and a black-hair seven-years 
old person of unknown sex is described by cube OX1. 
Cube XXX is the set of all possible people for the se­
lected set of attribute variables and their discretized 
values. Two-dimensional representation is just a set 
of cubes where the connecting operator is implicitly 
understood as: OR for SOP; EXOR for ESOP; con­
catenation for a spectrum, or other. For instance, 
assuming each cube corresponding to AND opera­
tor and the OR being the connecting operator; the 
list {OXl,110} is the SOP which represents the above 
mentioned two people (or a set of all people with 
these properties). Multi-valued and integer data can 
be encoded with binary strings in this representation, 
so that next all operations are executed in binary (we 
use this model in the decomposition machine). For 
instance, if there were three age categories, young, 
medium and old, they can be encoded as values 0, 1 
and 2 of the ternary variable age, respectively. Vari­
able age could be next represented in hardware as 
pair of variables age} and age2, where 0 = 00, 1 = 
01, 2 = 10, thus encoding: young = agel age2, 
medium = age} age2, old = agel age2,. Re­
call that the minimal hardware operations in CC are 
executed only on binary variables. 
Multi-Valued Cube Calculus (MVCC) [48,33]. 
This is a superset of CC. It represents product terms 
as cubes where each input variable can have a subset 
of a finite set of all possible values that this variable 
can take. Each element of the set is represented by 
a single bit, which makes this representation not ef­
ficient for large sets of values. This system is the 
superset of the Standard Binary Cube Calculus. In 
the above example we could have for instance a 5­
valued variable age for five age categories, and a 
quaternary variable color_of_hair. Each position of 
a variable corresponds to its possible value. For in­
stance, 10000-10-0100 describes a 7-year old boy with 
black hair. This is an example of a minterm cube, i.e. 
with single values in each variable. 01100-11-1100 de­
scribes group G l of people, men and women, that are 
either in second or in third age category and have ei­
ther blond or black hair. This is an example of a cu be 
that is not a minterm. 100000-00-1000 describes a 
first-category-of-age person wi th blond hair who has 
some conflicting information in sex attribute, for in­
stance a missing value (this is also how contradictions 
are signalized during cube calculus calculations [33]). 
The hardware operations in MVCC are done directly 
on such MV variable cubes so that the separate en­
coding to binary variables is not necessary. 
Generalized MV Cube Calculus (GMVCC) 
[33]. This is a superset of MVCC. It has cubes where 
each output variable can be also a subset of values. 
Such cubes can be directly used to represent MV re­
lations, as in Table 2. Its operations are more gen­
eral than MVCC, because more interpretations can 
be given to cubes. This calculus has more descrip­
tive power, but the respective hardware processors 
are much more complicated. 
Simplified Binary Cube Calculus (SBCC). This 
is a subset of CC. It operates only on minterms. It 
has application in decomposition of functions. The 
hardware of this machine is much simplified: oper­
ations are only set-theoretical. This is the simpliest 
virtual machine realized by us, so larger data can 
be processed by it because more of a machine can 
be fit to the limited FPGA Array resources of DEC­
PERLE-I. 
Simplified MV Cube Calculus. It has cubes 
where for every input variable either only a sin­
gle value of its possible values is selected (which 
is denoted by a binary code (such as a byte) of a 
symbol corresponding to this value), the variable is 
missing (which is denoted by a selected symbol, X), 
or the variable is contradictory (another symbol, 0). 
This representation is used for Rough Sets [30] and 
variable-valued logic [26]. For instance, assuming 10 
age categories, 0 = 0 - 10 years, 1 = 10 - 19 years, 
2 = 20 - 29 years, etc, and 3 hair categories: 0 
blond, 1 = black, 2 = red, the 7-year old boy with 
black hair is described as 0-0-1, the 18-year old girl 
with black hair is described as 1-1-1, the 28-year old 
woman with red hair is described as 2-1-2, and a set 
of all people with red hair is X-X-2. There is no way 
now to describe in one cube people below 19 with 
red or black hair, which was possible in MVCC or 
GMVCC. This simplification of the language brings 
however big speedup of algorithms and storage re­
duction when applied for data with many values of 
attributes. Also, the control of algorithms becomes 
more complicated, while the data path is simplified. 
Spectral Representations. Examples: Reed­
Muller FPRM and GRM spectra [11, 9], Walsh spec­
trum [12], various orthogonal spectra. These repre­
sentations represent function as a sequence of spec­
tral coefficients or selected coefficient values with 
their numbers. Some spectral representations are 
useful to represent data for genetic algorithms: the 
sequence of spectral coefficients is a chromosome. For 
instance, in the Fixed-Polarity Reed-Muller (FPRM) 
canonical AND/EXOR forms for n variables, every 
variable can have two polarities, 0 and 1. Thus there 
are 2n different polarities for a function and the GA 
algorithm has to search for the polarity that has 
the minimum number of ones in the chromosome. 
This way, every solution is correct, and the fitness 
function is used only to evaluate the cost of the de­
sign (100% correctness of the circuit is in general 
very difficult to achieve in GA [11, 9, 10]. There­
fore our approaches to logic synthesis based on GA 
are to have a representation that provides you 
with 100% correctness and have the GA search 
only for net minimization. This approach involves 
however a more difficult fitness function to be cal­
culated in hardware than the pure GA or Genetic 
Programming approaches [11, 9, 10]. Let us observe 
that evaluating the fitness function is much more dif­
ficult to realize in hardware than all other operations 
of the GA combined, and logic transformations are 
necessary for this to achieve. Similarly, the other 
AND/EXOR canonical form called the Generalized 
Reed-Muller form (GRM) has n . 2n - 1 binary co­
efficients, so there are 2n . 2",-1 various GRM forms. 
Because there are more GRM forms, it is more prob­
able to find a shorter form among them than among 
the FPRM forms [11, 9]. But the chromosomes are 
much longer and the evaluation is more difficult. This 
kind of trade-offs is quite common in spectral repre­
sentations. Spectral methods allow for high degree 
of parallelism. 
Rough Partitions (RP) represented as Bit Sets 
[22, 23]. This representation stores the two­
dimensional table column-wise, and not row-wise as 
MVCC does. In r-partition every variable (a col­
umn of a table) induces a partition of the set of rows 
(cubes) to blocks, one block for each value the vari­
able can take (there are two blocks for a binary vari­
able, and k blocks for a k-valued variable). Rough 
Partitions are an interesting and novel idea but they 
don't really form a representation of a function. Since 
the values of a variable are not stored together with 
partition blocks, the essential information on the 
function is lost and the original data can not be re­
covered from it. This is kind of an abstraction of a 
function, useful for instance in various decomposition 
algorithms. 
Labeled Rough Partitions (LRP) represented as 
Bit Sets [19]. This is a new representation (a gener­
alization of RS) which has very interesting properties 
and allows to find different kind of patterns in data. 
It is useful for decomposition of MV relations and it 
preserves all information about the relation or func­
tion. It can be also made canonical, when created 
for special cubes. Most of its operations are reduced 
to set-theoretical operations, so hardware realization 
is relatively easy. Relations happen in tables created 
from real data-base and features from images, for in­
stance, MV relations are benchmarks hayes, flare1, 
flare2 from [50]. An example of application of rela­
tion in logic synthesis area is a modulo-3 counter (a 
non-deterministic state machine is a special case of 
multiple-valued, multi-output, relation) that counts 
in sequence sO -t sl -t s2 -t sO and if the state s3 
happens to be the initial state of the counter, counter 
should transit to any of the states sO, sl, s2, but not 
to the state s3 itself. 
Generalized values for input variables are already 
known from cube calculus but generalized values for 
output variables are a new concept which allows for 
representation and manipulation of relations in LRP. 
Definition 1. Separation of the elements of a 
nonempty set S into nonempty subsets Si ,US; = S, 
is called a rough partition (r-partition) of S. 
Notice that the definition of a rough partition al­
lows subsets Si to overlap. 
Definition 2. [relation]. Let 3 1 and 82 be sets. 
A relation R from 31 to 32 is a subset of Cartesian 
product 8 1 x 32. A relation R on 81 is a subset of 
31 x 81 . 
Function is a special case of relation from 81 to 32 
where every element Sl E 81 is the first member of 
precisely one ordered pair (s I, S2) E 31 X 82. 
Definition 3. [labeled partition block] Let C(X) be 
a set of MV cubes, and relation Rk be defined by a 
cube cn(Xd, Xl ~ X, as follows: cdX)RkCj(X) iff 
ck(Xd ~ cdXt) and ck(Xd ~ cj(Xd, where cn(Xd 
is given and c;(X), Cj (X) E C(X). The set of all 
cubes Ci(X) being in relation Rk to each other and 
labeled by the cube ck(Xd will be called labeled par­
tition block and denoted by Bcs.(Xd' The cube ck(Xd 
will be called a block label. 
Every cube in C(X) can be enumerated with a dif­
ferent symbol (an integer number in particular) and 
consequently, a partition block represented by a set 
of symbols. Label added to the partition block allows 
for establishing a correspondence between the set of 
symbols in the partition block and cubes in C(X). 
In Table 2 we have: X = {Xl, X2}, Y = {YI, Y2}, 
and C(X U Y) = {a, b, c, d}, where a, b, c, d are sym­
bols denoting cubes C1 (X U Y), C2(X U Y), C3(X U 
Y),and C4(XUY) respectively. Let Xl = {xd. Then 
C1(Xl ) = {{O}} defines relation Rl = {a,b}, cube 
C2 (XI) {{I}} defines relation R2 = {b, d}, and 
cube C3(Xl ) = {{2}} defines relation R3 = {a, c}. 
Corresponding labeled partition blocks are {a, b}o, 
{b,dh, and {a,ch. 
Definition 4. [labeled rough partition]. The col­
lection of non empty labeled partition blocks Bc/c (X!) 
forming a rough partition of the set C(X), Xl ~ X, 
will be called labeled rough partition (Ir-partition) and 
denoted by P(Xd = {Bc/c(Xd}' 
Notice that every lr-partition P(Xi), Xi ~ X I 
forms a cover of the set of symbols enumerating the 
cubes in C(X). 
Given the example from Table 2 we have P(Xd = 
P(xd {{a,b}o,{b,dh,{a,ch}X1' 
Definition 5. [labeled partition block product]. 
Product of two labeled partition blocks Bc.(Xd and 
BCj (X2) is the labeled partition block BcdX 3 ), which 
is an intersection of partition blocks Bc.(x t ) and 
BCj(x'J) (BC /c(x 3 ) = Bc;(Xd n BCj(x'J)) and label 
Ck(X3) is equal to ci(Xd$Cj(X2). 
Definition 6. [lr-partition product]. The product 
P(XI)P(X2) of Ir-partitions P(X l ) and P(X2) of 
a set of cubes C(X), Xl, X2 ~ X, is Ir-partition 
P(X3 ), X3 = Xl U X 2 , the blocks of which are non 
empty products of the blocks of P(Xd and P(X2)' 
Theorem. For any set of cubes C(X), and any set 
of subsets Xi of X, P(Ui Xi) [Ii P(Xd· 
Characteristics of lr-partition representation can 
be summarized as follows: 
1. 	 Multiple values of both input and output vari­
ables can be easily represented. This is es­
pecially important in ML and complex Finite 
State Machine (FSM) controller optimization 
applications to express uncertainty of choice of 
variable's value. 
2. 	 It can easily handle situations where a variable 
is not present in a given cube (Michalski's train 
benchmark [26] and ,~, in Espresso format). 
3. 	 By selection of sets Xi and Yj lr-partitions can 
be dynamically adjusted to a given type of data 
(completely vs. incompletely specified, many 
cubes vs. few cubes) to minimize memory re­
quirements (see Theorem 1). 
4. 	 lr-partitions can be used for decomposition of 
large functions and relations, we implemented a 
decomposer of MV relations which can decom­
pose large functions and relations from ML and 
controller domains. It was shown in [36] that 
this representation is not only compact but also 
allows for a fast processing. 
All these representations have certain advantages 
and disadvantages, depending on type of data pro­
cessed and the algorithm realized in hardware. For 
instance, among our virtual machines, the Cube 
Calculus Machine realizes Multi-Valued Cube Cal­
culus, the Decomposition Machine realizes the 
Simplified Binary Cube Calculus, and the Rough 
Set Machine realizes the Simplified MV Cube Cal­
culus. This way, we will be able to compare hardware 
realization of various representations and operations 
for the same or similar tasks to understand better the 
trade-offs between generality and efficiency. We will 
illustrate also the concept of ULM with two kinds of 
approaches: CCM will represent a general-purpose 
computer with a special list of instructions for 
MVCC [33]. It is microprogrammed in a special lan­
guage called CCM Assembly. Decomposition Ma­
chine is a processor for onle application only: 
functional decomposition. It is thus completely hard­
wired and optimized just for this task to make it as 
efficient as possible. 
1LCA stands for Logic Cell Arrays 
V. 	DEC-PERLE-l BOARD FOR FAST 
PROTOTYPING 
In this section we will present the minimum de­
scription of our board, to give the reader some feeling 
about fast prototyping environment based on arra.ys 
of FGPAs and show difficulties that exist in creating 
Learning Hardware. Digital's Paris Research Labo­
ratory developed its third generation board, DEC­
PERLE-l in 1992. The overall structure of the 
DEC-PERLE-l is shown in Figure 1. The board is or­
ganized around a central computational matrix made 
up of 16 Xilinx XC3090 LCAsl (MOO to M15 in the 
figure), surrounded by a four 1MB RAM banks, and 
7 other LCAs to implement switching and controlling 
functions. The data buses and their width are also 
shown in the figure. The user has to understand well 
all programmable resources of the board, otherwise 
the logic design becomes non-mappable to FGPGA 
wiring resources. Moreover, the designer needs to 
take into account this architecture from the very be­
ginning of designing hardware rather than to design 
first and next try to map. Regularity is the key issue. 
Computational matrix. The central computa­
tional matrix is a 4 x 4 matrix of Xilinx 3090 LCAs. 
These LCAs are interconnected with each other. The 
LCAs are named LCA_MOO to LCA_M 15. This ma­
trix can be used to develop any kind of digital cir­
cuitry: data path, control unit and others. But it is 
typically used to develop the data path of the appli­
cation. The interconnection resource between them 
can be classified into the following three categories: 
direct connections, buses, and rings. Direct Con­
nections. These wires connect the adjacent sides of 
adjacent LCAs. The main purpose of direct connec­
tions is to extent the internal regularity of the LCA 
to the matrix level. The matrix can be seen as a 
large FPGA with 64 x 80 Configurable Logic Blocks 
(CLBs) (one XC3090 FPGA has 16 x 20 CLBs). Each 
LCA has 16 such wires on each side. 
The FPGA matrix is shown in Figure 2: this figure 
shows the regularity and the local and global connec­
tions. It allows the reader to evaluate the complex­
ity of designs that can be practically implemented 
in DEC-PERLE-l. The direct connections at the 
edges of the FPGA matrix four 64-bit-wide connec­
tions connected to external connectors, which can be 
used to connect other devices, for example, another 
DEC-PERLE-l board. Buses. The horizontal or 
vertical wires connect the corresponding side of all 
4 LCAs in the same row or column. They can thus 
efficiently distribute global data in one direction, and 
are comparable to the longline interconnections re­
sources in Xilinx internal architecture. Each LCA 
has 16 such wires on each side. According to their 
directions, these buses are named matrix North, 
East, South and West bus, respectively, and rep­
resented by MBusN, MBusE, MBusS and MBus W 
for short. Each bus has 64 wires which are connected 
to switches on the corresponding side of matrix FP­
GAs. 
Rings. These wires connect all the matrix LCAs 
and two control LCAs. These connections are very 
FIFOs 
Host 

Adapter 

useful for global control signals distribution since 
they connect to all the matrix LCAs. There are 10 
such wires. Note that because of their electrical load­
ing (they are used to connect 18 LCAs, 16 matrix 
LCAs and two control LCAs), these wires are slower 
than the buses and should be used with care in high 
performance designs. 
Figure 1: DEC-PERLE-l architecture 
Switches and I/O buses. FIFOs, RAM banks 
and the central matrix are connected through two 32­
bit data buses and five programmable switches (FP­
GAs). There is one matrix switch on each side of the 
matrix, respectively called North Switch (SWN), 
East Switch (SWE), South Switch (SWW) and 
West Switch (SWW), which connect the corre­
sponding matrix data buses and corresponding RAM 
banks. These 4 switches (SWN, SWE, SWS, SWW) 
also connected to two 32 bits I/O buses, called 
North-East Bus (DBusNE) and South-West Bus 
(DBusSW) after the names of the switches they 
respectively connect. Two I/O buses (DBusNE, 
DBusSW) connect to the input and output FIFOs 
through the fifth switch called Fifo Switch (FSW), 
and also connect to corresponding controllers, called 
North-East Controller (CNE) and South-West 
Controller (CSW). As their names imply, the FP-
GAs CNE and CSW are typically used to develop 
the controller of the application because they con­
nect to all other parts of the DEC-PERLE-l, FIFOs, 
Memory banks and other FPGAs. 
Control resource. The control resource is the 
programmable resource that can be used to develop 
the control part of the application other than data 
path part. The data path resource (matrix, RAM 
banks, FIFOs and switches) needs the following set 
of control wires: MATRIX RINGS: There are 10 
matrix global wires. RAM ADDRESS: Each RAM 
bank has a 18-bit-wide address, that specify the word 
address of the current read or write operation. Since 
our CCM design uses two of four memory banks, 
two addresses are used in our CCM design. RAM 
CONTROLS: Each RAM bank has 4 control 
nals. SWITCH CONTROLS: Each pair of ma­
trix switches (North and East / South and West) 
MBusN OCN MBusN OCN MBusN OCN MBusN OCN 
00:15 00:15 16:31 16:31 32:47 32:47 48 :63 48:63 
~ 
I I 
MON MBE 
MBN ~ 
I 
1 
MON MBE 
MBN ~ 
1 
MON MBE 
MBN 
----
I 
I 
MDN MBE 
MBN 
OCOO-01H DC01-02H OC02-03H
-
5 
W 
5 
I-- MDW MDE 
MOO MBS 
MBW MOB MRM 
1 I 
:::-
~ til 
I .... g 0 
u 0 (:I 
I 
MON MBE 
MEN 
MOW MDE 
00:15 
- MOl MBS I--
MBW MOS MRM 
1 I 
>
til 
0 .... 
I M 
... 
0 IJ) 
U M (:I 
1 
~ MON MBE MBN 
MDW MDE 
00:15 00:15 
M02 MBS r-
MBW MOS MRM 
1 I 
:::-
IJ) 
0 l"-
I 
"" N 
0 ~ U (:I 
1 
MON MBE 
>-- MBN 
MOW MOE 
M03 MBS 
MBW MOS MRM 
I I 
>I"-
0 M 
I \Q 
M 
0 co 
U 
""(:I 
-
1 
~ MON MBE MBN 
-
OC04-05H DC05-06H OC06-07H 
-
1 
W 
1 
MDW MOE 
M04 MBS f---< 
MBW MOS MRM 
1 I 
:::-
co 
0 ~ I 
~ 0 
U 0 (:I 
I 
MON MEE 
MBN 
MOW MDE 
16:31 
MOS MBS I--
MBW MDS MRM 
1 I 
:::-
Q'\ 
0 M 
I M 
U'l 
0 IJ) 
U M (:I 
1 
~ MDN MBE MBN 
MDW MOE MOW MOE ,.- -
16:31 16:31 
M06 MBS f---< M07 MBS '---; 
MBW MOS MRM MEW MOS MRM 
1 I I I 
:::- > 
0 M 
I"- MM M 
J, 
"" 
I IJ) 
r-
0 N 0 QO 
U M U 
"" (:I (:I 
1 1 
~ MON MBE MON MEE MBN MEN 
DC08-09H OC09-l0H OC10-11H
-
7 
W 
7 
MOW MOE 
M08 MBS 
MBW MOS MRM 
1 I 
>N 
M LIl 
I M 
co 
0 0 
U 0 (:I 
I 
MON MBE 
MBN 
MOW MOE 
32:47 
---< M09 MES I--
MEW MOS MRM 
1 I 
> 
M 
M M 
I M 
Q'\ 
0 ~U (:I 
1 
MDN MBE
- MBN 
MOW MOE 
32:47 32:47 
MIO MBS ---< 
MBW MOS MRM 
1 I 
> 
,qo 
M I"-
I ,qo 
0
... N 
U M (:I 
r 
MDN MBE
- MBN 
MOW MOE ,...--
Mll MBS r------< 
MBW MDS MRM 
I I 
> 
LIl
... M 
I \Q 
M 
M QO 
U 
""(:I 
-
r 
MON MBE
- MBN 
DC12-13H DC13-14H DC14-15H I--- MOW MOE 
3 
M12 MBS 
MBW MDS MRM 
W 1 I 
3 I 
---< 
-
48:63 
MOW MDE 
M13 MBS 
MBW MDS MRM 
1 
I 
I 
r---< 
48:63 
MDW MDE 
M14 MBS 
MBW MDS MRM 
1 
I 
I 
--
48:63 
MDW MDE 
MIS MES 
MBW MDS MRM 
I 
I 
I 
I--
MbusE 
00:15 
DCEDCW 
00:1500:1 
Mbus 
00:1 
MbusE 
16:31 
DCEDCW 
16:3116:3 
Mbus 
16:3 
MbusE 
32:47 
DCEDCW 
32:4732:4 
Mbus 
32:4 
MbusE 
48:63 
OCW DCE 
48:6 48:63 
Mbus 
48:6 
RingMat 
0:9 
DCS 
00:15 
MBusS 
00:15 
DCS 
16:31 
MBusS 
16:31 
OCS 
32:47 
MBusS 
32:47 
DCS 
48:63 
MBusS 
48:63 
DON, DOE, DOS and DOW: North/East/South/West matrix side to connectors 
MDN I MDE, MDS and MDW: Matrix North/East/South/West direct connections 
MBN, MBE, MBS and MBW: North/East/South/West matrix buses 
Figure 2: DEO-PERLE-l matrix 
has 10 control wires that are the equivalent of the 
matrix rings, and are called switch ring. The Fifo 
Switch has 6 control wires. In addition, each of the 
matrix switches has 2 dedicated control wires. FIFO 
CONTROLS: Each of the two FIFOs has one status 
wire: empty flag for input FIFO / full flag for output 
FIFO; and one control wire: write for output FIFO / 
read for input FIFO. TAGS: Four "tag" wires along 
the input data wires on the input FIFO. CLOCK 
CONTROL: The clock generator has two control 
wires that can be driven by the application design. 
LCBus: There is a 24-bit-wide communication path 
between the board and the host, called LCBus. All 
these control wires are connected to one of two con­
troller LCAs (CNE, CSW) or both of them. These 
two controllers are identical except that each of them 
controls two of the four switches and memory banks. 
These two controllers also connect to corresponding 
I/O bus in order for it to be able to communicate 
with the main datapath. 
Memory subsystem. DEC-PERLE-l contains 
4MB of high-speed static RAM organized in 4 banks 
of 256K 32-bit words (4 bytes a word). These banks 
are named North, South, East and West af­
ter the matrix switch to which they are connected. 
Each bank is completely independent of the oth­
ers and has its own data, address and control sig­
nals: DATA BUS: 32 data wires connect to the 
corresponding matrix switch. They are represented 
by RamdataX, where X is one of N,S,E,W to re­
spectively specify the North, South, East or West 
RAM bank. ADDRESS BUS: 18 address wires 
220 218 X(1MB = 22) connect to the correspond­
ing controller. They are represented by RamAddrX. 
CONTROL BUS: 4 active-low control signals to 
specify the read/write operation, connect to the cor­
responding controller. RamReadX: read command. 
RamWriteX: write command. RamDisLowX: dis­
able lower half-word (bits 0 to 15). RamDisHighX: 
disable upper half-word (bits 16 to 31). In our CCM 
design, we always read/write memory by a 32-bit 
word a time. Therefore, the signal RamDisLowX 
and RamDisHighX are always set to 1 (not ac­
tived). 
The basic read and write transactions both last one 
clock cycle, and either may be performed at every cy­
cle. 
Read memory. To read a particular word of 
memory, the word address (RamAddrX) must be 
presented and the read command (RamReadX) must 
be asserted at the beginning of a cycle; the data word 
read from memory will be available on the data wires 
at the end of the same cycle and may be latched on 
the next clock tick. A RAM bank can be seen as a 
combinational device when read. 
Write memory_ To write a particular 
word of memory, the word address (RamAddrX), 
the data (RamDataX) and the write command 
(RamW riteX) must be asserted during the same cy­
cle; the word will have been written by the end of the 
same cycle, and the address and the data may be re­
moved after the next clock tick. The reading or writ­
ing of either half of the data word may be indepen­
dently disabled by asserting the corresponding dis­
abled command (RamDisLowX or RamDisHighX) 
during the transaction cycle. The memory system is 
clocked by clockl signal. 
Clock subsystem. Two global synchronous clock 
signals, clockO and clockl, are available to all 
DEC-PERLE-l LCAs for proper synchronous opera­
tion. These clock signals are generated by a phase­
locked-loop oscillator synchronized to the host bus 
master clock. When DEC-PERLE-l is connected to 
a DEC 5000/24 workstation (25MHz TURBOchan­
nel), its frequency can be programmed under soft­
ware control to be any value from 360 KHz to 120 
MHz, with an average resolution of 0.01 %. 
Clock modes. Under software (the program run­
ning on the host) control, the clock generator may 
be put in the following operation modes: STOP 
MODE: No clock is generated in this mode. FREE­
RUN MODE: This is the norma] operating mode, 
where the clock continuously runs at the prescribed 
frequency. BURST MODE: This is a mode where, 
under software control, the clock generator will gen­
erate a burst of 1 to 31 clock ticks at the pre­
scribed frequency, then stop. This is useful to imple­
ment step and double-step debugging modes. AU­
TOSTOP MODE: There are two autostop modes: 
Fifoln-Autostop and FifoOut-Autostop. In the 
Fifoln-A utostop mode, clockO will automatically stop 
whenever the design attempts to read an empty in­
put FIFO. Similarly, in the FifoOut-Autostop mode, 
clockO will automatically stop whenever the design 
attempts to write a full output FIFO. These two 
modes can be enabled at the same time. For instance, 
the CeM design runs in this mode. CLOCKl­
DIV2: This mode is useful for very high perfor­
mance designs. clockl runs at half the speed of clcokO. 
This allows the RAM and FIFOs to be operated on 
half the speed of the matrix. clockO stop. The 
clockO may stop under control of the application on 
the board. This is usually used to implement flow­
control, where the entire datapath is stopped wait­
ing for input data (when the input FIFO is empty) 
or output space (when the output FIFO is full). It 
is much more efficiently and easily implemented this 
way than through the global distribution of a clock 
enable signal. In effect, when application runs en­
tirelyon clockO and both autostop modes are enabled, 
the application can be seen as a perfect synchronous 
system without flow-control concern. The clockO sig­
nal will stop under one or more of the following con­
ditions: (1 )The active-low ClkStop signal is asserted 
from one of the controllers. (2) In the Fifoln-autostop 
mode, the input FIFO is empty and the active-low 
Fif ofnRead signal is asserted from one of the con­
trollers. (3) In the FifoOut-autostop mode, the out­
put FIFO is full and the active-low FifoOuiWriie 
signal is asserted from one of the controllers. The 
memory subsystem and the FIFOs are clocked by 
clockl. This means that it is still possible to perform 
memory and/or FIFO operations even when clockO is 
stopped. 
Slow mode. Under control of an application on 
the board, it is possible to slow down the clock (di­
vide its frequency by 4) by asserting the active-low 
ClkSlow signal from one of the controllers. This is 
useful when an application can run at a very high 
speed, but must infrequently perform an operation 
that is impossible to be performed at the high speed 
(like stopping the clock, or accessing the FIFOs). The 
ClkSlow can be asserted at any speed, but its oper­
ation is asynchronous, that is, it will take an unpre­
dictable number of cycles for it to be effective. If the 
operation frequency is less than 80 MHz, this num­
ber of cycles is however guaranteed to be less than or 
equal to 6. 
Host interface. The DEC-PERLE-l application 
is running under the control of the software pro­
gram executed on the host computer. The commu­
nication between DEC-PERLE-l application and its 
driving software program can be done through FIFOs 
or LCBus. 
FIFOs. There is a 32-bit-wide, 512-word-deep 
FIFO in each direction These FIFOs are called in­
put FIFO for the Host-to-PAM direction and out­
put FIFO for the P AM-to-Host direction, respec­
tively. On the application side, their data wires are 
connected to the Fifo Switch LCA and their control 
wires to the two Controller LCAs. Both FIFOs are 
purely synchronous devices when operated from the 
application side. They appear to be always available 
for reading or writing in autostop mode. The input 
FIFO and output FIFO are synchronous devices that 
offer two active-low status signals FifolnEmpty and 
FifoOutFull and two active-low command signals 
FifolnRead and FifoOutWrite. These four signals 
are connected to the two Controller LCAs CNE and 
CSW. 
The input FIFO can be written and the output 
FIFO can be read by the driving software through 
the runtime library. 
LCBus. The LCBus is a 24-bit-wide general pur­
pose register that can be read and written by both the 
software and the application design. The LCBus can 
be used for asynchronous communication between 
the Controller LCAs and the software program. Un­
der the software control, the direction of each bit can 
be set independently of the others. Initially (after 
download), all bits are set for PAM-to-Host commu­
nication. 
Tags. Every word that the software (the program 
running on the host) pushes into the input FIFO is 
"tagged" with 4-bit value. These tag bits are read 
from the input FIFO at the same time as the data 
word, and are available on both Controller LCAs and 
on the Fifo Switch. 
The user of the board has to know the delays of 
different kinds of connections, so that he can make 
reasonable trade-off decisions for his designs. For in­
stance, the delay of matrix rings is 43ns, and the de­
lay of matrix direct connection is 24ns. For a given 
signal, if the designer can use either the matrix rings 
or the matrix direct connection, then the matrix di­
rect connection should be a better choice. It would 
be very difficult to have a GA make good timing de­
cisions. 
The above described hardware resources have been 
created for a class of applications, so they are not nec­
essarily optimal for any particular application. The 
very useful features in designs are: large memories, 
vertical and horizontal buses and direct connections, 
global connections, clock control modes and debug­
ging modes. However, the designer is often con­
fronted with too few connections in FPGA resources 
to map his virtual architecture. This requires fre­
quent modifications, or may require a total redesign. 
The most difficult are architectures as CCM, which 
have many buses and many global signals between 
control units and data paths. 
Concluding, DEC-PERLE-l board, similarly to 
other FPGA boards, advocates very regular design 
styles without long and many control signals. It 
is then good for small SIMD processors, pipelin­
ing, systolic processors, cellular automata or com­
plex Boolean functions. The basic design princi­
ple is: "map two-dimensional tables to two­
dimensional logic resource arrays". The design 
can be developed incrementally thanks to its easy 
memory access, host interface with FIFOs, and the 
clock debugging modes and tags. 
VI. 	 PROGRAMMING THE DEC-PERLE-l 
BOARD 
For using DEC-PERLE-l board, we must run an 
application-specific program on the host computer 
which connects to the DEC-PERLE-l board. On the 
other hand, the 23 FPGA chips of the DEC-PERLE­
1 must be programmed to realize an application­
specific hardware. Therefore, A DEC-PERLE-l pro­
gram consists of two parts: 
• 	 the driving program which runs on the host 
and controls the DEC-PERLE-l hardware. 
• A 1.5 	MB bitstream which programs the 23 
XC3090 FPGAs of the DEC-PERLE-l to real­
ize an application-specific hardware. 
The driving program is written in C or C++ and is 
linked to the runtime library encapsulating a device 
driver. The requirement for developing the driving 
program is the C or C++ programming environment 
and the DEC-PERLE-l runtime library. 
The runtime library. The runtime library of 
DEC-PERLE-l is essential to the developer who de­
velops the driving program which runs on the 
host computer and controls the DEC-PERLE-1 hard­
ware for the application. The runtime library is 
the only way to access DEC-PERLE-1 hardware for 
the driving program. The runtime library devel­
oped by DEC's Paris Research Laboratory provided 
a few essential controls to the application driving pro­
gram: (1) A UNIX I/O interface, with open, close, 
read and write. (2) Download the configuration bit­
streams from host to DEC-PERLE-1, and/or read 
back the values of all the flip-flops of all the LCAs. 
(3) Read/write static RAM on DEC-PERLE-1 by the 
software program. (4) Control the mode and speed 
of DEC-PERLE-1 clock by the software program. 
For generating 1.5MB bitstream that programs the 
XC3090 FPGAs to realize the application-specific 
hardware, the following steps are involved: 
1. 	 Design Partition. In this step the design is 
mapped onto 23 FPGA chips according to the 
logic design and the constraints of the DEC­
PERLE-l board. Some of the FPGA chips may 
be not used. For example, the CCM design uses 
only 17 FGPA chips of all 23 chips, because we 
were not able to find better mapping despite 
many efforts. The steps 2 and 3 should be car­
ried out separately for each FPGA chip that is 
used in the design. 
2. 	 Design Entry. In this step, the design is cre­
ated for each FPGA used in the design sepa­
rately. This step produces a Xilinx netlist 
file (XNF file) for the next step. There 
are three kinds of design entry methods: (1) 
Schematic editor to create the XNF file. (2) 
Hardware description language: the de­
signer can use VHDL (or other hardware de­
scription language) to create the design, then 
the synthesis software is used to synthesize and 
optimize the design and produce the XNF file. 
(3) PerleDC library. Another possible way 
is to use a C++ program and the PerieDC li­
brary to describe the design. Individual config­
uration of each FPGAs involved in your design 
are described by this C++ program. Compil­
ing and running this C++ program generates 
the XNF file of the design. 
There are many tools that can be used. For 
instance, there are four sets of tools avail­
able at EE of PSU as of this writing: Xilinx 
Foundation Series, OrCAD Express 7.0, Men­
tor's Leonardo, Summit. Both Xilinx Foun­
dation Series and OrCAD Express 7.0 support 
schematic editor and hardware description lan­
guage. 
3. 	 Design Implementation. Map, place and 
route your design, and finally generate the 
bitstream file by using Xilinx development 
tools. Since all FPGAs used on DEC DEC­
PERLE-1 board are XC3090 FPGAs, the user 
needs Xilinx development tools that support 
XC3090 FPGA. 
4. 	Design Verification. At this step, the bit­
stream generated at the previous steps is down­
loaded into the DEC-PERLE-l board and the 
design is tested. If something goes wrong, you 
may need to modify your design at design entry 
step, then regenerate the bitstream file, down­
load it to DEC-PERLE-l board and test your 
design again. 
VII. CUBE CALCULUS MACHINE 
In our design, the Cube Calculus Machine is a co­
processor to the host computer and is realized as a 
virtual processor in DEC-PERLE-l. The simplified 
block diagram of the CCM is shown in Figure 3; the 
thick arrow stands for data buses, and the thin arrow 
stand for control buses. As shown in the figure, the 
CCM communicates with the host computer through 
the input and the output FIFO. The Iterative Logic 
Unit (ILU) is realized using a one-dimensional iter­
ative network of combinational modules and cellu­
lar automata. Its partial descriptions are included 
in [21, 38] and a complete description can be found 
in [40]. ILU is composed from ITs, each of them 
processes a single binary variable or two values of a 
multi-valued variable. Any even number of variables 
can be processed, and only size of the board as well 
as bus limitations are the limits (it is the total of 32 
values now, which is at most 16 binary variables, 8 
quaternary variables, or 4 8-valued variables, or any 
mixture of even-valued variables). It realizes several 
operations of MVCC [33]. 
The ILU can take the input from register file and 
memory, and can write output to the register file, 
the memory, and the output FIFO. The ILU executes 
the cube operation under the control of Operation 
Control Unit (OCU). The Global Control Unit 
(GCU) controls all parts of the CCM and let them 
work together. 
The machine realizes the set of operations [33] 
from Table 3. The Table shows also their program­
ming information. Each row of Table 3 describes one 
cube operation. Each operation is specified in terms 
of: rel - the elementary relation type between input 
values, and/or the global relation type, and the in­
ternal state of the elementary cellular automaton ­
before, active and after. The operation name, nota­
tion, the output value of reI (partial relation) function 
in every IT, and_or (relation type), the output values 
of before, active and aJterfunctions are listed from left 
to right. Partial relation reI is an elementary relation 
on elementary piece of data (pair of bits). These set 
theoretical relations such as inclusion, equality, etc. 
The value of and_or equals to 1 means that the re­
lation type is of AND type; otherwise, the relation 
type is of 0 R type. This relation is created by com­
posing elementary relations from ITs and variables. 
The machine is microprogrammable both in its 
OCU control unit part (by use of CCM Assembly 
Language) and in Data Path, as achieved by ILU op­
erations programmability. For instance, each opera­
tion is described by the binary pattern correspond­
ing to it in the respective row of Table 3. By cre­
ating other binary patterns in the fields of Table 
From host computer 
3, new operations can be programmed to be exe­
cuted by ILU. As the reader can appreciate, there are 
very many such combinations, and thus CCM micro­
operations. We call this horizontal data-path mi­
croprogramming. Higher order CCM operations 
are created by sequencing low-level operations. This 
is called vertical control microprogramming and 
is executed by OCU (within ILU) and GCU (for op­
erations with memories and I/O). Thus, the user has 
many ways to (micro) program sequences of elemen­
tary instructions. This is done in CCM Assembly 
language [40]. 
To host computer 
Global Control Unit (GCU) 
Figure 3: The simplified block diagram of the CCM 
Evaluation. 
For comparing the performance of the CCM and that 
of the software approach, a program to execute the 
disjoint sharp operation on two arrays of cubes was 
created using C language. Then this program and the 
CCM are used to solve the following problems: (1) 
Three variables problem: 1# (all minterm with 3 bi­
nary variables). (2) Four variables problem: 1# (all 
minterm with 4 binary variables). (3) Five variables 
problem: 1# (all minterm with 5 binary variables). 
The C program is compiled by GNU C compiler ver­
sion 2.7.2, and is run on Sun Ultra5 workstation with 
64MB real memory. A clock of 1.33 MHz (clock pe­
riod: 750 ns) is used as the clock of the CCM. The 
result is shown in Table 4. 
It can be seen from Table 4 that our CCM is about 
4 times slower than the software approach. But, the 
clock of the CPU of Sun Ultra5 workstation is 270 
MHz, which is 206 times faster than the clock of the 
CCM. Therefore, we still can say that the design of 
the CCM is very efficient for cube calculus opera­
tions. 
It also can be seen from Table 4 that the more 
variables the input cubes have, the more efficient the 
CCM is. This is due to the software approach need 
to iterate through one loop for each variable that is 
presented in the input cubes. 
However) the clock period of 750ns is too slow. 
From the state diagram of the GCU, it can be found 
that the delays of empty carry path and counter carry 
path only occur in a few states. Thus, if we can just 
give more time to these states, then we can speedup 
the clock of the whole CCM. This is very easy to 
achieve: for example, the state P2 of GCU needs 
more time for the delay of counter carry path, so 
add two more states in series between states P2 and 
P3. These two extra states do nothing but give the 
CCM two more clock periods to evaluate the signal 
preLres, which means that the CCM has 3 clock 
periods to evaluate signal preLres in state P2 after 
adding two more "delay" states. After making simi­
lar modifications to all these kind of states, the CCM 
can run against a clock of 4 Mhz (clock period of 250 
ns). The result is shown in Table 5. 
It is very hard to increase the clock frequency again 
with this mapping because some other paths like 
memory path have delays greater than 150 ns. 
From the above comparison result, we can conclude 
that a design like CCM with a complex control unit 
and complex data path is not good for the architec­
ture of the DEC-PERLE-l board. It can be seen 
from our CCM mapping that since a lot of signals 
Operation Notation Relation Output Function 
rel and/or before after 
crosslink 1110 1 0011 0101 
sharp basic B 0010 0 0011 0011 
disjoint sharp A #dbasic B 0010 0 0011 01 
consensus A *basic B 1111 1 0001 0001 
intersection AnB 0 
super cube AUB 0111 
prime A'B 001 0 0011 0111 
cofactor A Ibasic B 1011 1 001 1111 
Table 3: The Output Values of Bitwise Functions Used in Cube, Operations 
Problem 4 variables 5 variables 
Ultra5 268 usec 812 usec 
CCM 1285 x 0.75 3405 x 0.75 
963.75 usec 2553.75 usec 
speedup 0.28 0.32 
Table 4: Compare CCM (1.33 MHz) with software approach 
must go through multiple FPGA chips, this leads to 
greater signal delays. For instance, if we can connect 
the memory banks and the registers directly, then 
the memory path has a delay of only 35 ns. But our 
current memory path has a delay of 160 ns. Another 
If we can map the entire CCM inside one FPGA 
chip, then we can speedup the CCM from the follow­
ing aspects: 
• 	If we map entire CCM into one FPGA chip, 
the signals do not need to go through multiple 
chips again, which means the routing delay is 
reduced. 
• Since the 	new FPGA chip has more powerful 
CLBs and routing resource, we can map the 
CCM denser. This also reduces the routing de­
lays. 
• Since 	new FPGA chips are made using deep 
sub-micron technology, the delay of CLB and 
routing wires are both reduced. For example, 
the delay of the CLB ofXC3090A is 4.5 ns while 
the delay of CLB of XC4085XL (0.35 micron 
technology) is only 1.2 ns. This means that it 
is very easy to achieve 3 times faster mapping. 
XC4085XL FPGA, a new FPGA from Xilinx, has 
a CLB matrix of 56 x 56 and up to 448 user I/O pins. 
The CCM should be able to map into one XC4085XL 
FPGA. With this new chip, it should not be difficult 
to run the CCM against a clock of 20 MHz (clock pe­
riod: 50 ns). This means that our CCM will be about 
issue is that XC3090 FPGA is kind of "old" now (6 to 
8 years old technology). The latest FPGAs from Xil­
inx or other vendors have more powerful CLBs and 
more routing resource, and they are made using deep 
sub-micron process technology. 
4 times faster than the software approach while the 
system clock of the CCM is still 5 times slower than 
that of the workstation. 
As stated by the designers of the DEC-PERLE-1 
board: PAM technology is currently best applied to 
low-level, massively repetitive task such as image or 
signal processing. The example applications are a 
long integer multiplier, RSA cryptography and Fast 
Hough transform. All these applications have no or 
very simple control units, and their data paths can 
be easily pipelined. The CCM has a complex con­
trol unit, and a complex data path. It is difficult to 
pipeline the data path of the CCM. Therefore, the 
DEC-PERLE-1 board is not the best choice for the 
CCM. 
VIII. 	LEARNING BY FUNCTIONAL 
DECOMPOSITION MACHINE 
While the previous section presented a com­
plete general-purpose memory-programmable proces­
sor for cube calculus, in this section we will show 
design philosophy: the FPGA implementation of a 
point algorithm. Phases of the algorithm are ex­
ecuted sequentially, they are then loaded from the 
host memory, while the intermediate data are stored 
Problem 3 variables 4 variables 5 variables 
Ultra5 111 usec 268 usec 812 usec 
CCM 611 x 0.25 
152.75 usec 
1486 x 0.25 
=371.5 usec 
4078 x 0.25 
= 1019.5 usec 
speedup 0.72 0.72 0.80 
Table 5: Compare CCM (4MHz) with software approach 
in DEC-PERLE-1 memories between stages. We will 
show also how generic combinatorial problems are 
used in logic learning algorithms. Here the ideas of 
graph coloring [52, 45] will be used for decomposing 
functions, and thus in Machine Learning. 
The decision table represents a data set, with la­
beled instances, each relating a set of attribute values 
to a class (the output concept). Decomposition of 
the table is to decompose the initial table into a hier­
archy of decision tables, each of them no more decom­
posable. Thus, each of these new tables, as well as 
the entire network are less complex and easier to in­
terpret than the original table. Some regularities not 
seen in the original table can be found, and the inter­
mediate functions correspond to some features (con­
cepts) of the data set. Ashenhurst/Curtis Decom­
position has been adopted to multiple-valued logic, 
[23, 36]. It applies iteratively the single decompo­
sition step, whose goal is to decompose a function y 
=F(X) into y =G(A,H(B)), where X is a set of in­
put attributes x1 I X2, ... , Xn, and y is the class. F, G, 
and H are functions represented as decision tables, 
i.e. possibly incomplete sets of attribute-value vec­
tors with assigned classes. A and B are subsets of 
input attributes, called free and bound set, respec­
ti vely, such that A U B = X. Functions G and H 
are developed in the decomposition process and not 
predefined in any way. New concept Cl = H(B) has 
been found. The goal is to find the decomposition of 
the smallest complexity (DFC [1]). 
Let us illustrate the decomposition on a simple ex­
ample. Consider the decision table in Table 6. It re­
lates the input attributes Xl, X2, and X3 to the class 
y, such that y F(Xll X2, xs). 
There are three possible non-trivial partitions of 
attributes that yield three different decompositions 
y = Gl (Xl,Hdx 2,XS)), Y = G2(X2,H2(Xl,X3)), 

y = G3(X3,Hs(Xl,X2)). The first two are given in 

Figure 4e and Figure 4f, respectively. 

The comparison shows that: 

(1) 	 decision tables in the decomposition y = 
GdXl,H1(X2,XS)) are smaller than those for 
y = G2(X2,H2(Xl,XS)), 
(2) 	 the new concept Ci =H!(X2, xs) uses only three 
values, whereas that for H 2 (Xl, xs) uses four, 
(3) 	 we found it hard to interpret decision tables 
G2 and H 2 , whereas by inspecting HI and G I 
it can be easy to see that CI MIN(x2,x3) 
and y MAX (Xl, cd. This can be even more 
evident with the assignment of values 0,1, and 
2 of a multi-valued variable Xi: X/ = 10, XiI 
me, Xi 2 = hi. 
The following problems must be solved by an efficient 
decom posi tion algorithm: 
(1) 	 how to select sets A and B? 
(2) 	 how to evaluate the quality of decompositions? 
Unfortunately, all known methods require nearly 
exhaustive searches that involve huge repetitions of 
basic operations. 
A. Simple Decomposition Algorithm for Functions 
The decomposition algorithm constructs a parti­
tion matrix with attributes of bound set in columns 
and of free set in rows. Each column in the partition 
matrix denotes the behavior of F when the attributes 
in the bound set are constant. Some columns can 
then be represented with the same value of c, and the 
number of different columns is equal to the minimal 
number of values for C to be used for decomposition. 
In this way, every column is then assigned a value of 
c, and G and H are straightforwardly derived from 
such annotated partition matrix. For each of three 
partitions for our sample decision table F, the par­
tition tables with the corresponding values of care 
given in Figure 4b,c, and d respectively. 
The assignment of values of c is trivial in case of 
a completely specified function, which is, when de­
cision table instances completely cover the attribute 
space. Otherwise, when the function is incompletely 
specified, the relation of compatibility of columns is 
no longer transitive, and the graph coloring ap­
proach is used. Column functions are calculated 
by a cofactor operation on the original function j. 
The cofactor jPROD of function j with respect to 
the literals from PROD is this function with all lit­
erals from PROD substituted to maximum constant 
value (constant value 1 in case of binary logic). All 
functions are represented by arrays of cubes. 
Xl X2 X3 Y 
10 10 10 10 
10 10 hi 10 
10 me 10 10 
10 me hi me 
10 hi 10 10 
10 hi hi hi 
10 hi hi hi 
me 10 10 me 
me 10 hi me 
me me 10 me 
me me hi me 
me hi 10 me 
me hi hi hi 
hi 10 10 hi 
hi 10 hi hi 
hi me 10 hi 
hi me hi hi 
hi hi 10 hi 
hi hi hi hi 
Table 6: An example decision table y = F(Xb X2, X3) 
For a completely specified binary function, two 
columns nl and n2 are compatible if the Boolean 
functions corresponding to them are a Boolean Tau­
tology: 
nl compatible n2 iff /Prod l /Prod') 
which is equivalent to: 
nl compatible n2 iff (ON(nd # ON(n2) = 0) 
and (ON(n2) # ON(nt) = 0) 
where # denotes the sharp (difference) operation on 
arrays of cubes, and ON is the set of true cubes in 
SOP form. 
For an incompletely specified binary function, two 
nodes of the graph for coloring are incompatible if 
the corresponding columns are not compatible (can­
not be merged into one column): 
nl incompatible n2 iff (ON(nl) n OFF(n2) :f. 0) 
or (ON(n2) n OFF(nI) :f. 0) 
As we see, only two basic operations, cofactor 
and sharp are used for complete functions. Simi­
larly, only cofactor and intersection are used for 
incomplete functions. But, in both cases, these op­
erations are repeated many times on cubes from the 
cube arrays. Similarly, the basic (mv) logic operators 
can be used for checking compatibility of columns of 
multiple-valued functions while creating the graph for 
coloring. After creation, the graph is colored in such 
a way that every two nodes linked by an edge obtain 
different colors, and the minimum number of colors 
is used. Graph coloring can be reduced to sequences 
of basic logic operators as well. 
Concluding, in addition to cofactoring, the partial 
combinatorial problems that are solved by our hard­
ware decomposition processor DP are the following: 
set covering, graph coloring, and maximum clique. 
They are all NP-hard, and they all have many other 
applications in ML. 
Function Decomposition is an NP complete prob­
lem. Moreover, every stage of Function Decomposi­
tion, except the stages of G and H function blocks 
creation, are NP complete problems. Hence, even if 
more efficient algorithms can be found for exact cal­
culations, they will be either too slow or of inferior 
quality. One approach to find solutions to NP-hard 
problem is not to attempt at the exact solution, but 
be satisfied with one which is near exact but obtain­
able in a reasonable time. This type of algorithm 
is based on heuristics, or rules which can be applied 
which are likely to improve the solution. Such al­
gorithms, when implemented in hardware, can bring 
orders of magnitude speed-up [24]. We have chosen 
algorithms that are simple, easy, fast and can be rel­
atively easy implemented in hardware. In addition, 
in paper [4] we showed that decomposition of fuzzy 
functions and relations can be reduced to decompo­
sition of multi-valued functions and relations. So, 
assuming that the fast stages of converting fuzzy re­
lation to MV relation and next converting of MV re­
lations(functions) back to fuzzy logic for each decom­
posed block are executed in software, our hardware 
machine will still perform the most complex part of 
0 
c 2 yX2 
10 10
xI 10 10 
10 10 me 10 me10 10 hI10 me 1010 me me10 
10 me me mememe himeme 10hime 10 10 hime me 
me 
cl10 hime hime 10 hicihi 
hi 
hi 
hi 
hi 
hi (f) 
0 
10 10 10 10 
10 hi 10 me 
me 10 10 hi 
me hi 
hi 10 me me 
hi hi me 
10 10 
10 hi 
me 10 
me hi 
hi 10 
hi hi 
10 10 me hi 
10 hi hi 10 
me 10 hi hime hi 
hi 10 
hi hi (e)(a) 
Figure 4: Two one-step decompositions of the decision table from Table 2 
the synthesis proces. 
In addition to two virtual processors from this 
paper, we developed and simulated the Rough Set 
Machine (RSM) [30] and the Satisfiability Machine 
(SM) [31]. RSM is a SIMD processor that realizes 
the basic operations of Rough Sets theory of Zdzis­
law Pawlak. SM is a systolic processor to solve sat­
isfiability and related problems that occur in many 
combinatorial optimization problems. 
IX. CONCLUSIONS. 
We presented principles of Learning Hardware as 
a competing approach to Evolvable Hardware, and 
also as its generalization. The concept of Data Min­
ing machines has been outlined and the Universal 
Logic Machine with several virtual processors was 
briefly sketched, as just one possible realization of 
such machines. Although DEC-PERLE-1 is a good 
medium to prototype such machines, its XC3090A 
chip is now obsolete. This can be much improved by 
using XC4085XL FPGA and redesigning the board. 
Massively parallel architectures such as CBM based 
on new Xilinx series 6000 chips will allow even higher 
speedups. We hope that such machines will be used 
to implement improved models of ULM. 
REFERENCES 
[1] 	 Y. Abu-Mostafa (ed.), " Complexity in Information 
Theory," Springer Verlag, New York, 1988, p. 184. 
[2] 	 R.L. Ashenhurst, "The Decomposition of Switching 
Functions", Proc. Int. Symp. of Th. of Switching, 1951. 
[3] 	 A. Buller, "Artificial Brain. Phantasies no more," 
Proszynski i Ska, Warsaw, 1998, (in Polish). 
[4] 	 P. Burkey, M. Perkowski, and A. Wielgus, "Ashen­
hurst/Curtis Decomposition of Fuzzy Functions and 
Relations," submitted to Multiple- Valued Logic. An In­
ternational Journal, Gordon and Breach Science Pub­
lishers, 1999. 
[5] 	 R.E. Bryant, "Graph-based algorithms for boolean 
function manipulation, IEEE Transactions on Comput­
ers, C-35, No.8, pp. 661-691, 1986. 
[6] 	 E.F. Codd, "A Relational Model of Data for Large 
Shared Data Banks," Comm. A CM, 13, pp. 311-381. 
[1] 	 H.A. Curtis, "A New Approach to the Design of Switch­
ing Circuits," Princeton, N.J., Van Nostrand, 1962. 
[8] 	 D.L. Dietmeyer, "Logic Design of Digital Systems," AL­
lyn and Bacon, Boston, MA, 1911. 
[9] 	 K. Dill, and M. Perkowski, "Minimization of General­
ized Reed-Muller Forms with Genetic Operators," Proc. 
Genetic Programming '97 Conf., July 1991, Stanford 
Univ., CA. 
[10] 	 K. Dill, J. Herzog, and M. Perkowski, "Genetic Pro­
gramming and its Application to the Synthesis of Digi­
tal Logic," Proc. PA CRIM '97, Canada, August 20-22, 
1991. 
[11] 	 K. Dill, and M. Perkowski, "Evolutionary Minimiza­
tion of Generalized Reed-Muller Forms," Proc. IC­
CIMA '98 Conference, pp. 121-133, February 1998, Aus­
tralia, published by World Scientific. 
[12] 	 B. Falkowski, 1. Schaefer, M. Perkowski, "Ef­
fective Computer Methods for the Calculation of 
Radema.cher-Walsh Spectrum for Completely and In­
completely Specified Boolean Functions," IEEE Trans. 
on Computer-Aided Design, pp. 1207 - 1226, October 
1992. 
[13] 	 C. Files, M. Perkowski, "An Error Reducing Approach 
to Machine Learning Using Multi-Valued Functional 
Decomposition," Proc. ISMVL'98, pp. 161- 112, May 
1998. 
[14] 	 C. Files, M. Perkowski, "Multi-Valued Functional De­
composition as a Machine Learning Method," Proc. IS­
MVL'98, pp. 113 - 118, May 1998. 
[15] 	 J .M. Francioni, and A. Kandel, "Decomposable Fuzzy­
valued Switching Functions," Fuzzy Sets and Systems, 
Vol. 9, No.1, pp. 41-68, 1983. 
[16] 	 H. DeGaris, "Evolvable Hardware: Genetic Program­
ming of a Darwin Machine," In "Artificial Nets and 
Genetic Algorithms," R.F. Albrecht, C.R. Reeves and 
N.C. Steele (eds), Springer Verlag, pp. 441-449, 1993. 
[17] 	 H. DeGaris, "Evolvable Hardware: Principles and Prac­
tice," CACM Journal, August 1997. 
[18] 	 http: j jwww.hip.atr.co.jpj ,..., degaris 
[19] 	 S. Grygiel, and M. Perkowski, "New Compact Repre­
sentation of Multiple-Valued Functions, Relations, and 
Non-deterministic State Machines," Proc. ICCD'98, 
October 1998. 
[20] 	 T. Higuchi, M. Iwata, and W. Liu (eds), "Evolvable 
Systems: From Biology to Hardware," Lecture Notes in 
Computer Science, No. 1259, Proc. First Intern. ConJ. 
ICES'96, Tsukuba, Japan, October 1996. Springer Ver­
lag, 1997. 
[21] 	 L. Jozwiak, M.A. Perkowski, D. Foote, "Massively 
Parallel Structures of Specialized Reconfigurable Cel­
lular Processors for Fast Symbolic Computations," 
Proc. MPCS'98 - The Third International ConJerence 
on Massively Parallel Computing Systems, Colorado 
Springs, Colorado - USA, April 6-9, 1998. 
[22] 	 T. Luba, J. Rybnik, "Algorithmic Approach to Dis­
cernibmty Function with Respect to Attributes and Ob­
ject Reduction," Int. Workshop on Rough Sets, Poznan 
1992. 
[23] 	 T. Luba, "Decomposition of multiple-valued functions" , 
Proc. 25th ISMVL, 1995, pp. 256-261. 
[24] 	 R. Malvi, M. Perkowski, and L. Jozwiak, "Exact Graph 
Coloring for Functional Decomposition: Do we Need 
it? ," pp. 1-10, Proceedings oj Srd International Work­
shop on Boolean Problems, Freiberg University of Min­
ing and Technology, Institute of Computer Science, 
September 17-18, 1998. 
[25] 	 C. Mead, "Analog VLSI And Neural Systems," Addison 
Wesley Pub., April 1989. 
[26] 	 R.S. Michalski and J.B. Larson, "Inductive inference 
of vi decision rules," in Workshop in Pattern-Directed 
InJerence Systems, Hawaii, May 1977. 
[27] 	 R.S. Michalski, I. Bratko, and M. Kubat, "Machine 
Learning and Data Mining: Methods and Applica­
tions," Wiley and Sons, 1998. 
[28] 	 D. Michie, "Machine Learning in the next five years," 
Proc. EWSL'88, Srd European Working Session on 
Learning, Glasgow, Pitman, London, 1988. 
[29] 	 L. Nguyen, M. Perkowski, N. Goldstein, "PALMINI ­
Fast Boolean Minimizer for Personal Computers," Proc. 
oj the IEEE/ACM 24th Design Automation ConJer­
ence, pp. 615 - 621, Miami, Florida, June 28 - July 1, 
1987. 
[30] 	 Z. Pawlak, "Rough Sets. Theoretical Aspects of Reason­
ing about Data," [(luwer Academic Publishers, 1991. 
[31] 	 M. Perkowski, "Systolic Architecture for the Logic De­
sign Machine," Proc. oj the IEEE and A CM Inter­
national ConJerence on Computer Aided Design - IC­
CAD'85, pp. 133 - 135, Santa Clara, 19 - 21 November 
1985. 
[32] 	 M. Perkowski, S. Wang, W.K. Spiller, A. Legate, 
E. Pierzchala, <40vulo-Computer: Application of Im­
age Processing and Recognition to Mucus Ferning 
Patterns," Proc. oj the Third IEEE Symposium on 
Computer-Based Medical Systems, pp. 52 - 59, Chapel 
Hill, North Carolina, June 3-6, 1990. 
[33] 	 M.A. Perkowski, "A Universal Logic Machine," invited 
address, Proc. oj the 22nd IEEE International Sympo­
sium on Multiple Valued Logic, ISM VL '92, pp. 262 
271, Sendai, Japan, May 27-29, 1992. 
[34] 	 M. A. Perkowski, M. Chrzanowska-Jeske, "Multiple­
Valued-Input TANT Networks," Proc. ISMVL'94, pp. 
334-341, Boston, MA, May 25-27, 1994. 
[35] 	 M. A. Perkowski, T. Ross, D. Gadd, J .A. Goldman, 
and N. Song, "Application of ESOP Minimization in 
Machine Learning and Knowledge Discovery," Proc, oj 
the Second Workshop on Applications oj Reed-Muller 
Expansion in Circuit Design, Chiba City, Japan, 27-29 
August 1995, pp. 102-109. 
[36] 	 M. Perkowski, M. Marek-Sadowska, L. Jozwiak, T. 
Luba, S, Grygiel, M. Nowicka, R. Malvi, Z. Wang, and 
J. S. Zhang, "Decomposition of Multiple-Valued Rela­
tions," Proc. ISMVL '97, Halifax, Nova Scotia, Canada, 
May 1997, pp. 13 18. 
[37] 	 M. Perkowski, P. Lech, Y. Khateeb, R. Yazdi, and 
K. Regupathy, "Software-Hardware Codesign Approach 
to Generalized Zakrevskij Staircase Method for Exact 
Solutions of Arbitrary Canonical and Non-Canonical 
Expressions in Galois Logic," Booklet oj 6th Intern. 
Workshop on Post-Binary ULSI Systems, Nova Scotia, 
Canada, May 27, 1997, pp. 41 - 44. 
[38] 	 M. A. Perkowski, L. Jozwiak, and D. Foote, "Ar­
chitecture of a Programmable FPGA Coprocessor for 
Constructive Induction Approach to Machine Learning 
and other Discrete Optimization Problems" , in Reiner 
W. Hartenstein and Victor K. Prasanna (ed) "Recon­
figurable Architectures. High PerJormance by Config­
ware," IT Press Verlag, Bruchsal, Germany, 1997, pp. 
33 - 40. 
[39] 	 M. Perkowski, L. Jozwiak, and S. Mohamed, "New Ap­
proach to Learning Noisy Boolean Functions," Proc. 
ICCIMA '98 ConJerence, February 1998, Australia, 
published by World Scientific, pp. 693 - 706. Australia, 
published by World Scientific. 
[40] 	 M. Perkowski, "Do It Yourself Reconfigurable Super­
computer that Learns," book preprint, Portland, Ore­
gon, 1999. 
[41] 	 PSU POLO Directory with DMjML Benchmarks, soft­
ware and papers: http://www.ee.pdx.edujpolo/ 
[42] 	 E. Pierzchala and M. Perkowski, "A High-Frequency 
Field-Programmable Analog Array (FPAA), Part 1: 
Design, Part 2: Applications," Field-Programmable 
Analog Arrays, (E. Pierzchala, ed.), I(luwer Academic 
Publishers, 1998. 
[43] 	 L.O. Chua and T. Roska, "The CNN paradigm," IEEE 
Trans. on Circuits and Systems-I, Vol. 40, No.3, pp. 
148-156, March 1993. 
[44] 	 T. D. Ross, M.J. Noviskey, T.N. Taylor, and 
D.A. Gadd, "Pattern Theory: An Engineering 
Paradigm for Algorithm Design," Final Technical Re­
port WL- TR-91-1 060, Wright Laboratories, USAF, 
WLjAARTjWPAFB, OH 45433-6543, August 1991. 
[45] 	 P. Sapiecha, M. A. Perkowski, and T. Luba, "Decom­
position of Information Systems Based on Graph Color­
ing Heuristics," Symposium on Modelling, Analysis and 
Simulation, CESA'96 IMACS Multiconference, LilIe, 
France, July 9-12,1996. 
[46] 	 T. Sasao (editor), "Representation of Boolean Func­
tions," f(luwer Academic Publishers, 1996. 
[47] 	 KB. Stanton, P.R. Sherman, M.L. Rohwedder, Ch.P. 
Fleskes, D. Gray, D.T. Minh, C. Espinosa, D. Mayi, 
M. Ishaque, M.A. Perkowski, "PSUBOT - A Voice­
Controlled Wheelchair for the Handicapped," Proc. oj 
the 9Srd Midwest Symp. on Circuits and Systems, pp. 
669 672, Alberta, Canada, August 1990. 
[48] 	 Y.H. Su and P.T. Cheung, "Computer minimization 
of multiple-valued switching functions," IEEE Trans­
actions on Computers, Vol. C-21, pp. 995-1003, 1972. 
[49] 	 N. Song, M. Perkowski, "Minimization of Exclu­
sive Sum of Products Expressions for Multi-Output 
Multiple-Valued Input, Incompletely Specified Func­
tions," IEEE Transactions on Computer Aided Design, 
Vol. 15, No.4, April 1996, pp. 385-395. 
[50] 	 U.C. Irvine, "Repository of Machine Learn­
ing Databases and Domain Theories," 
Jtp : j j Jtp.ics.uci.edujpubjmachine - learning ­
databasesj 
[51] 	 J. Vuillemin, P. Bertin, D. Roncin, M. Shand, H. 
Touati, and Ph. Boucard, "Programmable Active Mem­
ories: Reconfigurable Systems Come of Age," IEEE 
Trans. on VLSJ Systems, Vol. 4, No. 1., pp. 56-69, 
March 1996 
[52] 	 W. Wan, and M. Perkowski, "A New Approach to the 
Decomposition of Incompletely Specified Multi-Output 
Function Based on Graph Coloring and Local Transfor­
mations and Its Application to FPGA Mapping," Proc. 
Euro-DA C, pp. 230 - 235, 1992. 
i 
j 
