Optimization of a hardware/software coprocessing platform for EEG eyeblink detection and removal by Chaudhuri, Matthew Alan
Rochester Institute of Technology
RIT Scholar Works
Theses Thesis/Dissertation Collections
12-1-2008
Optimization of a hardware/software coprocessing
platform for EEG eyeblink detection and removal
Matthew Alan Chaudhuri
Follow this and additional works at: http://scholarworks.rit.edu/theses
This Thesis is brought to you for free and open access by the Thesis/Dissertation Collections at RIT Scholar Works. It has been accepted for inclusion
in Theses by an authorized administrator of RIT Scholar Works. For more information, please contact ritscholarworks@rit.edu.
Recommended Citation
Chaudhuri, Matthew Alan, "Optimization of a hardware/software coprocessing platform for EEG eyeblink detection and removal"
(2008). Thesis. Rochester Institute of Technology. Accessed from
Optimization of a Hardware/Software Coprocessing
Platform for EEG Eyeblink Detection and Removal
by
Matthew Alan Chaudhuri
A Thesis Submitted in Partial Fulfillment of the Requirements for the Degree of
Master of Science in Computer Engineering
Supervised by
Associate Professor Dr. Daniel Phillips
Department of Electrical Engineering
Kate Gleason College of Engineering
Rochester Institute of Technology
Rochester, New York
December 2008
Approved By:
Dr. Daniel Phillips
Associate Professor
Primary Adviser
Dr. Juan Cockburn
Associate Professor, Department of Computer Engineering
Dr. Marcin Lukowiak
Assistant Professor, Department of Computer Engineering
Thesis Release Permission Form
Rochester Institute of Technology
Kate Gleason College of Engineering
Title: Optimization of a Hardware/Software Coprocessing Platform for
EEG Eyeblink Detection and Removal
I, Matthew Alan Chaudhuri, hereby grant permission to the Wallace Memorial Library
reporduce my thesis in whole or part.
Matthew Alan Chaudhuri
Date
Acknowledgments
I would like to thank Dr. Phillips for his dedication toward helping me complete this
research, and also his continuous encouragement and support. I would also like to thank Dr.
Cockburn and Dr. Lukowiak for teaching me just about all I know about signal processing
and embedded design. If it wasn’t for Dr. Berg’s support, none of this research would have
been possible - thank you.
iii
Abstract
The feasibility of implementing a real-time system for removing eyeblink artifacts from
electroencephalogram (EEG) recordings utilizing a hardware/software coprocessing plat-
form was investigated. A software based wavelet and independent component analysis
(ICA) eyeblink detection and removal process was extended to enable variation in its pro-
cessing parameters. Exploiting the efficiency of hardware and the reconfigurability of soft-
ware, it was ported to a field programmable gate array (FPGA) development platform which
was found to be capable of implementing the revised algorithm, although not in real-time.
The implemented hardware and software solution was applied to a collection of both
simulated and clinically acquired EEG data with known artifact and waveform characteris-
tics to assess its speed and accuracy. Configured for optimal accuracy in terms of minimal
false positives and negatives as well as maintaining the integrity of the underlying EEG, es-
pecially when encountering EEG waveform patterns with an appearance similar to eyeblink
artifacts, the system was capable of processing a 10 second EEG epoch in an average of
123 seconds. Configured for efficiency, but with diminished accuracy, the system required
an average of 34 seconds.
Varying the ICA contrast function showed that the gaussian nonlinearity provided the
best combination of reliability and accuracy, albeit with a long execution time. The cu-
bic nonlinearity was fast, but unreliable, while the hyperbolic tangent contrast function
frequently diverged.
It is believed that the utilization of programmable logic with increased logic capac-
ity and processing speed may enable this approach to achieve the objective of real-time
operation.
iv
Contents
Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv
Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Algorithm and Proposed Modifications . . . . . . . . . . . . . . . 3
1.1.2 Platform and Implementation . . . . . . . . . . . . . . . . . . . . 4
1.1.3 Testing and Benchmarking . . . . . . . . . . . . . . . . . . . . . . 5
1.2 Supporting Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.1 Hardware/Software Co-design . . . . . . . . . . . . . . . . . . . . 5
1.2.2 Multichannel Data Processing . . . . . . . . . . . . . . . . . . . . 6
1.2.3 EEG Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3 Document Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.1 Matrix Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.1.1 Square Matrix Multiplication . . . . . . . . . . . . . . . . . . . . . 10
2.1.2 Block Matrix Multiplication . . . . . . . . . . . . . . . . . . . . . 11
2.1.3 Accumulated Matrix Multiplication . . . . . . . . . . . . . . . . . 12
2.2 The Wavelet Transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2.2 Filter Banks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3 Independent Component Analysis . . . . . . . . . . . . . . . . . . . . . . 18
2.3.1 Centering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.3.2 Whitening . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.3.3 Measures of Nonguassianity . . . . . . . . . . . . . . . . . . . . . 20
v
2.3.4 Iterative Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.3.5 Ambiguities of ICA . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.4 Eyeblink Artifact Removal . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.4.1 Phase I Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.4.2 Ocular Component Separation . . . . . . . . . . . . . . . . . . . . 28
2.4.3 Phase II Detection and Removal . . . . . . . . . . . . . . . . . . . 28
3 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.1 Changes to Pesin Process . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.1.1 Centering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.1.2 Initial Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.1.3 Ocular IC Selection . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.2 Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.2.1 Supporting Libraries . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.2.2 Arithmetic Precision . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.2.3 Memory Organization . . . . . . . . . . . . . . . . . . . . . . . . 36
3.3 Development Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.3.1 Xilinx Profiler . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.4 Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.4.1 Development Platform . . . . . . . . . . . . . . . . . . . . . . . . 37
3.4.2 Auxiliary Processing Unit Interface . . . . . . . . . . . . . . . . . 38
3.4.3 Floating Point Unit . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.4.4 Register File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.5 Overhead Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.6 Reconfigurability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.7 Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
3.7.1 Graphical Interface . . . . . . . . . . . . . . . . . . . . . . . . . . 60
3.7.2 Data Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4 Method of Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.1 Simulated EEG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.1.1 Simulated EEG Generation . . . . . . . . . . . . . . . . . . . . . . 63
4.1.2 Eyeblink Removal Parameters . . . . . . . . . . . . . . . . . . . . 65
4.2 Experiment Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.2.1 Control Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.2.2 Clinical Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
vi
4.2.3 Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.2.4 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.1 Simulated Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.1.1 Correlation Coefficient . . . . . . . . . . . . . . . . . . . . . . . . 73
5.1.2 Euclidean Distance . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.1.3 Standard Deviation Ratio . . . . . . . . . . . . . . . . . . . . . . . 77
5.2 Control Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
5.3 Clinical Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.3.1 FIRDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
5.3.2 Triphasic Waves . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.4 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
6.1 Interpretation of Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
6.1.1 Optimal Contrast Function . . . . . . . . . . . . . . . . . . . . . . 84
6.2 Observations on the Pesin Process . . . . . . . . . . . . . . . . . . . . . . 84
6.2.1 Detection Thresholds . . . . . . . . . . . . . . . . . . . . . . . . . 85
6.2.2 Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
6.3 Independent Component Analysis Observations . . . . . . . . . . . . . . . 86
6.3.1 Initial Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
6.3.2 Attempted Optimization . . . . . . . . . . . . . . . . . . . . . . . 87
6.4 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
6.4.1 Algorithm Development . . . . . . . . . . . . . . . . . . . . . . . 88
6.4.2 Further Optimizations . . . . . . . . . . . . . . . . . . . . . . . . 88
6.4.3 Limitations of Present Approach . . . . . . . . . . . . . . . . . . . 89
7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
A main.cpp File Reference . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
B matrix mult.c File Reference . . . . . . . . . . . . . . . . . . . . . . . . . . 105
vii
List of Figures
1.1 An example of an EEG recording contaminated by an eyeblink. . . . . . . . 2
2.1 A single-level wavelet decomposition filter bank. . . . . . . . . . . . . . . 14
2.2 A single-level wavelet reconstruction filter bank. . . . . . . . . . . . . . . 14
2.3 A three-level wavelet decomposition tree. . . . . . . . . . . . . . . . . . . 15
2.4 A recursive wavelet decomposition filter bank. . . . . . . . . . . . . . . . . 15
2.5 A recursive wavelet reconstruction filter bank. . . . . . . . . . . . . . . . . 16
2.6 Pesin algorithm overview . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.7 Phase I blink detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.8 Phase II blink detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.1 External library dependency diagram. . . . . . . . . . . . . . . . . . . . . 33
3.2 The ML410 development platform architecture . . . . . . . . . . . . . . . 38
3.3 Execution time profile for the system with an FPU . . . . . . . . . . . . . . 41
3.4 The register file bridges the APU and the matrix multiplier . . . . . . . . . 43
3.5 The APU instruction decoder state machine . . . . . . . . . . . . . . . . . 44
3.6 Comparison of fixed and floating point number representation. . . . . . . . 47
3.7 The structure of the matrix multiplier [27]. . . . . . . . . . . . . . . . . . . 49
3.8 The data flow within PE0 for one row/column of operand matrices. . . . . 51
3.9 Execution time profile for system with hardware matrix multiplier. . . . . . 54
3.10 The hardware architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.11 Device utilization summary for the target Virtex-4 FX60 FPGA . . . . . . . 56
3.12 Execution time profile for system with final optimizations . . . . . . . . . . 58
3.13 Screenshots of each GUI panel . . . . . . . . . . . . . . . . . . . . . . . . 62
4.1 Top view of electrode placement for 10-20 system . . . . . . . . . . . . . . 64
5.1 Simulated eyeblink and original uncontaminated EEG . . . . . . . . . . . . 71
5.2 Comparison of MATLAB and FPGA waveforms for simulated eyeblink . . 72
5.3 Comparison of MATLAB and FPGA statistics for a simulated eyeblink . . 74
5.4 MATLAB and FPGA correlation coefficient for simulated eyeblink . . . . . 75
viii
5.5 MATLAB and FPGA Euclidean distance for simulated eyeblink . . . . . . 76
5.6 MATLAB and FPGA standard deviation ratio for simulated eyeblink . . . . 77
5.7 10 second EEG with FIRDA. . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.8 10 second EEG with triphasic waves. . . . . . . . . . . . . . . . . . . . . . 80
5.9 Histogram of gaussian nonlinearity iteration counts. . . . . . . . . . . . . . 82
6.1 Proposed future architecture include exponential unit . . . . . . . . . . . . 90
ix
List of Tables
3.1 Execution time for the system with an FPU . . . . . . . . . . . . . . . . . 42
3.2 Matrix multiplication performance comparison. . . . . . . . . . . . . . . . 52
3.3 Execution time comparison for the system with a matrix multiplier. . . . . . 54
3.4 Execution time comparison for system with final optimizations . . . . . . . 59
4.1 EOG Transmission Coefficients . . . . . . . . . . . . . . . . . . . . . . . 64
4.2 Algorithm parameters for each platform . . . . . . . . . . . . . . . . . . . 66
5.1 Summary of the eyeblink removal results for all contrast functions. . . . . . 78
5.2 Summary of eyeblink removal for clinical data. . . . . . . . . . . . . . . . 81
5.3 Execution time for each test case, with mean (standard deviation) in seconds 82
x
Assumptions
Development was completed under Windows XP Service Pack 3. The development
software used included:
• Xilinx ISE 10.1
• Xilinx Platform Studio (XPS) 10.1
• MathWorks MATLAB R14
• Sun Java Development Kit 1.6
• Java RXTX 2.1-7 Library
xi
Chapter 1
Introduction
1.1 Overview
An electroencephalogram (EEG) is the primary tool for diagnosing epilepsy and moni-
toring or predicting epileptic seizures. However, an EEG is highly susceptible to noise and
artifacts caused by eyeblinks, heartbeat, and muscle movement, thereby making analysis
difficult. Figure 1.1 shows an example of a 10 second EEG recording contaminated by two
eyeblink artifacts.
Processing and analysis of EEG data has been done “offline” on computer clusters
in the past. Such a method provides no opportunity for seizure prediction - a technique
which is currently the focus of much research, given its potential benefits for epilepsy pa-
tients. Since this type of EEG processing would likely be automated, artifact detection and
removal would become necessary, and potentially more reliable when the unique charac-
teristics of the artifacts found in an individual patient’s EEG are accounted for explicitly.
A multichannel signal processing platform optimized to perform an eyeblink detection and
removal algorithm in real-time would complement current research on seizure prediction.
A system which is optimized to perform real-time eyeblink detection and removal in
such a way that the algorithm parameters can be reconfigured at runtime is proposed. A
field programmable gate array (FPGA) is a viable platform for a combined hardware/soft-
ware solution. An FPGA is an ideal choice for the proposed signal processing platform
1
Figure 1.1: An example of an EEG recording contaminated by an eyeblink.
since it provides abundant resources, especially multipliers, which can be used for hard-
ware acceleration of algorithm kernels. Certain FPGAs also provide embedded micro-
processors, which are not only capable of efficiently executing software routines, but also
useful for exploiting the flexibility of software to achieve reconfigurability.
The goal of this research is to develop a signal processing platform that efficiently per-
forms a real-time eyeblink detection and removal algorithm on EEG data. This algorithm is
implemented with a combined hardware/software solution. The hardware is implemented
in the reconfigurable fabric of the target FPGA, which contains an algorithm-dependant
coprocessor as well as an interface to an embedded processor. This embedded processor
executes reconfigurable portions of the algorithm in software.
2
1.1.1 Algorithm and Proposed Modifications
Past research has produced an effective process for removing eyeblink artifacts from
EEGs [17]. This real-time eyeblink detection and removal algorithm is based on the dis-
crete wavelet transform (DWT) and independent component analysis (ICA). This type of
processing is essential because it has the potential to eliminate the artifacts introduced into
the EEG by eyelid movement. Automation of this task is necessary for ambulatory EEGs
since the subject is not in a controlled environment. Subsequently, there is no human ob-
server actively monitoring the patient and annotating the EEG, nor is there video of the
subject from which an annotation could be generated. This research may also serve to aid
future seizure detection algorithms by enabling the removal of movement-related EEG ar-
tifacts induced by seizures [24]. Reducing the number of clinically irrelevant components
of EEG waveforms through artifact removal may also decrease the likelihood of a seizure
detection false positive.
Several modifications are made to the eyeblink detection and removal algorithm in [17]
in order to improve its generality, as well as to take advantage of the benefits of hardware/-
software codesign.
In particular, the wavelet transform is implemented in such a way that the prototype
wavelet is customizable at runtime. This feature allows for the eyeblink detection algorithm
to be tailored to an individual patient, since the wavelet transform is the primary tool for
artifact detection. This feature may be of further benefit since the prototype wavelet could
be chosen in such a way that the occurrence of other events could be detected in the EEG
data, for instance from heart activity or other muscle movements. This reconfigurability
allows the Pesin algorithm in [17] to be adapted for other artifact removal tasks, though
such work is beyond the scope of this thesis.
Additionally, the independent component analysis algorithm is designed to allow for
a runtime reconfigurable contrast function. A customizable contrast function allows for
optimizations to the algorithm to be made based on factors such as desired robustness,
computational complexity, and the statistics of the independent sources [12].
3
1.1.2 Platform and Implementation
Previous research found that the proposed signal processing task is difficult to perform
in real-time [17]. This issue is confounded by advances in EEG technology which some-
times incorporate additional channels of data, thus making real-time processing even more
challenging.
Processing in real-time means that a certain operation is guaranteed to complete prior
to some deadline. In particular, the eyeblink detection and removal algorithm considers
non-overlapping, consecutive, 10 second epochs [17]. Therefore, for the purposes of this
research, real-time processing shall be taken to mean that a processed value will be output
from the system no later than 10 seconds after the original sample is present at the sys-
tem input. The design presented here enables runtime reconfigurability, which allows a
customized setting or parameter to be applied at the beginning of the epoch immediately
following the epoch in which the reconfiguration was requested by the user.
An FPGA containing an embedded microprocessor was used to implement this system
by way of a combined hardware/software approach. The available logic resources were
used to optimize time-critical portions of the required algorithms in an attempt to meet
the real-time processing goal, while the embedded microprocessor executed portions of
the algorithms that were targeted for reconfigurability. The architecture for the proposed
platform was constructed in such a way that the multichannel data was accessible in an
efficient manner, optimized in terms of performance for this application.
The heart of such a processing system lies in the optimization of memory management.
Since modern FPGAs provide abundant logic resources, and most importantly numerous
multipliers, efficient data access are the key to achieving real-time performance. Meth-
ods for obtaining high-throughput interfaces between the embedded processor and custom
coprocessors have been investigated and implemented.
4
1.1.3 Testing and Benchmarking
In order to validate the functionality of the proposed signal processing platform, a PC
was used to interface with the platform via bidirectional serial interfaces. Test data from
real EEG recordings was sent to the processing platform to simulate real-world data cap-
ture, while processed data was returned to the PC for display and analysis, and to demon-
strate the capabilities of the processing platform.
1.2 Supporting Work
1.2.1 Hardware/Software Co-design
General purpose processors (GPPs) are designed to be flexible enough to perform nearly
any task. However, the cost of this flexibility is efficiency, since a complex process which
may require many instructions (and subsequently many clock cycles) to complete on a GPP
may require only a single clock cycle when executing on a custom processor.
The hardware/software codesign paradigm involves simultaneous design of the soft-
ware and custom processors required to implement a system [21]. To that end, some com-
bined hardware/software systems have incorporated a GPP and an application specific in-
tegrated circuit (ASIC). In contrast, other hardware/software systems may be implemented
on a single device: an FPGA.
FPGAs are ideal platforms for rapid prototyping, and consequently, they are well-suited
to the implementation of combined hardware/software systems. The logic resources on
some FPGAs can be used to realize a soft-core microprocessor, while functionality pre-
viously included in ASICs can be implemented in reconfigurable FPGA fabric. More re-
cently, some FPGAs have begun to include hard-core microprocessors, where the processor
core is implemented in silicon alongside reconfigurable fabric.
Given the well-established practice of hardware/software codesign, a multitude of de-
sign tools have been created to aid engineers in the design, implementation, and verification
5
of hardware/software systems [3]. A crucial step of the hardware/software codesign flow
is the partitioning of the algorithm into portions which will either remain in software or
be implemented in hardware [3]. The profiling abilities provided by the Xilinx Software
Development Kit were utilized to determine which sections of the eyeblink detection and
removal algorithm were most appropriate to accelerate with a custom hardware coproces-
sor.
1.2.2 Multichannel Data Processing
A software-only processing platform, such as the kind that may be implemented on a
general purpose processor, microcontroller, or digital signal processor, would not necessar-
ily be capable of satisfying the throughput requirements necessary to perform multichannel
signal processing in real-time [1]. After characterizing the improvement of a combined
hardware/software coprocessing platform over a software-only platform for a multichan-
nel discrete cosine transform compression algorithm, it was found that the number of EEG
channels that could be compressed in real-time increased nearly sevenfold with a combined
approach.
Multichannel data processing on an FPGA has been considered as part of other biomed-
ical data processing applications, such as in the case of magnetic resonance imaging (MRI)
analysis [4]. In this application, an FPGA was successfully used as a front-end for a 16
channel data acquisition system. In order to reduce the amount of offline-processing on
general purpose PCs that would need to be performed following an MRI scan, a reconfig-
urable architecture was developed to utilize FPGA resources to perform a 2D fast Fourier
transform (FFT). It was found that, when implemented on the FPGA, a custom coprocessor
designed to perform a 16 channel 2D FFT reconstruction was sufficiently powerful to op-
erate in real-time - a feat not previously attainable when using general purpose computing
resources.
6
1.2.3 EEG Processing
There are many different kinds of useful processing that may be performed on an EEG.
Processing as simple as amplitude or slope analysis may be valuable for some applica-
tions, whereas other applications may require wavelet decomposition or analysis of the
power spectral density of an EEG waveform [22]. Such diversity in the types of useful
signal processing algorithms necessitates reconfigurability at the heart of an EEG process-
ing platform. This need is further driven by improvements in algorithm robustness and
reliability that might be gained by customizing certain processing tasks to an individual’s
unique needs. The premise of this claim is that each patient may conceivably have a unique
eyeblink artifact pattern [7].
Epilepsy patients would benefit from some warning at the onset of a seizure. For such
patients, EEG monitoring may be performed in either a clinical setting (e.g.in an inten-
sive care unit, typically a 24 hour recording) or ambulatory setting (e.g.in-home, a 72 hour
recording or longer). In either case a massive amount of EEG data is generated, thus, EEG
analysis is becoming increasingly automated. Automation may allow medical profession-
als with no neurological background to determine whether events of clinical significance
occurred, or it may enable detection of significant events that might be imperceptible even
to trained neurologists. This necessitates removal of artifacts such as those caused by eye
movements, which is imperative so that the automated analysis does not falsely identify
artifacts as diagnostically significant events [9, 22].
Recently, a wavelet- and independent component analysis-based eyeblink detection and
removal algorithm, hereafter referred to as the Pesin algorithm, was developed and simu-
lated to a good degree of success [17]. Since the MATLAB source code for this algorithm
was available, this author’s methods were demonstrated in an FPGA-based hardware/soft-
ware realization of his algorithm. This platform was also used to validate the reconfigura-
bility of the proposed architecture.
The advantage to implementing a customizable wavelet transform is twofold. First, an
analysis could be performed on the eyeblink artifacts generated by an individual patient,
7
and an appropriate custom prototype wavelet could be derived that may detect eyeblink
artifacts with better resolution than a generic, “one size fits all” prototype wavelet [18].
Second, the algorithm could be adapted to detect artifacts caused by non-ocular sources -
such as those created by heartbeat or muscle movements - by generation of an appropriate
prototype wavelet.
Customization of the discrete wavelet transform is accomplished by changing the coef-
ficients used in the filter bank which performs the wavelet decomposition. The prototype
wavelet used in analysis is uniquely described by these coefficients [18]. The goal of this
research is to investigate methods of modifying these coefficients at runtime; the derivation
of these coefficients is beyond the scope of this work.
A “fast and robust” version of the ICA algorithm, FastICA, has been previously devel-
oped [12]. This is the same algorithm that was used in [17] for development of the eyeblink
detection and removal algorithm. This FastICA algorithm was successfully implemented
on an FPGA to perform blind source separation in real-time on dual channel audio signals
sampled at 192 kHz [19], albeit the current work requires a far greater number of channels
at a higher sample rate.
One of several discrete steps in the iterative ICA algorithm is the application of a con-
trast function, which is a measure of the non-gaussianity of the independent sources, a
metric used to determine algorithm convergence [12, 10]. Several contrast functions have
been proposed, and selection of an appropriate contrast function for a particular set of signal
sources is dependent on the desired robustness of the algorithm, the maximum allowable
computational complexity, and the statistics of the independent source signals [12].
1.3 Document Outline
The theoretical background for the methods used in the Pesin algorithm, including
wavelets, independent component analysis, and the Pesin algorithm itself are examined
8
in Chapter 2. Details regarding the implementation and optimization of the hardware/soft-
ware coprocessing platform which supports the Pesin algorithm are explored in Chapter 3.
An experiment, designed to demonstrate the reconfigurability of the system and determine
an appropriate configuration for eyeblink artifact removal, is devised in Chapter 4. The
results of the testing performed during this experiment are provided in Chapter 5, while
these results, observations during development, and future work are discussed in Chapter
6. Concluding remarks are given in Chapter 7.
9
Chapter 2
Background
It is necessary to understand the theoretical basis for the tools used in performing the
eyeblink artifact removal process before it can be optimized. First, an overview of several
matrix multiplication methods is provided. Following that, overviews of the wavelet trans-
form and independent component analysis are be given, concluding with a description of
the Pesin process, and how it incorporates these methods.
2.1 Matrix Multiplication
Matrix multiplication is a common signal-processing operation central to the compu-
tations in ICA. Details of several methods of computing a matrix product are examined to
provide a foundation for identifying potential optimizations.
2.1.1 Square Matrix Multiplication
Multiplication of a pair of n × n matrices A, with elements aij , and B, with elements
bij yields an n× n result C, with elements cij as
a00 . . . a0n
... . . .
...
an0 . . . ann


b00 . . . b0n
... . . .
...
bn0 . . . bnn
 =

c00 . . . c0n
... . . .
...
cn0 . . . cnn
 . (2.1)
10
Alternately, each element of the product is expressed as the dot product of a row of A
with a column of B as
cij =
n−1∑
k=0
aikbkj. (2.2)
From this expression, it is clear that the time complexity of n×n matrix multiplication
on a single multiplier is O(n3) since each element of C requires n multiplications, and C
contains n2 elements. Square matrix multiplication can be used as the basis to multiply
rectangular matrices of other dimensions.
2.1.2 Block Matrix Multiplication
In general, block matrix multiplication is a method of computing the product of matri-
ces A and B which are carefully subdivided into m smaller submatrices Ai and Bi, such
as
A =

A0
...
Am−1
 (2.3)
or
B =
[
B0 . . . Bm−1
]
(2.4)
For this particular application, it is necessary to multiply anmn×nmatrix with an n×n
matrix, where m is restricted to a value in the the set of positive integers for convenience.
This problem decomposes to m separate n× n square matrix multiplications as
A0
. . .
Am−1
[ B ] =

A0B
. . .
Am−1B
 . (2.5)
Similarly, it is also possible to compute the product of an n× n matrix with an n×mn
matrix, which decomposes to m square n× n matrix products as[
A
] [
B0 . . . Bm−1
]
=
[
AB0 . . . ABm−1
]
. (2.6)
11
An efficient design can leverage data reuse since every block matrix in the result is
determined by multiplication with B, and likewise for A in the second case. A software
implementation benefits from improved cache performance by virtue of spatial locality,
since fewer cache “misses” occur. A hardware implementation benefits by way of a re-
duction in the amount of I/O since the reusable matrix must only be sent to the hardware
execution unit once.
2.1.3 Accumulated Matrix Multiplication
When multiplying anm×nmatrixA and an n×mmatrixB, data reuse is not possible.
However, if n is evenly divisible by m, the matrix A can be subdivided every n columns
into m
n
smaller n× n submatrices A0 through Am/n−1 such that
A =

A0
...
Am
n
−1
 . (2.7)
Similarly, B can be subdivided every n rows by
B =
[
B0 . . . Bm
n
−1
]
. (2.8)
The problem can then be expressed as the summation of m
n
products of n× n matrices as
[
A0 . . . An
]
B0
...
Bn
 =
mn −1∑
k=0
AkBk
 . (2.9)
In this fashion, the number of elements that either a software or hardware matrix multi-
plier would be required to consider is substantially reduced. For software implementations,
the the cache performance benefit is far greater than for block algorithms since only two
small matrices need to be considered, rather than two large matrices. For a hardware im-
plementation, the memory requirements are considerably less since both of the operands
have been reduced in size by a factor of m
n
.
12
2.2 The Wavelet Transform
2.2.1 Overview
The wavelet transform is an important tool for multiresolution analysis. One of the
primary benefits of the wavelet transform is that it is localized in both time and frequency,
whereas other classical methods like the Fourier transform are localized in frequency, only.
Moreover, the wavelet transform offers good time resolution for low-frequency components
and good frequency resolution for high-frequency components of the signal being analyzed.
It overcomes shortcomings of other similar methods, such as the short-time Fourier trans-
form, wherein time-frequency localization is constant for all frequencies. The result is that
a wavelet transform can be designed to detect specific signal transitions localized in time
and frequency [15].
For the purpose of this research, the design of wavelet transforms is of no concern.
Rather, methods of efficiently transforming a signal given an arbitrary wavelet transform
are sought.
2.2.2 Filter Banks
The discrete wavelet transform (DWT) of a signal can be computed by performing fil-
tering and downsampling operations (this process may also be referred to as either wavelet
analysis or decomposition). This discrete forward transform is invertible and, under cer-
tain conditions, perfect reconstruction of the original discrete time signal is possible (this
process may also be referred to as wavelet synthesis or reconstruction). The inverse dis-
crete wavelet transform (IDWT) can be computed by performing upsampling and filtering
operations.
An analysis filter bank is comprised of two decomposition FIR filters - one lowpass
with coefficients h(n) and one highpass with coefficients g(n). The outputs of each filter
are subsequently downsampled by a factor of two. The outputs from the lowpass channel
are the approximation coefficients, whereas the outputs from the highpass channel are the
13
detail coefficients. Given a discrete time signal x(n), a single level decomposition with
approximation coefficients a1(n) and detail coefficients d1(n) is shown in Figure 2.1.
2
2
2
x[n]
2
2 2
2
d1[n]
d2[n]
d3[n]
a3[n]
h[n]
g[n]
h[n]
g[n]
h[n]
g[n]
2
2
x[n]
h[n]
g[n] d1[n]
a1[n]
x[n]
h[n]
g[n]d1[n]
a1[n]
+2
2
2
2h[n]
g[n] dj+1[n]
aj+1[n]
aj-1[n]dj[n]
aj[n]
+2
aj[n]
h[n]
g[n]
Figure 2.1: A single-level wavelet decomposition filter bank.
The synthesis filter bank is similarly implemented by upsampling the input approxima-
tion and detail coefficients by a factor of two. Each of the resulting signals is then passed
through a filter that is the dual of the corresponding synthesis filter, with lowpass dual co-
efficients h¯(n) and highpass dual coefficients g¯(n) [15]. Given a first-level approximation
coefficients a1(n) and detail coefficients d1(n), a single level reconstruction of the original
signal x(n) is shown in Figure 2.2.
2
2
2
x[n]
2
2 2
2
d1[n]
d2[n]
d3[n]
a3[n]
h[n]
g[n]
h[n]
g[n]
h[n]
g[n]
2
2
x[n]
h[n]
g[n] d1[n]
a1[n]
x[n]
h[n]
g[n]d1[n]
a1[n]
+2
2
2
2h[n]
g[n] dj+1[n]
aj+1[n]
aj-1[n]dj[n]
aj[n]
+2
aj[n]
h[n]
g[n]
Figure 2.2: A single-level wavelet reconstruction filter bank.
Multiple filter banks can be cascaded to form a tree capable of decomposing or recon-
structing wavelets at multiple levels. When performing decomposition, only the highest-
level approximation coefficients are stored, while the detail coefficients from every level
are stored. The result is a number of coefficients equal to the number of samples in the
original input signal. This process is illustrated in the 3-level decomposition tree shown in
Figure 2.3.
Practical realizations of the wavelet filter banks may be implemented recursively. In
this manner, hardware can be reused for each transform level, rather than replicated for
each level as in the tree structure. This is possible because the output of one stage becomes
14
22
2
x[n]
2
2 2
2
d1[n]
d2[n]
d3[n]
a3[n]
h[n]
g[n]
h[n]
g[n]
h[n]
g[n]
2
2
x[n]
h[n]
g[n] d1[n]
a1[n]
x[n]
h[n]
g[n]d1[n]
a1[n]
+2
2
2
2h[n]
g[n] dj+1[n]
aj+1[n]
aj-1[n]dj[n]
aj[n]
+2
aj[n]
h[n]
g[n]
Figure 2.3: A three-level wavelet decomposition tree.
the input to the next, while the filter coefficients remain identical for every level of the
transform.
For the decomposition filter bank, the approximation coefficients from a level j decom-
position become the input to the filter bank and yield level j+1 approximation coefficients
aj+1(n) and detail coefficients dj+1(n). These detail coefficients are stored, whereas the
approximation coefficients become the input to the filter bank for the next level of decom-
position. At the final level of decomposition, both the approximation and detail coefficients
are stored. This process is shown in Figure 2.4.
2
2
2
x[n]
2
2 2
2
d1[n]
d2[n]
d3[n]
a3[n]
h[n]
g[n]
h[n]
g[n]
h[n]
g[n]
2
2
x[n]
h[n]
g[n] d1[n]
a1[n]
x[n]
h[n]
g[n]d1[n]
a1[n]
+2
2
2
2h[n]
g[n] dj+1[n]
aj+1[n]
aj-1[n]dj[n]
aj[n]
+2
aj[n]
h[n]
g[n]
Figure 2.4: A recursive wavelet decomposition filter bank.
For the reconstruction filter bank, the highest level approximation and detail coeffi-
cients aj(n) and dj(n), respectively, are the inputs. The output then becomes the level
j − 1 approximation coefficients aj−1(n). Further levels of reconstruction are performed
by applying these approximation coefficients in conjunction with the next-lowest level de-
tail coefficients to the filter bank inputs. This process is illustrated in Figure 2.5.
15
22
2
x[n]
2
2 2
2
d1[n]
d2[n]
d3[n]
a3[n]
h[n]
g[n]
h[n]
g[n]
h[n]
g[n]
2
2
x[n]
h[n]
g[n] d1[n]
a1[n]
x[n]
h[n]
g[n]d1[n]
a1[n]
+2
2
2
2h[n]
g[n] dj+1[n]
aj+1[n]
aj-1[n]dj[n]
aj[n]
+2
aj[n]
h[n]
g[n]
Figure 2.5: A recursive wavelet reconstruction filter bank.
Downsampling
Downsampling, the first operation performed during wavelet decomposition, is the pro-
cess by which samples are removed from a discrete time signal. For such a signal x(n)
of length L, downsampling by a factor of M would result in a new signal y(n) containing
approximately L/M samples [16] as per
y(n) = x(Mn). (2.10)
When computing a DWT, we are concerned only with downsampling by a factor of
M = 2. The relation in (2.10) thus implies that samples at odd indices be removed from x.
If samples at even-numbered indices were removed instead, the resulting operation would
take the form of
y(n) = x(Mn− 1), (2.11)
which incorporates a unit delay.
Thus, “pure” downsampling by a factor of two requires that only samples at odd-
numbered indices be removed from the original signal.
Convolution
Convolution is the process by which a finite impulse response (FIR) filter can be applied
to a signal. Given a discrete time signal x(n) and an FIR filter of length N with impulse
response h(n), the convolution y(n) = x(n) ∗ h(n) is given by
y(n) =
N−1∑
k=0
h(k)x(n− k). (2.12)
16
If x(n) is a finite-length signal of L elements, undefined values such as x(−1) or x(L)
may be necessary to compute the convolution. Thus, support for x(n) must be extended
from [0, . . . , L − 1] to [−(N − 1), . . . , L + N − 2]. There are many ways to perform this
necessary extension, including, but not limited to
1. Extension by zero-padding,
2. Extension by wraparound,
3. Extension by reflection.
Extension by zero-padding is the simplest method, but it usually creates an undesirable
jump in the signal since x(0) and x(L− 1) are not generally equal to 0. A signal filtered in
this manner may also show signs of degradation near its boundaries [20].
Extension by wraparound is slightly more complex than zero-padding, but can be easily
computed when sample indices are taken modulo L. For most signals, this “periodic”
wraparound will introduce a jump since x(0) and x(L−1) are not generally equal. However
this method may perform well for signals that are truly periodic [20].
Extension by reflection (also called “symmetric extension”) does not introduce a jump
in the extended signal. Though its implementation may be more complex than other meth-
ods, symmetric extension works best with symmetric filters [20]. Coiflets, which are used
in the Pesin process, are considered “near-symmetric1”due to their increased symmetry
over other families of wavelets.
Upsampling
Upsampling a discrete time signal x(n) increases its length L by a factorM by inserting
M − 1 zeros between each of the original L samples. This relation is given by
y(n) =
x(
n
M
), if n
M
is an integer
0, otherwise
. (2.13)
1From MATLAB’s waveinfo for coiflets.
17
Since we are interested in reconstructing a signal decomposed in part by a downsam-
pling operation that removed samples at odd-numbered indices, we will perform this oper-
ation by inserting zeros at odd sample indices to preserve the first sample.
2.3 Independent Component Analysis
By making multiple observations of mixed signals, it may be possible to recover (un-
mix) the signal sources. This problem is known as blind source separation (BSS), to which
a popular solution is independent component analysis (ICA).
A common example of a BSS problem is the cocktail-party problem, wherein there is
a room containing m individuals speaking simultaneously, and m microphones in different
locations inside the room recording simultaneously. Each microphone records a different
mixed signal, which is a linear mixture of each speaker’s voice, collectively referred to as
the source signals. The problem is then to recover the source signals from only the recorded
mixed signals.
If the source signals si and mixed signals xi are taken as row vectors, they can be
concatenated to form the matrices s and x, respectively. A separating matrix W then
relates the mixed signals to the source signals by
s = Wx. (2.14)
This relationship can restated using a mixing matrix A. For simplicity, it is assumed
that the number of source signals is equal to the number of mixed signals. It can then be
said that the mixing matrix is the inverse of the separating matrix, or A = W−1. The
mixed signals can then be written as
x = As. (2.15)
If the columns of A are denoted ai, then (2.15) can be rewritten as
x =
m∑
i=1
aisi. (2.16)
18
Many implementations of ICA algorithms exist. For this work, we focus on the FastICA
algorithm developed by Aapo Hyvarinen [12, 11].
2.3.1 Centering
Centering is used as a preprocessing step wherein the mean of the observed data is
subtracted from each observation sample. Form vectors of observed data xi, each of length
n samples, centering is carried out by
yi = xi − E{xi}, i = {1, ...,m}, (2.17)
where yi are the centered observation vectors. Since, in general, the expected value of the
random process xi is not known, it is estimated by
yi = xi − 1
n
n∑
j=1
xi(j), i = {1, ...,m}. (2.18)
2.3.2 Whitening
Whitening the observed data causes the observations to become uncorrelated, but with
unit variance. Though whiteness is therefore a stricter form of uncorrelatedness, it is less
strict than independence. A zero-mean random vector z is said to be white if its correlation
matrix follows
E{zzT} = I. (2.19)
For the observation vectors x, a whitening transform V can be performed to yield a
whitened observation such that
z = Vx. (2.20)
This is possible for any observation vector since an eigenvalue decomposition of
E{xxT} = EDET , (2.21)
and
V = ED−1/2ET , (2.22)
19
exists with E the orthogonal matrix of eigenvectors and D the diagonal matrix of eigenval-
ues.
Equation (2.20) can be restated using (2.15) as
z = VAs. (2.23)
By whitening the observed data as a preprocessing step prior to performing ICA, the
search space for potential mixing matrices is reduced from the set of arbitrarym×m square
matrices A to the much smaller set of orthogonal matrices B with
B = VA. (2.24)
2.3.3 Measures of Nonguassianity
A central assumption of ICA is that no more than one independent component (IC) is
normally distributed. The central limit theorem states that the distribution of the sum of
independent random variables tends to be more gaussian than any of their original distri-
butions. ICA leverages this by finding a mixture of the observed data with a maximum
nongaussianity. Such a mixture is then said to be one of the independent components [11].
Kurtosis and negentropy are statistics used to measure nongaussianity [11]. Though
the proof is beyond the scope of this work, an algorithm which performs ICA iteratively
navigates the gradient of some approximation of the nongaussianity of the “mixtures” of
observed signals until a local maximum is reached.
One approximation of negentropy g1(y) and its derivative g′1(y) are given by the fol-
lowing:
g1(y) = tanh(ay) (2.25)
g′1(y) = a(1− tanh2(ay)) (2.26)
An alternate form of negentropy g2(y) and its derivative g′2(y) have the form:
g2(y) = y exp(−y2/2) (2.27)
20
g′2(y) = (1− ay2) exp(−ay2/2) (2.28)
The approximation of kurtosis g3(y) and corresponding derivative g′3(y) are expressed by:
g3(y) = y
3 (2.29)
g′3(y) = 3y
2 (2.30)
Kurtosis tends to be sensitive to outliers yet computationally efficient, whereas negen-
tropy tends to be more robust but expensive to compute. A portion of this work is devoted
to determining whether the additional computation expense incurred by calculating the al-
ternate form of negentropy as originally proposed by Pesin is indeed necessary.
2.3.4 Iterative Algorithm
There are two common approaches to performing the separation of independent com-
ponents from a mixed signal: symmetric and deflationary. The deflationary approach es-
timates independent components sequentially, whereas the symmetric approach estimates
all independent components in parallel. This work focuses on the symmetric approach
since the deflationary approach may lead to the accumulation of errors in estimation [11].
Moreover, it may be possible to exploit parallelism inherent in the symmetric approach to
improve the performance of the Pesin algorithm.
The FastICA algorithm with symmetric orthogonalization, adapted from [11], is given
below.
1. Center the data to make its mean zero.
2. Whiten the data to give z.
3. Choose m, the number of independent components to estimate.
4. Choose initial values for the bi, i = 1, ...,m, each of unit norm. Orthogonalize matrix
B as in step 6 below.
21
5. For every i = 1, ...,m, let bi ← E{zg(bTi z)} − E{g′(bTi z)}b, where g and g′
represent the nonlinearities and their derivatives in (2.25)-(2.30).
6. Do a symmetric orthogonalization of the matrix B = (b1, ...,bm)T by
B← (BBT )−1/2B. (2.31)
7. If not converged, go back to step 5.
The FastICA algorithm with deflationary orthogonalization, as presented in [11], is
given below.
1. Center the data to make its mean zero.
2. Whiten the data to give z.
3. Choose m, the number of independent components to estimate. Set counter p← 1.
4. Choose an initial values of unit norm for bp, e.g. randomly.
5. Let bp ← E{zg(bTp z)} − E{g′(bTp z)}b, where g and g′ represent the nonlinearities
and their derivatives in (2.25)-(2.30).
6. Do the following orthogonalization:
bp ← bp −
p−1∑
j=1
(bTp bj)bj (2.32)
7. Let bp ← bp/‖bp‖.
8. If bp has not converged, go back to step 5.
9. Set p← p+ 1. If p ≤ m, go back to step 4.
The deflationary approach extracts independent components one at a time, performing
orthogonalization in a Gram-Schmidt-like manner.
22
Though the deflationary approach does not seem suitable for the present application, a
potential ICA optimization proposed in [6] suggests extracting a single IC with the defla-
tionary approach, and thereby fixing a row of the initial mixing matrix prior to completing
ICA with the symmetric approach. The suitability of this approach as it pertains to the
present work is discussed in Chapter 6.
Algorithm Complexity
The kernel of the FastICA algorithm is an iterative search for an orthogonal mixing ma-
trixB that relates the independent components to the whitened observed data. The relevant
portions of the original FastICA MATLAB code have been adapted for presentation and are
included below for reference. Operations performed only a single time, such as principle
component analysis, whitening, and centering, are not shown since they do not contribute
substantially to the overall algorithm execution time.
1 B = whiteningMatrix * guess;
2 while(!converged) {
3 B = B * (B’ * B)ˆ(-1/2);
4 switch(nonlinearity)
5 case FICA_NONLIN_POW3 :
6 B = (Z * ((Z’ * B) .ˆ 3)) / numSamples - 3 * B;
7 case FICA_NONLIN_TANH :
8 hypTan = tanh(a * Z’ * B);
9 B = Z * hypTan / numSamples - ...
10 ones(size(B,1),1) * sum(1 - hypTan .ˆ 2) .* B / numSamples * a;
11 case FICA_NONLIN_GAUSS :
12 U = Z’ * B;
13 Usquared = U .ˆ 2;
14 ex = exp(-a * Usquared / 2);
15 gauss = U .* ex;
16 dGauss = (1 - a * Usquared) .* ex;
17 B = Z * gauss / numSamples - ...
18 ones(size(B,1),1) * sum(dGauss) .* B / numSamples;
23
19 }
20 W = B’ * whiteningMatrix;
21 A = dewhiteningMatrix * B;
Lines 1 and 20-21 are included to demonstrate how the orthogonal matrix B relates to
the mixing matrix A and its corresponding separating matrix W by the whitening trans-
form. The remaining code is executed until convergence is detected or a maximum iteration
count is reached.
Line 2 describes the symmetric orthogonalization process, which requires the multipli-
cation of three m ×m matrices. The reciprocal square root is performed on the matrix of
eigenvalues found by the eigenvalue decomposition of B, or
(BTB)−1/2 = ED−1/2ET , (2.33)
where E and D are a matrix of the eigenvectors (the modal matrix) and a diagonal ma-
trix of the eigenvalues of BTB, respectively [10]. More time consuming, however, is the
evaluation of the nonlinearity.
The application of a cubic nonlinearity, which approximates kurtosis, is the simplest of
the functions, but also the least resilient to outliers [11]. Line 6 depicts the multiplication
of ZT , the n ×m matrix of whitened observations, with B. Each element of the result is
cubed before being right-multiplied with the whitened observation matrix.
The hyperbolic tangent nonlinearity is also relatively simple, but requires the evaluation
of the transcendental function tanh, which is relatively lengthy process. However, it must
be applied to the n × m results of ZTB. An additional block matrix multiplication is
performed with Z to the right of the result of the hyperbolic tangent.
Finally, the gaussian nonlinearity is the most complex. The determination of a tempo-
rary n × m matrix U necessitates the first block matrix multiplication on line 13. Lines
13-14 show that every element of this matrix is squared before exponentiation, which is the
most time consuming computation for this contrast function. Lines 15-16 show other ma-
nipulations of this temporary matrix and its squared/exponentiated forms, including several
24
element-wise products on these n×m matrices. Finally, line 17 shows the update applied
to B, which requires an accumulated matrix product.
Convergence
Since the data processed by FastICA is white, it is known that the desired mixing ma-
trix is orthogonal. Convergence occurs when the change in the estimated mixing matrix
from one iteration to the next is negligible, and when the orthogonality condition is met.
Therefore, on iteration i, the product of the transpose of the current mixing matrix and the
mixing matrix from the previous iteration should be
BTi Bi−1 ≈ I. (2.34)
Convergence is declared when the maximum deviation from an element on the diagonal of
this product and the diagonal of the identity matrix is less than some user-defined stopping
criteria.
2.3.5 Ambiguities of ICA
Though ICA can recover independent components from a mixed signal, it cannot re-
cover the variances of any ICs. This is because the product of any scalar coefficient with
an independent component could simply be canceled by scaling the corresponding column
of the mixing matrix A in (2.16) by the reciprocal of that scalar. That the independent
components have unit variance is therefore not only an assumption, but also a restriction in
the ICA model. By the same reasoning, the sign of each independent component remains
ambiguous in this model [11].
Additionally, the order of the independent components is non deterministic. It can vary
by a permutation matrix P by inserting this matrix and its inverse in equation (2.15) as
x = AP−1Ps. (2.35)
The unknown independent random vectors Ps differ in order, as does the new unknown
25
mixing matrix AP−1. Thus, the problem reduces to the original ICA model - the search
for an unknown mixing matrix which transforms some sources into unknown ICs.
2.4 Eyeblink Artifact Removal
The eyeblink artifact removal technique that is investigated in this work was originally
proposed in [17]. The algorithm consists of the three primary stages shown in Figure 2.6:
phase I detection, ocular component separation, and phase II detection and removal.
Ocular 
Component 
Separation
Phase I 
Detection
Contaminated 
EEG Data
Independent Components
Mixing Matrix
Phase I Detection Locations
Phase II 
Detection and 
Removal
Uncontaminated 
EEG Data
Figure 2.6: Pesin algorithm overview
The input to the system is a single 10 second epoch of zero-mean EEG data, potentially
contaminated by ocular artifacts. The phase I detection algorithm determines whether eye-
blinks are present in the input and approximates any such locations. If eyeblinks are found,
ocular component separation is performed by FastICA. The independent components and
the mixing matrix are passed to the phase II detection and removal algorithm, which identi-
fies the independent component containing the eyeblinks, locates the center of each ocular
artifact, and zeros the portions of the ocular IC surrounding the blink locations. The mod-
ified independent components are re-mixed to yield artifact-free EEG data, which is the
26
system output.
2.4.1 Phase I Detection
The centered EEG data is the input to this system. From that data, it must be decided
whether and eyeblink is present. If one cannot be found, the computationally expensive
ICA need not be performed. If instead, one or more eyeblinks are found, they are localized
in time prior to performing ICA. Figure 2.7 illustrates this process.
Compute the 
Vertical EOG
Wavelet 
Decomposition
Wavelet 
Reconstruction
Threshold Separate OAs
Contaminated 
EEG Data Vertical EOG
Vertical EOG 
Decomposition
Locate OAs
OA Indices
Grouped OA 
Indices
OA Center 
Indices
Vertical EOG Reconstruction
Subtract LEOG 
from REOG
6th level 
decomposition 
with coif3
6th level detail 
reconstruction
Find indices that 
exceed 0.02
Group indices 
that occur within 
600 msec
Average indices 
to estimate OA 
center
Figure 2.7: Phase I blink detection
Since one of the 25 electrodes is located above one eye, and the another is located
below the other eye, their difference recovers the vertical electrooculogram (EOG). The
absolute value of this difference is decomposed six levels by a third order coiflet wavelet
transform. The resulting 6th level detail coefficients are reconstructed and thresholded at
an empirically determined value such that the spikes produced by eyeblink artifacts tend to
be above the threshold while the remainder of the coefficients tend below.
The third order coiflet is the wavelet of choice since it closely resembles the shape of
the eyeblink artifact. Ideally, the difference between the EOG channels contains minimal
brain contributions and thus the detail coefficients are near-zero, while the coefficients local
in time to the eyeblink artifact are large [17].
The sample indices within a 600 millisecond window for which one or more corre-
sponding coefficients exceeded the threshold are averaged to find what is assumed to be the
27
center of an eyeblink artifact. These indices are stored for future processing.
2.4.2 Ocular Component Separation
Once it has been verified that an eyeblink artifact exists within a 10 second epoch, the
raw 25 channel EEG data is separated into 25 independent components via the FastICA
algorithm.
The Pesin algorithm was initially designed for use in conjunction with a gaussian non-
linearity and symmetric orthogonalization. The gaussian nonlinearity was chosen to re-
duce the separation’s sensitivity to outliers, while symmetric orthogonalization was chosen
to avoid the accumulation of rounding errors in the deflationary orthogonalization process
[17].
2.4.3 Phase II Detection and Removal
Once the IC separation process completes, the ocular IC must be identified. Detec-
tion of the ocular artifact location follows with a process similar to the phase I detection
algorithm. This is depicted in Figure 2.8.
Since the amplitudes of the ICs are nondeterministic (their restriction to unit-variance
does not solely determine their amplitude), they are normalized to unit-amplitude before
further processing. The normalized IC yi follows from the separated IC si with
yi =
si
max{si} , i = {1, . . . , n}. (2.36)
The FP1 channel, sampled from the electrode located above the left eye on the forehead,
is used to determine which IC contains the ocular contributions due to its proximity to the
ocular artifact source. The raw data for FP1 is also normalized to unit-amplitude as in
(2.36).
The normalized FP1 channel, as well as each normalized IC, are decomposed by a
seven level wavelet transform with the coiflet. The correlation between each IC’s recon-
structed detail coefficients and the FP1 detail coefficients is calculated. The IC that gives
28
IC Correlation Coefficients
OA Center 
Indices
Wavelet 
Decomposition
Wavelet 
Reconstruction
Independent 
Components
IC 
Decomposition
7th level 
decomposition 
with coif3
7th level detail 
reconstruction
Correlation
Correlate the IC 
decompositions 
with channel FP1
Vertical EOG 
Reconstruction
Ocular IC 
Selection Ocular IC
Choose IC with 
largest coefficient
Wavelet 
Decomposition
1st level 
decomposition 
with coif3
Ocular IC 
Decomposition
Wavelet 
Reconstruction
1st level 
approximation 
reconstruction
Threshold
Find indices that 
exceed 3.0
OA Indices
Separate OAs
Group indices 
that occur within 
400 msec
Grouped OA 
Indices
Locate OAs
Average indices 
to find OA center, 
if within 1 sec of 
Phase I estimate
Phase I 
Detection 
Locations
Ocular IC Reconstruction
Contaminated 
FP1 Channel
Decomposed 
FP1 Channel
FP1 Channel 
Reconstruction
Figure 2.8: Phase II blink detection
29
the correlation coefficient with the greatest magnitude is then assumed to be the ocular
component.
A single level wavelet decomposition is performed on the ocular IC and the correspond-
ing approximation coefficients are reconstructed. The result is again thresholded, wherein
the sample indices for which the threshold was exceeded within a 400 millisecond window
were averaged to obtain the center of the eyeblink. If the phase I and phase II blink de-
tection locations match to within one second, an 400 millisecond swath of the ocular IC is
zeroed, centered on the phase II detection center.
The reconstructed EEG data is then found by remixing the ICs, including the modified
ocular IC, via multiplication with the mixing matrix A returned by FastICA.
30
Chapter 3
Implementation Details
The MATLAB source code which describes the Pesin process in [17] was available as
a starting point for this work. This code was ported to C/C++ before implementation on
the embedded target. During this process, substantial changes were made to the software.
3.1 Changes to Pesin Process
Several corrections and improvements were made to the Pesin process prior to evaluat-
ing the performance of the FPGA implementation.
3.1.1 Centering
When performing ICA, a necessary preprocessing step is to center the data by subtract-
ing from each EEG channel the channel’s mean. FastICA stores each channel’s mean as
an element of a vector which, when multiplied by the separating matrix, can be summed
with the extracted independent components. This operation yields ICs with the correct DC
offset.
Since a centering operation was performed prior to the phase I detection stage, center-
ing the data a second time as a preprocessing step in the FastICA algorithm is unnecessary.
The original implementation of the Pesin Process did not restore the mean following com-
pletion of the artifact removal process. For the present work, this calculation was included
31
following the eyeblink artifact removal stage. However, no method was employed to com-
pensate for a removed artifact’s contribution to the mean.
3.1.2 Initial Conditions
When initializing the FastICA algorithm, a random mixing matrix can be generated in
lieu of providing a predetermined guess as a starting point. The random initial mixing ma-
trix originally used in the Pesin Process made the verification of the FPGA implementation
of the algorithm difficult since the known “good” MATLAB results were nondeterministic.
The initial guess used on both platforms has been set to the m×m identity matrix.
The FastICA implementation1used in this work did not originally support the use of a
predefined initial mixing matrix, but support for this feature was extended as part of this
work.
3.1.3 Ocular IC Selection
The process by which the ocular independent component is identified was observed to
sometimes fail even in ideal situations. The amplitudes of the ICs were normalized prior to
correlation of the 7th level decompositions of each IC with the 7th level decomposition of
the FP1 channel original data.
Though ICA restricts each IC to unit variance, the amplitude of each independent com-
ponent remains nondeterministic. This is due to the fact that though the variance may be
known, the content of the signal determines its amplitude.
It is understood that the amplitude of the eyeblink artifact tends to dwarf the amplitude
of the neuronal activity by an order of magnitude or more. The relative amplitudes of the
ocular IC and the neuronal ICs are thus not preserved by the normalization.
The mixing matrix, which is returned by FastICA along with the ICs, contains the scal-
ing factor applied to each IC for each EEG channel. This additional information was used
1From the IT++ library.
32
to apply the correct scaling factor to each IC prior to computing its wavelet decomposition,
enabling the correlation to operate more reliably.
3.2 Software
3.2.1 Supporting Libraries
Central to Pesin’s eyeblink artifact detection and removal algorithm was the Helsinki
University of Technology’s MATLAB implementation of FastICA [12]. Fortunately, the
FastICA algorithm was previously ported to C++ and was made part of IT++, a free, open-
source library of signal processing functions2.
This library carries dependencies on several additional libraries including Linear Alge-
bra Package (LAPACK), Basic Linear Algebra Subprograms (BLAS), and others. FastICA
is the only function required from the IT++ library, and in turn it is dependant only on
LAPACK and BLAS. The dependencies are described by the class diagram in Figure 3.1.
IT++ LAPACK
Eigenvalue 
decomposition
Singular value 
decomposition 
BLAS
F2C
FastICA 
Matrix 
operations
Vector 
operations 
Support
Figure 3.1: External library dependency diagram.
2Available at http://itpp.sourceforge.net
33
Linear Algebra Package
LAPACK is also free open-source library. This library provides numerical solvers for
systems of linear equations, eigenvalue problems, and singular value problems.
LAPACK itself does not implement basic linear algebra operations - it is dependant
on BLAS, which provides functionality for basic vector and matrix operations. Certain
distributions of LAPACK include a reference implementation of BLAS.
Basic Linear Algebra Subroutines
In addition to specifying structures for the storage of vectors and matrices, BLAS im-
plements simple vector, matrix-vector, and matrix-matrix operations such as multiplication
and addition.
A variety of machine-specific variations of BLAS exist, such as the Intel Math Kernel
Libraries. These implementations are highly optimized for their target platforms. Com-
piling the reference implementation of BLAS does not provide optimal performance since
the compiler will not generally be able to take advantage of machine-specific architectural
enhancements.
Performance in this system was initially limited due to the unavailability of source code
from which an implementation of BLAS optimized for the PowerPC architecture could be
cross-compiled. However, matrix multiplication functionality, the mainstay of BLAS, was
independently optimized in this work, so the limitations of the reference implementation of
BLAS did not constrain the final performance of this system.
Fortran to C
Both LAPACK and BLAS were originally written in Fortran. Since the Xilinx Software
Development Kit (SDK) tools include a C/C++ cross-compiler, only, the special utility
f2c translates the Fortran libraries to C. This operation, however, introduces an additional
dependance on a corresponding f2c library.
34
Fortunately, distributions of both LAPACK and BLAS exist that have already been
converted to C. These converted libraries remain dependant on the f2c library. As such,
the source required to build the f2c library was acquired and cross-compiled for execution
on the PowerPC platform.
3.2.2 Arithmetic Precision
The original MATLAB implementation relied on double precision (64-bit) floating
point arithmetic. The underlying architecture of the x86 platform on which MATLAB
and this algorithm were implemented supported this type of arithmetic in hardware. How-
ever, the PowerPC target does not have the benefit of a hardware floating point unit (FPU).
Instead, any floating point arithmetic is implemented by calling an appropriate software
emulation routine from a compiler-provided library.
While porting the algorithm to C/C++, it was found to be appropriate to make use of
single precision arithmetic rather than double precision arithmetic, which is the default pre-
cision in MATLAB. Single precision software emulation routines are substantially faster
than their double precision counterparts. Moreover, Xilinx provides a hardware floating
point unit IP core for instantiation in FPGA fabric. However, it is limited to single preci-
sion, only.
The IT++ library leveraged templated classes and functions. However, much of the
library was written in such a way that all floating point operations were assumed to be of
type double. It would be infeasible to extend the capabilities of this library to include
templated floating point data types, so the relevant functions utilizing type double were
made to instead use type float. Similarly, calls to functions in the LAPACK and BLAS
libraries were changed to reference the single-precision versions of the functions. Addi-
tionally, the required LAPACK and BLAS libraries were compiled with single-precision
support, only.
35
3.2.3 Memory Organization
To avoid reallocating the memory required for the matrices used in intermediate cal-
culations, particulary in the FastICA routines, the algorithm was modified to use statically
allocated matrices.
First, it is considered dangerous to rely on dynamic memory allocation (such as malloc)
in embedded systems because the limited amount of memory increases the risk of serious
fragmentation. Moreover, dynamic allocation is time consuming, whereas statically allo-
cating arrays incurs no overhead.
However, for the use of static allocations to be a viable approach, the size of all arrays
must be known at compile time. For this application, the number channels of EEG record-
ing and the number of samples per channel were known, and were not included as part of
the reconfigurability concept.
Initially, it was hoped that the most frequently used matrices could be stored in on-
chip memory (OCM). The embedded PowerPCs are capable of directly accessing a set of
low-latency BRAM resources configured as OCM. However, due the limited number of
BRAMs on the target device, storage of even a single 25 × 5000 matrix was not feasible,
since it would have consumed 222 BRAMs where only 232 BRAMs were available.
3.3 Development Issues
3.3.1 Xilinx Profiler
A profiler is a software development tool to assist in identifying portions of an algorithm
to target for optimization. A profile consists of a histogram of the percentage of execution
time spent executing each function called in the algorithm, and a call graph which shows
the number of times a function was invoked by the “parent” function responsible for every
such invocation.
The histogram is created by setting a timer interrupt to occur periodically, and using
36
the interrupt service routine to sample the value of the program counter saved on the stack.
This histogram is stored in a dedicated portion of memory identified by the developer.
A call graph is generated by forcing the compiler (via a compiler switch) to alter every
function in such a way that every invocation results in a counter corresponding to that
function to be incremented, while also tracking the caller.
During testing, it was found that the Xilinx SDK would issue an error when attempting
to profile an executable larger than one megabyte. Given the size and number of libraries
necessary for this design, the size of the executable could not be reduced to beneath this
threshold.
To work around this issue, the source code for the profile timers were extracted from
the same archive of source files automatically referenced by SDK when compiling with
profiling enabled. These source files were manually included in the project and modified to
store histogram data in memory defined at compile-time, rather than by the SDK interface.
With this, not only was the executable size limitation bypassed, but so was the tool capable
of dumping the histogram into a format readable by gprof, the GNU profile summary
generation tool. As such, a script was written to read the raw histogram data as dumped
from memory, reformat the data in the manner expected by gprof, and add the appropriate
file headers.
In this manner, a profile histogram was created. However, functionality for the call
graph generation was not implemented since this information was not vital to identifying
and optimizing the most time-consuming function in this application.
3.4 Hardware
3.4.1 Development Platform
The Xilinx ML410 Embedded Development Platform was used to implement this work.
A high level overview of the system architecture is given in Figure 3.2.
This development board features a Xilinx Virtex-4 FX60 FPGA (speed grade 11). This
37
Figure 3.2: The ML410 development platform architecture
device features two PowerPC 405 cores, 232 Block RAMs totalling 4,176 kilobits of mem-
ory, 128 18 × 18 multipliers, and 25,280 slices with which to implement general pur-
pose logic. Other relevant features included on the ML410 include two serial ports and 64
megabytes of DDR memory.
The PowerPC 405 is a 32-bit fixed point processor capable of operating at up to 450
MHz. It contains independent, 16 kB, two-way set associative instruction and data caches,
as well as a specialized auxiliary processing unit interface which interfaces the processor
to custom coprocessors [26].
3.4.2 Auxiliary Processing Unit Interface
The Auxiliary Processor Unit (APU) controller is a silicon device that interfaces the em-
bedded PowerPC to the FPGA fabric. Predefined instructions such as APU loads, APU
stores, FPU operations, and others are sent by this controller to custom hardware to be
38
decoded and executed. It is also possible to create user-defined instructions to be executed
in custom hardware. Regardless of the nature of these instructions, it is the responsibility
of the reconfigurable hardware to decode and execute them.
A suitable interface for many custom coprocessing units involves creating a register file
front-end for the custom hardware, wherein APU loads and stores are decoded and
executed. APU load instructions allow a block of data of up to 4 words (128 bits) to be
copied from a location addressable by the PowerPC to the custom hardware for storage.
Similarly, an APU store copies a block of data of up to 4 words from custom hardware
to a memory location addressable by the PowerPC. Note that these load and store
instructions cannot reference PowerPC registers as either a source or destination.
Operations in custom hardware can be performed on the data contained in the register
file by issuing an appropriate predefined or user-defined instruction, or by updating the
contents of a register regarded by the custom hardware as a control register via an additional
APU load. Floating point instructions are carried out by the former method, wherein
instructions like fadd are preceded by one or more load operations to send the operands
to the hardware, and followed by store operations to return the hardware result to the
PowerPC. The latter is the method employed in this work, whereby a control register is
designated in the hardware.
The APU interface is an ideal interface to hardware because of its high data rate, ease
of implementation, and placement within the PPC pipeline. Alternatives to this interface
include the on-chip peripheral bus or the processor local bus, but they are not optimized for
high data rate applications. Also, a direct memory access controller could be implemented,
though it can only address uncacheable memory. Such a restriction could substantially
decrease the performance of the processing performed by software on this data.
This design avoids historical problems with nondeterministic behavior when compiler
optimizations are enabled [2]. The source for this project compiled and ran correctly using
the flag -O3.
The fabric coprocessor bus (FCB) is an interface implemented in reconfigurable logic
39
that allows multiple custom coprocessors to share the APU interface. However, the set of
instructions decoded by each fabric coprocessing module (FCM) is required to be mutually
exclusive, for instance, only a single coprocessor is allowed to decode the lwfcmx (quad-
word FCM load) instruction. The implications of this is that multiple custom hardware
units must share a single, suitable instruction decoding front-end, such as a register file.
The APU interface contains clock-synchronization logic which allows the PowerPC to
operate at a frequency that is different from the custom coprocessor clock rate.
3.4.3 Floating Point Unit
Xilinx provides a single precision floating point unit which is realizable in reconfig-
urable FPGA fabric.
For systems without an FPU, the compiler will perform all floating point operations
with single- and double-precision generic software emulation routines. Operations that are
emulated by software for systems lacking an FPU include fadd, fsub, fmult, fdiv,
and fsqrt. However, these functions are not optimized for the PowerPC 405 target archi-
tecture, and may consume many cycles while executing.
Xilinx provides two different configurations for the FPU. Each has the same perfor-
mance specifications, but they require varying amounts of resources. The full single preci-
sion sp_full configuration supports the complete set of single precision operations, in-
cluding basic arithmetic, fixed-floating point conversions, and square roots. The sp_lite
configuration, on the other hand, supports all single precision operations with the exception
of division and square root, while requiring fewer resources.
To notify the compiler to make use of the FPU for single precision operations rather than
software routines, the source code must be compiled using either the flag -mfpu=sp_full
or -mfpu=sp_lite, which notifies the compiler to use the full-featured or resource-
limited configurations, respectively. It is important to note that all libraries must be recom-
piled with the same flags as the source to which they are linked to ensure correct operation
of the FPU.
40
This logic core substantially increases the performance of single precision floating point
operations up to an advertised 100 million floating point operations per second. For most
practical applications however, the observed performance tends to be much lower due to
the difficulty involved in keeping the FPU pipeline full. The APU FPU operates at one half
of the processor speed, with a maximum clock rate of 275 MHz for the speed grade 11 part.
The resources required to implement the full FPU configuration in FPGA fabric consist of
1600 slices, four embedded multipliers, and two BRAMs [25].
As mentioned in [2], the control signal APUFCMLOADWAIT must be disconnected
for the FPU to function correctly when paired with other devices on the fabric coprocessor
bus.
Following the implementation of the FPU in the system, an execution profile was cre-
ated. Figure 3.3 displays the top 10 functions according to percentage of total execution
time spent within each function. The profile was run on a single 10 second epoch of EEG
data containing two waveforms identified by the Pesin process as eyeblinks. The gaussian
nonlinearity used for this processing converged in 54 iterations.
Profile for System with FPU
0.86
0.9
1.09
1.9
2.29
2.6
2.98
6.79
8.77
66.74
0 10 20 30 40 50 60 70 80
wavedec
finitef
memcpy
transpose
expf
__ieee754_powf
powf
__ieee754_expf
separate
operator*
Execution Time (%)
Figure 3.3: Execution time profile for the system with an FPU
41
Table 3.1 shows the execution time for this system configuration. The speedup is shown
as unity since this represents the “baseline” system, from which further performance im-
provements are measured.
System Configuration Execution Time (seconds) Speedup
Software, FPU 199.05 1.00
Table 3.1: Execution time for the system with an FPU
The most time consuming function, operator*(), is the function which returns the
product of a pair of IT++ mat floating point matrix objects. By Amdahl’s Law, the most
effective way of improving the performance of a system is to improve the performance of
the slowest component, which in this case is the software matrix multiplication routine. A
hardware alternative was investigated.
3.4.4 Register File
A register file capable of decoding and executing APU loads and stores acted as a
front-end for the matrix multiplier. This unit contained four read-only and four write-only
registers, as well as instruction decoding logic.
Register Interfaces
Two write-only registers are allocated to send words of the operand matrices A and
B, respectively. One write-only register acts as a control register, with bit fields for reset,
operation codes, and the like. The remaining write-only register is unused in this design.
Two read-only registers are used to return the result matrix C. Each of these registers are
interfaced to the matrix multiplier core by a FIFO.
A 2:1 clock ratio between the processor clock and the FCB clock is imposed by the
presence of the FPU in this system. Since clock synchronization between the 200 MHz
register file (driven by the processor clock) and the 100 MHz matrix multiplier (driven by
the system clock) is required, FIFOs are used to interface the clock domains. Though the
42
APU provides clock synchronization, it is forced to operate at the processor frequency by
the presence of the FPU, therefore this built-in functionality goes to waste. This configura-
tion is shown in Figure 3.4.
Register File
Write-only Registers
A
u
x
il
ia
ry
 P
ro
c
e
s
s
o
r 
U
n
it
Register [0]
Register [1]
Register [2]
Register [0]
Register [1]
Read-only Registers
Matrix Multiplier
FIFO 
OPERAND A
M
u
lt
ip
li
e
r 
C
o
re
FIFO 
OPERAND B
FIFO 
RESULT C
FIFO 
CONTROL REGISTER
FCB
Figure 3.4: The register file bridges the APU and the matrix multiplier
Each of the FIFOs are 32 bits wide to support the storage of single precision floating
point values and 1024 words deep to ensure that overflow and underflow conditions are
never possible. First-word-fall-through is enabled on each FIFO to eliminate unnecessary
latency in pipelined operations.
Instruction Decode Logic
The PowerPC instruction set includes load and store instructions for data transfers
to and from custom hardware. These instructions support byte, half-word, word, double-
word, and quad-word transfers, where the processor-side sources and targets must be mem-
ory addresses rather than registers. Since only 32-bit floating point values were read from
or written to custom hardware in this design, the register file instruction decode logic sup-
ports only the word, double-word, and quad-word load and store instruction variants. A
high-level state diagram of this logic is given in Figure 3.5.
43
IDLE
LOAD STORE
WAITHW
APUFCMFLUSH
HW Not Ready
HW Ready
Load Instruction
& HW Ready
Store Instruction
Stored < X
Words
Loaded < Y
Words
Stored X
Words
Loaded Y
Words
Figure 3.5: The APU instruction decoder state machine
The latency incurred by directly using most APU signals was too great due to their pas-
sage through the FCB. Registered copies of each signal were used in place of the originals,
causing an unavoidable additional cycle of latency for every load and store operation.
Flushed Instruction Detection Logic
The APUFCMFLUSH signal, which synchronously returns the state machine to its idle
state, is raised by the processor whenever an APU instruction is flushed from the PowerPC
pipeline. Through testing, empirical evidence was gathered which supports the claim that
when any instruction is flushed from the PPC pipeline, the APUFCMFLUSH flag is raised.
An instruction might be flushed from the PowerPC pipeline for the following reasons:
• The outcome of a branch instruction was mispredicted
• An interrupt was received
• An exception occurred
In this design, interrupts and exceptions are expected to be infrequent, if they were to
44
occur at all. Therefore, the most likely cause for a flushed instruction is a branch mispre-
diction. Since the PowerPC utilized static branch prediction, which by default predicts that
branches with negative address displacements are taken, mispredictions are likely to occur
at the final iteration of every loop [26]. If an instruction is flushed because it is not meant to
be executed, such as at the final iteration in a loop, the flushed instruction must not alter the
processor or coprocessor state. Similarly, if an instruction was flushed only to be re-issued
again later, the flushed instruction must not alter the processor state so that the re-issue does
not result in an instruction being incorrectly executed twice.
Previous work indicated that use of compiler optimizations interfered with correct op-
eration of a similar instruction decoder [2]. However, during optimization, the compiler
will re-order instructions to remove dependencies, perform loop-unrolling, and make other
such structural modifications to the source, without changing the result. For instance, a
branch instruction that occurs at the end of a loop without compiler optimization might oc-
cur before the loop after optimization. Rather than a misprediction/flush having no effect
on the outcome of the loop in the unoptimized case, a misprediction/flush could improperly
alter the state of a custom coprocessor before the loop was executed, if the flushes were not
handled properly by the instruction decode logic.
Specifically, in the case of load instructions targeting fabric registers, flushed loads
must not alter the register state. For store instructions sourcing registers, flushes are
nondestructive to the register since they do not alter its contents. With FIFOs, however,
both loads and stores result in a change in FIFO state.
Handling flushed loads is straightforward. Since the load data originates from Pow-
erPC memory, it cannot be written to the custom register file until it has been retrieved.
Provided that the APUFCMFLUSH flag has not been raised by the time the last word of
the transfer is present on the FCB, the FIFO state can be safely updated.
For stores, however, once the final word of the transfer is asserted on the FCB by
the register file, the PowerPC must still write this word to memory - a potentially time con-
suming process. If APUFCMFLUSH were to be raised at this point, the register file logic
45
would have no way to determine whether the store instruction had already completed
successfully.
To that end, an APUFCMWRITEBACKOK flag indicates the ”point of no return”
for APU load instructions. That is, when the currently executing load instruction has
reached a PowerPC pipeline stage such that it is guaranteed to complete without flushing,
this flag is asserted. Since stores which source registers are non-destructive and can be
re-issued without adverse affects, this flag is not raised for store instructions. In the event
that FIFOs are used, wherein stores are destructive, Xilinx has provided the ability to
set an APU control bit that forces this flag to be asserted when a store instruction has
completed, but this feature comes at a cost of decreased performance. Unfortunately, this
option cannot be used in conjunction with an APU FPU in the same system.
There is therefore no convenient method to include a fabric coprocessor which utilizes
FIFOs for I/O in the same system the includes Xilinx’s APU FPU. In the event that a flushed
instruction is an APU load or store, it is the responsibility of the coprocessor designer
to ensure that the instruction can be re-issued without adversely affecting the coprocessor
state.
Some means of determining that a store has completed successfully is required so
that the output FIFO can be clocked appropriately. Thus, a “ping-pong” strategy is used
whereby each consecutive store instruction targets an alternate register, with both regis-
ters tied to the same FIFO output. A store sourced from register[0] followed by a second
from register[1] results in clocking the FIFO such that each store returns the next word
in the FIFO. Two consecutive stores sourced from the same register result in the same
word being returned without clocking the FIFO, since this is an indication that the first
store was flushed and subsequently re-issued.
Floating Point versus Fixed Point Arithmetic
Since matrix multiplication decomposes to many multiplication and addition opera-
tions, a critical design decision involves the selection of an appropriate number system for
46
3031 2829 2627 2425 2223 2021 1819 1617 1415 1213 1011 89 67 45 23 01
S E F
Floating Point Number Representation
3031 2829 2627 2425 2223 2021 1819 1617 1415 1213 1011 89 67 45 23 01
F
Fixed Point Number Representation
M
Figure 3.6: Comparison of fixed and floating point number representation.
performing these arithmetic operations. For the purpose of this comparison, only 32-bit
numbers are considered.
Floating point numbers, specified by the IEEE-754 standard, have both a large dynamic
range and good precision. A number x expressed in floating point format with sign bit S,
exponent E, and fraction F is given by
x = (−1)S · (1.F × 2E−127). (3.1)
For the 32-bit floating point number shown in Figure 3.6 with 8 exponent bits and
23 fraction bits, the range of representable values is between −(2− 2−23)× 2127 and (2−
2−23)×2127. The precision of this format varies with the exponent, however for comparison
purposes it is useful to note that distance to the smallest expressible number larger than 1
is 2−23.
For a 32-bit signed fixed point number, shown in Figure 3.6 with 8 magnitude bits M
(including the twos-complement “sign bit”) and 24 fraction bits F , numbers between−2−7
and 27 − 1 are expressible. The precision, constant over the entire range, is 2−24.
Though the precision is similar for this example of floating and fixed point numbers,
the dynamic range is substantially different. Since the precision and range of the input data
are determined by the analog-digital converter that samples the EEG data, it would seem as
though a suitable fixed point representation could be found.
Unfortunately, for this application, it was not feasible to represent matrices in fixed-
point format and thus achieve a more optimal design in terms of both performance and area.
Since ICA cannot recover the gain of individual independent components, elements of the
47
mixing matrix cannot be bounded. No reasonable fixed-point word length could match
the balance between precision and dynamic range afforded by floating point numbers, thus
fixed point arithmetic operations were not possible.
Implementation of Floating Point Operators
Though the embedded PowerPC target supports an optional pipelined, single-precision
floating point unit implemented in reconfigurable hardware, this represents a severe bottle-
neck. Though Xilinx claims that the maximum performance of the FPU is in excess of 100
MFLOPS, they reason that overhead due to loads and stores and CPU pipeline stalls
impose a practical limitation on performance which is substantially lower [25].
Individual floating point operators are available from a library of Xilinx logic cores. The
operators necessary to implement matrix multiplication are multiplication and addition,
both of which consume embedded 18x18 multipliers. Since there are a limited number of
these embedded multipliers available, there is an upper-bound to the number of processing
elements contained in the linear array. The designer is granted partial control in the form
of trading pipeline stages (performance) for resources (multipliers).
Matrix Multiplication on a Linear Array
Since the FPU coprocessor is an APU peripheral, the load/store overhead cannot be
minimized directly since any software matrix multiplication algorithm would incur similar
overhead by necessity of loading and storing every element of the operand and result matri-
ces many times in the FPU register file. However, when performing matrix multiplication,
it can be seen that each element of each matrix is reused a number of times. This can be
leveraged to indirectly reduce the load/store overhead incurred by each element to only
a single transfer. Furthermore, many floating point operators can be instantiated to allow
for multiple floating point operations to be performed in parallel.
48
Figure 3.7: The structure of the matrix multiplier [27].
49
One such design that operates in this way is given in [27]. A linear array of identi-
cal processing elements (PEs), such as that shown in Figure 3.7, is instantiated in recon-
figurable hardware. The major components comprising each PE include a floating point
multiplier, a floating point adder, and a small amount of local memory.
This algorithm requires m× m
s
PEs to compute the product of an m×m matrix, where
s is the number of words of local storage in a PE. Aside from control signals, each PE has
three data inputs and outputs. A given processing element PEl accepts an element of A
and an element of B as inputs from its left neighbor, PEl−1, and returns to that neighbor
an element of the result C. Similarly, PEl outputs an element of each A and B to its right
neighbor PEl+1 and accepts from that neighbor an element of C.
For this system, the number of words of local storage per-processor was selected to
be m (the number of ICs) to reduce the number of PEs required to implement the linear
array. Since abundant BRAM resources were available in the target FPGA, this selection
reduced the burden on lookup tables, embedded multipliers, and other resources necessary
to implement the other components in the PEs.
To compute the matrix product, elements of B are fed into the leftmost processor PE0
in row-major order. As the first element of the second row of B enters the array, the first
element of A is read into the leftmost PE, proceeding in column-major order. Since the
entry into the array of the first element of A lags the first element of B by m cycles, the
“operand read” operation requires m2 +m cycles.
An example of the data flow for the product of a pair of 2×2 matrices is shown in Figure
3.8 for the leftmost processing element PE0. This demonstration assumes a pipelined mul-
tiplier and adder, each with two stages. The output of the adder is stored in the processor’s
local memory.
After the pair of input matrices fully propagates through the linear array, the local mem-
ory in each PE contains a complete column of the result matrix, which propagates out of
the array to the left. A total of m2 cycles are required to write-back the result.
The hardware resources consumed by a single PE are rather modest, largely due to
50
a00 a10 a01 a11
b00 b01 b10 b11
b10 b10
a00
b00
a10
b00
a01
b10
a11
b10
c00
0
0
c10
0
0
c00
1
c00
0
c10
1
c10
0
b00 b00 b00 b00
b00 b10b00 b10
a00
b00
a10
b00
a01
b10
a11
b10
c00
0
0
c10
0
0
c00
1
c00
0
c10
1
c10
0
RR.A
RR.B
RR.B1
RR.B2
MULT1
MULT2
ADD1
ADD2
1 2 3 4 5 6 7 8 9Cycle
b00 b00
b10 b10
Figure 3.8: The data flow within PE0 for one row/column of operand matrices.
51
the availability of individual floating point operators rather than a full-featured FPU. Thus,
only a single floating point adder and floating point multiplier are implemented for each
processing element.
The target clock rate for this matrix multiplier was 100 MHz. There is no performance
benefit for exceeding this goal since the APU, which runs at 200 MHz, represents a bottle-
neck. For double-word load operations, the rate at which data is supplied to the matrix
multiplier is less than two words per four clock cycles. If the APU could provide data at
this rate, the multiplier pipeline would remain full; therefore a more realistic, lower data
rate yields an under-utilized multiplier pipeline despite the clock frequency difference.
Performance
Table 3.2 compares the performance of software and hardware matrix product computa-
tions. The software computations were performed with the mat objects and the operator*()
method defined by the IT++ library. The hardware calculations were performed with the
matrix multiplier linear array. The execution time for the hardware method includes the
time required by the PowerPC to write the operands to the hardware as well as read the
result back into DDR memory.
Operation 
Type
Software Time 
(msec)
Hardware Time 
(msec)
Speedup
Square (25x25) · (25x25) 2.403 0.087 27.7
Block (5000x25) · (25x25) 606.3 22.10 27.4
Block (25x25) · (25x5000) 486.2 22.18 21.9
Accumulated (25x5000) · (5000x25) 1549 20.65 75.0
Square (25x25)
T · (25x25) 2.501 0.097 25.7
Block (25x5000)
T · (25x25) 664.5 22.35 29.7
Matrix Dimensions
Table 3.2: Matrix multiplication performance comparison.
The matrices in the test were sized with the dimensions of the matrices used in the ICA
calculations which are determined by the number of input channels, 25, or the number of
samples per channel, 5000.
52
A substantial speedup was observed for all three methods. As expected, the two block
matrix multiplication methods required the same amount of time to execute in hardware,
however they achieved significantly different speedups. Since IT++ implements matrix
storage and operations using underlying BLAS routines, the data is stored in column-major
format. Therefore, there is a significant software performance benefit when the rectangular
matrix has more rows than columns, since the order in which the elements are accessed to
perform the many dot-products is sequential by memory address.
An additional benefit to the hardware implementation of the matrix multiplier is that
transpose operations are essentially unnecessary. As the operands are being read from
memory and sent to the hardware, the matrices can be read either row-major or column-
major, the selection of which is equivalent to an on-the-fly transpose operation. For com-
parison, the IT++ transpose operation allocates new memory for the transpose matrix and
copies every element of the operand matrix, which requires a substantial amount of time.
The results for the operations with a concurrent transpose operation show approximately
the same speedup as do the equivalent operations without a transpose. This indicates that
the implicit transpose operation performed when the operands are written to the hardware
is substantially faster than the software transpose method.
The performance for transpose operations for the remaining types of block multiplica-
tion and the accumulated multiplication are not shown because they are not used in this
application.
The execution profile for the eyeblink artifact removal algorithm with hardware matrix
multiplication support is shown in Figure 3.9, using the same 10 second epoch as before.
Previously, operator*() consumed more than two-thirds of the execution time. Now,
the FastICA kernel separate() occupies the largest percentage of execution time, with
single precision numerical methods such as expf and powf together consuming nearly
two-thirds of the remaining execution time. Table 3.3 shows the three-fold improvement in
execution time.
A diagram of the complete hardware architecture for this approach is given in Figure
53
Profile for System with Matrix Multiplier
2.04
2.31
2.67
2.85
3.2
6.48
7.59
9.64
20.52
27.26
0 5 10 15 20 25 30
block_matrix_multB
isnanf
wavedec
finitef
memcpy
expf
__ieee754_powf
powf
__ieee754_expf
separate
Execution Time (%)
Figure 3.9: Execution time profile for system with hardware matrix multiplier.
System Configuration Execution Time (seconds) Speedup
Software, FPU 199.05 1.00
Software, FPU, Matrix Multiplier 65.25 3.05
Table 3.3: Execution time comparison for the system with a matrix multiplier.
54
3.10. It includes the FPU, register file, and matrix multiplier associated with the APU via
the FCB, as well as other peripherals including memory and external I/O attached to the
PowerPC via the processor local bus.
2
0
0
 M
H
z
 F
C
BPPC405 CPU
APU
M
F
C
B
ICACHE
16 kB
DCACHE
16 kB
MPLB
Legend
Bus Master
Bus Slave
FPU
Matrix Multiplier
100 MHz PLB
DDR Controller UART UART
single precision
RS-232
To PC To PC
SPLB SPLB SPLB RS-232
DDR SDRAM
64 MB
Register File
4 read-only registers
4 write-only registers
25x25 square matrices
single precision
SFCB
SFCB
Execution Core
100 MHz200 MHz
200 MHz 100 MHz
100 MHz
Figure 3.10: The hardware architecture.
To implement this system on the FPGA requires nearly 84% of the available slices,
81% of the embedded multipliers, and 19% of the on-chip memory resources. The device
utilization summary generated by Xilinx ISE post-place and route is shown in Figure 3.11.
55
Figure 3.11: Device utilization summary for the target Virtex-4 FX60 FPGA
56
3.5 Overhead Reduction
When passing parameters to functions, there is a significant performance benefit to
passing parameters and return values by reference. The alternative, often used in the IT++
library, is to return result objects by value, which invokes the object’s copy constructor. In
the case of matrices, this can result in a significant overhead due to the process of copying
the data.
The function prototype for the generic matrix multiplication method in IT++ is of the
following form.
const Mat<Num_T> operator*(const Mat<Num_T> &m1, const Mat<Num_T> &
m2);
Though this method does not require creation of local copies of the operand matrices,
this function is not only responsible for allocating memory, but the returned result matrix
is copied from this function to the caller. This return occurs by invoking the mat object’s
operator=() method.
In contrast, when designing the routines that perform matrix multiplication on the linear
array, the caller is required to allocate space for the results and pass the corresponding
variable as a reference. The square matrix multiplication prototype has the following form.
void matrix_mult(const mat &A, const mat &B, mat &C);
Substantial performance benefits were realized by limiting the number of times these
large copy operations were performed. Thus, for element-wise operations such as multipli-
cation, the function
const Mat<Num_T> elem_mult (const Mat<Num_T> &A, const Mat<Num_T> &B
);
was replaced with a different IT++ function.
57
void elem_mult_out (const Mat<Num_T> &A, const Mat<Num_T> &B, Mat<
Num_T> &out);
A similar approach was taken for computing the element-wise exponential of a matrix when
applying the gaussian contrast function.
Additionally, calls to powf where the exponent was a small integer, such as for the
gaussian and cubic contrast functions, were replaced with calls to elem_mult_out, re-
peated if necessary to reach the required power. Since powf is capable of processing a
decimal exponent, its execution time dwarfs that of the one or two multiplications neces-
sary to compute the square or cube of a number.
Using the same test data as before, the final execution profile is shown in Figure 3.12.
Though it appears the percentage of execution time consumed by separate() has not
changed from the previous profile, the overall execution time has decreased nearly 50%,
and powf no longer appears in the profile, as shown in Table 3.4.
Profile for Optimized System
1.84
1.84
2.05
2.53
2.74
3.01
3.91
10.43
26.15
30.85
0 5 10 15 20 25 30 35
wrcoef
finitef
sum
memcpy
accum_matrix_mult
block_matrix_multB
wavedec
expf
separate
__ieee754_expf
Execution Time (%)
Figure 3.12: Execution time profile for system with final optimizations
58
System Configuration Execution Time (seconds) Speedup
Software, FPU 199.05 1.00
Software, FPU, Matrix Multiplier 65.25 3.05
Optimized Software, FPU, Matrix Multiplier 44.47 4.48
Table 3.4: Execution time comparison for system with final optimizations
3.6 Reconfigurability
Some of the design decisions originally applied to the work in [17] were based upon
observations made on intermediate and final results produced when processing a limited
set of data. To that end, though the results may have been accurate for the data tested, it
does not follow that the chosen parameters are suitable for the wide range of individuals
for which eyeblink artifact removal might be necessary.
Furthermore, individual eyeblink characteristics may vary in such a way that a single set
of parameters is not suitable to account for variations found across the general population.
It is thus desired that parameters used in the Pesin process be reconfigurable such that a
clinician or other responsible individual may modify them on-line.
Parameters which were made reconfigurable in the present design include the following:
• The number of levels in the phase I wavelet decomposition
• The number of levels in the phase II wavelet decomposition
• The phase I blink detection threshold
• The phase II blink detection threshold
• The phase I wavelet decomposition and reconstruction coefficients
• The maximum number of iterations in FastICA
• The FastICA convergence criteria
• The FastICA nonlinearity/contrast function
59
• The FastICA nonlinearity constant scalar
• The FastICA orthogonalization method
Key parameters which may be modified according to a patient’s individual needs in-
clude the wavelet transform filter coefficients, the number of levels of decomposition/re-
construction used in artifact detection, and the thresholds used in artifact detection. These
parameters account for the characteristics of the eyeblink artifact which the clinician wishes
to remove, including shape and duration.
The maximum number of iterations and the convergence criteria are directly related to
the tradeoff between the quality of the separation desired and the time required by FastICA.
The selection of an appropriate nonlinearity is determined by the type of data being
processed. This work examines the effect of this choice on the quality of the IC separation
for EEG data.
No investigation as to the effects of the nonlinearity scalar coefficient were made either
in this work or in [17]. A study on the effects of variations in this parameter is left as future
work.
Though selection of the orthogonalization method is possible in this system, the present
design optimizes only the symmetric orthogonalization approach. The outcome of a hybrid
deflationary-symmetric approach is discussed later.
3.7 Interface
3.7.1 Graphical Interface
A graphical user interface (GUI), which ran on the host PC connected to the develop-
ment platform, was created for testing purposes. The GUI was written in Java, and handled
the communications between the host PC and the development platform, including transfer
of the input EEG data, retrieval of the processed EEG data and independent components,
60
and selection of reconfigurable parameters. Screenshots of each of the GUI tabs are shown
in Figure 3.13.
3.7.2 Data Interface
Each of the two serial ports available on the ML410 development platform were con-
figured either as a data or control port. The transfer of the contaminated EEG data from
the host PC to the test platform as well as the return of the results and ICs utilized the data
port, while configuration and status information were exchanged on the control port. Both
ports were set to operate at 567600 baud. Communication at the maximum supported rate
of 115200 baud was found to be unreliable.
61
Figure 3.13: Screenshots of each GUI panel
62
Chapter 4
Method of Evaluation
To properly evaluate the performance of the Pesin process, a quantitative measurement
of the eyeblink removal quality is necessary. Thus, precise knowledge of the uncontami-
nated EEG is required. A method of simulating several epochs of EEG contaminated by
ocular artifacts was devised.
However, to gauge the performance of the algorithm on real data, knowledge of the
location and type of artifacts present in the EEG is required. To that end, EEG data was
collected in a controlled environment and used to determine the most suitable FastICA
contrast function for extracting the ocular IC from the data.
Aditionally, to assist the seizure prediction and detection algorithm, this artifact re-
moval process must be capable of correctly identifying and removing only ocular artifacts,
while ignoring the non-ocular content of an EEG - even if it resembles an eyeblink. There-
fore, data captured from patients presenting with neurological conditions that involve EEG
waveforms resembling eyeblinks was also considered.
4.1 Simulated EEG
4.1.1 Simulated EEG Generation
Past studies have determined the transmission coefficient of vertical EOG contamina-
tion into EEG [8]. The transmission coefficients determined by this study were extended to
63
include the additional EEG channels used in the present data collection. For the additional
channels (with the exception of FP1 and FP2), the omitted channels along the same merid-
ian were assigned the same coefficient as the known channels. For FP1 and FP2, since
no data was available with which to interpolate or extrapolate, an arbitrary transmission
coefficient of 0.5 was assigned due to these electrodes’ proximity to the eyes.
A diagram outlining the location on the scalp of the electrodes considered in the eye-
blink removal algorithm is given in Figure 4.1. For each of these electrodes, the transmis-
sion coefficients either gathered directly from the research in [8] or interpolated from those
results are displayed in Table 4.1.
Figure 4.1: Top view of electrode placement for 10-20 system
EEG Channel Transmission Coefficient
FP1, FP2 0.50
F3, F4, F7, F8, FZ, FT9, FT10 0.23
C3, C4, CZ, T3, T4, A1, A2 0.12
P3, P4, PZ 0.09
O1, O2, T5, T6 0.06
Table 4.1: EOG Transmission Coefficients
A recording free of eyeblink artifacts was used as a baseline for simulating EEG data
64
contaminated by ocular artifacts. The left and right EOG channels from a separate record-
ing which contained eyeblinks, but no eye movements, were subtracted to form the vertical
EOG. The vertical EOG was scaled and summed with each channel of the uncontaminated
EOG, where the scaling factor was the transmission coefficient in Table 4.1.
The simulated EEG was then processed with both the original Pesin algorithm in MAT-
LAB (after updating it to include the deterministic initial condition modification for con-
sistency) and the present FPGA implementation. The data was processed with all three
contrast functions on both platforms.
4.1.2 Eyeblink Removal Parameters
The phase I wavelet detection threshold used in the original Pesin process is 0.03 mV,
whereas the current research showed empirically that a threshold of 0.02 mV was more
successful in detecting eyeblinks, albeit with an added risk of more false detections. For a
fair comparison, the detection threshold in the original Pesin Process was updated to 0.02
mV before performing the experiment.
For the FPGA implementation, the maximum number of iterations FastICA could per-
form was reduced from 5000, as in the original algorithm, to 500. This reduction is nec-
essary to enforce time constraints. If the system were able to execute in real time, this
maximum iteration count would necessarily be fixed such that the maximum number of it-
erations would execute in less than the maximum allowable time of 10 seconds. Moreover,
it was observed that if the iteration count exceeded 500 while attempting to extract ICs for
a given epoch, FastICA tended not to converge.
Additionally, the stopping criteria was reduced from 10−8, as it was in the original
algorithm, to the FastICA default convergence criteria of 10−4. Since the embedded sys-
tem should ideally perform in real-time, a failure to converge must be identified sooner.
Also, the FPGA implemented single precision arithmetic, whereas MATLAB utilized dou-
ble precision math. Since the largest difference between single precision numbers of unit
magnitude is on the order of 10−7, and such a comparison defines the FastICA convergence
65
criteria, the initial value of 10−8 is unreachable.
It was decided that it would be most fair to test the platforms with these differences to
highlight the changes made in the present research. The parameters listed in Table 4.2 were
used for the simulated EEG processing, as well as for the remainder of this experiment.
Parameter MATLAB FPGA
Maximum Iteration Count 5000 500
Convergence Criteria 1.00E-08 1.00E-04
IC Selection Method Normalized Scaled
Phase I Detection Threshold
Phase II Detection Threshold
Contrast Function Scalar
Orthogonalization Approach
Wavelet Transform
Phase I Reconstruction
Phase II Reconstruction
6th level detail
1st level approximation
0.02
3.0
Symmetric
3rd order coiflet
1.0
Table 4.2: Algorithm parameters for each platform
4.2 Experiment Design
An experiment was devised to demonstrate the reconfigurability of the FPGA eyeblink
artifact removal system by utilizing it to process real data while varying system parameters.
By using EEG data gathered in both controlled and clinical environments, the effectiveness
of the eyeblink removal procedure could be tested with data known to contain eyeblink
artifacts as well as data containing neuronal discharges resembling eyeblink artifacts.
Control data was collected for a variety of scenarios including combinations of eye
blinks and eye movements. This control data was processed using varying algorithm pa-
rameters to demonstrate both the reconfigurability of the system as well as to determine an
optimal configuration.
The simplest control case was used to eliminate inadequate parameter combinations
from further consideration. Once the nominal algorithm parameter set was established, real
data from patients with a variety of neurological abnormalities was processed and reviewed
by a physician to evaluate the success of the eyeblink artifact removal process.
66
4.2.1 Control Data
A test subject with no known abnormal neurological conditions was the subject of an
EEG recording session to gather the following control data:
• Subject looking forward with no eye movements and no blinking.
• Subject looking forward with no eye movements and normal blinking - about one
every 5-10 seconds.
• Subject moving eyes alternately left and right once every two seconds, without blink-
ing.
• Subject moving eyes alternately up and down once every two seconds, without blink-
ing.
For each case, data was collected for a minimum 60 second interval. For the purposes
of this study, ”normal blinking” shall be defined as a non-evoked, involuntary eyeblink.
While collecting the data, the test subject observed the following:
• Minimum muscle movements,
• Steady respiration,
• Remain sitting.
The data was collected using a 32-channel headbox, including two channels of ECG
and two channels of EOG. Electrodes were arranged on the test subject’s scalp according to
the internationally recognized 10-20 system. The data was collected using Xltek Exchange
running on the attached PC. The export function in Xltek Exchange was used to dump the
raw EEG data to a text file, which contained the sample voltages in decimal format. Despite
recording 32 channels, only a 25 channel subset was used in the experiment, including the
pair of EOG channels but excluding the ECG channels.
In addition to the EEG data, close-up video of the subject’s eyes was recorded so as to
visually confirm the presence and type of eyeblink artifacts contained in the EEG data.
67
4.2.2 Clinical Data
EEG recordings of patients presenting with the following conditions were used to vali-
date the algorithm with actual clinical data:
• Frontal intermittent rhythmic delta activity (FIRDA)
• Triphasic wave patterns
These conditions were chosen due to the fact that the waveforms indicative of these
conditions tend to resemble eyeblink artifacts. Successful removal of eyeblinks, if present,
and preservation of the neuronal waveforms would validate that the ocular artifact removal
algorithm can discriminate between ocular activity and neuronal activity that resembles it.
These data sets were provided after being stripped of personal identifying information.
The data contained in these recordings was captured using the same electrode configuration
as the control data. Video of the patients faces was not available, so it was not possible to
ascertain the existence or location of ocular artifacts.
4.2.3 Experiment
Phase I
Starting with the simplest control case (no eye movements, no blinking), the data was
processed using the FPGA system parameters in Table 4.2 and each of the three FastICA
contrast functions: hyperbolic tangent (tanh), cubic (pow3), and gaussian (gauss).
For a given data set, any parameter selection that produced incorrect results was dis-
carded before proceeding to the next, potentially more complex, data set. If the results were
incorrect for a simpler data set, little could be gained by analyzing the results from either
more complex control data or real clinical data processed with suboptimal parameters.
After completing tests with a given data set, and potentially eliminating a set of algo-
rithm parameters from further consideration, the next simplest control case with the poten-
tially reduced set of parameters was tested. This process repeated until either test cases or
68
parameters were exhausted, whichever occurred first.
The nominal set(s) of parameters were then be chosen for further study. Though it
is possible that multiple sets of parameters could be found for which the performance is
acceptable, the purpose of this experiment is to demonstrate the flexibility of this system.
A more rigorous experiment with more complete coverage of possible parameter variations
is left as future work.
For analysis, the original EEG data, processed data, and the difference between the two
is written in a European Data Format (EDF) file. For prognostics, the ICs extracted for
each epoch were also printed in EDF files so as to determine whether the correct IC was
identified. The results were sent to Dr. Michel Berg, a neurologist from the Strong Medical
Center in Rochester, NY, for analysis.
Phase II
Using the “optimal” parameter set selected at the completion of Phase I, the clinical
data containing FIRDA or triphasic waveforms was processed. This data, too, was output
in EDF and submitted to Dr. Berg for review.
4.2.4 Analysis
The processed data was reviewed by Dr. Berg. For consistency, the original data from
which the results were generated were converted to the same file format. Additionally, the
difference between the input and output was generated and converted to EDF for analysis.
A quantifiable measurement of the success of the algorithm in removing eyeblink arti-
facts without introducing new distortions was implemented via analysis of simulated ocular
artifacts before and after application of the removal process.
69
Chapter 5
Results
The results from each of the three portions of the experiment are outlined below. From
the simulated data, a quantitative comparison of the performance of each platform was
derived using several key statistics. The evaluation of the “live” control data revealed the
tendencies of the algorithm to detect true or false eyeblinks for each contrast function, while
also showing which contrast functions had the potential to become numerically unstable.
Finally, the clinical data demonstrated the algorithm’s ability to differentiate between true
eyeblinks and neuronal discharges that resembled ocular artifacts.
5.1 Simulated Data
Evaluation of the simulated EEG processing was aided by the use of the video corre-
sponding to the control data set (eyeblinks with no eye movements) to identify the locations
where eyeblinks occurred. Since this data was overlaid on a set of control data containing
no eyeblinks or eye movements, no other ocular artifacts were believed to be present.
When processing the simulated data in both MATLAB and on the FPGA, only a single
case where both the MATLAB and FPGA implementations of FastICA converged was
observed. This simulated artifact and the corresponding uncontaminated CZ EEG channel
to which it was added are shown below in Figure 5.1.
This case, exemplified in Figure 5.2 for the CZ channel, highlights one of the issues
with the way in which the simulation data was generated. For all contrast functions, the
70
0 50 100 150 200 250 300 350 400
-30
-20
-10
0
10
20
30
40
50
Simulated Eyeblink Artifact
Time (msec)
V
o
lt
a
g
e
 (
µ
V
)
Uncontaminated EEG Simulated Contaminated EEG
 
Figure 5.1: Simulated eyeblink and original uncontaminated EEG
71
FPGA and MATLAB results contained only negligible differences. However, the differ-
ences between the uncontaminated processed and original results are more significant. A
portion of these differences can be attributed to the fact that the vertical EOG summed with
each of the EEG channels to simulate eyeblink artifacts contained some non-ocular activ-
ity not originally present in the uncontaminated EEG data. The other contributor to the
observed difference is the contamination of the ocular IC with non-ocular contributions.
0 50 100 150 200 250 300 350 400
-10
0
10
V
o
lt
a
g
e
 (
µ
V
)
Gaussian Nonlinearity
0 50 100 150 200 250 300 350 400
-10
0
10
V
o
lt
a
g
e
 (
µ
V
)
Hyperbolic Tangent Nonlinearity
0 50 100 150 200 250 300 350 400
-10
0
10
Cubic Nonlinearity
V
o
lt
a
g
e
 (
µ
V
)
Time (msec)
Original EEG MATLAB Processed FPGA Processed
 
Figure 5.2: Comparison of MATLAB and FPGA waveforms for simulated eyeblink
For the same simulated eyeblink, a comparison of several metrics used to quantify
the quality of the eyeblink removal for the simulated EEG data between the MATLAB
and FPGA implementations is given in Figure 5.3. These metrics include the correlation
72
coefficient, Euclidean distance, and standard deviation ratio. These calculations are made
between the original, uncontaminated EEG data to which the simulated artifacts were added
and the processed EEG data following the eyeblink removal.
Figures 5.4-5.6 show the outcome of these metrics across all simulated eyeblinks for
which FastICA converged across both platforms. These measurements were made only for
the 400 msec period affected by the blink removal algorithm, and only on the CZ channel,
selected for its central location on the scalp.
5.1.1 Correlation Coefficient
The correlation coefficient ρ between two vectors x and y is given by
ρx,y =
E[(x− µx)(y − µy)]
σxσy
, (5.1)
where µx and µy are the vector means and σx and σy are the standard deviations.1 This
metric is a measure of the similarity of the shapes of the vectors, but says nothing about
their relative amplitudes [23]. Due to the normalization of the numerator by the product
of the standard deviations, the maximum value of the correlation coefficient is 1. Thus, a
value of 1 indicates that the waveforms are identical in all but mean and amplitude.
For both the original Pesin algorithm implemented in MATLAB and the modified al-
gorithm running on the FPGA, the correlation coefficient between the original data (x) and
the processed data (y) is shown in Figure 5.4.
Generally, the gaussian contrast function showed the best results, whereas the hyper-
bolic tangent and cubic results were substantially worse.
5.1.2 Euclidean Distance
The Euclidean distance d between vectors x and y is given by
dx,y =
√√√√ N∑
k=1
(xi − yi)2. (5.2)
1The computation was performed with MATLAB’s corr.
73
Correlation
0.88
0.89
0.9
0.91
0.92
gauss tanh pow3
C
o
rr
e
la
ti
o
n
 C
o
e
ff
ic
ie
n
t
MATLAB FPGA
Euclidean Distance
0.04
0.042
0.044
0.046
0.048
0.05
gauss tanh pow3
E
u
c
lid
e
a
n
 D
is
ta
n
c
e
MATLAB FPGA
Standard Deviation Ratio
0.765
0.77
0.775
0.78
0.785
0.79
0.795
gauss tanh pow3S
ta
n
d
a
rd
 D
e
v
ia
ti
o
n
 R
a
ti
o
MATLAB FPGA
Figure 5.3: Comparison of MATLAB and FPGA statistics for a simulated eyeblink
74
Nonlinearity Correlation Comparison for MATLAB
0
0.2
0.4
0.6
0.8
1
1 2 3 4 5 6 7 8
C
o
rr
e
la
ti
o
n
C
o
e
ff
ic
ie
n
t
Nonlinearity Correlation Comparison for FPGA
0
0.2
0.4
0.6
0.8
1
1 2 3 4 5 6 7 8
Eyeblink Number
C
o
rr
e
la
ti
o
n
 
C
o
e
ff
ic
ie
n
t
Gaussian Hyperbolic Tangent Cubic
Figure 5.4: MATLAB and FPGA correlation coefficient for simulated eyeblink
75
This metric measures both the similarity in shape and amplitude [23]. Large element-
wise differences between x and y increase the value of d, therefore a value of 0 indicates
that the two vectors are identical.
Nonlinearity Euclideam Distance Comparison for MATLAB 
0
20
40
60
80
100
1 2 3 4 5 6 7 8
E
u
c
lid
e
a
n
D
is
ta
n
c
e
 (
µ
V
)
Nonlinearity Euclidean Distance Comparison for FPGA
0
20
40
60
80
100
1 2 3 4 5 6 7 8
Eyeblink Number
E
u
c
lid
e
a
n
D
is
ta
n
c
e
 (
µ
V
)
Gaussian Hyperbolic Tangent Cubic
Figure 5.5: MATLAB and FPGA Euclidean distance for simulated eyeblink
The Euclidean distance over the 200 samples (400 milliseconds) over which the eye-
blinks were removed never exceeded 90 µV for any eyeblink, and were generally less than
about 60 µV. This difference is somewhat significant since the example in Figure 5.2 shows
that the uncontaminated EEG signal levels were at about 10 µV.
The cubic nonlinearity tended to be the worst performer, whereas the performance of
the gaussian method tended to nearly match the performance of the hyperbolic tangent
method.
76
5.1.3 Standard Deviation Ratio
The standard deviation ratio r between two vectors x and y is simply equal to
rx,y =
σx
σy
. (5.3)
The standard deviation of a signal is related to its variance, which in turn is a measure
of the signal power, assuming that it is zero mean. Therefore, a standard deviation ratio
greater than unity indicates that x has a larger signal power than y, whereas a value of 1
indicate that the signal powers are identical [23].
Nonlinearity Standard Deviation Comparison for MATLAB
0
0.5
1
1.5
1 2 3 4 5 6 7 8
S
ta
n
d
a
rd
 D
e
v
ia
ti
o
n
 
R
a
ti
o
Nonlinearity Standard Deviation Comparison for FPGA
0
0.5
1
1.5
1 2 3 4 5 6 7 8
Eyeblink Number
S
ta
n
d
a
rd
 D
e
v
ia
ti
o
n
 
R
a
ti
o
Gaussian Hyperbolic Tangent Cubic
Figure 5.6: MATLAB and FPGA standard deviation ratio for simulated eyeblink
For blinks 3-5, the standard deviation ratio was near-unity for all of the contrast func-
tions. However, for the remainder of the eyeblinks, the less than unity ratio indicated that
the power of the processed output was greater than that of the uncontaminated signal. This
77
could be indicative of either remaining artifact, or contamination from the remaining non-
ocular portion of the vertical EOG signal from which the simulated artifact was constructed.
5.2 Control Data
A summary of the results for the processed control data is given in 5.1.
Eye 
Movement
True Blinks
Correct 
Removals
False 
Positives
Undetected 
Blinks
Diverged (%)
None 8 8 0 0 0.0
Horizontal 7 0 13 7 16.7
Vertical 2 0 16 2 0.0
None 8 3 0 5 66.7
Horizontal -- -- -- -- --
Vertical -- -- -- -- --
None 8 7 0 1 16.7
Horizontal 7 1 11 6 16.7
Vertical 2 0 11 2 50.0
Gaussian
Hyperbolic 
Tangent
Cubic
Table 5.1: Summary of the eyeblink removal results for all contrast functions.
One of the issues encountered during acquisition of the control data was that the test
subject found it difficult to perform eye movements without blinking. In fact, EEGs con-
taining eye movements without blinking may be a rarity given the difficulty encountered
generating such control data. Nevertheless, a significant number of horizontal eye move-
ments were accompanied by blinks, whereas the vertical eye movements contaminated with
blinks were in the minority.
The true eyeblink locations for the control cases were determined by reviewing video
captured of the subject’s face in tandem with the EEG recording. The video showed a
closeup of the subject’s eyes. Because eye movements tended to obscure the waveforms
indicative of eyeblinks in the EEG/EOG recordings, video was the most reliable means of
determining their presence and location.
78
5.3 Clinical Data
Video showing the patients’ faces was not available for reference while evaluating the
results of the clinical data, thus the true locations of eyeblinks, if any exist, are not known.
The type of waveforms indicative of FIRDA and triphasic waves are similar to eyeblinks,
thus they are a valuable test for this algorithm to ensure that clinically important data are not
removed from the EEG. A consequence of this is that eyeblinks are difficult for a clinician
to identify from the EEG alone.
Figure 5.7 shows a sample 10 second epoch of EEG data with FIRDA waveforms.
Figure 5.8 shows 10 seconds of triphasic waves in EEG data.
Figure 5.7: 10 second EEG with FIRDA.
A summary of the performance of the gaussian and cubic nonlinearities for the EEGs
containing FIRDA and triphasic waveforms is given in Table 5.2. The number of detections
listed indicates the number of waveforms resembling eyeblinks found in what the algorithm
79
Figure 5.8: 10 second EEG with triphasic waves.
chose as the ocular IC, whereas the number of distortions represents the number of wave-
form removals. Since the neurologist could not identify any eyeblinks in the EEGs, it was
assumed that any removals that occurred introduced distortion.
5.3.1 FIRDA
For the FIRDA test, the gaussian contrast function successfully avoided identifying or
removing eyeblinks in five out of six epochs examined. This is likely due to misidentifica-
tion of an ocular IC, if one was present at all. In the epoch where potential eyeblinks were
identified and removed, the neurologist believes the removed components were not ocular.
Distortions introduced because of this were characterized as “small.”
When using the cubic nonlinearity, potential ocular artifacts were identified and re-
moved in all six epochs. However, only three distortions were introduced in two epochs. It
is noteworthy that distortions were not introduced into the epochs wherein FastICA failed
80
1 2 3 4 5 6
Detections 0 0 0 0 0 4 0.67
Distortions 0 0 0 0 0 2 0.33
Diverged 0 0 0 1 0 0 0.17
Detections 4 6 4 0 6 3 3.83
Distortions 1 1 2 0 3 0 1.17
Diverged 0 0 0 1 0 0 0.17
Detections 1 1 1 2 1 1 1.17
Distortions 0 0 1 2 0 0 0.50
Diverged 0 1 0 0 1 1 0.50
Detections 0 2 2 0 2 2 1.33
Distortions 0 1 1 0 1 1 0.67
Diverged 1 0 0 1 0 0 0.33
Gaussian
Cubic
FIRDA
Triphasic
FIRDA
Triphasic
Epoch Number  Per Epoch 
Average
Contrast 
Function
Condition Metric
Table 5.2: Summary of eyeblink removal for clinical data.
to converge. In contrast, the gaussian nonlinearity diverged for only one epoch.
5.3.2 Triphasic Waves
When processing EEGs containing triphasic waves, many potential eyeblinks were
identified when using the gaussian nonlinearity. In six epochs, a total of seven potential
blinks were removed. The neurologist believed that each of these removals introduced a
distortion, since there was no evidence of ocular activity present.
Likewise, for the cubic contrast function, each of the four removals introduced dis-
tortions. As before, no changes were made to the EEG data in the two cases where this
algorithm diverged.
5.4 Performance
The mean and standard deviation of the execution times for all contrast functions, aver-
aged across all (control, simulated, and clinical) data sets is given in Table 5.3.
81
Nonlinearity
gauss 123.34 (62.41) 124.88 (109.14) 95.57 (50.29)
tanh 342.04 (197.77)
pow3 33.91 (35.07) 47.36 (35.84) 67.35 (43.39)
No Movement Vertical MovementHorizontal Movement
-- --
Table 5.3: Execution time for each test case, with mean (standard deviation) in seconds
Despite the optimizations, none of the nonlinearities were capable of meeting the real-
time processing goal of removing eyeblink artifacts from 10-second epochs within 10 sec-
onds of execution time on the embedded hardware/software coprocessing platform.
The cubic nonlinearity was the most simple to calculate, and thus had the best computa-
tional performance. In contrast, the tanh contrast function was the least computationally
efficient. The gaussian nonlinearity, on the other hand, provided the best results with only
moderate performance.
Across all tests, including the control, simulated, and clinical data sets, the histogram
representing the number of iterations executed by FastICA for the gaussian contrast func-
tion is given in Figure 5.9.
Gaussian Nonlinearity Iteration Count Histogram
0
2
4
6
8
10
12
50 100 150 200 250 300 350 400 450 500
Iteration Count
F
re
q
u
e
n
c
y
Figure 5.9: Histogram of gaussian nonlinearity iteration counts.
This seems to indicate that ICA tends to converge in less than 400 iterations, whereas
a maximum iteration count of 500 was used for all data sets considered in this work. By
reducing the the maximum iteration count, the average execution time would be improved
82
since divergence could be detected earlier.
83
Chapter 6
Discussion
6.1 Interpretation of Results
6.1.1 Optimal Contrast Function
The gaussian nonlinearity was more reliable than either the hyperbolic tangent or cu-
bic nonlinearities since it tended to introduce fewer distortions into the processed EEG.
Moreover, it diverged less frequently than the cubic nonlinearity.
The hyperbolic tangent nonlinearity tended to diverge even in the most ideal conditions,
such as when eyeblinks occurred without any associated eye movement or other similar
waveforms. As a result, it was dismissed from further consideration.
Though efficient, the cubic nonlinearity tended not to remove the entire eyeblink, leav-
ing a small portions of the original artifact behind.
This information validates Pesin’s claim that the gaussian nonlinearity is the best choice
for this application.
6.2 Observations on the Pesin Process
This research served to expose several limitations of the eyeblink artifact removal pro-
cess which are due further consideration.
84
6.2.1 Detection Thresholds
The phase I and phase II detection thresholds used in this research were determined
empirically, and seem to yield acceptable results over a limited set of test data [17]. The
phase I detection threshold, which identifies blinks from the EOG difference was initially
set to 0.03 mV as per the original research. However, it was observed through testing that
some eyeblinks for a given test subject were not identified in phase I. Since eyeblinks not
detected in phase I were not candidates for removal, this threshold was reduced to 0.02 mV
to improve blink detection performance.
In order to achieve the goal of developing a truly automated artifact removal algorithm,
this process will need to be more robust since the present work has demonstrated that the
necessary threshold may vary even for a single individual. This limitation was noted in
the original work [17], in which the author believed an adaptive technique may meet the
requirements.
Both adaptive an non-adaptive methods have been proposed and show great promise
[13, 14]. Being that these methods were developed with the intent of performing eyeblink
artifact removal in the wavelet domain, the detection/thresholding operation most likely
satisfies the robustness requirement since it has been successful on raw EEG data. For the
process in [17], the phase II detection is performed on a separated independent compo-
nent, so the robustness should be even greater. These methods provide a more statistically
rigorous means of thresholding the wavelet coefficients.
6.2.2 Accuracy
Identification of an eyeblink requires a positive identification in both the phase I and
phase II detection results. The phase I detection process simply thresholds the difference
of the EOG channels, whereas the phase II process thresholds the ocular IC. A phase II
detection must occur within one second of a phase I detection in order for removal to
occur.
85
However, it is believed that gating the phase II detections by locating blinks within
one second of their phase I locations is too coarse a filter. In the experiments containing
eye movements or clinical data, the number of distortions could have been substantially
reduced by enforcing stricter temporal locality requirements between phase I and phase II
eyeblink candidates.
In fact, it may seem as though the phase I detection results should not be considered
at all since the ocular IC should be the most reliable source of information regarding blink
locations. However, phase I operates on data that is known to contain a substantial ocular
contribution, whereas a misidentified ocular IC may contain none. Therefore, the restriction
imposed on the phase II detection process for a nearby detection from phase I aids in
reducing false detections due to an improperly identified ocular IC.
6.3 Independent Component Analysis Observations
Through testing, several observations about ICA initial conditions and potential opti-
mizations were made.
6.3.1 Initial Conditions
It was observed that variations in the initial mixing matrix sometimes had a significant
impact on the final separation of independent components. This is an additional potential
consequence of attempting to separate 25 ICs from the EEG data when in fact far fewer ICs
meet the ICA’s independence and stationarity requirements.
In an effort to reduce the processing necessary to compute the independent components,
it was thought that a mixing matrix calculated for a given epoch could be used as the
initial guess for IC separation for the subsequent epoch. However, testing revealed that this
approach led to consistent divergence of the FastICA algorithm.
86
6.3.2 Attempted Optimization
The optimization in [6] called for a one-step deflationary approach, followed by the
symmetric approach until convergence. In this approach, the deflationary orthogonaliza-
tion method would be responsible for extracting a single IC. The mixing coefficients for
this IC were used to seed the first row of the initial guess provided to the symmetric orthog-
onalization method, with the remaining rows populated by the identity. The research in [6]
showed that performance improved substantially with this approach.
In the present research, though it was observed that the number of iterations required
for convergence during symmetric orthogonalization tended to reduce with this method,
the one-step deflationary approach sometimes failed to converge. Moreover, this architec-
ture was designed to compute the symmetric approach initially called for by Pesin. The
implication of this is that the matrix multiplier, which is central to determining the per-
formance of the system, was designed to perform matrix-matrix multiplication, rather than
matrix-vector multiplication.
Since the deflationary approach calls for many matrix-vector products in inner-loops,
rather than the matrix-matrix products common to the symmetric orthogonalization method,
the performance suffers substantially. The matrix multiplier developed for this work is in-
capable of efficiently computing a matrix-vector product.
A potential alternative to either the symmetric or combined deflationary-symmetric ap-
proaches is the deflationary approach. During testing, however, it was observed that this
method tended to not reliably converge.
It is believed that the convergence failures observed in the deflationary approach are
due to the attempted extraction of 25 ICs. Since ICs are extracted one at a time, each is
subject to a maximum iteration count. The first ICs recovered tend to fall out with few itera-
tions, whereas the remainder may not be recovered before reaching the maximum iteration
count. Indeed, attempting to extract 25 ICs from the EEG data may violate some of the
ICA’s fundamental assumptions, since the 25 ICs may not be independent or nongaussian.
Some more prominent ICs can be attributed to EKG, ocular artifacts, respiration, or muscle
87
movements, whereas the majority are likely neuronal in nature. It is these more prominent
ICs that tend to be separated first in the deflationary approach.
6.4 Future Work
6.4.1 Algorithm Development
In the original Pesin algorithm, no justification was given for the choice of the number
of wavelet decomposition levels or reconstruction types (approximation or detail). Though
the present work has extended the capabilities of the initial implementation to allow for
selection of these parameters, an effort should be made to determine whether an optimal
parameter set exists, and if so, the reasoning behind it.
It may be possible to extend the system developed in this work to detect and remove
other types of artifacts from EEG, such as those due to muscle movements or ECG. Fur-
thermore, with more substantial modifications, this system may be made to remove artifacts
from other types of biomedical signals such as ECG or EMG. Also, the potential exists to
use the methods discussed in this work to process non-biomedical signals.
6.4.2 Further Optimizations
When performing ICA, it was noted that the symmetric orthogonalization method lends
itself to parallelization. Due to resource limitations on the target FPGA, only the matrix
multiplication aspect of the calculations was parallelized. However, if an array of 25 appro-
priate processing elements could be constructed, the algorithm may benefit from substantial
speedup and very low communication overhead. Each processing element would have to
be capable of performing matrix-vector multiplication as well as evaluating an appropriate
nonlinear function.
If this were possible, each weight vector could be updated in parallel for every iteration.
88
The only necessary inter-processor communication would be the broadcast of each proces-
sor’s weight vector such that each PE obtained the entire mixing matrix prior to beginning
each iteration. As was noted earlier, linear arrays offer an area-efficient interconnect which
limits the complexity of FPGA designs.
Since the gaussian nonlinearity was selected as the most suitable contrast function for
EEG processing, an appropriate hardware implementation of an exponential function is
necessary for future optimizations. A 74-stage pipelined, single-precision exponential unit
is described in [5]. Implemented on a Virtex-2, it consumes 5600 slices and operates at 85
MHz. On a more modern FPGA such as the Virtex-4 FX60 used in the present work, it may
operate at the same 100 MHz rate that the rest of the hardware coprocessing units do. Since
exponentiation is required on a matrix of dimension 25× 5000 per each FastICA iteration,
exponentiation could be performed in many fewer cycles than presently consumed. The
concept for this future architecture is shown in Figure 6.1.
Larger FPGAs exist in the Virtex-4 FX line, so implementation of this hardware opti-
mization is possible, though not on the Virtex-4 FX60 used in the present research. More-
over, Virtex-5 FX FPGAs became available prior to publication of this work. These devices
boast a more advanced APU interface capable of transferring data at a much higher rate,
in addition to a substantial increase in maximum achievable clock rates for custom logic.
Moreover, the embedded processors are a more advanced PowerPC 440, which contain
several significant architectural enhancements over the processors used in this work. By
porting the design to this new platform, it is believed that a substantial improvement in
performance, and perhaps even real-time processing, is achievable.
6.4.3 Limitations of Present Approach
This implementation of the Pesin algorithm provides minimal scalability. The recon-
figurable logic resources in the Virtex-4 FPGA used in this implementation were nearly
consumed by the matrix multiplier. If additional EEG channels were required, it would
become difficult to scale the matrix multiplier accordingly.
89
Register File
Write-only Registers
A
u
x
il
ia
ry
 P
ro
c
e
s
s
o
r 
U
n
it
Register [0]
Register [1]
Register [2]
Register [0]
Register [1]
Read-only Registers
Matrix Multiplier
FIFO 
OPERAND A
M
u
lt
ip
li
e
r 
C
o
re
FIFO 
OPERAND B
FIFO 
RESULT C
FIFO 
CONTROL REGISTER
FCB
Exponential Unit
FIFO 
OPERAND
E
x
p
o
n
e
n
ti
a
l 
U
n
it
 C
o
re
FIFO 
RESULT
Read-only Registers
Write-only Registers
Register [3]
Register [2]
Register [3]
Figure 6.1: Proposed future architecture include exponential unit
90
Chapter 7
Conclusions
A real-time implementation of Pesin’s eyeblink artifact removal algorithm was not
achieved. However, it was shown that additional performance improvements exist, and
given a more advanced processing platform, real-time performance may be achievable. De-
spite this, it was shown that an embedded hardware/software coprocessing system presents
a viable platform for the implementation of this algorithm, which initially required the
resources of a PC equipped with MATLAB.
Selection of the gaussian contrast function as the most appropriate ICA nonlinearity for
this application was validated through evaluation of processed EEG data containing both
simulated and real eyeblink artifacts.
It was shown that the Pesin process is capable of successfully removing eyeblinks from
EEG in the absence of eye movements with 100% accuracy and a 0% false alarm rate.
Processing EEG contaminated by eye movements increased the false detection rate, where
each false detection introduced some distortion into the EEG. For clinical data containing
eyeblink-like waveforms such as those caused by FIRDA or triphasic waves, the algo-
rithm was shown to be capable of correctly dismissing these waveforms from considera-
tion, though a small number of distortions were introduced due to artifacts that were likely
misidentified.
91
Bibliography
[1] A. Avila, R. Santoyo, S. O. Martinez, and G. Dieck. Hardware/software implemen-
tation of a discrete cosine transform algorithm using SystemC. In Reconfigurable
Computing and FPGAs, 2005. ReConFig 2005. International Conference on, page 4,
2005.
[2] Dmitriy L Bekker. Hardware and Software Optimization of Fourier Transform In-
frared Spectrometry on Hybrid-FPGAs. Master’s thesis, 2007.
[3] Katherine Compton and Scott Hauck. Reconfigurable computing: a survey of systems
and software. ACM Comput. Surv., 34(2):171–210, 2002.
[4] Ishaan L. Dalal and Fred L. Fontaine. A Reconfigurable FPGA-based 16-Channel
Front-End for MRI. In Signals, Systems and Computers, 2006. ACSSC ’06. Fortieth
Asilomar Conference on, pages 1860–1864, 2006.
[5] C. C. Doss and Jr. Riley, R. L. FPGA-based implementation of a robust IEEE-754
exponential unit. In Field-Programmable Custom Computing Machines, 2004. FCCM
2004. 12th Annual IEEE Symposium on, pages 229–238, 2004.
[6] Changyuan Fan, Baoqiang Wang, and Hui Ju. A New FastICA Algorithm with Sym-
metric Orthogonalization. In Communications, Circuits and Systems Proceedings,
2006 International Conference on, volume 3, pages 2058–2061, 2006.
[7] Bruce J. Fisch and Rainer Spehlmann. Fisch and Spehlmann’s EEG primer : basic
principles of digital and analog EEG. Elsevier, Amsterdam ; New York, 3rd rev. and
enl. edition, 1999.
[8] Theo Gasser, Lothar Sroka, and Joachim Mocks. The Transfer of EOG Activity into
the EEG for Eyes Open and Closed. Electroencephalography and Clinical Neuro-
physiology, 61:181–193, 1985.
92
[9] Jr. Glover, J. R., N. Raghaven, P. Y. Ktonas, and J. D. Jr. Frost. Context-based au-
tomated detection of epileptogenic sharp transients in the EEG: elimination of false
positives. Biomedical Engineering, IEEE Transactions on, 36(5):519–527, 1989.
[10] A. Hyvarinen and E. Oja. Independent component analysis: algorithms and applica-
tions. Neural Netw., 13(4-5):411–430, 2000.
[11] Aapo Hyvarinen, Juha Karhunen, and Erkki Oja. Independent component analysis. J.
Wiley, New York, 2001.
[12] Aapo Hyvarinen and Erkki Oja. A fast fixed-point algorithm for independent compo-
nent analysis. Neural Comput., 9(7):1483–1492, 1997.
[13] V. Krishnaveni, S. Jayaraman, L. Anitha, and K. Ramadoss. Removal of ocular arti-
facts from EEG using adaptive thresholding of wavelet coefficients. Journal of Neural
Engineering, 3(4):338–346, 2006.
[14] V. Krishnaveni, S. Jayaraman, N. Malmurugan, A. Kandaswamy, and K. Ramadoss.
Non adaptive thresholding methods for correcting ocular artifacts in EEG. Acad.
Open Internet Journal, 13, 2004.
[15] S. G. Mallat. A wavelet tour of signal processing. Academic Press, San Diego, 1998.
[16] Alan V. Oppenheim, Ronald W. Schafer, and John R. Buck. Discrete-time signal
processing. Prentice Hall, Upper Saddle River, N.J., 2nd edition, 1999.
[17] Jimy Pesin. Detection and Removal of Eyeblink Artifacts from EEG using Wavelets
and FastICA. Master’s thesis, 2007.
[18] V. J. Samar, A. Bopardikar, R. Rao, and K. Swartz. Wavelet analysis of neuroelectric
waveforms: a conceptual tutorial. Brain Lang, 66(1):7–60, 1999.
[19] Kuo-Kai Shyu and Ming-Huan Li. FPGA Implementation of FastICA based on
Floating-Point Arithmetic Design for Real-Time Blind Source Separation. In Neural
Networks, 2006. IJCNN ’06. International Joint Conference on, pages 2785–2792,
2006.
[20] Gilbert Strang and Troung Nguyen. Wavelets and filter banks. Wellesley-Cambridge
Press, Wellesley, MA, 1996.
[21] F. Vahid. The softening of hardware. Computer, 36(4):27–34, 2003.
93
[22] M. Van Gils, A. Rosenfalck, S. White, P. Prior, J. Gade, L. Senhadji, C. Thomsen,
I. R. Ghosh, R. M. Longford, and K. Jensen. Signal processing in prolonged EEG
recordings during intensive care. Engineering in Medicine and Biology Magazine,
IEEE, 16(6):56–63, 1997.
[23] L. Vigon, M. R. Saatchi, J. E. W. Mayhew, and R. Fernandes. Quantitative evaluation
of techniques for ocular artefact filtering of EEG waveforms. Science, Measurement
and Technology, IEE Proceedings -, 147(5):219–228, 2000.
[24] E. Waterhouse. New horizons in ambulatory electroencephalography. Engineering in
Medicine and Biology Magazine, IEEE, 22(3):74–80, 2003.
[25] Xilinx, Inc. APU Floating-Point Unit v3.0, 2007.
[26] Xilinx, Inc. PowerPC 405 Processor Block Reference Guide, 2007.
[27] Ling Zhuo and V. K. Prasanna. Scalable and Modular Algorithms for Floating-
Point Matrix Multiplication on Reconfigurable Computing Systems. Parallel and
Distributed Systems, IEEE Transactions on, 18(4):433–448, 2007.
94
Appendix A
main.cpp File Reference
Implementation of Pesin Eyeblink Artifact Removal Algorithm for EEG.
Data Structures
• struct sVector
Wrapper for a itpp::Vec object without dynamic memory management.
• struct sConfig
Reconfigurable parameters.
• union uConfig
Mapping of reconfigurable parameters to 32-bit unsigned intergers.
Enumerations
• enum reconstruction t
Functions
• void pad vector (sVector ∗vec, Int32 pad len)
• void shrink vector (sVector ∗vec, Int32 lower, Int32 upper)
• void conv (sVector ∗a, sVector ∗b, sVector ∗c)
• void downsample (sVector ∗vec, Uint8 flag)
• void upsample (sVector ∗vec, bool flag)
• Int32 calculate dec length (Int32 vec len, Uint8 level)
95
• void wavedec (Uint8 level, sVector ∗vec, sVector ∗approx, sVector ∗detail)
• void wrcoef (enum reconstruction t type, Uint8 level, sVector ∗approx, sVector ∗detail,
sVector ∗recon)
• void read data ()
• void write data (mat &reconstructed)
• Sample T subtract mean (sVector ∗vec)
• Int32 blink detect (Uint32 wavelet level, reconstruction t type, Sample T detect threshold,
Uint32 epoch samples, sVector ∗eye channel, bvec &detect, Int32 ∗location)
• Int32 blink detect phase1 (bvec &p1 detect, Int32 ∗p1 location)
• Int32 blink detect phase2 (mat &ic, Int32 eye index, bvec &p2 detect, Int32 ∗p2 location,
Int32 ∗p1 location, Int32 p1 num blinks)
• Int32 choose ic (mat &ic, sVector ∗fp1, mat &A)
• void blink remove (Int32 ∗p2 location, Int32 p2 num blinks, Int32 ocular chan, mat &ic)
• mat pesin process (mat &ma eeg data, mat &ICs original)
• void get config ()
• int main (void)
Detailed Description
Communicates with a host PC via dual RS-232 interfaces to receive unprocessed EEG data and
algorithm parameters, performs Pesin eyeblink artifact removal algorithm, and returns processed
results via serial port.
Define Documentation
#define CHAN FP1 8
Index corresponding to EEG channel FP1.
#define CHAN LEOG 21
Index corresponding to left EOG channel.
#define CHAN REOG 22
Index corresponding to right EOG channel.
#define CONTROL PORT XPAR RS232 UART 1 BASEADDR
Handle for serial port used for parameters and status.
96
#define DATA PORT XPAR RS232 UART 2 BASEADDR
Handle for the serial port used for input and output data.
#define MAX NUM BLINKS 16
Maximum number of blinks that can be detected in NUM SAMPLES.
#define MAX VECTOR LEN (NUM SAMPLES + 1000)
Maximum length of a vector, including intermediate processing.
#define NUM CHANNELS 25
Number of channels of EEG data considered.
#define NUM SAMPLES 5000
Number of EEG samples to process.
Typedef Documentation
typedef real Sample T
Type for matrix computations.
Enumeration Type Documentation
enum filter t
Filter types: lowpass/highpass decomposition and reconstruction.
Enumerator:
LO D
HI D
LO R
HI R
NUM FILTERS
enum reconstruction t
Supported wavelet transforms.
Types of reconstruction: approximation or detail coefficients.
97
Enumerator:
APPROX
DETAIL
Function Documentation
Int32 blink detect (Uint32 wavelet level, reconstruction t type, Sample T detect -
threshold, Uint32 epoch samples, sVector ∗ eye channel, bvec & detect, Int32 ∗
location)
The core of the blink detection algorithm.
Given a vector of time-domain ocular-like signals, perform a wavelet decomposition/reconstruc-
tion at the specified level, with the specified type (approximation or detail) and complete a binary
detection array indicating where the given threshold was exceeded, as well as an array of indices a
given number of samples apart that indicate potential ”centers” of eyeblinks.
Parameters:
← wavelet level The level of the wavelet decomposition/ reconstruction.
← type Approximation or detail reconstruction.
← detect threshold The threshold for blink detection.
← epoch samples Minimum separation to detect consecutive blinks.
← eye channel Time-domain ocular activity.
→ detect Binary results of thresholding operation.
→ location Array of blink ”center” indices.
Returns:
Number of blinks detected.
Int32 blink detect phase1 (bvec & p1 detect, Int32 ∗ p1 location)
Phase 1 blink detection.
Finds the absolute difference between LEOG and REOG channels, then performs a 6 level
wavelet decomposition/detail reconstruction, after which a thresholding operation is applied to de-
termine the blink centerpoints. The threshold is set at 0.02 by default, and distinct blinks must at
least 600 msec apart.
Parameters:
→ p1 detect Binary result of thresholding operation.
98
→ p1 location Array of detected blink centers.
Returns:
Number of blinks detected.
Int32 blink detect phase2 (mat & ic, Int32 eye index, bvec & p2 detect, Int32 ∗
p2 location, Int32 ∗ p1 location, Int32 p1 num blinks)
Phase 2 blink detection.
Given the independent component identified as ocular activity, perform a 1-level wavelet de-
composition/approximation reconstruction, followed by thresholding at 3.0 by default. Distinct
blinks must be at least 400 msec apart. Then, determine if any phase 2 detections occur within +/-
1 sec of phase 1 detections. If so, these blinks are counted.
Parameters:
← ic Matrix of independent components.
← eye index Index of ocular component in ICs.
→ p2 detect Binary array of detection centerpoints.
→ p2 location Sample indices of eyeblink centers from phase 2.
← p1 location Sample indices of eyeblink centers from phase 1.
← p1 num blinks Number of blinks detected in phase 1.
Returns:
Number of blinks detected in phase 2.
void blink remove (Int32 ∗ p2 location, Int32 p2 num blinks, Int32 ocular chan, mat
& ic)
Remove 400 ms swath from the ocular independent component.
Given the location of the center of the eyeblink, remove 200 ms from either side of it. If the
boundary of the data set occurs before 200 ms from the center, then just remove data up to the
boarder.
Parameters:
← p2 location The index of the center of each blink from phase 2.
← p2 num blinks The number of blinks detected in phase 2.
← ocular chan The channel index of the ocular IC.
← ic The matrix of separated ICs.
99
Int32 calculate dec length (Int32 vec len, Uint8 level)
Calculate the expected length of a wavelet decomposition output.
Given the length of an input vector, a wavelet transform, and the number of levels of decompo-
sition to perform, calculate the length of the expected output coefficient vectors.
Parameters:
← vec len The length of the input vector.
← level The number of levels of decomposition performed.
Returns:
The expected length of the output coefficient vectors.
Int32 choose ic (mat & ic, sVector ∗ fp1, mat & A)
Determine which independent component contains ocular activity.
Each independent component is decomposed/detail reconstructed at 7 levels and the correlation
is computed with the decomposition/reconstruction of channel FP1 (which is near to the eyes). Prior
to correlation, each component is scaled by its contribution to FP1 from the mixing matrix A. The
most correlated channel is chosen as the ocular component.
Parameters:
← ic Matrix containing ICs.
← fp1 Vector containing original EEG data from FP1.
← A The mixing matrix.
Returns:
The row index of the ocular component in ICs.
void conv (sVector ∗ a, sVector ∗ b, sVector ∗ c)
Perform the convolution c = a ∗ b.
Convolves vectors a and b and stores the result in c.
Parameters:
← a The first operand.
← b The second operand.
→ c The result.
100
void downsample (sVector ∗ vec, Uint8 flag)
Perform downsampling (by 2) on a vector. Downsample the vector by removing every second
element. If flag is ’1’, then the first element is discarded. If the flag is ’0’, then the first element is
kept.
Parameters:
↔ vec The vector to downsample.
← flag Determines whether the first element is discarded or kept.
void get config ()
Read the reconfigurable parameters from the control port.
Parse the control port input to determine the number and value of any input parameters at the
beginning of each epoch of EEG data. The first word received signifies the number of parameters
to expect. For every parameter, the parameter tag is sent, followed by the number of words for that
parameter, with the data following. The config struct is updated directly.
int main (void)
Perform initialization, then perform eyeblink removal processing.
The instruction and data caches are initialized, followed by the serial ports. The reconfigurable
parameters are then initialized to their default values. An infinite loop is then initiated which cap-
tures configuration data followed by raw EEG data.
void pad vector (sVector ∗ vec, Int32 pad len)
Perform symmetric padding on a vector prior to convolution.
pad len elements will be mirrored (with repetition of the first and last elements, ”half-point”
symmetric extension) at both the beginning and the end of the vector.
Parameters:
↔ vec The vector to pad.
← pad len The number of elements to add/remove.
mat pesin process (mat & ma eeg data, mat & ICs original)
The core eyeblink detection/removal algorithm/.
Raw EEG data, taken as an input, is centered and examined by the phase I detection routine
for potential eyeblinks. If none are found, the original input data is returned. If blinks are found,
independent component analysis is performed and the ocular IC identified. The phase II detection
process is performed on the ocular IC, and any eyeblinks found are removed. The modified ICs are
remixed to form the artifact-free EEG, which is returned.
101
Parameters:
← ma eeg data Matrix containing the input data
↔ ICs original Storage for the independent components.
Returns:
The artifact-free data.
void read data ()
Read raw EEG data and store in an array of vectors.
Read raw binary floating point data (in the correct endianness) from a serial port. Store the data
in the global eeg data variable in column-major order.
void shrink vector (sVector ∗ vec, Int32 lower, Int32 upper)
Remove elements from the beginning/end of a vector.
Remove lower elements from the beginning of the vector and upper elements from the end of
the vector.
Parameters:
↔ vec The vector to process.
← lower The number of elements to remove from the beginning.
← upper The number of elements to remove from the end.
Sample T subtract mean (sVector ∗ vec)
Subtract the mean from a vector.
Given a vector, calculate its mean and subtract it from each element.
Parameters:
↔ vec The vector on which to operate.
void upsample (sVector ∗ vec, bool flag)
Perform upsampling (by 2) on a vector.
Upsample the vector by inserting zeros between every element. If flag is ’0’, then the first and
last elements of the output are the same as the first and last elements of the input. If the flag is ’1’,
then the first and last elements of the output are zeros.
Parameters:
↔ vec The vector to upsample.
← flag Determines whether the first and last elements are preserved or made zeros.
102
void wavedec (Uint8 level, sVector ∗ vec, sVector ∗ approx, sVector ∗ detail)
Compute the 1D wavelet decomposition of a vector.
Given a vector, a wavelet transform, and the number of levels of decomposition to perform,
compute the 1D wavelet decomposition approximation and detail coefficients.
Parameters:
← level The number of levels of decomposition.
← vec The vector to transform.
→ approx The approximation coefficients.
→ detail The detail coefficients.
void wrcoef (enum reconstruction t type, Uint8 level, sVector ∗ approx, sVector ∗
detail, sVector ∗ recon)
Compute the 1D wavelet reconstruction from coefficients.
Given a the approximation and detail coefficients from a wavelet decomposition, perform the
reconstruction with the given transform and level.
Parameters:
← type Choose detail or approximation reconstruction.
← approx The approximation coefficients.
← detail The detail coefficients.
→ recon The reconstructed coefficients.
void write data (mat & reconstructed)
Write processed EEG data to serial port.
Wait for receipt of any character on the control port before writing the processed EEG data to
the data port. The data is sent column-major order.
Parameters:
← reconstructed The matrix containing the processed data.
Variable Documentation
Sample T coefficients[NUM FILTERS][255]
Filter coefficients for each filter for each transform.
103
uConfig config
Configuration parameters.
sVector eeg data[NUM CHANNELS]
Global storage for EEG data.
Int32 num coeffs = 18
Number of coefficients in the filters for each transform.
104
Appendix B
matrix mult.c File Reference
Hardware abstraction library for matrix multiplier.
Functions
• void accum matrix mult (const mat &A, const mat &B, mat &C)
• void matrix mult (const mat &A, int a tran, const mat &B, int b tran, mat &C)
• void matrix mult (const mat &A, const mat &B, mat &C)
• void block matrix multA (const mat &A, int a tran, const mat &B, int b tran, mat &C)
• void block matrix multA (const mat &A, const mat &B, mat &C)
• void block matrix multB (const mat &A, int a tran, const mat &B, int b tran, mat &C)
• void block matrix multB (const mat &A, const mat &B, mat &C)
Detailed Description
Contains functions to perform matrix multiplication operations on embedded matrix multiplier.
Supported operations include square matrix multiplication, block matrix multiplication, and accu-
mulated matrix multiplication. Transpose operations are performed by setting a flag for selected
operations.
Define Documentation
#define control reg(ptr) lwfcmx(2, (Uint32)ptr & 0xFFFFFFE0, (Uint32)ptr & 0x0000001F)
Write a word to the multiplier control register.
105
#define ldfcmx(rn, base, adr) asm volatile ( ”ldfcmx ” #rn ”,%0,%1\n” : : ”b”
(base), ”r” (adr) )
Load double word assembly mnemonic.
#define lqfcmx(rn, base, adr) asm volatile ( ”lqfcmx ” #rn ”,%0,%1\n” : : ”b”
(base), ”r” (adr) )
Load quad word assembly mnemonic.
#define lwfcmx(rn, base, adr) asm volatile ( ”lwfcmx ” #rn ”,%0,%1\n” : :
”b” (base), ”r” (adr) )
Load word assembly mnemonic.
#define stdfcmx(rn, base, adr) asm volatile ( ”stdfcmx ” #rn ”,%0,%1\n” : :
”b” (base), ”r” (adr) )
Store double word assembly mnemonic.
#define stqfcmx(rn, base, adr) asm volatile ( ”stqfcmx ” #rn ”,%0,%1\n” : :
”b” (base), ”r” (adr) )
Store quad word assembly mnemonic.
#define stwfcmx(rn, base, adr) asm volatile ( ”stwfcmx ” #rn ”,%0,%1\n” : :
”b” (base), ”r” (adr) )
Store word assembly mnemonic.
Function Documentation
void accum matrix mult (const mat & A, const mat & B, mat & C)
Perform accumulated matrix multiplication.
Find the inner-product of two rectangular matrices of (N x M) ∗ (M x N) where N <M and and
N divides M. The caller is responsible for allocating memory for the result.
Parameters:
← A An N x M matrix
← B An M x N matrix
→ C The N x N result, allocated by the caller.
106
void block matrix multA (const mat & A, const mat & B, mat & C)
Perform block matrix multiplication.
Find the product of an N x N matrix and an N x M matrix. The caller is responsible for for
allocating memory for the result.
Parameters:
← A An N x N matrix
← B An N x M matrix
→ C The N x M result, allocated by the caller.
void block matrix multA (const mat & A, int a tran, const mat & B, int b tran, mat
& C)
Perform block matrix multiplication.
Find the product of an N x N matrix and an N x M matrix. The caller is responsible for for
allocating memory for the result. Flags are provided to perform the multiplication with the transpose
of either or both operands.
Parameters:
← A An N x N matrix
← a tran Flag to indicate whether to transpose A
← B An N x M matrix
← b tran Flag to indicate whether to transpose B
→ C The N x M result, allocated by the caller.
void block matrix multB (const mat & A, const mat & B, mat & C)
Perform block matrix multiplication.
Find the product of an M x N matrix and an N x N matrix. The caller is responsible for for
allocating memory for the result.
Parameters:
← A An M x N matrix
← B An N x N matrix
→ C The M x N result, allocated by the caller.
107
void block matrix multB (const mat & A, int a tran, const mat & B, int b tran, mat
& C)
Perform block matrix multiplication.
Find the product of an M x N matrix and an N x N matrix. The caller is responsible for for
allocating memory for the result. Flags are provided to perform the multiplication with the transpose
of either or both operands.
Parameters:
← A An M x N matrix
← a tran Flag to indicate whether to transpose A
← B An N x N matrix
← b tran Flag to indicate whether to transpose B
→ C The M x N result, allocated by the caller.
void matrix mult (const mat & A, const mat & B, mat & C)
Perform square matrix multiplication.
Find the product of two square N x N matrices. The caller is responsible for allocating memory
for the result.
Parameters:
← A An N x N matrix
← B An N x N matrix
→ C The N x N result, allocated by the caller.
void matrix mult (const mat & A, int a tran, const mat & B, int b tran, mat & C)
Perform square matrix multiplication.
Find the product of two square N x N matrices. The caller is responsible for allocating memory
for the result. Flags are provided to perform the multiplication with the transpose of either or both
operands.
Parameters:
← A An N x N matrix
← a tran Flag to indicate whether to transpose A
← B An N x N matrix
← b tran Flag to indicate whether to transpose B
→ C The N x N result, allocated by the caller.
108
