Modular Acquisition and Stimulation System for Timestamp-Driven
  Neuroscience Experiments by Matias, Paulo et al.
Modular Acquisition and Stimulation System for
Timestamp-driven Neuroscience Experiments
Paulo Matias, Rafael T. Guariento,
Lirio O. B. de Almeida, and Jan F. W. Slaets
Sa˜o Carlos Institute of Physics
University of Sa˜o Paulo
Sa˜o Carlos, SP, Brazil
{matias,guariento,lirio,jan}@ifsc.usp.br
Abstract. Dedicated systems are fundamental for neuroscience experi-
mental protocols that require timing determinism and synchronous stim-
uli generation. We developed a data acquisition and stimuli generator
system for neuroscience research, optimized for recording timestamps
from up to 6 spiking neurons and entirely specified in a high-level Hard-
ware Description Language (HDL). Despite the logic complexity penalty
of synthesizing from such a language, it was possible to implement our
design in a low-cost small reconfigurable device. Under a modular frame-
work, we explored two different memory arbitration schemes for our
system, evaluating both their logic element usage and resilience to in-
put activity bursts. One of them was designed with a decoupled and
latency insensitive approach, allowing for easier code reuse, while the
other adopted a centralized scheme, constructed specifically for our ap-
plication. The usage of a high-level HDL allowed straightforward and
stepwise code modifications to transform one architecture into the other.
The achieved modularity is very useful for rapidly prototyping novel
electronic instrumentation systems tailored to scientific research.
Keywords: Spiking Neurons, Data Acquisition, Precise Timing, Re-
source Arbitration, Latency Insensitive, Modular Design.
1 Introduction
Neurons usually behave by emitting stereotyped pulses of electric depolariza-
tion through their membranes, creating temporally localized spikes. It is a com-
mon belief that spiking neurons follow an all-or-none principle, similar to the
processing of digital signals, by encoding information only through spike tim-
ing [1]. Although each individual cell always produces the same waveform, the
most widespread experimental approach employs Analog to Digital Convert-
ers (ADCs) integrated on commercial acquisition systems to capture complete
waveforms. This procedure is required when the researcher desires to analyze
a large population of neurons recording only from a few electrodes, applying
then a neuron classification technique known as spike sorting to discriminate
Preprint submitted to ARC 2015.
The final publication is available at link.springer.com.
ar
X
iv
:1
50
4.
01
71
8v
1 
 [q
-b
io.
QM
]  7
 A
pr 
20
15
individual waveforms [2]. However, because of the lack of readily available spe-
cialized acquisition hardware, many works adopt the same recording technique
even though they only need to identify the occurrence of spikes from one neuron
per electrode [3,4,5,6]. The resulting data files are large and spikes need to be
detected by software, demanding a considerable amount of time.
In this paper we present the design of a low-cost alternative hardware solu-
tion based on a dedicated Complex Programmable Logic Device (CPLD). We
have chosen CPLDs instead of Field Programmable Gate Arrays (FPGAs) to
demonstrate the flexibility of our approach, as CPLDs are usually limited to a
small number of logic gates, and lack common FPGA features such as Block
RAMs and Phase Locked Loops (PLLs). We implemented the logic circuits on
the CPLD adopting a modular design, which aims to facilitate future refinement
and customization for specific applications. The complete source code imple-
mented in the Bluespec SystemVerilog (BSV) [7] language is available at [8].
BSV designs targeted at small reconfigurable devices, such as ours, are rare
in literature, since many works show that BSV usually produces a higher logic el-
ement (LE) count than Register-Transfer Level (RTL) languages [9,10,11]. How-
ever, some research [12] argues that microarchitectural choices have greater im-
pact on the LE usage than the specification’s abstraction level, although there
is a lack of studies in glue logic sized architectures with significant modularity
and complexity. This paper showcases such a system, and also explores the im-
pact of latency insensitive module decoupling [13], by comparing two distinct
implementations of a resource arbitration scheme. Similar work evaluating syn-
thesis results exists [14], but we also test the consequences on system resilience
to extreme conditions, many times above our application requirements.
The acquisition input is provided to our digital logic by an analog front-end
system which generates an asynchronous TTL-compatible signal pulse at the oc-
currence of each valid spike. Our entire circuit was designed to be compatible and
easily inserted into a previous experimental setup [15] devised for studying neu-
ral codification in Chrysomya megacephala’s visual system, but it is sufficiently
generic to be suitable for a wide range of neuroscience experiments.
Main contributions of this work:
– Develops a portable, low-cost and precise data acquisition system for neuro-
science and neuroethology experiments.
– Applies the seldom used concept of recording digital events (instead of ADC-
converted data) to increase the precision of neural spike timing.
– Employs the BSV language in a small and resource constrained system.
– Showcases architecture refactoring from a decoupled to a centralized scheme.
Paper organization: The next section describes the basic specifications of our
design and its overall architecture. Section 3 discusses the system implementa-
tion, focusing on points common both to dynamic and static arbiter versions.
Sections 3.1 and 3.2 delve into specific aspects of each one of the implementa-
2
tions. Section 4 presents synthesis, experimental and simulation results. Finally,
we conclude in Section 5.
2 Overall system architecture
Our system offers 6 TTL-level pulse timestamp acquisition inputs, 4 analog
16-bit resolution outputs for stimuli generation and a Join Test Action Group
(JTAG) host computer interface. It is composed by a MAX II Micro Kit (EPM-
2210F324C3 CPLD), a 74HC4050 buffer for input overvoltage protection, a
MAX5134 Digital-to-Analog Converter (DAC) and a IDT71256 20ns 32K×8-
bit SRAM. We have divided the project in following functional subunits:
Synchronizer: Receives asynchronous input pulses and registers 32-bit times-
tamps from a hardware counter, each one paired to a flag indicating which input
channels fired since last counted. In most neural systems, 1 µs is believed to be
enough resolution for studying fine details of information coding [16].
FIFO SRAM: Provides an interface for using the external SRAM memory
as a pair of First-In First-Out (FIFO) queues of 16 KiB each. One of them
buffers data acquired from inputs, and the other buffers stimuli received from a
computer. Our FIFO modules are compatible with the BSV standard library.
JTAG interface: Provides communication with a host computer. We have
wrapped Altera JTAG-UART libraries into a ready-to-use BSV module. By us-
ing this protocol, the same communication module is portable to any CPLD or
FPGA manufactured by the same vendor. As programmable devices are config-
ured via JTAG, the bus is readily available through USB adapters embedded in
almost every evaluation board. However, this approach introduces a significant
protocol overhead by encapsulating UART emulation inside JTAG-USB, limit-
ing the data rate to about 1 Mbit/s. Also, client software needs to explicitly poll
the device, because the interface is not interrupt nor event driven. This results
in software determinism becoming a bottleneck depending on hardware buffer
size and desired data rate. Nevertheless, these limitations do not impair this
particular application.
3 BSV module architecture
Bluespec SystemVerilog is a strongly typed high-level hardware description lan-
guage (HDL) with functional paradigm features. A BSV design is organized in
modules and rules. Modules provide interfaces, composed by a set of methods
which can be used to access or modify their internal state. State changes (side-
effects) are clearly separated from read-only operations by the means of a monad
[17] called Action, thus any expression which modifies state has an action type.
Modules can be statically elaborated several times, allowing to represent com-
plex circuit structures. Rules are entities which describe the connections between
modules and ultimately define hardware dynamics. They are formed by a set of
actions and a boolean predicate, which defines an explicit condition needed to
allow execution of the actions (rule firing). During a single clock cycle, a rule is
3
guaranteed to entirely complete its execution or not to fire at all, property known
as transaction atomicity. Rule firing can also be affected by implicit conditions,
which can be attributed to any BSV expression. The BSV compiler propagates
an implicit condition back to the predicate of the rule which actually executes
the action or queries the value of the corresponding expression. Implicit condi-
tions are usually attributed to method boundaries, serving as an effective way to
specify module contracts. When synthesizing, the compiler defines an execution
order for rules, allowing a hardware scheduler to be generated according to the
Term Rewriting Systems (TRS) formalism [18].
Synchronizer SER
FIFO SRAM
DES
JTAG
DAC
wires mkAsyncPulseSync
inputs
acqIn
inSyncs
Bits#(NumInputs)
syncedIn
blendChannelFlags
channelFlags 0 1 0 0 0 1 1 0 
timestampUpdate
32  bits
timestamp
+1
mkFunnel
in
out
funnel
uart
sram
mkSRAMSplit
srvB
srvA
mkAlteraJtagUart
tx
rx
uartHandleCmd
mkUnfunnel
inout
unfunnel
mkDAC req
dac
dacHandleReq
uartOutFifo
uartInFifo
cli
fifo
mkSRAMFIFO
cli
wires
wires
dacLoad
load
wires mkAsyncPulseSync
fifo
DAC
SRAM
Fig. 1. Block diagram of the BSV modules and rules. The gray shaded areas represent
internal logic, while white structures depict I/O interfaces. Modules are portrayed
as quadrilaterals. The names inside them are module names, and those outside are
instance names. Small rectangles on their side are interfaces. Ellipses designate rules.
Those which only perform connections are omitted and represented directly by arrows.
Figure 1 illustrates our main module. The fundamental difference between
arbitration approaches resides on SRAMFIFO internals and on how communication
with the SRAM occurs. In the dynamic approach this is done by a client-server
interface mediated by FIFOs, whereas in the static one an extra central module
exists which assigns a specific operation to the SRAMFIFOs on each cycle.
Acquisition data flow: Asynchronous pulses arriving from acquisition in-
puts are synchronized to the system clock by the AsyncPulseSync modules,
resulting in the syncedIn signal. The blendChannelFlags rule accumulates one
bit for each input channel in the channelFlags register, if a pulse occurred on
syncedIn since the last collected timestamp. The timestampUpdate rule atomi-
cally increments the timestamp register, sends the channelFlags value and the
current timestamp to the funnel, and resets channelFlags to zero. The funnel
emits one byte of its input per cycle to the uartOutFifo. Finally, data coming
4
from dataOutFifo can be read in the host computer after being collected by the
JTAG-UART transmitting (tx) interface.
Stimulus generation data flow: Begins at the JTAG-UART receiving
(rx) interface. The uartHandleCmd rule identifies if the byte received from the
computer represents a start command or a DAC conversion request. A start
command sets a boolean register (omitted from the figure) which unblocks the
predicate of dacLoad, timestampUpdate and blendChannelFlags rules. A DAC
conversion request sends the current byte and the next two bytes to uartInFifo.
After coming out of the FIFO, the bytes feed the unfunnel block, merging three
bytes together. The dacHandleReq controls the request flow to the DAC module.
The dacLoad rule fires when the first input channel receives a pulse, unblocking
the dacHandleReq rule and causing a synchronous update on all DAC outputs.
This input channel is used to synchronize analog outputs to the desired stimuli
clock, e.g. the display controller in a visual stimulation system.
BSV has an implicit condition mechanism which eases the specification of a
provably correct system. We only needed to add error handling to four places of
our design. The first one is related to tx path FIFO overflow and is put in the
timestampUpdate rule, ensuring that the timestamp is always incremented at
each update period. The second check is accomplished in the dacLoad rule, and
verifies if a FIFO underrun has occurred in the rx path, by certifying that the
DAC is ready to receive a new command and that all DAC registers were filled
since the last load. The final two are not directly related to system functional
correctness, but to good debugging practices. The third one checks if bytes re-
ceived from JTAG-UART correspond to valid commands. The last one verifies if
DAC requests are still valid after leaving uartInFifo, aiming to detect any oc-
currences of data corruption during communication with the SRAM chip. When
any of these error condition occurs, we alert the user by blinking LEDs until the
system is reset.
Next, we describe characteristics of the common system sub-modules.
SER/DES: Serializer and deserializer modules are implemented using shift
registers. Our design is generic and type parametrized, making it reusable with
any input or output data types. A code excerpt illustrating these concepts is
shown in Figure 2.
DAC: In order to rapidly prototype the control of a Serial Peripheral In-
terface (SPI) and DAC linearity calibration procedures, we employed a stan-
dard BSV library called StmtFSM, which consists of a Domain Specific Language
(DSL) for specifying Finite State Machines (FSMs). The FSMs could be eas-
ily composed and exposed in the form of a simple external module interface.
DAC register update requests supported by the MAX5134 DAC contain three
bytes, one specifying the target channels and two bytes of data. Part of the FSM
implementation is illustrated in Figure 3.
SRAMFIFO: This module is also type parametrized. It exposes a fifo
subinterface mimicking BSV standard library’s mkLFIFO, and a cli subinter-
face which can be connected to a SRAM or SRAMSplit server interface. Usage
of Ephemeral History Registers (EHRs) [19] greatly simplified the design in or-
5
// Defines an interface for generic types a and b
interface Funnel#(type a, type b);
    // ...
endinterface
 
function Funnel#(a,b) 
    toFunnel(FIFOF#(a) infifo, FIFOF#(b) outfifo);
    return (interface Funnel;
        method Bool notFull = infifo.notFull;
         // ...
    endinterface);
endfunction
   
module mkFunnel(Funnel#(a,b))
     // ...
    rule firstCycle(stage == 0);
        // Shifts the input value from infifo
        shiftReg <= truncateLSB(inval << nB);
        outfifo.enq(unpack(truncateLSB(inval))); 
        updateStage(); infifo.deq();
    endrule
    rule(stage != 0);
        outfifo.enq(unpack(truncateLSB(shiftReg)));
        shiftReg <= shiftReg << nB;
        updateStage();
    endrule
    // Returns an in/out interface from the infifo/outfifo
    return toFunnel(infifo, outfifo);
endmodule
Fig. 2. Code excerpt from the serializer shift register implementation, illustrating
parametrized data types and abstract manipulation of interfaces by compile-time re-
solved functions (toFunnel).
Stmt shiftRegSenderStmt = seq
    while(!shiftRegDone) seq
        // Send each bit of shiftReg to DAC ...
    endseq;
    rnCS[0] <= 1; delay(2);
endseq;
// Instantiates the state machine from specified instructions
FSM shiftRegSender <- mkFSM(shiftRegSenderStmt);
function Action send(Bit#(24) cmd);
    action
        shiftReg <= {cmd, 1'b1}; rnCS[1] <= 0;
        shiftRegSender.start;
    endaction
endfunction
// Calibrate DAC
Stmt dacCalibrationStmt = seq
    delay(waitVoltageStab); send(commandBits_1);
    shiftRegSender.waitTillDone;
    delay(dacCalibCycles); send(commandBits_2);
endseq
// ....
interface Put req;
    method Action put(DACReq r) if (rdy);
        match{.mask, .sample} = r;
        // Send command for writing sample
        send({4'b0001, mask, sample});
    endmethod
endinterface
Fig. 3. Code illustrating usage of the Finite State Machine (FSM) Domain Specific
Language (DSL). FSMs can be specified as a series of statements (Stmt) in a DSL
which resembles an imperative software language. The mkFSM module transforms these
statements into a hardware implementation. We have implemented a send function
which is used both to compose different FSMs together and to start a FSM from an
externally accessible method.
6
der to achieve the same scheduling specifications as mkLFIFO. EHRs provide a
register-like interface on which same-cycle accesses can be ordered according to
the logical execution order of rules or methods. Head and tail position pointers
to locations inside the SRAM are held in EHRs. The SRAMFIFO stores one unit
of data (in the case of this design, one byte) in a local cache FIFO implemented
using flip-flop registers, whose output is connected directly to SRAMFIFO’s one.
When the cache FIFO goes empty, we dispatch a read request to the SRAM,
aiming to maintain the cache filled most of the time. When new data is enqueued
to the SRAMFIFO and no space is available in the cache FIFO, a write request is
sent to the SRAM.
3.1 Dynamic arbiter
In this design, the SRAM controller (Figure 4) is decoupled from the SRAMSplit
module (Figure 5). The first dispatches requests in the order as they are received
in reqfifo, using an internal cycle register to keep track of its state during a
single request. The latter arbitrates the access of two other modules to a single
SRAM controller. Requests received from both modules are placed into a pair
of FIFOs. A set of mutually exclusive rules then controls the priority of each
request. Two of them are generated by the getPrioritizeValid function, one
of which is fired when a single request FIFO is not empty. However when both
FIFOs contain data, the prioritize current turn rule is fired, prioritizing the
Least Recently Used (LRU) FIFO. The turn register holds the state needed to
infer the LRU FIFO.
response
request
ifc
reqfifo
enq first
mkLFIFOF
respfifo
deq enqmkLFIFOF
fifo_to_wires
Bit#(1)cycle
cycle_machine
zbuf
mkTriState
nwe
addrifc
data
SRAM_ADDR[*]
SRAM_NWE
SRAM_DATA[*]
mkSRAM
deq
Fig. 4. In the dynamic arbiter based design, the SRAM controller is decoupled from
the SRAMSplit module. Its state over the two cycles of operation is determined by an
internal cycle register, handled by the cycle machine rule. The fifo to wires rule
controls the tri-state buffer and drives the outputs.
Arbitration also needs to take place in the SRAMFIFO module, because meth-
ods for enqueuing and dequeuing data are designed not to conflict, in order to
simplify module reuse. When both methods are called during the same cycle, we
check if the queue’s head and tail pointers are equal to each other. This means
that the dequeue method has requested to read the same address which the
7
srvB
request
responseresponse
request
srv[.]
reqfifo[.]
mkSRAM
sram
response
request
ifc
deq enqmkLFIFOF
pending
prioritize_current_turn
getPrioritizeValid(.)
getputput getmkLFIFOF
Bit#(1)
turn
[0] [1]
wires
wiresmkSRAMSplit
Fig. 5. SRAMSplit provides two SRAM server interfaces and arbitrates their access to
a single SRAM controller. Requests coming to both servers are held into FIFOs until
being sent to the controller. A set of three mutually exclusive rules arbitrate the access
based on not-empty FIFO flags and on a turn register. A pending FIFO preserves the
requester identification, allowing responses to be served back in the right order.
interface Client cli;
    interface Get request;
        method ActionValue#(SRAMReq#(Bit#(addr_sz), tdata)) get;
             // Obey the scheduled request order
             MemOp turn; // Read or Write enum
            case (req_turn.first) matches
                Read: action turn = Read; req_turn.deq; endaction
                ReadThenWrite: action
                        turn = turn_stage == 1'b0 ? Read : Write;
                      if(turn_stage == 1'b1) req_turn.deq;
                        turn_stage <= ~turn_stage;
                  endaction
                // Write and WriteThenRead implementations here ...
           endcase
            last_op[1] <= turn;
            // Forward request to SRAMSplit
           if (turn == Read) begin
                req_read.deq; return tagged Read req_read.first;
           end else /* Write request here ... */
    endinterface
    interface Put response; /* ... */ endinterface
endinterface
rule compute_req_turn(deq_requested_mem || enq_requested_mem);
    if(deq_requested_mem && enq_requested_mem) begin
        if(head[2] == tail[2]) begin
             // conflict: deq has read from head[2]-1 and enq has written to tail[2]-1
             // enforce memory order compatible with method order:
             // deq < enq implies read < write
             req_turn.enq(ReadThenWrite);
        end else begin
             req_turn.enq(last_op[0] == Read ? WriteThenRead : ReadThenWrite); // LRU
        end
    end else if(deq_requested_mem && !enq_requested_mem) begin
        req_turn.enq(Read);
    end else if(!deq_requested_mem && enq_requested_mem) begin
        req_turn.enq(Write);
    end
endrule
Fig. 6. Excerpts from the dynamic arbitrated SRAMFIFO implementation. All of our
modules follow a client/server pattern, demanding the treatment of simultaneous re-
quests for full decoupling. The compute req turn rule (right panel) chooses the turn
according to which requests were issued at the current cycle. Whenever possible, a Least
Recently Used (LRU) scheme is adopted (highlighted in the code). The request.get
method of the client interface (left panel) queries this information from the req turn
FIFO, updates last op and forwards the correct request by returning it. The last op
is an example of an Ephemeral History Register (EHR) — references to it must be
appended with an index (shown as subscript text in the code) which defines the logical
execution order of register read and write operations.
8
enqueue method asked to write. In this case, we enforce requests to be sent to
SRAM in the same order as the logical execution order chosen when designing
the methods (dequeue before enqueue, and thus read before write). Otherwise,
we follow a LRU approach based on the value of a register which holds the type
of the last memory operation issued by the SRAMFIFO module (Figure 6).
3.2 Static arbiter
Starting from the code of the dynamic version, we incrementally added new
conditions to method predicates, testing the system after the changes. As implicit
conditions which control the data flow of FIFOs are still present in the logic at
this development stage, designer errors tend to prevent rules from firing, stopping
data flow and making the system hang instead of producing incorrect results.
0 1 2 3 4 5 6 7 0
req
req
req
req
resp
resp
resp
resp
1
enq
deq
deq
enq
Fig. 7. Timing diagram of SRAMFIFO
transactions governed by the static ar-
biter. Arrows identify in which cycles
the memory requests and responses oc-
cur. White rectangles represent actions
on uartInFifo, while black rectangles
depict operations on uartOutFifo.
We added these predicate conditions based on a manually devised arbitration
schedule, shown in Figure 7. This schedule allows for the execution of an enqueue
and a dequeue operation on both uartInFifo and uartOutFifo during the
course of 8 clock cycles. A central arbiter, which consists of a counter reset
every 8 cycles, was implemented just below the top level module. Boolean values
derived from this counter, signaling if each operation could occur during each
cycle, were routed from the top level module to the inner SRAM controller,
SRAMSplit and SRAMFIFOs (Figure 8). After the predicates were changed, some
FIFOs could be removed, reducing the number of LEs needed to implement the
design.
SRAM controller: The cycle register (compare with Figure 4) was re-
moved and replaced by the least significant bit of the central arbiter counter.
Memory requests became allowed only when this bit is zero, which happens in
cycles numbered 0, 2, 4 and 6 (Figure 7).
SRAMSplit: Requests to one of the arbitrated servers became allowed only
during the correct cycle, as defined by the timing diagram — requests coming
from uartInFifo at cycles numbered 0 and 2, and those from uartOutFifo at
cycles 4 and 6. Both reqfifos could then be removed from the design without
affecting its behavior. The order of responses could also be inferred from the
diagram (cycles 3, 5, 7 and 1), allowing us to remove the pending FIFO. After
all changes, the module became just an abstraction which synthesizes purely to
wires (compare with Figure 5).
9
// We limit the rules and actions to occur on the right
// cycle by inserting predicate conditions
module mkSRAM#(Bit#(1) turn) (SRAM#(taddr, tdata))
    // ...
    rule cycle_machine(turn == 0) 
        /* ... */                   // Notice the cycle dependent predicate
    endrule
    interface Server ifc; interface Put request;
        method Action put(SRAMReq#(taddr, tdata) req) 
            if (turn == 0); // Method implicit condition
                // ...
        endmethod
    endinterface; /* ... */ endinterface;
endmodule
// We also attributed implicit conditions to specific
// actions which depend on the correct cycle 
module mkSRAMFIFO#(Bool turnRead, Bool turnWrite)
(SRAMFIFO#(addr_sz, tdata))
    // ...
    method Action enq(tdata x) if (not_ring_full[1]);
        // ...
        when(turnWrite, action // when adds implicit conditions
            // ...                              // to an action expression
        endaction);
    endmethod
endmodule
// Central Arbiter definition
// Just a cyclic counter
interface CentralArbiter#(numeric type arbitrated_units);
    interface ReadOnly#(Bit#(TLog#(arbitrated_units))) turn;
endinterface
module mkCentralArbiter(CentralArbiter#(n));
    Reg#(Bit#(TLog#(n))) turnCounter <- mkRegU;
    rule incrementTurn;
        let maxCount = fromInteger(valueOf(n) - 1);
         turnCounter <= (turnCounter == maxCount) ? 0 : (turnCounter + 1);
    endrule
    interface ReadOnly turn = regToReadOnly(turnCounter);
endmodule
// On the top module, we pass the arbiter turn as argument to the modules
CentralArbiter#(8) arb <- mkCentralArbiter;
SRAMSplit#(AddrSize, Byte) sram
<- mkSRAMSplit(arb.turn[2], arb.turn[0], arb.turn == 5, arb.turn == 7);
SRAMFIFO#(FifoAddrSize, Byte) uartInFifo
<- mkSRAMFIFO(arb.turn == 2, arb.turn == 0);
SRAMFIFO#(FifoAddrSize, Byte) uartOutFifo
<- mkSRAMFIFO(arb.turn == 4, arb.turn == 6);
Fig. 8. Excerpts from the static arbiter implementation. The CentralArbiter module,
which consists of a simple cyclic counter, is instantiated inside the top-level module. Its
turn method describes the current cycle according to the schedule of Figure 7. Some
bits of the cycle counter, or Boolean conditions involving its current state, are routed to
inner modules, where they are appended to predicates or added as implicit conditions.
SRAMFIFO: Dequeues and enqueues became allowed only during the des-
ignated cycles (2 and 4 for dequeues, 0 and 6 for enqueues). This allowed to
remove memory request output FIFOs, which were replaced by wires.
4 Results
4.1 Synthesis
Synthesis results are shown in Table 1. On the device actually adopted in our
project (EPM2210F324C3), the dynamic arbitrated circuit occupies 200 more
LEs than the static arbiter design. This corresponds to 9% of the LEs available
in the CPLD. Almost a half of the hardware resources are still free and could be
exploited to implement new features. We have also synthesized both architectures
on a smaller device (EPM1270F256C3) in order to demonstrate the design can
meet the requirements even when reaching the limits of the CPLD substrate.
The Quartus II Fitter clearly undertook more effort during synthesis on this
device, as both circuits were implemented occupying less LEs than on the bigger
CPLD. Nonetheless, the attainable clock frequency was not significantly reduced
by this area optimization, remaining above 50 MHz.
4.2 Experimental validation
Workbench validation consisted in connecting independent square-wave periodic
signal generators into each input of the system for 8 hours and then analyzing
10
Table 1. Synthesis results for both arbiter designs (Altera Quartus II 14.0)
EPM2210F324C3 EPM1270F256C3
Design Logic Maximum clock Logic Maximum clock
Arbiter elements frequency elements frequency
Static 1017 (46%) 54.57 MHz 970 (76%) 54.36 MHz
Dynamic 1217 (55%) 54.07 MHz 1168 (92%) 53.49 MHz
1999 2000 2001
104
105
106
107
108
N
um
be
ro
fo
cc
ur
re
nc
es
161 162
107
108
109
163 164 165
105
106
107
108
109
1362 1363 1364 1365
∆t (µs)
102
103
104
105
106
107
108
N
um
be
ro
fo
cc
ur
re
nc
es
1383 1384 1385 1386
∆t (µs)
103
104
105
106
107
108
1402 1403 1404
∆t (µs)
105
106
107
108
Fig. 9. Histograms displaying the number of occurrences of each time interval ∆t
measured between two consecutive pulses. Each one of the six inputs channels was
connected to an independent periodic signal source. No missed nor spurious events
could be observed even after eight hours of acquisition.
the acquired data to look for spurious or missing detections. Figure 9 shows
histograms of the time interval (∆t) between two consecutive recorded pulses
for all input channels. Histograms shown in the first line correspond to periodic
pulses generated by an 1-channel Hewlett-Packard 33120A and a 2-channel Sony-
Tektronix AFG320 function generator. The three remaining channels were fed
with signals generated by free-running astable oscillators made using the NE555
timer integrated circuit.
The first input channel was programmed to synchronize DAC conversions,
thus we have fed it with 500 Hz (frame rate frequency adopted by the VSImG [15]
visual stimulation system). The second and third channels were supplied with
close but incommensurable frequencies (6.2 kHz and 6.1 kHz). The remaining
channels were fed by similar frequencies, produced by three identical NE555
circuits, differing only within component nominal tolerances.
11
Experiments with both architectures (dynamic and static arbiter) resulted in
almost identical histograms, thus the figure only portrays the results for one of
them (dynamic arbiter). The histograms show that during 8 hours of acquisition
no pulses were missed, and no spurious events were registered, otherwise the
abscissa of the graph would reach double or half the value of the baseline period,
respectively. The maximum deviation from the adjusted periods was within the
acceptable generator’s thermal drift. As expected, NE555 oscillators are less sta-
ble and produce more jitter than the commercial function generators, resulting
in sparser histograms.
We emphasize that input event periods employed during this test were well
below the minimum intervals between spikes (refractory period) attainable by
a typical neuron. For example, in Chrysomya megacephala’s H1 neuron this
minimum interval is 2 ms [20].
4.3 Arbiter resilience evaluation
Besides the dynamic arbiter design advantages related to code reusability, in-
herent to its latency insensitive and decoupled characteristics, it is also more
resilient to failures. In order to prove this, we needed to increase the timestamp
resolution above its original specification of 1 µs. In fact, any update rate be-
low 1/40 of the clock frequency (50 MHz) can always be correctly scheduled, as it
leaves room for at least 5 rounds of 4 memory operations, each one taking 2 cy-
cles (see Figure 7), sufficient to carry the 5-byte (channelFlags, timestamp)
tuple in and out of the FIFOs. Thus, to be able to observe failures related to
differences in SRAM arbitration schemes, we increased the timestamp counter
update rate from 1/50 to 1/20 of the clock frequency.
In order to keep control over parameters such as UART transmission rate, we
simulated the system instead of evaluating it with workbench instruments. Both
architectures were simulated under exactly the same parameters and inputs.
Inputs were fed with trains of pulses generated according to a Poisson process, a
stochastic model that occasionally produces activity bursts, although it possesses
a parameterized mean rate. It has also been adopted in some statistical models
of spiking neurons [1]. The first channel, however, was modeled as an oscillation
whose frequency varies according to a narrow Gaussian distribution, which better
reproduces the behavior of the stimuli reference clock.
We limited UART transmission to 1 byte per 6 cycles, aiming to observe the
system in a regime in which it would eventually acquire more data than it could
transmit. UART reception was constrained to 1 byte per 10 cycles, allowing 3-
byte chunks of stimuli data to be provided to the system at double the speed
required by the first-channel reference clock, whose mean frequency was chosen
at 1/60 of the system clock.
Varying the rate parameter of the Poisson processes, we measured the total
mean input event rate to which the system was exposed, i.e. the number of
input events divided by the number of cycles of simulation, relating it to the
mean time between failures (MTBF) in number of cycles. Failures were detected
12
0.015 0.020 0.025 0.030 0.035 0.040 0.045 0.050 0.055
Mean input event rate (divided by clock frequency)
102
103
104
105
106
107
108
M
ea
n
tim
e
be
tw
ee
n
fa
ilu
re
s
(c
yc
le
s)
regime transition
A
B
C
Dynamic arbiter
Static arbiter
Fig. 10. Mean time between failures (MTBF) obtained by simulation, with designs
configured for an increased timestamp resolution of 1/20 of the system clock, and subject
to large mean input event rates. The dynamic arbiter is consistently more resilient,
and presented two operating regimes: in regime A, failures are caused by overflow of
uartOutFifo, while in regime B, memory write requests occur at a high rate and
eventually cannot be scheduled on time, overflowing funnel. As the static arbiter
schedules enqueuing operations at a fixed rate of 1 byte per 8 cycles (slower than
the simulated UART transmission rate), regime C is only due to funnel overflow.
13
if any of the four error conditions discussed in Section 3 were triggered. In this
experiment they happened only due to overflows in the tx path.
Figure 10 shows these results and demonstrates that besides supporting a
mean firing rate greater than the static arbiter without missing events, the dy-
namic arbiter takes longer to fail when the frequency approaches its limits. At
an input rate of 1/20, the maximum meaningful frequency at the adopted time-
stamp resolution, the dynamic arbiter fails after circa 103 cycles, whereas the
static arbiter withstands for only 102 cycles.
Under the simulation parameters, the static arbiter is not able to fill uart-
OutFifo’s SRAM-contained ring with more than 1 byte. This happens because
the FIFO enqueuing rate is limited to a maximum of 1 byte per 8 cycles by the
central arbiter schedule (see Figure 7), while the simulated UART can reach a
transmission rate of 1 byte per 6 cycles, sufficient to keep the FIFO almost empty.
The observed failures were due to overflow of the funnel: the uartOutFifo never
even came close to a full state. Therefore, the failure process may be viewed as
nearly stationary at a time scale greater than 102 cycles. Indeed, the measured
data set (Figure 10–C) does not change significantly if we reset the circuit state
a couple of times during the course of simulation.
On the other hand, the dynamic arbiter is able to surpass an operating regime
(Figure 10–B) where failures occur because of funnel overflow, reaching another
regime at lower frequencies (Figure 10–A) where the system does not abort until
uartOutFifo is full. The dashed curve is proportional to (f − flim)−1, where f
is the mean input frequency (abscissa) and flim is a limit frequency (in this
experiment, flim ≈ 0.0282) such that 5flim is less than the effective UART
transmission rate, implying the circuit is not expected to ever fail for f ≤ flim.
5 Conclusions
The system described in this paper (source code at [8]) can be applied to neuro-
science research both on in vivo or in vitro experiments requiring deterministic
timing and synchronous stimuli generation, such as the study of neural coding on
the visual system of flies [21]. It can also be applied to experiments in neuroethol-
ogy, for example on the analysis of electrocommunication signals produced by
pulse-type electric fish [22,23,24,25]. The employed digital pulse timestamping
technique allows to achieve a measurement precision in the order of 1 µs, much
higher than most ADC-based acquisition systems. Even though our project was
programmed to a small reconfigurable device, almost a half of LEs were left free
and can be filled to implement future experiments with real-time feedback [26].
We have also shown that Bluespec SystemVerilog (BSV) can be effectively
used even in projects involving small devices, and have presented an approach
to refactor a decoupled and latency insensitive logic into a statically arbitrated
one, which could be useful when a designer needs to quickly lower a system’s LE
usage — however, keeping dynamic arbitration can compensate the LE cost if
the system needs to be resilient to activity bursts. The designed code is modular
and reusable to implement similar systems, e.g. we have a working prototype for
14
closed-loop experiments implemented on an EP4CGX150DF31C7 FPGA, which
occupies 5728 LEs (4% of the device) and interfaces with a real-time operating
system (RTOS) through PCI Express [27].
Acknowledgments. Authors were supported by grants from CAPES and FAPESP.
Maxim Integrated provided free analog IC samples. Altera Corp and Bluespec Inc
supplied free software licenses through their university programs.
References
1. Dayan, P., Abbott, L.F.: Theoretical neuroscience: computational and mathemat-
ical modeling of neural systems. The MIT Press (2005)
2. Lewicki, M.S.: A review of methods for spike sorting: the detection and classifica-
tion of neural action potentials. Network–Comp Neural 9(4) (1998) R53–R78
3. Brochini, L., Carelli, P.V., Pinto, R.D.: Single synapse information coding in
intraburst spike patterns of central pattern generator motor neurons. J Neurosci
31(34) (2011) 12297–12306
4. Spavieri, Jr, D.L., Eichner, H., Borst, A.: Coding efficiency of fly motion processing
is set by firing rate, not firing precision. PLoS Comput Biol 6(7) (July 2010)
e1000860
5. Bolzon, D.M., Nordstro¨m, K., O’Carroll, D.C.: Local and large-range inhibition in
feature detection. J Neurosci 29(45) (2009) 14143–14150
6. Rokem, A., Watzl, S., Gollisch, T., Stemmler, M., Herz, A.V.M., Samengo, I.:
Spike-timing precision underlies the coding efficiency of auditory receptor neurons.
J Neurophysiol 95(4) (2006) 2541–2552
7. Nikhil, R.: Bluespec System Verilog: efficient, correct RTL from high level specifi-
cations. In: MEMOCODE’04. (June 2004) 69–70
8. Matias, P.: Low-cost modular acquisition and stimulation system for neuroscience.
http://dx.doi.org/10.5281/zenodo.11034 (July 2014)
9. Gruian, F., Westmijze, M.: BluEJAMM: a Bluespec embedded Java architecture
with memory management. In: SYNASC’07. (September 2007) 459–466
10. Meeus, W., Van Beeck, K., Goedeme´, T., Meel, J., Stroobandt, D.: An overview
of today’s high-level synthesis tools. Des Autom Embed Syst 16(3) (2012) 31–51
11. Malik, J.S., Palazzari, P., Hemani, A.: Effort, resources, and abstraction vs perfor-
mance in high-level synthesis: finding new answers to an old question. SIGARCH
Comput Archit News 40(5) (March 2012) 64–69
12. Arvind, Nikhil, R.S., Rosenband, D.L., Dave, N.: High-level synthesis: an essential
ingredient for designing complex ASICs. In: ICCAD’04, Washington, DC, USA,
IEEE Computer Society (2004) 775–782
13. Fleming, K.E., Ng, M.C., Gross, S., Arvind: WiLIS: architectural modeling of
wireless systems. In: ISPASS’11. (April 2011) 197–206
14. Murray, K.E., Betz, V.: Quantifying the cost and benefit of latency insensitive
communication on FPGAs. In: FPGA’14, New York, ACM (2014) 223–232
15. de Almeida, L.O.B., Slaets, J.F.W., Ko¨berle, R.: VSImG: a high frame rate bitmap
based display system for neuroscience research. Neurocomputing 74(10) (2011)
1762–1768
16. Nemenman, I., Lewen, G.D., Bialek, W., de Ruyter van Steveninck, R.R.: Neural
coding of natural stimuli: information at sub-millisecond resolution. PLoS Comput
Biol 4(3) (March 2008) e1000025
15
17. Wadler, P.: Monads for functional programming. In Jeuring, J., Meijer, E., eds.:
Advanced Functional Programming. Volume 925 of Lecture Notes in Computer
Science. Springer Berlin Heidelberg (1995) 24–52
18. Shen, X., Arvind: Design and verification of speculative processors. In: Proceedings
of the Workshop on Formal Techniques for Hardware and Hardware-like Systems.
(1998)
19. Rosenband, D.L.: The ephemeral history register: flexible scheduling for rule-based
designs. In: MEMOCODE’04. (June 2004) 189–198
20. Baptista, M.S., de Almeida, L.O.B., Slaets, J.F., Ko¨berle, R., Grebogi, C.: A
complex biological system: the fly’s visual module. Philos T R Soc A 366(1864)
(2008) 345–357
21. Fernandes, N., Pinto, B., de Almeida, L.O.B., Slaets, J.F.W., Ko¨berle, R.: Record-
ing from two neurons: second-order stimulus reconstruction from spike trains and
population coding. Neural Comput 22(10) (October 2010) 2537–2557
22. Forlim, C.G., Pinto, R.D.: Automatic realistic real time stimulation/recording in
weakly electric fish: long time behavior characterization in freely swimming fish
and stimuli discrimination. PLoS ONE 9(1) (January 2014) e84885
23. Matias, P., Slaets, J.F.W., Pinto, R.D.: Individual discrimination of freely swim-
ming pulse-type electric fish from electrode array recordings. Neurocomputing 153
(2015) 191–198
24. Forlim, C.G., de Almeida, L.O.B., Varona, P., Rodr´ıguez, F., Pinto, R.D.: Study
of electric and motor behavior in weakly electric fish, Gymnotus carapo and
Gnathonemus petersii, using information theory tools. In: Neuroscience 2012, New
Orleans, LA, Society for Neuroscience (2012) Program#501.10/EEE20
25. Guariento, R.T., Mosqueiro, T.S., Caputi, A., Pinto, R.D.: A simple model for
eletrocommunication - “refractoriness avoidance response”? BMC Neuroscience
15(Suppl 1) (2014) P68
26. Mun˜iz, C., Rodr´ıguez, F., Varona, P.: RTBiomanager: a software platform to
expand the applications of real-time technology in neuroscience. BMC Neurosci
10(1) (2009) 49
27. de Almeida, L.O.B., Matias, P., Guariento, R.T.: An embedded system for real-
time feedback neuroscience experiments. In: IV Brazilian Symposium on Comput-
ing Systems Engineering – SBESC 2014, Intel Embedded Systems Competition.
(November 2014) arXiv:1504.00932 [q-bio.QM].
16
