Abstract-This work describes a new digital front-end for a high-resolution low-cost animal PET scanner which is currently under development. The advances in flexibility and size of modern FPGAs together with the release of new tools enable the integration of most of the front-end electronics in a single FPGA. The implemented system includes a small 32-bit RISC processor, several peripherals attached to the internal buses and a special DSP unit closely attached to the processor which is dedicated to the detection of the gamma events. On top of these, a small footprint real time operating system abstracts the underlying hardware, providing the mechanisms to combine on-chip slow control and data streaming.
I. INTRODUCTION
S programmable logic devices continue to grow in density, designers are increasingly using FPGAs where they previously used ASICs.
This movement towards programmable technologies reduces development time and risks while keeping electronics size comparable [1] . Moreover online reconfiguration allows for area reutilization when considering alternative modes of operation.
In the last few years front-end detectors for nuclear imaging have evolved in this direction, where ASIC-based detectors [2] are been replaced by FPGA-based equivalents, replacing most of the scintillation pulse analog processing by its digital counterpart [3] . However most designs still rely on an external constant fraction discriminator (CFD) or similar circuit for the generation of the event time stamp, which is required in positron emission tomography (PET) for coincidence sorting. There are some exceptions to this last statement [4] [5] [6] [7] that replace the external circuit by additional digital processing.
In this work we describe the digital front-end currently
Manuscript received June 9, 2005 . This work was supported in part by the FPU Research Grant from the Spanish Education and Science Ministry by the Spanish Thematic Network IM3 (G03/185) TIC2001-0175-C03-02 and TEC2004-07052-C02-02. P. Guerra being developed by our group. Our aim is to build a compact, low-cost and flexible detector for small animal imaging. Currently we assume a detector consisting of multi-layer scintillation crystals attached to a position sensitive photomultiplier (PS-PMT) and Anger readout, although the interface will be flexible enough to accommodate other configurations. In order to reduce space and simplify the design, we have integrated the complete system in a single FPGA; in this way the signal processing block (custom DSP) that computes basic parameters of the pulse (energy, position, time stamp and crystal discrimination) can be treated as another peripheral of a more complex system with direct access to the high speed buses embedded in the device. A small footprint real time operating system (RT/OS) runs on top of the HW architecture, simplifying software development.
This communication is structured as follows: in the first section the software tools that have been used are enumerated and an overview of the complete system under design is provided, next section presents the HW/SW architecture of the embedded digital front-end that has been designed; this communication finalizes with results regarding area, speed and streaming bandwidth.
II. MATERIAL AND METHODS

A. Software Tools
The Embedded Design Kit 6.2 (EDK Xilinx Inc., San José CA, USA) has been used to integrate the complete embedded system, including hardware peripherals (either proprietary or third party cores), RT/OS and different software services.
The implemented digital hardware, as well as the interaction between the developed cores and the low level libraries, has been thoroughly simulated with Modelsim SE (Mentor Graphics, Wilsonville OR, USA).
The DSP core, described in VHDL (Very High Speed Integrated Circuit Hardware Description Language), has been optimized and verified through cosimulation with Modelsim and Simulink 5.0 (The Mathworks, Natick, MA, USA) using the software package XtremeDSP® from Xilinx. Simulink has been used to provide realistic input stimuli to the VHDL simulator through the modeling of the analog elements of the front-end (crystal layers, PS-PMT, analog New embedded digital front-end for high resolution PET scanner electronics and ADCs) [8] .
The real time kernel μC/OS-II (Micrium, Weston FL ,USA), a highly portable preemptive real-time multitasking kernel, provides all the resources required by our application. This kernel, written in ANSI C, is suitable for safety critical systems common to aviation and medical products and has been certified in an avionics product by the Federal Aviation Administration (FAA) for use in commercial aircraft. Moreover μC/OS-II has been ported to the microprocessors currently supported by FPGA vendors (PowerPC, ARM, NiosII and Microblaze), providing an additional level of independence from the final target technology.
The front-end device is interfaced with internal tools developed with the .NET platform (Microsoft Corp., Redmond NY, USA).
B. System Overview
The embedded front-end described in this work will be the main building block of the system, which will consist of:
A master controller which will distribute the synchronization signals. These signals are required in order to guarantee that all modules share the same clock and the same value in the internal counters referring to time.
An even number of acquisition modules, consisting of the scintillation crystals, PS-PMT, readout and acquisition +control, being the latter the object of this work. One or several concentrators, depending on the actual number of acquisition modules, which send data to a workstation for off-line fine coincidence resolution and image reconstruction. The master controller will also include: An acquisition interface for the registration of biological signals, such as the cardiac or respiratory cycle. Synchronization of external activity with the internal timestamps is important to ensure the validity of the acquired PET images and contributes to improve the reproducibility of PET investigations [9] . A SW configurable coincidence detector. On every event detection the modules report the single to this unit, which discriminates for coincidences in a clock-cycle wide time window, producing a result before the pulse is completely characterized. Gantry control tasks, like controlling the rotating motor or stepping the bed. A custom DSP core processes the Anger signals generated by the analog readout, which are sampled at a maximum sampling rate of 65MHz by an 8-channel free-running ADC from Texas Instruments (Dallas, TX, USA). When a pulse is detected, a programmable number of samples are extracted out of the input stream to compute the basic parameters of the scintillation pulse, producing a data packet. Each module will support a maximum count rate of 2Mcps. An Ethernet controller that sends the acquired data to the host computer. In order to simplify HW/SW development, the external Ethernet IC includes the physical (PHY) and medium access (MAC) layers [10] . A microcontroller that handles the communication through the Ethernet as well as slow control. Clock management. For reliable coincidence detection, accurate timestamps are needed. For nanosecond accuracy we must be able to synchronize all modules within 1 ns [7] . This is achieved by a high-speed differential clock distribution and a SW correction after calibration. The master controller distributes a high precision 25MHz LVDS clock, which is used by each module to generate a synchronized 62.5MHz clock, making use of the low jitter internal PLLs available in current FPGAs.
In order to reduce space and increase flexibility most subsystem components are integrated in a single FPGA. A preliminary prototype of the digital part of the acquisition module has been assembled based on development kits from different vendors. This prototype enabled us proof the 
III. PROPOSED ARCHITECTURE FOR THE EMBEDDED SYSTEM
A. Hardware Overview
The on-chip system designed with the EDK is bus-centric, in the sense that relation among cores is defined by their attachment to system buses. Our design, whose architecture is shown in Fig.3 , considers three internal buses: one for data, one for instructions in the internal memory and one for the onchip peripherals (OPB). The implemented system includes the following cores:
Application specific DSP block (DSP) and clock management (CLK). The Microblaze(uB), a 32-bit RISC processor from Xilinx with Harvard architecture that requires around 1000 logic cells. A debug module (MDM) for on-line debug through the JTAG. Memory controllers for external SRAM and FLASH (EMC) and also for the internal memory of 16KB Serial port (UART) and I2C controller.
Interrupt controller (INTC). Custom made Ethernet controller (ETHE).
2 Timers, one of them is required by the RT/OS for task scheduling. General purpose IO, including interface to an external watchdog timer (WDT).
B. DSP core
This peripheral detects the scintillation pulses generated by the interaction of the gamma-rays with the crystal and transfers to the processor the extracted information through the OPB bus. The detection and acquisition process is highly pipelined, in such a way that up to a maximum of four consecutive pulses may coexist in the core, each of them in a different phase of processing.
As the block diagram of Fig. 4 shows, the data acquired by the ADC is handled by its controller which corrects the baseline (BLR) and normalizes the pulse, so that independently of the actual voltages the pulse is going from zero to positive values. When the instantaneous energy crosses the programmed threshold, a finite state machine (FSM) is triggered, enabling the integration of the pulse within a certain time window, as well as computing a timestamp for the pulse and a measure of the decay time, which will be used for depth-of-interaction correction in a phoswich system [11] . As a result of the detection, a data packet of 15 bytes is generated and stored in a queue where it waits for transmission. These packets are then transferred into an internal buffer of the IP interface and once enough data has been stored, the core raises an interruption that invokes the corresponding interrupt handler and the actual transmission to the host computer through the Ethernet interface.
The IP interface also includes 8 addressable registers, which are required to configure the acquisition parameters or read the core status.
As it has been earlier mentioned, the acquisition and processing is driven by a 62.5MHz clock which is synchronized to the external reference. However in order to guarantee the functionality of the system even when the PLL does no succeed in locking to the reference, the microprocessor and cores are driven by a local 50MHz clock. At the DSP IP interface, where both clock domains meet, data is asynchronously exchanged.
C. Software Services
Following Xilinx's methodology, we have created device drivers for every custom core, so that they automatically integrate with the design tool.
In order to improve design flexibility and reusability we have decided to count on the synchronization and communication services provided by an RT/OS, in particular the μC/OS-II was selected, which was extended with a new timer service. Additionally, during the system boot a filesystem based on Xilinx's libraries is mounted and the 
D. Application Tasks
The embedded application code is split as a set of concurrent tasks which are scheduled by the μC/OS-II based on their priority. These tasks, as it is summarized in Fig. 5 , are the following:
Slow Control Task, which is responsible for all steps previous to acquisition, such as calibration or configuration as well as executing control commands generated by the master controller or the host computer. Acquisition and streaming task, which is waken up by the DSP interrupt handler and performs the actual data transmission to the computer through a UDP socket. DSP statistics task, which is periodically waken up by an OS timer to report the number of detected singles, lost events or transmitted singles together with timer counter values. This data is useful for a-posteriori sinogram corrections Error management, which manage error signals given by tasks creating the corresponding report files and takes the appropriate recovery steps. LWIP engine, which keeps the TCP/IP stack alive. Self-diagnosis, which performs periodic system self test when the processor is idle. Shell, which through the serial port provides a Unix-like command interface that allows the user to log into system given access to the system status, report files…. 
IV. RESULTS AND CONCLUSIONS
The on-chip hardware/software architecture, which is specially tailored to the needs of low-cost and flexible gamma detection in PET with small animals, has been completely specified, synthesized and implemented. Two different Spartan3-400k boards have been used for HW/SW integration and development.
An embedded processor is used for slow control and data streaming through TCP/IP sockets. The RT/OS scheduler guarantees that no high priority control task gets blocked by lower priority acquisition tasks.
Moreover HW/SW components have been tuned to achieve efficient and high speed data streaming. Particularly the linker needs to be guided for an efficient placement of the software in memory.
Although SW development is still going on, we have assessed the streaming bandwidth to be around 40Mbps (~300Kcps). However there is still margin for architectural improvements to meet our expectations of a sustained maximum count rate of 500kcps per PSPMT, which is the expected maximum number of singles for the conventional detector size with small animals.
The current on-chip system fits quite tight into our development boards, being the DSP core responsible of half of the area; we have already designed our own prototype with a 1M-Spartan3 that mimics the development kits currently assembled and includes new features to improve reliability and flexibility. Once our prototypes are up and running with the presented design, we will integrate our analog readout and start doing field acquisitions. 
