Abstract-Address-event representation (AER) is an emergent hardware technology which shows a high potential for providing in the near future a solid technological substrate for emulating brain-like processing structures. When used for vision, AER sensors and processors are not restricted to capturing and processing still image frames, as in commercial frame-based video technology, but sense and process visual information in a pixel-level event-based frameless manner. As a result, vision processing is practically simultaneous to vision sensing, since there is no need to wait for sensing full frames. Also, only meaningful information is sensed, communicated, and processed. Of special interest for brain-like vision processing are some already reported AER convolutional chips, which have revealed a very high computational throughput as well as the possibility of assembling large convolutional neural networks in a modular fashion. It is expected that in a near future we may witness the appearance of large scale convolutional neural networks with hundreds or thousands of individual modules. In the meantime, some research is needed to investigate how to assemble and configure such large scale convolutional networks for specific applications. In this paper, we analyze AER spiking convolutional neural networks for texture recognition hardware applications. Based on the performance figures of already available individual AER convolution chips, we emulate large scale networks using a custom made event-based behavioral simulator. We have developed a new event-based processing architecture that emulates with AER hardware Manjunath's frame-based feature recognition software algorithm, and have analyzed its performance using our behavioral simulator. Recognition rate performance is not degraded. However, regarding speed, we show that recognition can be achieved before an equivalent frame is fully sensed and transmitted.
I. INTRODUCTION
A RTIFICIAL man-made machine vision systems operate in a quite different way from biological brains. Machine vision systems usually capture and process sequences of frames. For example, a video camera captures images at about 25-30 frames per second, which are then processed frame by frame, pixel by pixel, usually with convolution operations, to extract, enhance, and combine features, and perform operations in feature spaces, until a desired recognition is achieved. This frame convolution processing is slow, especially if many convolutions need to be computed in sequence for each input image or frame.
Biological brains seem to not operate on a frame by frame basis. In the retina, each pixel sends spikes (also called events) to the cortex when its activity level reaches a threshold. Pixels are not read by an external scanner. Pixels decide when to send an event. All these spikes are transmitted as they are being produced, and do not wait for an artificial "frame time" before sending them to the next processing layer. 1 Besides this frameless nature, brains are structured hierarchically in cortical layers [1] . Neurons (pixels) in one layer connect to a projection field of neurons (pixels) in the next layer. This processing based on projection fields is similar to convolution-based processing [2] , at least for the earlier cortical layers. For example, it is widely accepted that the first layer of visual cortex V1 performs an operation similar to a bank of 2-D Gabor-like filters at different scales and orientations [3] whose actual parameters have been measured [4] - [6] . This fact has been exploited by many researchers to propose powerful convolution-based image processing algorithms [3] , [7] - [12] . However, convolutions are computationally expensive. It seems unlikely that the high number of convolutions that might be performed by the brain could be emulated fast enough by software programs running on the fastest of today's computers. Many researchers believe that a new hardware technology is required for approaching the processing capability of biological brains.
Address-event representation (AER) is a promising emergent hardware technology that shows potential for providing the computing requirements of large frameless projection-fieldbased multilayer systems. AER was first proposed in 1991 in one of the California Institute of Technology (Caltech) research labs [13] , and has been used since then by a wide community of neuromorphic hardware engineers. AER has been used fundamentally in image sensors, for simple light intensity to frequency transformations [15] , time-to-first-spike coding [16] , [17] , foveated sensors [18] , contrast [19] , [20] , more elaborate transient detectors [21] , and motion sensing and computation systems [22] . But AER has also been used for auditory systems [14] , [23] , competition and winner-takes-all networks [24] , [25] , and even for systems distributed over wireless networks [26] . However, the high potential of AER has become even more apparent since the availability of AER convolution chips [27] , [28] . These chips, which can perform large arbitrary kernel convolutions (32 32 in [27] ) at speeds of about 3 10 connections/s/chip, can be used as building blocks for larger cortical-like multilayer hierarchical structures, because of the modular and scalable nature of AER-based systems. Currently, only a small number of such chips have been used simultaneously 2 [29] , but it is expected that hundreds of such modular AER convolution units could be integrated in a compact volume, such as a miniature printed circuit board (PCB) or into chips of the type known as networks-on-chip (NoC) [30] . This would eventually allow the assembly of large cortical-like convolutional neural networks and event-based frameless vision processing systems operating at very high speeds.
II. FRAME-BASED VERSUS EVENT-BASED
SENSING AND PROCESSING Fig. 1 illustrates the conceptual difference between a frameand an event-based sensing and processing system. Each use a camera sensor to capture reality. In the top row, a frame-based camera captures a sequence of frames, each of which is transmitted to the computing system. Each frame is processed by sophisticated image processing algorithms for achieving some recognition. The computing system needs to have all pixel values of a frame before starting any computation. In the bottom row, an event-based vision sensor operates without frames. Each pixel sends an event (usually its own coordinate) when it senses something (change in intensity [21] , contrast with respect to neighboring pixels [20] , etc.). Events are sent out to the computing system as they are produced, without waiting for a frame time. The computing system updates its state after each event. Fig. 2 illustrates the inherent difference in timings between both concepts. In the top (frame-based), reality is binned into compartments of duration . During the first frame , an event happens (such as a flashing shape), but the information produced by this event does not reach the computing system until the full frame is captured (at ) and transmitted (with an additional delay ). Then, the computing system has to process the full frame, handling large amount of data and requiring a long frame computation time before the "recognition" information is available. In the bottom of Fig. 2 , pixels "see" directly the event in reality and send out their own events with a delay to the computing system. Events are processed as they flow with an event computation delay (some nanoseconds [27] ). For performing recognition not all events are necessary. Actually, more relevant events usually come out first or with higher frequency. Consequently, recognition time can be smaller than the total time of the events produced. Note that recognition is possible before frame time , resulting in a negative when compared to the recognition delay of a frame-based system. Fig. 3 provides an illustration of a typical operation of an AER-based hardware [31] . In this case, the hardware is composed of one temporal contrast (motion) sensing retina of 128 128 pixels [21] that is sending its output events to a 2-D convolution chip programmed with a 7 7 pixel vertical Gabor filter. A pixel in the retina sends out an event (which usually consists of its coordinate) every time its incident light intensity changes a relative amount of at least 2.5%. Fig. 3(a) shows the 1500 events generated by the retina during about 80 ms when observing two persons walking. The receiver convolution chip processes each event as it comes in with a delay of about 90 ns [27] . Pixels in the 2-D array of integrators of the convolution chip will generate their own output events. Fig. 3(b) shows the 300 output events produced by the convolution chip during the same 80 ms. This 7 7 persons walking captured with a 128 2 128 temporal contrast (motion) retina [21] . Pixels sensing a positive time derivative in light intensity send a positive event (white), while those sensing a negative time derivative send a negative event (black). Gray pixels are silent. The figure shows the events captured during an interval of about 80 ms with a total of about 1500 events. (b) As these pixel events are generated asynchronously by the motion retina, they are received and processed one by one by a receiver convolution chip programmed with a 7 2 7 vertical Gabor 2-D spatial filter. The computation delay in the convolution chip is 90 ns per event [27] . The figures shows about 300 output events produced during the same 80 ms by the convolution chip.
kernel typically requires between 5 and 20 spatio-temporal correlated input events to produce an output event. As soon as these events are fed to the convolution chip, the corresponding output event appears with a delay of 90 ns. Consequently, in practice, input and output event flows are simultaneous.
Interestingly, AER hardware sensing or processing modules can be assembled into large hierarchical structures, as if one assembles bricks [29] . This is because of the robustness and asynchrony of the AER communication links between the modules, and the availability of "glue" modules such as AER splitters, mergers, and mappers [29] , [32] .
While the AER hardware technology takes its time to mature for allowing the availability of such large scale modular systems, the AER research community also needs to provide a more theoretical substrate for knowing how to assemble, configure, program, and train such systems. What is the optimum hierarchical structure for a desired application? What kernels are best? Can they be learned through a training process? What other parameters should be set? In this paper, our goal is to perform a step towards this more theoretical direction. We will concentrate on one potential application for AER convolution-based visual processing: texture recognition. Based on performance results of individual AER convolution chips already tested, our goal is to emulate through behavioral simulations, a relatively large multimodule AER convolutional neural network for texture recognition, and estimate its eventual performance, especially in terms of speed response. We will use an AER behavioral simulator developed in Visual C++ [33] , which allows to behaviorally describe any AER module (including timing and nonideal characteristics), and assemble large netlists of many different modules. This allows obtaining a realistic estimate of the processing delays of the simulated systems. As we will see, recognition retrieval performance is similar to state-of-the-art frame-based algorithms, while recognition time delays are such that the results are available before the equivalent frame sensing and transmission time. [36] , where pixel intensity is coded directly as pixel event frequency. 3 The continuous-time states of pixels in an emitter chip are transformed into sequences of fast digital pulses (spikes or events) of minimal width (in the order of nanoseconds) but with much longer inter-spike intervals (typically in the order of milliseconds). Each time a pixel generates a spike, its address is written on the interchip digital bus, after proper arbitration [13] . This is called an "address event." The receiver chip reads and decodes the addresses of the incoming events and sends spikes to the corresponding receiving pixels for reconstruction or further processing. This point-to-point communication in Fig. 4 can be extended to a multireceiver scheme [14] . Also, multiple emitters can merge their outputs into a smaller set of receiver chips [29] . Moreover, AER visual information can easily be translated or rotated by remapping the addresses during interchip transmission [37] , [38] . Complex processing such as convolutions has also been demonstrated [27] - [29] .
To illustrate how AER convolution is performed event by event (without frames) consider the example in Fig. 5 . Fig. 5 (a) corresponds to a conventional frame-based convolution, where a 5 5 input image is convolved with a 3 3 kernel , producing a 5 5 output image . Mathematically, this corresponds to sweeping kernel over the full pixel array (1) In an AER system, shown in Fig. 5(b) , an intensity retina sensing the same visual stimulus would produce events for some pixels only (those sensing a nonzero light intensity). Every time an event from the retina chip is received by the convolution chip, the kernel is added to the array of pixels (which operate as adders and accumulators) around the pixel having the same event coordinate. Note that this is actually a projection-field operation. This way, after the four retina events have been received and processed, the result accumulated in the array of pixels in Fig. 5(b) is equal to that in Fig. 5(a) . In a more realistic situation, the retina pixel values are higher and more events are sent per pixel. However, note that more intense pixels have higher frequencies, and consequently, their events will start to come out earlier, and will be processed first. The first "wave front" of events is therefore more relevant for object recognition. AER visual sensors become significantly more efficient if they include on-chip some extra preprocessing, such as temporal [21] or spatial [20] contrast. In this case, only pixels with a minimum contrast level generate events. These pixels are the most meaningful for object/texture recognition. Using such sensors also increases dramatically the efficiency of the posterior cortical processing, as the number of events is reduced at least one order of magnitude while keeping the meaningful information content. In AER systems, since events are processed by a multilayer corticallike structure as they are produced by the sensor, it is possible to achieve successful recognition after a fraction of the total number of events are processed [39] .
IV. TEXTURE-BASED AER RETRIEVAL
We have developed an AER system for computing Manjunath's Gabor wavelet features for texture analysis [35] . By performing texture analysis using Gabor filters (2-D convolutions) at different scales and orientations, these patterns can be efficiently described in the frequency domain and localized in the spatial domain. Texture is analyzed by applying a bank of scale and orientation Gabor filters to an image. Next we summarize the sequence of computations performed in Manjunath's method [35] , and indicate how we have adapted them for an AER hardware system.
A. Manjunath's Frame-Based Method
A 2-D Gabor function can be written as (2) where , , and are its geometric parameters. Let be the mother wavelet. A Gabor filter bank can be obtained by appropriate dilations and rotations of through the generating function (3) where represents orientation and the scale. The filter bank parameters are computed by Manjunath's method [35] . Given an image , its Gabor wavelet transform is then defined as (4) The mean and the standard deviation of the magnitude of the transform coefficients (5) Fig. 6 . Scheme of the AER-based system implemented for texture-based retrieval of images.
are used to represent the region for classification and retrieval purposes. In our AER implementation, we will not compute as given in (5), but as (6) without any degradation in performance. A feature vector is now constructed using and as feature components. In the experiments, we use four scales and six orientations, resulting in a 48 component feature vector (7) Consider two image patterns and , and let and represent the corresponding feature vectors. The distance between the two patterns in feature space is then defined as (8) where and are the standard deviations of the respective features over the entire database, and are used to normalize the individual feature components. For database texture retrieval, the feature vector of a new input image is compared with a precomputed database of feature vectors . Computation of is fast. However, computing the feature vector is a slow process in conventional computers.
B. Adaptation of Manjunath's Method to AER Convolutional Event-Based Hardware
Our AER system implements a slightly modified version of the algorithm originally proposed by Manjunath for texture retrieval. The AER system is shown in Fig. 6 . It has three layers. The first one is composed of a splitter module and 24 AER convolution modules in parallel. It implements a Gabor filter bank with four scales and six orientations. In [40] , this configuration of filters was demonstrated to provide the best results. An input texture image is coded by events at intervals of 50 ns. These events are fed to a splitter module that replicates them on the 24 output channels. Each output channel is connected to a convolution module that uses as kernel the real part of a Gabor wavelet with scale and orientation . In the system of Fig. 6 , each convolution module in the first layer is configured to change the sign bit of negative output events to positive (this is a full-wave rectification). This way, the output at each convolution module is . Note that adding more modules to layer "1" increases the number of scales and orientations in the bank of Gabor filters. This improves classification performance. However, note that adding more modules to a layer will not increase the processing delay of the hardware.
Layer "2" consists of 24 feature extraction modules (FEM in Fig. 6 ). A FEM module is shown in Fig. 7 . The first block is a splitter with three output channels. The top channel (labeled "2" in Fig. 7 ) goes directly to layer 3, thus providing an AER representation for . The bottom channel (labeled "5") goes to an internal merger module with a hardwired positive sign. The central channel (labeled "3") goes to an internal mapper. This mapper ignores the address of the incoming event, and generates a new address by sequentially sweeping all addresses. Consequently, at the mapper output, a uniform AER image is represented with the same number of events as . Thus, this represents the mean of (5). This mean is fed to the internal merger with a hardwired negative sign. Consequently, at the merger output, we have all events with a positive sign and all events with a negative sign. After convolving them with a unitary kernel C and changing the negative output event signs to positive, the output will represent (9) Finally, for each of its input channels and , layer 3 will count the total number of events (regardless of their addresses) per unit time. We will use these numbers to create our feature vector described as (10) Numbers and will be a representation of and and are the extracted characteristic feature vector for the input texture. Although slightly different from Manjunath's vector in (7), retrieval performance will not degrade, as shown in Section V and in the Appendix.
V. RESULTS
In this section, we provide a performance evaluation of an eventual hardware implementation. For this we use a behavioral simulator developed in Visual C++ [29] , [33] which allows to test large modular AER systems. The performance characteristics of the AER modules employed (convolution chips, mergers, splitters, and mappers) are obtained from already manufactured, tested, and reported AER modules [27] , [29] , [32] , [39] . Unfortunately, those AER chips are presently experimental prototypes and only a small number of them are available. At this moment, it is therefore not possible to assemble large AER systems like the ones discussed in this paper. However, using the module performance characteristics together with the AER behavioral simulator, we can obtain a good estimate of the overall system performance.
We have used the Brodatz database [41] , which consists of 112 images and each image has been divided into 16 90 90 nonoverlapping subimages, thus creating a database of 1792 texture images. These images have been rate-coded into events separated by 50 ns, creating stimulus bursts of 30 ms on average. 4 We used our C++ behavioral simulation tool to estimate the performance of an eventual hardware implementation. The 48 channel outputs of layer 2 (see Fig. 6 ) obtained for each of the images in the database were collected during the 30 ms (duration of the input burst) to create the feature vector database.
In what follows, a query pattern is any one of the 1792 patterns in the database. This pattern is then processed to compute the feature vector as in (10) . The distances , where is the query pattern index and is the index of a pattern from the database (with ), are computed and sorted in increasing order. Only the closest set of patterns are retrieved. Ideally, all top 15 retrievals are from the same large image. The performance is measured in terms of the average retrieval rate which is defined as the average percent number of patterns belonging to the same image as the query pattern in the top 15 matches. Table I summarizes the results. It shows the retrieval accuracy of the different texture features for each of the 112 texture classes in the database when we compare our AER-based method with the original Manjunath results. As can be seen, the retrieval accuracies are approximately equal.
To estimate the minimum time for correct texture retrieval, we proceeded as follows. Input stimuli lasted for about 30 ms. Layer 3 counts events coming from the 48 layer 2 output channels during a time . This time was increased in steps of 15 s from 0 to 30 ms. We found that for approximately equal to 10 ms the results shown in Table I were similar. Consequently, an AER hardware implementation would be able to achieve correct texture retrieval in about 10 ms. As an illustration, Fig. 8 shows the retrieval accuracy as a function of for six of the texture images in [41] . As can be seen, after 10 ms, the retrieval accuracy has stabilized; this is 20 ms before the input stimulus is finished.
In the Appendix, retrieval performance is compared against other state-of-the-art texture retrieval algorithms. The conclusion is that retrieval rate is not degraded in an AER implementation, but speed response is dramatically improved since recognition is achieved before the equivalent frame becomes fully available (see Table III in Appendix).
VI. DISCUSSION
AER is an emerging hardware technology with great potential for providing complex cortical-like sensory-processing systems. Of special interest is its potential for providing very fast spike-processing convolutional neural networks with complex hierarchical structures, similar to those found in biological cortex. Recent work on individual AER convolutional chips reveals the outstanding capabilities of such components as "bricks" for larger highly sophisticated and hierarchically structured cortical-like sensory processing systems. To date, the largest AER multimodule system reported uses only four processing stages, one of which is a convolution [29] . We believe that we are not far from seeing systems made out of several hundreds (or thousands) of AER convolutional modules in the near future. NoC technology could host around 100 individual convolutional modules on a single chip, and about 100 such chips could be put on one single PCB. Consequently, a small physical volume like a desktop computer could easily hold 20-40 such PCBs, providing a total of almost half million convolution modules. However, currently, it is not obvious what architectural structures should be used to assemble these AER convolutional "bricks" and how to set their parameters for a desired (recognition) application. In this paper, we have concentrated on one such possible application, texture recognition, emulated it with a behavioral AER simulator, and used it as an exercise to see how to set up such a system, its parameters, and estimate the performance of multilayer AER convolutional systems. There are starting to appear some software computational works in the literature that use massive convolutions for vision processing. For example, in texture recognition, experiments in the last years have demonstrated that filter-based schemes provide excellent results [42] , [61] - [63] . However, massive convolutions on conventional computers result in excessive computational times, making such approaches nonpractical for real- world TABLE III  COMPUTATION TIMES USING THE BRODATZ DATABASE applications. In general, vision processing researchers tend to avoid the use of convolutional processing because of its excessive computational load. For example, quoting Serre et al. [3] who use a first stage with 64 Gabor filters (for an input image of 128 128 pixels), the main limitation of their powerful recognition system is the delay of this first stage, which requires several tens of seconds. An AER-based spiking hardware could perform this processing with delays of a few milliseconds, or fractions of milliseconds, while the visual input is being sensed.
In all reported approaches for texture recognition, there is a relationship between the length of the feature vector and the computational time. The longer the feature vector, the longer the feature extraction time. In AER convolutional hardware, this is not the case, because all the elements of the feature vector are computed in parallel. Consequently, it is possible to increase the feature vector length or elements [54] to improve retrieval rate, without increasing feature extraction time, although at the cost of using more hardware "bricks." Actually, novel approaches for texture retrieval are based on the use of filters that take into account more frequencies or scales [64] , [67] and produce less redundant features as compared to other wavelets (Gabor wavelet in our case).
This property is not specific for the texture retrieval application, but is generic for AER convolutional hardware: increasing the number of convolutional filters in a layer does not degrade speed response of the overall system. This is because the filters receive the same input events simultaneously and process them in parallel. There will be some delay the hardware will add to distribute the events to a larger number of receivers, but this extra delay will be in the order of nanoseconds, and consequently not perceived by the overall system. We have observed that the main potential for introducing delays in a multimodules AER system comes from the finite bandwidth of individual AER links. For present day reported AER links, a typical bandwidth is in the order of 10-30 Meps (mega events per second). Retina sensors output event rate is usually below 1 Meps. However, when merging several AER module outputs into one single AER channel, especially if we are thinking of several hundreds for the near future, it is realistic to expect that the limited AER link bandwidth could easily end up being the main delay bottleneck for such systems. Solutions for this problem could be to do a hierarchical merging of outputs combined with replicating the number of AER links to increase bandwidth. Also, we have observed that event traffic is higher for the first stages and is gradually reduced as convolutional processing compresses and extracts relevant information.
Perhaps the most interesting observation is that in AER sensory processing hardware, processing is performed as events are communicated between modules. As a retina is sending out its events they are sent directly to the processing structure and are processed as they flow in. In the same way, each "brick" processes its input events as they flow in and generates new ones. This way the whole system operates as if a wave of (visual) information (in the form of flow of events) travels through the convolutional structure while it is processed. Since processing is on a per event basis, stages do not wait for transmitting full "images" before processing them, thus reducing drastically the latency between input and output information flow.
What we have found with the specific example we have analyzed in this paper is that when mapping a know convolutional processing (frame-based) algorithm to AER hardware: 1) the recognition performance remains similar and also comparable to state-of-the-art computational methods not based on convolutions (or filters), and 2) if some day we are able to build physically this hardware, it will be capable of providing output recognition while the input stimulus is being produced by the sensors.
VII. CONCLUSION AND FUTURE WORK
This paper shows performance results for a relatively large multimodule multilayer convolutional neural network frameless AER processing system, estimated through behavioral simulations but using performance figures of real individual AER hardware modules already available. A texture classification system based on Manjunath's method has been analyzed. This scheme uses 48 AER convolutional modules plus a similar number of interfacing modules, such as splitters mergers and mappers. We have shown that the recognition performance of the AER system is equivalent to its original frame-based reference. However, if built with realistic AER hardware, recognition is achieved while the sensory stimulus is being generated. This would be equivalent to stating that an AER system has a negative processing delay when compared to a frame-based system, where each frame has to be fully available before starting any recognition computation.
Thus, AER systems reveal some interesting properties. First, they are not constrained to frames and the output is often available even before the input stimulus has finished. Processing delay is given mainly by the number of layers and the number of events needed to represent the input stimulus. The processing capability of such systems is increased by adding more modules per layer, but without increasing the number of layers. Consequently, processing capability can be increased without penalizing delays, although at the cost of adding hardware.
Currently, the available AER hardware modules are quite preliminary, although their performance figures provide very promising system level performance estimations. Future work is focused mainly on miniaturizing present AER modules so that a large number of them (several hundred) could fit on a single PCB or in a large NoC chip. Also, such multimodule elements should allow a large degree of reconfigurability and reprogrammability, so that many different applications can easily be set up. In parallel with the hardware developments, future work also has to focus on analyzing other system level applications, while developing new theoretical frameworks more specific to event-based frameless processing and learning techniques.
APPENDIX COMPARISON TO STATE-OF-THE-ART TEXTURE RETRIEVAL
The commonly used methods for texture characterization can be divided into three categories: statistical, model-based, and filtering approaches [42] . Statistical methods such as cooccurrence features [43] , [44] describe the tonal distribution in textures. Model-based methods such as Markov random field (MRF) [45] and simultaneous autoregressive (SAR) models [46] provide a description of texture in terms of spatial interaction. Most of the statistical and model-based approaches for texture classification consider spatial interactions over relatively small neighborhoods. Therefore, these approaches are more apt only for microtextures [47] , [48] . Filtering approaches including wavelet [49] , [50] , Gabor filters [47] , [51] , steerable pyramid [52] , and directional filter bank (DFB) [53] , [54] characterize textures in the frequency domain. Among the three categories, MPEG-7 has adopted Gabor-like filtering for texture description [55] . The rationale behind is that visual cortex is sensitive to localized frequency components [56] . It has been shown that the direction together with scale information is important for texture perception. In the last decade, researchers have been combining different methods in order to provide a better classification and retrieval of images. Fusion of different types of texture features can be found in the literature [57] - [60] . A comprehensive performance evaluation on filtering (i.e., spectral-based) methods for texture classification is presented in [42] , which suggests that no single set of features derived from filtering approaches has consistent superior performances on all textures. Other comparative studies about all these methods can be found in [61] - [63] .
In [64] , two fast algorithms for multiscale directional filter banks (MDFB) are proposed. These two algorithms are compared with the previous algorithm for MDFB proposed in [68] and with the contourlet transform [71] , [72] in terms of time of feature extraction (FE) and total computational time. In [65] , a texture representation suitable for recognizing images of textured surfaces under a wide range of transformations, including viewpoint changes and nonrigid deformations is presented. At the feature extraction stage, a sparse set of affine Harris and Laplacian regions is found in the image. Each of these regions can be thought as a texture element having an elliptic-shape characteristic and a distinctive appearance pattern. The approach achieves a maximum average retrieval rate of 76.26% when combined Harris and Laplacian descriptor channels are used. In [66] , a linear family of filters is introduced, which provides certain scale invariance, resulting in a texture description invariant to local changes in orientation, contrast and scale, and robust to local skew. Then, a texture discrimination method based on the similarity measure is applied to the histograms derived from the filter responses. This approach achieves a maximum average retrieval rate of 78.5%. In [67] , the authors propose an approach for rotation-invariant texture image retrieval by using a set of dual-tree rotated complex wavelet filter (DT-RCWF) and DT complex wavelet transform (DT-CWT) jointly. They make a comparison of average retrieval accuracy using standard real DWT, DT-CWT and a combination of DT-CWT and DT-RCWF. In [54] , rotation-invariant and scale-invariant Gabor representations are proposed, where each representation only requires few summations on the conventional Gabor filter impulse responses. The results show that the new implementations behave better than the conventional Gabor-based scheme when rotated or scaled images are considered. However, a conventional Gabor-based scheme provides better results when no rotation or scaling is considered.
In [68] , an MDFB is first proposed and it is compared with the Gabor filters in polar form [73] and steerable pyramid [74] in terms of retrieval accuracy. In [69] , fractal-code signatures are proposed for texture-based retrieval of images. Fractal image coding is a block-based scheme that exploits the self-similarity hiding within an image. By combining fractal parameters and collage error, a set of statistical fractal signatures is proposed. In [70] , image signatures constructed from the bit planes of wavelet sub-bands are presented [bit plane signature (BP) and three-pass layer probability (TPLP) signature]. As can be observed, the method that provides the highest ARR is filter based and is the combination of DT-CWT and DT-RCWF implemented by Kokare et al. [67] .
In Table II , we compare our AER event-based method with those reported in [54] and [64] - [68] and with Manjunath approach [35] in terms of average retrieval rate (ARR) using the entire Brodatz database. In Table III , we compare our method with those published in [64] , [67] , [69] , and [70] and also with Manjunath's method [35] , in terms of computation times. We distinguish between a FE time [time required to obtain a feature vector of the type in (7)] and a searching and sorting time (additional time to classify texture: computation of terms , sorting them, and selecting the best match). The sum of both is the total computation time. Note that, because of the conceptual difference between a frame-and an event-based approach, total computation time for a frame-based system is (as defined in Fig. 2) , while for an event-based system it is (as defined in Fig. 2) . Consequently, comparing the computational delay of the two approaches by simply comparing times and is not a fair comparison. It is more realistic to either compare against , or the time between a frame is fully available ( in Fig. 2 ) and the computing system provides a recognition result:
for a frame-based system against (see Fig. 2 ) for an event-based system. Note that the latter ends up being negative.
Begoña Acha received the Ph.D. degree in telecom
