This work explores the use of a wavelet transform, a feature extractor mechanism, and a neural network to classify audio samples as belonging to either a voice class, or a music class. The proposed system was implemented in a digital design using VHDL and was synthesized with Synopsys Design Compiler, using the LSI-IOK synthesized library cells with a clock frequency of 1 1.025KHz. This design of a wavelet neural network was effective in correctly identifying the test data sets.
I. INTRODUCTION
With the advent of neural network processing, computer systems have been able to replace, and in some cases improve upon human operators. One such system would be auditory processing. Auditory processing is being developed to solve problems such as voice recognition, speaker recognition, multimedia indexing, SONAR analysis, and many others. In order to present audio samples to a neural network, the important features of the sample must first be extracted. Discreet wavelet transform processing and feature extraction techniques have been shown to be effective in this task. The proposed system incorporates a discreet wavelet transform processor, a feature extraction processor, and a neural network to classify audio samples as belonging to either a voice class or music class. The system was implemented in a digital design using VHDL as the hardware description language.
WAVELET TRANSFORM THEORY AND APPLICATION
Signal analysis has benefited from mathematical tools, such as the Fourier Transform and more recently, Wavelet Transforms, by manipulating the signal into a more meaningful form for analysis. In the case of digital audio processing, a discreet transform can be used in the analysis of the digital signal. Discreet Wavelet Transform (DWT) analysis is of greater benefit to audio processing than Discreet Fourier Transform (DFT) analysis, because DWT analysis is a multi-resolution representation that includes both time and frequency localization, whereas DFT analysis only provides frequency localization [I] . Features generated, from time and frequency information, are of greater value to audio analysis than are features generated from frequency information alone, because this closely approximates human auditory processing [ 2 ] .
The Haar wavelet transform was chosen as the DWT for this project because it lends well to a digital design, and it is the simplest of the wavelet transforms, yet has been proven to be effective in signal analysis. The Haar transform can be implemented rather easily, it requires minimal intermediate storage, and its computation is bounded by order O(n), which allows for a pipelined digital design [I] . Conceptually, the Haar wavelet transform is a recursive filtering, beginning first with the original sample vector, which is processed by both a lowpass and a high-pass filter. The output of the high-pass filter is collected to the result vector as detail coefficients, and the output of the low-pass filter is successively filtered as previously described. This repeats until each filter generates only one coefficient. The output of the final low-pass filter is the approximation coefficient. This process is depicted graphically in Figure I [3].
The low-pass filter produces outputs that are the sum of adjacent inputs divided by two, and the high-pass filter produces outputs that are the difference of adjacent inputs divided by two. The outputs of each filter are of the same number as the inputs, but have half the frequency band. Therefore, half of the outputs can be discarded, according to the Nyquist Criterion. Thus, the transform process can be performed in log2(n) stages, where n is the number of input samples. This transform produces n-1 detail coefficients, and 1 approximation coefficient. If the process were performed in reverse, the original signal could be reconstructed; therefore there is no loss of information in this procedure [3] .
A novel approach was taken in the design of the wavelet transform processor for this project. The wavelet transform processor was designed to accept Due to synchronization issues, the process takes more than 256 cycles to complete. Therefore, temporary saveregisters were used to prevent the overwriting of the results by the processing of the next block of samples. The wavelet transform processor's outputs are 256 4-bit wavelet coefficients. A block diagram of this system is shown in Figure 2 .
FEATURE EXTRACTION THEORY AND APPLICATION
The data provided by the wavelet decomposition process, though multi-resolution and time-and frequency-localized, is not of an acceptable form for neural network processing. The wavelet coefficient data set is overly large, and includes data that could be not meaningful or misleading to the training or classifying process of neural networks.
A feature extractor processor, to generate meaningful features from the wavelet coefficients, is required. The feature extraction The feature extraction process requires initial setup steps that involve processing a training set of data. This training set is the same data set that will be used to train the neural network. The use of the training set allows the feature extraction process to be customized for the data space of interest. The feature extraction process depends on the use of clusters to identify groups of wavelet coefficients. The determination of the number of clusters and the size of each cluster is accomplished by the following procedure. First, the coefficients are placed in a matrix, B, such that each row vector contains all of the detail coefficients for a given level of decomposition, and the last row vector contains the approximation coefficient. All empty spaces in the matrix are filled with zeros. For each data set in the training set, a matrix, B, is formed and is inserted into an array of matrices, BK, where K is the length of the training set. Let I be the matrix of the same size as B but containing only 1 s as its elements. Let R be the operation that reduces a matrix by its last row. A matrix, G, is calculated by the following equation. 
is obtained [4] . The binary matrix, Gh, contains 1s at the center of the proposed clusters. If a row vector of Gh contains no Is, the entire row vector is treated as a cluster. This ensures that clusters do not overlap across different scales.
Extracting the features of the wavelet coefficients involves using the clusters that were generated. For each cluster, U, the feature, U, is calculated by taking the square root of the sum of the squares of each coefficient, v, in the cluster, also known as the Euclidean norm [4] . (3) The digital design of the feature extractor processor closely follows the feature extraction process described previously. In a software development environment, the training set of audio samples was transformed by the wavelet process into wavelet coefficients. The feature cluster analyzer software, as described above, then processed these wavelet coefficients. This resulted in discovering 34 clusters for the data space considered in this project. The digital design of the feature extractor processor consists of a feature extractor module, which contains 34 cluster processors. Each cluster processor performs the operations described in equation (3). The feature extractor module accepts wavelet coefficients as inputs, which it allocates to the cluster processors according to the cluster boundaries. The feature extractor processor outputs 34 4-bit wavelet features. A block diagram of this system is shown in Figures 3 and 4. 
IV. NEURAL NETWORK THEORY AND APPLICATION
Neural networks are designed to solve classification Using the problems by means of a learning process. The model of the multi-layer perceptron, considered for this project, is based on the McCulloch-Pitts model of a neuron. The goal of the multi-layer perceptron is to correctly classify a set of inputs within an input space that is defined by more than two decision regions. The number of layers, and the number of neurons in each layer determine the number of decision regions that a multi-layer perceptron can define. A typical multi-layer perceptron consists of an output layer, and one or more hidden layers, one of which is also known as the input layer. The multi-layer perceptron is first trained with a known training set. This is accomplished by applying the known input to the input layer, and then forward propagating the results through the other layers. During this phase, the weights remain constant. The results from the output layer are then collected, and compared to the desired response. An error signal is calculated from the difference of the actual response and the desired response, and is then back propagated through the neural network. The multi-layer perceptron converges using a back-propagation error-correction learning algorithm [ 5 ] .
After each training iteration, the weights in the neural network are modified according to the following backpropagation error-correction equations, (4) where 77 is the learning rate, and y; is the neuron output value.
Awj; (n) = 77 .6/ * y; (n) 7 I f j is in the output layer, or i f j is in a hidden layer, where dj is the desired response, and a and b are the scaling values from the neuron activation function. Training continues until the weights of the neural network produce outputs that converge. Convergence is defined by an average error signal, E~, , reaching a threshold.
ej(n>=dj(n)-Yj(n)
where c is the set of all neurons in the output layer.
The neural network considered for this project was a 2-layer multi-layer perceptron consisting of 34 input neurons, and 2 output neurons. The 34 wavelet-features were fully connected to the 34 input neurons, which were in turn, fully connected to the output neurons. Each output neuron corresponds to each of the two result classes, voice and music.
The digital design of the neural network was inspired by the design in [6]. The weights in the neural network were designed to be uploaded, to avoid the need for training hardware in the design. The training was instead performed in a software simulation model. The neural network parameters were determined experimentally beginning with typical values [5] . The neural network processor consists of 34 input layer neuron modules, 2 output layer neuron modules, synchronization hardware to ensure proper data flow through the model, and a result generator module. The result generator observes the outputs of the output layer neurons, and generates "valid", "voice", "music", and "other" signals. If the first output neuron's output is greater than or equal to the second's, the result is "voice". If the second output neuron's output is greater than the first's, the result is "music". If both output neurons' outputs are equal to zero, the result is "other". The neuron modules contain 8-bit registers to store weights, muxes to select input and weight combinations, and an 8x4 bit multiplier with an accumulator to implement the activation potential calculation. The input and output neurons implement individual activation functions, which are look-up based comparators. A block diagram of this system is shown in Figures 5 and 6 .
V. RESULTS
The wavelet neural network was originally constructed as a software model in order to experiment with network parameters, to determine ideal results, and to provide a reference to verify correct hardware operation. Using 8-bit weights and 4-bit data paths in the digital design significantly reduced the accuracy of the classifier from the ideal. An 8-bit weights and 8-bit data path model was also created in software to demonstrate the capabilities of a larger system. The results of tests performed on the ideal software model and on the hardware models are shown in Table 1 . The digital design of the wavelet neural network was written in VHDL and synthesized with Synopsys Design Compiler, using the LSI-IOK synthesized library cells with a clock frequency of 1 1.025 kHz.
VI. CONCLUSIONS
The trained digital design wavelet neural network was effective in correctly identifying the test data sets. The novel design of the wavelet transform processor produced an efficient hardware design that was also a high performance pipeline. The design of the hardware modules was fairly straightforward to implement in VHDL, and the synthesis was fairly simple because of the low clock operating speed. The idea1 model of the wavelet neural network demonstrates what could be achieved with much larger hardware sizes. 
