Abstract: This study presents a prior work of developing a single chip solution for Text-to-Speech synthesizer for Tamil (Tamil-TTS) language. Though there are enormous works presented in the recent days to address TTS for their native languages, the motivation of this study is to develop a low-cost FPGA based solution for Tamil TTS synthesizer. This study uses the unique feature of Tamil language to eliminate the complexity involved in accessing a database of stored audio signals. It uses only the audio signals of consonants and vowels in the stored memory locations. The compound characters from the segmented input text are generated using a Direct Digital Synthesizer by operating at three different frequencies of phonetic interval units of Tamil. The proposed system is implemented in Cyclone IVE EP4CE115F29C7 FPGA device and the implementation results show that the proposed system outperforms the other similar methods in terms of memory utilization, text-to-speech time, area utilization and power dissipation. The accuracy of the system is examined with 25 native speakers and acceptable accuracy scale has been reached.
INTRODUCTION
In recent days, Text-to-Speech (TTS) is an attracting problem for many researchers to come with a catchy solution in both hardware and software. As the world is evolving around the internet and hand held devices, Text-To-Speech Synthesizer's has its own importance in par with other applications. Currently TTS is used in e-book reading, caller identification in mobiles, email reading services, news reading and announcement services (Violaro and Boeffard, 1998; Shih and Sproat, 1996; Klatt, 1987; Chazan et al., 2005) . Most of these services tend to be in favor of visually and speech impaired users. These services are majorly implemented in hardware for English and very few researchers have implemented it for Mandarin (Shih and Sproat, 1996) . These services are available in two different methods, one by a vast stored database access and the other by the phonetic pronunciation influenced by syllabification. In these two methods, the earlier approach has become outdated due to its higher memory utilization property, whereas the later is a promising technique which uses less memory by maintaining the accuracy level.
In both the techniques the speech synthesis process is organized as front end and back end process. In front end, the input text can be real time or stored data. Depending on the language, input text is processed into smaller elements (syllabification) and sent to the back end (Aida-Zade et al., 2010; Phan et al., 2014; Ferreira et al., 2014) . Where these syllabified inputs are processed under speech related signal processing techniques at the back end. Though there is a vast need for developing better TTS, till date very few notable research works has been done for Indian regional languages (Hindi, Tamil, Telugu, etc.) (Rama et al., 2001; Sen and Samudravijaya, 2002; Bellur et al., 2011; Jayasankar and Vijayaselvi, 2014; Saraswathi and Vishalaksh, 2010; Sivaradje and Dananjayan, 2004) .
The work presented here utilizes the unique feature of the regional language-Tamil, in which consonants (Vallinam-Mellinam-Idaiyinam) and vowels can be used to produce compound character sound. As said earlier, though TTS can be implemented in both hardware and software, very few researchers has taken the challenge of implementing TTS in hardware. The major hurdle in hardware implementation is the difficulty in accessing the stored vocabulary database. This study presents a novel TTS technique for Tamil language (TTS-Tamil) which operates only based on consonants and vowels stored in the database.
Speech synthesis-overview:
The bottom line of Speech Synthesis process is the concatenation process. In Thirukkural-TTS by Rama et al. (2001) , proposes both offline and process for Tamil Text-to-speech. The offline process combines five different stages, combining basic units, building database, study of prosody in natural speech, consonant-vowel segmentation and pitch marking. In the process of study of prosody includes the grammatical rules for proper pronunciation based on pauses and duration for the naturalness of the synthesized speech. These duration scales are stored in database as a look up table. When implemented this offline technique is capable of achieving 98% accuracy. On the other hand, in the online process the process of building database is eliminated for sampling the synthesizing process. These both methods seem to be a prominent solution for TTS (any native language) for a software implementation. Though it is mentioned that, these methods are prone to low distortion, it is not prudent to use this scheme for a hardware implementation. Similarly, TTS for web browsing by Sen and Samudravijaya (2002) , proposes a online solution to get rid of the memory issues for storing a database. This scheme is developed for both Hindi and English text contents and uses exhaustive rule sets. But, according to the author's statement, the naturalness of the synthesized speech has to be improved. Bellur et al. (2011) , developed a prosody TTS model for Hindi and Tamil in which Classification and Regression Trees (CART) was modified to syllable-based synthesis. The Mean opinion score for the system gets a scale greater than 3, which is a nominal score rated from 1 to 5. In the recent scenario, came up with a syllable based TTS scheme for Tamil which uses neural network for prosody prediction. This concatenative speech synthesis scheme uses five layers of auto associative neural network to get better naturalness to the final processed speech signal. It is proved that, the TTS with prosody has better naturalness than the TTS scheme without prosody.
In all these techniques, the process of achieving the naturalness (i.e., accuracy) majorly depends on the prosody prediction techniques. Some of the methods studied here present neural and fuzzy logic concepts for the synthesis process which will be more complex when implemented in hardware. Addressing the above said issue (for Tamil-TTS), Sivaradje and Dananjayan (2004) , designed and implemented TTS converter for satellite radio receivers for FPGA. A much comprehensible work has been done by Jayasankar and Vijayaselvi (2014) , to examine the real time difficulties in implementing TTS-Tamil in FPGA. It follows a set of condensed rules for segmentation of tamil words. As the implementation part is done in Verilog, the input text in given in English with the Tamil pronunciation. Further, this study presents a novel technique with a speech synthesis technique which is implemented in FPGA.
DESIGN OF SPEECH SYNTHESIZING UNIT
The design process of the novel Speech Synthesizing technique is developed with much care considering memory utilization issues raised in the other FPGA based TTS methods (Sivaradje and Dananjayan, 2004; Bamini, 2003; Khalifa et al., 2008) . One major drawback in both software and hardware designs is the memory utilized for storing the words either offline or online memory devices. Although certain TTS schemes considers memory utilization (database) as a trade-off parameter and concentrates on 100% accuracy, the objective of our work is to develop a FPGA based low-cost standalone TTS scheme for which we have studied the unique features of Tamil language and combined it to develop the novel TTS system (Fig. 1) .
When the need for low cost solutions were demanded, many researchers came up optimized solutions like syllabification based on prosody predictions. Such techniques can provide optimized results for languages like English which has only 26 characters. But, developing such system for a language like Tamil which has 247 characters will be a quiet difficult task.
To reduce this complexity we utilize the feature of producing compound character from consonants (Vallinam-Mellinam-Idaiyinam) and vowels (Uyirezhuthu). The pronunciation accuracy of these characters depends of the time slot taken to spell out each the characters which is measured in terms of Mathirai (unit of phonetic interval).
An indigenous Direct Digital Synthesizer generates three variant frequencies for three different units of phonetic interval. Mathirai units of consonants, kuril and nedil characters are half second, one second and two seconds respectively. This synthesize unit concatenates these compound signals based on the phonetic intervals and generates the speech signal at the receiver end. The design of the proposed system is shown in Fig. 2 . Further, the implementation process is explained in the next section.
Implementation of speech synthesizing unit:
In this chapter, the design of novel Tamil-TTS technique in a standalone FPGA device is explained in detail. The overall schematic of the proposed system is shown in Fig. 2 . SD-Card contains the audio samples of Consonants and Vowel characters for the synthesizing process. During the initialization process, our proposed TTS-Core fetches the Audio Signal (s) from SD-Card and stores in on-board SRAM memory.
The need for this process has risen to achieve a faster TTS scheme than conventional methods. The TTS-Core and text analyzer are interfaced with NIOScore through Avalon Bus. In the preprocessing stage, the input text read from the PS2 controller is sent to the text analyzer, where the input data stream will be segmented into compound character (Consonants+ Vowels). In parallel, text analyzer estimates the Based on this phonetic interval unit, DDS estimates frequency variation for the concatenation process. Corresponding audio signal (s) for the consonants and vowels are put in frequency matching stage and the concatenated compound character signals are converted using DAC by Audio Codec Controller for the final speech output. The pseudo code of TTS as shown in Fig. 3 .
RESULTS AND DISCUSSION
Voice quality testing is performed using subjective test. In subjective tests, human listeners hear and rank the quality of processed voice files according to a Table 1 . For the proposes system, audio signals of consonants and vowels listed in Appendix A are stored in SD-Card. As explained in the Design of Speech Synthesizing unit Section, during the initialization process of TTS-Scheme, memory controller copies the signals stored in the SD-Card to S-RAM through the Avalon Interface in the Nios Processor. On the second case audio signals of compound characters are stored. For the third case, a vocabulary set of 16 words listed in Appendix B are stored. In all these three scenarios, the overall QoS of the proposed method is in the acceptable scale. Though the accuracy of the proposed is less than the accuracy achieved through the vocabulary set, it is quite higher than the second case. The highlighting QoS parameter is Time-TTS (Time taken for Text-toSpeech).
The proposed scheme outperforms all the other two cases with an average Time-TTS as 12 msec. As the proposed system uses lesser memory allocation for the audio signals, on-chip memory utilization can be considered as a future extension of this current study.
CONCLUSION
The demand for developing a Text-to-Speech synthesizer for Tamil language has been addressed and solved in this study. The standalone FPGA based TTS synthesizer uses the unique features of the native Tamil language to reduce the memory complexity issues in hardware implementation. Hence, much attention has been put forth developing a TTS-Core, with Direct Digital Synthesizer to produce a satisfactory speech signal with less area utilization and much lesser time for processing the speech output. We conclude from this study the real time implementation results show that the proposed Tamil-TTS with the stored vowel and consonant sounds requires 90% lesser memory than the conventional techniques and easy method of synthesizing speech for Tamil Language. As a future work, this study can be extended by using a on-chip memories in FPGAs to get a faster and cheaper solution for Tamil-TTS.
