13 research outputs found
An infrastructure for Turkish prosody generation in text-to-speech synthesis
Text-to-speech engines benefit from natural language processing while generating the appropriate prosody. In this study, we investigate the natural language processing infrastructure for Turkish prosody generation in three steps as pronunciation disambiguation, phonological phrase detection and intonation level assignment. We focus on phrase boundary detection and intonation assignment. We propose a phonological phrase detection scheme based on syntactic analysis for Turkish and assign one of three intonation levels to words in detected phrases. Empirical observations on 100 sentences show that the proposed scheme works with approximately 85% accuracy
Statistical morphological disambiguation with application to disambiguation of pronunciations in Turkish /
The statistical morphological disambiguation of agglutinative languages suffers from data sparseness. In this study, we introduce the notion of distinguishing tag sets (DTS) to overcome the problem. The morphological analyses of words are modeled with DTS and the root major part-of-speech tags. The disambiguator based on the introduced representations performs the statistical morphological disambiguation of Turkish with a recall of as high as 95.69 percent. In text-to-speech systems and in developing transcriptions for acoustic speech data, the problem occurs in disambiguating the pronunciation of a token in context, so that the correct pronunciation can be produced or the transcription uses the correct set of phonemes. We apply the morphological disambiguator to this problem of pronunciation disambiguation and achieve 99.54 percent recall with 97.95 percent precision. Most text-to-speech systems perform phrase level accentuation based on content word/function word distinction. This approach seems easy and adequate for some right headed languages such as English but is not suitable for languages such as Turkish. We then use a a heuristic approach to mark up the phrase boundaries based on dependency parsing on a basis of phrase level accentuation for Turkish TTS synthesizers
Robustness of Massively Parallel Sequencing Platforms
The improvements in high throughput sequencing technologies (HTS) made clinical sequencing projects such as ClinSeq and Genomics England feasible. Although there are significant improvements in accuracy and reproducibility of HTS based analyses, the usability of these types of data for diagnostic and prognostic applications necessitates a near perfect data generation. To assess the usability of a widely used HTS platform for accurate and reproducible clinical applications in terms of robustness, we generated whole genome shotgun (WGS) sequence data from the genomes of two human individuals in two different genome sequencing centers. After analyzing the data to characterize SNPs and indels using the same tools (BWA, SAMtools, and GATK), we observed significant number of discrepancies in the call sets. As expected, the most of the disagreements between the call sets were found within genomic regions containing common repeats and segmental duplications, albeit only a small fraction of the discordant variants were within the exons and other functionally relevant regions such as promoters. We conclude that although HTS platforms are sufficiently powerful for providing data for first-pass clinical tests, the variant predictions still need to be confirmed using orthogonal methods before using in clinical applications
Pronunciation disambiguation in Turkish
In text-to-speech systems and in developing transcriptions for acoustic speech data, one is faced with the problem of disambiguating the pronunciation of a token in the context it is used, so that the correct pronunciation can be produced or the transcription uses the correct set of phonemes. In this paper we investigate the problem of pronunciation disambiguation in Turkish as a natural language processing problem and present preliminary results using a morphological disambiguation technique based on the notion of distinguishing tag sets
Comparisons of total and novel SNP and indel intersections of <i>B</i><sub>1</sub> vs. <i>T</i><sub>1</sub> and <i>B</i><sub>2</sub> vs. <i>T</i><sub>2</sub>. <i>B</i><sub>1</sub>, <i>T</i><sub>1</sub>:pooled <i>S</i><sub>1</sub> calls from BGI and TÜBİTAK datasets using HaplotypeCaller; <i>B</i><sub>2</sub>, <i>T</i><sub>2</sub>:pooled <i>S</i><sub>2</sub> calls from BGI and TÜBİTAK datasets, respectively.
<p>Comparisons of total and novel SNP and indel intersections of <i>B</i><sub>1</sub> vs. <i>T</i><sub>1</sub> and <i>B</i><sub>2</sub> vs. <i>T</i><sub>2</sub>. <i>B</i><sub>1</sub>, <i>T</i><sub>1</sub>:pooled <i>S</i><sub>1</sub> calls from BGI and TÜBİTAK datasets using HaplotypeCaller; <i>B</i><sub>2</sub>, <i>T</i><sub>2</sub>:pooled <i>S</i><sub>2</sub> calls from BGI and TÜBİTAK datasets, respectively.</p
Detailed view of novel SNP and indel distributions of <i>S</i><sub>2</sub> that map to common repeats.
<p>Detailed view of novel SNP and indel distributions of <i>S</i><sub>2</sub> that map to common repeats.</p
Summary of the sequence datasets.
<p>Basic statistics of the two samples (<i>S</i><sub>1</sub>, <i>S</i><sub>2</sub>) sequenced at two different centers. <i>S</i><sub>1<i>T</i></sub> refers to sample <i>S</i><sub>1</sub> sequenced at TÜBİTAK, where the dataset <i>S</i><sub>1<i>B</i></sub> was generated from the same sample at BGI. Similarly, datasets from sample <i>S</i><sub>2</sub> are denoted as <i>S</i><sub>2<i>T</i></sub> and <i>S</i><sub>2<i>B</i></sub>.</p><p>Summary of the sequence datasets.</p
Detailed view of novel SNP and indel distributions of <i>S</i><sub>1</sub> that map to common repeats.
<p>Detailed view of novel SNP and indel distributions of <i>S</i><sub>1</sub> that map to common repeats.</p
Underlying sequence content of novel SNP and indel calls.
<p>A) SNPs and B) indels in the genome of <i>S</i><sub>1</sub>. C) SNPs and D) indels in the genome of <i>S</i><sub>2</sub>.</p