18 research outputs found
Multi-Dimensional Coding of Speech Data
This paper presents specific new techniques for coding of speech representations and a new general approach to coding for compression, which directly utilises the multi-dimensional nature of the input data. Many methods of speech analysis yield a two-dimensional pattern, with time as one of the dimensions. Various such speech representations and power spectrum sequences in particular, are shown here to be amenable to two-dimensional compression using specific models which take account of a large part of their structure in both dimensions.
Newly developed techniques, namely, Multi-step Adaptive Flux Interpolation ( MAFI) and Multi-step Flow Based Prediction (MFBP) are presented. These are able to code power spectral density (PSD) sequences of speech more completely and accurately than conventional methods and at a low computational cost. This is due to their ability to model non-stationary, piecewise-continuous, signals, of which speech is a good example.
MAFI and MFBP are first applied in the time domain and then to the encoded data in the second dimension. This approach allows the coding algorithm to exploit redundancy in both dimensions, giving a significant movement in the overall compression ratio. Furthermore, the compression may be reapplied several times. The data is further compressed with each application
Image Coding by Multi-Step, Adaptive Flux Interpolation
This paper describes and discusses a new technique, the multi-step adaptive flux interpolation (MAFI) and its application to image data for coding. The output of MAFI, when applied to an image, is still in an image form but has a more uniform feature density. This is because the original image has been warped by removing those rows and columns which contain mostly redundant pixels. It is also greatly reduced in size and the side information is minimal. The MAFI output can be further compressed using conventional coders, making its compression ratio even higher. Because of its warped nature, the MAFI output's statistics are also more consistent with the properties assumed by block-based discrete cosine transform (DCT) models
Modelling the Flow Inherent in Speech Representations
This paper presents two new methods for modelling the flow inherent in speech: flow-based prediction (FBP)and acoustic flow interpolation (AFI). These are presented as extensions of the form of prediction implied in calculating the delta and delta-delta coefficients often used in automatic speech recognition.All these methods are presented as special cases of a general vector linear prediction model, but it is shown that the new techniques, which make the flow of features within the data explicit, are significantly better at modelling spectrogram-like data.
Several speech representations, using both parametric and non-parametric analyses, are discussed both in terms of their ability to represent speech accurately and of their appropriateness to these flow-based models. AFI and FBP error coefficients, for both male and female speakers, are measured and compared with the delta and delta-delta coefficients. Wherever possible, the parameters and methods used to produce the representations have been chosen to be directly comparable with one another
Variable Frame-Rate Speech Coding by Adaptive-Flux Interpolation
Variable frame-rate (VFR) speech coders have many desirable properties but make implicit assumptions concerning the nature of the spectral evolution of speech (Peeling and Pointing 1989). To date, these assumptions have been crude and unable to model speech parameters during extended periods of coarticulation. In particular they have been unable to cope with steadily changing formats. Thus, existing VFR methods must transmit many more frames than are really necessary. This paper presents a new technique; Adaptive-Flux Interpolation (AFI), which significantly extends the period over which accurate estimation can be performed and is much more robust and accurate than other methods
Assimilation of word-final nasals to following word-initial place of articulation in United Kingdom English.
Using very large speech corpora, we can study rare but systematic pronunciation patterns in spontaneous speech. Previous studies have established that word-final alveolar consonants in English (/t/, /d/, /n/, /s/ and /z/) vary their place of articulation to match a following word-initial consonant, e.g., "ran quickly" → "ra[] quickly." Assimilation of bilabial or velar nasals, e.g., "alar[] clock" for "alarm clock," is unexpected according to linguistic frameworks such as underspecification theory. The existence of systematic counterexamples would challenge that theory, but these might have been previously overlooked because they are infrequent. From the c. 8-million word Audio BNC (http://www.phon.ox.ac.uk/AudioBNC) we extracted more than 4,000 tokens of relevant word pairs, to determine whether non-alveolar assimilations occur and with what distribution. Word and segment boundaries were obtained by forced alignment, and F1-F3 formant frequencies were estimated using Praat. Formant frequencies in assimilation environments were compared to non-assimilating controls (e.g., them down vs. them back/then down). We also examined patterns of variability in different contexts. We will present evidence that velar and bilabial nasals sometimes do assimilate, though less frequently than alveolars
Reproducing speech intervals in the sub-hundred millisecond (ms) range with a translocation in 7q31
There is increasing evidence that mutations in the transcription factor FOXP2 impair sensorimotor responses at the brain level.
How some gene interactions produce afferent-efferent circuits involving gabaergic and glutamergic populations of cells in different parts of the CNS is unclear. It is known that FOXP2 is expressed in a sensorimotor dopaminergic circuit, comprising the striatum, thalamus, deep cerebral cortical layers, the inferior olive and Purkinje cells of the cerebellum.
Here we focus on a case of a subject A with speech and language disorders who has a chromosomal translocation t[7;11] affecting 7q31, the locus of FOXP2.
Since there is substantial evidence that the neural basis for interval timing of fast movement changes, crucial for speech and language, may be regulated by these sensorimotor dopaminergic circuits, we focus here on how interval timing in the ms range is reproduced by A, compared to the tutor, and a control C matched for sex, age, languages and education.
It is found that A reproduces non-word sequences with significantly fewer dynamic changes than C. We discuss these findings relating them to the debatable hypothesis that the cerebellum may be more involved in the perception and production of sub-second intervals
Reproducing speech intervals in the sub-hundred millisecond (ms) range with a translocation in 7q31
There is increasing evidence that mutations in the transcription factor FOXP2 impair sensorimotor responses at the brain level. How some gene interactions produce afferent-efferent circuits involving gabaergic and glutamergic populations of cells in different parts of the CNS is unclear. It is known that FOXP2 is expressed in a sensorimotor dopaminergic circuit, comprising the striatum, thalamus, deep cerebral cortical layers, the inferior olive and Purkinje cells of the cerebellum. Here we focus on a case of a subject A with speech and language disorders who has a chromosomal translocation t[7;11] affecting 7q31, the locus of FOXP2. Since there is substantial evidence that the neural basis for interval timing of fast movement changes, crucial for speech and language, may be regulated by these sensorimotor dopaminergic circuits, we focus here on how interval timing in the ms range is reproduced by A, compared to the tutor, and a control C matched for sex, age, languages and education. It is found that A reproduces non-word sequences with significantly fewer dynamic changes than C. We discuss these findings relating them to the debatable hypothesis that the cerebellum may be more involved in the perception and production of sub-second intervals