22 research outputs found
Stylization of Pitch with Syllable-Based Linear Segments
Fundamental frequency contours for speech, as obtained by common pitch tracking algorithms, contain a great deal of fine detail that is unlikely to hold much perceptual significance for listeners. In our experiments, a radically reduced pitch contour consisting of a single linear segment for each syllable was found to judged as equally natural as the original pitch track by listeners, based on high-quality analysis-synthesis. We describe the algorithms both for segmenting speech into syllables based on fitting Gaussians to the energy envelope, and for approximating the pitch contour by independent linear segments for each syllable. We report our web-based test in which 40 listeners compared the stylized pitch contour resyntheses to equivalent resyntheses based on the original pitch track, and also to pitch tracks stylized by the existing Momel algorithm. Listeners preferred the original pitch contour to the linear approximation in only 60% of cases, where 50% would indicate random guessing. By contrast, the original was preferred over Momel in 74% of cases
Addressee detection for dialog systems using temporal and spectral dimensions of speaking style,” in
Abstract As dialog systems evolve to handle unconstrained input and for use in open environments, addressee detection (detecting speech to the system versus to other people) becomes an increasingly important challenge. We study a corpus in which speakers talk both to a system and to each other, and model two dimensions of speaking style that talkers modify when changing addressee: speech rhythm and vocal effort. For each dimension we design features that do not require speech recognition output, session normalization, speaker normalization, or dialog context. Detection experiments show that rhythm and effort features are complementary, outperform lexical models based on recognized words, and reduce error rates even if word recognition is error-free. Simulated online processing experiments show that all features need only the first couple seconds of speech. Finally, we find that temporal and spectral stylistic models can be trained on outside corpora, such as ATIS and ICSI meetings, with reasonable generalization to the target task, thus showing promise for domain-independent computerversus-human addressee detectors
Hasude
Üsküdar Kız Sanattan İhsan'ın Hanım Kızlara Mahsus Gazete'de tefrika edilen Hasude adlı roman
GraphCast: Learning skillful medium-range global weather forecasting
We introduce a machine-learning (ML)-based weather simulator--called
"GraphCast"--which outperforms the most accurate deterministic operational
medium-range weather forecasting system in the world, as well as all previous
ML baselines. GraphCast is an autoregressive model, based on graph neural
networks and a novel high-resolution multi-scale mesh representation, which we
trained on historical weather data from the European Centre for Medium-Range
Weather Forecasts (ECMWF)'s ERA5 reanalysis archive. It can make 10-day
forecasts, at 6-hour time intervals, of five surface variables and six
atmospheric variables, each at 37 vertical pressure levels, on a 0.25-degree
latitude-longitude grid, which corresponds to roughly 25 x 25 kilometer
resolution at the equator. Our results show GraphCast is more accurate than
ECMWF's deterministic operational forecasting system, HRES, on 90.0% of the
2760 variable and lead time combinations we evaluated. GraphCast also
outperforms the most accurate previous ML-based weather forecasting model on
99.2% of the 252 targets it reported. GraphCast can generate a 10-day forecast
(35 gigabytes of data) in under 60 seconds on Cloud TPU v4 hardware. Unlike
traditional forecasting methods, ML-based forecasting scales well with data: by
training on bigger, higher quality, and more recent data, the skill of the
forecasts can improve. Together these results represent a key step forward in
complementing and improving weather modeling with ML, open new opportunities
for fast, accurate forecasting, and help realize the promise of ML-based
simulation in the physical sciences.Comment: Main text: 21 pages, 8 figures, 1 table. Appendix: 15 pages, 5
figures, 2 table
Recommended from our members
Large-Margin Structured Prediction Extensions of Neural Networks for Automatic Speech Recognition
Neural networks, especially those with more than one hidden layer, have re-emerged in Automatic Speech Recognition (ASR) systems as replacements to emission models based on Gaussian Mixture Models (GMMs). While the use of these so-called Deep Neural Networks (DNNs) has enjoyed widespread success due to improvements in recognition results, the exact source of better recognition accuracy is not entirely understood. Using a bootstrap resampling framework that generates synthetic test set data satisfying conditional independence assumptions of the model while still using real observations, I show that DNNs used for both feature generation and hybrid acoustic modeling help compensate for incorrect conditional independence assumptions and help fix poor phone duration estimates of the hidden Markov Model (HMM).Despite these improvements, the large increase in word error rates for DNN-HMM systems on real data compared to synthetic data suggests that one can improve recognition performance by modifying the training criterion. Since neural networks are log-linear at the output layer, I propose using sequences of last hidden layers as input to a log-linear model, and training that model with large-margin criteria. These Structured Support Vector Machine (SVM) approaches allow us to more directly minimize errors relevant to automatic speech recognition, and provide some guarantees on test set error. First, I show how one can generate better features by combining a neural network with a hidden Markov Support Vector Machine (HMSVM). Then, I propose a hybrid DNN-Structured SVM acoustic model and an online training algorithm that iteratively updates alignments for faster convergence. Training of this model falls under a class of approaches known as sequence-discriminative training, which are used to train state-of-the-art systems. This DNN-latent Structured SVM model beats alternative methods to sequence-discriminative training by 1.0% absolute, while needing 33-66% fewer utterances to converge.Finally, I analyze the Structured SVM approach to sequence-discriminative training and compare it to standard methods. I show how the loss function for boosted Maximum Mutual Information is an upper bound of the hinge loss for the Structured SVM, and how such a relaxation precludes the use of aggressive boosting parameters needed for better results. Finally, I analyze four of the most popular sequence-discriminative training criteria – Maximum Mutual Information, boosted Maximum Mutual Information, Minimum Phone Error, and state-level Minimum Bayes Risk – and the latent Structured SVM using the bootstrap resampling framework, and compare how different sequence-discriminative training criteria compensate for data/model mismatch. Structured SVM models perform better for real rather than synthetic data, likely because the model makes fewer distributional assumptions about the underlying data