292 research outputs found
Automatic Prosodic Segmentation by F0 Clustering Using Superpositional Modeling.
In this paper, we propose an automatic method for detecting
accent phrase boundaries in Japanese continuous speech by
using F0 information. In the training phase, hand labeled
accent patterns are parameterized according to a superpositional
model proposed by Fujisaki, and assigned to some
clusters by a clustering method, in which accent templates
are calculated as centroid of each cluster. In the segmentation
phase, automatic N-best extraction of boundaries is
performed by One-Stage DP matching between the reference
templates and the target F0 contour. About 90% of
accent phrase boundaries were correctly detected in speaker
independent experiments with the ATR Japanese continuous
speech database
On Representation of Fundamental Frequency of Speech for Prosody Analysis Using Reliability Function.
This paper highlights on a method that provides a new
prosodic feature called ‘F0 reliability field’ based on a reliability
function of the fundamental frequency (F0). The
proposed method does not employ any correction process
for F0 estimation errors that occur during automatic F0
extraction. By applying this feature as a score function
for prosodic analyses like prosodic structure estimation
or superpositional modeling of prosodic commands, these
prosodic information could be acquired with higher accuracy.
The feature has been applied to ‘F0 template matching
method’, which detects accent phrase boundaries in
Japanese continuous speech. The experimental results
show that compared to the conventional F0 contour, the
proposed feature overcomes the harmful influence caused
by F0 errors
Automated analysis of musical structure
Thesis (Ph. D.)--Massachusetts Institute of Technology, School of Architecture and Planning, Program in Media Arts and Sciences, 2005.Includes bibliographical references (p. 93-96).Listening to music and perceiving its structure is a fairly easy task for humans, even for listeners without formal musical training. For example, we can notice changes of notes, chords and keys, though we might not be able to name them (segmentation based on tonality and harmonic analysis); we can parse a musical piece into phrases or sections (segmentation based on recurrent structural analysis); we can identify and memorize the main themes or the catchiest parts - hooks - of a piece (summarization based on hook analysis); we can detect the most informative musical parts for making certain judgments (detection of salience for classification). However, building computational models to mimic these processes is a hard problem. Furthermore, the amount of digital music that has been generated and stored has already become unfathomable. How to efficiently store and retrieve the digital content is an important real-world problem. This dissertation presents our research on automatic music segmentation, summarization and classification using a framework combining music cognition, machine learning and signal processing. It will inquire scientifically into the nature of human perception of music, and offer a practical solution to difficult problems of machine intelligence for automatic musical content analysis and pattern discovery.(cont.) Specifically, for segmentation, an HMM-based approach will be used for key change and chord change detection; and a method for detecting the self-similarity property using approximate pattern matching will be presented for recurrent structural analysis. For summarization, we will investigate the locations where the catchiest parts of a musical piece normally appear and develop strategies for automatically generating music thumbnails based on this analysis. For musical salience detection, we will examine methods for weighting the importance of musical segments based on the confidence of classification. Two classification techniques and their definitions of confidence will be explored. The effectiveness of all our methods will be demonstrated by quantitative evaluations and/or human experiments on complex real-world musical stimuli.by Wei Chai.Ph.D
ReLyMe: Improving Lyric-to-Melody Generation by Incorporating Lyric-Melody Relationships
Lyric-to-melody generation, which generates melody according to given lyrics,
is one of the most important automatic music composition tasks. With the rapid
development of deep learning, previous works address this task with end-to-end
neural network models. However, deep learning models cannot well capture the
strict but subtle relationships between lyrics and melodies, which compromises
the harmony between lyrics and generated melodies. In this paper, we propose
ReLyMe, a method that incorporates Relationships between Lyrics and Melodies
from music theory to ensure the harmony between lyrics and melodies.
Specifically, we first introduce several principles that lyrics and melodies
should follow in terms of tone, rhythm, and structure relationships. These
principles are then integrated into neural network lyric-to-melody models by
adding corresponding constraints during the decoding process to improve the
harmony between lyrics and melodies. We use a series of objective and
subjective metrics to evaluate the generated melodies. Experiments on both
English and Chinese song datasets show the effectiveness of ReLyMe,
demonstrating the superiority of incorporating lyric-melody relationships from
the music domain into neural lyric-to-melody generation.Comment: Accepted by ACMMM 2022, ora
Prosody generation for text-to-speech synthesis
The absence of convincing intonation makes current parametric speech
synthesis systems sound dull and lifeless, even when trained on expressive
speech data. Typically, these systems use regression techniques to predict the
fundamental frequency (F0) frame-by-frame. This approach leads to overlysmooth
pitch contours and fails to construct an appropriate prosodic structure
across the full utterance. In order to capture and reproduce larger-scale
pitch patterns, we propose a template-based approach for automatic F0 generation,
where per-syllable pitch-contour templates (from a small, automatically
learned set) are predicted by a recurrent neural network (RNN). The use of
syllable templates mitigates the over-smoothing problem and is able to reproduce
pitch patterns observed in the data. The use of an RNN, paired with connectionist
temporal classification (CTC), enables the prediction of structure in
the pitch contour spanning the entire utterance. This novel F0 prediction system
is used alongside separate LSTMs for predicting phone durations and the
other acoustic features, to construct a complete text-to-speech system. Later,
we investigate the benefits of including long-range dependencies in duration
prediction at frame-level using uni-directional recurrent neural networks.
Since prosody is a supra-segmental property, we consider an alternate approach
to intonation generation which exploits long-term dependencies of
F0 by effective modelling of linguistic features using recurrent neural networks.
For this purpose, we propose a hierarchical encoder-decoder and
multi-resolution parallel encoder where the encoder takes word and higher
level linguistic features at the input and upsamples them to phone-level
through a series of hidden layers and is integrated into a Hybrid system which
is then submitted to Blizzard challenge workshop. We then highlight some of
the issues in current approaches and a plan for future directions of investigation
is outlined along with on-going work
Prosody and speech perception
The major concern of this thesis is with
models of speech perception. Following Gibson's
(1966) work on visual perception, it seeks to establish
whether there are sources of information in the speech
signal which can be responded to directly and which
specify the units of information of speech. The
treatment of intonation follows that of Halliday (1967)
and rhythm that of Abercrombie (1967) . By "prosody"
is taken to mean both the intonational and the
rhythmic aspects of speech.Experiments one to four show the
interdependence of prosody and grammar in the
perception of speech, although they leave open the
question of which sort of information is responded
to first. Experiments five and six, employing a
short-term memory paradigm and Morton's (1970)
"suffix effect" explanation, demonstrate that prosody
could well be responded to before grammar. Since
the previous experiments suggested a close connection
between the two, these results suggest that information
about grammatical structures may well be given
directly by prosody. In qthe final two experiments
the amount of prosodic information in fluent speech
that can be perceived independently of grammar and
meaning is investigated. Although tone -group
division seems to be given clearly enough by acoustic
cues, there are problems of interpretation with the
data on syllable stress assignments.In the concluding chapter, a three-stage
model of speech perception is proposed, following
never (1970), but incorporating prosodic analysis as
an integral part of the processing. The obtained
experimental results are integrated within this
model
- …