9 research outputs found

    Rule-embedded network for audio-visual voice activity detection in live musical video streams

    Full text link
    Detecting anchor's voice in live musical streams is an important preprocessing for music and speech signal processing. Existing approaches to voice activity detection (VAD) primarily rely on audio, however, audio-based VAD is difficult to effectively focus on the target voice in noisy environments. With the help of visual information, this paper proposes a rule-embedded network to fuse the audio-visual (A-V) inputs to help the model better detect target voice. The core role of the rule in the model is to coordinate the relation between the bi-modal information and use visual representations as the mask to filter out the information of non-target sound. Experiments show that: 1) with the help of cross-modal fusion by the proposed rule, the detection result of A-V branch outperforms that of audio branch; 2) the performance of bi-modal model far outperforms that of audio-only models, indicating that the incorporation of both audio and visual signals is highly beneficial for VAD. To attract more attention to the cross-modal music and audio signal processing, a new live musical video corpus with frame-level label is introduced.Comment: Submitted to ICASSP 202

    Combining Metadata, Inferred Similarity of Content, and Human Interpretation for Managing and Listening to Music Collections

    Get PDF
    Music services, media players and managers provide support for content classification and access based on filtering metadata values, statistics of access and user ratings. This approach fails to capture characteristics of mood and personal history that are often the deciding factors when creating personal playlists and collections in music. This dissertation work presents MusicWiz, a music management environment that combines traditional metadata with spatial hypertext-based expression and automatically extracted characteristics of music to generate personalized associations among songs. MusicWiz’s similarity inference engine combines the personal expression in the workspace with assessments of similarity based on the artists, other metadata, lyrics and the audio signal to make suggestions and to generate playlists. An evaluation of MusicWiz with and without the workspace and suggestion capabilities showed significant differences for organizing and playlist creation tasks. The workspace features were more valuable for organizing tasks, while the suggestion features had more value for playlist creation activities

    Content-based music classification, summarization and retrieval

    Get PDF
    Ph.DDOCTOR OF PHILOSOPH

    Are All Pixels Equally Important? Towards Multi-Level Salient Object Detection

    Get PDF
    When we look at our environment, we primarily pay attention to visually distinctive objects. We refer to these objects as visually important or salient. Our visual system dedicates most of its processing resources to analyzing these salient objects. An analogous resource allocation can be performed in computer vision, where a salient object detector identifies objects of interest as a pre-processing step. In the literature, salient object detection is considered as a foreground-background segmentation problem. This approach assumes that there is no variation in object importance. Only the most salient object(s) are detected as foreground. In this thesis, we challenge this conventional methodology of salient-object detection and introduce multi-level object saliency. In other words, all pixels are not equally important. The well-known salient-object ground-truth datasets contain images with single objects and thus are not suited to evaluate the varying importance of objects. In contrast, many natural images have multiple objects. The saliency levels of these objects depend on two key factors. First, the duration of eye fixation is longer for visually and semantically informative image regions. Therefore, a difference in fixation duration should reflect a variation in object importance. Second, visual perception is subjective; hence the saliency of an object should be measured by averaging the perception of a group of people. In other words, objective saliency can be considered as the collective human attention. In order to better represent natural images and to measure the saliency levels of objects, we thus collect new images containing multiple objects and create a Comprehensive Object Saliency (COS) dataset. We provide ground truth multi-level salient object maps via eye-tracking and crowd-sourcing experiments. We then propose three salient-object detectors. Our first technique is based on multi-scale linear filtering and can detect salient objects of various sizes. The second method uses a bilateral-filtering approach and is capable of producing uniform object saliency values. Our third method employs image segmentation and machine learning and is robust against image noise and texture. This segmentation-based method performs the best on the existing datasets compared to our other methods and the state-of-the-art methods. The state-of-the-art salient-object detectors are not designed to assess the relative importance of objects and to provide multi-level saliency values. We thus introduce an Object-Awareness Model (OAM) that estimates the saliency levels of objects by using their position and size information. We then modify and extend our segmentation-based salient-object detector with the OAM and propose a Comprehensive Salient Object Detection (CSD) method that is capable of performing multi-level salient-object detection. We show that the CSD method significantly outperforms the state-of-the-art methods on the COS dataset. We use our salient-object detectors as a pre-processing step in three applications. First, we show that multi-level salient-object detection provides more relevant semantic image tags compared to conventional salient-object detection. Second, we employ our salient-object detector to detect salient objects in videos in real time. Third, we use multi-level object-saliency values in context-aware image compression and obtain perceptually better compression compared to standard JPEG with the same file size

    Content-based music structure analysis

    Get PDF
    Ph.DDOCTOR OF PHILOSOPH

    Modelling Perception of Large-Scale Thematic Structure in Music

    Get PDF
    Large-scale thematic structure—the organisation of material within a musical composition—holds an important position in the Western classical music tradition and has subsequently been incorporated into many influential models of music cognition. Whether, and if so, how, these structures may be perceived provides an interesting psychological problem, combining many aspects of memory, pattern recognition, and similarity judgement. However, strong experimental evidence supporting the perception of large-scale thematic structures remains limited, often arising from difficulties in measuring and disrupting their perception. To provide a basis for experimental research, this thesis develops a probabilistic computational model that characterises the possible cognitive processes underlying the perception of thematic structure. This modelling is founded on the hypothesis that thematic structures are perceptible through the statistical regularities they form, arising from the repetition and learning of material. Through the formalisation of this hypothesis, features were generated characterising compositions’ intra-opus predictability, stylistic predictability, and the amounts of repetition and variation of identified thematic material in both pitch and rhythmic domains. A series of behavioural experiments examined the ability of these modelled features to predict participant responses to important indicators of thematic structure. Namely, similarity between thematic elements, identification of large-scale repetitions, perceived structural unity, sensitivity to thematic continuation, and large-scale ordering. Taken together, the results of these experiments provide converging evidence that the perception of large-scale thematic structures can be accounted for by the dynamic learning of statistical regularities within musical compositions

    ABSTRACT Automated Extraction of Music Snippets

    No full text
    Similar to image and video thumbnail, music snippet is defined as the most representative or highlight excerpt of a music clip, and can be used efficiently for fast browsing large number of music files. Music snippet is usually a part of the repeated melody, main theme or chorus. In this paper, we present an approach to extracting music snippet automatically. In our approach, the most salient segment of the music is firstly detected based on its occurrence frequency and energy information. Meanwhile, the boundaries of musical phrases are also detected based on the estimated phrase length and phrase boundary confidence of each frame. These boundaries are used to ensure that an extracted snippet does not break musical phrases. Finally, the musical phrases including the most salient segment are extracted as music snippet. User study indicates that the proposed algorithm works very well on our music database
    corecore