Search CORE

1,267 research outputs found

Automatic tagging and geotagging in video collections and communities

Author: Jones Gareth J.F.
Larson Martha
Serdyukov Pavel
Soleymani Mohammad
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/04/2011
Field of study

Automatically generated tags and geotags hold great promise to improve access to video collections and online communi- ties. We overview three tasks offered in the MediaEval 2010 benchmarking initiative, for each, describing its use scenario, definition and the data set released. For each task, a reference algorithm is presented that was used within MediaEval 2010 and comments are included on lessons learned. The Tagging Task, Professional involves automatically matching episodes in a collection of Dutch television with subject labels drawn from the keyword thesaurus used by the archive staff. The Tagging Task, Wild Wild Web involves automatically predicting the tags that are assigned by users to their online videos. Finally, the Placing Task requires automatically assigning geo-coordinates to videos. The specification of each task admits the use of the full range of available information including user-generated metadata, speech recognition transcripts, audio, and visual features

Irish Universities

DCU Online Research Access Service

Recommended from our members

Adapting Metrics for Music Similarity Using Comparative Ratings

Author: Weyde T.
Wolff D.
Publication venue
Publication date: 01/01/2011
Field of study

City Research Online

Retrieval and Annotation of Music Using Latent Semantic Models

Author: Levy Mark
Publication venue: 'Queen Mary University of London'
Publication date: 01/01/2012
Field of study

PhDThis thesis investigates the use of latent semantic models for annotation and retrieval from collections of musical audio tracks. In particular latent semantic analysis (LSA) and aspect models (or probabilistic latent semantic analysis, pLSA) are used to index words in descriptions of music drawn from hundreds of thousands of social tags. A new discrete audio feature representation is introduced to encode musical characteristics of automatically-identified regions of interest within each track, using a vocabulary of audio muswords. Finally a joint aspect model is developed that can learn from both tagged and untagged tracks by indexing both conventional words and muswords. This model is used as the basis of a music search system that supports query by example and by keyword, and of a simple probabilistic machine annotation system. The models are evaluated by their performance in a variety of realistic retrieval and annotation tasks, motivated by applications including playlist generation, internet radio streaming, music recommendation and catalogue searchEngineering and Physical Sciences Research Counci

Queen Mary Research Online

Harvesting and Structuring Social Data in Music Information Retrieval

Author: Sergio Oramas
Publication venue
Publication date: 03/04/2020
Field of study

Abstract. An exponentially growing amount of music and sound resources are being shared by communities of users on the Internet. Social media content can be found with different levels of structuring, and the contributing users might be experts or non-experts of the domain. Harvesting and structuring this information semantically would be very useful in context-aware Music Information Retrieval (MIR). Until now, scant research in this field has taken advantage of the use of formal knowledge representations in the process of structuring information. We propose a methodology that combines Social Media Mining, Knowledge Extraction and Natural Language Processing techniques, to extract meaningful context information from social data. By using the extracted information we aim to improve retrieval, discovery and annotation of music and sound resources. We define three different scenarios to test and develop our methodology

CiteSeerX

Learning Contextualized Music Semantics from Tags via a Siamese Network

Author: Law Edith
Law Edith
Lebret Rémi
Marques Gonçalo
Mikolov Tomas
Singhal Amit
Thierry
Turnbull Douglas
van der Maaten Laurens
Publication venue
Publication date: 07/06/2016
Field of study

Music information retrieval faces a challenge in modeling contextualized musical concepts formulated by a set of co-occurring tags. In this paper, we investigate the suitability of our recently proposed approach based on a Siamese neural network in fighting off this challenge. By means of tag features and probabilistic topic models, the network captures contextualized semantics from tags via unsupervised learning. This leads to a distributed semantics space and a potential solution to the out of vocabulary problem which has yet to be sufficiently addressed. We explore the nature of the resultant music-based semantics and address computational needs. We conduct experiments on three public music tag collections -namely, CAL500, MagTag5K and Million Song Dataset- and compare our approach to a number of state-of-the-art semantics learning approaches. Comparative results suggest that this approach outperforms previous approaches in terms of semantic priming and music tag completion.Comment: 20 pages. To appear in ACM TIST: Intelligent Music Systems and Application

arXiv.org e-Print Archive

Crossref

The University of Manchester - Institutional Repository

Evaluating the usability and security of a video CAPTCHA

Author: Kluever Kurt Alfred
Publication venue: RIT Scholar Works
Publication date: 01/01/2008
Field of study

A CAPTCHA is a variation of the Turing test, in which a challenge is used to distinguish humans from computers (`bots\u27) on the internet. They are commonly used to prevent the abuse of online services. CAPTCHAs discriminate using hard articial intelligence problems: the most common type requires a user to transcribe distorted characters displayed within a noisy image. Unfortunately, many users and them frustrating and break rates as high as 60% have been reported (for Microsoft\u27s Hotmail). We present a new CAPTCHA in which users provide three words (`tags\u27) that describe a video. A challenge is passed if a user\u27s tag belongs to a set of automatically generated ground-truth tags. In an experiment, we were able to increase human pass rates for our video CAPTCHAs from 69.7% to 90.2% (184 participants over 20 videos). Under the same conditions, the pass rate for an attack submitting the three most frequent tags (estimated over 86,368 videos) remained nearly constant (5% over the 20 videos, roughly 12.9% over a separate sample of 5146 videos). Challenge videos were taken from YouTube.com. For each video, 90 tags were added from related videos to the ground-truth set; security was maintained by pruning all tags with a frequency 0.6%. Tag stemming and approximate matching were also used to increase human pass rates. Only 20.1% of participants preferred text-based CAPTCHAs, while 58.2% preferred our video-based alternative. Finally, we demonstrate how our technique for extending the ground truth tags allows for different usability/security trade-offs, and discuss how it can be applied to other types of CAPTCHAs

RIT Scholar Works

Recommended from our members

User-centred video abstraction

Author: Darabi Kaveh
Publication venue
Publication date: 01/01/2015
Field of study

This thesis was submitted for the degree of Doctor of Philosophy and awarded by Brunel University LondonThe rapid growth of digital video content in recent years has imposed the need for the development of technologies with the capability to produce condensed but semantically rich versions of the input video stream in an effective manner. Consequently, the topic of Video Summarisation is becoming increasingly popular in multimedia community and numerous video abstraction approaches have been proposed accordingly. These recommended techniques can be divided into two major categories of automatic and semi-automatic in accordance with the required level of human intervention in summarisation process. The fully-automated methods mainly adopt the low-level visual, aural and textual features alongside the mathematical and statistical algorithms in furtherance to extract the most significant segments of original video. However, the effectiveness of this type of techniques is restricted by a number of factors such as domain-dependency, computational expenses and the inability to understand the semantics of videos from low-level features. The second category of techniques however, attempts to alleviate the quality of summaries by involving humans in the abstraction process to bridge the semantic gap. Nonetheless, a single user’s subjectivity and other external contributing factors such as distraction will potentially deteriorate the performance of this group of approaches. Accordingly, in this thesis we have focused on the development of three user-centred effective video summarisation techniques that could be applied to different video categories and generate satisfactory results. According to our first proposed approach, a novel mechanism for a user-centred video summarisation has been presented for the scenarios in which multiple actors are employed in the video summarisation process in order to minimise the negative effects of sole user adoption. Based on our recommended algorithm, the video frames were initially scored by a group of video annotators ‘on the fly’. This was followed by averaging these assigned scores in order to generate a singular saliency score for each video frame and, finally, the highest scored video frames alongside the corresponding audio and textual contents were extracted to be included into the final summary. The effectiveness of our approach has been assessed by comparing the video summaries generated based on our approach against the results obtained from three existing automatic summarisation tools that adopt different modalities for abstraction purposes. The experimental results indicated that our proposed method is capable of delivering remarkable outcomes in terms of Overall Satisfaction and Precision with an acceptable Recall rate, indicating the usefulness of involving user input in the video summarisation process. In an attempt to provide a better user experience, we have proposed our personalised video summarisation method with an ability to customise the generated summaries in accordance with the viewers’ preferences. Accordingly, the end-user’s priority levels towards different video scenes were captured and utilised for updating the average scores previously assigned by the video annotators. Finally, our earlier proposed summarisation method was adopted to extract the most significant audio-visual content of the video. Experimental results indicated the capability of this approach to deliver superior outcomes compared with our previously proposed method and the three other automatic summarisation tools. Finally, we have attempted to reduce the required level of audience involvement for personalisation purposes by proposing a new method for producing personalised video summaries. Accordingly, SIFT visual features were adopted to identify the video scenes’ semantic categories. Fusing this retrieved data with pre-built users’ profiles, personalised video abstracts can be created. Experimental results showed the effectiveness of this method in delivering superior outcomes comparing to our previously recommended algorithm and the three other automatic summarisation techniques

Brunel University Research Archive

Text-based Sentiment Analysis and Music Emotion Recognition

Author: Cano Erion
Publication venue: Politecnico di Torino
Publication date
Field of study

Nowadays, with the expansion of social media, large amounts of user-generated texts like tweets, blog posts or product reviews are shared online. Sentiment polarity analysis of such texts has become highly attractive and is utilized in recommender systems, market predictions, business intelligence and more. We also witness deep learning techniques becoming top performers on those types of tasks. There are however several problems that need to be solved for efficient use of deep neural networks on text mining and text polarity analysis. First of all, deep neural networks are data hungry. They need to be fed with datasets that are big in size, cleaned and preprocessed as well as properly labeled. Second, the modern natural language processing concept of word embeddings as a dense and distributed text feature representation solves sparsity and dimensionality problems of the traditional bag-of-words model. Still, there are various uncertainties regarding the use of word vectors: should they be generated from the same dataset that is used to train the model or it is better to source them from big and popular collections that work as generic text feature representations? Third, it is not easy for practitioners to find a simple and highly effective deep learning setup for various document lengths and types. Recurrent neural networks are weak with longer texts and optimal convolution-pooling combinations are not easily conceived. It is thus convenient to have generic neural network architectures that are effective and can adapt to various texts, encapsulating much of design complexity. This thesis addresses the above problems to provide methodological and practical insights for utilizing neural networks on sentiment analysis of texts and achieving state of the art results. Regarding the first problem, the effectiveness of various crowdsourcing alternatives is explored and two medium-sized and emotion-labeled song datasets are created utilizing social tags. One of the research interests of Telecom Italia was the exploration of relations between music emotional stimulation and driving style. Consequently, a context-aware music recommender system that aims to enhance driving comfort and safety was also designed. To address the second problem, a series of experiments with large text collections of various contents and domains were conducted. Word embeddings of different parameters were exercised and results revealed that their quality is influenced (mostly but not only) by the size of texts they were created from. When working with small text datasets, it is thus important to source word features from popular and generic word embedding collections. Regarding the third problem, a series of experiments involving convolutional and max-pooling neural layers were conducted. Various patterns relating text properties and network parameters with optimal classification accuracy were observed. Combining convolutions of words, bigrams, and trigrams with regional max-pooling layers in a couple of stacks produced the best results. The derived architecture achieves competitive performance on sentiment polarity analysis of movie, business and product reviews. Given that labeled data are becoming the bottleneck of the current deep learning systems, a future research direction could be the exploration of various data programming possibilities for constructing even bigger labeled datasets. Investigation of feature-level or decision-level ensemble techniques in the context of deep neural networks could also be fruitful. Different feature types do usually represent complementary characteristics of data. Combining word embedding and traditional text features or utilizing recurrent networks on document splits and then aggregating the predictions could further increase prediction accuracy of such models

PORTO@iris (Publications Open Repository TOrino - Politecnico di Torino)

Promoting Informal Learning Using a Context-Sensitive Recommendation Algorithm For a QRCode-based Visual Tagging System

Author: COOK HENRI
Publication venue
Publication date: 01/01/2010
Field of study

Structured Abstract Context: Previous work in the educational field has demonstrated that Informal Learning is an effective way to learn. Due to its casual nature it is often difficult for academic institutions to leverage this method of learning as part of a typical curriculum. Aim: This study planned to determine whether Informal Learning could be encouraged amongst learners at Durham University using an object tagging system and a context-sensitive recommendation algorithm. Method: This study creates a visual tagging system using a type of two-dimensional barcode called the QR Code and describes a tool designed to allow learners to use these ‘tags’ to learn about objects in a physical space. Information about objects features audio media as well as textual descriptions to make information appealing. A collaboratively-filtered, user-based recommendation algorithm uses elements of a learner’s context, namely their university records, physical location and data on the activities of users similar to them to create a top-N ranked list of objects that they may find interesting. The tool is evaluated in a case study with thirty (n=30) participants taking part in a task in a public space within Durham University. The evaluation uses quantitative and qualititative data to make conclusions as to the use of the proposed tool for individuals who wish to learn informally. Results: A majority of learners found learning about the objects around them to be an interesting practice. The recommendation system fulfilled its purpose and learners indicated that they would travel a significant distance to view objects that were presented to them. The addition of audio clips to largely textual information did not serve to increase learner interest and the implementation of this part of the system is examined in detail. Additionally there was found to be no apparent correlation between prior computer usage and the ability to comprehend an informal learning tool such as the one described. Conclusion: Context-sensitive, mobile tools are valuable for motivating Informal Learning. Interaction with tagged objects outside of the experimental setting indicates significant learner interest even from those individuals that did not participate in the study. Learners that did participate in the experiment gained a better understanding of the world around them than they would have without the tool and would use such software again in the future

Durham e-Theses