1,267 research outputs found
Automatic tagging and geotagging in video collections and communities
Automatically generated tags and geotags hold great promise
to improve access to video collections and online communi-
ties. We overview three tasks offered in the MediaEval 2010
benchmarking initiative, for each, describing its use scenario, definition and the data set released. For each task, a reference algorithm is presented that was used within MediaEval 2010 and comments are included on lessons learned. The Tagging Task, Professional involves automatically matching episodes in a collection of Dutch television with subject labels drawn from the keyword thesaurus used by the archive staff. The Tagging Task, Wild Wild Web involves automatically predicting the tags that are assigned by users to their online videos. Finally, the Placing Task requires automatically assigning geo-coordinates to videos. The specification of each task admits the use of the full range of available information including user-generated metadata, speech recognition transcripts, audio, and visual features
Retrieval and Annotation of Music Using Latent Semantic Models
PhDThis thesis investigates the use of latent semantic models for annotation and
retrieval from collections of musical audio tracks. In particular latent semantic
analysis (LSA) and aspect models (or probabilistic latent semantic analysis,
pLSA) are used to index words in descriptions of music drawn from hundreds
of thousands of social tags. A new discrete audio feature representation is introduced
to encode musical characteristics of automatically-identified regions
of interest within each track, using a vocabulary of audio muswords. Finally a
joint aspect model is developed that can learn from both tagged and untagged
tracks by indexing both conventional words and muswords. This model is
used as the basis of a music search system that supports query by example and
by keyword, and of a simple probabilistic machine annotation system. The
models are evaluated by their performance in a variety of realistic retrieval
and annotation tasks, motivated by applications including playlist generation,
internet radio streaming, music recommendation and catalogue searchEngineering and Physical Sciences
Research Counci
Harvesting and Structuring Social Data in Music Information Retrieval
Abstract. An exponentially growing amount of music and sound resources are being shared by communities of users on the Internet. Social media content can be found with different levels of structuring, and the contributing users might be experts or non-experts of the domain. Harvesting and structuring this information semantically would be very useful in context-aware Music Information Retrieval (MIR). Until now, scant research in this field has taken advantage of the use of formal knowledge representations in the process of structuring information. We propose a methodology that combines Social Media Mining, Knowledge Extraction and Natural Language Processing techniques, to extract meaningful context information from social data. By using the extracted information we aim to improve retrieval, discovery and annotation of music and sound resources. We define three different scenarios to test and develop our methodology
Learning Contextualized Music Semantics from Tags via a Siamese Network
Music information retrieval faces a challenge in modeling contextualized
musical concepts formulated by a set of co-occurring tags. In this paper, we
investigate the suitability of our recently proposed approach based on a
Siamese neural network in fighting off this challenge. By means of tag features
and probabilistic topic models, the network captures contextualized semantics
from tags via unsupervised learning. This leads to a distributed semantics
space and a potential solution to the out of vocabulary problem which has yet
to be sufficiently addressed. We explore the nature of the resultant
music-based semantics and address computational needs. We conduct experiments
on three public music tag collections -namely, CAL500, MagTag5K and Million
Song Dataset- and compare our approach to a number of state-of-the-art
semantics learning approaches. Comparative results suggest that this approach
outperforms previous approaches in terms of semantic priming and music tag
completion.Comment: 20 pages. To appear in ACM TIST: Intelligent Music Systems and
Application
Evaluating the usability and security of a video CAPTCHA
A CAPTCHA is a variation of the Turing test, in which a challenge is used to distinguish humans from computers (`bots\u27) on the internet. They are commonly used to prevent the abuse of online services. CAPTCHAs discriminate using hard articial intelligence problems: the most common type requires a user to transcribe distorted characters displayed within a noisy image. Unfortunately, many users and them frustrating and break rates as high as 60% have been reported (for Microsoft\u27s Hotmail). We present a new CAPTCHA in which users provide three words (`tags\u27) that describe a video. A challenge is passed if a user\u27s tag belongs to a set of automatically generated ground-truth tags. In an experiment, we were able to increase human pass rates for our video CAPTCHAs from 69.7% to 90.2% (184 participants over 20 videos). Under the same conditions, the pass rate for an attack submitting the three most frequent tags (estimated over 86,368 videos) remained nearly constant (5% over the 20 videos, roughly 12.9% over a separate sample of 5146 videos). Challenge videos were taken from YouTube.com. For each video, 90 tags were added from related videos to the ground-truth set; security was maintained by pruning all tags with a frequency 0.6%. Tag stemming and approximate matching were also used to increase human pass rates. Only 20.1% of participants preferred text-based CAPTCHAs, while 58.2% preferred our video-based alternative. Finally, we demonstrate how our technique for extending the ground truth tags allows for different usability/security trade-offs, and discuss how it can be applied to other types of CAPTCHAs
Recommended from our members
User-centred video abstraction
This thesis was submitted for the degree of Doctor of Philosophy and awarded by Brunel University LondonThe rapid growth of digital video content in recent years has imposed the need for the development of technologies with the capability to produce condensed but semantically rich versions of the input video stream in an effective manner. Consequently, the topic of Video Summarisation is becoming increasingly popular in multimedia community and numerous video abstraction approaches have been proposed accordingly. These recommended techniques can be divided into two major categories of automatic and semi-automatic in accordance with the required level of human intervention in summarisation process. The fully-automated methods mainly adopt the low-level visual, aural and textual features alongside the mathematical and statistical algorithms in furtherance to extract the most significant segments of original video. However, the effectiveness of this type of techniques is restricted by a number of factors such as domain-dependency, computational expenses and the inability to understand the semantics of videos from low-level features. The second category of techniques however, attempts to alleviate the quality of summaries by involving humans in the abstraction process to bridge the semantic gap. Nonetheless, a single user’s subjectivity and other external contributing factors such as distraction will potentially deteriorate the performance of this group of approaches. Accordingly, in this thesis we have focused on the development of three user-centred effective video summarisation techniques that could be applied to different video categories and generate satisfactory results. According to our first proposed approach, a novel mechanism for a user-centred video summarisation has been presented for the scenarios in which multiple actors are employed in the video summarisation process in order to minimise the negative effects of sole user adoption. Based on our recommended algorithm, the video frames were initially scored by a group of video annotators ‘on the fly’. This was followed by averaging these assigned scores in order to generate a singular saliency score for each video frame and, finally, the highest scored video frames alongside the corresponding audio and textual contents were extracted to be included into the final summary. The effectiveness of our approach has been assessed by comparing the video summaries generated based on our approach against the results obtained from three existing automatic summarisation tools that adopt different modalities for abstraction purposes. The experimental results indicated that our proposed method is capable of delivering remarkable outcomes in terms of Overall Satisfaction and Precision with an acceptable Recall rate, indicating the usefulness of involving user input in the video summarisation process. In an attempt to provide a better user experience, we have proposed our personalised video summarisation method with an ability to customise the generated summaries in accordance with the viewers’ preferences. Accordingly, the end-user’s priority levels towards different video scenes were captured and utilised for updating the average scores previously assigned by the video annotators. Finally, our earlier proposed summarisation method was adopted to extract the most significant audio-visual content of the video. Experimental results indicated the capability of this approach to deliver superior outcomes compared with our previously proposed method and the three other automatic summarisation tools. Finally, we have attempted to reduce the required level of audience involvement for personalisation purposes by proposing a new method for producing personalised video summaries. Accordingly, SIFT visual features were adopted to identify the video scenes’ semantic categories. Fusing this retrieved data with pre-built users’ profiles, personalised video abstracts can be created. Experimental results showed the effectiveness of this method in delivering superior outcomes comparing to our previously recommended algorithm and the three other automatic summarisation techniques
Text-based Sentiment Analysis and Music Emotion Recognition
Nowadays, with the expansion of social media, large amounts of user-generated
texts like tweets, blog posts or product reviews are shared online. Sentiment polarity
analysis of such texts has become highly attractive and is utilized in recommender
systems, market predictions, business intelligence and more. We also witness deep
learning techniques becoming top performers on those types of tasks. There are
however several problems that need to be solved for efficient use of deep neural
networks on text mining and text polarity analysis.
First of all, deep neural networks are data hungry. They need to be fed with
datasets that are big in size, cleaned and preprocessed as well as properly labeled.
Second, the modern natural language processing concept of word embeddings as a
dense and distributed text feature representation solves sparsity and dimensionality
problems of the traditional bag-of-words model. Still, there are various uncertainties
regarding the use of word vectors: should they be generated from the same dataset
that is used to train the model or it is better to source them from big and popular
collections that work as generic text feature representations? Third, it is not easy for
practitioners to find a simple and highly effective deep learning setup for various
document lengths and types. Recurrent neural networks are weak with longer texts
and optimal convolution-pooling combinations are not easily conceived. It is thus
convenient to have generic neural network architectures that are effective and can
adapt to various texts, encapsulating much of design complexity.
This thesis addresses the above problems to provide methodological and practical
insights for utilizing neural networks on sentiment analysis of texts and achieving
state of the art results. Regarding the first problem, the effectiveness of various
crowdsourcing alternatives is explored and two medium-sized and emotion-labeled
song datasets are created utilizing social tags. One of the research interests of Telecom
Italia was the exploration of relations between music emotional stimulation and
driving style. Consequently, a context-aware music recommender system that aims
to enhance driving comfort and safety was also designed. To address the second
problem, a series of experiments with large text collections of various contents and
domains were conducted. Word embeddings of different parameters were exercised
and results revealed that their quality is influenced (mostly but not only) by the
size of texts they were created from. When working with small text datasets, it is
thus important to source word features from popular and generic word embedding
collections. Regarding the third problem, a series of experiments involving convolutional
and max-pooling neural layers were conducted. Various patterns relating
text properties and network parameters with optimal classification accuracy were
observed. Combining convolutions of words, bigrams, and trigrams with regional
max-pooling layers in a couple of stacks produced the best results. The derived
architecture achieves competitive performance on sentiment polarity analysis of
movie, business and product reviews.
Given that labeled data are becoming the bottleneck of the current deep learning
systems, a future research direction could be the exploration of various data programming
possibilities for constructing even bigger labeled datasets. Investigation
of feature-level or decision-level ensemble techniques in the context of deep neural
networks could also be fruitful. Different feature types do usually represent complementary
characteristics of data. Combining word embedding and traditional text
features or utilizing recurrent networks on document splits and then aggregating the
predictions could further increase prediction accuracy of such models
Promoting Informal Learning Using a Context-Sensitive Recommendation Algorithm For a QRCode-based Visual Tagging System
Structured Abstract
Context: Previous work in the educational field has demonstrated that Informal Learning is an effective way to learn. Due to its casual nature it is often difficult
for academic institutions to leverage this method of learning as part of a typical curriculum.
Aim: This study planned to determine whether Informal Learning could be encouraged amongst learners at Durham University using an object tagging system and a context-sensitive recommendation algorithm.
Method: This study creates a visual tagging system using a type of two-dimensional barcode called the QR Code and describes a tool designed to allow learners to use these ‘tags’ to learn about objects in a physical space. Information about objects features audio media as well as textual descriptions to make information appealing.
A collaboratively-filtered, user-based recommendation algorithm uses elements of a learner’s context, namely their university records, physical location and data on
the activities of users similar to them to create a top-N ranked list of objects that they may find interesting. The tool is evaluated in a case study with thirty (n=30) participants taking part in a task in a public space within Durham University. The evaluation uses quantitative and qualititative data to make conclusions as to the use
of the proposed tool for individuals who wish to learn informally.
Results: A majority of learners found learning about the objects around them to be an interesting practice. The recommendation system fulfilled its purpose and
learners indicated that they would travel a significant distance to view objects that were presented to them. The addition of audio clips to largely textual information
did not serve to increase learner interest and the implementation of this part of the system is examined in detail. Additionally there was found to be no apparent
correlation between prior computer usage and the ability to comprehend an informal learning tool such as the one described.
Conclusion: Context-sensitive, mobile tools are valuable for motivating Informal Learning. Interaction with tagged objects outside of the experimental setting
indicates significant learner interest even from those individuals that did not participate in the study. Learners that did participate in the experiment gained a better
understanding of the world around them than they would have without the tool and would use such software again in the future
- …