10 research outputs found

    MediaEval 2016 Predicting Media Interestingness Task

    Get PDF
    Volume: 1739 Host publication title: MediaEval 2016 Multimedia Benchmark Workshop Host publication sub-title: Working Notes Proceedings of the MediaEval 2016 WorkshopNon peer reviewe

    RUC at MediaEval 2016: Predicting Media Interestingness Task

    Get PDF
    ABSTRACT Measuring media interestingness has a wide range of applications such as video recommendation. This paper presents our approach in the MediaEval 2016 Predicting Media Interestingness Task. There are two subtasks: image interestingness prediction and video interestingness prediction. For both subtasks, we utilize hand-crafted features and C-NN features as our visual features. For the video subtask, we also extract acoustic features including MFCC Fisher Vector and statistical acoustic features. We train SVM and Random Forest as classifiers and early fusion is applied to combine different features. Experimental results show that combining semantic-level and low-level visual features are beneficial for image interestingness prediction. When predicting video interestingness, the audio modality has superior performance and the early fusion of visual and audio modalities can further boost the performance

    Collecting, Analyzing and Predicting Socially-Driven Image Interestingness

    Get PDF
    International audienceInterestingness has recently become an emerging concept for visual content assessment. However, understanding and predicting image interestingness remains challenging as its judgment is highly subjective and usually context-dependent. In addition, existing datasets are quite small for in-depth analysis. To push forward research in this topic, a large-scale interestingness dataset (images and their associated metadata) is described in this paper and released for public use. We then propose computational models based on deep learning to predict image interestingness. We show that exploiting relevant contextual information derived from social metadata could greatly improve the prediction results. Finally we discuss some key findings and potential research directions for this emerging topic

    Fine-grained Video Attractiveness Prediction Using Multimodal Deep Learning on a Large Real-world Dataset

    Full text link
    Nowadays, billions of videos are online ready to be viewed and shared. Among an enormous volume of videos, some popular ones are widely viewed by online users while the majority attract little attention. Furthermore, within each video, different segments may attract significantly different numbers of views. This phenomenon leads to a challenging yet important problem, namely fine-grained video attractiveness prediction. However, one major obstacle for such a challenging problem is that no suitable benchmark dataset currently exists. To this end, we construct the first fine-grained video attractiveness dataset, which is collected from one of the most popular video websites in the world. In total, the constructed FVAD consists of 1,019 drama episodes with 780.6 hours covering different categories and a wide variety of video contents. Apart from the large amount of videos, hundreds of millions of user behaviors during watching videos are also included, such as "view counts", "fast-forward", "fast-rewind", and so on, where "view counts" reflects the video attractiveness while other engagements capture the interactions between the viewers and videos. First, we demonstrate that video attractiveness and different engagements present different relationships. Second, FVAD provides us an opportunity to study the fine-grained video attractiveness prediction problem. We design different sequential models to perform video attractiveness prediction by relying solely on video contents. The sequential models exploit the multimodal relationships between visual and audio components of the video contents at different levels. Experimental results demonstrate the effectiveness of our proposed sequential models with different visual and audio representations, the necessity of incorporating the two modalities, and the complementary behaviors of the sequential prediction models at different levels.Comment: Accepted by WWW 2018 The Big Web Trac

    Deep learning for multimedia processing-Predicting media interestingness

    Get PDF
    This thesis explores the application of a deep learning approach for the prediction of media interestingness. Two different models are investigated, one for the prediction of image and one for the prediction of video interestingness. For the prediction of image interestingness, the ResNet50 network is fine-tuned to obtain best results. First, some layers are added. Next, the model is trained and fine-tuned using data augmentation, dropout, class weights, and changing other hyper parameters. For the prediction of video interestingness, first, features are extracted with a 3D convolutional network. Next a LSTM network is trained and fine-tuned with the features. The final result is a binary label for each image/video: 1 for interesting, 0 for not interesting. Additionally, a confidence value is provided for each prediction. Finally, the Mean Average Precision (MAP) is employed as evaluation metric to estimate the quality of the final results.Esta tesis explora un enfoque con deep learning aplicado a la predicción del nivel de interés de imágenes y vídeos. Se investigan dos modelos, uno para predecir el nivel de interés de imágenes y otro para vídeos. Para la predicción del nivel de interés de imágenes, se adapta la red ResNet50 con el fin de obtener los mejores resultados. En primer lugar, se añaden capas. A continuación, se entrena y se adapta el modelo utilizando aumento de datos, dropout, ponderación de clases y cambiando otros hiperparámetros. Para la predicción del nivel de interés de vídeos, en primer lugar, se extraen características de los vídeos con una red convolucional 3D. A continuación se entrena y se adapta una red LSTM con estas características. El resultado final es una clasificación binaria para cada imagen/vídeo: 1 para "interesante", 0 para "no interesante". Además, se aporta un nivel de confianza en cada predicción. Finalmente, el promedio de la precisión media (MAP) se usa como métrica de evaluación para estimar la calidad de los resultados finales.Aquesta tèsi explora un enfocament amb deep learning aplicat a la predicció del nivell d'interès d'imatges i vídeos. S'investiguen dos models, un per a predir el nivell d'interès d'imatges i un altre per a vídeos. Per a la predicció del nivell d'interès d'imatges, s'adapta la xarxa ResNet50 amb la finalitat d'obtenir els millors resultats. En primer lloc, s'afegeixen capes. A continuació, s'entrena i s'adapta el model utilitzant augmentació de les dades, dropout, ponderació de classes i canviant hiperparàmetres. Per a la predicció del nivell d'interès de vídeos, en primer lloc, s'extreuen característiques dels videos amb una xarxa convolucional 3D. A continuació, s'entrena i s'adapta una xarxa LSTM amb aquestes característiques. El resultat final és una classificació binària de cada imatge/vídeo: 1 per a "interessant", 0 per a "no interessant". A més a més, s'aporta un nivell de confiança a cada predicció. Finalment, el promig de la precisió mitja (MAP) s'utilitza com a mètrica d'evaluació per a estimar la qualitat dels resultats finals

    Annotating, Understanding, and Predicting Long-term Video Memorability

    Get PDF
    International audienceMemorability can be regarded as a useful metric of video importance to help make a choice between competing videos. Research on computational understanding of video memorability is however in its early stages. There is no available dataset for modelling purposes, and the few previous attempts provided protocols to collect video memorability data that would be difficult to generalize. Furthermore, the computational features needed to build a robust memorability predictor remain largely undiscovered. In this article, we propose a new protocol to collect long-term video memorability annotations. We measure the memory performances of 104 participants from weeks to years after memorization to build a dataset of 660 videos for video memorability prediction. This dataset is made available for the research community. We then analyze the collected data in order to better understand video memorability, in particular the effects of response time, duration of memory retention and repetition of visualization on video memorability. We finally investigate the use of various types of audio and visual features and build a computational model for video memorability prediction. We conclude that high level visual semantics help better predict the memorability of videos

    TUD-MMC at MediaEval 2016: Predicting Media Interestingness Task

    No full text
    ABSTRACT This working notes paper describes the TUD-MMC entry to the MediaEval 2016 Predicting Media Interestingness Task. Noting that the nature of movie trailer shots is different from that of preceding tasks on image and video interestingness, we propose two baseline heuristic approaches based on the clear occurrence of people. MAP scores obtained on the development set and test set suggest that our approaches cover a limited but non-marginal subset of the interestingness spectrum. Most strikingly, our obtained scores on the Image and Video Subtasks are comparable or better than those obtained when evaluating the ground truth annotations of the Image Subtask against the Video Subtask and vice versa

    BigVid at MediaEval 2016: Predicting Interestingness in Images and Videos

    No full text
    ABSTRACT Despite growing research interest, the tasks of predicting the interestingness of images and videos remain as an open challenge. The main obstacles come from both the diversity and complexity of video content and highly subjective and varying judgements of interestingness of different persons. In the MediaEval 2016 Predicting Media Interestingness Task, our team of BigVid@Fudan had submitted five runs exploring various methods of extraction, and modeling the low-level features (from visual and audio modalities) and hundreds of high-level semantic attributes; and fusing these features for classification. We not only investigated the use of the SVM (Support Vector Machine) model; but the recent deep learning methods were explored as well. We had submitted 5 runs using SVM/Ranking-SVM (Run1, Run3 and Run4) and Deep Neural Networks (Run2 and Run5) respectively. We achieved a mean average precision of 0.23 for the image subtask and 0.15 for the video subtask. Furthermore, our experiments revealed some insights of this task which are interesting and potential useful. For example, our results show that the visual features and high-level attributes are complementary to each other
    corecore