9 research outputs found

    AN INTERFACE FOR IMAGE RETRIEVAL AND ITS EXTENSION TO VIDEO RETRIEVAL

    Get PDF
    National audienceSemantic video retrieval is still an open problem. While many works exist in analyzing the video contents, few ones present the retrieval results to the users and interact with him/her. In this article, firstly, we propose a 2D graphic interface adapted to the problem of image retrieval that enables a bidirectional communication: from the system towards the user to visualize the current research results and from the user towards the system so that the user can provide some relevance feedback information to refine his/her query. In this interface, the visualization shows the image query in the middle of the screen and the result images in a 2D plan with distances showing the similarity measures between images and the query. We propose also a method of relevance feedback in form of validation, in this interface, for image retrieval. This approach has been implemented and tested with different image databases. Secondly, we analyze the extension of this approach for video retrieval. For this, we extract the key frames from video and use them to represent the research results of video as well as to do the relevance feedback

    Feature Fusion Based Audio-Visual Speaker Identification Using Hidden Markov Model under Different Lighting Variations

    Get PDF
    The aim of the paper is to propose a feature fusion based Audio-Visual Speaker Identification (AVSI) system with varied conditions of illumination environments. Among the different fusion strategies, feature level fusion has been used for the proposed AVSI system where Hidden Markov Model (HMM) is used for learning and classification. Since the feature set contains richer information about the raw biometric data than any other levels, integration at feature level is expected to provide better authentication results. In this paper, both Mel Frequency Cepstral Coefficients (MFCCs) and Linear Prediction Cepstral Coefficients (LPCCs) are combined to get the audio feature vectors and Active Shape Model (ASM) based appearance and shape facial features are concatenated to take the visual feature vectors. These combined audio and visual features are used for the feature-fusion. To reduce the dimension of the audio and visual feature vectors, Principal Component Analysis (PCA) method is used. The VALID audio-visual database is used to measure the performance of the proposed system where four different illumination levels of lighting conditions are considered. Experimental results focus on the significance of the proposed audio-visual speaker identification system with various combinations of audio and visual features

    Detection of Hate Speech in Videos Using Machine Learning

    Get PDF
    With the progression of the internet and social media, people are given multiple platforms to share their thoughts and opinions about various subject matters freely. However, this freedom of speech is misused to direct hate towards individuals or group of people due to their race, religion, gender etc. The rise of hate speech has led to conflicts and cases of cyber bullying, causing many organizations to look for optimal solutions to solve this problem. Developments in the field of machine learning and deep learning have piqued the interest of researchers, leading them to research and implement solutions to solve the problem of hate speech. Currently, machine learning techniques are applied to textual data to detect hate speech. With the ample use of video sharing sites, there is a need to find a way to detect hate speech in videos. This project deals with classification of videos into normal or hateful categories based on the spoken content of the videos. The video dataset is built using a crawler to search and download videos based on offensive words that are specified as keywords. The audio is extracted from the videos and is converted into textual format using a speech-to-text converter to obtain a transcript of the videos. Experiments are conducted by training four models with three different feature sets extracted from the dataset. The models are evaluated by computing the specified evaluation metrics. The evaluated metrics indicate that random forest classifier model delivers the best results in classifying videos

    Event detection in soccer video based on audio/visual keywords

    Get PDF
    Master'sMASTER OF SCIENC

    Integration Of Multimodal Features For Video Scene Classification Based On Hmm

    No full text
    Along with the advance in multimedia and internet technology, ahuge amount of data, including digital video and audio, are generated daily. Tools for e#cient indexing and retrieval are indispensable. With multi-modal information present in the data, e#ectiveintegration is necessary and is still a challenging problem. In this paper, we present four di#erent methods for integrating audio and visual information for video classi#cation based on Hidden Markov Model. Our results have shown signi#cant improvementover using single modality. INTRODUCTION Along with the advancementinmultimedia and internet technology, the amount of digital data, including TV programs, conference archives, and movies, grows exponentially. For e#cient access, understanding, and retrieval of digital video, tools that can automatically understand the semantic content in a video are becoming indispensable. In this paper, we consider the classi#cation of a video sequence into one of a few predetermined scene types. ..

    Acoustic event detection and localization using distributed microphone arrays

    Get PDF
    Automatic acoustic scene analysis is a complex task that involves several functionalities: detection (time), localization (space), separation, recognition, etc. This thesis focuses on both acoustic event detection (AED) and acoustic source localization (ASL), when several sources may be simultaneously present in a room. In particular, the experimentation work is carried out with a meeting-room scenario. Unlike previous works that either employed models of all possible sound combinations or additionally used video signals, in this thesis, the time overlapping sound problem is tackled by exploiting the signal diversity that results from the usage of multiple microphone array beamformers. The core of this thesis work is a rather computationally efficient approach that consists of three processing stages. In the first, a set of (null) steering beamformers is used to carry out diverse partial signal separations, by using multiple arbitrarily located linear microphone arrays, each of them composed of a small number of microphones. In the second stage, each of the beamformer output goes through a classification step, which uses models for all the targeted sound classes (HMM-GMM, in the experiments). Then, in a third stage, the classifier scores, either being intra- or inter-array, are combined using a probabilistic criterion (like MAP) or a machine learning fusion technique (fuzzy integral (FI), in the experiments). The above-mentioned processing scheme is applied in this thesis to a set of complexity-increasing problems, which are defined by the assumptions made regarding identities (plus time endpoints) and/or positions of sounds. In fact, the thesis report starts with the problem of unambiguously mapping the identities to the positions, continues with AED (positions assumed) and ASL (identities assumed), and ends with the integration of AED and ASL in a single system, which does not need any assumption about identities or positions. The evaluation experiments are carried out in a meeting-room scenario, where two sources are temporally overlapped; one of them is always speech and the other is an acoustic event from a pre-defined set. Two different databases are used, one that is produced by merging signals actually recorded in the UPC¿s department smart-room, and the other consists of overlapping sound signals directly recorded in the same room and in a rather spontaneous way. From the experimental results with a single array, it can be observed that the proposed detection system performs better than either the model based system or a blind source separation based system. Moreover, the product rule based combination and the FI based fusion of the scores resulting from the multiple arrays improve the accuracies further. On the other hand, the posterior position assignment is performed with a very small error rate. Regarding ASL and assuming an accurate AED system output, the 1-source localization performance of the proposed system is slightly better than that of the widely-used SRP-PHAT system, working in an event-based mode, and it even performs significantly better than the latter one in the more complex 2-source scenario. Finally, though the joint system suffers from a slight degradation in terms of classification accuracy with respect to the case where the source positions are known, it shows the advantage of carrying out the two tasks, recognition and localization, with a single system, and it allows the inclusion of information about the prior probabilities of the source positions. It is worth noticing also that, although the acoustic scenario used for experimentation is rather limited, the approach and its formalism were developed for a general case, where the number and identities of sources are not constrained

    Indexation sémantique des images et des vidéos par apprentissage actif

    Get PDF
    Le cadre général de cette thèse est l'indexation sémantique et la recherche d'informations, appliquée à des documents multimédias. Plus précisément, nous nous intéressons à l'indexation sémantique des concepts dans des images et vidéos par les approches d'apprentissage actif, que nous utilisons pour construire des corpus annotés. Tout au long de cette thèse, nous avons montré que les principales difficultés de cette tâche sont souvent liées, en général, à l'fossé sémantique. En outre, elles sont liées au problème de classe-déséquilibre dans les ensembles de données à grande échelle, où les concepts sont pour la plupart rares. Pour l'annotation de corpus, l'objectif principal de l'utilisation de l'apprentissage actif est d'augmenter la performance du système en utilisant que peu d'échantillons annotés que possible, ainsi minimisant les coûts de l'annotations des données (par exemple argent et temps). Dans cette thèse, nous avons contribué à plusieurs niveaux de l'indexation multimédia et nous avons proposé trois approches qui succèdent des systèmes de l'état de l'art: i) l'approche multi-apprenant (ML) qui surmonte le problème de classe-déséquilibre dans les grandes bases de données, ii) une méthode de reclassement qui améliore l'indexation vidéo, iii) nous avons évalué la normalisation en loi de puissance et de l'APC et a montré son efficacité dans l'indexation multimédia. En outre, nous avons proposé l'approche ALML qui combine le multi-apprenant avec l'apprentissage actif, et nous avons également proposé une méthode incrémentale qui accélère l'approche proposé (ALML). En outre, nous avons proposé l'approche de nettoyage actif, qui aborde la qualité des annotations. Les méthodes proposées ont été tous validées par plusieurs expériences, qui ont été menées et évaluées sur des collections à grande échelle de l'indice de benchmark internationale bien connue, appelés TRECVID. Enfin, nous avons présenté notre système d'annotation dans le monde réel basé sur l'apprentissage actif, qui a été utilisé pour mener les annotations de l'ensemble du développement de la campagne TRECVID en 2011, et nous avons présenté notre participation à la tâche d'indexation sémantique de cette campagne, dans laquelle nous nous sommes classés à la 3ème place sur 19 participants.The general framework of this thesis is semantic indexing and information retrieval, applied to multimedia documents. More specifically, we are interested in the semantic indexing of concepts in images and videos by the active learning approaches that we use to build annotated corpus. Throughout this thesis, we have shown that the main difficulties of this task are often related, in general, to the semantic-gap. Furthermore, they are related to the class-imbalance problem in large scale datasets, where concepts are mostly sparse. For corpus annotation, the main objective of using active learning is to increase the system performance by using as few labeled samples as possible, thereby minimizing the cost of labeling data (e.g. money and time). In this thesis, we have contributed in several levels of multimedia indexing and proposed three approaches that outperform state-of-the-art systems: i) the multi-learner approach (ML) that overcomes the class-imbalance problem in large-scale datasets, ii) a re-ranking method that improves the video indexing, iii) we have evaluated the power-law normalization and the PCA and showed its effectiveness in multimedia indexing. Furthermore, we have proposed the ALML approach that combines the multi-learner with active learning, and also proposed an incremental method that speeds up ALML approach. Moreover, we have proposed the active cleaning approach, which tackles the quality of annotations. The proposed methods were validated through several experiments, which were conducted and evaluated on large-scale collections of the well-known international benchmark, called TrecVid. Finally, we have presented our real-world annotation system based on active learning, which was used to lead the annotations of the development set of TrecVid 2011 campaign, and we have presented our participation at the semantic indexing task of the mentioned campaign, in which we were ranked at the 3rd place out of 19 participants.SAVOIE-SCD - Bib.électronique (730659901) / SudocGRENOBLE1/INP-Bib.électronique (384210012) / SudocGRENOBLE2/3-Bib.électronique (384219901) / SudocSudocFranceF

    Abordagens multimodais com utilização de deep learning e unimodais com aprendizado de máquina no reconhecimento de emoções em músicas

    Get PDF
    Orientadora: Profa. Dra. Denise Fukumi TsunodaCoorientadora: Profa. Dra. Marília Nunes SilvaTese (doutorado) - Universidade Federal do Paraná, Setor de Ciências Sociais Aplicadas, Programa de Pós-Graduação em Gestão da Informação. Defesa : Curitiba, 30/08/2023Inclui referênciasResumo: Esta pesquisa foi realizada com base na compreensão da relevância da relação entre música e emoção na vida humana, abrangendo desde o lazer até estudos científicos. Embora a organização emocional da música seja intrínseca à natureza humana, o reconhecimento automático de emoções musicais enfrenta desafios, configurando-se como um tema complexo na recuperação de informações musicais. Nesse contexto, o propósito central desta tese foi investigar se a adoção de abordagens multimodais, envolvendo informações de diferentes fontes e arquiteturas de deep learning, pode superar o desempenho das abordagens unimodais baseadas em algoritmos de aprendizado de máquina. Essa indagação emergiu da carência de estratégias multimodais na área e da perspectiva de melhoria nos resultados de classificação reportados em pesquisas correlatas. Com cinco objetivos específicos, esta pesquisa abordou a identificação de um modelo cognitivo de emoções, definição de modalidades, construção de bases de dados multimodais, comparação de arquiteturas de deep learning e avaliação comparativa das abordagens multimodais com abordagens unimodais utilizando algoritmos tradicionais de aprendizado de máquina. A análise dos resultados demonstrou que as abordagens multimodais alcançaram desempenho superior em diversos cenários de classificação, comparadas às estratégias unimodais. Tais resultados contribuem positivamente para a compreensão da eficácia das abordagens multimodais e das arquiteturas de deep learning no reconhecimento de emoções em músicas. Adicionalmente, a pesquisa ressalta a necessidade de atenção aos modelos emocionais e metadados em plataformas online, visando evitar vieses e ruídos. Esta tese oferece contribuições relevantes na área de reconhecimento de emoções em músicas, particularmente no desenvolvimento de bases de dados multimodais, avaliação de arquiteturas de deep learning para problemas tabulares, protocolos de experimentos e abordagens voltadas à cognição musical. A comparação sistemática entre abordagens multimodais e unimodais evidencia as vantagens das primeiras, incentivando novas pesquisas nesse campoAbstract: This research was conducted based on the understanding of the significance of the relationship between music and emotion in human life, spanning from leisure to scientific studies. Although the emotional organization of music is intrinsic to human nature, the automatic recognition of musical emotions faces challenges, manifesting as a complex theme in the retrieval of musical information. Within this context, the central purpose of this thesis was to investigate whether the adoption of multimodal approaches, involving information from different sources and deep learning architectures, can outperform unimodal approaches based on machine learning algorithms. This inquiry arose from the lack of multimodal strategies in the field and the prospect of improvement in classification results reported in related research. With five specific objectives, this research addressed the identification of a cognitive model of emotions, definition of modalities, construction of multimodal databases, comparison of deep learning architectures, and comparative evaluation of multimodal approaches with unimodal approaches using traditional machine learning algorithms. The analysis of results demonstrated that multimodal approaches achieved superior performance in various classification scenarios, compared to unimodal strategies. These findings positively contribute to the understanding of the effectiveness of multimodal approaches and deep learning architectures in the recognition of emotions in music. Additionally, the research emphasizes the need for attention to emotional models and metadata in online platforms, aiming to avoid biases and noise. This thesis offers relevant contributions to the field of music emotion recognition, particularly in the development of multimodal databases, evaluation of deep learning architectures for tabular problems, experimental protocols, and approaches focused on musical cognition. The systematic comparison between multimodal and unimodal approaches highlights the advantages of the former, encouraging new research in this fiel
    corecore