160 research outputs found

    Robust Audio-Codebooks for Large-Scale Event Detection in Consumer Videos

    Get PDF
    Abstract In this paper we present our audio based system for detecting "events" within consumer videos (e.g. You Tube) and report our experiments on the TRECVID Multimedia Event Detection (MED) task and development data. Codebook or bag-of-words models have been widely used in text, visual and audio domains and form the state-of-the-art in MED tasks. The overall effectiveness of these models on such datasets depends critically on the choice of low-level features, clustering approach, sampling method, codebook size, weighting schemes and choice of classifier. In this work we empirically evaluate several approaches to model expressive and robust audio codebooks for the task of MED while ensuring compactness. First, we introduce the Large Scale Pooling Features (LSPF) and Stacked Cepstral Features for encoding local temporal information in audio codebooks. Second, we discuss several design decisions for generating and representing expressive audio codebooks and show how they scale to large datasets. Third, we apply text based techniques like Latent Dirichlet Allocation (LDA) to learn acoustictopics as a means of providing compact representation while maintaining performance. By aggregating these decisions into our model, we obtained 11% relative improvement over our baseline audio systems

    Visual Concept Detection in Images and Videos

    Get PDF
    The rapidly increasing proliferation of digital images and videos leads to a situation where content-based search in multimedia databases becomes more and more important. A prerequisite for effective image and video search is to analyze and index media content automatically. Current approaches in the field of image and video retrieval focus on semantic concepts serving as an intermediate description to bridge the “semantic gap” between the data representation and the human interpretation. Due to the large complexity and variability in the appearance of visual concepts, the detection of arbitrary concepts represents a very challenging task. In this thesis, the following aspects of visual concept detection systems are addressed: First, enhanced local descriptors for mid-level feature coding are presented. Based on the observation that scale-invariant feature transform (SIFT) descriptors with different spatial extents yield large performance differences, a novel concept detection system is proposed that combines feature representations for different spatial extents using multiple kernel learning (MKL). A multi-modal video concept detection system is presented that relies on Bag-of-Words representations for visual and in particular for audio features. Furthermore, a method for the SIFT-based integration of color information, called color moment SIFT, is introduced. Comparative experimental results demonstrate the superior performance of the proposed systems on the Mediamill and on the VOC Challenge. Second, an approach is presented that systematically utilizes results of object detectors. Novel object-based features are generated based on object detection results using different pooling strategies. For videos, detection results are assembled to object sequences and a shot-based confidence score as well as further features, such as position, frame coverage or movement, are computed for each object class. These features are used as additional input for the support vector machine (SVM)-based concept classifiers. Thus, other related concepts can also profit from object-based features. Extensive experiments on the Mediamill, VOC and TRECVid Challenge show significant improvements in terms of retrieval performance not only for the object classes, but also in particular for a large number of indirectly related concepts. Moreover, it has been demonstrated that a few object-based features are beneficial for a large number of concept classes. On the VOC Challenge, the additional use of object-based features led to a superior performance for the image classification task of 63.8% mean average precision (AP). Furthermore, the generalization capabilities of concept models are investigated. It is shown that different source and target domains lead to a severe loss in concept detection performance. In these cross-domain settings, object-based features achieve a significant performance improvement. Since it is inefficient to run a large number of single-class object detectors, it is additionally demonstrated how a concurrent multi-class object detection system can be constructed to speed up the detection of many object classes in images. Third, a novel, purely web-supervised learning approach for modeling heterogeneous concept classes in images is proposed. Tags and annotations of multimedia data in the WWW are rich sources of information that can be employed for learning visual concepts. The presented approach is aimed at continuous long-term learning of appearance models and improving these models periodically. For this purpose, several components have been developed: a crawling component, a multi-modal clustering component for spam detection and subclass identification, a novel learning component, called “random savanna”, a validation component, an updating component, and a scalability manager. Only a single word describing the visual concept is required to initiate the learning process. Experimental results demonstrate the capabilities of the individual components. Finally, a generic concept detection system is applied to support interdisciplinary research efforts in the field of psychology and media science. The psychological research question addressed in the field of behavioral sciences is, whether and how playing violent content in computer games may induce aggression. Therefore, novel semantic concepts most notably “violence” are detected in computer game videos to gain insights into the interrelationship of violent game events and the brain activity of a player. Experimental results demonstrate the excellent performance of the proposed automatic concept detection approach for such interdisciplinary research

    Bag-of-words representations for computer audition

    Get PDF
    Computer audition is omnipresent in everyday life, in applications ranging from personalised virtual agents to health care. From a technical point of view, the goal is to robustly classify the content of an audio signal in terms of a defined set of labels, such as, e.g., the acoustic scene, a medical diagnosis, or, in the case of speech, what is said or how it is said. Typical approaches employ machine learning (ML), which means that task-specific models are trained by means of examples. Despite recent successes in neural network-based end-to-end learning, taking the raw audio signal as input, models relying on hand-crafted acoustic features are still superior in some domains, especially for tasks where data is scarce. One major issue is nevertheless that a sequence of acoustic low-level descriptors (LLDs) cannot be fed directly into many ML algorithms as they require a static and fixed-length input. Moreover, also for dynamic classifiers, compressing the information of the LLDs over a temporal block by summarising them can be beneficial. However, the type of instance-level representation has a fundamental impact on the performance of the model. In this thesis, the so-called bag-of-audio-words (BoAW) representation is investigated as an alternative to the standard approach of statistical functionals. BoAW is an unsupervised method of representation learning, inspired from the bag-of-words method in natural language processing, forming a histogram of the terms present in a document. The toolkit openXBOW is introduced, enabling systematic learning and optimisation of these feature representations, unified across arbitrary modalities of numeric or symbolic descriptors. A number of experiments on BoAW are presented and discussed, focussing on a large number of potential applications and corresponding databases, ranging from emotion recognition in speech to medical diagnosis. The evaluations include a comparison of different acoustic LLD sets and configurations of the BoAW generation process. The key findings are that BoAW features are a meaningful alternative to statistical functionals, offering certain benefits, while being able to preserve the advantages of functionals, such as data-independence. Furthermore, it is shown that both representations are complementary and their fusion improves the performance of a machine listening system.Maschinelles Hören ist im täglichen Leben allgegenwärtig, mit Anwendungen, die von personalisierten virtuellen Agenten bis hin zum Gesundheitswesen reichen. Aus technischer Sicht besteht das Ziel darin, den Inhalt eines Audiosignals hinsichtlich einer Auswahl definierter Labels robust zu klassifizieren. Die Labels beschreiben bspw. die akustische Umgebung der Aufnahme, eine medizinische Diagnose oder - im Falle von Sprache - was gesagt wird oder wie es gesagt wird. Übliche Ansätze hierzu verwenden maschinelles Lernen, d.h., es werden anwendungsspezifische Modelle anhand von Beispieldaten trainiert. Trotz jüngster Erfolge beim Ende-zu-Ende-Lernen mittels neuronaler Netze, in welchen das unverarbeitete Audiosignal als Eingabe benutzt wird, sind Modelle, die auf definierten akustischen Merkmalen basieren, in manchen Bereichen weiterhin überlegen. Dies gilt im Besonderen für Einsatzzwecke, für die nur wenige Daten vorhanden sind. Allerdings besteht dabei das Problem, dass Zeitfolgen von akustischen Deskriptoren in viele Algorithmen des maschinellen Lernens nicht direkt eingespeist werden können, da diese eine statische Eingabe fester Länge benötigen. Außerdem kann es auch für dynamische (zeitabhängige) Klassifikatoren vorteilhaft sein, die Deskriptoren über ein gewisses Zeitintervall zusammenzufassen. Jedoch hat die Art der Merkmalsdarstellung einen grundlegenden Einfluss auf die Leistungsfähigkeit des Modells. In der vorliegenden Dissertation wird der sogenannte Bag-of-Audio-Words-Ansatz (BoAW) als Alternative zum Standardansatz der statistischen Funktionale untersucht. BoAW ist eine Methode des unüberwachten Lernens von Merkmalsdarstellungen, die von der Bag-of-Words-Methode in der Computerlinguistik inspiriert wurde, bei der ein Textdokument als Histogramm der vorkommenden Wörter beschrieben wird. Das Toolkit openXBOW wird vorgestellt, welches systematisches Training und Optimierung dieser Merkmalsdarstellungen - vereinheitlicht für beliebige Modalitäten mit numerischen oder symbolischen Deskriptoren - erlaubt. Es werden einige Experimente zum BoAW-Ansatz durchgeführt und diskutiert, die sich auf eine große Zahl möglicher Anwendungen und entsprechende Datensätze beziehen, von der Emotionserkennung in gesprochener Sprache bis zur medizinischen Diagnostik. Die Auswertungen beinhalten einen Vergleich verschiedener akustischer Deskriptoren und Konfigurationen der BoAW-Methode. Die wichtigsten Erkenntnisse sind, dass BoAW-Merkmalsvektoren eine geeignete Alternative zu statistischen Funktionalen darstellen, gewisse Vorzüge bieten und gleichzeitig wichtige Eigenschaften der Funktionale, wie bspw. die Datenunabhängigkeit, erhalten können. Zudem wird gezeigt, dass beide Darstellungen komplementär sind und eine Fusionierung die Leistungsfähigkeit eines Systems des maschinellen Hörens verbessert

    Concept-based video search with the PicSOM multimedia retrieval system

    Get PDF

    Using the Bag-of-Audio-Words approach for emotion recognition

    Get PDF
    The problem of varying length recordings is a well-known issue in paralinguistics. We investigated how to resolve this problem using the bag-of-audio-words feature extraction approach. The steps of this technique involve preprocessing, clustering, quantization and normalization. The bag-of-audio-words technique is competitive in the area of speech emotion recognition, but the method has several parameters that need to be precisely tuned for good efficiency. The main aim of our study was to analyse the effectiveness of bag-of-audio-words method and try to find the best parameter values for emotion recognition. We optimized the parameters one-by-one, but built on the results of each other. We performed the feature extraction, using openSMILE. Next we transformed our features into same-sized vectors with openXBOW, and finally trained and evaluated SVM models with 10-fold-crossvalidation and UAR. In our experiments, we worked with a Hungarian emotion database. According to our results, the emotion classification performance improves with the bag-of-audio-words feature representation. Not all the BoAW parameters have the optimal settings but later we can make clear recommendations on how to set bag-of-audio-words parameters for emotion detection tasks

    Análise de propriedades intrínsecas e extrínsecas de amostras biométricas para detecção de ataques de apresentação

    Get PDF
    Orientadores: Anderson de Rezende Rocha, Hélio PedriniTese (doutorado) - Universidade Estadual de Campinas, Instituto de ComputaçãoResumo: Os recentes avanços nas áreas de pesquisa em biometria, forense e segurança da informação trouxeram importantes melhorias na eficácia dos sistemas de reconhecimento biométricos. No entanto, um desafio ainda em aberto é a vulnerabilidade de tais sistemas contra ataques de apresentação, nos quais os usuários impostores criam amostras sintéticas, a partir das informações biométricas originais de um usuário legítimo, e as apresentam ao sensor de aquisição procurando se autenticar como um usuário válido. Dependendo da modalidade biométrica, os tipos de ataque variam de acordo com o tipo de material usado para construir as amostras sintéticas. Por exemplo, em biometria facial, uma tentativa de ataque é caracterizada quando um usuário impostor apresenta ao sensor de aquisição uma fotografia, um vídeo digital ou uma máscara 3D com as informações faciais de um usuário-alvo. Em sistemas de biometria baseados em íris, os ataques de apresentação podem ser realizados com fotografias impressas ou com lentes de contato contendo os padrões de íris de um usuário-alvo ou mesmo padrões de textura sintéticas. Nos sistemas biométricos de impressão digital, os usuários impostores podem enganar o sensor biométrico usando réplicas dos padrões de impressão digital construídas com materiais sintéticos, como látex, massa de modelar, silicone, entre outros. Esta pesquisa teve como objetivo o desenvolvimento de soluções para detecção de ataques de apresentação considerando os sistemas biométricos faciais, de íris e de impressão digital. As linhas de investigação apresentadas nesta tese incluem o desenvolvimento de representações baseadas nas informações espaciais, temporais e espectrais da assinatura de ruído; em propriedades intrínsecas das amostras biométricas (e.g., mapas de albedo, de reflectância e de profundidade) e em técnicas de aprendizagem supervisionada de características. Os principais resultados e contribuições apresentadas nesta tese incluem: a criação de um grande conjunto de dados publicamente disponível contendo aproximadamente 17K videos de simulações de ataques de apresentações e de acessos genuínos em um sistema biométrico facial, os quais foram coletados com a autorização do Comitê de Ética em Pesquisa da Unicamp; o desenvolvimento de novas abordagens para modelagem e análise de propriedades extrínsecas das amostras biométricas relacionadas aos artefatos que são adicionados durante a fabricação das amostras sintéticas e sua captura pelo sensor de aquisição, cujos resultados de desempenho foram superiores a diversos métodos propostos na literature que se utilizam de métodos tradicionais de análise de images (e.g., análise de textura); a investigação de uma abordagem baseada na análise de propriedades intrínsecas das faces, estimadas a partir da informação de sombras presentes em sua superfície; e, por fim, a investigação de diferentes abordagens baseadas em redes neurais convolucionais para o aprendizado automático de características relacionadas ao nosso problema, cujos resultados foram superiores ou competitivos aos métodos considerados estado da arte para as diferentes modalidades biométricas consideradas nesta tese. A pesquisa também considerou o projeto de eficientes redes neurais com arquiteturas rasas capazes de aprender características relacionadas ao nosso problema a partir de pequenos conjuntos de dados disponíveis para o desenvolvimento e a avaliação de soluções para a detecção de ataques de apresentaçãoAbstract: Recent advances in biometrics, information forensics, and security have improved the recognition effectiveness of biometric systems. However, an ever-growing challenge is the vulnerability of such systems against presentation attacks, in which impostor users create synthetic samples from the original biometric information of a legitimate user and show them to the acquisition sensor seeking to authenticate themselves as legitimate users. Depending on the trait used by the biometric authentication, the attack types vary with the type of material used to build the synthetic samples. For instance, in facial biometric systems, an attempted attack is characterized by the type of material the impostor uses such as a photograph, a digital video, or a 3D mask with the facial information of a target user. In iris-based biometrics, presentation attacks can be accomplished with printout photographs or with contact lenses containing the iris patterns of a target user or even synthetic texture patterns. In fingerprint biometric systems, impostor users can deceive the authentication process using replicas of the fingerprint patterns built with synthetic materials such as latex, play-doh, silicone, among others. This research aimed at developing presentation attack detection (PAD) solutions whose objective is to detect attempted attacks considering different attack types, in each modality. The lines of investigation presented in this thesis aimed at devising and developing representations based on spatial, temporal and spectral information from noise signature, intrinsic properties of the biometric data (e.g., albedo, reflectance, and depth maps), and supervised feature learning techniques, taking into account different testing scenarios including cross-sensor, intra-, and inter-dataset scenarios. The main findings and contributions presented in this thesis include: the creation of a large and publicly available benchmark containing 17K videos of presentation attacks and bona-fide presentations simulations in a facial biometric system, whose collect were formally authorized by the Research Ethics Committee at Unicamp; the development of novel approaches to modeling and analysis of extrinsic properties of biometric samples related to artifacts added during the manufacturing of the synthetic samples and their capture by the acquisition sensor, whose results were superior to several approaches published in the literature that use traditional methods for image analysis (e.g., texture-based analysis); the investigation of an approach based on the analysis of intrinsic properties of faces, estimated from the information of shadows present on their surface; and the investigation of different approaches to automatically learning representations related to our problem, whose results were superior or competitive to state-of-the-art methods for the biometric modalities considered in this thesis. We also considered in this research the design of efficient neural networks with shallow architectures capable of learning characteristics related to our problem from small sets of data available to develop and evaluate PAD solutionsDoutoradoCiência da ComputaçãoDoutor em Ciência da Computação140069/2016-0 CNPq, 142110/2017-5CAPESCNP

    Análise de vídeo sensível

    Get PDF
    Orientadores: Anderson de Rezende Rocha, Siome Klein GoldensteinTese (doutorado) - Universidade Estadual de Campinas, Instituto de ComputaçãoResumo: Vídeo sensível pode ser definido como qualquer filme capaz de oferecer ameaças à sua audiência. Representantes típicos incluem ¿ mas não estão limitados a ¿ pornografia, violência, abuso infantil, crueldade contra animais, etc. Hoje em dia, com o papel cada vez mais pervasivo dos dados digitais em nossa vidas, a análise de conteúdo sensível representa uma grande preocupação para representantes da lei, empresas, professores, e pais, devido aos potenciais danos que este tipo de conteúdo pode infligir a menores, estudantes, trabalhadores, etc. Não obstante, o emprego de mediadores humanos, para constantemente analisar grandes quantidades de dados sensíveis, muitas vezes leva a ocorrências de estresse e trauma, o que justifica a busca por análises assistidas por computador. Neste trabalho, nós abordamos este problema em duas frentes. Na primeira, almejamos decidir se um fluxo de vídeo apresenta ou não conteúdo sensível, à qual nos referimos como classificação de vídeo sensível. Na segunda, temos como objetivo encontrar os momentos exatos em que um fluxo começa e termina a exibição de conteúdo sensível, em nível de quadros de vídeo, à qual nos referimos como localização de conteúdo sensível. Para ambos os casos, projetamos e desenvolvemos métodos eficazes e eficientes, com baixo consumo de memória, e adequação à implantação em dispositivos móveis. Neste contexto, nós fornecemos quatro principais contribuições. A primeira é uma nova solução baseada em sacolas de palavras visuais, para a classificação eficiente de vídeos sensíveis, apoiada na análise de fenômenos temporais. A segunda é uma nova solução de fusão multimodal em alto nível semântico, para a localização de conteúdo sensível. A terceira, por sua vez, é um novo detector espaço-temporal de pontos de interesse, e descritor de conteúdo de vídeo. Finalmente, a quarta contribuição diz respeito a uma base de vídeos anotados em nível de quadro, que possui 140 horas de conteúdo pornográfico, e que é a primeira da literatura a ser adequada para a localização de pornografia. Um aspecto relevante das três primeiras contribuições é a sua natureza de generalização, no sentido de poderem ser empregadas ¿ sem modificações no passo a passo ¿ para a detecção de tipos diversos de conteúdos sensíveis, tais como os mencionados anteriormente. Para validação, nós escolhemos pornografia e violência ¿ dois dos tipos mais comuns de material impróprio ¿ como representantes de interesse, de conteúdo sensível. Nestes termos, realizamos experimentos de classificação e de localização, e reportamos resultados para ambos os tipos de conteúdo. As soluções propostas apresentam uma acurácia de 93% em classificação de pornografia, e permitem a correta localização de 91% de conteúdo pornográfico em fluxo de vídeo. Os resultados para violência também são interessantes: com as abordagens apresentadas, nós obtivemos o segundo lugar em uma competição internacional de detecção de cenas violentas. Colocando ambas em perspectiva, nós aprendemos que a detecção de pornografia é mais fácil que a de violência, abrindo várias oportunidades de pesquisa para a comunidade científica. A principal razão para tal diferença está relacionada aos níveis distintos de subjetividade que são inerentes a cada conceito. Enquanto pornografia é em geral mais explícita, violência apresenta um espectro mais amplo de possíveis manifestaçõesAbstract: Sensitive video can be defined as any motion picture that may pose threats to its audience. Typical representatives include ¿ but are not limited to ¿ pornography, violence, child abuse, cruelty to animals, etc. Nowadays, with the ever more pervasive role of digital data in our lives, sensitive-content analysis represents a major concern to law enforcers, companies, tutors, and parents, due to the potential harm of such contents over minors, students, workers, etc. Notwithstanding, the employment of human mediators for constantly analyzing huge troves of sensitive data often leads to stress and trauma, justifying the search for computer-aided analysis. In this work, we tackle this problem in two ways. In the first one, we aim at deciding whether or not a video stream presents sensitive content, which we refer to as sensitive-video classification. In the second one, we aim at finding the exact moments a stream starts and ends displaying sensitive content, at frame level, which we refer to as sensitive-content localization. For both cases, we aim at designing and developing effective and efficient methods, with low memory footprint and suitable for deployment on mobile devices. In this vein, we provide four major contributions. The first one is a novel Bag-of-Visual-Words-based pipeline for efficient time-aware sensitive-video classification. The second is a novel high-level multimodal fusion pipeline for sensitive-content localization. The third, in turn, is a novel space-temporal video interest point detector and video content descriptor. Finally, the fourth contribution comprises a frame-level annotated 140-hour pornographic video dataset, which is the first one in the literature that is appropriate for pornography localization. An important aspect of the first three contributions is their generalization nature, in the sense that they can be employed ¿ without step modifications ¿ to the detection of diverse sensitive content types, such as the previously mentioned ones. For validation, we choose pornography and violence ¿ two of the commonest types of inappropriate material ¿ as target representatives of sensitive content. We therefore perform classification and localization experiments, and report results for both types of content. The proposed solutions present an accuracy of 93% in pornography classification, and allow the correct localization of 91% of pornographic content within a video stream. The results for violence are also compelling: with the proposed approaches, we reached second place in an international competition of violent scenes detection. Putting both in perspective, we learned that pornography detection is easier than its violence counterpart, opening several opportunities for additional investigations by the research community. The main reason for such difference is related to the distinct levels of subjectivity that are inherent to each concept. While pornography is usually more explicit, violence presents a broader spectrum of possible manifestationsDoutoradoCiência da ComputaçãoDoutor em Ciência da Computação1572763, 1197473CAPE
    corecore