622 research outputs found

    Unmasking Clever Hans Predictors and Assessing What Machines Really Learn

    Full text link
    Current learning machines have successfully solved hard application problems, reaching high accuracy and displaying seemingly "intelligent" behavior. Here we apply recent techniques for explaining decisions of state-of-the-art learning machines and analyze various tasks from computer vision and arcade games. This showcases a spectrum of problem-solving behaviors ranging from naive and short-sighted, to well-informed and strategic. We observe that standard performance evaluation metrics can be oblivious to distinguishing these diverse problem solving behaviors. Furthermore, we propose our semi-automated Spectral Relevance Analysis that provides a practically effective way of characterizing and validating the behavior of nonlinear learning machines. This helps to assess whether a learned model indeed delivers reliably for the problem that it was conceived for. Furthermore, our work intends to add a voice of caution to the ongoing excitement about machine intelligence and pledges to evaluate and judge some of these recent successes in a more nuanced manner.Comment: Accepted for publication in Nature Communication

    Attention Mechanism for Recognition in Computer Vision

    Get PDF
    It has been proven that humans do not focus their attention on an entire scene at once when they perform a recognition task. Instead, they pay attention to the most important parts of the scene to extract the most discriminative information. Inspired by this observation, in this dissertation, the importance of attention mechanism in recognition tasks in computer vision is studied by designing novel attention-based models. In specific, four scenarios are investigated that represent the most important aspects of attention mechanism.First, an attention-based model is designed to reduce the visual features\u27 dimensionality by selectively processing only a small subset of the data. We study this aspect of the attention mechanism in a framework based on object recognition in distributed camera networks. Second, an attention-based image retrieval system (i.e., person re-identification) is proposed which learns to focus on the most discriminative regions of the person\u27s image and process those regions with higher computation power using a deep convolutional neural network. Furthermore, we show how visualizing the attention maps can make deep neural networks more interpretable. In other words, by visualizing the attention maps we can observe the regions of the input image where the neural network relies on, in order to make a decision. Third, a model for estimating the importance of the objects in a scene based on a given task is proposed. More specifically, the proposed model estimates the importance of the road users that a driver (or an autonomous vehicle) should pay attention to in a driving scenario in order to have safe navigation. In this scenario, the attention estimation is the final output of the model. Fourth, an attention-based module and a new loss function in a meta-learning based few-shot learning system is proposed in order to incorporate the context of the task into the feature representations of the samples and increasing the few-shot recognition accuracy.In this dissertation, we showed that attention can be multi-facet and studied the attention mechanism from the perspectives of feature selection, reducing the computational cost, interpretable deep learning models, task-driven importance estimation, and context incorporation. Through the study of four scenarios, we further advanced the field of where \u27\u27attention is all you need\u27\u27

    Análise de vídeo sensível

    Get PDF
    Orientadores: Anderson de Rezende Rocha, Siome Klein GoldensteinTese (doutorado) - Universidade Estadual de Campinas, Instituto de ComputaçãoResumo: Vídeo sensível pode ser definido como qualquer filme capaz de oferecer ameaças à sua audiência. Representantes típicos incluem ¿ mas não estão limitados a ¿ pornografia, violência, abuso infantil, crueldade contra animais, etc. Hoje em dia, com o papel cada vez mais pervasivo dos dados digitais em nossa vidas, a análise de conteúdo sensível representa uma grande preocupação para representantes da lei, empresas, professores, e pais, devido aos potenciais danos que este tipo de conteúdo pode infligir a menores, estudantes, trabalhadores, etc. Não obstante, o emprego de mediadores humanos, para constantemente analisar grandes quantidades de dados sensíveis, muitas vezes leva a ocorrências de estresse e trauma, o que justifica a busca por análises assistidas por computador. Neste trabalho, nós abordamos este problema em duas frentes. Na primeira, almejamos decidir se um fluxo de vídeo apresenta ou não conteúdo sensível, à qual nos referimos como classificação de vídeo sensível. Na segunda, temos como objetivo encontrar os momentos exatos em que um fluxo começa e termina a exibição de conteúdo sensível, em nível de quadros de vídeo, à qual nos referimos como localização de conteúdo sensível. Para ambos os casos, projetamos e desenvolvemos métodos eficazes e eficientes, com baixo consumo de memória, e adequação à implantação em dispositivos móveis. Neste contexto, nós fornecemos quatro principais contribuições. A primeira é uma nova solução baseada em sacolas de palavras visuais, para a classificação eficiente de vídeos sensíveis, apoiada na análise de fenômenos temporais. A segunda é uma nova solução de fusão multimodal em alto nível semântico, para a localização de conteúdo sensível. A terceira, por sua vez, é um novo detector espaço-temporal de pontos de interesse, e descritor de conteúdo de vídeo. Finalmente, a quarta contribuição diz respeito a uma base de vídeos anotados em nível de quadro, que possui 140 horas de conteúdo pornográfico, e que é a primeira da literatura a ser adequada para a localização de pornografia. Um aspecto relevante das três primeiras contribuições é a sua natureza de generalização, no sentido de poderem ser empregadas ¿ sem modificações no passo a passo ¿ para a detecção de tipos diversos de conteúdos sensíveis, tais como os mencionados anteriormente. Para validação, nós escolhemos pornografia e violência ¿ dois dos tipos mais comuns de material impróprio ¿ como representantes de interesse, de conteúdo sensível. Nestes termos, realizamos experimentos de classificação e de localização, e reportamos resultados para ambos os tipos de conteúdo. As soluções propostas apresentam uma acurácia de 93% em classificação de pornografia, e permitem a correta localização de 91% de conteúdo pornográfico em fluxo de vídeo. Os resultados para violência também são interessantes: com as abordagens apresentadas, nós obtivemos o segundo lugar em uma competição internacional de detecção de cenas violentas. Colocando ambas em perspectiva, nós aprendemos que a detecção de pornografia é mais fácil que a de violência, abrindo várias oportunidades de pesquisa para a comunidade científica. A principal razão para tal diferença está relacionada aos níveis distintos de subjetividade que são inerentes a cada conceito. Enquanto pornografia é em geral mais explícita, violência apresenta um espectro mais amplo de possíveis manifestaçõesAbstract: Sensitive video can be defined as any motion picture that may pose threats to its audience. Typical representatives include ¿ but are not limited to ¿ pornography, violence, child abuse, cruelty to animals, etc. Nowadays, with the ever more pervasive role of digital data in our lives, sensitive-content analysis represents a major concern to law enforcers, companies, tutors, and parents, due to the potential harm of such contents over minors, students, workers, etc. Notwithstanding, the employment of human mediators for constantly analyzing huge troves of sensitive data often leads to stress and trauma, justifying the search for computer-aided analysis. In this work, we tackle this problem in two ways. In the first one, we aim at deciding whether or not a video stream presents sensitive content, which we refer to as sensitive-video classification. In the second one, we aim at finding the exact moments a stream starts and ends displaying sensitive content, at frame level, which we refer to as sensitive-content localization. For both cases, we aim at designing and developing effective and efficient methods, with low memory footprint and suitable for deployment on mobile devices. In this vein, we provide four major contributions. The first one is a novel Bag-of-Visual-Words-based pipeline for efficient time-aware sensitive-video classification. The second is a novel high-level multimodal fusion pipeline for sensitive-content localization. The third, in turn, is a novel space-temporal video interest point detector and video content descriptor. Finally, the fourth contribution comprises a frame-level annotated 140-hour pornographic video dataset, which is the first one in the literature that is appropriate for pornography localization. An important aspect of the first three contributions is their generalization nature, in the sense that they can be employed ¿ without step modifications ¿ to the detection of diverse sensitive content types, such as the previously mentioned ones. For validation, we choose pornography and violence ¿ two of the commonest types of inappropriate material ¿ as target representatives of sensitive content. We therefore perform classification and localization experiments, and report results for both types of content. The proposed solutions present an accuracy of 93% in pornography classification, and allow the correct localization of 91% of pornographic content within a video stream. The results for violence are also compelling: with the proposed approaches, we reached second place in an international competition of violent scenes detection. Putting both in perspective, we learned that pornography detection is easier than its violence counterpart, opening several opportunities for additional investigations by the research community. The main reason for such difference is related to the distinct levels of subjectivity that are inherent to each concept. While pornography is usually more explicit, violence presents a broader spectrum of possible manifestationsDoutoradoCiência da ComputaçãoDoutor em Ciência da Computação1572763, 1197473CAPE

    Video metadata extraction in a videoMail system

    Get PDF
    Currently the world swiftly adapts to visual communication. Online services like YouTube and Vine show that video is no longer the domain of broadcast television only. Video is used for different purposes like entertainment, information, education or communication. The rapid growth of today’s video archives with sparsely available editorial data creates a big problem of its retrieval. The humans see a video like a complex interplay of cognitive concepts. As a result there is a need to build a bridge between numeric values and semantic concepts. This establishes a connection that will facilitate videos’ retrieval by humans. The critical aspect of this bridge is video annotation. The process could be done manually or automatically. Manual annotation is very tedious, subjective and expensive. Therefore automatic annotation is being actively studied. In this thesis we focus on the multimedia content automatic annotation. Namely the use of analysis techniques for information retrieval allowing to automatically extract metadata from video in a videomail system. Furthermore the identification of text, people, actions, spaces, objects, including animals and plants. Hence it will be possible to align multimedia content with the text presented in the email message and the creation of applications for semantic video database indexing and retrieving

    Advanced Biometrics with Deep Learning

    Get PDF
    Biometrics, such as fingerprint, iris, face, hand print, hand vein, speech and gait recognition, etc., as a means of identity management have become commonplace nowadays for various applications. Biometric systems follow a typical pipeline, that is composed of separate preprocessing, feature extraction and classification. Deep learning as a data-driven representation learning approach has been shown to be a promising alternative to conventional data-agnostic and handcrafted pre-processing and feature extraction for biometric systems. Furthermore, deep learning offers an end-to-end learning paradigm to unify preprocessing, feature extraction, and recognition, based solely on biometric data. This Special Issue has collected 12 high-quality, state-of-the-art research papers that deal with challenging issues in advanced biometric systems based on deep learning. The 12 papers can be divided into 4 categories according to biometric modality; namely, face biometrics, medical electronic signals (EEG and ECG), voice print, and others

    영상 기반 동일인 판별을 위한 부분 정합 학습

    Get PDF
    학위논문 (박사)-- 서울대학교 대학원 : 공과대학 전기·정보공학부, 2019. 2. 이경무.Person re-identification is a problem of identifying the same individuals among the persons captured from different cameras. It is a challenging problem because the same person captured from non-overlapping cameras usually shows dramatic appearance change due to the viewpoint, pose, and illumination changes. Since it is an essential tool for many surveillance applications, various research directions have been exploredhowever, it is far from being solved. The goal of this thesis is to solve person re-identification problem under the surveillance system. In particular, we focus on two critical components: designing 1) a better image representation model using human poses and 2) a better training method using hard sample mining. First, we propose a part-aligned representation model which represents an image as the bilinear pooling between appearance and part maps. Since the image similarity is independently calculated from the locations of body parts, it addresses the body part misalignment issue and effectively distinguishes different people by discriminating fine-grained local differences. Second, we propose a stochastic hard sample mining method that exploits class information to generate diverse and hard examples to use for training. It efficiently explores the training samples while avoiding stuck in a small subset of hard samples, thereby effectively training the model. Finally, we propose an integrated system that combines the two approaches, which is benefited from both components. Experimental results show that the proposed method works robustly on five datasets with diverse conditions and its potential extension to the more general conditions.동일인 판별문제는 다른 카메라로 촬영된 각각의 영상에 찍힌 두 사람이 같은 사람인지 여부를 판단하는 문제이다. 이는 감시카메라와 보안에 관련된 다양한 응용 분야에서 중요한 도구로 활용되기 때문에 최근까지 많은 연구가 이루어지고 있다. 그러나 같은 사람이더라도 시간, 장소, 촬영 각도, 조명 상태가 다른 환경에서 찍히면 영상마다 보이는 모습이 달라지므로 판별을 자동화하기 어렵다는 문제가 있다. 본 논문에서는 주로 감시카메라 영상에 대해서, 각 영상에서 자동으로 사람을 검출한 후에 검출한 결과들이 서로 같은 사람인지 여부를 판단하는 문제를 풀고자 한다. 이를 위해 1) 어떤 모델이 영상을 잘 표현할것인지 2) 주어진 모델을 어떻게 잘 학습시킬수 있을지 두 가지 질문에 대해서 연구한다. 먼저 벡터 공간 상에서의 거리가 이미지 상에서 대응되는 파트들 사이의 생김새 차이의 합과 같아지도록 하는 매핑 함수를 설계함으로써 검출된 사람들 사이에 신체 부분별로 생김새를 비교를 통해 효과적인 판별을 가능하게 하는 모델을 제안한다. 두번째로 학습 과정에서 클래스 정보를 활용해서 적은 계산량으로 어려운 예시를 많이 보도록 함으로써 효과적으로 함수의 파라미터를 학습하는 방법을 제안한다. 최종적으로는 두 요소를 결합해서 새로운 동일인 판별 시스템을 제안하고자 한다. 본 논문에서는 실험결과를 통해 제안하는 방법이 다양한 환경에서 강인하고 효과적으로 동작함을 증명하였고 보다 일반적인 환경으로의 확장 가능성도 확인 할 수 있을 것이다.Abstract i Contents ii List of Tables v List of Figures vii 1. Introduction 1 1.1 Part-Aligned Bilinear Representations . . . . . . . . . . . . . . . . . 3 1.2 Stochastic Class-Based Hard Sample Mining . . . . . . . . . . . . . 4 1.3 Integrated System for Person Re-identification . . . . . . . . . . . . . 5 2. Part-Aligned Bilinear Represenatations 6 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.3 Our Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.3.1 Two-Stream Network . . . . . . . . . . . . . . . . . . . . . . 10 2.3.2 Bilinear Pooling . . . . . . . . . . . . . . . . . . . . . . . . 12 2.3.3 Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.4 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.4.1 Part-Aware Image Similarity . . . . . . . . . . . . . . . . . . 13 2.4.2 Relationship to the Baseline Models . . . . . . . . . . . . . . 15 2.4.3 Decomposition of Appearance and Part Maps . . . . . . . . . 15 2.4.4 Part-Alignment Effects on Reducing Misalignment Issue . . . 19 2.5 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.6.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.6.2 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . 23 2.6.3 Comparison with the Baselines . . . . . . . . . . . . . . . . . 24 2.6.4 Comparison with State-of-the-Art Methods . . . . . . . . . . 25 2.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3. Stochastic Class-Based Hard Sample Mining 35 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.3 Deep Metric Learning with Triplet Loss . . . . . . . . . . . . . . . . 40 3.3.1 Triplet Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.3.2 Efficient Learning with Triplet Loss . . . . . . . . . . . . . . 41 3.4 Batch Construction for Metric Learning . . . . . . . . . . . . . . . . 42 3.4.1 Neighbor Class Mining by Class Signatures . . . . . . . . . . 42 3.4.2 Batch Construction . . . . . . . . . . . . . . . . . . . . . . . 44 3.4.3 Scalable Extension to the Number of Classes . . . . . . . . . 50 3.5 Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 3.6 Feature Extractor . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 3.7 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 3.7.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 3.7.2 Implementation Details . . . . . . . . . . . . . . . . . . . . . 55 3.7.3 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . 56 3.7.4 Effect of the Stochastic Hard Example Mining . . . . . . . . 59 3.7.5 Comparison with the Existing Methods on Image Retrieval Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 3.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .70 4. Integrated System for Person Re-identification 71 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 4.2 Hard Positive Mining . . . . . . . . . . . . . . . . . . . . . . . . . . 72 4.3 Integrated System for Person Re-identification . . . . . . . . . . . . . 75 4.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 4.4.1 Comparison with the baselines . . . . . . . . . . . . . . . . . 75 4.4.2 Comparison with the existing works . . . . . . . . . . . . . . 80 4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 5.Conclusion 83 5.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 5.2 Future Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 Abstract (In Korean) 94Docto

    Human-robot interaction and computer-vision-based services for autonomous robots

    Get PDF
    L'Aprenentatge per Imitació (IL), o Programació de robots per Demostració (PbD), abasta mètodes pels quals un robot aprèn noves habilitats a través de l'orientació humana i la imitació. La PbD s'inspira en la forma en què els éssers humans aprenen noves habilitats per imitació amb la finalitat de desenvolupar mètodes pels quals les noves tasques es poden transferir als robots. Aquesta tesi està motivada per la pregunta genèrica de "què imitar?", Que es refereix al problema de com extreure les característiques essencials d'una tasca. Amb aquesta finalitat, aquí adoptem la perspectiva del Reconeixement d'Accions (AR) per tal de permetre que el robot decideixi el què cal imitar o inferir en interactuar amb un ésser humà. L'enfoc proposat es basa en un mètode ben conegut que prové del processament del llenguatge natural: és a dir, la bossa de paraules (BoW). Aquest mètode s'aplica a grans bases de dades per tal d'obtenir un model entrenat. Encara que BoW és una tècnica d'aprenentatge de màquines que s'utilitza en diversos camps de la investigació, en la classificació d'accions per a l'aprenentatge en robots està lluny de ser acurada. D'altra banda, se centra en la classificació d'objectes i gestos en lloc d'accions. Per tant, en aquesta tesi es demostra que el mètode és adequat, en escenaris de classificació d'accions, per a la fusió d'informació de diferents fonts o de diferents assajos. Aquesta tesi fa tres contribucions: (1) es proposa un mètode general per fer front al reconeixement d'accions i per tant contribuir a l'aprenentatge per imitació; (2) la metodologia pot aplicar-se a grans bases de dades, que inclouen diferents modes de captura de les accions; i (3) el mètode s'aplica específicament en un projecte internacional d'innovació real anomenat Vinbot.El Aprendizaje por Imitación (IL), o Programación de robots por Demostración (PbD), abarca métodos por los cuales un robot aprende nuevas habilidades a través de la orientación humana y la imitación. La PbD se inspira en la forma en que los seres humanos aprenden nuevas habilidades por imitación con el fin de desarrollar métodos por los cuales las nuevas tareas se pueden transferir a los robots. Esta tesis está motivada por la pregunta genérica de "qué imitar?", que se refiere al problema de cómo extraer las características esenciales de una tarea. Con este fin, aquí adoptamos la perspectiva del Reconocimiento de Acciones (AR) con el fin de permitir que el robot decida lo que hay que imitar o inferir al interactuar con un ser humano. El enfoque propuesto se basa en un método bien conocido que proviene del procesamiento del lenguaje natural: es decir, la bolsa de palabras (BoW). Este método se aplica a grandes bases de datos con el fin de obtener un modelo entrenado. Aunque BoW es una técnica de aprendizaje de máquinas que se utiliza en diversos campos de la investigación, en la clasificación de acciones para el aprendizaje en robots está lejos de ser acurada. Además, se centra en la clasificación de objetos y gestos en lugar de acciones. Por lo tanto, en esta tesis se demuestra que el método es adecuado, en escenarios de clasificación de acciones, para la fusión de información de diferentes fuentes o de diferentes ensayos. Esta tesis hace tres contribuciones: (1) se propone un método general para hacer frente al reconocimiento de acciones y por lo tanto contribuir al aprendizaje por imitación; (2) la metodología puede aplicarse a grandes bases de datos, que incluyen diferentes modos de captura de las acciones; y (3) el método se aplica específicamente en un proyecto internacional de innovación real llamado Vinbot.Imitation Learning (IL), or robot Programming by Demonstration (PbD), covers methods by which a robot learns new skills through human guidance and imitation. PbD takes its inspiration from the way humans learn new skills by imitation in order to develop methods by which new tasks can be transmitted to robots. This thesis is motivated by the generic question of “what to imitate?” which concerns the problem of how to extract the essential features of a task. To this end, here we adopt Action Recognition (AR) perspective in order to allow the robot to decide what has to be imitated or inferred when interacting with a human kind. The proposed approach is based on a well-known method from natural language processing: namely, Bag of Words (BoW). This method is applied to large databases in order to obtain a trained model. Although BoW is a machine learning technique that is used in various fields of research, in action classification for robot learning it is far from accurate. Moreover, it focuses on the classification of objects and gestures rather than actions. Thus, in this thesis we show that the method is suitable in action classification scenarios for merging information from different sources or different trials. This thesis makes three contributions: (1) it proposes a general method for dealing with action recognition and thus to contribute to imitation learning; (2) the methodology can be applied to large databases which include different modes of action captures; and (3) the method is applied specifically in a real international innovation project called Vinbot

    Automated interpretation of benthic stereo imagery

    Get PDF
    Autonomous benthic imaging, reduces human risk and increases the amount of collected data. However, manually interpreting these high volumes of data is onerous, time consuming and in many cases, infeasible. The objective of this thesis is to improve the scientific utility of the large image datasets. Fine-scale terrain complexity is typically quantified by rugosity and measured by divers using chains and tape measures. This thesis proposes a new technique for measuring terrain complexity from 3D stereo image reconstructions, which is non-contact and can be calculated at multiple scales over large spatial extents. Using robots, terrain complexity can be measured without endangering humans, beyond scuba depths. Results show that this approach is more robust, flexible and easily repeatable than traditional methods. These proposed terrain complexity features are combined with visual colour and texture descriptors and applied to classifying imagery. New multi-dataset feature selection methods are proposed for performing feature selection across multiple datasets, and are shown to improve the overall classification performance. The results show that the most informative predictors of benthic habitat types are the new terrain complexity measurements. This thesis presents a method that aims to reduce human labelling effort, while maximising classification performance by combining pre-clustering with active learning. The results support that utilising the structure of the unlabelled data in conjunction with uncertainty sampling can significantly reduce the number of labels required for a given level of accuracy. Typically 0.00001–0.00007% of image data is annotated and processed for science purposes (20–50 points in 1–2% of the images). This thesis proposes a framework that uses existing human-annotated point labels to train a superpixel-based automated classification system, which can extrapolate the classified results to every pixel across all the images of an entire survey
    corecore