    Towards Video Transformers for Automatic Human Analysis

    [eng] With the aim of creating artificial systems capable of mirroring the nuanced understanding and interpretative powers inherent to human cognition, this thesis embarks on an exploration of the intersection between human analysis and Video Transformers. The objective is to harness the potential of Transformers, a promising architectural paradigm, to comprehend the intricacies of human interaction, thus paving the way for the development of empathetic and context-aware intelligent systems. In order to do so, we explore the whole Computer Vision pipeline, from data gathering, to deeply analyzing recent developments, through model design and experimentation. Central to this study is the creation of UDIVA, an expansive multi-modal, multi-view dataset capturing dyadic face-to-face human interactions. Comprising 147 participants across 188 sessions, UDIVA integrates audio-visual recordings, heart-rate measurements, personality assessments, socio- demographic metadata, and conversational transcripts, establishing itself as the largest dataset for dyadic human interaction analysis up to this date. This dataset provides a rich context for probing the capabilities of Transformers within complex environments. In order to validate its utility, as well as to elucidate Transformers' ability to assimilate diverse contextual cues, we focus on addressing the challenge of personality regression within interaction scenarios. We first adapt an existing Video Transformer to handle multiple contextual sources and conduct rigorous experimentation. We empirically observe a progressive enhancement in model performance as more context is added, reinforcing the potential of Transformers to decode intricate human dynamics. Building upon these findings, the Dyadformer emerges as a novel architecture, adept at long-range modeling of dyadic interactions. By jointly modeling both participants in the interaction, as well as embedding multi- modal integration into the model itself, the Dyadformer surpasses the baseline and other concurrent approaches, underscoring Transformers' aptitude in deciphering multifaceted, noisy, and challenging tasks such as the analysis of human personality in interaction. Nonetheless, these experiments unveil the ubiquitous challenges when training Transformers, particularly in managing overfitting due to their demand for extensive datasets. Consequently, we conclude this thesis with a comprehensive investigation into Video Transformers, analyzing topics ranging from architectural designs and training strategies, to input embedding and tokenization, traversing through multi-modality and specific applications. Across these, we highlight trends which optimally harness spatio-temporal representations that handle video redundancy and high dimensionality. A culminating performance comparison is conducted in the realm of video action classification, spotlighting strategies that exhibit superior efficacy, even compared to traditional CNN-based methods.[cat] Aquesta tesi busca crear sistemes artificials que reflecteixin les habilitats de comprensió i interpretació humanes a través de l'ús de Transformers per a vídeo. L'objectiu és utilitzar aquestes arquitectures per comprendre millor la interacció humana i desenvolupar sistemes intel·ligents i conscients de l'entorn. Això implica explorar àmplies àrees de la Visió per Computador, des de la recopilació de dades fins a l'anàlisi de l'estat de l'art i la prova experimental d'aquests models. Una part essencial d'aquest estudi és la creació d'UDIVA, un ampli conjunt de dades multimodal i multivista que enregistra interaccions humanes cara a cara. Amb 147 participants i 188 sessions, UDIVA inclou contingut audiovisual, freqüència cardíaca, perfils de personalitat, dades sociodemogràfiques i transcripcions de les converses. És el conjunt de dades més gran conegut per a l'anàlisi de la interacció humana diàdica i proporciona un context ric per a l'estudi de les capacitats dels Transformers en entorns complexos. Per tal de validar la seva utilitat i les habilitats dels Transformers, ens centrem en la regressió de la personalitat. Inicialment, adaptem un Transformer de vídeo per integrar diverses fonts de context. Mitjançant experiments exhaustius, observem millores progressives en els resultats amb la inclusió de més context, confirmant la capacitat dels Transformers. Motivats per aquests resultats, desenvolupem el Dyadformer, una arquitectura per interaccions diàdiques de llarga duració. Aquesta nova arquitectura considera simultàniament els dos participants en la interacció i incorpora la multimodalitat en un sol model. El Dyadformer supera la nostra proposta inicial i altres treballs similars, destacant la capacitat dels Transformers per abordar tasques complexes. No obstant això, aquestos experiments revelen reptes d'entrenament dels Transformers, com el sobreajustament, per la seva necessitat de grans conjunts de dades. La tesi conclou amb una anàlisi profunda dels Transformers per a vídeo, incloent dissenys arquitectònics, estratègies d'entrenament, preprocessament de vídeos, tokenització i multimodalitat. S'identifiquen tendències per gestionar la redundància i alta dimensionalitat de vídeos i es realitza una comparació de rendiment en la classificació d'accions a vídeo, destacant estratègies d'eficàcia superior als mètodes tradicionals basats en convolucions

    Scene graph generation: A comprehensive survey

    Deep learning techniques have led to remarkable breakthroughs in the field of object detection and have spawned a lot of scene-understanding tasks in recent years. Scene graph has been the focus of research because of its powerful semantic representation and applications to scene understanding. Scene Graph Generation (SGG) refers to the task of automatically mapping an image or a video into a semantic structural scene graph, which requires the correct labeling of detected objects and their relationships. In this paper, a comprehensive survey of recent achievements is provided. This survey attempts to connect and systematize the existing visual relationship detection methods, to summarize, and interpret the mechanisms and the strategies of SGG in a comprehensive way. Deep discussions about current existing problems and future research directions are given at last. This survey will help readers to develop a better understanding of the current researches

    Attention-based Approaches for Text Analytics in Social Media and Automatic Summarization

    [ES] Hoy en día, la sociedad tiene acceso y posibilidad de contribuir a grandes cantidades de contenidos presentes en Internet, como redes sociales, periódicos online, foros, blogs o plataformas de contenido multimedia. Todo este tipo de medios han tenido, durante los últimos años, un impacto abrumador en el día a día de individuos y organizaciones, siendo actualmente medios predominantes para compartir, debatir y analizar contenidos online. Por este motivo, resulta de interés trabajar sobre este tipo de plataformas, desde diferentes puntos de vista, bajo el paraguas del Procesamiento del Lenguaje Natural. En esta tesis nos centramos en dos áreas amplias dentro de este campo, aplicadas al análisis de contenido en línea: análisis de texto en redes sociales y resumen automático. En paralelo, las redes neuronales también son un tema central de esta tesis, donde toda la experimentación se ha realizado utilizando enfoques de aprendizaje profundo, principalmente basados en mecanismos de atención. Además, trabajamos mayoritariamente con el idioma español, por ser un idioma poco explorado y de gran interés para los proyectos de investigación en los que participamos. Por un lado, para el análisis de texto en redes sociales, nos enfocamos en tareas de análisis afectivo, incluyendo análisis de sentimientos y detección de emociones, junto con el análisis de la ironía. En este sentido, se presenta un enfoque basado en Transformer Encoders, que consiste en contextualizar \textit{word embeddings} pre-entrenados con tweets en español, para abordar tareas de análisis de sentimiento y detección de ironía. También proponemos el uso de métricas de evaluación como funciones de pérdida, con el fin de entrenar redes neuronales, para reducir el impacto del desequilibrio de clases en tareas \textit{multi-class} y \textit{multi-label} de detección de emociones. Adicionalmente, se presenta una especialización de BERT tanto para el idioma español como para el dominio de Twitter, que tiene en cuenta la coherencia entre tweets en conversaciones de Twitter. El desempeño de todos estos enfoques ha sido probado con diferentes corpus, a partir de varios \textit{benchmarks} de referencia, mostrando resultados muy competitivos en todas las tareas abordadas. Por otro lado, nos centramos en el resumen extractivo de artículos periodísticos y de programas televisivos de debate. Con respecto al resumen de artículos, se presenta un marco teórico para el resumen extractivo, basado en redes jerárquicas siamesas con mecanismos de atención. También presentamos dos instancias de este marco: \textit{Siamese Hierarchical Attention Networks} y \textit{Siamese Hierarchical Transformer Encoders}. Estos sistemas han sido evaluados en los corpora CNN/DailyMail y NewsRoom, obteniendo resultados competitivos en comparación con otros enfoques extractivos coetáneos. Con respecto a los programas de debate, se ha propuesto una tarea que consiste en resumir las intervenciones transcritas de los ponentes, sobre un tema determinado, en el programa "La Noche en 24 Horas". Además, se propone un corpus de artículos periodísticos, recogidos de varios periódicos españoles en línea, con el fin de estudiar la transferibilidad de los enfoques propuestos, entre artículos e intervenciones de los participantes en los debates. Este enfoque muestra mejores resultados que otras técnicas extractivas, junto con una transferibilidad de dominio muy prometedora.[CA] Avui en dia, la societat té accés i possibilitat de contribuir a grans quantitats de continguts presents a Internet, com xarxes socials, diaris online, fòrums, blocs o plataformes de contingut multimèdia. Tot aquest tipus de mitjans han tingut, durant els darrers anys, un impacte aclaparador en el dia a dia d'individus i organitzacions, sent actualment mitjans predominants per compartir, debatre i analitzar continguts en línia. Per aquest motiu, resulta d'interès treballar sobre aquest tipus de plataformes, des de diferents punts de vista, sota el paraigua de l'Processament de el Llenguatge Natural. En aquesta tesi ens centrem en dues àrees àmplies dins d'aquest camp, aplicades a l'anàlisi de contingut en línia: anàlisi de text en xarxes socials i resum automàtic. En paral·lel, les xarxes neuronals també són un tema central d'aquesta tesi, on tota l'experimentació s'ha realitzat utilitzant enfocaments d'aprenentatge profund, principalment basats en mecanismes d'atenció. A més, treballem majoritàriament amb l'idioma espanyol, per ser un idioma poc explorat i de gran interès per als projectes de recerca en els que participem. D'una banda, per a l'anàlisi de text en xarxes socials, ens enfoquem en tasques d'anàlisi afectiu, incloent anàlisi de sentiments i detecció d'emocions, juntament amb l'anàlisi de la ironia. En aquest sentit, es presenta una aproximació basada en Transformer Encoders, que consisteix en contextualitzar \textit{word embeddings} pre-entrenats amb tweets en espanyol, per abordar tasques d'anàlisi de sentiment i detecció d'ironia. També proposem l'ús de mètriques d'avaluació com a funcions de pèrdua, per tal d'entrenar xarxes neuronals, per reduir l'impacte de l'desequilibri de classes en tasques \textit{multi-class} i \textit{multi-label} de detecció d'emocions. Addicionalment, es presenta una especialització de BERT tant per l'idioma espanyol com per al domini de Twitter, que té en compte la coherència entre tweets en converses de Twitter. El comportament de tots aquests enfocaments s'ha provat amb diferents corpus, a partir de diversos \textit{benchmarks} de referència, mostrant resultats molt competitius en totes les tasques abordades. D'altra banda, ens centrem en el resum extractiu d'articles periodístics i de programes televisius de debat. Pel que fa a l'resum d'articles, es presenta un marc teòric per al resum extractiu, basat en xarxes jeràrquiques siameses amb mecanismes d'atenció. També presentem dues instàncies d'aquest marc: \textit{Siamese Hierarchical Attention Networks} i \textit{Siamese Hierarchical Transformer Encoders}. Aquests sistemes s'han avaluat en els corpora CNN/DailyMail i Newsroom, obtenint resultats competitius en comparació amb altres enfocaments extractius coetanis. Pel que fa als programes de debat, s'ha proposat una tasca que consisteix a resumir les intervencions transcrites dels ponents, sobre un tema determinat, al programa "La Noche en 24 Horas". A més, es proposa un corpus d'articles periodístics, recollits de diversos diaris espanyols en línia, per tal d'estudiar la transferibilitat dels enfocaments proposats, entre articles i intervencions dels participants en els debats. Aquesta aproximació mostra millors resultats que altres tècniques extractives, juntament amb una transferibilitat de domini molt prometedora.[EN] Nowadays, society has access, and the possibility to contribute, to large amounts of the content present on the internet, such as social networks, online newspapers, forums, blogs, or multimedia content platforms. These platforms have had, during the last years, an overwhelming impact on the daily life of individuals and organizations, becoming the predominant ways for sharing, discussing, and analyzing online content. Therefore, it is very interesting to work with these platforms, from different points of view, under the umbrella of Natural Language Processing. In this thesis, we focus on two broad areas inside this field, applied to analyze online content: text analytics in social media and automatic summarization. Neural networks are also a central topic in this thesis, where all the experimentation has been performed by using deep learning approaches, mainly based on attention mechanisms. Besides, we mostly work with the Spanish language, due to it is an interesting and underexplored language with a great interest in the research projects we participated in. On the one hand, for text analytics in social media, we focused on affective analysis tasks, including sentiment analysis and emotion detection, along with the analysis of the irony. In this regard, an approach based on Transformer Encoders, based on contextualizing pretrained Spanish word embeddings from Twitter, to address sentiment analysis and irony detection tasks, is presented. We also propose the use of evaluation metrics as loss functions, in order to train neural networks for reducing the impact of the class imbalance in multi-class and multi-label emotion detection tasks. Additionally, a specialization of BERT both for the Spanish language and the Twitter domain, that takes into account inter-sentence coherence in Twitter conversation flows, is presented. The performance of all these approaches has been tested with different corpora, from several reference evaluation benchmarks, showing very competitive results in all the tasks addressed. On the other hand, we focused on extractive summarization of news articles and TV talk shows. Regarding the summarization of news articles, a theoretical framework for extractive summarization, based on siamese hierarchical networks with attention mechanisms, is presented. Also, we present two instantiations of this framework: Siamese Hierarchical Attention Networks and Siamese Hierarchical Transformer Encoders. These systems were evaluated on the CNN/DailyMail and the NewsRoom corpora, obtaining competitive results in comparison to other contemporary extractive approaches. Concerning the TV talk shows, we proposed a text summarization task, for summarizing the transcribed interventions of the speakers, about a given topic, in the Spanish TV talk shows of the ``La Noche en 24 Horas" program. In addition, a corpus of news articles, collected from several Spanish online newspapers, is proposed, in order to study the domain transferability of siamese hierarchical approaches, between news articles and interventions of debate participants. This approach shows better results than other extractive techniques, along with a very promising domain transferability.González Barba, JÁ. (2021). Attention-based Approaches for Text Analytics in Social Media and Automatic Summarization [Tesis doctoral]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/172245TESI

    잠재 임베딩을 통한 시각적 스토리로부터의 서사 텍스트 생성기 학습

    학위논문 (박사)-- 서울대학교 대학원 : 공과대학 전기·컴퓨터공학부, 2019. 2. 장병탁.The ability to understand the story is essential to make humans unique from other primates as well as animals. The capability of story understanding is crucial for AI agents to live with people in everyday life and understand their context. However, most research on story AI focuses on automated story generation based on closed worlds designed manually, which are widely used for computation authoring. Machine learning techniques on story corpora face similar problems of natural language processing such as omitting details and commonsense knowledge. Since the remarkable success of deep learning on computer vision field, increasing our interest in research on bridging between vision and language, vision-grounded story data will potentially improve the performance of story understanding and narrative text generation. Let us assume that AI agents lie in the environment in which the sensing information is input by the camera. Those agents observe the surroundings, translate them into the story in natural language, and predict the following event or multiple ones sequentially. This dissertation study on the related problems: learning stories or generating the narrative text from image streams or videos. The first problem is to generate a narrative text from a sequence of ordered images. As a solution, we introduce a GLAC Net (Global-local Attention Cascading Network). It translates from image sequences to narrative paragraphs in text as a encoder-decoder framework with sequence-to-sequence setting. It has convolutional neural networks for extracting information from images, and recurrent neural networks for text generation. We introduce visual cue encoders with stacked bidirectional LSTMs, and all of the outputs of each layer are aggregated as contextualized image vectors to extract visual clues. The coherency of the generated text is further improved by conveying (cascading) the information of the previous sentence to the next sentence serially in the decoders. We evaluate the performance of it on the Visual storytelling (VIST) dataset. It outperforms other state-of-the-art results and shows the best scores in total score and all of 6 aspects in the visual storytelling challenge with evaluation of human judges. The second is to predict the following events or narrative texts with the former parts of stories. It should be possible to predict at any step with an arbitrary length. We propose recurrent event retrieval models as a solution. They train a context accumulation function and two embedding functions, where make close the distance between the cumulative context at current time and the next probable events on a latent space. They update the cumulative context with a new event as a input using bilinear operations, and we can find the next event candidates with the updated cumulative context. We evaluate them for Story Cloze Test, they show competitive performance and the best in open-ended generation setting. Also, it demonstrates the working examples in an interactive setting. The third deals with the study on composite representation learning for semantics and order for video stories. We embed each episode as a trajectory-like sequence of events on the latent space, and propose a ViStoryNet to regenerate video stories with them (tasks of story completion). We convert event sentences to thought vectors, and train functions to make successive event embed close each other to form episodes as trajectories. Bi-directional LSTMs are trained as sequence models, and decoders to generate event sentences with GRUs. We test them experimentally with PororoQA dataset, and observe that most of episodes show the form of trajectories. We use them to complete the blocked part of stories, and they show not perfect but overall similar result. Those results above can be applied to AI agents in the living area sensing with their cameras, explain the situation as stories, infer some unobserved parts, and predict the future story.스토리를 이해하는 능력은 동물들 뿐만 아니라 다른 유인원과 인류를 구별짓는 중요한 능력이다. 인공지능이 일상생활 속에서 사람들과 함께 지내면서 그들의 생활 속 맥락을 이해하기 위해서는 스토리를 이해하는 능력이 매우 중요하다. 하지만, 기존의 스토리에 관한 연구는 언어처리의 어려움으로 인해 사전에 정의된 세계 모델 하에서 좋은 품질의 저작물을 생성하려는 기술이 주로 연구되어 왔다. 기계학습 기법을 통해 스토리를 다루려는 시도들은 대체로 자연어로 표현된 데이터에 기반할 수 밖에 없어 자연어 처리에서 겪는 문제들을 동일하게 겪는다. 이를 극복하기 위해서는 시각적 정보가 함께 연동된 데이터가 도움이 될 수 있다. 최근 딥러닝의 눈부신 발전에 힘입어 시각과 언어 사이의 관계를 다루는 연구들이 늘어나고 있다. 연구의 비전으로서, 인공지능 에이전트가 주변 정보를 카메라로 입력받는 환경 속에 놓여있는 상황을 생각해 볼 수 있다. 이 안에서 인공지능 에이전트는 주변을 관찰하면서 그에 대한 스토리를 자연어 형태로 생성하고, 생성된 스토리를 바탕으로 다음에 일어날 스토리를 한 단계에서 여러 단계까지 예측할 수 있다. 본 학위 논문에서는 사진 및 비디오 속에 나타나는 스토리(visual story)를 학습하는 방법, 내러티브 텍스트로의 변환, 가려진 사건 및 다음 사건을 추론하는 연구들을 다룬다. 첫 번째로, 여러 장의 사진이 주어졌을 때 이를 바탕으로 스토리 텍스트를 생성하는 문제(비주얼 스토리텔링)를 다룬다. 이 문제 해결을 위해 글랙넷(GLAC Net)을 제안하였다. 먼저, 사진들로부터 정보를 추출하기 위한 컨볼루션 신경망, 문장을 생성하기 위해 순환신경망을 이용한다. 시퀀스-시퀀스 구조의 인코더로서, 전체적인 이야기 구조의 표현을 위해 다계층 양방향 순환신경망을 배치하되 각 사진 별 정보를 함께 이용하기 위해 전역적-국부적 주의집중 모델을 제안하였다. 또한, 여러 문장을 생성하는 동안 맥락정보와 국부정보를 잃지 않게 하기 위해 앞선 문장 정보를 전달하는 메커니즘을 제안하였다. 위 제안 방법으로 비스트(VIST) 데이터 집합을 학습하였고, 제 1 회 시각적 스토리텔링 대회(visual storytelling challenge)에서 사람 평가를 기준으로 전체 점수 및 6 항목 별로 모두 최고점을 받았다. 두 번째로, 스토리의 일부가 문장들로 주어졌을 때 이를 바탕으로 다음 문장을 예측하는 문제를 다룬다. 임의의 길이의 스토리에 대해 임의의 위치에서 예측이 가능해야 하고, 예측하려는 단계 수에 무관하게 작동해야 한다. 이를 위한 방법으로 순환 사건 인출 모델(Recurrent Event Retrieval Models)을 제안하였다. 이 방법은 은닉 공간 상에서 현재까지 누적된 맥락과 다음에 발생할 유력 사건 사이의 거리를 가깝게 하도록 맥락누적함수와 두 개의 임베딩 함수를 학습한다. 이를 통해 이미 입력되어 있던 스토리에 새로운 사건이 입력되면 쌍선형적 연산을 통해 기존의 맥락을 개선하여 다음에 발생할 유력한 사건들을 찾는다. 이 방법으로 락스토리(ROCStories) 데이터집합을 학습하였고, 스토리 클로즈 테스트(Story Cloze Test)를 통해 평가한 결과 경쟁력 있는 성능을 보였으며, 특히 임의의 길이로 추론할 수 있는 기법 중에 최고성능을 보였다. 세 번째로, 비디오 스토리에서 사건 시퀀스 중 일부가 가려졌을 때 이를 복구하는 문제를 다룬다. 특히, 각 사건의 의미 정보와 순서를 모델의 표현 학습에 반영하고자 하였다. 이를 위해 은닉 공간 상에 각 에피소드들을 궤적 형태로 임베딩하고, 이를 바탕으로 스토리를 재생성을 하여 스토리 완성을 할 수 있는 모델인 비스토리넷(ViStoryNet)을 제안하였다. 각 에피소드를 궤적 형태를 가지게 하기 위해 사건 문장을 사고벡터(thought vector)로 변환하고, 연속 이벤트 순서 임베딩을 통해 전후 사건들이 서로 가깝게 임베딩되도록 하여 하나의 에피소드가 궤적의 모양을 가지도록 학습하였다. 뽀로로QA 데이터집합을 통해 실험적으로 결과를 확인하였다. 임베딩 된 에피소드들은 궤적 형태로 잘 나타났으며, 에피소드들을 재생성 해본 결과 전체적인 측면에서 유사한 결과를 보였다. 위 결과물들은 카메라로 입력되는 주변 정보를 바탕으로 스토리를 이해하고 일부 관측되지 않은 부분을 추론하며, 향후 스토리를 예측하는 방법들에 대응된다.Abstract i Chapter 1 Introduction 1 1.1 Story of Everyday lives in Videos and Story Understanding . . . 1 1.2 Problems to be addressed . . . . . . . . . . . . . . . . . . . . . . 3 1.3 Approach and Contribution . . . . . . . . . . . . . . . . . . . . . 6 1.4 Organization of Dissertation . . . . . . . . . . . . . . . . . . . . . 9 Chapter 2 Background and Related Work 10 2.1 Why We Study Stories . . . . . . . . . . . . . . . . . . . . . . . . 10 2.2 Latent Embedding . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.3 Order Embedding and Ordinal Embedding . . . . . . . . . . . . 14 2.4 Comparison to Story Understanding . . . . . . . . . . . . . . . . 15 2.5 Story Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.5.1 Abstract Event Representations . . . . . . . . . . . . . . . 17 2.5.2 Seq-to-seq Attentional Models . . . . . . . . . . . . . . . . 18 2.5.3 Story Generation from Images . . . . . . . . . . . . . . . 19 Chapter 3 Visual Storytelling via Global-local Attention Cascading Networks 21 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.2 Evaluation for Visual Storytelling . . . . . . . . . . . . . . . . . . 26 3.3 Global-local Attention Cascading Networks (GLAC Net) . . . . . 27 3.3.1 Encoder: Contextualized Image Vector Extractor . . . . . 28 3.3.2 Decoder: Story Generator with Attention and Cascading Mechanism . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.4.1 VIST Dataset . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.4.2 Experiment Settings . . . . . . . . . . . . . . . . . . . . . 33 3.4.3 Network Training Details . . . . . . . . . . . . . . . . . . 36 3.4.4 Qualitative Analysis . . . . . . . . . . . . . . . . . . . . . 38 3.4.5 Quantitative Analysis . . . . . . . . . . . . . . . . . . . . 38 3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 Chapter 4 Common Space Learning on Cumulative Contexts and the Next Events: Recurrent Event Retrieval Models 44 4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 4.2 Problems of Context Accumulation . . . . . . . . . . . . . . . . . 45 4.3 Recurrent Event Retrieval Models for Next Event Prediction . . 46 4.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . 49 4.4.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . 51 4.4.2 Story Cloze Test . . . . . . . . . . . . . . . . . . . . . . . 52 4.4.3 Open-ended Story Generation . . . . . . . . . . . . . . . . 53 4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 Chapter 5 ViStoryNet: Order Embedding of Successive Events and the Networks for Story Regeneration 58 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 5.2 Order Embedding with Triple Learning . . . . . . . . . . . . . . 60 5.2.1 Embedding Ordered Objects in Sequences . . . . . . . . . 62 5.3 Problems and Contextual Events . . . . . . . . . . . . . . . . . . 62 5.3.1 Problem Definition . . . . . . . . . . . . . . . . . . . . . . 62 5.3.2 Contextual Event Vectors from Kids Videos . . . . . . . . 64 5.4 Architectures for the Story Regeneration Task . . . . . . . . . . . 67 5.4.1 Two Sentence Generators as Decoders . . . . . . . . . . . 68 5.4.2 Successive Event Order Embedding (SEOE) . . . . . . . . 68 5.4.3 Sequence Models of the Event Space . . . . . . . . . . . . 72 5.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . 73 5.5.1 Experimental setup . . . . . . . . . . . . . . . . . . . . . . 73 5.5.2 Quantitative Analysis . . . . . . . . . . . . . . . . . . . . 73 5.5.3 Qualitative Analysis . . . . . . . . . . . . . . . . . . . . . 74 5.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 Chapter 6 Concluding Remarks 80 6.1 Summary of Methods and Contributions . . . . . . . . . . . . . . 80 6.2 Limitation and Outlook . . . . . . . . . . . . . . . . . . . . . . . 81 6.3 Suggestions for Future Research . . . . . . . . . . . . . . . . . . . 81 초록 101Docto

    Recent Advances of Local Mechanisms in Computer Vision: A Survey and Outlook of Recent Work

    Inspired by the fact that human brains can emphasize discriminative parts of the input and suppress irrelevant ones, substantial local mechanisms have been designed to boost the development of computer vision. They can not only focus on target parts to learn discriminative local representations, but also process information selectively to improve the efficiency. In terms of application scenarios and paradigms, local mechanisms have different characteristics. In this survey, we provide a systematic review of local mechanisms for various computer vision tasks and approaches, including fine-grained visual recognition, person re-identification, few-/zero-shot learning, multi-modal learning, self-supervised learning, Vision Transformers, and so on. Categorization of local mechanisms in each field is summarized. Then, advantages and disadvantages for every category are analyzed deeply, leaving room for exploration. Finally, future research directions about local mechanisms have also been discussed that may benefit future works. To the best our knowledge, this is the first survey about local mechanisms on computer vision. We hope that this survey can shed light on future research in the computer vision field

    이미지의 의미적 이해를 위한 시각적 관계의 이용

    학위논문(박사)--서울대학교 대학원 :융합과학기술대학원 융합과학부(디지털정보융합전공),2019. 8. 곽노준.이미지를 이해하는 것은 컴퓨터 비전 분야에서 가장 근본적인 목적 중 하나이다. 이러한 이해는 다양한 산업 분야의 문제를 해결 할 수 있는 혁신이 될 수 있다. 최근 딥러닝의 발전과 함께, 이미지에서 객관적인 요소를 인식하는 기술은 매우 발전되어 왔다. 그러나 시각 정보를 제대로 이해하기 위해서는 사람처럼 맥락 정보를 이해하는 것이 중요하다. 인간은 주로 직접적인 시각정보와 함께 맥락을 이해하여 의미 있는 지식 정보로 활용한다. 본 논문에서는 객체간의 의미적 관계정보를 구축하과 활용하는 방법론을 제시하여 보다 나은 이미지의 이해 방법을 연구하였다. 첫 번째로, 다이어그램에서 관계 지식을 표현하는 관계 그래프를 생성하는 알고리즘을 제안하였다. 다이어그램이 가진 정보를 축약하는 능력이 다른 형태의 지식 저장 방법에 비해 뛰어나지만, 그에 따라 해석하기에는 다양한 요소와 유연한 레이아웃 때문에 풀기 어려운 문제였다. 우리는 다이어그램에서 객체를 찾고 그것들의 관계를 찾는 통합 네트워크를 제안한다. 그리고 이러한 능동적인 그래프 생성을 위한 특수 모듈은 DGGN을 제안한다. 이 모듈의 성능을 나타내기 위해 모듈안의 활성화 게이트의 정보 역학을 비주얼라이즈 하여 분석하였다. 또한 공개된 다이어그램 데이터셋에서 기존의 알고리즘을 뛰어넘는 성능을 증명하였다.마지막으로 질의 응답 데이터셋을 이용한 실험으로 향후 다양한 응용 가능성도 증명하였다. 두 번째로, 우리는 현존하는 질의 응답 데이터셋 중 가장 복잡한 형태를 가진 교과서에서 질의응답 (TQA) 문제를 풀기위한 솔루션을 제안하였다. TQA 데이터셋은 질문 파트와 본문 파트 모두에 이미지와 텍스트 형태를 가진 데이터를 포함하고 있다. 이러한 복잡한 구조를 해결하기 위해 우리는 f-GCN이라는 다중 모달 그래프를 처리할 수 있는 모듈을 제안하였다. 이 모듈을 통해 보다 효율적으로 다중 모달을 그래프 형태로 처리하여 활용하기 쉬운 피쳐로 바꿔줄 수 있다. 그 다음으로 교과서의 경우 다양한 주제가 포함되어 있고 그에 따라 용어나 내용이 겹치지 않고 기술되어 있다. 그로인해 완전 새로운 내용의 문제를 풀어야하는 out-of-domain 이슈가 있다. 이를 해결하기위해 정답을 보지 않고 본문만으로 자가 학습을 하는 알고리즘을 제안하였다. 이 두 알고리즘을 통해 기존 연구보다 훨씬 좋은 성능을 보이는 실험 결과를 제시하였고 각각의 모듈의 기능성에 대해 검증하였다. 마지막으로, 인간과 물건의 관계정보를 활용하여 객체 검출을 약지도 학습으로 배우는 프레임워크를 제안하였다. 객체 검출 문제를 풀기위해 노동력이 많이 필요한 데이터 라벨링 작업이 필요하다. 그 중 가장 노력이 많이 필요한 위치 라벨링인데, 새로운 방법론은 인간과 물건의 관계를 이용하여 이부분을 해결하였다. 우리는 RRPN이란 모듈을 제안하여 인간의 포즈정보와 관계에 관한 동사를 이용하여 처음보는 물건의 위치를 추정할 수 있다. 이를 통해 새롭게 배우는 목표 라벨에 대해, 정답 라벨 없이 위치를 추정하여 학습할 수 있어 훨씬 적은 노력만 사용해도 된다. 또한 RRPN은 추가 방식의 구조로 다양한 태스크에 관한 네트워크에 추가 할 수 있다. HICO-DET 데이터셋을 사용하여 실험한 결과 현재의 지도학습을 대신할 가능성을 보여주었다. 또한 우리 모델이 처음 본 물건의 위치를 잘 추정하고 있음을 시각화를 통해 보여주었다.Understanding an image is one of the fundamental goals of computer vision and can provide important breakthroughs for various industries. In particular, the ability to recognize objective instances such as objects and poses has been developed due to recent deep learning approaches. However, deeply comprehending a visual scene requires higher understanding, such as is found in human beings. Humans usually exploit contextual information from visual inputs to detect meaningful features. In this dissertation, visual relation in various contexts, from the construction phase to the application phase, is studied with three tasks. We first propose a new algorithm for constructing relation graphs that contains relational knowledge in diagrams . Although diagrams contain richer information compared to individual image-based or language-based data, proper solutions for automatically understanding diagrams have not been proposed due to their innate multimodality and the arbitrariness of their layouts. To address this problem, we propose a unified diagram-parsing network for generating knowledge from diagrams based on an object detector and a recurrent neural network designed for a graphical structure. Specifically, we propose a dynamic graph-generation network that is based on dynamic memory and graph theory. We explore the dynamics of information in a diagram with the activation of gates in gated recurrent unit (GRU) cells. Using publicly available diagram datasets, our model demonstrates a state-of-the-art result that outperforms other baselines. Moreover, further experiments on question answering demonstrate the potential of the proposed method for use in various applications. Next, we introduce a novel algorithm to solve the Textbook Question Answering (TQA) task; this task describes more realistic QA (Question Answering) problems compared to other recent tasks. We mainly focus on two issues related to the analysis of the TQA dataset. First, solving the TQA problems requires an understanding of multimodal contexts in complicated input data. To overcome this issue of extracting knowledge features from long text lessons and merging them with visual features, we establish a context graph from texts and images and propose a new module f-GCN based on graph convolutional networks (GCN). Second, in the TQA dataset , scientific terms are not spread over the chapters and subjects are split. To overcome this so-called ``out-of-domain issue, before learning QA problems we introduce a novel, self-supervised, open-set learning process without any annotations. The experimental results indicate that our model significantly outperforms prior state-of-the-art methods. Moreover, ablation studies confirm that both methods (incorporating f-GCN to extract knowledge from multimodal contexts and our newly proposed, self-supervised learning process) are effective for TQA problems. Third, we introduce a novel, weakly supervised object detection (WSOD) paradigm to detect objects belonging to rare classes that do not have many examples. We use transferable knowledge from human-object interactions (HOI). While WSOD has lower performance than full supervision, we mainly focus on HOI that can strongly supervise complex semantics in images. Therefore, we propose a novel module called the ``relational region proposal network (RRPN) that outputs an object-localizing attention map with only human poses and action verbs. In the source domain, we fully train an object detector and the RRPN with full supervision of HOI. With transferred knowledge about the localization map from the trained RRPN, a new object detector can learn unseen objects with weak verbal supervisions of HOI without bounding box annotations in the target domain. Because the RRPN is designed as an add-on type, we can apply it not only to object detection but also to other domains such as semantic segmentation. The experimental results using a HICO-DET dataset suggest the possibility that the proposed method can be a cheap alternative for the current supervised object detection paradigm. Moreover, qualitative results demonstrate that our model can properly localize unseen objects in HICO-DET and V-COCO datasets.1. Introduction 1 1.1 Problem Definition 4 1.2 Motivation 6 1.3 Challenges 7 1.4 Contributions 9 1.4.1 Generating Visual Relation Graphs from Diagrams 9 1.4.2 Application of the Relation Graph in Textbook Question Answering 10 1.4.3 Weakly Supervised Object Detection with Human-object Interaction 11 1.5 Outline 11 2. Background 13 2.1 Visual relationships 13 2.2 Neural networks on a graph 16 2.3 Human-object interaction 17 3. Generating Visual Relation Graphs from Diagrams 18 3.1 Related Work 20 3.2 Proposed Method 21 3.2.1 Detecting Constituents in a Diagram 21 3.2.2 Generating a Graph of relationships 22 3.2.3 Multi-task Training and Cascaded Inference 27 3.2.4 Details of Post-processing 29 3.3 Experiment 29 3.3.1 Datasets 29 3.3.2 Baseline 32 3.3.3 Metrics 32 3.3.4 Implementation Details 33 3.3.5 Quantitative Results 35 3.3.6 Qualitative Results 37 3.4 Discussion 38 3.5 Conclusion 41 4. Application of the Relation Graph in Textbook Question Answering 46 4.1 Related Work 48 4.2 Problem 50 4.3 Proposed Method 53 4.3.1 Multi-modal Context Graph Understanding 53 4.3.2 Multi-modal Problem Solving 55 4.3.3 Self-supervised open-set comprehension 57 4.3.4 Process of Building Textual Context Graph 61 4.4 Experiment 62 4.4.1 Implementation Details 62 4.4.2 Dataset 62 4.4.3 Baselines 63 4.4.4 Quantitative Results 64 4.4.5 Qualitative Results 67 4.5 Conclusion 70 5. Weakly Supervised Object Detection with Human-object Interaction 77 5.1 Related Work 80 5.2 Algorithm Overview 81 5.3 Proposed Method 84 5.3.1 Training on the Source classes Ds 86 5.3.2 Training on the Target classes Dt 89 5.4 Experiment 90 5.4.1 Implementation details 90 5.4.2 Dataset and Pre-processing 91 5.4.3 Metrics 91 5.4.4 Comparison with different feature combination 92 5.4.5 Comparison with different attention loss balance and box threshold 95 5.4.6 Comparison with prior works 96 5.4.7 Qualitative results 96 5.5 Conclusion 100 6. Concluding Remarks 105 6.1 Summary 105 6.2 Limitation and Future Directions 106Docto

    Understanding Video Transformers for Segmentation: A Survey of Application and Interpretability

    Video segmentation encompasses a wide range of categories of problem formulation, e.g., object, scene, actor-action and multimodal video segmentation, for delineating task-specific scene components with pixel-level masks. Recently, approaches in this research area shifted from concentrating on ConvNet-based to transformer-based models. In addition, various interpretability approaches have appeared for transformer models and video temporal dynamics, motivated by the growing interest in basic scientific understanding, model diagnostics and societal implications of real-world deployment. Previous surveys mainly focused on ConvNet models on a subset of video segmentation tasks or transformers for classification tasks. Moreover, component-wise discussion of transformer-based video segmentation models has not yet received due focus. In addition, previous reviews of interpretability methods focused on transformers for classification, while analysis of video temporal dynamics modelling capabilities of video models received less attention. In this survey, we address the above with a thorough discussion of various categories of video segmentation, a component-wise discussion of the state-of-the-art transformer-based models, and a review of related interpretability methods. We first present an introduction to the different video segmentation task categories, their objectives, specific challenges and benchmark datasets. Next, we provide a component-wise review of recent transformer-based models and document the state of the art on different video segmentation tasks. Subsequently, we discuss post-hoc and ante-hoc interpretability methods for transformer models and interpretability methods for understanding the role of the temporal dimension in video models. Finally, we conclude our discussion with future research directions