Search CORE

547 research outputs found

Video Sequence Alignment

Author: AL GHAMDI MANAL
Publication venue: 'University of Sheffield Conference Proceedings'
Publication date: 01/01/2015
Field of study

The task of aligning multiple audio visual sequences with similar contents needs careful synchronisation in both spatial and temporal domains. It is a challenging task due to a broad range of contents variations, background clutter, occlusions, and other factors. This thesis is concerned with aligning video contents by characterising the spatial and temporal information embedded in the high-dimensional space. To that end a three- stage framework is developed, involving space-time representation of video clips with local linear coding, followed by their alignment in the manifold embedded space. The first two stages present a video representation techniques based on local feature extraction and linear coding methods. Firstly, the scale invariant feature transform (SIFT) is extended to extract interest points not only from the spatial plane but also from the planes along the space-time axis. Locality constrained coding is then incorporated to project each descriptor into a local coordinate system produced by a pooling technique. Human action classification benchmarks are adopted to evaluate these two stages, comparing their performance against existing techniques. The results shows that space-time extension of SIFT with a linear coding scheme outperforms most of the state-of-the-art approaches on the action classification task owing to its ability to represent complex events in video sequences. The final stage presents a manifold learning algorithm with spatio-temporal constraints to embed a video clip in a lower dimensional space while preserving the intrinsic geometry of the data. The similarities observed between frame sequences are captured by defining two types of correlation graphs: an intra-correlation graph within a single video sequence and an inter-correlation graph between two sequences. A video retrieval and ranking tasks are designed to evaluate the manifold learning stage. The experimental outcome shows that the approach outperforms the conventional techniques in defining similar video contents and capture the spatio-temporal correlations between them

White Rose E-theses Online

잠재 임베딩을 통한 시각적 스토리로부터의 서사 텍스트 생성기 학습

Author: 허민오
Publication venue: 서울대학교 대학원
Publication date: 01/02/2019
Field of study

학위논문 (박사)-- 서울대학교 대학원 : 공과대학 전기·컴퓨터공학부, 2019. 2. 장병탁.The ability to understand the story is essential to make humans unique from other primates as well as animals. The capability of story understanding is crucial for AI agents to live with people in everyday life and understand their context. However, most research on story AI focuses on automated story generation based on closed worlds designed manually, which are widely used for computation authoring. Machine learning techniques on story corpora face similar problems of natural language processing such as omitting details and commonsense knowledge. Since the remarkable success of deep learning on computer vision field, increasing our interest in research on bridging between vision and language, vision-grounded story data will potentially improve the performance of story understanding and narrative text generation. Let us assume that AI agents lie in the environment in which the sensing information is input by the camera. Those agents observe the surroundings, translate them into the story in natural language, and predict the following event or multiple ones sequentially. This dissertation study on the related problems: learning stories or generating the narrative text from image streams or videos. The first problem is to generate a narrative text from a sequence of ordered images. As a solution, we introduce a GLAC Net (Global-local Attention Cascading Network). It translates from image sequences to narrative paragraphs in text as a encoder-decoder framework with sequence-to-sequence setting. It has convolutional neural networks for extracting information from images, and recurrent neural networks for text generation. We introduce visual cue encoders with stacked bidirectional LSTMs, and all of the outputs of each layer are aggregated as contextualized image vectors to extract visual clues. The coherency of the generated text is further improved by conveying (cascading) the information of the previous sentence to the next sentence serially in the decoders. We evaluate the performance of it on the Visual storytelling (VIST) dataset. It outperforms other state-of-the-art results and shows the best scores in total score and all of 6 aspects in the visual storytelling challenge with evaluation of human judges. The second is to predict the following events or narrative texts with the former parts of stories. It should be possible to predict at any step with an arbitrary length. We propose recurrent event retrieval models as a solution. They train a context accumulation function and two embedding functions, where make close the distance between the cumulative context at current time and the next probable events on a latent space. They update the cumulative context with a new event as a input using bilinear operations, and we can find the next event candidates with the updated cumulative context. We evaluate them for Story Cloze Test, they show competitive performance and the best in open-ended generation setting. Also, it demonstrates the working examples in an interactive setting. The third deals with the study on composite representation learning for semantics and order for video stories. We embed each episode as a trajectory-like sequence of events on the latent space, and propose a ViStoryNet to regenerate video stories with them (tasks of story completion). We convert event sentences to thought vectors, and train functions to make successive event embed close each other to form episodes as trajectories. Bi-directional LSTMs are trained as sequence models, and decoders to generate event sentences with GRUs. We test them experimentally with PororoQA dataset, and observe that most of episodes show the form of trajectories. We use them to complete the blocked part of stories, and they show not perfect but overall similar result. Those results above can be applied to AI agents in the living area sensing with their cameras, explain the situation as stories, infer some unobserved parts, and predict the future story.스토리를 이해하는 능력은 동물들 뿐만 아니라 다른 유인원과 인류를 구별짓는 중요한 능력이다. 인공지능이 일상생활 속에서 사람들과 함께 지내면서 그들의 생활 속 맥락을 이해하기 위해서는 스토리를 이해하는 능력이 매우 중요하다. 하지만, 기존의 스토리에 관한 연구는 언어처리의 어려움으로 인해 사전에 정의된 세계 모델 하에서 좋은 품질의 저작물을 생성하려는 기술이 주로 연구되어 왔다. 기계학습 기법을 통해 스토리를 다루려는 시도들은 대체로 자연어로 표현된 데이터에 기반할 수 밖에 없어 자연어 처리에서 겪는 문제들을 동일하게 겪는다. 이를 극복하기 위해서는 시각적 정보가 함께 연동된 데이터가 도움이 될 수 있다. 최근 딥러닝의 눈부신 발전에 힘입어 시각과 언어 사이의 관계를 다루는 연구들이 늘어나고 있다. 연구의 비전으로서, 인공지능 에이전트가 주변 정보를 카메라로 입력받는 환경 속에 놓여있는 상황을 생각해 볼 수 있다. 이 안에서 인공지능 에이전트는 주변을 관찰하면서 그에 대한 스토리를 자연어 형태로 생성하고, 생성된 스토리를 바탕으로 다음에 일어날 스토리를 한 단계에서 여러 단계까지 예측할 수 있다. 본 학위 논문에서는 사진 및 비디오 속에 나타나는 스토리(visual story)를 학습하는 방법, 내러티브 텍스트로의 변환, 가려진 사건 및 다음 사건을 추론하는 연구들을 다룬다. 첫 번째로, 여러 장의 사진이 주어졌을 때 이를 바탕으로 스토리 텍스트를 생성하는 문제(비주얼 스토리텔링)를 다룬다. 이 문제 해결을 위해 글랙넷(GLAC Net)을 제안하였다. 먼저, 사진들로부터 정보를 추출하기 위한 컨볼루션 신경망, 문장을 생성하기 위해 순환신경망을 이용한다. 시퀀스-시퀀스 구조의 인코더로서, 전체적인 이야기 구조의 표현을 위해 다계층 양방향 순환신경망을 배치하되 각 사진 별 정보를 함께 이용하기 위해 전역적-국부적 주의집중 모델을 제안하였다. 또한, 여러 문장을 생성하는 동안 맥락정보와 국부정보를 잃지 않게 하기 위해 앞선 문장 정보를 전달하는 메커니즘을 제안하였다. 위 제안 방법으로 비스트(VIST) 데이터 집합을 학습하였고, 제 1 회 시각적 스토리텔링 대회(visual storytelling challenge)에서 사람 평가를 기준으로 전체 점수 및 6 항목 별로 모두 최고점을 받았다. 두 번째로, 스토리의 일부가 문장들로 주어졌을 때 이를 바탕으로 다음 문장을 예측하는 문제를 다룬다. 임의의 길이의 스토리에 대해 임의의 위치에서 예측이 가능해야 하고, 예측하려는 단계 수에 무관하게 작동해야 한다. 이를 위한 방법으로 순환 사건 인출 모델(Recurrent Event Retrieval Models)을 제안하였다. 이 방법은 은닉 공간 상에서 현재까지 누적된 맥락과 다음에 발생할 유력 사건 사이의 거리를 가깝게 하도록 맥락누적함수와 두 개의 임베딩 함수를 학습한다. 이를 통해 이미 입력되어 있던 스토리에 새로운 사건이 입력되면 쌍선형적 연산을 통해 기존의 맥락을 개선하여 다음에 발생할 유력한 사건들을 찾는다. 이 방법으로 락스토리(ROCStories) 데이터집합을 학습하였고, 스토리 클로즈 테스트(Story Cloze Test)를 통해 평가한 결과 경쟁력 있는 성능을 보였으며, 특히 임의의 길이로 추론할 수 있는 기법 중에 최고성능을 보였다. 세 번째로, 비디오 스토리에서 사건 시퀀스 중 일부가 가려졌을 때 이를 복구하는 문제를 다룬다. 특히, 각 사건의 의미 정보와 순서를 모델의 표현 학습에 반영하고자 하였다. 이를 위해 은닉 공간 상에 각 에피소드들을 궤적 형태로 임베딩하고, 이를 바탕으로 스토리를 재생성을 하여 스토리 완성을 할 수 있는 모델인 비스토리넷(ViStoryNet)을 제안하였다. 각 에피소드를 궤적 형태를 가지게 하기 위해 사건 문장을 사고벡터(thought vector)로 변환하고, 연속 이벤트 순서 임베딩을 통해 전후 사건들이 서로 가깝게 임베딩되도록 하여 하나의 에피소드가 궤적의 모양을 가지도록 학습하였다. 뽀로로QA 데이터집합을 통해 실험적으로 결과를 확인하였다. 임베딩 된 에피소드들은 궤적 형태로 잘 나타났으며, 에피소드들을 재생성 해본 결과 전체적인 측면에서 유사한 결과를 보였다. 위 결과물들은 카메라로 입력되는 주변 정보를 바탕으로 스토리를 이해하고 일부 관측되지 않은 부분을 추론하며, 향후 스토리를 예측하는 방법들에 대응된다.Abstract i Chapter 1 Introduction 1 1.1 Story of Everyday lives in Videos and Story Understanding . . . 1 1.2 Problems to be addressed . . . . . . . . . . . . . . . . . . . . . . 3 1.3 Approach and Contribution . . . . . . . . . . . . . . . . . . . . . 6 1.4 Organization of Dissertation . . . . . . . . . . . . . . . . . . . . . 9 Chapter 2 Background and Related Work 10 2.1 Why We Study Stories . . . . . . . . . . . . . . . . . . . . . . . . 10 2.2 Latent Embedding . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.3 Order Embedding and Ordinal Embedding . . . . . . . . . . . . 14 2.4 Comparison to Story Understanding . . . . . . . . . . . . . . . . 15 2.5 Story Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.5.1 Abstract Event Representations . . . . . . . . . . . . . . . 17 2.5.2 Seq-to-seq Attentional Models . . . . . . . . . . . . . . . . 18 2.5.3 Story Generation from Images . . . . . . . . . . . . . . . 19 Chapter 3 Visual Storytelling via Global-local Attention Cascading Networks 21 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.2 Evaluation for Visual Storytelling . . . . . . . . . . . . . . . . . . 26 3.3 Global-local Attention Cascading Networks (GLAC Net) . . . . . 27 3.3.1 Encoder: Contextualized Image Vector Extractor . . . . . 28 3.3.2 Decoder: Story Generator with Attention and Cascading Mechanism . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.4.1 VIST Dataset . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.4.2 Experiment Settings . . . . . . . . . . . . . . . . . . . . . 33 3.4.3 Network Training Details . . . . . . . . . . . . . . . . . . 36 3.4.4 Qualitative Analysis . . . . . . . . . . . . . . . . . . . . . 38 3.4.5 Quantitative Analysis . . . . . . . . . . . . . . . . . . . . 38 3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 Chapter 4 Common Space Learning on Cumulative Contexts and the Next Events: Recurrent Event Retrieval Models 44 4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 4.2 Problems of Context Accumulation . . . . . . . . . . . . . . . . . 45 4.3 Recurrent Event Retrieval Models for Next Event Prediction . . 46 4.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . 49 4.4.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . 51 4.4.2 Story Cloze Test . . . . . . . . . . . . . . . . . . . . . . . 52 4.4.3 Open-ended Story Generation . . . . . . . . . . . . . . . . 53 4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 Chapter 5 ViStoryNet: Order Embedding of Successive Events and the Networks for Story Regeneration 58 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 5.2 Order Embedding with Triple Learning . . . . . . . . . . . . . . 60 5.2.1 Embedding Ordered Objects in Sequences . . . . . . . . . 62 5.3 Problems and Contextual Events . . . . . . . . . . . . . . . . . . 62 5.3.1 Problem Definition . . . . . . . . . . . . . . . . . . . . . . 62 5.3.2 Contextual Event Vectors from Kids Videos . . . . . . . . 64 5.4 Architectures for the Story Regeneration Task . . . . . . . . . . . 67 5.4.1 Two Sentence Generators as Decoders . . . . . . . . . . . 68 5.4.2 Successive Event Order Embedding (SEOE) . . . . . . . . 68 5.4.3 Sequence Models of the Event Space . . . . . . . . . . . . 72 5.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . 73 5.5.1 Experimental setup . . . . . . . . . . . . . . . . . . . . . . 73 5.5.2 Quantitative Analysis . . . . . . . . . . . . . . . . . . . . 73 5.5.3 Qualitative Analysis . . . . . . . . . . . . . . . . . . . . . 74 5.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 Chapter 6 Concluding Remarks 80 6.1 Summary of Methods and Contributions . . . . . . . . . . . . . . 80 6.2 Limitation and Outlook . . . . . . . . . . . . . . . . . . . . . . . 81 6.3 Suggestions for Future Research . . . . . . . . . . . . . . . . . . . 81 초록 101Docto

SNU Open Repository and Archive

Discriminative Video Representation Learning

Author: Wang Jue
Publication venue
Publication date: 01/01/2022
Field of study

Representation learning is a fundamental research problem in the area of machine learning, refining the raw data to discover representations needed for various applications. However, real-world data, particularly video data, is neither mathematically nor computationally convenient to process due to its semantic redundancy and complexity. Video data, as opposed to images, includes temporal correlation and motion dynamics, but the ground truth label is normally limited to category labels, which makes the video representation learning a challenging problem. To this end, this thesis addresses the problem of video representation learning, specifically discriminative video representation learning, which focuses on capturing useful data distributions and reliable feature representations improving the performance of varied downstream tasks. We argue that neither all frames in one video nor all dimensions in one feature vector are useful and should be equally treated for video representation learning. Based on this argument, several novel algorithms are investigated in this thesis under multiple application scenarios, such as action recognition, action detection and one-class video anomaly detection. These proposed video representation learning methods produce discriminative video features in both deep and non-deep learning setups. Specifically, they are presented in the form of: 1) an early fusion layer that adopts a temporal ranking SVM formulation, agglomerating several optical flow images from consecutive frames into a novel compact representation, named as dynamic optical flow images; 2) an intermediate feature aggregation layer that applies weakly-supervised contrastive learning techniques, learning discriminative video representations via contrasting positive and negative samples from a sequence; 3) a new formulation for one-class feature learning that learns a set of discriminative subspaces with orthonormal hyperplanes to flexibly bound the one-class data distribution using Riemannian optimisation methods. We provide extensive experiments to gain intuitions into why the learned representations are discriminative and useful. All the proposed methods in this thesis are evaluated on standard publicly available benchmarks, demonstrating state-of-the-art performance

The Australian National University

Unsupervised object candidate discovery for activity recognition

Author: Rybok Lukas
Publication venue: KIT-Bibliothek, Karlsruhe
Publication date: 01/01/2017
Field of study

Die automatische Interpretation menschlicher Bewegungsabläufe auf Basis von Videos ist ein wichtiger Bestandteil vieler Anwendungen im Bereich des Maschinellen Sehens, wie zum Beispiel Mensch-Roboter Interaktion, Videoüberwachung, und inhaltsbasierte Analyse von Multimedia Daten. Anders als die meisten Ansätze auf diesem Gebiet, die hauptsächlich auf die Klassifikation von einfachen Aktionen, wie Aufstehen, oder Gehen ausgerichtet sind, liegt der Schwerpunkt dieser Arbeit auf der Erkennung menschlicher Aktivitäten, d.h. komplexer Aktionssequenzen, die meist Interaktionen des Menschen mit Objekten beinhalten. Gemäß der Aktionsidentifikationstheorie leiten menschliche Aktivitäten ihre Bedeutung nicht nur von den involvierten Bewegungsmustern ab, sondern vor allem vom generellen Kontext, in dem sie stattfinden. Zu diesen kontextuellen Informationen gehören unter anderem die Gesamtheit aller vorher furchgeführter Aktionen, der Ort an dem sich die aktive Person befindet, sowie die Menge der Objekte, die von ihr manipuliert werden. Es ist zum Beispiel nicht möglich auf alleiniger Basis von Bewegungsmustern und ohne jeglicher Miteinbeziehung von Objektwissen zu entschieden ob eine Person, die ihre Hand zum Mund führt gerade etwas isst oder trinkt, raucht, oder bloß die Lippen abwischt. Die meisten Arbeiten auf dem Gebiet der computergestützten Aktons- und Aktivitätserkennung ignorieren allerdings jegliche durch den Kontext bedingte Informationen und beschränken sich auf die Identifikation menschlicher Aktivitäten auf Basis der beobachteten Bewegung. Wird jedoch Objektwissen für die Klassifikation miteinbezogen, so geschieht dies meist unter Zuhilfenahme von überwachten Detektoren, für deren Einrichtung widerum eine erhebliche Menge an Trainingsdaten erforderlich ist. Bedingt durch die hohen zeitlichen Kosten, die die Annotation dieser Trainingsdaten mit sich bringt, wird das Erweitern solcher Systeme, zum Beispiel durch das Hinzufügen neuer Typen von Aktionen, zum eigentlichen Flaschenhals. Ein weiterer Nachteil des Hinzuziehens von überwacht trainierten Objektdetektoren, ist deren Fehleranfälligkeit, selbst wenn die verwendeten Algorithmen dem neuesten Stand der Technik entsprechen. Basierend auf dieser Beobachtung ist das Ziel dieser Arbeit die Leistungsfähigkeit computergestützter Aktivitätserkennung zu verbessern mit Hilfe der Hinzunahme von Objektwissen, welches im Gegensatz zu den bisherigen Ansätzen ohne überwachten Trainings gewonnen werden kann. Wir Menschen haben die bemerkenswerte Fähigkeit selektiv die Aufmerksamkeit auf bestimmte Regionen im Blickfeld zu fokussieren und gleichzeitig nicht relevante Regionen auszublenden. Dieser kognitive Prozess erlaubt es uns unsere beschränkten Bewusstseinsressourcen unbewusst auf Inhalte zu richten, die anschließend durch das Gehirn ausgewertet werden. Zum Beispiel zur Interpretation visueller Muster als Objekte eines bestimmten Typs. Die Regionen im Blickfeld, die unsere Aufmerksamkeit unbewusst anziehen werden als Proto-Objekte bezeichnet. Sie sind definiert als unbestimmte Teile des visuellen Informationsspektrums, die zu einem späteren Zeitpunkt durch den Menschen als tatsächliche Objekte wahrgenommen werden können, wenn er seine Aufmerksamkeit auf diese richtet. Einfacher ausgedrückt: Proto-Objekte sind Kandidaten für Objekte, oder deren Bestandteile, die zwar lokalisiert aber noch nicht identifiziert wurden. Angeregt durch die menschliche Fähigkeit solche visuell hervorstechenden (salienten) Regionen zuverlässig vom Hintergrund zu unterscheiden, haben viele Wissenschaftler Methoden entwickelt, die es erlauben Proto-Objekte zu lokalisieren. Allen diesen Algorithmen ist gemein, dass möglichst wenig statistisches Wissens über tatsächliche Objekte vorausgesetzt wird. Visuelle Aufmerksamkeit und Objekterkennung sind sehr eng miteinander vernküpfte Prozesse im visuellen System des Menschen. Aus diesem Grund herrscht auf dem Gebiet des Maschinellen Sehens ein reges Interesse an der Integration beider Konzepte zur Erhöhung der Leistung aktueller Bilderkennungssysteme. Die im Rahmen dieser Arbeit entwickelten Methoden gehen in eine ähnliche Richtung: wir demonstrieren, dass die Lokalisation von Proto-Objekten es erlaubt Objektkandidaten zu finden, die geeignet sind als zusätzliche Modalität zu dienen für die bewegungsbasierte Erkennung menschlicher Aktivitäten. Die Grundlage dieser Arbeit bildet dabei ein sehr effizienter Algorithmus, der die visuelle Salienz mit Hilfe von quaternionenbasierten DCT Bildsignaturen approximiert. Zur Extraktion einer Menge geeigneter Objektkandidaten (d.h. Proto-Objekten) aus den resultierenden Salienzkarten, haben wir eine Methode entwickelt, die den kognitiven Mechanismus des Inhibition of Return implementiert. Die auf diese Weise gewonnenen Objektkandidaten nutzen wir anschliessend in Kombination mit state-of-the-art Bag-of-Words Methoden zur Merkmalsbeschreibung von Bewegungsmustern um komplexe Aktivitäten des täglichen Lebens zu klassifizieren. Wir evaluieren das im Rahmen dieser Arbeit entwickelte System auf diversen häufig genutzten Benchmark-Datensätzen und zeigen experimentell, dass das Miteinbeziehen von Proto-Objekten für die Aktivitätserkennung zu einer erheblichen Leistungssteigerung führt im Vergleich zu rein bewegungsbasierten Ansätzen. Zudem demonstrieren wir, dass das vorgestellte System bei der Erkennung menschlicher Aktivitäten deutlich weniger Fehler macht als eine Vielzahl von Methoden, die dem aktuellen Stand der Technik entsprechen. Überraschenderweise übertrifft unser System leistungsmäßig sogar Verfahren, die auf Objektwissen aufbauen, welches von überwacht trainierten Detektoren, oder manuell erstellten Annotationen stammt. Benchmark-Datensätze sind ein sehr wichtiges Mittel zum quantitativen Vergleich von computergestützten Mustererkennungsverfahren. Nach einer Überprüfung aller öffentlich verfügbaren, relevanten Benchmarks, haben wir jedoch festgestellt, dass keiner davon geeignet war für eine detaillierte Evaluation von Methoden zur Erkennung komplexer, menschlicher Aktivitäten. Aus diesem Grund bestand ein Teil dieser Arbeit aus der Konzeption und Aufnahme eines solchen Datensatzes, des KIT Robo-kitchen Benchmarks. Wie der Name vermuten lässt haben wir uns dabei für ein Küchenszenario entschieden, da es ermöglicht einen großen Umfang an Aktivitäten des täglichen Lebens einzufangen, von denen viele Objektmanipulationen enthalten. Um eine möglichst umfangreiche Menge natürlicher Bewegungen zu erhalten, wurden die Teilnehmer während der Aufnahmen kaum eingeschränkt in der Art und Weise wie die diversen Aktivitäten auszuführen sind. Zu diesem Zweck haben wir den Probanden nur die Art der auszuführenden Aktivität mitgeteilt, sowie wo die benötigten Gegenstände zu finden sind, und ob die jeweilige Tätigkeit am Küchentisch oder auf der Arbeitsplatte auszuführen ist. Dies hebt KIT Robo-kitchen deutlich hervor gegenüber den meisten existierenden Datensätzen, die sehr unrealistisch gespielte Aktivitäten enthalten, welche unter Laborbedingungen aufgenommen wurden. Seit seiner Veröffentlichung wurde der resultierende Benchmark mehrfach verwendet zur Evaluation von Algorithmen, die darauf abzielen lang andauerne, realistische, komplexe, und quasi-periodische menschliche Aktivitäten zu erkennen

KITopen

Feature based dynamic intra-video indexing

Author: Asghar Muhammad Nabeel
Publication venue: University of Bedfordshire
Publication date: 01/09/2014
Field of study

A thesis submitted in partial fulfillment for the degree of Doctor of PhilosophyWith the advent of digital imagery and its wide spread application in all vistas of life, it has become an important component in the world of communication. Video content ranging from broadcast news, sports, personal videos, surveillance, movies and entertainment and similar domains is increasing exponentially in quantity and it is becoming a challenge to retrieve content of interest from the corpora. This has led to an increased interest amongst the researchers to investigate concepts of video structure analysis, feature extraction, content annotation, tagging, video indexing, querying and retrieval to fulfil the requirements. However, most of the previous work is confined within specific domain and constrained by the quality, processing and storage capabilities. This thesis presents a novel framework agglomerating the established approaches from feature extraction to browsing in one system of content based video retrieval. The proposed framework significantly fills the gap identified while satisfying the imposed constraints of processing, storage, quality and retrieval times. The output entails a framework, methodology and prototype application to allow the user to efficiently and effectively retrieved content of interest such as age, gender and activity by specifying the relevant query. Experiments have shown plausible results with an average precision and recall of 0.91 and 0.92 respectively for face detection using Haar wavelets based approach. Precision of age ranges from 0.82 to 0.91 and recall from 0.78 to 0.84. The recognition of gender gives better precision with males (0.89) compared to females while recall gives a higher value with females (0.92). Activity of the subject has been detected using Hough transform and classified using Hiddell Markov Model. A comprehensive dataset to support similar studies has also been developed as part of the research process. A Graphical User Interface (GUI) providing a friendly and intuitive interface has been integrated into the developed system to facilitate the retrieval process. The comparison results of the intraclass correlation coefficient (ICC) shows that the performance of the system closely resembles with that of the human annotator. The performance has been optimised for time and error rate

University of Bedfordshire Repository

Geospatial Information Research: State of the Art, Case Studies and Future Perspectives

Author: Bill Ralf
Blankenbach Jörg
Breunig Martin
Haunert J.-H.
Haunert Jan-Henrik
Heipke Christian
Herle Stefan
Maas Hans-Gerd
Mayer Helmut
Meng Liqui
Rottensteiner Franz
Schiewe Jochen
Sester Monika
Sörgel Uwe
Werner Martin
Publication venue: E Schweizerbart Science Publishers
Publication date: 01/01/2022
Field of study

Geospatial information science (GI science) is concerned with the development and application of geodetic and information science methods for modeling, acquiring, sharing, managing, exploring, analyzing, synthesizing, visualizing, and evaluating data on spatio-temporal phenomena related to the Earth. As an interdisciplinary scientific discipline, it focuses on developing and adapting information technologies to understand processes on the Earth and human-place interactions, to detect and predict trends and patterns in the observed data, and to support decision making. The authors – members of DGK, the Geoinformatics division, as part of the Committee on Geodesy of the Bavarian Academy of Sciences and Humanities, representing geodetic research and university teaching in Germany – have prepared this paper as a means to point out future research questions and directions in geospatial information science. For the different facets of geospatial information science, the state of art is presented and underlined with mostly own case studies. The paper thus illustrates which contributions the German GI community makes and which research perspectives arise in geospatial information science. The paper further demonstrates that GI science, with its expertise in data acquisition and interpretation, information modeling and management, integration, decision support, visualization, and dissemination, can help solve many of the grand challenges facing society today and in the future

KITopen

PubMed Central

Fourth SIAM Conference on Applications of Dynamical Systems

Author: SIAM Activity Group on Dynamical Systems
Publication venue: Hosted by Utah State University Libraries
Publication date: 01/05/1997
Field of study

DigitalCommons@USU

Large-scale interactive exploratory visual search

Author: Lu Shiyang
Publication venue: Faculty of Engineering and Information Technologies, School of Information Technologies
Publication date: 01/01/2014
Field of study

Large scale visual search has been one of the challenging issues in the era of big data. It demands techniques that are not only highly effective and efficient but also allow users conveniently express their information needs and refine their intents. In this thesis, we focus on developing an exploratory framework for large scale visual search. We also develop a number of enabling techniques in this thesis, including compact visual content representation for scalable search, near duplicate video shot detection, and action based event detection. We propose a novel scheme for extremely low bit rate visual search, which sends compressed visual words consisting of vocabulary tree histogram and descriptor orientations rather than descriptors. Compact representation of video data is achieved through identifying keyframes of a video which can also help users comprehend visual content efficiently. We propose a novel Bag-of-Importance model for static video summarization. Near duplicate detection is one of the key issues for large scale visual search, since there exist a large number nearly identical images and videos. We propose an improved near-duplicate video shot detection approach for more effective shot representation. Event detection has been one of the solutions for bridging the semantic gap in visual search. We particular focus on human action centred event detection. We propose an enhanced sparse coding scheme to model human actions. Our proposed approach is able to significantly reduce computational cost while achieving recognition accuracy highly comparable to the state-of-the-art methods. At last, we propose an integrated solution for addressing the prime challenges raised from large-scale interactive visual search. The proposed system is also one of the first attempts for exploratory visual search. It provides users more robust results to satisfy their exploring experiences

Sydney eScholarship

A Study on Human Motion Acquisition and Recognition Employing Structured Motion Database

Author: Ashik Eftakhar S.M.
Publication venue: 九州工業大学
Publication date: 01/01/2012
Field of study

九州工業大学博士学位論文学位記番号:工博甲第332号　学位授与年月日:平成24年3月23日1 Introduction||2 Human Motion Representation||3 Human Motion Recognition||4 Automatic Human Motion Acquisition||5 Human Motion Recognition Employing Structured Motion Database||6 Analysis on the Constraints in Human Motion Recognition||7 Multiple Persons’ Action Recognition||8 Discussion and ConclusionsHuman motion analysis is an emerging research field for the video-based applications capable of acquiring and recognizing human motions or actions. The automaticity of such a system with these capabilities has vital importance in real-life scenarios. With the increasing number of applications, the demand for a human motion acquisition system is gaining importance day-by-day. We develop such kind of acquisition system based on body-parts modeling strategy. The system is able to acquire the motion by positioning body joints and interpreting those joints by the inter-parts inclination. Besides the development of the acquisition system, there is increasing need for a reliable human motion recognition system in recent years. There are a number of researches on motion recognition is performed in last two decades. At the same time, an enormous amount of bulk motion datasets are becoming available. Therefore, it becomes an indispensable task to develop a motion database that can deal with large variability of motions efficiently. We have developed such a system based on the structured motion database concept. In order to gain a perspective on this issue, we have analyzed various aspects of the motion database with a view to establishing a standard recognition scheme. The conventional structured database is subjected to improvement by considering three aspects: directional organization, nearest neighbor searching problem resolution, and prior direction estimation. In order to investigate and analyze comprehensively the effect of those aspects on motion recognition, we have adopted two forms of motion representation, eigenspace-based motion compression, and B-Tree structured database. Moreover, we have also analyzed the two important constraints in motion recognition: missing information and clutter outdoor motions. Two separate systems based on these constraints are also developed that shows the suitable adoption of the constraints. However, several people occupy a scene in practical cases. We have proposed a detection-tracking-recognition integrated action recognition system to deal with multiple people case. The system shows decent performance in outdoor scenarios. The experimental results empirically illustrate the suitability and compatibility of various factors of the motion recognition

Kyushu Institute of Technology of Academic Repository

Kyutacar : Kyushu Institute of Technology Academic Repository