Search CORE

49 research outputs found

A Transformer based Multi task Model for Attribute based Person Retrieval

Author: Specker Andreas
Publication venue: Karlsruher Institut für Technologie
Publication date: 08/07/2022
Field of study

Person retrieval is a crucial task in video surveillance. While searching for persons-of-interest based on so-called query images gains much interest in the research community, attribute-based approaches are rarely studied. Attribute-based person retrieval takes a person’s semantic attributes as input and provides a ranked list of search results that match the description. Typically, such approaches either build on a pedestrian attribute recognition approach or learn a joint feature space between attribute descriptions and image data. In this work, both approaches are combined in a multi-task model to benefit from the advantages of both procedures. Moreover, transformer modules are incorporated to increase performance further. Experimental evaluation proves the effectiveness of the approach and shows that the proposed architecture outperforms the baselines significantly

KITopen

Lidar-based Gait Analysis and Activity Recognition in a 4D Surveillance System

Author: Benedek Csaba
Gálai Bence
Jankó Zsolt
Nagy Balázs
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 14/06/2016
Field of study

This paper presents new approaches for gait and activity analysis based on data streams of a Rotating Multi Beam (RMB) Lidar sensor. The proposed algorithms are embedded into an integrated 4D vision and visualization system, which is able to analyze and interactively display real scenarios in natural outdoor environments with walking pedestrians. The main focus of the investigations are gait based person re-identification during tracking, and recognition of specific activity patterns such as bending, waving, making phone calls and checking the time looking at wristwatches. The descriptors for training and recognition are observed and extracted from realistic outdoor surveillance scenarios, where multiple pedestrians are walking in the field of interest following possibly intersecting trajectories, thus the observations might often be affected by occlusions or background noise. Since there is no public database available for such scenarios, we created and published a new Lidar-based outdoors gait and activity dataset on our website, that contains point cloud sequences of 28 different persons extracted and aggregated from 35 minutes-long measurements. The presented results confirm that both efficient gait-based identification and activity recognition is achievable in the sparse point clouds of a single RMB Lidar sensor. After extracting the people trajectories, we synthesized a free-viewpoint video, where moving avatar models follow the trajectories of the observed pedestrians in real time, ensuring that the leg movements of the animated avatars are synchronized with the real gait cycles observed in the Lidar stream

SZTAKI Publication Repository

Lidar-based Gait Analysis and Activity Recognition in a 4D Surveillance System

Author: Benedek Csaba
Gálai Bence
Jankó Zsolt
Nagy Balázs
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2018
Field of study

Repository of the Academy's Library

Proceedings of the 2021 Joint Workshop of Fraunhofer IOSB and Institute for Anthropomatics, Vision and Fusion Laboratory

Author
Publication venue: KIT Scientific Publishing
Publication date: 11/07/2022
Field of study

2021, the annual joint workshop of the Fraunhofer IOSB and KIT IES was hosted at the IOSB in Karlsruhe. For a week from the 2nd to the 6th July the doctoral students extensive reports on the status of their research. The results and ideas presented at the workshop are collected in this book in the form of detailed technical reports

Directory of Open Access Books (DOAB)

Proceedings of the 2021 Joint Workshop of Fraunhofer IOSB and Institute for Anthropomatics, Vision and Fusion Laboratory

Author: Beyerer Jürgen
Zander Tim
Publication venue: KIT Scientific Publishing, Karlsruhe
Publication date: 01/01/2022
Field of study

KITopen

Robust density modelling using the student's t-distribution for human action recognition

Author: Moghaddam Z
Piccardi M
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/12/2011
Field of study

The extraction of human features from videos is often inaccurate and prone to outliers. Such outliers can severely affect density modelling when the Gaussian distribution is used as the model since it is highly sensitive to outliers. The Gaussian distribution is also often used as base component of graphical models for recognising human actions in the videos (hidden Markov model and others) and the presence of outliers can significantly affect the recognition accuracy. In contrast, the Student's t-distribution is more robust to outliers and can be exploited to improve the recognition rate in the presence of abnormal data. In this paper, we present an HMM which uses mixtures of t-distributions as observation probabilities and show how experiments over two well-known datasets (Weizmann, MuHAVi) reported a remarkable improvement in classification accuracy. © 2011 IEEE

OPUS - University of Technology Sydney

Recommended from our members

Object Tracking-by-Segmentation in Videos

Author: Chen Sheng
Publication venue: 'Oregon State University'
Publication date
Field of study

This thesis focuses on the problem of object tracking. Given a video, the general objective of tracking is to track the location over time of one or more targets in the image sequence. This is a very challenging task as algorithms need to deal with problems such as appearance variations, non-rigid deformations, cluttered background, occlusions etc. While most existing methods use bounding boxes to represent the target, we use segmentations instead, which provide better ac- cess to target pixels and can better handle occlusions. Our first contribution, is a new tracking algorithm that given an over-segmentation of a video tracks multiple targets through interactions and occlusions. We develop a provably convergent learning algorithm for this approach, which leverages training data to improve performance. Our second contribution targets the case when an over-segmentation is not available due to poor video quality or low resolution. For this case, we develop a new algorithm that tracks coherent regions and estimates the number of target objects in each region. This count representation of a video can be used to help inform more traditional tracking techniques. Finally, we develop the first tracking-by-segmentation approach based on deep learning. We propose a novel deep network architecture and training algorithms for learning to segment and track a target object throughout a video. All of our algorithms are rigorously evaluated on challenging benchmark video collections, which demonstrate improvements over the state-of-the-art

ScholarsArchive@OSU

이야기형 설명문을 활용한 대규모 비디오 학습 연구

Author: 유영재
Publication venue: 서울대학교 대학원
Publication date: 01/02/2021
Field of study

학위논문 (박사) -- 서울대학교 대학원 : 공과대학 컴퓨터공학부, 2021. 2. 김건희.Extensive contributions are being made to develop intelligent agents that can recognize and communicate with the world. In this sense, various video-language tasks have drawn a lot of interests in computer vision research, including image/video captioning, video retrieval and video question answering. It can be applied to high-level computer vision tasks and various future industries such as search engines, social marketing, automated driving, and robotics support through QA / dialog generation for the surrounding environment. However, despite these developments, video-language learning suffers from a higher degree of complexity. This thesis investigates methodologies for learning the relationship between videos and free-formed languages, including explanations, conversations, and question-and-answers, so that the machine can easily adapt to target downstream tasks. First, we introduce several methods to learn the relationship between long sentences and videos efficiently. We introduce the approaches for supervising human attention transfer for the video attention model, which shows the video attention mechanism can benefit from explicit human gaze labels. Next, we introduce the end-to-end semantic attention method, which further reduces the visual attention algorithm's complexity by using the representative visual concept word detected by the attention-based detector. As a follow-up study on previous methods, we introduce a JSFusion (Joint Sequence Fusion) method that enables efficient video search and QA by enabling many-to-many matching of attention model. Next, we introduce the CiSIN(Character in Story Identification Network), which uses Attention to increase the performance of character grounding and character re-identification in the movie. Finally, we introduce Transitional Adaptation, which promotes the caption generation models to generates coherent narratives for long videos. In summary, this thesis presents a novel approaches for automatic video description generation/retrieval and shows the benefits of extracting linguistic knowledge for object and motion in the video as well as the advantage of multimodal audio-visual learning for understanding videos. Since the proposed methods are easily adapted to any video-language tasks, it is expected to be applied to the latest models, bringing additional performance improvements. Moving forward, we plan to design an unsupervised video learning framework that can solve many challenges in the industry by integrating an unlimited amount of video, audio, and free-formed language data from the web.시각-언어 학습은 이미지/비디오 캡션(Image/Video captioning), 시각 질의응답(Visual Question and Answering), 비디오 검색(Video Retrieval), 장면 이해(scene understanding), 이벤트 인식(event detection) 등 고차원의 컴퓨터 비전 태스크(task)뿐만 아니라 주변 환경에 대한 질의 응답 및 대화 생성(Dialogue Generation)으로 인터넷 검색 뿐만 아니라 최근 활발한 소셜 마케팅(Social Marketing) 자율 주행(Automated Driving), 로보틱스(Robotics)을 보조하는 등 여러 미래 산업에 적용될 수 있어 활발히 연구되고 있는 중요한 분야이다. 컴퓨터 비젼과 자연어 처리는 이러한 중요성을 바탕으로 각자 고유한 영역에서 발전을 거듭해 왔으나, 최근 딥러닝의 등장과 함께 눈부시게 발전하면서 서로를 보완하며 학습 결과를 향상시키는 등 큰 시너지 효과를 발휘하게 되었다. 하지만 이런 발전에도 불구하고, 비디오-언어간 학습은 문제의 복잡도가 한층 높아 어려움을 겪게 되는 경우가 많다. 본 논문에서는 비디오와 이에 대응하는 설명, 대화, 질의 응답 등 더 나아가 자유 형태의 언어 (Free-formed language)간의 관계를 더욱 효율적으로 학습하고, 목표 임무에 잘 대응할 수 있도록 개선하는 것을 목표로 한다. 먼저, 시각적 복잡도가 이미지보다 높은 비디오와 긴 문장 사이의 관계를 효율적으로 학습하기 위한 여러 방법들을 소개한다. 인간의 주의 인식(Attention) 모델을 비디오-언어 모델에 지도 학습 하는 방법을 소개하고, 이어서 비디오에서 우선 검출된 대표 시각 단어를 매개로 하여 주의 인식(Attention) 알고리즘의 복잡도를 더욱 줄이는 의미 중심 주의 인식 (Semantic Attention) 방법, 어텐션 모델의 다대다 매칭을 기반으로 효율적인 비디오 검색 및 질의응답을 가능케 하는 비디오-언어간 융합 (Joint Sequence Fusion) 방법 등 비디오 주의 인식을 효율적으로 학습시킬 수 있는 방법들을 제시한다. 다음으로는, 주의 인식(Attention) 모델이 물체-단어 간 관계를 넘어 비디오 상에서 인물 검색 (Person Searching) 그리고 인물 재 식별 (Person Re-Identification)을 동시에 수행하며 상승작용을 일으키는 스토리 속 캐릭터 인식 신경망 (Character in Story Identification Network) 을 소개하며, 마지막으로 자기 지도 학습(Self-supervised Learning)을 통해 주의 인식(Attention) 기반 언어 모델이 긴 비디오에 대한 설명을 연관성 있게 잘 생성할 수 있도록 유도하는 방법을 소개한다. 요약하자면, 이 학위 논문에서 제안한 새로운 방법론들은 비디오-언어 학습에 해당하는 비디오 캡션(Video captioning), 비디오 검색(Video Retrieval), 시각 질의응답(Video Question and Answering)등을 해결할 수 있는 기술적 디딤돌이 되며, 비디오 캡션 학습을 통해 학습된 주의 인식 모듈은 검색 및 질의응답, 인물 검색 등 각 네트워크에 이식되면서 새로운 문제들에 대해 동시에 최고 수준(State-of-the-art)의 성능을 달성하였다. 이를 통해 비디오-언어 학습으로 얻은 언어 지식의 이전은 시각-청각을 아우르는 비디오 멀티모달 학습에 큰 도움이 되는 것을 실험적으로 보여준다. 향후 작업 방향 (Future Work)으로는 앞서 연구한 내용들을 기반으로 웹 속에 존재하는 대규모의 언어, 비디오, 오디오 데이터를 통합해 학습에 활용하여 산업계의 많은 난제를 해결할 수 있는 비지도 학습 모델을 만들고자 한다.Chapter 1 Introduction 1.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4 1.2 Outline of the thesis . . . . . . . . . . . . . . . . . . . . . . . . .8 Chapter 2 Related Work 2.1 Video Captioning . . . . . . . . . . . . . . . . . . . . . . . . . . .9 2.2 Video Retrieval with Natural Language . . . . . . . . . . . . . . 12 2.3 Video Question and Answering . . . . . . . . . . . . . . . . . . . 13 2.4 Cross-modal Representation Learning for Vision and LanguageTasks . . . . 15 Chapter 3 Human Attention Transfer for Video Captioning18 3.1 Introduction 3.2 Video Datasets for Caption and Gaze . . . . . . . . . . . . . . . 21 3.3 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.3.1 Video Pre-processing and Description . . . . . . . . . . . 22 3.3.2The Recurrent Gaze Prediction (RGP) Model . . . . . . . 23 3.3.3Construction of Visual Feature Pools . . . . . . . . . . . . 24 3.3.4The Decoder for Caption Generation . . . . . . . . . . . . 26 3.3.5Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.4.1Evaluation of Gaze Prediction . . . . . . . . . . . . . . . . 29 3.4.2Evaluation of Video Captioning . . . . . . . . . . . . . . . 32 3.4.3Human Evaluation via AMT . . . . . . . . . . . . . . . . 35 3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 Chapter 4 Semantic Word Attention for Video QA and VideoCaptioning 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 4.1.1Related Work . . . . . . . . . . . . . . . . . . . . . . . . . 39 4.1.2Contributions . . . . . . . . . . . . . . . . . . . . . . . . . 40 4.2 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.2.1Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.2.2An Attention Model for Concept Detection . . . . . . . . 42 4.2.3Video-to-Language Models . . . . . . . . . . . . . . . . . 45 4.2.4A Model for Description . . . . . . . . . . . . . . . . . . . 45 4.2.5A Model for Fill-in-the-Blank . . . . . . . . . . . . . . . . 48 4.2.6A Model for Multiple-Choice Test . . . . . . . . . . . . . 50 4.2.7A Model for Retrieval . . . . . . . . . . . . . . . . . . . . 51 4.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 4.3.1The LSMDC Dataset and Tasks . . . . . . . . . . . . . . 52 4.3.2Quantitative Results . . . . . . . . . . . . . . . . . . . . . 54 4.3.3Qualitative Results . . . . . . . . . . . . . . . . . . . . . . 56 4.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 Chapter 5 Joint Sequnece Fusion Attention for Multimodal Sequence Data 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 5.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 5.3 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 5.3.1Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . 63 5.3.2The Joint Semantic Tensor . . . . . . . . . . . . . . . . . 65 5.3.3The Convolutional Hierarchical Decoder . . . . . . . . . . 66 5.3.4An Illustrative Example of How the JSFusion Model Works 68 5.3.5Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 5.3.6Implementation of Video-Language Models . . . . . . . . 69 5.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 5.4.1LSMDC Dataset and Tasks . . . . . . . . . . . . . . . . . 71 5.4.2MSR-VTT-(RET/MC) Dataset and Tasks . . . . . . . . . 73 5.4.3Quantitative Results . . . . . . . . . . . . . . . . . . . . . 74 5.4.4Qualitative Results . . . . . . . . . . . . . . . . . . . . . . 76 5.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 Chapter 6 Character Re-Identification and Character Ground-ing for Movie Understanding 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 6.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 6.3 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 6.3.1Video Preprocessing . . . . . . . . . . . . . . . . . . . . . 84 6.3.2Visual Track Embedding . . . . . . . . . . . . . . . . . . . 85 6.3.3Textual Character Embedding . . . . . . . . . . . . . . . 86 6.3.4Character Grounding . . . . . . . . . . . . . . . . . . . . 87 6.3.5Re-Identification . . . . . . . . . . . . . . . . . . . . . . . 88 6.3.6Joint Training . . . . . . . . . . . . . . . . . . . . . . . . 90 6.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 6.4.1Experimental Setup . . . . . . . . . . . . . . . . . . . . . 92 6.4.2Quantitative Results . . . . . . . . . . . . . . . . . . . . . 93 6.4.3Qualitative Results . . . . . . . . . . . . . . . . . . . . . . 95 6.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 Chapter 7 Transitional Adaptation of Pretrained Models forVisual Storytelling 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 7.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 7.3 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 7.3.1The Visual Encoder . . . . . . . . . . . . . . . . . . . . . 104 7.3.2The Language Generator . . . . . . . . . . . . . . . . . . 104 7.3.3Adaptation training . . . . . . . . . . . . . . . . . . . . . 105 7.3.4The Sequential Coherence Loss . . . . . . . . . . . . . . . 105 7.3.5Training with the adaptation Loss . . . . . . . . . . . . . 107 7.3.6Fine-tuning and Inference . . . . . . . . . . . . . . . . . . 107 7.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 7.4.1Experimental Setup . . . . . . . . . . . . . . . . . . . . . 109 7.4.2Quantitative Results . . . . . . . . . . . . . . . . . . . . . 112 7.4.3Further Analyses . . . . . . . . . . . . . . . . . . . . . . . 112 7.4.4Human Evaluation Results . . . . . . . . . . . . . . . . . 115 7.4.5Qualitative Results . . . . . . . . . . . . . . . . . . . . . . 116 7.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 Chapter 8 Conclusion 8.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 8.2 Future Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 Bibliography ... 123 요약 ... 148 Acknowledgements ... 150Docto

SNU Open Repository and Archive

Recommended from our members

Optimization and Technology-Based Strategies to Improve Public Transit Performance Accounting for Demand Distribution

Author: Sipetas Charalampos
Publication venue: ScholarWorks@UMass Amherst
Publication date: 06/04/2021
Field of study

Public transit is important to societies worldwide. The operation of public transit systems is generally associated with great benefits for the users, but there are also cases in which these systems demonstrate inefficient performance. Quantifying transit performance is an important area of research over the last decades. This dissertation presents models to improve transit system performance through optimization techniques and new technologies, recognizing the effects of non-uniform distribution of demand over space and time. The contributions span fixed route transit services and on-demand transit, as well as models for flexible transit operations that lie in between. Regarding fixed route systems, a methodology is proposed to estimate the number of passengers being left-behind subway train vehicles due to overcrowding. Methods to identify appropriate time periods and locations for studying this phenomenon are presented. The effects of overcrowding on passenger waiting times are also investigated. The challenging case of transit networks where passengers tap-in only upon entrance is analyzed, adding a new methodology to a very short list of similar studies and enhancing previous work in this field. For demand responsive systems, this dissertation focuses on optimizing the operation of paratransit services through coordination with alternative providers in order to decrease high operating costs of such a service. The analysis includes a heuristic-based method. The proposed model is more detailed than existing aggregated methods and is able to perform well in high demand levels, unlike existing exact approaches. This part of the dissertation also assists in making transportation network companies a complementary part of public transit, rather than a competitor. Finally, flexible transit systems are studied to identify the operational and demand related characteristics of a service area that could serve as indicators of such systems\u27 efficient performance. The focus here is on route deviation flexible services. Continuous approximation is used to model this flexible system. A new optimized hybrid transit system with elements of both fixed route and flexible services is proposed. Finally, it is highlighted that the current COVID-19 pandemic has proven the need for public transit systems that could be adjusted to accommodate changes in transit demand

ScholarWorks@UMass Amherst

Open Platforms for Connected Vehicles

Author: RAVIGLIONE FRANCESCO
Publication venue: country:Italy
Publication date: 14/12/2022
Field of study

L'abstract è presente nell'allegato / the abstract is in the attachmen

PORTO@iris (Publications Open Repository TOrino - Politecnico di Torino)