Search CORE

10,749 research outputs found

Learning to Extract Motion from Videos in Convolutional Neural Networks

Author: BKP Horn
D Fleet
D Fortun
DJ Butler
DJ Heeger
EH Adelson
F Solari
GW Taylor
KG Derpanis
KG Derpanis
NC Rust
T Brox
T Brox
V Ulman
YA LeCun
Publication venue
Publication date: 27/01/2016
Field of study

This paper shows how to extract dense optical flow from videos with a convolutional neural network (CNN). The proposed model constitutes a potential building block for deeper architectures to allow using motion without resorting to an external algorithm, \eg for recognition in videos. We derive our network architecture from signal processing principles to provide desired invariances to image contrast, phase and texture. We constrain weights within the network to enforce strict rotation invariance and substantially reduce the number of parameters to learn. We demonstrate end-to-end training on only 8 sequences of the Middlebury dataset, orders of magnitude less than competing CNN-based motion estimation methods, and obtain comparable performance to classical methods on the Middlebury benchmark. Importantly, our method outputs a distributed representation of motion that allows representing multiple, transparent motions, and dynamic textures. Our contributions on network design and rotation invariance offer insights nonspecific to motion estimation

arXiv.org e-Print Archive

Crossref

XNect: Real-time Multi-Person 3D Motion Capture with a Single RGB Camera

Author: Elgharib Mohamed
Fua Pascal
Mehta Dushyant
Mueller Franziska
Pons-Moll Gerard
Rhodin Helge
Seidel Hans-Peter
Sotnychenko Oleksandr
Theobalt Christian
Xu Weipeng
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2020
Field of study

We present a real-time approach for multi-person 3D motion capture at over 30 fps using a single RGB camera. It operates successfully in generic scenes which may contain occlusions by objects and by other people. Our method operates in subsequent stages. The first stage is a convolutional neural network (CNN) that estimates 2D and 3D pose features along with identity assignments for all visible joints of all individuals.We contribute a new architecture for this CNN, called SelecSLS Net, that uses novel selective long and short range skip connections to improve the information flow allowing for a drastically faster network without compromising accuracy. In the second stage, a fully connected neural network turns the possibly partial (on account of occlusion) 2Dpose and 3Dpose features for each subject into a complete 3Dpose estimate per individual. The third stage applies space-time skeletal model fitting to the predicted 2D and 3D pose per subject to further reconcile the 2D and 3D pose, and enforce temporal coherence. Our method returns the full skeletal pose in joint angles for each subject. This is a further key distinction from previous work that do not produce joint angle results of a coherent skeleton in real time for multi-person scenes. The proposed system runs on consumer hardware at a previously unseen speed of more than 30 fps given 512x320 images as input while achieving state-of-the-art accuracy, which we will demonstrate on a range of challenging real-world scenes.Comment: To appear in ACM Transactions on Graphics (SIGGRAPH) 202

arXiv.org e-Print Archive

MPG.PuRe

RGB-D datasets using microsoft kinect or similar sensors: a survey

Author: Galili
Guan
Hu
Kolner
Mulvad
Nakazawa
Palushani
Palushani
Publication venue: Springer
Publication date: 01/01/2015
Field of study

RGB-D data has turned out to be a very useful representation of an indoor scene for solving fundamental computer vision problems. It takes the advantages of the color image that provides appearance information of an object and also the depth image that is immune to the variations in color, illumination, rotation angle and scale. With the invention of the low-cost Microsoft Kinect sensor, which was initially used for gaming and later became a popular device for computer vision, high quality RGB-D data can be acquired easily. In recent years, more and more RGB-D image/video datasets dedicated to various applications have become available, which are of great importance to benchmark the state-of-the-art. In this paper, we systematically survey popular RGB-D datasets for different applications including object recognition, scene classification, hand gesture recognition, 3D-simultaneous localization and mapping, and pose estimation. We provide the insights into the characteristics of each important dataset, and compare the popularity and the difficulty of those datasets. Overall, the main goal of this survey is to give a comprehensive description about the available RGB-D datasets and thus to guide researchers in the selection of suitable datasets for evaluating their algorithms

Northumbria Research Link

Crossref

Springer - Publisher Connector

Online Research Database In Technology

XNect: Real-time Multi-person 3D Human Pose Estimation with a Single RGB Camera

Author: Elgharib M.
Fua P.
Mehta D.
Mueller F.
Pons-Moll G.
Rhodin H.
Seidel H.
Sotnychenko O.
Theobalt C.
Xu W.
Publication venue
Publication date: 01/01/2019
Field of study

We present a real-time approach for multi-person 3D motion capture at over 30 fps using a single RGB camera. It operates in generic scenes and is robust to difficult occlusions both by other people and objects. Our method operates in subsequent stages. The first stage is a convolutional neural network (CNN) that estimates 2D and 3D pose features along with identity assignments for all visible joints of all individuals. We contribute a new architecture for this CNN, called SelecSLS Net, that uses novel selective long and short range skip connections to improve the information flow allowing for a drastically faster network without compromising accuracy. In the second stage, a fully-connected neural network turns the possibly partial (on account of occlusion) 2D pose and 3D pose features for each subject into a complete 3D pose estimate per individual. The third stage applies space-time skeletal model fitting to the predicted 2D and 3D pose per subject to further reconcile the 2D and 3D pose, and enforce temporal coherence. Our method returns the full skeletal pose in joint angles for each subject. This is a further key distinction from previous work that neither extracted global body positions nor joint angle results of a coherent skeleton in real time for multi-person scenes. The proposed system runs on consumer hardware at a previously unseen speed of more than 30 fps given 512x320 images as input while achieving state-of-the-art accuracy, which we will demonstrate on a range of challenging real-world scenes

arXiv.org e-Print Archive

MPG.PuRe

Gait Recognition from Motion Capture Data

Author: Balazia Michal
Sojka Petr
Publication venue
Publication date: 24/08/2017
Field of study

Gait recognition from motion capture data, as a pattern classification discipline, can be improved by the use of machine learning. This paper contributes to the state-of-the-art with a statistical approach for extracting robust gait features directly from raw data by a modification of Linear Discriminant Analysis with Maximum Margin Criterion. Experiments on the CMU MoCap database show that the suggested method outperforms thirteen relevant methods based on geometric features and a method to learn the features by a combination of Principal Component Analysis and Linear Discriminant Analysis. The methods are evaluated in terms of the distribution of biometric templates in respective feature spaces expressed in a number of class separability coefficients and classification metrics. Results also indicate a high portability of learned features, that means, we can learn what aspects of walk people generally differ in and extract those as general gait features. Recognizing people without needing group-specific features is convenient as particular people might not always provide annotated learning data. As a contribution to reproducible research, our evaluation framework and database have been made publicly available. This research makes motion capture technology directly applicable for human recognition.Comment: Preprint. Full paper accepted at the ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), special issue on Representation, Analysis and Recognition of 3D Humans. 18 pages. arXiv admin note: substantial text overlap with arXiv:1701.00995, arXiv:1609.04392, arXiv:1609.0693

arXiv.org e-Print Archive

Anti-spoofing using challenge-response user interaction

Author: Saad Alia Mahmoud Khaled
Publication venue: AUC Knowledge Fountain
Publication date: 01/06/2015
Field of study

2D facial identification has attracted a great amount of attention over the past years, due to its several advantages including practicality and simple requirements. However, without its capability to recognize a real user from an impersonator, face identification system becomes ineffective and vulnerable to spoof attacks. With the great evolution of smart portable devices, more advanced sorts of attacks have been developed, especially the replayed videos spoofing attempts that are becoming more difficult to recognize. Consequently, several studies have investigated the types of vulnerabilities a face biometric system might encounter and proposed various successful anti-spoofing algorithms. Unlike spoofing detection for passive or motionless authentication methods that were profoundly studied, anti-spoofing systems applied on interactive user verification methods were broadly examined as a potential robust spoofing prevention approach. This study aims first at comparing the performance of the existing spoofing detection techniques on passive and interactive authentication methods using a more balanced collected dataset and second proposes a fusion scheme that combines both texture analysis with interaction in order to enhance the accuracy of spoofing detection

AUC Knowledge Fountain (American Univ. in Cairo)

신체 임베딩을 활용한 오토인코더 기반 컴퓨터 비전 모형의 성능 개선

Author: 박종혁
Publication venue: 서울대학교 대학원
Publication date: 01/08/2021
Field of study

학위논문(박사) -- 서울대학교대학원 : 공과대학 산업공학과, 2021.8. 박종헌.Deep learning models have dominated the field of computer vision, achieving state-of-the-art performance in various tasks. In particular, with recent increases in images and videos of people being posted on social media, research on computer vision tasks for analyzing human visual information is being used in various ways. This thesis addresses classifying fashion styles and measuring motion similarity as two computer vision tasks related to humans. In real-world fashion style classification problems, the number of samples collected for each style class varies according to the fashion trend at the time of data collection, resulting in class imbalance. In this thesis, to cope with this class imbalance problem, generalized few-shot learning, in which both minority classes and majority classes are used for learning and evaluation, is employed. Additionally, the modalities of the foreground images, cropped to show only the body and fashion item parts, and the fashion attribute information are reflected in the fashion image embedding through a variational autoencoder. The K-fashion dataset collected from a Korean fashion shopping mall is used for the model training and evaluation. Motion similarity measurement is used as a sub-module in various tasks such as action recognition, anomaly detection, and person re-identification; however, it has attracted less attention than the other tasks because the same motion can be represented differently depending on the performer's body structure and camera angle. The lack of public datasets for model training and evaluation also makes research challenging. Therefore, we propose an artificial dataset for model training, with motion embeddings separated from the body structure and camera angle attributes for training using an autoencoder architecture. The autoencoder is designed to generate motion embeddings for each body part to measure motion similarity by body part. Furthermore, motion speed is synchronized by matching patches performing similar motions using dynamic time warping. The similarity score dataset for evaluation was collected through a crowdsourcing platform utilizing videos of NTU RGB+D 120, a dataset for action recognition. When the proposed models were verified with each evaluation dataset, both outperformed the baselines. In the fashion style classification problem, the proposed model showed the most balanced performance, without bias toward either the minority classes or the majority classes, among all the models. In addition, In the motion similarity measurement experiments, the correlation coefficient of the proposed model to the human-measured similarity score was higher than that of the baselines.컴퓨터 비전은 딥러닝 학습 방법론이 강점을 보이는 분야로, 다양한 태스크에서 우수한 성능을 보이고 있다. 특히, 사람이 포함된 이미지나 동영상을 딥러닝을 통해 분석하는 태스크의 경우, 최근 소셜 미디어에 사람이 포함된 이미지 또는 동영상 게시물이 늘어나면서 그 활용 가치가 높아지고 있다. 본 논문에서는 사람과 관련된 컴퓨터 비전 태스크 중 패션 스타일 분류 문제와 동작 유사도 측정에 대해 다룬다. 패션 스타일 분류 문제의 경우, 데이터 수집 시점의 패션 유행에 따라 스타일 클래스별 수집되는 샘플의 양이 달라지기 때문에 이로부터 클래스 불균형이 발생한다. 본 논문에서는 이러한 클래스 불균형 문제에 대처하기 위하여, 소수 샘플 클래스와 다수 샘플 클래스를 학습 및 평가에 모두 사용하는 일반화된 퓨샷러닝으로 패션 스타일 분류 문제를 설정하였다. 또한 변분 오토인코더 기반의 모델을 통해, 신체 및 패션 아이템 부분만 잘라낸 전경 이미지 모달리티와 패션 속성 정보 모달리티가 패션 이미지의 임베딩 학습에 반영되도록 하였다. 학습 및 평가를 위한 데이터셋으로는 한국 패션 쇼핑몰에서 수집된 K-fashion 데이터셋을 사용하였다. 한편, 동작 유사도 측정은 행위 인식, 이상 동작 감지, 사람 재인식 같은 다양한 분야의 하위 모듈로 활용되고 있지만 그 자체가 연구된 적은 많지 않은데, 이는 같은 동작을 수행하더라도 신체 구조 및 카메라 각도에 따라 다르게 표현될 수 있다는 점으로 부터 기인한다. 학습 및 평가를 위한 공개 데이터셋이 많지 않다는 점 또한 연구를 어렵게 하는 요인이다. 따라서 본 논문에서는 학습을 위한 인공 데이터셋을 수집하여 오토인코더 구조를 통해 신체 구조 및 카메라 각도 요소가 분리된 동작 임베딩을 학습하였다. 이때, 각 신체 부위별로 동작 임베딩을 생성할 수 있도록하여 신체 부위별로 동작 유사도 측정이 가능하도록 하였다. 두 동작 사이의 유사도를 측정할 때에는 동적 시간 워핑 기법을 사용, 비슷한 동작을 수행하는 구간끼리 정렬시켜 유사도를 측정하도록 함으로써, 동작 수행 속도의 차이를 보정하였다. 평가를 위한 유사도 점수 데이터셋은 행위 인식 데이터셋인 NTU-RGB+D 120의 영상을 활용하여 크라우드 소싱 플랫폼을 통해 수집되었다. 두 가지 태스크의 제안 모델을 각각의 평가 데이터셋으로 검증한 결과, 모두 비교 모델 대비 우수한 성능을 기록하였다. 패션 스타일 분류 문제의 경우, 모든 비교군에서 소수 샘플 클래스와 다수 샘플 클래스 중 한 쪽으로 치우치지 않는 가장 균형잡힌 추론 성능을 보여주었고, 동작 유사도 측정의 경우 사람이 측정한 유사도 점수와 상관계수에서 비교 모델 대비 더 높은 수치를 나타내었다.Chapter 1 Introduction 1 1.1 Background and motivation 1 1.2 Research contribution 5 1.2.1 Fashion style classication 5 1.2.2 Human motion similarity 9 1.2.3 Summary of the contributions 11 1.3 Thesis outline 13 Chapter 2 Literature Review 14 2.1 Fashion style classication 14 2.1.1 Machine learning and deep learning-based approaches 14 2.1.2 Class imbalance 15 2.1.3 Variational autoencoder 17 2.2 Human motion similarity 19 2.2.1 Measuring the similarity between two people 19 2.2.2 Human body embedding 20 2.2.3 Datasets for measuring the similarity 20 2.2.4 Triplet and quadruplet losses 21 2.2.5 Dynamic time warping 22 Chapter 3 Fashion Style Classication 24 3.1 Dataset for fashion style classication: K-fashion 24 3.2 Multimodal variational inference for fashion style classication 28 3.2.1 CADA-VAE 31 3.2.2 Generating multimodal features 33 3.2.3 Classier training with cyclic oversampling 36 3.3 Experimental results for fashion style classication 38 3.3.1 Implementation details 38 3.3.2 Settings for experiments 42 3.3.3 Experimental results on K-fashion 44 3.3.4 Qualitative analysis 48 3.3.5 Eectiveness of the cyclic oversampling 50 Chapter 4 Motion Similarity Measurement 53 4.1 Datasets for motion similarity 53 4.1.1 Synthetic motion dataset: SARA dataset 53 4.1.2 NTU RGB+D 120 similarity annotations 55 4.2 Framework for measuring motion similarity 58 4.2.1 Body part embedding model 58 4.2.2 Measuring motion similarity 67 4.3 Experimental results for measuring motion similarity 68 4.3.1 Implementation details 68 4.3.2 Experimental results on NTU RGB+D 120 similarity annotations 72 4.3.3 Visualization of motion latent clusters 78 4.4 Application 81 4.4.1 Real-world application with dancing videos 81 4.4.2 Tuning similarity scores to match human perception 87 Chapter 5 Conclusions 89 5.1 Summary and contributions 89 5.2 Limitations and future research 91 Appendices 93 Chapter A NTU RGB+D 120 Similarity Annotations 94 A.1 Data collection 94 A.2 AMT score analysis 95 Chapter B Data Cleansing of NTU RGB+D 120 Skeletal Data 100 Chapter C Motion Sequence Generation Using Mixamo 102 Bibliography 104 국문초록 123박

SNU Open Repository and Archive

Learning and Acting in Peripersonal Space: Moving, Reaching, and Grasping

Author: Juett Jonathan
Kuipers Benjamin
Publication venue
Publication date: 27/09/2018
Field of study

The young infant explores its body, its sensorimotor system, and the immediately accessible parts of its environment, over the course of a few months creating a model of peripersonal space useful for reaching and grasping objects around it. Drawing on constraints from the empirical literature on infant behavior, we present a preliminary computational model of this learning process, implemented and evaluated on a physical robot. The learning agent explores the relationship between the configuration space of the arm, sensing joint angles through proprioception, and its visual perceptions of the hand and grippers. The resulting knowledge is represented as the peripersonal space (PPS) graph, where nodes represent states of the arm, edges represent safe movements, and paths represent safe trajectories from one pose to another. In our model, the learning process is driven by intrinsic motivation. When repeatedly performing an action, the agent learns the typical result, but also detects unusual outcomes, and is motivated to learn how to make those unusual results reliable. Arm motions typically leave the static background unchanged, but occasionally bump an object, changing its static position. The reach action is learned as a reliable way to bump and move an object in the environment. Similarly, once a reliable reach action is learned, it typically makes a quasi-static change in the environment, moving an object from one static position to another. The unusual outcome is that the object is accidentally grasped (thanks to the innate Palmar reflex), and thereafter moves dynamically with the hand. Learning to make grasps reliable is more complex than for reaches, but we demonstrate significant progress. Our current results are steps toward autonomous sensorimotor learning of motion, reaching, and grasping in peripersonal space, based on unguided exploration and intrinsic motivation.Comment: 35 pages, 13 figure

arXiv.org e-Print Archive

Directory of Open Access Journals

On the evolution of flow topology in turbulent Rayleigh-Bénard convection

Author: A. Gorobets
A. Oliva
Dabbagh F.
F. Dabbagh
F. X. Trias
Tsinober A.
Publication venue: 'AIP Publishing'
Publication date: 01/01/2016
Field of study

Copyright 2016 AIP Publishing. This article may be downloaded for personal use only. Any other use requires prior permission of the author and AIP Publishing.Small-scale dynamics is the spirit of turbulence physics. It implicates many attributes of flow topology evolution, coherent structures, hairpin vorticity dynamics, and mechanism of the kinetic energy cascade. In this work, several dynamical aspects of the small-scale motions have been numerically studied in a framework of Rayleigh-Benard convection (RBC). To do so, direct numerical simulations have been carried out at two Rayleigh numbers Ra = 10(8) and 10(10), inside an air-filled rectangular cell of aspect ratio unity and pi span-wise open-ended distance. As a main feature, the average rate of the invariants of the velocity gradient tensor (Q(G), R-G) has displayed the so-calledPeer ReviewedPostprint (author's final draft

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Crossref

UPCommons. Portal del coneixement obert de la UPC

Automatic visual detection of human behavior: a review from 2000 to 2014

Author: Afsar Palwasha
Cortez Paulo
Santos Henrique
Publication venue: 'Elsevier BV'
Publication date: 01/10/2015
Field of study

Due to advances in information technology (e.g., digital video cameras, ubiquitous sensors), the automatic detection of human behaviors from video is a very recent research topic. In this paper, we perform a systematic and recent literature review on this topic, from 2000 to 2014, covering a selection of 193 papers that were searched from six major scientific publishers. The selected papers were classified into three main subjects: detection techniques, datasets and applications. The detection techniques were divided into four categories (initialization, tracking, pose estimation and recognition). The list of datasets includes eight examples (e.g., Hollywood action). Finally, several application areas were identified, including human detection, abnormal activity detection, action recognition, player modeling and pedestrian detection. Our analysis provides a road map to guide future research for designing automatic visual human behavior detection systems.This work is funded by the Portuguese Foundation for Science and Technology (FCT - Fundacao para a Ciencia e a Tecnologia) under research Grant SFRH/BD/84939/2012

Universidade do Minho: RepositoriUM