91 research outputs found
A Wide Area Multiview Static Crowd Estimation System Using UAV and 3D Training Simulator
Crowd size estimation is a challenging problem, especially when the crowd is spread over a significant geographical area. It has applications in monitoring of rallies and demonstrations and in calculating the assistance requirements in humanitarian disasters. Therefore, accomplishing a crowd surveillance system for large crowds constitutes a significant issue. UAV-based techniques are an appealing choice for crowd estimation over a large region, but they present a variety of interesting challenges, such as integrating per-frame estimates through a video without counting individuals twice. Large quantities of annotated training data are required to design, train, and test such a system. In this paper, we have first reviewed several crowd estimation techniques, existing crowd simulators and data sets available for crowd analysis. Later, we have described a simulation system to provide such data, avoiding the need for tedious and error-prone manual annotation. Then, we have evaluated synthetic video from the simulator using various existing single-frame crowd estimation techniques. Our findings show that the simulated data can be used to train and test crowd estimation, thereby providing a suitable platform to develop such techniques. We also propose an automated UAV-based 3D crowd estimation system that can be used for approximately static or slow-moving crowds, such as public events, political rallies, and natural or man-made disasters. We evaluate the results by applying our new framework to a variety of scenarios with varying crowd sizes. The proposed system gives promising results using widely accepted metrics including MAE, RMSE, Precision, Recall, and F1 score to validate the results
RGB-D-based Action Recognition Datasets: A Survey
Human action recognition from RGB-D (Red, Green, Blue and Depth) data has
attracted increasing attention since the first work reported in 2010. Over this
period, many benchmark datasets have been created to facilitate the development
and evaluation of new algorithms. This raises the question of which dataset to
select and how to use it in providing a fair and objective comparative
evaluation against state-of-the-art methods. To address this issue, this paper
provides a comprehensive review of the most commonly used action recognition
related RGB-D video datasets, including 27 single-view datasets, 10 multi-view
datasets, and 7 multi-person datasets. The detailed information and analysis of
these datasets is a useful resource in guiding insightful selection of datasets
for future research. In addition, the issues with current algorithm evaluation
vis-\'{a}-vis limitations of the available datasets and evaluation protocols
are also highlighted; resulting in a number of recommendations for collection
of new datasets and use of evaluation protocols
An end-to-end review of gaze estimation and its interactive applications on handheld mobile devices
In recent years we have witnessed an increasing number of interactive systems on handheld mobile devices which utilise gaze as a single or complementary interaction modality. This trend is driven by the enhanced computational power of these devices, higher resolution and capacity of their cameras, and improved gaze estimation accuracy obtained from advanced machine learning techniques, especially in deep learning. As the literature is fast progressing, there is a pressing need to review the state of the art, delineate the boundary, and identify the key research challenges and opportunities in gaze estimation and interaction. This paper aims to serve this purpose by presenting an end-to-end holistic view in this area, from gaze capturing sensors, to gaze estimation workflows, to deep learning techniques, and to gaze interactive applications.PostprintPeer reviewe
Advances in Object and Activity Detection in Remote Sensing Imagery
The recent revolution in deep learning has enabled considerable development in the fields of object and activity detection. Visual object detection tries to find objects of target classes with precise localisation in an image and assign each object instance a corresponding class label. At the same time, activity recognition aims to determine the actions or activities of an agent or group of agents based on sensor or video observation data. It is a very important and challenging problem to detect, identify, track, and understand the behaviour of objects through images and videos taken by various cameras. Together, objects and their activity recognition in imaging data captured by remote sensing platforms is a highly dynamic and challenging research topic. During the last decade, there has been significant growth in the number of publications in the field of object and activity recognition. In particular, many researchers have proposed application domains to identify objects and their specific behaviours from air and spaceborne imagery. This Special Issue includes papers that explore novel and challenging topics for object and activity detection in remote sensing images and videos acquired by diverse platforms
Deep Learning-Based Human Pose Estimation: A Survey
Human pose estimation aims to locate the human body parts and build human
body representation (e.g., body skeleton) from input data such as images and
videos. It has drawn increasing attention during the past decade and has been
utilized in a wide range of applications including human-computer interaction,
motion analysis, augmented reality, and virtual reality. Although the recently
developed deep learning-based solutions have achieved high performance in human
pose estimation, there still remain challenges due to insufficient training
data, depth ambiguities, and occlusion. The goal of this survey paper is to
provide a comprehensive review of recent deep learning-based solutions for both
2D and 3D pose estimation via a systematic analysis and comparison of these
solutions based on their input data and inference procedures. More than 240
research papers since 2014 are covered in this survey. Furthermore, 2D and 3D
human pose estimation datasets and evaluation metrics are included.
Quantitative performance comparisons of the reviewed methods on popular
datasets are summarized and discussed. Finally, the challenges involved,
applications, and future research directions are concluded. We also provide a
regularly updated project page: \url{https://github.com/zczcwh/DL-HPE
A survey of face recognition techniques under occlusion
The limited capacity to recognize faces under occlusions is a long-standing
problem that presents a unique challenge for face recognition systems and even
for humans. The problem regarding occlusion is less covered by research when
compared to other challenges such as pose variation, different expressions,
etc. Nevertheless, occluded face recognition is imperative to exploit the full
potential of face recognition for real-world applications. In this paper, we
restrict the scope to occluded face recognition. First, we explore what the
occlusion problem is and what inherent difficulties can arise. As a part of
this review, we introduce face detection under occlusion, a preliminary step in
face recognition. Second, we present how existing face recognition methods cope
with the occlusion problem and classify them into three categories, which are
1) occlusion robust feature extraction approaches, 2) occlusion aware face
recognition approaches, and 3) occlusion recovery based face recognition
approaches. Furthermore, we analyze the motivations, innovations, pros and
cons, and the performance of representative approaches for comparison. Finally,
future challenges and method trends of occluded face recognition are thoroughly
discussed
Data and methods for a visual understanding of sign languages
Signed languages are complete and natural languages used as the first or preferred mode of communication by millions of people worldwide. However, they, unfortunately, continue to be marginalized languages. Designing, building, and evaluating models that work on sign languages presents compelling research challenges and requires interdisciplinary and collaborative efforts. The recent advances in Machine Learning (ML) and Artificial Intelligence (AI) has the power to enable better accessibility to sign language users and narrow down the existing communication barrier between the Deaf community and non-sign language users. However, recent AI-powered technologies still do not account for sign language in their pipelines. This is mainly because sign languages are visual languages, that use manual and non-manual features to convey information, and do not have a standard written form. Thus, the goal of this thesis is to contribute to the development of new technologies that account for sign language by creating large-scale multimodal resources suitable for training modern data-hungry machine learning models and developing automatic systems that focus on computer vision tasks related to sign language that aims at learning better visual understanding of sign languages.
Thus, in Part I, we introduce the How2Sign dataset, which is a large-scale collection of multimodal and multiview sign language videos in American Sign Language. In Part II, we contribute to the development of technologies that account for sign languages by presenting in Chapter 4 a framework called Spot-Align, based on sign spotting methods, to automatically annotate sign instances in continuous sign language. We further present the benefits of this framework and establish a baseline for the sign language recognition task on the How2Sign dataset. In addition to that, in Chapter 5 we benefit from the different annotations and modalities of the How2Sign to explore sign language video retrieval by learning cross-modal embeddings. Later in Chapter 6, we explore sign language video generation by applying Generative Adversarial Networks to the sign language domain and assess if and how well sign language users can understand automatically generated sign language videos by proposing an evaluation protocol based on How2Sign topics and English translationLes llengües de signes són llengües completes i naturals que utilitzen milions de persones de tot el món com mode de comunicació primer o preferit. Tanmateix, malauradament, continuen essent llengües marginades. Dissenyar, construir i avaluar tecnologies que funcionin amb les llengües de signes presenta reptes de recerca que requereixen d’esforços interdisciplinaris i col·laboratius. Els avenços recents en l’aprenentatge automà tic i la intel·ligència artificial (IA) poden millorar l’accessibilitat tecnològica dels signants, i alhora reduir la barrera de comunicació existent entre la comunitat sorda i les persones no-signants. Tanmateix, les tecnologies més modernes en IA encara no consideren les llengües de signes en les seves interfÃcies amb l’usuari. Això es deu principalment a que les llengües de signes són llenguatges visuals, que utilitzen caracterÃstiques manuals i no manuals per transmetre informació, i no tenen una forma escrita està ndard. Els objectius principals d’aquesta tesi són la creació de recursos multimodals a gran escala adequats per entrenar models d’aprenentatge automà tic per a llengües de signes, i desenvolupar sistemes de visió per computador adreçats a una millor comprensió automà tica de les llengües de signes. AixÃ, a la Part I presentem la base de dades How2Sign, una gran col·lecció multimodal i multivista de vÃdeos de la llengua de signes nord-americana. A la Part II, contribuïm al desenvolupament de tecnologia per a llengües de signes, presentant al capÃtol 4 una solució per anotar signes automà ticament anomenada Spot-Align, basada en mètodes de localització de signes en seqüències contÃnues de signes. Després, presentem els avantatges d’aquesta solució i proporcionem uns primers resultats per la tasca de reconeixement de la llengua de signes a la base de dades How2Sign. A continuació, al capÃtol 5 aprofitem de les anotacions i diverses modalitats de How2Sign per explorar la cerca de vÃdeos en llengua de signes a partir de l’entrenament d’incrustacions multimodals. Finalment, al capÃtol 6, explorem la generació de vÃdeos en llengua de signes aplicant xarxes adversà ries generatives al domini de la llengua de signes. Avaluem fins a quin punt els signants poden entendre els vÃdeos generats automà ticament, proposant un nou protocol d’avaluació basat en les categories dins de How2Sign i la traducció dels vÃdeos a l’anglès escritLas lenguas de signos son lenguas completas y naturales que utilizan millones de personas
de todo el mundo como modo de comunicación primero o preferido. Sin embargo, desgraciadamente,
siguen siendo lenguas marginadas. Diseñar, construir y evaluar tecnologÃas
que funcionen con las lenguas de signos presenta retos de investigación que requieren
esfuerzos interdisciplinares y colaborativos. Los avances recientes en el aprendizaje automático
y la inteligencia artificial (IA) pueden mejorar la accesibilidad tecnológica de
los signantes, al tiempo que reducir la barrera de comunicación existente entre la comunidad
sorda y las personas no signantes. Sin embargo, las tecnologÃas más modernas en
IA todavÃa no consideran las lenguas de signos en sus interfaces con el usuario. Esto
se debe principalmente a que las lenguas de signos son lenguajes visuales, que utilizan
caracterÃsticas manuales y no manuales para transmitir información, y carecen de una
forma escrita estándar. Los principales objetivos de esta tesis son la creación de recursos
multimodales a gran escala adecuados para entrenar modelos de aprendizaje automático
para lenguas de signos, y desarrollar sistemas de visión por computador dirigidos a una
mejor comprensión automática de las lenguas de signos.
AsÃ, en la Parte I presentamos la base de datos How2Sign, una gran colección multimodal
y multivista de vÃdeos de lenguaje la lengua de signos estadounidense. En la Part II,
contribuimos al desarrollo de tecnologÃa para lenguas de signos, presentando en el capÃtulo
4 una solución para anotar signos automáticamente llamada Spot-Align, basada en
métodos de localización de signos en secuencias continuas de signos. Después, presentamos
las ventajas de esta solución y proporcionamos unos primeros resultados por la tarea
de reconocimiento de la lengua de signos en la base de datos How2Sign. A continuación,
en el capÃtulo 5 aprovechamos de las anotaciones y diversas modalidades de How2Sign
para explorar la búsqueda de vÃdeos en lengua de signos a partir del entrenamiento de
incrustaciones multimodales. Finalmente, en el capÃtulo 6, exploramos la generación
de vÃdeos en lengua de signos aplicando redes adversarias generativas al dominio de la
lengua de signos. Evaluamos hasta qué punto los signantes pueden entender los vÃdeos
generados automáticamente, proponiendo un nuevo protocolo de evaluación basado en
las categorÃas dentro de How2Sign y la traducción de los vÃdeos al inglés escrito.Postprint (published version
Data and methods for a visual understanding of sign languages
Signed languages are complete and natural languages used as the first or preferred mode of communication by millions of people worldwide. However, they, unfortunately, continue to be marginalized languages. Designing, building, and evaluating models that work on sign languages presents compelling research challenges and requires interdisciplinary and collaborative efforts. The recent advances in Machine Learning (ML) and Artificial Intelligence (AI) has the power to enable better accessibility to sign language users and narrow down the existing communication barrier between the Deaf community and non-sign language users. However, recent AI-powered technologies still do not account for sign language in their pipelines. This is mainly because sign languages are visual languages, that use manual and non-manual features to convey information, and do not have a standard written form. Thus, the goal of this thesis is to contribute to the development of new technologies that account for sign language by creating large-scale multimodal resources suitable for training modern data-hungry machine learning models and developing automatic systems that focus on computer vision tasks related to sign language that aims at learning better visual understanding of sign languages.
Thus, in Part I, we introduce the How2Sign dataset, which is a large-scale collection of multimodal and multiview sign language videos in American Sign Language. In Part II, we contribute to the development of technologies that account for sign languages by presenting in Chapter 4 a framework called Spot-Align, based on sign spotting methods, to automatically annotate sign instances in continuous sign language. We further present the benefits of this framework and establish a baseline for the sign language recognition task on the How2Sign dataset. In addition to that, in Chapter 5 we benefit from the different annotations and modalities of the How2Sign to explore sign language video retrieval by learning cross-modal embeddings. Later in Chapter 6, we explore sign language video generation by applying Generative Adversarial Networks to the sign language domain and assess if and how well sign language users can understand automatically generated sign language videos by proposing an evaluation protocol based on How2Sign topics and English translationLes llengües de signes són llengües completes i naturals que utilitzen milions de persones de tot el món com mode de comunicació primer o preferit. Tanmateix, malauradament, continuen essent llengües marginades. Dissenyar, construir i avaluar tecnologies que funcionin amb les llengües de signes presenta reptes de recerca que requereixen d’esforços interdisciplinaris i col·laboratius. Els avenços recents en l’aprenentatge automà tic i la intel·ligència artificial (IA) poden millorar l’accessibilitat tecnològica dels signants, i alhora reduir la barrera de comunicació existent entre la comunitat sorda i les persones no-signants. Tanmateix, les tecnologies més modernes en IA encara no consideren les llengües de signes en les seves interfÃcies amb l’usuari. Això es deu principalment a que les llengües de signes són llenguatges visuals, que utilitzen caracterÃstiques manuals i no manuals per transmetre informació, i no tenen una forma escrita està ndard. Els objectius principals d’aquesta tesi són la creació de recursos multimodals a gran escala adequats per entrenar models d’aprenentatge automà tic per a llengües de signes, i desenvolupar sistemes de visió per computador adreçats a una millor comprensió automà tica de les llengües de signes. AixÃ, a la Part I presentem la base de dades How2Sign, una gran col·lecció multimodal i multivista de vÃdeos de la llengua de signes nord-americana. A la Part II, contribuïm al desenvolupament de tecnologia per a llengües de signes, presentant al capÃtol 4 una solució per anotar signes automà ticament anomenada Spot-Align, basada en mètodes de localització de signes en seqüències contÃnues de signes. Després, presentem els avantatges d’aquesta solució i proporcionem uns primers resultats per la tasca de reconeixement de la llengua de signes a la base de dades How2Sign. A continuació, al capÃtol 5 aprofitem de les anotacions i diverses modalitats de How2Sign per explorar la cerca de vÃdeos en llengua de signes a partir de l’entrenament d’incrustacions multimodals. Finalment, al capÃtol 6, explorem la generació de vÃdeos en llengua de signes aplicant xarxes adversà ries generatives al domini de la llengua de signes. Avaluem fins a quin punt els signants poden entendre els vÃdeos generats automà ticament, proposant un nou protocol d’avaluació basat en les categories dins de How2Sign i la traducció dels vÃdeos a l’anglès escritLas lenguas de signos son lenguas completas y naturales que utilizan millones de personas
de todo el mundo como modo de comunicación primero o preferido. Sin embargo, desgraciadamente,
siguen siendo lenguas marginadas. Diseñar, construir y evaluar tecnologÃas
que funcionen con las lenguas de signos presenta retos de investigación que requieren
esfuerzos interdisciplinares y colaborativos. Los avances recientes en el aprendizaje automático
y la inteligencia artificial (IA) pueden mejorar la accesibilidad tecnológica de
los signantes, al tiempo que reducir la barrera de comunicación existente entre la comunidad
sorda y las personas no signantes. Sin embargo, las tecnologÃas más modernas en
IA todavÃa no consideran las lenguas de signos en sus interfaces con el usuario. Esto
se debe principalmente a que las lenguas de signos son lenguajes visuales, que utilizan
caracterÃsticas manuales y no manuales para transmitir información, y carecen de una
forma escrita estándar. Los principales objetivos de esta tesis son la creación de recursos
multimodales a gran escala adecuados para entrenar modelos de aprendizaje automático
para lenguas de signos, y desarrollar sistemas de visión por computador dirigidos a una
mejor comprensión automática de las lenguas de signos.
AsÃ, en la Parte I presentamos la base de datos How2Sign, una gran colección multimodal
y multivista de vÃdeos de lenguaje la lengua de signos estadounidense. En la Part II,
contribuimos al desarrollo de tecnologÃa para lenguas de signos, presentando en el capÃtulo
4 una solución para anotar signos automáticamente llamada Spot-Align, basada en
métodos de localización de signos en secuencias continuas de signos. Después, presentamos
las ventajas de esta solución y proporcionamos unos primeros resultados por la tarea
de reconocimiento de la lengua de signos en la base de datos How2Sign. A continuación,
en el capÃtulo 5 aprovechamos de las anotaciones y diversas modalidades de How2Sign
para explorar la búsqueda de vÃdeos en lengua de signos a partir del entrenamiento de
incrustaciones multimodales. Finalmente, en el capÃtulo 6, exploramos la generación
de vÃdeos en lengua de signos aplicando redes adversarias generativas al dominio de la
lengua de signos. Evaluamos hasta qué punto los signantes pueden entender los vÃdeos
generados automáticamente, proponiendo un nuevo protocolo de evaluación basado en
las categorÃas dentro de How2Sign y la traducción de los vÃdeos al inglés escrito.Teoria del Senyal i Comunicacion
An Outlook into the Future of Egocentric Vision
What will the future be? We wonder! In this survey, we explore the gap
between current research in egocentric vision and the ever-anticipated future,
where wearable computing, with outward facing cameras and digital overlays, is
expected to be integrated in our every day lives. To understand this gap, the
article starts by envisaging the future through character-based stories,
showcasing through examples the limitations of current technology. We then
provide a mapping between this future and previously defined research tasks.
For each task, we survey its seminal works, current state-of-the-art
methodologies and available datasets, then reflect on shortcomings that limit
its applicability to future research. Note that this survey focuses on software
models for egocentric vision, independent of any specific hardware. The paper
concludes with recommendations for areas of immediate explorations so as to
unlock our path to the future always-on, personalised and life-enhancing
egocentric vision.Comment: We invite comments, suggestions and corrections here:
https://openreview.net/forum?id=V3974SUk1
- …